Standard test needed to benchmark XRUNs

Optimize your system for ultimate performance.

Moderators: khz, MattKingUSA

User avatar
lilith
Established Member
Posts: 1039
Joined: Fri May 27, 2016 11:41 pm
Location: bLACK fOREST
Contact:

Re: Standard test needed to benchmark XRUNs

Postby lilith » Wed Jan 02, 2019 4:00 pm

Drumfix wrote:Looking the endpoint descriptor of the Zomm, it sends/receives 1 isochronous packet every 1 millisecond, so it will work best at buffersizes that are an exact multiple of samplerate/1000, e.g. 48, 96 .. at 48000.


Holy Moly!

Code: Select all

Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 97.838287
in complete 1 Xruns in 44758 circles
first Xrun happen at DSP load 97.838287 circle 44755


I always thought that 512, 1024, etc. is a kind of standard. I never read of any other buffer size numbers. Do you have a good source where I can read more about it? The difference is quite huge :D
https://soundcloud.com/lilith_93
latest: https://soundcloud.com/lilith_93/days-of-terror
_____________________________
Debian 9 (XFCE) & KXStudio repos

folderol
Established Member
Posts: 925
Joined: Mon Sep 28, 2015 8:06 pm
Location: Here, of course!
Contact:

Re: Standard test needed to benchmark XRUNs

Postby folderol » Wed Jan 02, 2019 9:09 pm

@ raboof
"How do you benchmark that you get better performance when using ALSA directly?"
For a single audio source, using ALSA directly should give better results. There's a whole chunk of RT processing taken out of the audio chain.

User avatar
raboof
Established Member
Posts: 1643
Joined: Tue Apr 08, 2008 11:58 am
Location: Deventer, NL
Contact:

Re: Standard test needed to benchmark XRUNs

Postby raboof » Wed Jan 02, 2019 10:19 pm

folderol wrote:
raboof wrote:How do you benchmark that you get better performance when using ALSA directly?

For a single audio source, using ALSA directly should give better results. There's a whole chunk of RT processing taken out of the audio chain.

That is of course undeniably true. On the other hand it is not entirely obvious that it matters in practice: I have not seen a real-life apples-to-apples comparison where ALSA indeed allowed for lower latencies than going though JACK. After all, typically with ALSA you also choose/configure a buffer size, and the "jump" from the lowest buffer size that works for you with JACK to the next smaller one might not be possible on ALSA either, due to other bottlenecks.

Now Drumfix above alluded to some ALSA API's that can achieve significantly better performance by avoiding the 'fixed buffer size' design of JACK, especially on USB. I'm of course curious, but I he hasn't shared many details yet and I haven't found the time to go hunting for what he might be referring to either.

Jack Winter
Established Member
Posts: 376
Joined: Sun May 28, 2017 3:52 pm

Re: Standard test needed to benchmark XRUNs

Postby Jack Winter » Wed Jan 02, 2019 10:35 pm

FWIW, I've never observed any substantial performance difference between using JACK or ALSA.

AFAIK you can also use "non traditional" buffersizes with JACK, at least 48 works fine with JACK1. In fact I just tried with 528 and it also seems to work! :shock:
Reaper/KDE/Archlinux. i7-2600k/16GB + i7-4700HQ/16GB, RME Multiface/Babyface, Behringer X32, WA273-EQ, 2 x WA-412, ADL-600, Tegeler TRC, etc 8) For REAPER on Linux information: https://wiki.cockos.com/wiki/index.php/REAPER_for_Linux

merlyn
Established Member
Posts: 516
Joined: Thu Oct 11, 2018 4:13 pm

Re: Standard test needed to benchmark XRUNs

Postby merlyn » Sun Jan 06, 2019 1:59 am

It's been interesting seeing all the results from xruncounter. Good job tramp.

xruncounter produces two results : the DSP load at which the first Xrun happens, and how many circles or cycles it took to produce that DSP load. A well configured system doesn't produce Xruns until ~98% DSP load. The number of cycles it took to produce that DSP load tells us the power of the system, which tells us how many plugins and soft synths we can expect to run. The more cycles, the more plugins.

When interpreting results we can consider DSP load and the number of cycles. We do want the DSP load where the first Xrun happens to be as high as possible; we also want the maximum number of cycles. Here's an example :

lilith wrote:This is how it looks for me... not that good :/

Code: Select all

first Xrun happen at DSP load 83.043152 circle 41055

I tried with the Debian stock kernel and it's slightly better (maybe the same within the reproducibility)

Code: Select all

first Xrun happen at DSP load 87.472519 circle 40284


I would think the second result is worse for two reasons. First the number of cycles is less, so that means less plugins. Secondly if the stock system got to ~100% DSP load it would do so in less cycles than the RT system. Less cycles means less plugins even if the configuration of the stock system is optimised. To put it another way, the stock system has a higher DSP load at a lower number of cycles. This shows that the stock kernel is worse for processing audio, specifically using JACK -- this is what we expect and fits with the received wisdom.

@windowsrefund -- If we take DSP load and number of cycles into account, 3 periods isn't performing better.

windowsrefund wrote:With Jack Periods set to 2

Code: Select all

first Xrun happen at DSP load 81.481308 circle 41250

With Jack Periods set to 3

Code: Select all

first Xrun happen at DSP load 93.284630 circle 43342


With JACK periods set to 2, your system Xruns at 81% and 41250 cycles. So say you tweaked a few things and got to 93%, as you did with 3 periods, how many cycles would that be? It's 41250 * (93/81) = 47361, which is more than the 43342 you got with 3 periods. Although you got to a higher DSP load and did squeeze a bit more out the system the 3 period system is actually performing less well.

Adding another period introduces more latency. Going from 2 periods to 3 periods increases latency by 50%, which you've traded for a 5% (43342/41250) increase in performance. In my tests latency and cycles have a nearly straight line relationship. If I half the latency by halving the buffer size or doubling the sample rate, I get roughly half the number of cycles.

To attempt to clarify : Actual buffer size = (frames per period) * (periods).

'Frame' is a fancy word for 'sample' or 'group of samples' e.g. 2 samples in stereo. 'Frames per period' is what xruncounter calls 'buffersize'. And 'periods' are 'periods'. I've heard that 3 periods is better for USB devices. You have an internal soundcard, so probably 2 is best. Cadence reports 'block latency' which is (frames per period/sample rate). Qjackctl reports (block latency*periods).

I think you could tweak your system to get better performance. The first thing I thought of was "do you have Nvidia proprietary drivers?".

To summarise : when using xruncounter DSP load where the first Xrun occurs taken together with number of cycles tells us about the performance of a system.

windowsrefund
Established Member
Posts: 64
Joined: Mon Jul 30, 2018 11:04 pm

Re: Standard test needed to benchmark XRUNs

Postby windowsrefund » Sun Jan 06, 2019 5:18 am

This is very detailed and informative. I'll need to read it a few dozen times. Thank you so much.

merlyn
Established Member
Posts: 516
Joined: Thu Oct 11, 2018 4:13 pm

Re: Standard test needed to benchmark XRUNs

Postby merlyn » Mon Jan 07, 2019 2:28 pm

windowsrefund wrote:Thank you so much.

You're welcome. Maybe I can put it more simply. On your system 2 periods has more potential than 3 periods. The maximum you could possibly get out of a system is :

(max cycles) = (cycles you got) * (100/(DSP load you got))

  • 2 Periods

    (max cycles) = 41250 * (100/81) = 50926

  • 3 Periods

    (max cycles) = 43342 * (100/93) = 46604

Add to this the fact that 2 periods has less latency and there is enough to say 3 periods doesn't agree with your system.

User avatar
lilith
Established Member
Posts: 1039
Joined: Fri May 27, 2016 11:41 pm
Location: bLACK fOREST
Contact:

Re: Standard test needed to benchmark XRUNs

Postby lilith » Fri Mar 08, 2019 10:07 pm

made a new thread here: viewtopic.php?f=27&t=19718



I did some tests again with xruncounter script as I got a lot of xruns again when using the stock debian kernel.

This is with the RT kernel:
Linux fox 4.9.0-8-rt-amd64 #1 SMP PREEMPT RT Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux

Code: Select all

Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 97.194702
in complete 1 Xruns in 44269 circles
first Xrun happen at DSP load 97.194702 circle 44268


--> looks fine

This is with the stock kernel:
Linux fox 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux

Code: Select all

Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 90.951126
Xrun 2 at DSP load 90.279114
Xrun 3 at DSP load 93.730499
Xrun 4 at DSP load 91.147636
Xrun 5 at DSP load 95.048004
Xrun 6 at DSP load 98.028214
in complete 6 Xruns in 48195 circles
first Xrun happen at DSP load 90.951126 circle 47043

marco@fox:~/src$ ./xruncounter
Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 76.661041
Xrun 2 at DSP load 77.524490
Xrun 3 at DSP load 77.524490
Xrun 4 at DSP load 79.741531
Xrun 5 at DSP load 78.874954
Xrun 6 at DSP load 79.201111
Xrun 7 at DSP load 79.201111
Xrun 8 at DSP load 79.201111
Xrun 9 at DSP load 79.598999
Xrun 10 at DSP load 79.286118
Xrun 11 at DSP load 79.286118
Xrun 12 at DSP load 81.828911
Xrun 13 at DSP load 82.644875
Xrun 14 at DSP load 82.249924
Xrun 15 at DSP load 83.655365
Xrun 16 at DSP load 83.661469
Xrun 17 at DSP load 83.661469
Xrun 18 at DSP load 83.661469
Xrun 19 at DSP load 83.661469
Xrun 20 at DSP load 86.714508
Xrun 21 at DSP load 96.771843
in complete 21 Xruns in 43748 circles
first Xrun happen at DSP load 76.661041 circle 37179

marco@fox:~/src$ ./xruncounter
Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 92.366013
Xrun 2 at DSP load 92.366013
Xrun 3 at DSP load 96.760590
in complete 3 Xruns in 44181 circles
first Xrun happen at DSP load 92.366013 circle 44007
marco@fox:~/src$ ./xruncounter
Samplerate 48000
Buffersize is 528
jack running with realtime priority
Xrun 1 at DSP load 89.840897
Xrun 2 at DSP load 97.447464
in complete 2 Xruns in 44650 circles
first Xrun happen at DSP load 89.840897 circle 43296


It seems to be better with the RT kernel, but what I don't understand is why is it so different from run to run in case of the stock kernel? I turned on and off pulseaudio for some runs, but it doesn't correlate with pulse audio.
Last edited by lilith on Sun Mar 10, 2019 8:21 pm, edited 1 time in total.
https://soundcloud.com/lilith_93
latest: https://soundcloud.com/lilith_93/days-of-terror
_____________________________
Debian 9 (XFCE) & KXStudio repos

to7m
Established Member
Posts: 4
Joined: Sun Nov 13, 2016 6:52 am

Re: Standard test needed to benchmark XRUNs

Postby to7m » Sun Mar 10, 2019 1:17 am

I tried to benchmark my xruns ages ago.

A few conclusions I reached:

hackbench doesn't always increase xruns. On some of my lowest latency settings, hackbench resulted in 2 or 3 xruns per hour whereas a lack of hackbench resulted in thousands. I think this is because, despite having the ‘performance’ governor set, the cpu rate as shown by i7z still varied too much without a spinner process. So a good benchmark would only run a spinner process like hackbench some of the time, instead of all or none of the time.

Multi-core video processing might be a better option to torture the system than hackbench.

iperf for using the network card — this is a common cause of xruns and should ideally be included in tests.

PulseAudio should be put to work too as it is a very common source of xruns despite being an essential part of the Linux desktop.


I asked about this topic on Reddit a while ago and a user kindly wrote this program to log xruns: https://github.com/pulse0ne/xrun-logger

simonvanderveldt
Established Member
Posts: 37
Joined: Mon Sep 04, 2017 9:30 pm

Re: Standard test needed to benchmark XRUNs

Postby simonvanderveldt » Wed Mar 13, 2019 6:32 pm

tramp wrote:
merlyn wrote:I found an xrun counter that tramp wrote on this thread


Yes, that could easily extended to a stress test.
Here I've add a growing atan function as dsp load blob, so DSP load will grow with every circle the tests runs.
It will detect then at which DSP load the first Xrun happen, and will stop the test at 95% DSP load.
At least it will show us how many circles needed to reach 95% DSP load, and at which DSP load the first Xrun happen.
So, for example here is the output I get with 48kHz 128/2 frames/buffer:


Would it be an idea to contribute this nice tool to JACK?
Or if not contribute it there release it/put it in a repo?

tramp
Established Member
Posts: 1452
Joined: Mon Jul 01, 2013 8:13 am

Re: Standard test needed to benchmark XRUNs

Postby tramp » Fri Mar 15, 2019 5:13 am

simonvanderveldt wrote:Would it be an idea to contribute this nice tool to JACK?
Or if not contribute it there release it/put it in a repo?


Feel free to do so. I've put it into the public domain, means you could do with it what ever you like, including relicensing.
I've already more then enough stuff to maintain, so it will be welcome if someone take it over.

regards
hermann
On the road again.

gimmeapill
Established Member
Posts: 546
Joined: Thu Mar 12, 2015 8:41 am

Re: Standard test needed to benchmark XRUNs

Postby gimmeapill » Wed Mar 20, 2019 12:07 pm

xruncounter is very useful, thanks once more for the great job Tramp!
And thanks merlyn + Drumfix for the tips about buffersizes and circles: after the tenth read this is finally making it through my thick skull ;-)

So let's join the party: Here are some preliminary results on my 5+ y old notebook (Ivy Bridge CPU, intel graphics), a Scarlett 2i2 first gen, jack2 in asynchronous mode with 3 periods, and the stock kernel on Arch (5.0.2).

I'm aiming here for the lowest possible latency with Guitarix, and preferably at 96khz, rather than running a ton of plugins in a DAW (although this side effect wouldn't hurt either).
So the circle number in the results below is rather tiny compared to what was posted earlier at higher latencies, but I assume this is the expected behavior with my config, or maybe the age of my system (more tests needed to confirm).

Code: Select all

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 48000
Buffersize is 96
jack running with realtime priority
Xrun 1 at DSP load 93.923187
Xrun 2 at DSP load 93.923187
Xrun 3 at DSP load 96.811592
Xrun 4 at DSP load 96.811592
in complete 4 Xruns in 6355 circles
first Xrun happen at DSP load 93.923187 circle 6297

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 48000
Buffersize is 128
jack running with realtime priority
Xrun 1 at DSP load 96.727165
Xrun 2 at DSP load 96.727165
in complete 2 Xruns in 8670 circles
first Xrun happen at DSP load 96.727165 circle 8645

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 96
jack running with realtime priority
Xrun 1 at DSP load 96.163605
in complete 1 Xruns in 3403 circles
first Xrun happen at DSP load 96.163605 circle 3385


-> I still need to spend some more time on it and properly measure the results at each setting, but from what was already highlighted in the thread we should now have enough tools to A/B test:

- stock kernel vs rt
- standard buffersize vs custom bufferzise set as multiple of the frequency (it seems to scale quite linearly so far with the 2i2)
- periods: 2 vs 3

The things that I have not seen mentioned but could be worth measuring:
- jack1 vs jack 2 in synchronous and asynchronous modes (I can test that easily on Arch, since both are packaged).
- CPU affinity / parallelism (it seems we load a single CPU core here, not sure if there's a way to do any better, or if it even makes sense).
- IRQ priorities (rtirq)

...And I have one feature request for xruncounter: would it be possible to print out also the detected number of periods?
Also if you don't mind I'd like to make a pkgbuild out of it, so I'll probably upload it on github over the week end (unless someone is faster).

merlyn
Established Member
Posts: 516
Joined: Thu Oct 11, 2018 4:13 pm

Re: Standard test needed to benchmark XRUNs

Postby merlyn » Wed Mar 20, 2019 7:26 pm

@gimmeapill That's good. This program has potential.

Looking at your results I'd guess your processor is 2.6 GHz. That seems to be the most important factor in how many circles or cycles it takes to cause an Xrun.

tramp has handed this code on, so it's up to us to develop it. :) I looked at some info about JACK here and there doesn't seem to be a function that gets the number of periods. You will know the number of periods from starting JACK. If you want it in the output there will be a way of inputting it. I know basic C so I could figure it out. I've never used GitHub. I imagine this discussion can continue there.

gimmeapill
Established Member
Posts: 546
Joined: Thu Mar 12, 2015 8:41 am

Re: Standard test needed to benchmark XRUNs

Postby gimmeapill » Fri Mar 22, 2019 11:21 am

merlyn wrote:Looking at your results I'd guess your processor is 2.6 GHz. That seems to be the most important factor in how many circles or cycles it takes to cause an Xrun.

Quite close ;-)
That's an old i5 3317u (Ivy Bridge) than can hardly sustain the max turbo frequency of 2.4 GHz.

...And I think I managed to break it:

Code: Select all

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 512
jack running with realtime priority
Xrun 1 at DSP load 100.000000
Xrun 2 at DSP load 100.000000
in complete 2 Xruns in 17895 circles
first Xrun happen at DSP load 100.000000 circle 17895


I didn't even try to confirm with an actual audio workload, I think this is just a calculation error:
The max DSP load jumped up to 99.9XXXX when I started jack2 in synchronous mode (/usr/bin/jackd -S).
The rest is more or less the same: 3 periods, stock kernel (5.0.3)

At lower latencies, the results are almost realistic: the DSP is still very high but the circle number scales more or less linearly with the buffersize:

Code: Select all

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 64
jack running with realtime priority
Xrun 1 at DSP load 99.970459
Xrun 2 at DSP load 99.970459
Xrun 3 at DSP load 99.970459
Xrun 4 at DSP load 99.970459
Xrun 5 at DSP load 99.970459
Xrun 6 at DSP load 99.970459
Xrun 7 at DSP load 99.970459
Xrun 8 at DSP load 99.970459
Xrun 9 at DSP load 99.970459
Xrun 10 at DSP load 99.970459
Xrun 11 at DSP load 99.970459
Xrun 12 at DSP load 99.970459
in complete 12 Xruns in 2576 circles
first Xrun happen at DSP load 99.970459 circle 2576

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 96
jack running with realtime priority
Xrun 1 at DSP load 99.999977
Xrun 2 at DSP load 99.999977
Xrun 3 at DSP load 99.999977
Xrun 4 at DSP load 99.999977
Xrun 5 at DSP load 99.999977
Xrun 6 at DSP load 99.999977
in complete 6 Xruns in 3647 circles
first Xrun happen at DSP load 99.999977 circle 3647

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 128
jack running with realtime priority
Xrun 1 at DSP load 99.999847
Xrun 2 at DSP load 99.999847
Xrun 3 at DSP load 99.999847
in complete 3 Xruns in 4933 circles
first Xrun happen at DSP load 99.999847 circle 4933

[gimmeapill@pill-mobile4 xruncounter]$ ./xruncounter
Samplerate 96000
Buffersize is 256
jack running with realtime priority
Xrun 1 at DSP load 99.994133
Xrun 2 at DSP load 99.994133
in complete 2 Xruns in 9366 circles
first Xrun happen at DSP load 99.994133 circle 9366


As I understand it, that's very likely a DSP calculation error (If I start jack2 with the default asynchronous mode, the results are the same as posted earlier). In Synchronous mode, the results should degrade and be less forgiving, closer to jack1.
Could you give it a try to confirm?

merlyn
Established Member
Posts: 516
Joined: Thu Oct 11, 2018 4:13 pm

Re: Standard test needed to benchmark XRUNs

Postby merlyn » Fri Mar 22, 2019 11:30 pm

I have jack2dbus so to use synchronous mode I ticked the box in Cadence. My results do what you expected -- they're slightly worse in synchronous mode.

Asynchronous

Code: Select all

Samplerate 48000
Buffersize is 512
jack running with realtime priority
Xrun 1 at DSP load 99.946274
Xrun 2 at DSP load 99.946274
in complete 2 Xruns in 34776 circles
first Xrun happen at DSP load 99.946274 circle 34769

Synchronous

Code: Select all

Samplerate 48000
Buffersize is 512
jack running with realtime priority
Xrun 1 at DSP load 99.381393
Xrun 2 at DSP load 99.381393
in complete 2 Xruns in 34442 circles
first Xrun happen at DSP load 99.381393 circle 34441


Return to “System Tuning and Configuration”

Who is online

Users browsing this forum: No registered users and 9 guests