Benchmarking Some Plugins

ssj71 · Post by **ssj71** » Thu Sep 26, 2013 4:59 pm

Hallo:

I've recently been challenged on some claims of one of my infamous plugins being light. I had no benchmarks to prove it, so who's to say my algorithm was any lighter than while(1){i++;}? Well, I decided to defend my honor and try some benchmarking. The process isn't perfect and I'm sure you could all suggest some improvements, which I'm open to, but I'm limited in time too (just like all of you), so don't expect it right away. Anyway:

Question: Is the infamous cheap distortion plugin any lighter on processing than other saturation plugins?
Hypothesis: Yes.
Method/Experimental Design: Using the plugin torture program http://carlh.net/plugins/torture.php running under callgrind, profile several saturation plugins on my core2 duo laptop. The cycle estimation will be the metric (fewer is better).
Results:
infamous hip2b - 203,544,718 cycles
infamous cheap distortion - 90,715,523 cycles
swh valve staturation lv2 - 195,221,722 cycles
swh valve saturation ladspa - 144,603,842 cycles
Conclusions: The hypothesis is correct.

Additional Comments: Its fallacy to say that cheap distortion is half as computationally expensive as swh valve saturation, as the cycle estimate here includes all the overhead of loading the plugin, and generating the various signals etc. Its observed that the ladspa version of the swh plugin requires fewer cycles for this overhead, and the difference will be less and less relevant as longer signals are run through (i.e. in practical realtime use). So this method can indicate which plugins are cheaper, but it surely isn't a definitive benchmark of how much cheaper. Running each plugin a few times found the cycle count varied with a standard deviation on the order of 1,000 cycles (though this is a rather unscientific measurement of a normal distribution). Unfortunately the plugin torture program only runs the first plugin in a collection, so it was impossible to try several of the other saturation plugins available such as those by CAPS and Calf.

Anyhow, hopefully this is somewhat enlightening and provides devs or curious users a way to test some of their favorite plugins, as well as a methodology to somewhat substantiate claims of performance.

tramp · Post by **tramp** » Thu Sep 26, 2013 5:25 pm

mmm, have you ever have a look at the source for the swh valve?
That have nothing to do with the simple
x-0.15*x^2-0.15*x^3
wave-sharping variants.

You have chosen to compare, like male would say , between apples and oranges. Please try to see your results in the right context. Why don't you say how much CPU load the rt-process of your plugin produce? You could easy see it in htop, when you run it in jalv, for example.

tramp · Post by **tramp** » Thu Sep 26, 2013 6:14 pm

The DSP load is the overall CPU usage in the rt-processes which are currently running.
We didn't talk here about programming on a dsp chip. You can easy count together the CPU load of the running rt-processes and compare it with the DSP load shown by jack, and will see that it is the same.

falkTX wrote:both functions use very little cpu-load (if any at all), but the plugin will still cause xruns (because of sleep).
it's the same with "bad" locks (which includes memory allocation). using those will usually not result in high cpu usage, but we can't say the same about DSP load.

I'm pretty sure that those will be shown in the CPU load of the rt-process as well, but I must admit that I've never done a memlock in the rt-callback. But now, as you keep on going to ride on this, I will test it here to see if this really wouldn't shown in the CPU usage of the rt-process. If so, then I would consider it a bug.

Edit// okay, you are right, if you do fancy things, the values will differ indeed.

So, a differ in this values could be a indicator for a bad design in the DSP code.

ssj71 · Post by **ssj71** » Thu Sep 26, 2013 7:49 pm

tramp wrote:mmm, have you ever have a look at the source for the swh valve?
That have nothing to do with the simple
x-0.15*x^2-0.15*x^3
wave-sharping variants.

I haven't looked at the source, but cheap distortion is doing a square root approximation... I don't agree that this is apples to oranges. They are both waveshapers. I would gladly include any plugin you recommend, knowing the limitations of the test (only the first plugin in the bundle, only on the default parameters).

falkTX wrote:CPU load is related to DSP load, but not that much.

I agree that in this sense I might be showing apples and saying "check out these sweet oranges..."

But its still not clear to me how to measure DSP load well, especially how to compare 2 plugins side by side in this manner. Perhaps I'm being dense here... I've found a reasonable way to measure CPU time, but how can I do the same with DSP? (Perhaps a feature request for Carla to have some kind of debug/benchmark mode...

) I'm as curious as anyone to see the results. Help me design the experiment.

ssj71 · Post by **ssj71** » Thu Sep 26, 2013 7:58 pm

similarly I ran this code in callgrind:

Code: Select all

#include<stdio.h>
#include<stdint.h>
#include<math.h>

typedef union
{
    int32_t i;
    float f;
    struct
    {
        uint32_t mantissa : 23;
        uint32_t exponent : 8;
        uint32_t sign : 1;
    } parts;
	struct
	{
		uint32_t num : 31;
		uint32_t sign : 1;
	}mine; 

} Float_t;

void hack()
{
Float_t f2;
float i = -1.1;

for(;i<1.1;i+=.0001)
{
	f2.f = i;
	f2.mine.num = (f2.mine.num)>>3;
	f2.parts.exponent += 111;
}

}

void trad()
{
float i = -1.1, j;

for(;i<1.1;i+=.0001)
{
	if(i<0)
		 j = -powf(-i,.125);
	else
	         j = powf(i,.125);
}
}

int main()
{

Float_t f1, f2, f;
float i = -1.1;

hack();
trad();

return 0;
}

Which compares the computation of the 8th root in floating point mode and doing the same using the hack in cheap distortion. It found it spent about 75% of the cycles in the trad() function and 20% in the hack() function (suggesting a >3.5x improvement by using the hack). These results are limited, so I didn't post them originally. And they still only show the cpu load and not the DSP load.

EDIT: ALSO DON'T USE THIS CODE OUT OF CONTEXT. It is not officially supported and has undefined results. It happens to work on x86 systems when compiled using gcc. This wasn't intended as a silver bullet.

male · Post by **male** » Thu Sep 26, 2013 9:42 pm

ssj71 wrote:
tramp wrote:mmm, have you ever have a look at the source for the swh valve?
That have nothing to do with the simple
x-0.15*x^2-0.15*x^3
wave-sharping variants.
I haven't looked at the source, but cheap distortion is doing a square root approximation... I don't agree that this is apples to oranges. They are both waveshapers. I would gladly include any plugin you recommend, knowing the limitations of the test (only the first plugin in the bundle, only on the default parameters).

falkTX wrote:CPU load is related to DSP load, but not that much.
I agree that in this sense I might be showing apples and saying "check out these sweet oranges..."

But its still not clear to me how to measure DSP load well, especially how to compare 2 plugins side by side in this manner. Perhaps I'm being dense here... I've found a reasonable way to measure CPU time, but how can I do the same with DSP? (Perhaps a feature request for Carla to have some kind of debug/benchmark mode... ) I'm as curious as anyone to see the results. Help me design the experiment.

Non Mixer already does this for LADSPA plugins. If you want, I can easily add a separate meter per plugin instead of per strip.

Calculating this figure is very simple. You just need a timestamp at the start of processing and compare it to the time at the end of processing and present it as a percentage of the current DSP time window (2.7ms or whatever the combination of period size and sample rate produces). The number will generally be very small, < 1%. You can exaggerate the effect by running each plugin in N stages. Say 10. You could also look at either peak or average load.

BTW, this is exactly how the main JACK DSP load figure is calculated. It has no relation to CPU load.

EDIT: I should point out that this technique is very accurate, but only when actually running in an RT context (i.e. SCHED_FIFO).

ssj71 · Post by **ssj71** » Thu Sep 26, 2013 10:45 pm

So perhaps: find an appropriate sample, such as white noise, load the plugin in carla (sorry, I'm comparing some LV2 and some ladspa so non-mixer is a no go) and play the sample through it, then use the DSP load in qjack-ctl as the metric. Seems to have more room for error, but I'll give it a go when I have a few minutes back home. It would be nice if something would poll the jack dsp load and give a peak and average for this task. Sounds like Qjack-ctl shows the peak in the last few moments. Thanks.

Also when I looked at the source of plugin-torture it does have an option to select the index of the ladspa plugin. So I'm going to at least try CAPS saturation using both these tests.

male · Post by **male** » Fri Sep 27, 2013 12:49 am

ssj71 wrote:So perhaps: find an appropriate sample, such as white noise, load the plugin in carla (sorry, I'm comparing some LV2 and some ladspa so non-mixer is a no go) and play the sample through it, then use the DSP load in qjack-ctl as the metric. Seems to have more room for error, but I'll give it a go when I have a few minutes back home. It would be nice if something would poll the jack dsp load and give a peak and average for this task. Sounds like Qjack-ctl shows the peak in the last few moments. Thanks.

Also when I looked at the source of plugin-torture it does have an option to select the index of the ladspa plugin. So I'm going to at least try CAPS saturation using both these tests.

The problem with using the global JACK DSP load figure is A) you're including the host and other overheads in the measurement and B) you can only measure one thing at a time. In Non Mixer you can run two plugins side-by-side and compare the loads in real time. But getting back to A) it is quite possible that the host and other overheads are much larger than the actual load produced by the plugins. This effect can be reduced by running many plugin stages as I mentioned before. In all likelihood, doing what you described, you'll probably see a global DSP load of 1-3 percent, regardless of what plugin you're testing--unless it is a very compex or poor performing one.

tramp · Post by **tramp** » Fri Sep 27, 2013 2:43 am

However, if I work on dsp algorithm, I watch the CPU load the formula produce. If I could reach the same result, with a lower CPU load, I consider it a improvement in the source. The DSP load will behave similar to the CPU load, here I've never had a situation were a lower CPU load produced a higher DSP load, as well not the opposite effect. Clearly one must follow the rules for writing realtime tasks in first order .
But, I must admit that, if you would compare plugs, the results be more accurate if one take the DSP load into account.

tramp · Post by **tramp** » Fri Sep 27, 2013 3:10 am

Well, I understand it.
The difference comes from the point of view.
Trying to improve a DSP formula is a different thing then benchmark some plugs and watch for culprits in the source.

ssj71 · Post by **ssj71** » Fri Sep 27, 2013 1:38 pm

Question: Is the infamous cheap distortion plugin any lighter on DSP load than other saturation plugins?
Hypothesis: Yes.
Method/Experimental Design: Using Carla, load each plugin and connect the input and output to the system input and output respectively. The DSP load reported by QJackCtl will be the metric (lower is better). Each will be observed for about 10 seconds or more and the highest number observed will be reported.
Results:
host with no plugin - .72%
infamous hip2b - 1.0%
infamous cheap distortion - .85%
swh valve staturation lv2 - 1.1%
swh valve saturation ladspa - .91%
Calf saturation lv2 & ladspa - 2.7%
cmt wave shaper - .98%
C* clip - 3.2%
millenium saturator - .95%
Function sqrt(x) - .86%
3 instances of function sqrt(x) cascaded (8th root) - 1.0%
Conclusions: The hypothesis is correct. barely.

Other observations: The DSP load didn't vary greatly so the peak was usually quite close to the average. No guis were run. Only default parameters were tested save on cheap distortion which didn't vary per setting. Interestingly the calf saturator did not vary in DSP load dependant on plugin standard (ladspa or lv2) whereas the swh valve saturation did. Viewing the source to determine if the implementation differs or simply a difference in overhead between plugin standards is beyond the scope of this work. The sqrt funtion was cascaded to save face since the single instance yeilded nearly the same DSP load as the approximation performed in cheap distortion. Since I ran the test I can bias the results like that. This suggests the aggression parameter must be set to 2 or 3 to really see any performance savings. Retesting, or testing with multiple instances of each plugin to lessen the effects of the host overhead in the measurement is left as an exercise to the reader.

Feel free to challenge these results by conducting your own experiment and posting the results here!

p.s. the option to select the index of a ladspa plugin in plugin-torture is not yet implemented so C* clip will not be tested for CPU useage.

male · Post by **male** » Fri Sep 27, 2013 4:34 pm

ssj71 wrote:Question: Is the infamous cheap distortion plugin any lighter on DSP load than other saturation plugins?
Hypothesis: Yes.
Method/Experimental Design: Using Carla, load each plugin and connect the input and output to the system input and output respectively. The DSP load reported by QJackCtl will be the metric (lower is better). Each will be observed for about 10 seconds or more and the highest number observed will be reported.
Results:
host with no plugin - .72%
infamous hip2b - 1.0%
infamous cheap distortion - .85%
swh valve staturation lv2 - 1.1%
swh valve saturation ladspa - .91%
Calf saturation lv2 & ladspa - 2.7%
cmt wave shaper - .98%
C* clip - 3.2%
millenium saturator - .95%
Function sqrt(x) - .86%
3 instances of function sqrt(x) cascaded (8th root) - 1.0%
Conclusions: The hypothesis is correct. barely.

Other observations: The DSP load didn't vary greatly so the peak was usually quite close to the average. No guis were run. Only default parameters were tested save on cheap distortion which didn't vary per setting. Interestingly the calf saturator did not vary in DSP load dependant on plugin standard (ladspa or lv2) whereas the swh valve saturation did. Viewing the source to determine if the implementation differs or simply a difference in overhead between plugin standards is beyond the scope of this work. The sqrt funtion was cascaded to save face since the single instance yeilded nearly the same DSP load as the approximation performed in cheap distortion. Since I ran the test I can bias the results like that. This suggests the aggression parameter must be set to 2 or 3 to really see any performance savings. Retesting, or testing with multiple instances of each plugin to lessen the effects of the host overhead in the measurement is left as an exercise to the reader.

Feel free to challenge these results by conducting your own experiment and posting the results here!

p.s. the option to select the index of a ladspa plugin in plugin-torture is not yet implemented so C* clip will not be tested for CPU useage.

Interesting results. I do think the methodology could use some work (I was actually hoping to trick you into writing a plugin benchmarkings suite, which, as you're beginning to see, would be a pretty useful thing to have).

Remember, having the knowledge that an optimization or shortcut is unnecessary is just as valuable as having the optimization--the end of that road leads to better performance either way. Math functions like sqrt and sin/cos are pretty heavy optimization targets for compiler built ins--you can bet your ass that they've been benchmakred and tuned for the architecture. And if there's a lesson to be learned here, it's probably just that--compiler optimization exists so that you can write the routine once (aiming for human readability) and have the compiler optimize it for whatever platforms your code is going to be run on.

Oh, FYI the JACK DSP load figure *is* an average. The number of samples is a compile time option, but I believe it's 32, so that means it's an average of load over 32 periods (whatever that works out to in sample time).

LinuxMusicians

Benchmarking Some Plugins

Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins

Re: Benchmarking Some Plugins