Optimizing FFTW for NEON-Enabled ARM Devices: Final Benchmarks

Thursday, August 19, 2010

Final Benchmarks

update-2010-08-20: you can download a consolidated patch of my fftw work (includes all of my changesets) from here. Alternatively, clone my git repository. You can find build instructions here.

Below are my benchmarks from today. They're really only final in terms of the final ones I'm submitting for GSOC. Yes, you might very well notice that 'fftwff' (i.e. ffmpeg called through fftw) dominates in most cases where power of two transforms are used. However, you should also notice, that fftw provides a nice dispatching mechanism for allowing one to use ffmpeg's blazingly fast transforms in order to compute higher-order transforms with a single api. Pretty nice!

As I already mentioned, these benchmarks do not contain the neon-simd optimized strided memory transfers that I started rather late in the game, but you can probably imagine, that for any composite transform (i.e. if N != m^p, or any transform of higher dimensionality) those memory transfer routines will improve the performance of ALL versions ('fftw3' 'fftwni' and 'fftwn', 'fftw3ff') by a good factor of 2x, at least. After that, once a couple of straight-forward small-prime routines (i.e. 3, 5, 7, 11) are implemented in neon-asm, then the non-power-of-two (npot) performance will begin to approach the power-of-two (pot) performance, which is ideal. That is currently what makes fftw so attractive for x86 architectures - the fact that it provides O(N log N) performance for all lengths, not just powers of two, and without zero-padding (for most transforms of significant length).

Please feel free to check out the raw data on Google Docs (bench-1d-pot, bench-1d-npot, bench-2d-pot, bench-2d-npot, bench-3d-pot, bench-3d-npot). Also, just in case it isn't clear, when 'mflops' is on the y-axis, a higher value is better, and when log10(time(s)) is on the y-axis, a smaller value is better.

Some remarks: ffmpeg does not have api calls for any length that is not a power of two, nor does it have api calls for higher-order (e.g. NxN) transforms, which is why its missing in some of the graphs below. Furthermore, ffmpeg (and most all other libraries) do not support transforms of vector data, and since I added the ffmpeg interface to fftw, it does now :-). In this sense, calling ffmpeg through fftw leverages the excellent planning and dispatching routines of fftw to provide the absolute best possible performance. Eventually, it will be unlikely that anyone would directly used ffmpeg's transforms other than for decoding multimedia - but I think that's best left unchanged. Ffmpeg is likely the best library publicly available for multimedia.

In summary, my improvements to fftw include anything marked 'fftw3n' (neon simd, asm), 'fftw3ni' (neon simd, intrinsics), or 'fftw3ff' (neon simd, ffmpeg). Actually, even those that are marked 'fftw3' are still improved from their original values due to the added cycle counter... and you can count on me adding those missing features once my thesis is submitted ;-)

5 comments:

Jeff in Oakland said...: Christopher,

Very interesting results. I am curious about your experience with the tools. Can we discuss via email or phone?; October 7, 2010 at 1:02 PM
Anonymous said...: Cristopher,
Thanks for this, should be really useful. FWIW, I needed to add --enable-mdct to my ffmpeg ./configure args to get it to build, otherwise it wouldn't assemble mdct_neon.S and the final link would fail on unresolved references. This is with the latest ffmpeg git sources (as of 10/29/10).; October 29, 2010 at 10:54 AM
Anonymous said...: Christopher,
I've been playing around with this for a bit and have had to tweak a few things to get it to work properly. There were several codelets that I couldn't get to cross-compile properly with several different toolchains -- which did you use, or did you compile natively on the Beagleboard? Also, the aligned() function in dft-ffmpeg does not work as intended, so applicable() will always fail and the FFMPEG functions will never be used by FFTW. A quick fix is to change:

return( (unsigned int)p % FFMPEG_ALIGNMENT() == (unsigned int)p );

to:

return( (unsigned int)p % FFMPEG_ALIGNMENT() == 0 );

But you probably meant to turn FFMPEG_ALIGNMENT into a mask ~(0xF) == 0xFFFFFFF0 and say:

return ( (unsigned int)p & mask == (unsigned int)p ), which might be faster if gcc can't figure that out for itself (it should strength reduce "modulo by power of 2" automatically, but I haven't bothered to check).

What is the best way to get in touch about integrating my small changes and anything else I do that extends your work?; November 8, 2010 at 6:01 PM
cto_maverick said...: On mny setup the normal fftw-3.2.2 builds and runs fine natively on the Beagleboard xM

However your patch when applied fails at line 1441

When I grab your git tree it configures but fails during make due to what looks like a missing file (see below).

-David

david@david-beagleboardXM:~/Documents/gsoc2010-fftw-neon$ make
make all-recursive
make[1]: Entering directory `/home/david/Documents/gsoc2010-fftw-neon'
Making all in support
make[2]: Entering directory `/home/david/Documents/gsoc2010-fftw-neon/support'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory `/home/david/Documents/gsoc2010-fftw-neon/support'
Making all in kernel
make[2]: Entering directory `/home/david/Documents/gsoc2010-fftw-neon/kernel'
/bin/bash ../libtool --tag=CC --mode=compile gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I../simd -g -O2 -MT cpy1d.lo -MD -MP -MF .deps/cpy1d.Tpo -c -o cpy1d.lo cpy1d.c
libtool: compile: gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I../simd -g -O2 -MT cpy1d.lo -MD -MP -MF .deps/cpy1d.Tpo -c cpy1d.c -o cpy1d.o
cpy1d.c:91: fatal error: cpy1d_neon.c: No such file or directory
compilation terminated.
make[2]: *** [cpy1d.lo] Error 1
make[2]: Leaving directory `/home/david/Documents/gsoc2010-fftw-neon/kernel'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/david/Documents/gsoc2010-fftw-neon'
make: *** [all] Error 2; March 22, 2011 at 7:20 AM
cto_maverick said...: I hade to make a few modifications as below in order to use gcc on the Beagleboard xM as you used a cross-compiler on another system rather than gcc on the Beagleboard xM:

cd gsoc2010-fftw-neon
make clean
sh bootstrap.sh
CFLAGS=”-O3 -pipe -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp” ./configure —prefix=/usr —enable-single —enable-shared —with-pic —host=arm-none-linux-gnueabi —enable-armv7-cycle-counter —enable-neon
cd kernel
cp cpy1d.c __cpy1d.c
cp cpy1d-neon.c cpy1d.c
cd ..
make

Make currently fails due to an internal gcc compiler error which I have submitted to GNU
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48256; March 23, 2011 at 8:59 AM

Optimizing FFTW for NEON-Enabled ARM Devices

Thursday, August 19, 2010

Final Benchmarks

5 comments:

Post a Comment

Followers

Blog Archive

About Me