Optimizing FFTW for NEON-Enabled ARM Devices: 2010

Thursday, August 19, 2010

Final Benchmarks

update-2010-08-20: you can download a consolidated patch of my fftw work (includes all of my changesets) from here. Alternatively, clone my git repository. You can find build instructions here.

Below are my benchmarks from today. They're really only final in terms of the final ones I'm submitting for GSOC. Yes, you might very well notice that 'fftwff' (i.e. ffmpeg called through fftw) dominates in most cases where power of two transforms are used. However, you should also notice, that fftw provides a nice dispatching mechanism for allowing one to use ffmpeg's blazingly fast transforms in order to compute higher-order transforms with a single api. Pretty nice!

As I already mentioned, these benchmarks do not contain the neon-simd optimized strided memory transfers that I started rather late in the game, but you can probably imagine, that for any composite transform (i.e. if N != m^p, or any transform of higher dimensionality) those memory transfer routines will improve the performance of ALL versions ('fftw3' 'fftwni' and 'fftwn', 'fftw3ff') by a good factor of 2x, at least. After that, once a couple of straight-forward small-prime routines (i.e. 3, 5, 7, 11) are implemented in neon-asm, then the non-power-of-two (npot) performance will begin to approach the power-of-two (pot) performance, which is ideal. That is currently what makes fftw so attractive for x86 architectures - the fact that it provides O(N log N) performance for all lengths, not just powers of two, and without zero-padding (for most transforms of significant length).

Please feel free to check out the raw data on Google Docs (bench-1d-pot, bench-1d-npot, bench-2d-pot, bench-2d-npot, bench-3d-pot, bench-3d-npot). Also, just in case it isn't clear, when 'mflops' is on the y-axis, a higher value is better, and when log10(time(s)) is on the y-axis, a smaller value is better.

Some remarks: ffmpeg does not have api calls for any length that is not a power of two, nor does it have api calls for higher-order (e.g. NxN) transforms, which is why its missing in some of the graphs below. Furthermore, ffmpeg (and most all other libraries) do not support transforms of vector data, and since I added the ffmpeg interface to fftw, it does now :-). In this sense, calling ffmpeg through fftw leverages the excellent planning and dispatching routines of fftw to provide the absolute best possible performance. Eventually, it will be unlikely that anyone would directly used ffmpeg's transforms other than for decoding multimedia - but I think that's best left unchanged. Ffmpeg is likely the best library publicly available for multimedia.

In summary, my improvements to fftw include anything marked 'fftw3n' (neon simd, asm), 'fftw3ni' (neon simd, intrinsics), or 'fftw3ff' (neon simd, ffmpeg). Actually, even those that are marked 'fftw3' are still improved from their original values due to the added cycle counter... and you can count on me adding those missing features once my thesis is submitted ;-)

Wednesday, August 18, 2010

TODO: The Leftovers

The few things that I did not completely finish in time:

The biggest missing feature by far: optimized, strided (and non-strided) memory copies. You can see where I was going with this. 1D transforms are fairly straight-forward, but 2D and beyond will require a bit of work.
Further testing of rdft-ffmpeg.c and reodft-ffmpeg.c . I did do some very basic forward transformation tests to ensure that I had the correct output for forward transforms, but not for reverse.
Ideally, a few extremely fast transforms in neon / asm for small, non-power of two cases. These would compliment the extremely fast power-of-two cases and also make composite transforms incredibly faster (much like on x86).

I'll definitely be tackling those issues after my thesis is handed in. Only 16 more days and then I'm done in academia for the rest of my life!

Monday, August 16, 2010

FFTW: Weekly Report - Week 11, 12, and 13

After much anticipation, here are my very-belated weekly updates, along with another few lovely graphs.

update-20100819(am): I had to restart the benchmark before I left for the lab this morning because the ffmpeg can_do() function in my benchfft source caused a double free error in glibc. I fixed it this morning (removed the out-of-place junk). The benchmarks will be ready by when I get home. Since these are my final benchmarks, I'll copy / paste the timing data into oocalc and finally make time on the y-axis instead of mflops. It's a bit of a pain that benchmarking takes 4-6 hours. Sorry for the delay.

update-20100818: Arg! A bug-ridden, cycle-hungry udevd process was absolutely killing the metrics of my last benchmark. It happens every time I reboot, and I just haven't gotten around to fixing it yet. Usually I remember to kill udevd before doing anything else but I must have accidentally cycled the power on my Beagle last night before starting the benchmark. Below is one graph from the results (which I will replace as soon as the next set of results is finished). You can see several more results of neon performance under heavy load (and its still pretty decent!). Particularly, it's an interesting experience to see how neon just jogs leisurely along while 'top' shows a single process using 95% of the cpu cycles. If you look at the graphs under heavy load, you'll probably also notice how strided and non-strided meory transfers take a beating. I've already restarted the benchmark and should have some new graphs up later tonight.

When I set out to improve FFTW on NEON-enabled ARM processors, the main goal and focus was to be on the TI OMAP 3430, for the BeagleBoard Project and that I did. Originally, the speed increase was rather modest (1.5-2x or so), because I simply implemented the SIMD / codelet API that most architectures use already. However, the NEON coprocessor has no hardware instruction rescheduling (i.e. it is in-order), nor does the Cortex-A8 core, and without those two things (not to mention better scheduling from GCC), FFTW's codelet interface does not perform nearly as well as hand-written assembly code. In contrast, on x86 processors, the benefit of the OOE unit (and likely far better compiler optimizations) is clearly visible, as FFTW outperforms FFMPEG even on my eee701 (shown in the graphs below). The Cortex-A9 does have an out-of-order unit, but its unlikely that this will have a major impact on NEON scheduling.

My most recent BeagleBoard performance comparison included FFMPEG's transform for powers of two, which was faster by far than FFTW using codelets, so I knew that I had some high goals to reach. I originally thought it would be smart to transpose lots of highly optimized assembly into a new non-portable / architecture dependent simd sub folder in FFTW, trying my damndest to get my code to measure up to FFMPEG, but then I realized that was just stupid when there was already a perfectly optimized FFMPEG library right in front of me. Next, I went to lots of extra trouble to create a seperate ffmpeg_fft library which I bundled with FFTW. That was fine, although I did dig up an interesting alignment related bug with respect to how FFTW handles FFTW_FORWARD and FFTW_BACKWARD. Also, I think I may have also dug up another cortex-a8 errata-worthy tidbit, or have found a bug in the Linux kernel's handling of neon instructions... but that's slightly tangential.

Since my last post, I've smoothed over my work, added the beginnings of NEON copy acceleration for buffered / rank-0 transforms, and ran a few tests that you can check-out here (here) and here. The results might be surprising even for the non-power-of-two cases. I was surprised at first, myself. Originally, I was curious as to how many of the non-power-of-two / prime algorithms worked. After some perusal of related articles on IEEEXplore, I learned that many of the non-power-of-two algorithms (i.e. Bluestein, Rader) rely on some form of recursion or composite solution where power-of-two DFT's (or other small, prime factors) are often key, which why non-power-of-two performance was slightly increased with NEON SIMD codelets and FFMPEG routines. In the non-power-of-two case, typically what would happen is that . I found Rader's algorithm to be particularly interesting; these days its treated as common knowledge in most advanced DSP courses. Non-power-of-two routines are still not as fast as ffmpeg's power-of-two algorithms, but it might be possible to improve that with a bit more work on NEON copy acceleration.

A few technical details:

Not only the DFT, but also the RDFT, DCT, and DST interfaces to FFMPEG were brought out into FFTW, which gives much more flexibility for varying applications.
Currently the FFMPEG interfaces stick with the in-place, contiguous, and aligned API of FFMPEG. That is, any non-aligned, or composite transforms require buffering before feeding them into FFMPEG. In FFTW terms, changing strided data into a non-strided data is called a rank-0 transform. Rank-0 transforms always occur for composite DFTs, and thus, require a very efficient set of copying routines.
The ARMv7 cycle timer made a major difference to performance, since simple estimation methods were not being used to choose an algorithm. I would highly suggest that people apply Mans' USER_PMON patch and recompile their kenel before trying out my changes.
Since FFTW's interface to FFMPEG is generic, the native library may or may not actually use native SIMD instructions. As we all know, misalignment with SIMD instructions are bad, so it's necessary to ensure that input data is properly aligned. Currently FFMPEG uses a constant 16-byte alignment, which is straight forward to account for, but FFTW does a clever little swap of real and imaginary input pointers to indicate an IDFT (i.e. data is no longer aligned). After some time, I found a bit of trickery to get around that, which I will document further on elinux.org.
Lastly, if you choose to link with FFMPEG, please ensure that you have --enable-hardcoded-tables selected. Otherwise, FFMPEG (current git) will cause a segmentation fault (in cos table initialization). That is an FFMPEG related bug that I was unlucky enough to encounter during my testing phase.

Having a cycle counter means that FFTW will no longer be restricted to running in ESTIMATE mode on ARMv7, which will allow it to accurately time several different strategies, and then store wisdom files to speed-up future calculations.

So the bottom line, is that FFMPEG's blazingly fast power-of-two transforms are a benefit to all of FFTW's algorithms including those for prime and non-power-of-two input length. However, such composite transforms rely heavily on efficient strided memory copy algorithms (rank-0 transforms). Since I only just started chipping away at that (for NEON), the performance for non-power-of-two transforms is still less than ideal. Also, I should note, that any higher-order transform (e.g. for images or volumes) also requires highly efficient rank-0 transforms. Hence, none of my graphs are showing higher-order throughput results.

There is one more thing worth noting. Rather than using a bundled ffmpeg_fft library inside FFTW (as I initially did for testing), I decided to take the simple and obvious approach. That is, I chose to link to FFMPEG from within the FFTW code dynamically (statically linking is also possible). Why?

maintenance becomes much easier,
bundled libraries are inherently evil, and
my contribution then becomes architecture agnostic.

Rather than solely benefiting NEON-enabled processors, it actually benefits all architectures that have support in both FFTW and FFMPEG. The list includes all x86 variants, some PowerPC variants, the dozens ARM SoC's with a NEON coprocessor (e.g. OMAP, i.MX, and handfuls of others), and likely more. It's like killing 100 birds with a single stone ;-) Please note, that there are some non-power-of-two cases, as shown in the graphs, where my code in simd-neon.h (fftwfn) actually outperforms the mish-mash of power-of-two composites that the fftwff (ffmpeg) version generates. So it's nice to know that it is, on occasion, useful. Once NEON copy acceleration is finished, it will likely also have a major improvement.

In the most ideal case, genfft would have been optimized to directly output optimized ARM / NEON machine code, much like a JIT compiler, but that would probably take much more time than was available. At this point, I think I can finally say that I'm happy with FFTW's throughput results (and the potential for improvement with acclerated NEON rank-0 transforms) even through the 'brute force' methods that I chose. That being said, although the code is more-or-less finished and more-or-less bug-free, the fftw-neon project still. I will probably only have time to do some instructional documentation, and a demonstration screencast, but it shouldn't be too hard to put together a bitbake recipe at a later date.

Although I thoroughly enjoyed working on this project and did put a lot of time into it (some of which really should have been dedicated to my MSc thesis), much of the credit should be attributed to the authors of FFMPEG and FFTW. I also would like to thank the people who offered me advice and encouragement - specifically my mentor Mans, Matteo from the FFTW project, as well as Jason and Koen from #beagle. It has definitely been a good experience working on FFTW, and I feel inclined to do some further optimizations on it in the future. Specifically, that would include NEON-accelerated rank-0 transforms and perhaps some small-prime algorithms in NEON assembler (e.g. for N=3, 5, 7).

Please drop me a line if you find this work useful - I always love to hear how I've helped others out. Also, within the next few days, please feel free to visit my project page on elinux.org for documentation, instructions, demonstrations, and more. In case anyone is interested, you could probably also anticipate seeing the results of this project ending up in a certain free version of a popular MAtrix maTh LibrAry, Built for mobile devices (i.e. GNU Octave).

Monday, August 9, 2010

Meanwhile

I'm waiting for another benchmark to finish to publish my (rather lengthy) weekly update. It should be done within the next hour. Meanwhile, here's a small sample of many more graphs to come.

update-20100812 (am): right before I was about to publish the rest of the graphs, I ran into a really strange (probably alignment-related) bug that was causing fftwff to stop at N=32. I've sinced fixed the bug, but also realized that I completely neglected optimized memory copies! FFTW terms these as 'rank-0' transforms and you can probably imagine that they're absolutely essential for pretty much everything. This morning I identified the appropriate files to fix, and will get to it tonight after spending the rest of the day on my thesis. For the impatient, I gracefully refer you to my git repository.

update-20100814 (pm): ... found the bug... oddly enough it was not due to misalignment. libavcodec/fft.c:75 in ff_init_ff_cos_tabs(). I rebuilt ffmpeg it with hardcoded tables.

update-20100816 (am): i've since fixed several other bugs to do with alignment and vector strides, added the beginnings of memory copy acceleration (for rank-0 transforms and buffered operations), and am (yet again) generating lots of lovely graphs.

Tuesday, July 27, 2010

FFTW: Weekly Report #10

After getting an extension on my thesis last week, I've been able to get back into FFTW a bit more, so I started going after the main thing on my agenda; adding non-portable, architecture-specific, neon-enabled fft routines that do NOT use codelets (gitorious). I decided to add these routines in dft/simd/nonportable for lack of a better destination.

The first step was to determine how to expose a new fft algorithm / implementation to fftw's planner. Taking a look at an example like dft/bluestein.c, it was clear that each algorithm / implementation would use a function of the form X(dft_NAME_register) to register with the planner, and that each of the functions to register would need to appear in a const solvtab structure. In dft/conf.c each of the solver tables (standard & simd) are added to the list of possible solvers via X(dft_conf_standard).

In dft/simd/nonportable, a subdirectory for each architecture (in my case 'arm') should be defined, and added to dft/simd/nonportable/codlist.c (if supported).

If the ESTIMATE flag is passed to the fftw planner, then the chosen algorithm / implementation will be based on the number of additions, multiplications, and other operations as reported by all available algorithms and implementations. If the PATIENT flag is passed to the planner, then most available algorithms and implementations will be timed (now I definitely need to fix the armv7a timer code), and the one with the shortest time will be selected. I don't yet know of a way to specifically select which algorithm to use via the advanced or guru interfaces although that useful feature might deliberately not even be available in fftw.

So, now that I have the groundwork laid out, I can go ahead and splice some of the ffmpeg power-of-two fft code into fftw in dft/simd/nonportable/arm. I'll try to get that done within the next day or two. All that's necessary is a bit of a bridge to join the two apis. After that I will begin the interesting journey to see if I can improve the various non-power-of-two algorithms with some elegently crafted assembler code.

Monday, July 19, 2010

FFTW: Weekly Report #9

Hi everyone,

There was not a lot of activity this week, mainly due to playing catch-up on my thesis.

Highlights: Misc Repository

ffmpeg's fft was seperated out of ffmpeg into its own small library (<39kB)
ffmpeg's fft is integrated with benchfft, for easy visualization

Todos:

* fix real / reverse portions of ffmpeg's benchmark in benchfft

* fix benchfft's graphing features to display runtime rather than mflops on the y-axis.

Please see my last post for an initial graph of ffmpeg's fft vs fftw with neon, as well as details about my plans with fftw.

Thursday, July 15, 2010

This One's For You, Mru

As a follow up to my last post, I thought that I would show everyone a pretty graph with FFMpeg's FFT performance benched against FFTW for powers of two. I think it's obvious who the clear winner is from the image below.

Just for clarification purposes, 'fftw3' means 'fftw without neon simd optimizations', 'fftw3n' means 'fftw with inline asm neon simd optimizations', and 'fftwni' means 'fftw with gcc intrinsic neon simd optimizations'.

All versions of FFTW were compiled with CFLAGS="-O3 -pipe -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon -fno-tree-vectorize". You can get the source at gitorious.org/gsoc2010-fftw-neon.

Now, a few comments.

First of all, NEON intrinsics perform almost identically to inline asm in the 'codelet' department, which really just goes to show exactly what one can expect from GCC's NEON optimization process (i.e. it is not optimized).

Secondly, well optimized assembler will almost always outperform what the compiler is capable of (arguably even when there is a hardware optimizer, see OoOE). The NEON coprocessor does not have an OoOE execution unit (although Cortex-A9 has one for the multiple ARM cores), which means that optimization with NEON is entirely up to the compiler and the programmer.

Lastly, FFMpeg's implementation is only for powers of two (even with very large radices), which means that the way non-power-of-two transforms would be computed with FFMpeg, would be to first be zero-pad the input. Zero-padding is also possible with FFTW, but it must be done before making the call to fftw*_plan. In other words, FFTW does not explicitly zero-pad data (as far as I'm aware, but I could be wrong). From what I understand (based on some random memory fragment suggesting that I read this somewhere a long time ago), programs such as Matlab actually do zero-pad all data before passing it to the FFTW library.

So, for the purists out there, or those who use embedded devices with (really!) small amounts of RAM, this is where FFTW becomes very attractive. FFTW uses several well published algorithms specifically for non-power-of-two problems to efficiently compute transforms without zero-padding. An alternate interpretation is that it's better for large input signal lengths (in the order of 2^27+1), if one were that naive, since the next power-of-two could max out available RAM. That's up for some debate anyway, and I don't know very many people who would do that on an ARM device (thankfully).

Now, the next plan is to determine how to augment the existing plans that FFTW has, to use similarly optimized power-of-two algorithms to those from FFMpeg. Then, the next step would be to tackle some of the non-power-of-two algorithms and try to rewrite them in NEON assembler. According to FFTW, even those non-power-of-two algorithms are O(NlogN), so perhaps (just perhaps), they will show a similar level of performance when written directly with ASM.

In short, I now have a visual depiction of what I should be shooting for. I will keep you posted on how well that goes.

PS: If you're wondering who Mru is, he's indeed my mentor for GSOC and maintainer for ARM portions of FFMpeg.

Sunday, July 11, 2010

Fftw-Neon: Weekly Report: Week 8

Over the last week, I did some more benchmarking, and also began working ffmpeg_fft into my benchmarks for a good target to aim for. I'm currently working out a segfault in the ffmpeg_fft library (specifically in ff_fft_permute_neon.S) and hope to have a few more pretty graphs to show for next time. I added fftwni to the benchmarks in order to compare codelets coded in inline asm versus codelets coded in neon intrinsics (also started using -O3 for -finline-functions), and the results were very close for fftwn (inline neon asm) and fftwni (inline neon intrinsics). The one exception I made was that vtrn was coded in inline asm to prevent compiler errors revolving around being unable to spill registers. Also, I'm changing the graphs to directly show cycles (or time) instead of mflops. The next major step will be working power-of-two fft 'algorithms' into the fftw planner. After that I'll be tackling the much more interesting non-power-of-two algorithms. By directly coding specific algorithms in asm, we're hoping to achieve a greater speedup than what is possible with codelets. The asm routines do not necessarily need to completely displace the codelets from the point of view of the planner (at least not without hard proof that they are always faster), so they can possibly augment the pool of algorithms available to the planner.

I will need to have some serious crunch time for my thesis as well, since the submission deadline is in 19 days (!), so you might not see as many commits in my repositories this week. Please feel free to look at the latest in main and misc.

Tuesday, July 6, 2010

Results from Inline Asm / Codelets

I thought I would post the initial speed results achieved using the latest inline asm patch for neon-simd.h . The results are decent - a 2x mflop improvement in some cases - but I'm still aiming to get approximately a 10x speedup on average, which will require some pretty intensive assembly coding. The results below are ony for powers of two, just for a quick overview. Also, I'm trying to completely eliminate in-place transforms from the benchmark, since fftw always performs simd transforms out-of-place (as far as I can tell), and out-of-place transforms are generally faster anyway.

Sunday, July 4, 2010

FFTW: Weekly Report - Week 7

Over the last week, I've spent a fair amount of time implementing the "standard" fftw simd interface for neon using intrinsics, and one thing is certainly clear: there is absolutely no throughput gained at all, as you can see in the graph below. And yes, I was definitely using neon code for fftw3n. These results were more or less expected (although maybe not so exactly), since I was really only using intrinsics for verification purposes, and obviously inline C functions with intrinsics produce some undesirable effects. As far as numerical accuracy goes, its identical to the non-simd version, which is good. The benchmark was made via benchfft for consistency, and you can grab a copy of it from my misc repository.

Below is an example of how the VMUL and VZMUL inline functions appear in the disassembled .so file when using intrinsics. They shouldn't actually appear at all as seperate sections. The best way to eliminate all of those useless branches, push's and pop's, is to rewrite each inline function in simd-neon.h with inline asm statements.

000dc268 :

vmul.f32 q0, q0, q1

bx lr

...

00149cf0 :

vorr q8, q0, q0

vtrn.32 q0, q8

push {lr} ; (str lr, [sp, #-4]!)

vpush {d8-d13}

sub sp, sp, #68 ; 0x44

vstr d16, [sp, #16]

vstr d17, [sp, #24]

vstmia sp, {d0-d1}

mov lr, sp

add ip, sp, #32 ; 0x20

ldm lr!, {r0, r1, r2, r3}

vorr q6, q1, q1

stmia ip!, {r0, r1, r2, r3}

vldr d0, [sp, #32]

vldr d1, [sp, #40]

ldm lr, {r0, r1, r2, r3}

stm ip, {r0, r1, r2, r3}

vldr d8, [sp, #48]

vldr d9, [sp, #56]

bl 149c30

vorr q5, q0, q0

vorr q0, q6, q6

bl 149c38

vorr q1, q0, q0

vorr q0, q4, q4

bl 149c30

vorr q1, q0, q0

vorr q0, q5, q5

bl 149ce8

add sp, sp, #68 ; 0x44

vpop {d8-d13}

pop {pc}

On a side note, I did discover that some as-of-yet unidentified effect of the armv7 cycle-counter was making my benchmarks hang, so that was currently configured out in my test libraries (--enable-armv7-cycle-counter is not set).

On another side note, all of my simulations seem to indicate that simd transforms are always out-of-place. This is probably a good thing in any case because out-of-place transformations tend to be faster, but it also might allow me to implement pointer/register auto-incrementing (with ldxxx {xx} [rx]!) .

In pure C / neon intrinsics, there is no way to specify load-store alignment (major speedups lost) or pointer auto-increment (more arm instruction syncopation). Ideally, there would only be a few seldom, conditional branches in arm code while most of the work was done in the neon coprocessor.

In case anyone would like to test the library out on their own, I'm configuring fftw3 with

CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfloat-abi=softfp" ./configure --prefix=/usr --host=arm-none-linux-gnueabi --enable-float --with-pic --enable-neon --enable-shared

I would highly suggest adding -mfpu=neon to the cflags above as well (for all code), otherwise configure.ac only adds -mfpu=neon to simd-specific compiles.

Although I did get quite a bit done this week, it's been slower than I would have liked for two reasons: 1) Canada Day (daycare is closed), and 2) my significant other was on the other side of the continent for an academic visit, and so I've been a single parent this week. Although I did get to spend lots of extra time with my son, which is always welcome, i think I lost about 1 or two hours a day getting to the daycare and back.

Plans for next week:

1) rewrite inline functions as inline asm instead of inline with intrinsics
2) speed-ups!
3) continue investigating codelet-free approaches (i.e. sacrificing the fftw methodology for speed)
4) fix cycle counter!

Wednesday, June 30, 2010

Added Intrinsics and Cycle Counter Support

Today I pushed two relatively simple changes that I wanted to get out of the way.

The first changeset adds NEON intrinsics to FFTW (although some testing / tuning still needs to be done). The end product will not be relying on NEON intrinsics but I thought it would be a good thing to benchmark against in any case. By specifically not using intrinsics (in the simd/ directory of FFTW), a lot more will have to be rewritten by hand in .S file format and I'm corresponding with Matteo about which routines to give special attention to.

The second changeset adds support for the arm7 cycle counter which will be used during the benchmarking stages.

Monday, June 28, 2010

Weekly Report - Week 5

Originally, I was quite happy to have achieved a 5.5x speedup in mutliplying long vectors of complex floats, but this week, I managed to increase that to values varying between 11x and 18x. My changes were quite subtle, but I suspect the main problem was issuing vmul instructions after vmla, which is specifically noted to produce stalls of 4 cycles in the Cortex-A8 reference manual. What's odd though, is that the q-reg implementation seems to generally perform slightly faster than the d-reg implementation in most cases. In any case, these speedups are most certainly going to make a major difference when implemented in fftw compared to the code generated from the generic C code.

In case you aren't watching them already, I have two repositories set up. The first one is specifically for fftw, and the second is for snippets of c code and assembly, for benchmarking purposes.

http://gitorious.org/gsoc2010-fftw-neon

http://gitorious.org/gsoc2010-fftw-neon-misc

Please feel free to check out 'demo2' from misc and run it on your beagleboard to observe the speedups first hand.

Sunday, June 20, 2010

Demo #2

Hi all,

it's far past my bed time, so I'm going to put up a link to some sample code that shows a 5.5x speedup for multiplying long vectors of complex numbers using NEON as opposed to the default libc implementation. Here it is. This has been an "educational" experience in familiarizing myself with the NEON instruction set. Tomorrow, I'm moving on to more FFTW related things, and will finally commit some demo code, and FFTW changes to my git repository.

Monday, June 14, 2010

(Pre) Weekly Report #4

Hi folks,

Unfortunately I can't say that I have very much to report today. I spent this entire week working day and night on the final antenna model for my thesis, which was important for me to finish before my flight back to Canada tomorrow morning.

However, I should be able to accomplish the goals of my last week by tomorrow evening when I arrive back in Canada, since I'll have 10+ hours of between origin and destination. Please look forward to a more informative Weekly Update #4 tomorrow evening (GMT-5).

Monday, June 7, 2010

Weekly Report #3

Hello all,

I spent the last week familiarizing myself with the NEON instruction set and resolving some confusion I had about how to load interleaved data elements from memory. I also took a few first-stabs at integrating ARM/NEON specific code into FFTW.

As an example of interleaved data elements, complex floats have interleaved real and imaginary components in memory and each component is separated by two word addresses. For my previous demo (multiplying long arrays of real floats), I had used the older VFP instructions (FLDM,FSTM) for loading and storing vectors into memory. This time, I proceeded under the incorrect assumption that the FPSCR.STRIDE setting controlled interleaving of memory locations. It's actually the opposite - the stride setting controls interleaving of VFP registers. So needless to say, I received a few segfaults and alignment errors in my next demo app.

The next demo app I'm writing (in progress), has so far helped me to familiarize myself with the nooks and crannies of the NEON instruction set including alignment and latency issues. This kind of in-depth knowledge is vital for FFTW integration, and particularly for squeezing every last drop of performance out of the NEON co-processor. I changed all of my load and store operations to use the VLDn and VSTn instructions, and have so far written 1 of 3 test cases that will be benchmarked for speed, as was the case for my previous demo.

Before I divulge any details, I should say something about complex multiplication. A typical strategy for multiplying two complex numbers is to use the FOIL method. The acronym stands for First-Outside Inside-Last. So given a complex number x = a + jb, and y = c + jd (where j = sqrt(-1)), FOIL-multiplication would resemble the following:

x * y = (ac - ad) + j(bc + bd) = (F - O) + j(I + L)

This can be broken down into the following (rough) pseudo-code:

MUL(reg0,a,c)

MACS(reg0,a,d)

MUL(reg1,b,c)

MAC(reg1,b,d)

Where MUL is floating-point multiplication, MACS is multiply and accumulate (subtractively), and MAC is multiply and accumulate (additively). Clear, right?

For longer arrays of complex numbers (e.g. cf32_t x[256], cf32_t y[256], cf32_t z[256], cf32_mul(z,x,y)), there are a few other considerations to take into account.

1) SIMD Operations

This is the most obvious speedup. If one can compute N products simultaneously, then there is an inherent speedup of N. The structure of the NEON pipeline allows for 2 x 2 simultaneous multiplications within the time of a single ARM instruction, so effectively we can make 4 parallel float multiplications with a single ARM instruction.

2) Memory Bottlenecks

Most ARM devices have a fairly low memory bus clock frequency. That is basically the same thing as the front-side-bus for the x86 architecture. What that means, is that loading and storing variables from RAM is an even more expensive operation on the ARM architecture than it is on x86. As a reference, newer x86 chips support bus frequencies of over 1.3 GHz, whereas a typical bus frequency for ARM devices is in the order of 100 times slower.

To get around this issue, operands should be loaded and stored using vector instructions. As shown in my previous demo, that saves a significant amount of time.

3) Pipeline Activity

NEON supports concurrent memory loads and stores. Therefore, the best (fastest) performance can be achieved by keeping the NEON arithmetic pipeline busy, while concurrently loading and storing operands and products. In assembler, this would mean that arithmetic instructions should be mixed with load and store instructions.

4) Co-processor Latency

When switching between ARM load / store operations and NEON load / store operations, a latency penalty is incurred in order to avoid memory hazards, as pointed out here. Thus, ARM loads and stores should be avoided at all costs while doing lengthy NEON operations.

What Next?

As I already mentioned, the current demo app that I'm working on is intended to be educational about the NEON instruction set. However, I will also be comparing 4 different methods for multiplying long arrays of complex floats.

i) Libc - in no way optimized

ii) FFF...OOO...III...LLL (this operates length-wise along the array)

iii) same as above but with intermixed load and store operations

iv) FOILFOILFOIL... (this completes complex multiplications one at a time)

Method (i) will obviously take the longest time to execute. Method (ii) will be an improvement over method (i). Method (iii) should perform even faster than Method (ii) Method (iv) will use intermixed loading and storing, and will also use the MAC / MACS features of NEON.

I expect that methods (iii) and (iv) will be the fastest, and should have some benchmarks to show shortly.

As For FFTW

My FFTW work thus far, has included adding some autoconf work, and a few headers in simd/.

The configure.ac now includes autoconf macros for an '--enable-neon' flag. The flag can be set when cross-compiling and auto-detected for (native arm compilation). I borrowed the same NEON test that ffmpeg uses in its configure script, so it was fairly straight-forward. Subsequently, a HAVE_NEON preprocessor variable is either set or not-set in config.h .

As far as integrating code, it's been a bit more challenging. Since the SSE SIMD optimizations are also only available for single-precision, I decided to start with those and retrofit the NEON optimizations in. The SSE headers use a lot of x86 intrinsics in GCC. Although NEON intrinsics also exist, many have complained that they produce questionable code, so I'll probably tend to use .S files. Hopefully all goes well.

Monday, May 31, 2010

Some Insight

Well, this was a very busy weekend for me because I was moving out of my apartment here in Kiel. I'll be couch-surfing here for another two weeks before heading back to Montreal. Amid the chaos I took some time, lying down on the floor in my completely white-washed apartment with no furniture, to read some FFTW documentation and check out some of the more interesting corners of its codebase.

The documentation was actually quite resourceful, and it helped me identify some key areas to focus on for the next week.

SIMD Integration

explore how this interacts with fftw_malloc (data alignment, etc), should be fairly straight-forward
assmble codelets for default array formatting. Hopefully the 'stride' setting of load/store multiple actually works (last time I tried, it did not, to my disappointment), otherwise
provide an Advanced / Guru interface that accepts arrays in pre-arranged format (as well as bijective mapper between the default and pre-arranged format)

FFTW Wisdom

FFTW's planner tries out several possible methods for computing the DFT and chooses the fastest one. The planner results can be stored as 'Wisdom' for later reuse.
it relies on direct access to hardware cycle counters, which are present on most supported platforms (although they say on certain embedded platforms this is not the case), otherwise gettimeofday() is used which has a major cost in overhead.
look into the cycle-counter code and see if it can be implemented with any of the ARM timers, or if an actual cycle counter exists in hardware that can be accessed by userspace

Perform some very initial rough-optimization of double-precision operations using the VFP and load/store multiple (requires alignment).

I should also mention, that the NEON co-processor is (of course) incapable of double-precision floating-point, which is the default for fftw, and as far as I'm aware, the only way to 'accelerate' double-precision calculations would be to do them 'softly' in the DSP (any other suggestions?). So everything I'm speaking about above applies to single-precision floating-point only. My secondary goal about leveraging the DSP for (many) double-precision computations should address this problem, otherwise VFP must be used.

Tuesday, May 4, 2010

Hello, GSOC world!

What better way to start of the fftw-neon blog than with a greeting in the best of K&R traditions?

In case you were wondering, the reason that this blog exists is to document my progress through the summer of 2010, working on a project funded through the Google Summer of Code. Specifically, the intention of this project is to improve performance of FFTW for NEON-enabled ARM processors.

What is FFTW, you ask? Well, to explain that, I should probably first direct interested readers to a few articles which cover the basics about the Fourier Transform (FT), the Discrete Fourier Transform (DFT), and also the Fast Fourier Transform (FFT). What all of these transforms have in common is that they represent signals in the frequency-domain as opposed to the time domain. What kind of signals? Well... essentially everything in nature that is measurable can be expressed as a signal. For example music, speech, mechanical waves in a structure, voltages in an electrical circuit, radio waves, heart beats, earthquakes, brainwaves, and so on.

The DFT and FFT differ from the FT in that they operate on digital (or discrete) information, that can be processed by a digital computer. Although the DFT or the FFT both operate on the time-domain representation of a signal, converting it to a frequency-domain representation (often called a spectrum), the FFT can perform this transformation incredibly faster than the DFT, in spite of providing identical results (in certain cases). The FFT takes advantage of certain symmetries in the DFT, and eliminates redundant calculations, which typically requires signals to have lengths equivalent to some power of 2.

Finally, to answer the original question, FFTW is a finely tuned software library that leverages certain hardware in modern computers to perform the FFT even faster. This is typically achieved by performing several operations in parallel or in sequence, or by reducing the number of accesses over the main memory bus. FFTW is probably most famous for its use in the Matlab, GNU Octave, and GNURadio, although it certainly has seen several other important applications.

To date, FFTW has mainly been optimized for x86 and PowerPC processors. However, as mobile devices are becoming more advanced (reaching GHz clock speeds), we are seeing ARM processors moving out of the traditional embedded market and into more general computing devices; the mobile processors of today are computing powerhouses compared to those of 10 years ago. ARM powered-devices are also now available in the laptop / netbook form factor that has traditionally been dominated by the x86 architecture. There has even been recent mention of multi-core ARM devices moving into the server market.

This is where NEON comes into play. The ARM Cortex-A8 architecture was the first to introduce the Advanced SIMD instruction set (NEON) which can perform several floating-point operations (even more integer operations) in parallel. Combining that with vector instructions for loading and storing data, a significant speedup is possible over the generic code generated by GCC. Applying those optimizations to FFTW's codebase will enable the next generation of ARM devices (in both the handheld and netbook formats) to be used for scientific computing, or really any application that uses the DFT / FFT.

Well, that's it for my initial post. If you're interested in following the progress of this project, be sure to follow this blog. Alternatively, feel free to subscribe to the mailing list and monitor the git repository for new code.

Optimizing FFTW for NEON-Enabled ARM Devices