Optimizing FFTW for NEON-Enabled ARM Devices

Wednesday, June 30, 2010

Added Intrinsics and Cycle Counter Support

Today I pushed two relatively simple changes that I wanted to get out of the way.

The first changeset adds NEON intrinsics to FFTW (although some testing / tuning still needs to be done). The end product will not be relying on NEON intrinsics but I thought it would be a good thing to benchmark against in any case. By specifically not using intrinsics (in the simd/ directory of FFTW), a lot more will have to be rewritten by hand in .S file format and I'm corresponding with Matteo about which routines to give special attention to.

The second changeset adds support for the arm7 cycle counter which will be used during the benchmarking stages.

Monday, June 28, 2010

Weekly Report - Week 5

Originally, I was quite happy to have achieved a 5.5x speedup in mutliplying long vectors of complex floats, but this week, I managed to increase that to values varying between 11x and 18x. My changes were quite subtle, but I suspect the main problem was issuing vmul instructions after vmla, which is specifically noted to produce stalls of 4 cycles in the Cortex-A8 reference manual. What's odd though, is that the q-reg implementation seems to generally perform slightly faster than the d-reg implementation in most cases. In any case, these speedups are most certainly going to make a major difference when implemented in fftw compared to the code generated from the generic C code.

In case you aren't watching them already, I have two repositories set up. The first one is specifically for fftw, and the second is for snippets of c code and assembly, for benchmarking purposes.

http://gitorious.org/gsoc2010-fftw-neon

http://gitorious.org/gsoc2010-fftw-neon-misc

Please feel free to check out 'demo2' from misc and run it on your beagleboard to observe the speedups first hand.

Sunday, June 20, 2010

Demo #2

Hi all,

it's far past my bed time, so I'm going to put up a link to some sample code that shows a 5.5x speedup for multiplying long vectors of complex numbers using NEON as opposed to the default libc implementation. Here it is. This has been an "educational" experience in familiarizing myself with the NEON instruction set. Tomorrow, I'm moving on to more FFTW related things, and will finally commit some demo code, and FFTW changes to my git repository.

Monday, June 14, 2010

(Pre) Weekly Report #4

Hi folks,

Unfortunately I can't say that I have very much to report today. I spent this entire week working day and night on the final antenna model for my thesis, which was important for me to finish before my flight back to Canada tomorrow morning.

However, I should be able to accomplish the goals of my last week by tomorrow evening when I arrive back in Canada, since I'll have 10+ hours of between origin and destination. Please look forward to a more informative Weekly Update #4 tomorrow evening (GMT-5).

Monday, June 7, 2010

Weekly Report #3

Hello all,

I spent the last week familiarizing myself with the NEON instruction set and resolving some confusion I had about how to load interleaved data elements from memory. I also took a few first-stabs at integrating ARM/NEON specific code into FFTW.

As an example of interleaved data elements, complex floats have interleaved real and imaginary components in memory and each component is separated by two word addresses. For my previous demo (multiplying long arrays of real floats), I had used the older VFP instructions (FLDM,FSTM) for loading and storing vectors into memory. This time, I proceeded under the incorrect assumption that the FPSCR.STRIDE setting controlled interleaving of memory locations. It's actually the opposite - the stride setting controls interleaving of VFP registers. So needless to say, I received a few segfaults and alignment errors in my next demo app.

The next demo app I'm writing (in progress), has so far helped me to familiarize myself with the nooks and crannies of the NEON instruction set including alignment and latency issues. This kind of in-depth knowledge is vital for FFTW integration, and particularly for squeezing every last drop of performance out of the NEON co-processor. I changed all of my load and store operations to use the VLDn and VSTn instructions, and have so far written 1 of 3 test cases that will be benchmarked for speed, as was the case for my previous demo.

Before I divulge any details, I should say something about complex multiplication. A typical strategy for multiplying two complex numbers is to use the FOIL method. The acronym stands for First-Outside Inside-Last. So given a complex number x = a + jb, and y = c + jd (where j = sqrt(-1)), FOIL-multiplication would resemble the following:

x * y = (ac - ad) + j(bc + bd) = (F - O) + j(I + L)

This can be broken down into the following (rough) pseudo-code:

MUL(reg0,a,c)

MACS(reg0,a,d)

MUL(reg1,b,c)

MAC(reg1,b,d)

Where MUL is floating-point multiplication, MACS is multiply and accumulate (subtractively), and MAC is multiply and accumulate (additively). Clear, right?

For longer arrays of complex numbers (e.g. cf32_t x[256], cf32_t y[256], cf32_t z[256], cf32_mul(z,x,y)), there are a few other considerations to take into account.

1) SIMD Operations

This is the most obvious speedup. If one can compute N products simultaneously, then there is an inherent speedup of N. The structure of the NEON pipeline allows for 2 x 2 simultaneous multiplications within the time of a single ARM instruction, so effectively we can make 4 parallel float multiplications with a single ARM instruction.

2) Memory Bottlenecks

Most ARM devices have a fairly low memory bus clock frequency. That is basically the same thing as the front-side-bus for the x86 architecture. What that means, is that loading and storing variables from RAM is an even more expensive operation on the ARM architecture than it is on x86. As a reference, newer x86 chips support bus frequencies of over 1.3 GHz, whereas a typical bus frequency for ARM devices is in the order of 100 times slower.

To get around this issue, operands should be loaded and stored using vector instructions. As shown in my previous demo, that saves a significant amount of time.

3) Pipeline Activity

NEON supports concurrent memory loads and stores. Therefore, the best (fastest) performance can be achieved by keeping the NEON arithmetic pipeline busy, while concurrently loading and storing operands and products. In assembler, this would mean that arithmetic instructions should be mixed with load and store instructions.

4) Co-processor Latency

When switching between ARM load / store operations and NEON load / store operations, a latency penalty is incurred in order to avoid memory hazards, as pointed out here. Thus, ARM loads and stores should be avoided at all costs while doing lengthy NEON operations.

What Next?

As I already mentioned, the current demo app that I'm working on is intended to be educational about the NEON instruction set. However, I will also be comparing 4 different methods for multiplying long arrays of complex floats.

i) Libc - in no way optimized

ii) FFF...OOO...III...LLL (this operates length-wise along the array)

iii) same as above but with intermixed load and store operations

iv) FOILFOILFOIL... (this completes complex multiplications one at a time)

Method (i) will obviously take the longest time to execute. Method (ii) will be an improvement over method (i). Method (iii) should perform even faster than Method (ii) Method (iv) will use intermixed loading and storing, and will also use the MAC / MACS features of NEON.

I expect that methods (iii) and (iv) will be the fastest, and should have some benchmarks to show shortly.

As For FFTW

My FFTW work thus far, has included adding some autoconf work, and a few headers in simd/.

The configure.ac now includes autoconf macros for an '--enable-neon' flag. The flag can be set when cross-compiling and auto-detected for (native arm compilation). I borrowed the same NEON test that ffmpeg uses in its configure script, so it was fairly straight-forward. Subsequently, a HAVE_NEON preprocessor variable is either set or not-set in config.h .

As far as integrating code, it's been a bit more challenging. Since the SSE SIMD optimizations are also only available for single-precision, I decided to start with those and retrofit the NEON optimizations in. The SSE headers use a lot of x86 intrinsics in GCC. Although NEON intrinsics also exist, many have complained that they produce questionable code, so I'll probably tend to use .S files. Hopefully all goes well.

Monday, May 31, 2010

Some Insight

Well, this was a very busy weekend for me because I was moving out of my apartment here in Kiel. I'll be couch-surfing here for another two weeks before heading back to Montreal. Amid the chaos I took some time, lying down on the floor in my completely white-washed apartment with no furniture, to read some FFTW documentation and check out some of the more interesting corners of its codebase.

The documentation was actually quite resourceful, and it helped me identify some key areas to focus on for the next week.

SIMD Integration

explore how this interacts with fftw_malloc (data alignment, etc), should be fairly straight-forward
assmble codelets for default array formatting. Hopefully the 'stride' setting of load/store multiple actually works (last time I tried, it did not, to my disappointment), otherwise
provide an Advanced / Guru interface that accepts arrays in pre-arranged format (as well as bijective mapper between the default and pre-arranged format)

FFTW Wisdom

FFTW's planner tries out several possible methods for computing the DFT and chooses the fastest one. The planner results can be stored as 'Wisdom' for later reuse.
it relies on direct access to hardware cycle counters, which are present on most supported platforms (although they say on certain embedded platforms this is not the case), otherwise gettimeofday() is used which has a major cost in overhead.
look into the cycle-counter code and see if it can be implemented with any of the ARM timers, or if an actual cycle counter exists in hardware that can be accessed by userspace

Perform some very initial rough-optimization of double-precision operations using the VFP and load/store multiple (requires alignment).

I should also mention, that the NEON co-processor is (of course) incapable of double-precision floating-point, which is the default for fftw, and as far as I'm aware, the only way to 'accelerate' double-precision calculations would be to do them 'softly' in the DSP (any other suggestions?). So everything I'm speaking about above applies to single-precision floating-point only. My secondary goal about leveraging the DSP for (many) double-precision computations should address this problem, otherwise VFP must be used.

Tuesday, May 4, 2010

Hello, GSOC world!

What better way to start of the fftw-neon blog than with a greeting in the best of K&R traditions?

In case you were wondering, the reason that this blog exists is to document my progress through the summer of 2010, working on a project funded through the Google Summer of Code. Specifically, the intention of this project is to improve performance of FFTW for NEON-enabled ARM processors.

What is FFTW, you ask? Well, to explain that, I should probably first direct interested readers to a few articles which cover the basics about the Fourier Transform (FT), the Discrete Fourier Transform (DFT), and also the Fast Fourier Transform (FFT). What all of these transforms have in common is that they represent signals in the frequency-domain as opposed to the time domain. What kind of signals? Well... essentially everything in nature that is measurable can be expressed as a signal. For example music, speech, mechanical waves in a structure, voltages in an electrical circuit, radio waves, heart beats, earthquakes, brainwaves, and so on.

The DFT and FFT differ from the FT in that they operate on digital (or discrete) information, that can be processed by a digital computer. Although the DFT or the FFT both operate on the time-domain representation of a signal, converting it to a frequency-domain representation (often called a spectrum), the FFT can perform this transformation incredibly faster than the DFT, in spite of providing identical results (in certain cases). The FFT takes advantage of certain symmetries in the DFT, and eliminates redundant calculations, which typically requires signals to have lengths equivalent to some power of 2.

Finally, to answer the original question, FFTW is a finely tuned software library that leverages certain hardware in modern computers to perform the FFT even faster. This is typically achieved by performing several operations in parallel or in sequence, or by reducing the number of accesses over the main memory bus. FFTW is probably most famous for its use in the Matlab, GNU Octave, and GNURadio, although it certainly has seen several other important applications.

To date, FFTW has mainly been optimized for x86 and PowerPC processors. However, as mobile devices are becoming more advanced (reaching GHz clock speeds), we are seeing ARM processors moving out of the traditional embedded market and into more general computing devices; the mobile processors of today are computing powerhouses compared to those of 10 years ago. ARM powered-devices are also now available in the laptop / netbook form factor that has traditionally been dominated by the x86 architecture. There has even been recent mention of multi-core ARM devices moving into the server market.

This is where NEON comes into play. The ARM Cortex-A8 architecture was the first to introduce the Advanced SIMD instruction set (NEON) which can perform several floating-point operations (even more integer operations) in parallel. Combining that with vector instructions for loading and storing data, a significant speedup is possible over the generic code generated by GCC. Applying those optimizations to FFTW's codebase will enable the next generation of ARM devices (in both the handheld and netbook formats) to be used for scientific computing, or really any application that uses the DFT / FFT.

Well, that's it for my initial post. If you're interested in following the progress of this project, be sure to follow this blog. Alternatively, feel free to subscribe to the mailing list and monitor the git repository for new code.