Wednesday, June 30, 2010

Added Intrinsics and Cycle Counter Support

Today I pushed two relatively simple changes that I wanted to get out of the way.

The first changeset adds NEON intrinsics to FFTW (although some testing / tuning still needs to be done). The end product will not be relying on NEON intrinsics but I thought it would be a good thing to benchmark against in any case. By specifically not using intrinsics (in the simd/ directory of FFTW), a lot more will have to be rewritten by hand in .S file format and I'm corresponding with Matteo about which routines to give special attention to.

The second changeset adds support for the arm7 cycle counter which will be used during the benchmarking stages.

Monday, June 28, 2010

Weekly Report - Week 5

Originally, I was quite happy to have achieved a 5.5x speedup in mutliplying long vectors of complex floats, but this week, I managed to increase that to values varying between 11x and 18x. My changes were quite subtle, but I suspect the main problem was issuing vmul instructions after vmla, which is specifically noted to produce stalls of 4 cycles in the Cortex-A8 reference manual. What's odd though, is that the q-reg implementation seems to generally perform slightly faster than the d-reg implementation in most cases. In any case, these speedups are most certainly going to make a major difference when implemented in fftw compared to the code generated from the generic C code.

In case you aren't watching them already, I have two repositories set up. The first one is specifically for fftw, and the second is for snippets of c code and assembly, for benchmarking purposes.

Please feel free to check out  'demo2' from misc and run it on your beagleboard to observe the speedups first hand.

Sunday, June 20, 2010

Demo #2

Hi all,

it's far past my bed time, so I'm going to put up a link to some sample code that shows a 5.5x speedup for multiplying long vectors of complex numbers using NEON as opposed to the default libc implementation. Here it is. This has been an "educational" experience in familiarizing myself with the NEON instruction set. Tomorrow, I'm moving on to more FFTW related things, and will finally commit some demo code, and FFTW changes to my git repository.

Monday, June 14, 2010

(Pre) Weekly Report #4

Hi folks,

Unfortunately I can't say that I have very much to report today. I spent this entire week working day and night on the final antenna model for my thesis, which was important for me to finish before my flight back to Canada tomorrow morning.

However, I should be able to accomplish the goals of my last week by tomorrow evening when I arrive back in Canada, since I'll have 10+ hours of between origin and destination. Please look forward to a more informative Weekly Update #4 tomorrow evening (GMT-5).

Monday, June 7, 2010

Weekly Report #3

Hello all,

I spent the last week familiarizing myself with the NEON instruction set and resolving some confusion I had about how to load interleaved data elements from memory. I also took a few first-stabs at integrating ARM/NEON specific code into FFTW.

As an example of interleaved data elements, complex floats have interleaved real and imaginary components in memory and each component is separated by two word addresses. For my previous demo (multiplying long arrays of real floats), I had used the older VFP instructions (FLDM,FSTM) for loading and storing vectors into memory. This time, I proceeded under the incorrect assumption that the FPSCR.STRIDE setting controlled interleaving of memory locations. It's actually the opposite - the stride setting controls interleaving of VFP registers. So needless to say, I received a few segfaults and alignment errors in my next demo app.

The next demo app I'm writing (in progress), has so far helped me to familiarize myself with the nooks and crannies of the NEON instruction set including alignment and latency issues. This kind of in-depth knowledge is vital for FFTW integration, and particularly for squeezing every last drop of performance out of the NEON co-processor. I changed all of my load and store operations to use the VLDn and VSTn instructions, and have so far written 1 of 3 test cases that will be benchmarked for speed, as was the case for my previous demo.

Before I divulge any details, I should say something about complex multiplication. A typical strategy for multiplying two complex numbers is to use the FOIL method. The acronym stands for First-Outside Inside-Last. So given a complex number x = a + jb, and y = c + jd (where j = sqrt(-1)), FOIL-multiplication would resemble the following:

x * y = (ac - ad) + j(bc + bd) = (F - O) + j(I + L)

This can be broken down into the following (rough) pseudo-code:


Where MUL is floating-point multiplication, MACS is multiply and accumulate (subtractively), and MAC is multiply and accumulate (additively). Clear, right?

For longer arrays of complex numbers (e.g. cf32_t x[256], cf32_t y[256], cf32_t z[256], cf32_mul(z,x,y)), there are a few other considerations to take into account.

1) SIMD Operations

This is the most obvious speedup. If one can compute N products simultaneously, then there is an inherent speedup of N. The structure of the NEON pipeline allows for 2 x 2 simultaneous multiplications  within the time of a single ARM instruction, so effectively we can make 4 parallel float multiplications with a single ARM instruction.

2) Memory Bottlenecks

Most ARM devices have a fairly low memory bus clock frequency. That is basically the same thing as the front-side-bus for the x86 architecture. What that means, is that loading and storing variables from RAM is an even more expensive operation on the ARM architecture than it is on x86. As a reference, newer x86 chips support bus frequencies of over 1.3 GHz, whereas a typical bus frequency for ARM devices is in the order of 100 times slower.

To get around this issue, operands should be loaded and stored using vector instructions. As shown in my previous demo, that saves a significant amount of time.

3) Pipeline Activity

NEON supports concurrent memory loads and stores. Therefore, the best (fastest) performance can be achieved by keeping the NEON arithmetic pipeline busy, while concurrently loading and storing operands and products. In assembler, this would mean that arithmetic instructions should be mixed with load and store instructions.

4) Co-processor Latency

When switching between ARM load / store operations and NEON load / store operations, a latency penalty is incurred in order to avoid memory hazards, as pointed out here. Thus, ARM loads and stores should be avoided at all costs while doing lengthy NEON operations.

What Next?

As I already mentioned, the current demo app that I'm working on is intended to be educational about the NEON instruction set. However, I will also be comparing 4 different methods for multiplying long arrays of complex floats.

i) Libc - in no way optimized
ii) FFF...OOO...III...LLL (this operates length-wise along the array)
iii) same as above but with intermixed load and store operations
iv) FOILFOILFOIL... (this completes complex multiplications one at a time)

Method (i) will obviously take the longest time to execute. Method (ii) will be an improvement over method (i). Method (iii) should perform even faster than Method (ii) Method (iv) will use intermixed loading and storing, and will also use the MAC / MACS features of NEON.

I expect that methods (iii) and (iv) will be the fastest, and should have some benchmarks to show shortly.


My FFTW work thus far, has included adding some autoconf work, and a few headers in simd/.

The now includes autoconf macros for an '--enable-neon' flag. The flag can be set when cross-compiling and auto-detected for (native arm compilation). I borrowed the same NEON test that ffmpeg uses in its configure script, so it was fairly straight-forward. Subsequently, a HAVE_NEON preprocessor variable is either set or not-set in config.h .

As far as integrating code, it's been a bit more challenging. Since the SSE SIMD optimizations are also only available for single-precision, I decided to start with those and retrofit the NEON optimizations in. The SSE headers use a lot of x86 intrinsics in GCC. Although NEON intrinsics also exist, many have complained that they produce questionable code, so I'll probably tend to use .S files. Hopefully all goes well.