Optimizing FFTW for NEON-Enabled ARM Devices: Weekly Report #3

Hello all,

I spent the last week familiarizing myself with the NEON instruction set and resolving some confusion I had about how to load interleaved data elements from memory. I also took a few first-stabs at integrating ARM/NEON specific code into FFTW.

As an example of interleaved data elements, complex floats have interleaved real and imaginary components in memory and each component is separated by two word addresses. For my previous demo (multiplying long arrays of real floats), I had used the older VFP instructions (FLDM,FSTM) for loading and storing vectors into memory. This time, I proceeded under the incorrect assumption that the FPSCR.STRIDE setting controlled interleaving of memory locations. It's actually the opposite - the stride setting controls interleaving of VFP registers. So needless to say, I received a few segfaults and alignment errors in my next demo app.

The next demo app I'm writing (in progress), has so far helped me to familiarize myself with the nooks and crannies of the NEON instruction set including alignment and latency issues. This kind of in-depth knowledge is vital for FFTW integration, and particularly for squeezing every last drop of performance out of the NEON co-processor. I changed all of my load and store operations to use the VLDn and VSTn instructions, and have so far written 1 of 3 test cases that will be benchmarked for speed, as was the case for my previous demo.

Before I divulge any details, I should say something about complex multiplication. A typical strategy for multiplying two complex numbers is to use the FOIL method. The acronym stands for First-Outside Inside-Last. So given a complex number x = a + jb, and y = c + jd (where j = sqrt(-1)), FOIL-multiplication would resemble the following:

x * y = (ac - ad) + j(bc + bd) = (F - O) + j(I + L)

This can be broken down into the following (rough) pseudo-code:

MUL(reg0,a,c)

MACS(reg0,a,d)

MUL(reg1,b,c)

MAC(reg1,b,d)

Where MUL is floating-point multiplication, MACS is multiply and accumulate (subtractively), and MAC is multiply and accumulate (additively). Clear, right?

For longer arrays of complex numbers (e.g. cf32_t x[256], cf32_t y[256], cf32_t z[256], cf32_mul(z,x,y)), there are a few other considerations to take into account.

1) SIMD Operations

This is the most obvious speedup. If one can compute N products simultaneously, then there is an inherent speedup of N. The structure of the NEON pipeline allows for 2 x 2 simultaneous multiplications within the time of a single ARM instruction, so effectively we can make 4 parallel float multiplications with a single ARM instruction.

2) Memory Bottlenecks

Most ARM devices have a fairly low memory bus clock frequency. That is basically the same thing as the front-side-bus for the x86 architecture. What that means, is that loading and storing variables from RAM is an even more expensive operation on the ARM architecture than it is on x86. As a reference, newer x86 chips support bus frequencies of over 1.3 GHz, whereas a typical bus frequency for ARM devices is in the order of 100 times slower.

To get around this issue, operands should be loaded and stored using vector instructions. As shown in my previous demo, that saves a significant amount of time.

3) Pipeline Activity

NEON supports concurrent memory loads and stores. Therefore, the best (fastest) performance can be achieved by keeping the NEON arithmetic pipeline busy, while concurrently loading and storing operands and products. In assembler, this would mean that arithmetic instructions should be mixed with load and store instructions.

4) Co-processor Latency

When switching between ARM load / store operations and NEON load / store operations, a latency penalty is incurred in order to avoid memory hazards, as pointed out here. Thus, ARM loads and stores should be avoided at all costs while doing lengthy NEON operations.

What Next?

As I already mentioned, the current demo app that I'm working on is intended to be educational about the NEON instruction set. However, I will also be comparing 4 different methods for multiplying long arrays of complex floats.

i) Libc - in no way optimized

ii) FFF...OOO...III...LLL (this operates length-wise along the array)

iii) same as above but with intermixed load and store operations

iv) FOILFOILFOIL... (this completes complex multiplications one at a time)

Method (i) will obviously take the longest time to execute. Method (ii) will be an improvement over method (i). Method (iii) should perform even faster than Method (ii) Method (iv) will use intermixed loading and storing, and will also use the MAC / MACS features of NEON.

I expect that methods (iii) and (iv) will be the fastest, and should have some benchmarks to show shortly.

As For FFTW

My FFTW work thus far, has included adding some autoconf work, and a few headers in simd/.

The configure.ac now includes autoconf macros for an '--enable-neon' flag. The flag can be set when cross-compiling and auto-detected for (native arm compilation). I borrowed the same NEON test that ffmpeg uses in its configure script, so it was fairly straight-forward. Subsequently, a HAVE_NEON preprocessor variable is either set or not-set in config.h .

As far as integrating code, it's been a bit more challenging. Since the SSE SIMD optimizations are also only available for single-precision, I decided to start with those and retrofit the NEON optimizations in. The SSE headers use a lot of x86 intrinsics in GCC. Although NEON intrinsics also exist, many have complained that they produce questionable code, so I'll probably tend to use .S files. Hopefully all goes well.

Optimizing FFTW for NEON-Enabled ARM Devices

Monday, June 7, 2010

Weekly Report #3

1 comment:

Post a Comment

Followers

Blog Archive

About Me