Optimizing FFTW for NEON-Enabled ARM Devices: asm

Tuesday, July 6, 2010

Results from Inline Asm / Codelets

I thought I would post the initial speed results achieved using the latest inline asm patch for neon-simd.h . The results are decent - a 2x mflop improvement in some cases - but I'm still aiming to get approximately a 10x speedup on average, which will require some pretty intensive assembly coding. The results below are ony for powers of two, just for a quick overview. Also, I'm trying to completely eliminate in-place transforms from the benchmark, since fftw always performs simd transforms out-of-place (as far as I can tell), and out-of-place transforms are generally faster anyway.

Monday, June 28, 2010

Weekly Report - Week 5

Originally, I was quite happy to have achieved a 5.5x speedup in mutliplying long vectors of complex floats, but this week, I managed to increase that to values varying between 11x and 18x. My changes were quite subtle, but I suspect the main problem was issuing vmul instructions after vmla, which is specifically noted to produce stalls of 4 cycles in the Cortex-A8 reference manual. What's odd though, is that the q-reg implementation seems to generally perform slightly faster than the d-reg implementation in most cases. In any case, these speedups are most certainly going to make a major difference when implemented in fftw compared to the code generated from the generic C code.

In case you aren't watching them already, I have two repositories set up. The first one is specifically for fftw, and the second is for snippets of c code and assembly, for benchmarking purposes.

http://gitorious.org/gsoc2010-fftw-neon

http://gitorious.org/gsoc2010-fftw-neon-misc

Please feel free to check out 'demo2' from misc and run it on your beagleboard to observe the speedups first hand.

Optimizing FFTW for NEON-Enabled ARM Devices

Tuesday, July 6, 2010

Results from Inline Asm / Codelets

Monday, June 28, 2010

Weekly Report - Week 5

Followers

Blog Archive

About Me