Optimizing FFTW for NEON-Enabled ARM Devices: Weekly Report

Originally, I was quite happy to have achieved a 5.5x speedup in mutliplying long vectors of complex floats, but this week, I managed to increase that to values varying between 11x and 18x. My changes were quite subtle, but I suspect the main problem was issuing vmul instructions after vmla, which is specifically noted to produce stalls of 4 cycles in the Cortex-A8 reference manual. What's odd though, is that the q-reg implementation seems to generally perform slightly faster than the d-reg implementation in most cases. In any case, these speedups are most certainly going to make a major difference when implemented in fftw compared to the code generated from the generic C code.

In case you aren't watching them already, I have two repositories set up. The first one is specifically for fftw, and the second is for snippets of c code and assembly, for benchmarking purposes.

http://gitorious.org/gsoc2010-fftw-neon

http://gitorious.org/gsoc2010-fftw-neon-misc

Please feel free to check out 'demo2' from misc and run it on your beagleboard to observe the speedups first hand.

Optimizing FFTW for NEON-Enabled ARM Devices

Monday, June 28, 2010

Weekly Report - Week 5

No comments:

Post a Comment

Followers

Blog Archive

About Me