In case you were wondering, the reason that this blog exists is to document my progress through the summer of 2010, working on a project funded through the Google Summer of Code. Specifically, the intention of this project is to improve performance of FFTW for NEON-enabled ARM processors.
What is FFTW, you ask? Well, to explain that, I should probably first direct interested readers to a few articles which cover the basics about the Fourier Transform (FT), the Discrete Fourier Transform (DFT), and also the Fast Fourier Transform (FFT). What all of these transforms have in common is that they represent signals in the frequency-domain as opposed to the time domain. What kind of signals? Well... essentially everything in nature that is measurable can be expressed as a signal. For example music, speech, mechanical waves in a structure, voltages in an electrical circuit, radio waves, heart beats, earthquakes, brainwaves, and so on.
The DFT and FFT differ from the FT in that they operate on digital (or discrete) information, that can be processed by a digital computer. Although the DFT or the FFT both operate on the time-domain representation of a signal, converting it to a frequency-domain representation (often called a spectrum), the FFT can perform this transformation incredibly faster than the DFT, in spite of providing identical results (in certain cases). The FFT takes advantage of certain symmetries in the DFT, and eliminates redundant calculations, which typically requires signals to have lengths equivalent to some power of 2.
Finally, to answer the original question, FFTW is a finely tuned software library that leverages certain hardware in modern computers to perform the FFT even faster. This is typically achieved by performing several operations in parallel or in sequence, or by reducing the number of accesses over the main memory bus. FFTW is probably most famous for its use in the Matlab, GNU Octave, and GNURadio, although it certainly has seen several other important applications.
To date, FFTW has mainly been optimized for x86 and PowerPC processors. However, as mobile devices are becoming more advanced (reaching GHz clock speeds), we are seeing ARM processors moving out of the traditional embedded market and into more general computing devices; the mobile processors of today are computing powerhouses compared to those of 10 years ago. ARM powered-devices are also now available in the laptop / netbook form factor that has traditionally been dominated by the x86 architecture. There has even been recent mention of multi-core ARM devices moving into the server market.
This is where NEON comes into play. The ARM Cortex-A8 architecture was the first to introduce the Advanced SIMD instruction set (NEON) which can perform several floating-point operations (even more integer operations) in parallel. Combining that with vector instructions for loading and storing data, a significant speedup is possible over the generic code generated by GCC. Applying those optimizations to FFTW's codebase will enable the next generation of ARM devices (in both the handheld and netbook formats) to be used for scientific computing, or really any application that uses the DFT / FFT.
Well, that's it for my initial post. If you're interested in following the progress of this project, be sure to follow this blog. Alternatively, feel free to subscribe to the mailing list and monitor the git repository for new code.
The DFT and FFT differ from the FT in that they operate on digital (or discrete) information, that can be processed by a digital computer. Although the DFT or the FFT both operate on the time-domain representation of a signal, converting it to a frequency-domain representation (often called a spectrum), the FFT can perform this transformation incredibly faster than the DFT, in spite of providing identical results (in certain cases). The FFT takes advantage of certain symmetries in the DFT, and eliminates redundant calculations, which typically requires signals to have lengths equivalent to some power of 2.
Finally, to answer the original question, FFTW is a finely tuned software library that leverages certain hardware in modern computers to perform the FFT even faster. This is typically achieved by performing several operations in parallel or in sequence, or by reducing the number of accesses over the main memory bus. FFTW is probably most famous for its use in the Matlab, GNU Octave, and GNURadio, although it certainly has seen several other important applications.
To date, FFTW has mainly been optimized for x86 and PowerPC processors. However, as mobile devices are becoming more advanced (reaching GHz clock speeds), we are seeing ARM processors moving out of the traditional embedded market and into more general computing devices; the mobile processors of today are computing powerhouses compared to those of 10 years ago. ARM powered-devices are also now available in the laptop / netbook form factor that has traditionally been dominated by the x86 architecture. There has even been recent mention of multi-core ARM devices moving into the server market.
This is where NEON comes into play. The ARM Cortex-A8 architecture was the first to introduce the Advanced SIMD instruction set (NEON) which can perform several floating-point operations (even more integer operations) in parallel. Combining that with vector instructions for loading and storing data, a significant speedup is possible over the generic code generated by GCC. Applying those optimizations to FFTW's codebase will enable the next generation of ARM devices (in both the handheld and netbook formats) to be used for scientific computing, or really any application that uses the DFT / FFT.
Well, that's it for my initial post. If you're interested in following the progress of this project, be sure to follow this blog. Alternatively, feel free to subscribe to the mailing list and monitor the git repository for new code.
3 comments:
Mmmh, Neon. Isn't it too slow ?
Why not using fixed point or a DSP(OMAP3).
Edgar
Neon is actually fairly fast - lots of todays video codecs are using it (e.g. ffmpeg, x264).
There's a pretty significant overhead / complexity in using the DSP, so for small 'vectors' neon will probably be the best bet, but for large 'vectors' I do plan on exploring the DSP option - i think it is fixed point, but it should still be quite capable.
Glad to hear you rec'd your shipment okay!
Post a Comment