Optimizing FFTW for NEON-Enabled ARM Devices: FFTW: Weekly Report

Over the last week, I've spent a fair amount of time implementing the "standard" fftw simd interface for neon using intrinsics, and one thing is certainly clear: there is absolutely no throughput gained at all, as you can see in the graph below. And yes, I was definitely using neon code for fftw3n. These results were more or less expected (although maybe not so exactly), since I was really only using intrinsics for verification purposes, and obviously inline C functions with intrinsics produce some undesirable effects. As far as numerical accuracy goes, its identical to the non-simd version, which is good. The benchmark was made via benchfft for consistency, and you can grab a copy of it from my misc repository.

Below is an example of how the VMUL and VZMUL inline functions appear in the disassembled .so file when using intrinsics. They shouldn't actually appear at all as seperate sections. The best way to eliminate all of those useless branches, push's and pop's, is to rewrite each inline function in simd-neon.h with inline asm statements.

000dc268 :

vmul.f32 q0, q0, q1

bx lr

...

00149cf0 :

vorr q8, q0, q0

vtrn.32 q0, q8

push {lr} ; (str lr, [sp, #-4]!)

vpush {d8-d13}

sub sp, sp, #68 ; 0x44

vstr d16, [sp, #16]

vstr d17, [sp, #24]

vstmia sp, {d0-d1}

mov lr, sp

add ip, sp, #32 ; 0x20

ldm lr!, {r0, r1, r2, r3}

vorr q6, q1, q1

stmia ip!, {r0, r1, r2, r3}

vldr d0, [sp, #32]

vldr d1, [sp, #40]

ldm lr, {r0, r1, r2, r3}

stm ip, {r0, r1, r2, r3}

vldr d8, [sp, #48]

vldr d9, [sp, #56]

bl 149c30

vorr q5, q0, q0

vorr q0, q6, q6

bl 149c38

vorr q1, q0, q0

vorr q0, q4, q4

bl 149c30

vorr q1, q0, q0

vorr q0, q5, q5

bl 149ce8

add sp, sp, #68 ; 0x44

vpop {d8-d13}

pop {pc}

On a side note, I did discover that some as-of-yet unidentified effect of the armv7 cycle-counter was making my benchmarks hang, so that was currently configured out in my test libraries (--enable-armv7-cycle-counter is not set).

On another side note, all of my simulations seem to indicate that simd transforms are always out-of-place. This is probably a good thing in any case because out-of-place transformations tend to be faster, but it also might allow me to implement pointer/register auto-incrementing (with ldxxx {xx} [rx]!) .

In pure C / neon intrinsics, there is no way to specify load-store alignment (major speedups lost) or pointer auto-increment (more arm instruction syncopation). Ideally, there would only be a few seldom, conditional branches in arm code while most of the work was done in the neon coprocessor.

In case anyone would like to test the library out on their own, I'm configuring fftw3 with

CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfloat-abi=softfp" ./configure --prefix=/usr --host=arm-none-linux-gnueabi --enable-float --with-pic --enable-neon --enable-shared

I would highly suggest adding -mfpu=neon to the cflags above as well (for all code), otherwise configure.ac only adds -mfpu=neon to simd-specific compiles.

Although I did get quite a bit done this week, it's been slower than I would have liked for two reasons: 1) Canada Day (daycare is closed), and 2) my significant other was on the other side of the continent for an academic visit, and so I've been a single parent this week. Although I did get to spend lots of extra time with my son, which is always welcome, i think I lost about 1 or two hours a day getting to the daycare and back.

Plans for next week:

1) rewrite inline functions as inline asm instead of inline with intrinsics
2) speed-ups!
3) continue investigating codelet-free approaches (i.e. sacrificing the fftw methodology for speed)
4) fix cycle counter!

Optimizing FFTW for NEON-Enabled ARM Devices

Sunday, July 4, 2010

FFTW: Weekly Report - Week 7

No comments:

Post a Comment

Followers

Blog Archive

About Me