Sunday, July 4, 2010

FFTW: Weekly Report - Week 7

Over the last week, I've spent a fair amount of time implementing the "standard" fftw simd interface for neon using intrinsics, and one thing is certainly clear: there is absolutely no throughput gained at all, as you can see in the graph below. And yes, I was definitely using neon code for fftw3n. These results were more or less expected (although maybe not so exactly), since I was really only using intrinsics for verification purposes, and obviously inline C functions with intrinsics produce some undesirable effects. As far as numerical accuracy goes, its identical to the non-simd version, which is good. The benchmark was made via benchfft for consistency, and you can grab a copy of it from my misc repository.






Below is an example of how the VMUL and VZMUL inline functions appear in the disassembled .so file when using intrinsics. They shouldn't actually appear at all as seperate sections. The best way to eliminate all of those useless branches, push's and pop's, is to rewrite each inline function in simd-neon.h with inline asm statements. 





000dc268 :
vmul.f32        q0, q0, q1
bx      lr
...
...
00149cf0 :
vorr    q8, q0, q0
vtrn.32 q0, q8
push    {lr}            ; (str lr, [sp, #-4]!)
vpush   {d8-d13}
sub     sp, sp, #68     ; 0x44
vstr    d16, [sp, #16]
vstr    d17, [sp, #24]
vstmia  sp, {d0-d1}
mov     lr, sp
add     ip, sp, #32     ; 0x20
ldm     lr!, {r0, r1, r2, r3}
vorr    q6, q1, q1
stmia   ip!, {r0, r1, r2, r3}
vldr    d0, [sp, #32]
vldr    d1, [sp, #40]
ldm     lr, {r0, r1, r2, r3}
stm     ip, {r0, r1, r2, r3}
vldr    d8, [sp, #48]
vldr    d9, [sp, #56]
bl      149c30
vorr    q5, q0, q0
vorr    q0, q6, q6
bl      149c38
vorr    q1, q0, q0
vorr    q0, q4, q4
bl      149c30
vorr    q1, q0, q0
vorr    q0, q5, q5
bl      149ce8
add     sp, sp, #68     ; 0x44
vpop    {d8-d13}
pop     {pc}





On a side note, I did discover that some as-of-yet unidentified effect of the armv7 cycle-counter was making my benchmarks hang, so that was currently configured out in my test libraries (--enable-armv7-cycle-counter is not set). 

On another side note, all of my simulations seem to indicate that simd transforms are always out-of-place. This is probably a good thing in any case because out-of-place transformations tend to be faster, but it also might allow me to implement pointer/register auto-incrementing (with ldxxx {xx} [rx]!) . 


In pure C / neon intrinsics, there is no way to specify load-store alignment (major speedups lost) or pointer auto-increment (more arm instruction syncopation). Ideally, there would only be a few seldom, conditional branches in arm code while most of the work was done in the neon coprocessor. 


In case anyone would like to test the library out on their own, I'm configuring fftw3 with

CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfloat-abi=softfp" ./configure --prefix=/usr --host=arm-none-linux-gnueabi --enable-float --with-pic --enable-neon --enable-shared

I would highly suggest adding -mfpu=neon to the cflags above as well (for all code), otherwise configure.ac only adds -mfpu=neon to simd-specific compiles. 

Although I did get quite a bit done this week, it's been slower than I would have liked for two reasons: 1) Canada Day (daycare is closed), and 2) my significant other was on the other side of the continent for an academic visit, and so I've been a single parent this week. Although I did get to spend lots of extra time with my son, which is always welcome, i think I lost about 1 or two hours a day getting to the daycare and back. 

Plans for next week: 

1) rewrite inline functions as inline asm instead of inline with intrinsics
2) speed-ups!
3) continue investigating codelet-free approaches (i.e. sacrificing the fftw methodology for speed)
4) fix cycle counter!

No comments:

Post a Comment