Below is an example of how the VMUL and VZMUL inline functions appear in the disassembled .so file when using intrinsics. They shouldn't actually appear at all as seperate sections. The best way to eliminate all of those useless branches, push's and pop's, is to rewrite each inline function in simd-neon.h with inline asm statements.
000dc268 :
vmul.f32 q0, q0, q1
bx lr
...
...
00149cf0 :
vorr q8, q0, q0
vtrn.32 q0, q8
push {lr} ; (str lr, [sp, #-4]!)
vpush {d8-d13}
sub sp, sp, #68 ; 0x44
vstr d16, [sp, #16]
vstr d17, [sp, #24]
vstmia sp, {d0-d1}
mov lr, sp
add ip, sp, #32 ; 0x20
ldm lr!, {r0, r1, r2, r3}
vorr q6, q1, q1
stmia ip!, {r0, r1, r2, r3}
vldr d0, [sp, #32]
vldr d1, [sp, #40]
ldm lr, {r0, r1, r2, r3}
stm ip, {r0, r1, r2, r3}
vldr d8, [sp, #48]
vldr d9, [sp, #56]
bl 149c30
vorr q5, q0, q0
vorr q0, q6, q6
bl 149c38
vorr q1, q0, q0
vorr q0, q4, q4
bl 149c30
vorr q1, q0, q0
vorr q0, q5, q5
bl 149ce8
add sp, sp, #68 ; 0x44
vpop {d8-d13}
pop {pc}
On a side note, I did discover that some as-of-yet unidentified effect of the armv7 cycle-counter was making my benchmarks hang, so that was currently configured out in my test libraries (--enable-armv7-cycle-counter is not set).
On another side note, all of my simulations seem to indicate that simd transforms are always out-of-place. This is probably a good thing in any case because out-of-place transformations tend to be faster, but it also might allow me to implement pointer/register auto-incrementing (with ldxxx {xx} [rx]!) .
In pure C / neon intrinsics, there is no way to specify load-store alignment (major speedups lost) or pointer auto-increment (more arm instruction syncopation). Ideally, there would only be a few seldom, conditional branches in arm code while most of the work was done in the neon coprocessor.
In case anyone would like to test the library out on their own, I'm configuring fftw3 with
CFLAGS="-Os -pipe -mcpu=cortex-a8 -mfloat-abi=softfp" ./configure --prefix=/usr --host=arm-none-linux-gnueabi --enable-float --with-pic --enable-neon --enable-shared
I would highly suggest adding -mfpu=neon to the cflags above as well (for all code), otherwise configure.ac only adds -mfpu=neon to simd-specific compiles.
Although I did get quite a bit done this week, it's been slower than I would have liked for two reasons: 1) Canada Day (daycare is closed), and 2) my significant other was on the other side of the continent for an academic visit, and so I've been a single parent this week. Although I did get to spend lots of extra time with my son, which is always welcome, i think I lost about 1 or two hours a day getting to the daycare and back.
Plans for next week:
1) rewrite inline functions as inline asm instead of inline with intrinsics
2) speed-ups!
3) continue investigating codelet-free approaches (i.e. sacrificing the fftw methodology for speed)
4) fix cycle counter!
No comments:
Post a Comment