Calling fsincos instruction in LLVM slower than calling Libc sin/cos functions

tosh1 pts0 comments

assembly - Calling fsincos instruction in LLVM slower than calling libc sin/cos functions? - Stack Overflow

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Stack Internal

Knowledge at work

Bring the best of human thought and AI automation together at your work.

Explore Stack Internal

Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?

Ask Question

Asked<br>13 years, 8 months ago

Modified<br>2 years, 10 months ago

Viewed<br>7k times

18

I am working on a language that is compiled with LLVM. Just for fun, I wanted to do some microbenchmarks. In one, I run some million sin / cos computations in a loop. In pseudocode, it looks like this:

var x: Double = 0.0<br>for (i

If I'm computing sin/cos using LLVM IR inline assembly in the form:

%sc = call { double, double } asm "fsincos", "={st(1)},={st},1,~{dirflag},~{fpsr},~{flags}" (double %"res") nounwind

this is faster than using fsin and fcos separately instead of fsincos. However, it is slower than if I calling the llvm.sin.f64 and llvm.cos.f64 intrinsics separately, which compile to calls to the C math lib functions, at least with the target settings I'm using (x86_64 with SSE enabled).

It seems LLVM inserts some conversions between single/double precision FP -- that might be the culprit. Why is that? Sorry, I'm a relative newbie at assembly:

.globl main<br>.align 16, 0x90<br>.type main,@function<br>main: # @main<br>.cfi_startproc<br># BB#0: # %loopEntry1<br>xorps %xmm0, %xmm0<br>movl $-1, %eax<br>jmp .LBB44_1<br>.align 16, 0x90<br>.LBB44_2: # %then4<br># in Loop: Header=BB44_1 Depth=1<br>movss %xmm0, -4(%rsp)<br>flds -4(%rsp)<br>#APP<br>fsincos<br>#NO_APP<br>fstpl -16(%rsp)<br>fstpl -24(%rsp)<br>movsd -16(%rsp), %xmm0<br>mulsd %xmm0, %xmm0<br>cvtsd2ss %xmm0, %xmm1<br>movsd -24(%rsp), %xmm0<br>mulsd %xmm0, %xmm0<br>cvtsd2ss %xmm0, %xmm0<br>addss %xmm1, %xmm0<br>.LBB44_1: # %loop2<br># =>This Inner Loop Header: Depth=1<br>incl %eax<br>cmpl $99999999, %eax # imm = 0x5F5E0FF<br>jle .LBB44_2<br># BB#3: # %break3<br>cvttss2si %xmm0, %eax<br>ret<br>.Ltmp160:<br>.size main, .Ltmp160-main<br>.cfi_endproc

Same test with calls to llvm sin/cos intrinsics:

.globl main<br>.align 16, 0x90<br>.type main,@function<br>main: # @main<br>.cfi_startproc<br># BB#0: # %loopEntry1<br>pushq %rbx<br>.Ltmp162:<br>.cfi_def_cfa_offset 16<br>subq $16, %rsp<br>.Ltmp163:<br>.cfi_def_cfa_offset 32<br>.Ltmp164:<br>.cfi_offset %rbx, -16<br>xorps %xmm0, %xmm0<br>movl $-1, %ebx<br>jmp .LBB44_1<br>.align 16, 0x90<br>.LBB44_2: # %then4<br># in Loop: Header=BB44_1 Depth=1<br>movsd %xmm0, (%rsp) # 8-byte Spill<br>callq cos<br>mulsd %xmm0, %xmm0<br>movsd %xmm0, 8(%rsp) # 8-byte Spill<br>movsd (%rsp), %xmm0 # 8-byte Reload<br>callq sin<br>mulsd %xmm0, %xmm0<br>addsd 8(%rsp), %xmm0 # 8-byte Folded Reload<br>.LBB44_1: # %loop2<br># =>This Inner Loop Header: Depth=1<br>incl %ebx<br>cmpl $99999999, %ebx # imm = 0x5F5E0FF<br>jle .LBB44_2<br># BB#3: # %break3<br>cvttsd2si %xmm0, %eax<br>addq $16, %rsp<br>popq %rbx<br>ret<br>.Ltmp165:<br>.size main, .Ltmp165-main<br>.cfi_endproc

Can you suggest how the ideal assembly would look like with fsincos? PS. Adding -enable-unsafe-fp-math to llc makes the conversions disappear and switches to doubles (fldl etc.), but the speed remains the same.

.globl main<br>.align 16, 0x90<br>.type main,@function<br>main: # @main<br>.cfi_startproc<br># BB#0: # %loopEntry1<br>xorps %xmm0, %xmm0<br>movl $-1, %eax<br>jmp .LBB44_1<br>.align 16, 0x90<br>.LBB44_2: # %then4<br># in Loop: Header=BB44_1 Depth=1<br>movsd %xmm0, -8(%rsp)<br>fldl -8(%rsp)<br>#APP<br>fsincos<br>#NO_APP<br>fstpl -24(%rsp)<br>fstpl -16(%rsp)<br>movsd -24(%rsp), %xmm1<br>mulsd %xmm1, %xmm1<br>movsd -16(%rsp), %xmm0<br>mulsd %xmm0, %xmm0<br>addsd %xmm1, %xmm0<br>.LBB44_1: # %loop2<br># =>This Inner Loop Header: Depth=1<br>incl %eax<br>cmpl $99999999, %eax # imm = 0x5F5E0FF<br>jle .LBB44_2<br># BB#3: # %break3<br>cvttsd2si %xmm0, %eax<br>ret<br>.Ltmp160:<br>.size main, .Ltmp160-main<br>.cfi_endproc

assembly<br>llvm<br>inline-assembly<br>x87

Share

Improve this question

Follow

edited Sep 18, 2012 at 21:31

asked Sep 18, 2012 at 21:18

Erkki Lindpere

61766 silver badges1111 bronze badges

Hmm.. I think I'm starting to get it. fsin/fcos/fsincos use x87 registers and mulsd addsd use MMX / SSE. So the overhead is from moving the data between them probably?

Erkki Lindpere

Erkki Lindpere

2012-09-18 21:38:01 +00:00

Commented<br>Sep 18, 2012 at 21:38

No, cvtsd2ss is a conversion from double to float. But stay away from legacy coprocessor instructions, they are slower and more imprecise than library routines nowadays. See for instance gcc.gnu.org/ml/gcc/2012-02/msg00188.html

Gunther Piez

Gunther Piez

2012-09-18 22:00:06 +00:00

Commented<br>Sep 18, 2012 at 22:00

And yes, there is additional overhead from moving, but it doesn't amount to much compared to the 200-300 cycles fsincos uses.

Gunther Piez

Gunther Piez

2012-09-18 22:01:07 +00:00

Commented<br>Sep 18, 2012 at 22:01

Thanks, I guess I'll stick with the llvm sin/cos intrinsics then.

Erkki Lindpere

Erkki Lindpere

2012-09-18 22:10:34 +00:00

Commented<br>Sep 18, 2012 at 22:10

Add a comment

2 Answers 2

Sorted by:

Reset to default

Highest score...

xmm0 main llvm fsincos movsd calling

Related Articles