Need some constructive criticism on my SSE/Assembly attempt

Posted by Brett on Stack Overflow See other posts from Stack Overflow or by Brett
Published on 2010-05-27T17:40:14Z Indexed on 2010/05/27 17:41 UTC
Read the original article Hit count: 251

Filed under:
|
|

Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code.

The bit of code that I need to do this for is:

float ox = p2x - (px * c - py * s)*m;
float oy = p2y - (px * s - py * c)*m;

What I've got for SSE code is:

void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy)
{
    vector4 r;
    __m128 scale = _mm_set1_ps(m);

__asm
{
    mov     eax,    p       //Load into CPU reg
    mov     ebx,    sc
    movups  xmm0,   [eax]   //move vectors to SSE regs
    movups  xmm1,   [ebx]

    mulps   xmm0,   xmm1    //Multiply the Elements

    movaps  xmm2,   xmm0    //make a copy of the array  
    shufps  xmm2,   xmm0,  0x1B //shuffle the array     

    subps   xmm0,   xmm2    //subtract the elements

    mulps   xmm0,   scale   //multiply the vector by the scale

    mov     ecx,    xy      //load the variable into cpu reg
    movups  xmm3,   [ecx]   //move the vector to the SSE regs

    subps   xmm3,   xmm0    //subtract xmm3 - xmm0

    movups  [r],    xmm3    //Save the retun vector, and use elements 0 and 3
    }
}

Since its very difficult to read the code, I'll explain what I did:

loaded vector4 , xmm0 _ p = [px , py , px , py ]
mult. by vector4, xmm1 _ cs = [c , c , s , s ]
_____________mult----------------------------
result,
______ xmm0 = [px*c, py*c, px*s, py*s]

reuse result, xmm0 = [px*c, py*c, px*s, py*s]
shuffle result, xmm2 = [py*s, px*s, py*c, px*c]
___________subtract----------------------------
result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]

reuse result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c]
load m vector4, scale = [m, m, m, m]
______________mult----------------------------
result, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]


load xy vector4, xmm3 = [p2x, p2x, p2y, p2y]
reuse, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m]
___________subtract----------------------------
result, xmm3 = [p2x-(px*c-py*s)*m, p2x-(py*c-px*s)*m, p2y-(px*s-py*c)*m, p2y-(py*s-px*c)*m]

then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4]

I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms.

Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code.

Thanks in advance for any help! Brett

© Stack Overflow or respective owner

Related posts about c++

Related posts about assembly