Search Results

Search found 138 results on 6 pages for 'sse'.

Page 1/6 | 1 2 3 4 5 6 | Next Page >

Adding SSE support in Java EE 8

- by delabassee

SSE (Server-Sent Event) is a standard mechanism used to push, over HTTP, server notifications to clients. SSE is often compared to WebSocket as they are both supported in HTML 5 and they both provide the server a way to push information to their clients but they are different too! See here for some of the pros and cons of using one or the other. For REST application, SSE can be quite complementary as it offers an effective solution for a one-way publish-subscribe model, i.e. a REST client can 'subscribe' and get SSE based notifications from a REST endpoint. As a matter of fact, Jersey (JAX-RS Reference Implementation) already support SSE since quite some time (see the Jersey documentation for more details). There might also be some cases where one might want to use SSE directly from the Servlet API. Sending SSE notifications using the Servlet API is relatively straight forward. To give you an idea, check here for 2 SSE examples based on the Servlet 3.1 API. We are thinking about adding SSE support in Java EE 8 but the question is where as there are several options, in the platform, where SSE could potentially be supported: the Servlet API the WebSocket API JAX-RS or even having a dedicated SSE API, and thus a dedicated JSR too! Santiago Pericas-Geertsen (JAX-RS Co-Spec Lead) conducted an initial investigation around that question. You can find the arguments for the different options and Santiago's findings here. So at this stage JAX-RS seems to be a good choice to support SSE in Java EE. This will obviously be discussed in the respective JCP Expert Groups but what is your opinion on this question?

Read the article
Need some constructive criticism on my SSE/Assembly attempt

- by Brett

Hello, I'm working on converting a bit of code to SSE, and while I have the correct output it turns out to be slower than standard c++ code. The bit of code that I need to do this for is: float ox = p2x - (px * c - py * s)*m; float oy = p2y - (px * s - py * c)*m; What I've got for SSE code is: void assemblycalc(vector4 &p, vector4 &sc, float &m, vector4 &xy) { vector4 r; __m128 scale = _mm_set1_ps(m); __asm { mov eax, p //Load into CPU reg mov ebx, sc movups xmm0, [eax] //move vectors to SSE regs movups xmm1, [ebx] mulps xmm0, xmm1 //Multiply the Elements movaps xmm2, xmm0 //make a copy of the array shufps xmm2, xmm0, 0x1B //shuffle the array subps xmm0, xmm2 //subtract the elements mulps xmm0, scale //multiply the vector by the scale mov ecx, xy //load the variable into cpu reg movups xmm3, [ecx] //move the vector to the SSE regs subps xmm3, xmm0 //subtract xmm3 - xmm0 movups [r], xmm3 //Save the retun vector, and use elements 0 and 3 } } Since its very difficult to read the code, I'll explain what I did: loaded vector4 , xmm0 _ p = [px , py , px , py ] mult. by vector4, xmm1 _ cs = [c , c , s , s ] _____________mult---------------------------- result,______ xmm0 = [px*c, py*c, px*s, py*s] reuse result, xmm0 = [px*c, py*c, px*s, py*s] shuffle result, xmm2 = [py*s, px*s, py*c, px*c] ___________subtract---------------------------- result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c] reuse result, xmm0 = [px*c-py*s, py*c-px*s, px*s-py*c, py*s-px*c] load m vector4, scale = [m, m, m, m] ______________mult---------------------------- result, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m] load xy vector4, xmm3 = [p2x, p2x, p2y, p2y] reuse, xmm0 = [(px*c-py*s)*m, (py*c-px*s)*m, (px*s-py*c)*m, (py*s-px*c)*m] ___________subtract---------------------------- result, xmm3 = [p2x-(px*c-py*s)*m, p2x-(py*c-px*s)*m, p2y-(px*s-py*c)*m, p2y-(py*s-px*c)*m] then ox = xmm3[0] and oy = xmm3[3], so I essentially don't use xmm3[1] or xmm3[4] I apologize for the difficulty reading this, but I'm hoping someone might be able to provide some guidance for me, as the standard c++ code runs in 0.001444ms and the SSE code runs in 0.00198ms. Let me know if there is anything I can do to further explain/clean this up a bit. The reason I'm trying to use SSE is because I run this calculation millions of times, and it is a part of what is slowing down my current code. Thanks in advance for any help! Brett

Read the article
Combining prefixes in SSE

- by Nathan Fellman

In SSE the prefixes 066h (operand size override) 0F2H (REPNE) and 0F3h (REPE) are part of the opcode. In non-SSE 066h switches between 32-bit (or 64-bit) and 16-bit operation. 0F2h and 0F3h are used for string operations. They can be combined so that 066h and 0F2h (or 0F3h) can be used in the same instruction, because this is meaningful. What is the behavior in an SSE instruction? For instance, we have (ignoring mod/rm for now): 0f 58 -- addps 66 0f 58 -- addpd f2 0f 58 -- addsd f3 0f 58 -- addss But what is this? 66 f2 0f 58 And how about? f2 66 0f 58 Not to mention the following which has two conflicting REP prefixes: f2 f3 0f 58 What is the spec for thse?

Read the article
error A2070: invalid instruction operands IN SSE MASM64

- by Green

when compiling this in ml64.exe 64bit (masm64) the SSE command give me an error what do i need to do to include the SSE commands in 64 bit? .code test PROC movlps [rdx], xmm7 ;;error A2070: invalid instruction operands ;//Inc in vec ptr add rsi, 16 movhlps xmm6, xmm7 movss [rdx+8], xmm6 ;;rror A2070: invalid instruction operands ret test ENDP end i get the error: 1>Performing Custom Build Step 1> Assembling: extasm.asm 1>extasm.asm(6) : error A2070: invalid instruction operands 1>extasm.asm(10) : error A2070: invalid instruction operands 1>Microsoft (R) Macro Assembler (x64) Version 8.00.50727.215 1>Copyright (C) Microsoft Corporation. All rights reserved. 1>Project : error PRJ0019: A tool returned an error code from "Performing Custom Build Step"

Read the article
How to rotate an SSE/AVX vector

- by user1584773

I need to perform a rotate operation with as little clock cycle as possible. In the first case let's assume __m128i as source and dest type source: || A0 || A1 || A2 || A3 || dest : || A1 || A2 || A3 || A0 || dest = (__m128i)_mm_shuffle_epi32((__m128i)source, _MM_SHUFFLE(0,3,2,1)); Now I want to do the same whit AVX intrinsics So let's assume this time __m256i as source and dest type source: || A0 || A1 || A2 || A3 || A4 || A5 || A6 || A7 || dest : || A1 || A2 || A3 || A4 || A5 || A6 || A7 || A0 || The Avx intrinsics is missing most of the corresponding SSE integer operations. Maybe there is some way go get the desider output working with the floating point version. I've tryed with: dest = (__m256i)_mm256_shuffle_ps((__m256)source, (__m256)source, _MM_SHUFFLE(0,3,2,1)); but what I get is: || A0 || A2 || A3 || A4 || A5 || A6 || A7 || A1 || Any Idea on how to solve this in an efficient way? (without mixing SSE and AVX operation and without "manually" inverting A0 and A1 Thanks in advance!

Read the article
128bit hash comparison with SSE

- by fokenrute

Hi, In my current project, I have to compare 128bit values (actually md5 hashes) and I thought it would be possible to accelerate the comparison by using SSE instructions. My problem is that I can't manage to find good documentation on SSE instructions; I'm searching for a 128bit integer comparison instruction that let me know if one hash is larger, smaller or equal to another. Does such an instruction exists? PS: The targeted machines are x86_64 servers with SSE2 instructions; I'm also interested in a NEON instruction for the same job.

Read the article
SSE SIMD Optimization For Loop

- by Projectile Fish

I have some code in a loop for(int i = 0; i < n; i++) { u[i] = c * u[i] + s * b[i]; } So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?

Read the article
Fast double -> short conversion with clamping using SSE?

- by gct

Is there a fast way to cast double values to shorts (16 bits signed), currently I'm doing something like this: double dval = <sum junk> int16_t sval; if (val > int16_max) { sval = int16_max; } else if (val < int16_min) { sval = int16_min; } else sval = (int16_t)val; I suspect there's a fast way to do this using SSE that will be significantly more efficient.

Read the article
help me improve my sse yuv to rgb ssse3 code

- by David McPaul

Hello, I am looking to optimise some sse code I wrote for converting yuv to rgb (both planar and packed yuv functions). i am using SSSE3 at the moment but if there are useful functions from later sse versions thats ok. I am mainly interested in how I would work out processor stalls and the like. Anyone know of any tools that do static analysis of sse code? ; ; Copyright (C) 2009-2010 David McPaul ; ; All rights reserved. Distributed under the terms of the MIT License. ; ; A rather unoptimised set of ssse3 yuv to rgb converters ; does 8 pixels per loop ; inputer: ; reads 128 bits of yuv 8 bit data and puts ; the y values converted to 16 bit in xmm0 ; the u values converted to 16 bit and duplicated into xmm1 ; the v values converted to 16 bit and duplicated into xmm2 ; conversion: ; does the yuv to rgb conversion using 16 bit integer and the ; results are placed into the following registers as 8 bit clamped values ; r values in xmm3 ; g values in xmm4 ; b values in xmm5 ; outputer: ; writes out the rgba pixels as 8 bit values with 0 for alpha ; xmm6 used for scratch ; xmm7 used for scratch %macro cglobal 1 global _%1 %define %1 _%1 align 16 %1: %endmacro ; conversion code %macro yuv2rgbsse2 0 ; u = u - 128 ; v = v - 128 ; r = y + v + v >> 2 + v >> 3 + v >> 5 ; g = y - (u >> 2 + u >> 4 + u >> 5) - (v >> 1 + v >> 3 + v >> 4 + v >> 5) ; b = y + u + u >> 1 + u >> 2 + u >> 6 ; subtract 16 from y movdqa xmm7, [Const16] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm0,xmm7 ; y = y - 16 ; subtract 128 from u and v movdqa xmm7, [Const128] ; loads a constant using data cache (slower on first fetch but then cached) psubsw xmm1,xmm7 ; u = u - 128 psubsw xmm2,xmm7 ; v = v - 128 ; load r,b with y movdqa xmm3,xmm0 ; r = y pshufd xmm5,xmm0, 0xE4 ; b = y ; r = y + v + v >> 2 + v >> 3 + v >> 5 paddsw xmm3, xmm2 ; add v to r movdqa xmm7, xmm1 ; move u to scratch pshufd xmm6, xmm2, 0xE4 ; move v to scratch psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r psraw xmm6,1 ; divide v by 2 paddsw xmm3, xmm6 ; and add to r psraw xmm6,2 ; divide v by 4 paddsw xmm3, xmm6 ; and add to r ; b = y + u + u >> 1 + u >> 2 + u >> 6 paddsw xmm5, xmm1 ; add u to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,1 ; divide u by 2 paddsw xmm5, xmm7 ; and add to b psraw xmm7,4 ; divide u by 32 paddsw xmm5, xmm7 ; and add to b ; g = y - u >> 2 - u >> 4 - u >> 5 - v >> 1 - v >> 3 - v >> 4 - v >> 5 movdqa xmm7,xmm2 ; move v to scratch pshufd xmm6,xmm1, 0xE4 ; move u to scratch movdqa xmm4,xmm0 ; g = y psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,2 ; divide u by 4 psubsw xmm4,xmm6 ; subtract from g psraw xmm6,1 ; divide u by 2 psubsw xmm4,xmm6 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,2 ; divide v by 4 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g psraw xmm7,1 ; divide v by 2 psubsw xmm4,xmm7 ; subtract from g %endmacro ; outputer %macro rgba32sse2output 0 ; clamp values pxor xmm7,xmm7 packuswb xmm3,xmm7 ; clamp to 0,255 and pack R to 8 bit per pixel packuswb xmm4,xmm7 ; clamp to 0,255 and pack G to 8 bit per pixel packuswb xmm5,xmm7 ; clamp to 0,255 and pack B to 8 bit per pixel ; convert to bgra32 packed punpcklbw xmm5,xmm4 ; bgbgbgbgbgbgbgbg movdqa xmm0, xmm5 ; save bg values punpcklbw xmm3,xmm7 ; r0r0r0r0r0r0r0r0 punpcklwd xmm5,xmm3 ; lower half bgr0bgr0bgr0bgr0 punpckhwd xmm0,xmm3 ; upper half bgr0bgr0bgr0bgr0 ; write to output ptr movntdq [edi], xmm5 ; output first 4 pixels bypassing cache movntdq [edi+16], xmm0 ; output second 4 pixels bypassing cache %endmacro SECTION .data align=16 Const16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 dw 16 Const128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 dw 128 UMask db 0x01 db 0x80 db 0x01 db 0x80 db 0x05 db 0x80 db 0x05 db 0x80 db 0x09 db 0x80 db 0x09 db 0x80 db 0x0d db 0x80 db 0x0d db 0x80 VMask db 0x03 db 0x80 db 0x03 db 0x80 db 0x07 db 0x80 db 0x07 db 0x80 db 0x0b db 0x80 db 0x0b db 0x80 db 0x0f db 0x80 db 0x0f db 0x80 YMask db 0x00 db 0x80 db 0x02 db 0x80 db 0x04 db 0x80 db 0x06 db 0x80 db 0x08 db 0x80 db 0x0a db 0x80 db 0x0c db 0x80 db 0x0e db 0x80 ; void Convert_YUV422_RGBA32_SSSE3(void *fromPtr, void *toPtr, int width) width equ ebp+16 toPtr equ ebp+12 fromPtr equ ebp+8 ; void Convert_YUV420P_RGBA32_SSSE3(void *fromYPtr, void *fromUPtr, void *fromVPtr, void *toPtr, int width) width1 equ ebp+24 toPtr1 equ ebp+20 fromVPtr equ ebp+16 fromUPtr equ ebp+12 fromYPtr equ ebp+8 SECTION .text align=16 cglobal Convert_YUV422_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx mov esi, [fromPtr] mov edi, [toPtr] mov ecx, [width] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP REPEATLOOP: ; loop over width / 8 ; YUV422 packed inputer movdqa xmm0, [esi] ; should have yuyv yuyv yuyv yuyv pshufd xmm1, xmm0, 0xE4 ; copy to xmm1 movdqa xmm2, xmm0 ; copy to xmm2 ; extract both y giving y0y0 pshufb xmm0, [YMask] ; extract u and duplicate so each u in yuyv becomes u0u0 pshufb xmm1, [UMask] ; extract v and duplicate so each v in yuyv becomes v0v0 pshufb xmm2, [VMask] yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,16 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP ENDLOOP: ; Cleanup pop ecx pop esi pop edi mov esp, ebp pop ebp ret cglobal Convert_YUV420P_RGBA32_SSSE3 ; reserve variables push ebp mov ebp, esp push edi push esi push ecx push eax push ebx mov esi, [fromYPtr] mov eax, [fromUPtr] mov ebx, [fromVPtr] mov edi, [toPtr1] mov ecx, [width1] ; loop width / 8 times shr ecx,3 test ecx,ecx jng ENDLOOP1 REPEATLOOP1: ; loop over width / 8 ; YUV420 Planar inputer movq xmm0, [esi] ; fetch 8 y values (8 bit) yyyyyyyy00000000 movd xmm1, [eax] ; fetch 4 u values (8 bit) uuuu000000000000 movd xmm2, [ebx] ; fetch 4 v values (8 bit) vvvv000000000000 ; extract y pxor xmm7,xmm7 ; 00000000000000000000000000000000 punpcklbw xmm0,xmm7 ; interleave xmm7 into xmm0 y0y0y0y0y0y0y0y0 ; extract u and duplicate so each becomes 0u0u punpcklbw xmm1,xmm7 ; interleave xmm7 into xmm1 u0u0u0u000000000 punpcklwd xmm1,xmm7 ; interleave again u000u000u000u000 pshuflw xmm1,xmm1, 0xA0 ; copy u values pshufhw xmm1,xmm1, 0xA0 ; to get u0u0 ; extract v punpcklbw xmm2,xmm7 ; interleave xmm7 into xmm1 v0v0v0v000000000 punpcklwd xmm2,xmm7 ; interleave again v000v000v000v000 pshuflw xmm2,xmm2, 0xA0 ; copy v values pshufhw xmm2,xmm2, 0xA0 ; to get v0v0 yuv2rgbsse2 rgba32sse2output ; endloop add edi,32 add esi,8 add eax,4 add ebx,4 sub ecx, 1 ; apparently sub is better than dec jnz REPEATLOOP1 ENDLOOP1: ; Cleanup pop ebx pop eax pop ecx pop esi pop edi mov esp, ebp pop ebp ret SECTION .note.GNU-stack noalloc noexec nowrite progbits

Read the article
What's the difference between logical SSE intrinsics?

- by ~buratinas

Hello, Is there any difference between logical SSE intrinsics for different types? For example if we take OR operation, there are three intrinsics: _mm_or_ps, _mm_or_pd and _mm_or_si128. My questions: Is there any difference between using one or another intrinsic (with appropriate type casting). Won't there be any hidden costs like longer execution in some specific situation? These intrinsics maps to three different x86 instructions (por, orps, orpd). Does anyone have any ideas why Intel is wasting precious opcode space for several instructions which do the same thing?

Read the article
OpenMP + SSE gives no speedup

- by Sayan Ghosh

Hi, My Professor found out this interesting experiment of 3D Linearly separable Kernel Convolution using SSE and OpenMP, and gave the task to me to benchmark the statistics on our system. The author claims a crazy 18 fold speedup from the serial approach! Might not be always, but we were expecting at least a 2-4 times speedup running this on a Dual Core Intel. http://software.intel.com/en-us/articles/16bit-3d-convolution-sse4openmp-implementation-on-penryn-cpu/#comment-41994 Alas, we could find exactly no speedup. The serial code performs always better, with or without OpenMP. I am using Linux, and observed a certain trend...when no other processes are running on the system, after a while the loadavg starts increasing, and the the %CPU utilization falls down. Another probable false positive which I ran into accidentally...I started the program, then immediately paused it. Then I ran it on background with bg, and saw a speedup of more than 2. This happens all the time! Any advice would be great. Thanks, Sayan

Read the article
Concise SSE and MMX instruction reference with latencies and throughput

- by Joe

I am trying to optimize some arithmetic by using the MMX and SSE instruction sets with inline assembly. However, I have been unable to find good references for the timings and usages of these enhanced instruction sets. Could you please help me find references that contain information about the throughput, latency, operands, and perhaps short descriptions of the instructions? So far, I have found: Intel Instruction References http://www.intel.com/Assets/PDF/manual/253666.pdf http://www.intel.com/Assets/PDF/manual/253667.pdf Intel Optimization Guide http://www.intel.com/Assets/PDF/manual/248966.pdf Timings of Integer Operations http://gmplib.org/~tege/x86-timing.pdf

Read the article
g++ SSE intrinsics dilemma - value from intrinsic "saturates"

- by Sriram

Hi, I wrote a simple program to implement SSE intrinsics for computing the inner product of two large (100000 or more elements) vectors. The program compares the execution time for both, inner product computed the conventional way and using intrinsics. Everything works out fine, until I insert (just for the fun of it) an inner loop before the statement that computes the inner product. Before I go further, here is the code: //this is a sample Intrinsics program to compute inner product of two vectors and compare Intrinsics with traditional method of doing things. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> #include <time.h> #include <stdlib.h> using namespace std; typedef float v4sf __attribute__ ((vector_size(16))); double innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume len1 = len2. float result = 0.0; for(int i = 0; i < len1; i++) { for(int j = 0; j < len1; j++) { result += (arr1[i] * arr2[i]); } } //float y = 1.23e+09; //cout << "y = " << y << endl; return result; } double sse_v4sf_innerProduct(float* arr1, int len1, float* arr2, int len2) { //assume that len1 = len2. if(len1 != len2) { cout << "Lengths not equal." << endl; exit(1); } /*steps: * 1. load a long-type (4 float) into a v4sf type data from both arrays. * 2. multiply the two. * 3. multiply the same and store result. * 4. add this to previous results. */ v4sf arr1Data, arr2Data, prevSums, multVal, xyz; //__builtin_ia32_xorps(prevSums, prevSums); //making it equal zero. //can explicitly load 0 into prevSums using loadps or storeps (Check). float temp[4] = {0.0, 0.0, 0.0, 0.0}; prevSums = __builtin_ia32_loadups(temp); float result = 0.0; for(int i = 0; i < (len1 - 3); i += 4) { for(int j = 0; j < len1; j++) { arr1Data = __builtin_ia32_loadups(&arr1[i]); arr2Data = __builtin_ia32_loadups(&arr2[i]); //store the contents of two arrays. multVal = __builtin_ia32_mulps(arr1Data, arr2Data); //multiply. xyz = __builtin_ia32_addps(multVal, prevSums); prevSums = xyz; } } //prevSums will hold the sums of 4 32-bit floating point values taken at a time. Individual entries in prevSums also need to be added. __builtin_ia32_storeups(temp, prevSums); //store prevSums into temp. cout << "Values of temp:" << endl; for(int i = 0; i < 4; i++) cout << temp[i] << endl; result += temp[0] + temp[1] + temp[2] + temp[3]; return result; } int main() { clock_t begin, end; int length = 100000; float *arr1, *arr2; double result_Conventional, result_Intrinsic; // printStats("Allocating memory."); arr1 = new float[length]; arr2 = new float[length]; // printStats("End allocation."); srand(time(NULL)); //init random seed. // printStats("Initializing array1 and array2"); begin = clock(); for(int i = 0; i < length; i++) { // for(int j = 0; j < length; j++) { // arr1[i] = rand() % 10 + 1; arr1[i] = 2.5; // arr2[i] = rand() % 10 - 1; arr2[i] = 2.5; // } } end = clock(); cout << "Time to initialize array1 and array2 = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("Finished initialization."); // printStats("Begin inner product conventionally."); begin = clock(); result_Conventional = innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product conventionally = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; // printStats("End inner product conventionally."); // printStats("Begin inner product using Intrinsics."); begin = clock(); result_Intrinsic = sse_v4sf_innerProduct(arr1, length, arr2, length); end = clock(); cout << "Time to compute inner product with intrinsics = " << ((double) (end - begin)) / CLOCKS_PER_SEC << endl; //printStats("End inner product using Intrinsics."); cout << "Results: " << endl; cout << " result_Conventional = " << result_Conventional << endl; cout << " result_Intrinsics = " << result_Intrinsic << endl; return 0; } I use the following g++ invocation to build this: g++ -W -Wall -O2 -pedantic -march=i386 -msse intrinsics_SSE_innerProduct.C -o innerProduct Each of the loops above, in both the functions, runs a total of N^2 times. However, given that arr1 and arr2 (the two floating point vectors) are loaded with a value 2.5, the length of the array is 100,000, the result in both cases should be 6.25e+10. The results I get are: Results: result_Conventional = 6.25e+10 result_Intrinsics = 5.36871e+08 This is not all. It seems that the value returned from the function that uses intrinsics "saturates" at the value above. I tried putting other values for the elements of the array and different sizes too. But it seems that any value above 1.0 for the array contents and any size above 1000 meets with the same value we see above. Initially, I thought it might be because all operations within SSE are in floating point, but floating point should be able to store a number that is of the order of e+08. I am trying to see where I could be going wrong but cannot seem to figure it out. I am using g++ version: g++ (GCC) 4.4.1 20090725 (Red Hat 4.4.1-2). Any help on this is most welcome. Thanks, Sriram.

Read the article
SSE (SIMD extensions) support in gcc

- by goldenmean

Hi, I see a code as below: include "stdio.h" #define VECTOR_SIZE 4 typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE))); // vector of four single floats typedef union f4vector { v4sf v; float f[VECTOR_SIZE]; } f4vector; void print_vector (f4vector *v) { printf("%f,%f,%f,%f\n", v->f[0], v->f[1], v->f[2], v->f[3]); } int main() { union f4vector a, b, c; a.v = (v4sf){1.2, 2.3, 3.4, 4.5}; b.v = (v4sf){5., 6., 7., 8.}; c.v = a.v + b.v; print_vector(&a); print_vector(&b); print_vector(&c); } This code builds fine and works expectedly using gcc (it's inbuild SSE / MMX extensions and vector data types. this code is doing a SIMD vector addition using 4 single floats. I want to understand in detail what does each keyword/function call on this typedef line means and does: typedef float v4sf __attribute__ ((vector_size(sizeof(float)*VECTOR_SIZE))); What is the vector_size() function return; What is the __attribute__ keyword for Here is the float data type being type defined to vfsf type? I understand the rest part. thanks, -AD

Read the article
Intrinsics program (SSE) - g++ - help needed

- by Sriram

Hi all, This is the first time I am posting a question on stackoverflow, so please try and overlook any errors I may have made in formatting my question/code. But please do point the same out to me so I may be more careful. I was trying to write some simple intrinsics routines for the addition of two 128-bit (containing 4 float variables) numbers. I found some code on the net and was trying to get it to run on my system. The code is as follows: //this is a sample Intrinsics program to add two vectors. #include <iostream> #include <iomanip> #include <xmmintrin.h> #include <stdio.h> using namespace std; struct vector4 { float x, y, z, w; }; //functions to operate on them. vector4 set_vector(float x, float y, float z, float w = 0) { vector4 temp; temp.x = x; temp.y = y; temp.z = z; temp.w = w; return temp; } void print_vector(const vector4& v) { cout << " This is the contents of vector: " << endl; cout << " > vector.x = " << v.x << endl; cout << " vector.y = " << v.y << endl; cout << " vector.z = " << v.z << endl; cout << " vector.w = " << v.w << endl; } vector4 sse_vector4_add(const vector4&a, const vector4& b) { vector4 result; asm volatile ( "movl $a, %eax" //move operands into registers. "\n\tmovl $b, %ebx" "\n\tmovups (%eax), xmm0" //move register contents into SSE registers. "\n\tmovups (%ebx), xmm1" "\n\taddps xmm0, xmm1" //add the elements. addps operates on single-precision vectors. "\n\t movups xmm0, result" //move result into vector4 type data. ); return result; } int main() { vector4 a, b, result; a = set_vector(1.1, 2.1, 3.2, 4.5); b = set_vector(2.2, 4.2, 5.6); result = sse_vector4_add(a, b); print_vector(a); print_vector(b); print_vector(result); return 0; } The g++ parameters I use are: g++ -Wall -pedantic -g -march=i386 -msse intrinsics_SSE_example.C -o h The errors I get are as follows: intrinsics_SSE_example.C: Assembler messages: intrinsics_SSE_example.C:45: Error: too many memory references for movups intrinsics_SSE_example.C:46: Error: too many memory references for movups intrinsics_SSE_example.C:47: Error: too many memory references for addps intrinsics_SSE_example.C:48: Error: too many memory references for movups I have spent a lot of time on trying to debug these errors, googled them and so on. I am a complete noob to Intrinsics and so may have overlooked some important things. Any help is appreciated, Thanks, Sriram.

Read the article
How much effort do you have to put in to get gains from using SSE?

- by John

Case One Say you have a little class: class Point3D { private: float x,y,z; public: operator+=() ...etc }; Point3D &Point3D::operator+=(Point3D &other) { this->x += other.x; this->y += other.y; this->z += other.z; } A naive use of SSE would simply replace these function bodies with using a few intrinsics. But would we expect this to make much difference? MMX used to involve costly state cahnges IIRC, does SSE or are they just like other instructions? And even if there's no direct "use SSE" overhead, would moving the values into SSE registers and back out again really make it any faster? Case Two Instead, you're working with a less OO-based code base. Rather than an array/vector of Point3D objects, you simply have a big array of floats: float coordinateData[NUM_POINTS*3]; void add(int i,int j) //yes it's unsafe, no overlap check... example only { for (int x=0;x<3;++x) { coordinateData[i*3+x] += coordinateData[j*3+x]; } } What about use of SSE here? Any better? In conclusion Is trying to optimise single vector operations using SSE actually worthwhile, or is it really only valuable when doing bulk operations?

Read the article
Websockets VS SSE

- by user3385828

Sorry for asking this here, I bet it has been asked plenty of times before but this time it's something specific which I haven't understood anywhere else: Suppose I have a service which requires to seek the database for different data once and in a while. For this I have 2 or 3 SSE, each one with a different retry basetime (20000 miliseconds, 1000 miliseconds...). What I'd like to know is if websockets can handle different "data type" accordingly to the request, for example, could I create one websocket to handle a notification system, a chat system, a group system instead of separated SSEs and treat data differently with javascript? And if so, would it be of higher interest (performance) than actually performing different queries to the server through different SSEs?

Read the article
Help with Assembly/SSE Multiplication

- by Brett

I've been trying to figure out how to gain some improvement in my code at a very crucial couple lines: float x = a*b; float y = c*d; float z = e*f; float w = g*h; all a, b, c... are floats. I decided to look into using SSE, but can't seem to find any improvement, in fact it turns out to be twice as slow. My SSE code is: Vector4 abcd, efgh, result; abcd = [float a, float b, float c, float d]; efgh = [float e, float f, float g, float h]; _asm { movups xmm1, abcd movups xmm2, efgh mulps xmm1, xmm2 movups result, xmm1 } I also attempted using standard inline assembly, but it doesn't appear that I can pack the register with the four floating points like I can with SSE. Any comments, or help would be greatly appreciated, I mainly need to understand why my calculations using SSE are slower than the serial C++ code? I'm compiling in Visual Studio 2005, on a Windows XP, using a Pentium 4 with HT if that provides any additional information to assit. Thanks in advance!

Read the article
Send post data while opening SSE connection

- by Prosto Trader

I'm trying to establish SSE connection and do some long-taking actions on server-side, informing user about how it goes through SSE events. Actually, I don't understand how would I send some data along with new connection. I have to combine regular ajax with new EventSource or there is a way to transfer post data inside that event? Here is what I have so far, and I need to send pretty big JSON with the request. Is it possible or the only way to send data is GET? var source = new EventSource('/terminal/ajax-put-packet-trade-order/');

Read the article
MVC: returning multiple results on stream connection to implement HTML5 SSE

- by eddo

I am trying to set up a lightweight HTML5 Server-Sent Event implementation on my MVC 4 Web, without using one of the libraries available to implement sockets and similars. The lightweight approach I am trying is: Client side: EventSource (or jquery.eventsource for IE) Server side: long polling with AsynchController (sorry for dropping here the raw test code but just to give an idea) public class HTML5testAsyncController : AsyncController { private static int curIdx = 0; private static BlockingCollection<string> _data = new BlockingCollection<string>(); static HTML5testAsyncController() { addItems(10); } //adds some test messages static void addItems(int howMany) { _data.Add("started"); for (int i = 0; i < howMany; i++) { _data.Add("HTML5 item" + (curIdx++).ToString()); } _data.Add("ended"); } // here comes the async action, 'Simple' public void SimpleAsync() { AsyncManager.OutstandingOperations.Increment(); Task.Factory.StartNew(() => { var result = string.Empty; var sb = new StringBuilder(); string serializedObject = null; //wait up to 40 secs that a message arrives if (_data.TryTake(out result, TimeSpan.FromMilliseconds(40000))) { JavaScriptSerializer ser = new JavaScriptSerializer(); serializedObject = ser.Serialize(new { item = result, message = "MSG content" }); sb.AppendFormat("data: {0}\n\n", serializedObject); } AsyncManager.Parameters["serializedObject"] = serializedObject; AsyncManager.OutstandingOperations.Decrement(); }); } // callback which returns the results on the stream public ActionResult SimpleCompleted(string serializedObject) { ServerSentEventResult sar = new ServerSentEventResult(); sar.Content = () => { return serializedObject; }; return sar; } //pushes the data on the stream in a format conforming HTML5 SSE public class ServerSentEventResult : ActionResult { public ServerSentEventResult() { } public delegate string GetContent(); public GetContent Content { get; set; } public int Version { get; set; } public override void ExecuteResult(ControllerContext context) { if (context == null) { throw new ArgumentNullException("context"); } if (this.Content != null) { HttpResponseBase response = context.HttpContext.Response; // this is the content type required by chrome 6 for server sent events response.ContentType = "text/event-stream"; response.BufferOutput = false; // this is important because chrome fails with a "failed to load resource" error if the server attempts to put the char set after the content type response.Charset = null; string[] newStrings = context.HttpContext.Request.Headers.GetValues("Last-Event-ID"); if (newStrings == null || newStrings[0] != this.Version.ToString()) { string value = this.Content(); response.Write(string.Format("data:{0}\n\n", value)); //response.Write(string.Format("id:{0}\n", this.Version)); } else { response.Write(""); } } } } } The problem is on the server side as there is still a big gap between the expected result and what's actually going on. Expected result: EventSource opens a stream connection to the server, the server keeps it open for a safe time (say, 2 minutes) so that I am protected from thread leaking from dead clients, as new message events are received by the server (and enqueued to a thread safe collection such as BlockingCollection) they are pushed in the open stream to the client: message 1 received at T+0ms, pushed to the client at T+x message 2 received at T+200ms, pushed to the client at T+x+200ms Actual behaviour: EventSource opens a stream connection to the server, the server keeps it open until a message event arrives (thanks to long polling) once a message is received, MVC pushes the message and closes the connection. EventSource has to reopen the connection and this happens after a couple of seconds. message 1 received at T+0ms, pushed to the client at T+x message 2 received at T+200ms, pushed to the client at T+x+3200ms This is not OK as it defeats the purpose of using SSE as the clients start again reconnecting as in normal polling and message delivery gets delayed. Now, the question: is there a native way to keep the connection open after sending the first message and sending further messages on the same connection?

Read the article
Benefit of using multiple SIMD instruction sets simultaneously

- by GenTiradentes

I'm writing a highly parallel application that's multithreaded. I've already got an SSE accelerated thread class written. If I were to write an MMX accelerated thread class, then run both at the same time (one SSE thread and one MMX thread per core) would the performance improve noticeably? I would think that this setup would help hide memory latency, but I'd like to be sure before I start pouring time into it.

Read the article
How can I find a list of all SSE instructions? What happens if a CPU doesn't support SSE?

- by Blastcore

So I've been reading about how processors work. Now I'm on the instructions (SSE, SSE2, etc) stuff. (Which is pretty interesting). I have lot of questions (I've been reading this stuff on Wikipedia): I've saw the names of some instructions that were added on SSE, however there's no explanation about any of them (Maybe SSE4? They're not even listed on Wikipedia). Where can I read about what they do? How do I know which of these instructions are being used? If we do know which are being used, let's say I'm doing a comparison, (This may be the most stupid question I've ever asked, I don't know about assembly, though) Is it possible to directly use the instruction on an assembly code? (I've been looking at this: http://asm.inightmare.org/opcodelst/index.php?op=CMP) How does the processor interpret the instructions? What would happen if I had a processor without any of the SSE instructions? (I suppose in the case we want to do a comparison, we wouldn't be able to, right?)

Read the article
GCC - How to realign stack?

- by psihodelia

I try to build an application which uses pthreads and __m128 SSE type. According to GCC manual, default stack alignment is 16 bytes. In order to use __m128, the requirement is the 16-byte alignment. My target CPU supports SSE. I use a GCC compiler which doesn't support runtime stack realignment (e.g. -mstackrealign). I cannot use any other GCC compiler version. My test application looks like: #include <xmmintrin.h> #include <pthread.h> void *f(void *x){ __m128 y; ... } int main(void){ pthread_t p; pthread_create(&p, NULL, f, NULL); } The application generates an exception and exits. After a simple debugging (printf "%p", &y), I found that the variable y is not 16-byte aligned. My question is: how can I realign the stack properly (16-byte) without using any GCC flags and attributes (they don't help)? Should I use GCC inline Assembler within this thread function f()?

Read the article
Can one construct a "good" hash function using CRC32C as a base.

- by DavidD

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that? David

Read the article
Common SIMD techniques

- by zxcat

Hi! Where can I find information about common SIMD tricks? I have an instruction set and know, how to write non-tricky SIMD code, but I know, SIMD now is much more powerful. It can hold complex conditional branchless code. For example (ARMv6), the following sequence of instructions sets each byte of Rd equal to the unsigned minimum of the corresponding bytes of Ra and Rb: USUB8 Rd, Ra, Rb SEL Rd, Rb, Ra Links to tutorials / uncommon SIMD techniques are good too :) ARMv6 is the most interesting for me, but x86(SSE,...)/Neon(in ARMv7)/others are good too. Thank you.

Read the article

1 2 3 4 5 6 | Next Page >