Search Results

Search found 28 results on 2 pages for 'vectorization'.

Page 1/2 | 1 2 | Next Page >

auto vectorization on double and ffast-math

- by wiso

Why is mandatory to use -ffast-math with g++ to achieve vectorization of loop using doubles? I don't like ffast-math because I don't want to loose precision.

Read the article
vectorization of a text file

- by Fox

I am trying to implement vectorization of a text file...I have created a dictionary (Unique words in all the documents) ... Which is the best way to implement this in java? For example - My dictionary has the following words - {w1, w2, w3, w4} And I have 2 documents each having subset of the words in the vocabulary. I need to write to a text file the matrix in the form -- 1,3,4,0 0,0,2,1 Here each row represents a document and the values represent the occurrence of each word in the document. Can you suggest me the most efficient way to implement this in Java?

Read the article
R strsplit and vectorization

- by James

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input? For example, to count the lengths of words in a character vector: words <- c("a","quick","brown","fox") > length(strsplit(words,"")) [1] 4 # The number of words (length of the list) > length(strsplit(words,"")[[1]]) [1] 1 # The length of the first word only > sapply(words,function (x) length(strsplit(x,"")[[1]])) a quick brown fox 1 5 5 3 # Success, but potentially very slow Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

Read the article
Vectorization of matlab code for faster execution

- by user3237134

My code works in the following manner: 1.First, it obtains several images from the training set 2.After loading these images, we find the normalized faces,mean face and perform several calculation. 3.Next, we ask for the name of an image we want to recognize 4.We then project the input image into the eigenspace, and based on the difference from the eigenfaces we make a decision. 5.Depending on eigen weight vector for each input image we make clusters using kmeans command. Source code i tried: clear all close all clc % number of images on your training set. M=1200; %Chosen std and mean. %It can be any number that it is close to the std and mean of most of the images. um=60; ustd=32; %read and show images(bmp); S=[]; %img matrix for i=1:M str=strcat(int2str(i),'.jpg'); %concatenates two strings that form the name of the image eval('img=imread(str);'); [irow icol d]=size(img); % get the number of rows (N1) and columns (N2) temp=reshape(permute(img,[2,1,3]),[irow*icol,d]); %creates a (N1*N2)x1 matrix S=[S temp]; %X is a N1*N2xM matrix after finishing the sequence %this is our S end %Here we change the mean and std of all images. We normalize all images. %This is done to reduce the error due to lighting conditions. for i=1:size(S,2) temp=double(S(:,i)); m=mean(temp); st=std(temp); S(:,i)=(temp-m)*ustd/st+um; end %show normalized images for i=1:M str=strcat(int2str(i),'.jpg'); img=reshape(S(:,i),icol,irow); img=img'; end %mean image; m=mean(S,2); %obtains the mean of each row instead of each column tmimg=uint8(m); %converts to unsigned 8-bit integer. Values range from 0 to 255 img=reshape(tmimg,icol,irow); %takes the N1*N2x1 vector and creates a N2xN1 matrix img=img'; %creates a N1xN2 matrix by transposing the image. % Change image for manipulation dbx=[]; % A matrix for i=1:M temp=double(S(:,i)); dbx=[dbx temp]; end %Covariance matrix C=A'A, L=AA' A=dbx'; L=A*A'; % vv are the eigenvector for L % dd are the eigenvalue for both L=dbx'*dbx and C=dbx*dbx'; [vv dd]=eig(L); % Sort and eliminate those whose eigenvalue is zero v=[]; d=[]; for i=1:size(vv,2) if(dd(i,i)>1e-4) v=[v vv(:,i)]; d=[d dd(i,i)]; end end %sort, will return an ascending sequence [B index]=sort(d); ind=zeros(size(index)); dtemp=zeros(size(index)); vtemp=zeros(size(v)); len=length(index); for i=1:len dtemp(i)=B(len+1-i); ind(i)=len+1-index(i); vtemp(:,ind(i))=v(:,i); end d=dtemp; v=vtemp; %Normalization of eigenvectors for i=1:size(v,2) %access each column kk=v(:,i); temp=sqrt(sum(kk.^2)); v(:,i)=v(:,i)./temp; end %Eigenvectors of C matrix u=[]; for i=1:size(v,2) temp=sqrt(d(i)); u=[u (dbx*v(:,i))./temp]; end %Normalization of eigenvectors for i=1:size(u,2) kk=u(:,i); temp=sqrt(sum(kk.^2)); u(:,i)=u(:,i)./temp; end % show eigenfaces; for i=1:size(u,2) img=reshape(u(:,i),icol,irow); img=img'; img=histeq(img,255); end % Find the weight of each face in the training set. omega = []; for h=1:size(dbx,2) WW=[]; for i=1:size(u,2) t = u(:,i)'; WeightOfImage = dot(t,dbx(:,h)'); WW = [WW; WeightOfImage]; end omega = [omega WW]; end % Acquire new image % Note: the input image must have a bmp or jpg extension. % It should have the same size as the ones in your training set. % It should be placed on your desktop ed_min=[]; srcFiles = dir('G:\newdatabase\*.jpg'); % the folder in which ur images exists for b = 1 : length(srcFiles) filename = strcat('G:\newdatabase\',srcFiles(b).name); Imgdata = imread(filename); InputImage=Imgdata; InImage=reshape(permute((double(InputImage)),[2,1,3]),[irow*icol,1]); temp=InImage; me=mean(temp); st=std(temp); temp=(temp-me)*ustd/st+um; NormImage = temp; Difference = temp-m; p = []; aa=size(u,2); for i = 1:aa pare = dot(NormImage,u(:,i)); p = [p; pare]; end InImWeight = []; for i=1:size(u,2) t = u(:,i)'; WeightOfInputImage = dot(t,Difference'); InImWeight = [InImWeight; WeightOfInputImage]; end noe=numel(InImWeight); % Find Euclidean distance e=[]; for i=1:size(omega,2) q = omega(:,i); DiffWeight = InImWeight-q; mag = norm(DiffWeight); e = [e mag]; end ed_min=[ed_min MinimumValue]; theta=6.0e+03; %disp(e) z(b,:)=InImWeight; end IDX = kmeans(z,5); clustercount=accumarray(IDX, ones(size(IDX))); disp(clustercount); Running time for 50 images:Elapsed time is 103.947573 seconds. QUESTIONS: 1.It is working fine for M=50(i.e Training set contains 50 images) but not for M=1200(i.e Training set contains 1200 images).It is not showing any error.There is no output.I waited for 10 min still there is no output. I think it is going infinite loop.What is the problem?Where i was wrong?

Read the article
Do any JS implementations currently support (or have support on the roadmap for) fast, vectorized op

- by agnoster

I'd like to do a bit of matrix/vector arithmetic in JavaScript, and was wondering if any browsers or other JS implementations actually have support for vectorized operations, for instance for quickly summing the entries of two Arrays (or summing, or whatever). Even if that currently doesn't mean it compiles down to vectorized operations, at least some language support would be nice for when it does get implemented - I'd take the existence of functions or syntax to support it as a step in the right direction. (Understandably, "vectorization javascript" searches are pretty much all about graphics and SVG.)

Read the article
Parallelize or vectorize all-against-all operation on a large number of matrices?

- by reve_etrange

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm. In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so list = who('data_matrix_prefix*') H = cell(numel(list),numel(list)); for i=1:numel(list) for j=1:numel(list) if i ~= j eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']); end end end is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s). However, operating on all the data requires almost 25 million calls. I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop: # who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data. nextData = cell(k,1); for i=1:k [nextData{:}] = deal(data{i}); H{:,i} = cellfun(@compare,data,nextData,'UniformOutput',false); end Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced. compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?). Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?

Read the article
Mean filter in MATLAB without loops or signal processing toolbox

- by Doresoom

I need to implement a mean filter on a data set, but I don't have access to the signal processing toolbox. Is there a way to do this without using a for loop? Here's the code I've got working: x=0:.1:10*pi; noise=0.5*(rand(1,length(x))-0.5); y=sin(x)+noise; %generate noisy signal a=10; %specify moving window size my=zeros(1,length(y)-a); for n=a/2+1:length(y)-a/2 my(n-a/2)=mean(y(n-a/2:n+a/2)); %calculate mean for each window end mx=x(a/2+1:end-a/2); %truncate x array to match plot(x,y) hold on plot(mx,my,'r')

Read the article
vectorizing a for loop in numpy/scipy?

- by user248237

I'm trying to vectorize a for loop that I have inside of a class method. The for loop has the following form: it iterates through a bunch of points and depending on whether a certain variable (called "self.condition_met" below) is true, calls a pair of functions on the point, and adds the result to a list. Each point here is an element in a vector of lists, i.e. a data structure that looks like array([[1,2,3], [4,5,6], ...]). Here is the problematic function: def myClass: def my_inefficient_method(self): final_vector = [] # Assume 'my_vector' and 'my_other_vector' are defined numpy arrays for point in all_points: if not self.condition_met: a = self.my_func1(point, my_vector) b = self.my_func2(point, my_other_vector) else: a = self.my_func3(point, my_vector) b = self.my_func4(point, my_other_vector) c = a + b final_vector.append(c) # Choose random element from resulting vector 'final_vector' self.condition_met is set before my_inefficient_method is called, so it seems unnecessary to check it each time, but I am not sure how to better write this. Since there are no destructive operations here it is seems like I could rewrite this entire thing as a vectorized operation -- is that possible? any ideas how to do this?

Read the article
How do I translate this Matlab bsxfun call to R?

- by claytontstanley

I would also (fingers crossed) like the solution to work with R Sparse Matrices in the Matrix package. >> A = [1,2,3,4,5] A = 1 2 3 4 5 >> B = [1;2;3;4;5] B = 1 2 3 4 5 >> bsxfun(@times, A, B) ans = 1 2 3 4 5 2 4 6 8 10 3 6 9 12 15 4 8 12 16 20 5 10 15 20 25 >> EDIT: I would like to do a matrix multiplication of these sparse vectors, and return a sparse array: > class(NRowSums) [1] "dsparseVector" attr(,"package") [1] "Matrix" > class(NColSums) [1] "dsparseVector" attr(,"package") [1] "Matrix" > NRowSums * NColSums (I think) w/o using a non-sparse variable to temporarily store data.

Read the article
MATLAB: vectorized array creation from a list of start/end indices

- by merv

I have a two-column matrix M that contains the start/end indices of a bunch of intervals: startInd EndInd 1 3 6 10 12 12 15 16 How can I generate a vector of all the interval indices: v = [1 2 3 6 7 8 9 10 12 15 16]; I'm doing the above using loops, but I'm wondering if there's a more elegant vectorized solution? v = []; for i=1:size(M,1) v = [v M(i,1):M(i,2)]; end

Read the article
Vectorizing sums of different diagonals in a matrix

- by reve_etrange

I want to vectorize the following MATLAB code. I think it must be simple but I'm finding it confusing nevertheless. r = some constant less than m or n [m,n] = size(C); S = zeros(m-r,n-r); for i=1:m-r for j=1:n-r S(i,j) = sum(diag(C(i:i+r-1,j:j+r-1))); end end The code calculates a table of scores, S, for a dynamic programming algorithm, from another score table, C. The diagonal summing is to generate scores for individual pieces of the data used to generate C, for all possible pieces (of size r). Thanks in advance for any answers! Sorry if this one should be obvious...

Read the article
Is there a better (i.e vectorised) way to put part of a column name into a row of a data frame in R

- by PaulHurleyuk

I have a data frame in R that has come about from running some stats on the result fo a melt/cast operation. I want to add a row into this dataframe containing a Nominal value. That Nominal Value is present in the names for each column df<-as.data.frame(cbind(x=c(1,2,3,4,5),`Var A_100`=c(5,4,3,2,1),`Var B_5`=c(9,8,7,6,5))) > df x Var A_100 Var B_5 1 1 5 9 2 2 4 8 3 3 3 7 4 4 2 6 5 5 1 5 So, I want to create a new row, that contains '100' in the column Var A_100 and '5' in Var B_5. Currently this is what I'm doing but I'm sure there must be a better, vectorised way to do this. temp_nom<-NULL for (l in 1:length(names(df))){ temp_nom[l]<-strsplit(names(df),"_")[[l]][2] } temp_nom [1] NA "100" "5" df[6,]<-temp_nom > df x Var A_100 Var B_5 1 1 5 9 2 2 4 8 3 3 3 7 4 4 2 6 5 5 1 5 6 <NA> 100 5 rm(temp_nom) Typically I'd have 16-24 columns. Any ideas ?

Read the article
Can operations be auto-vectorized on struct's field referenced by pointer?

- by Eonil

This is my code. inline void inv(Vector* target) { (*target).x = -(*target).x; (*target).y = -(*target).y; (*target).z = -(*target).z; (*target).w = -(*target).w; } I'm using GCC. Can this be vectorized? PS: I'm trying some kind of optimization. Any recommendations are welcome.

Read the article
How to speed this kind of for-loop?

- by wok

I would like to compute the maximum of translated images along the direction of a given axis. I know about ordfilt2, however I would like to avoid using the Image Processing Toolbox. So here is the code I have so far: imInput = imread('tire.tif'); n = 10; imMax = imInput(:, n:end); for i = 1:(n-1) imMax = max(imMax, imInput(:, i:end-(n-i))); end Is it possible to avoid using a for-loop in order to speed the computation up, and, if so, how?

Read the article
Vectorize matrix operation in R

- by Fernando

I have a R x C matrix filled to the k-th row and empty below this row. What i need to do is to fill the remaining rows. In order to do this, i have a function that takes 2 entire rows as arguments, do some calculations and output 2 fresh rows (these outputs will fill the matrix). I have a list of all 'pairs' of rows to be processed, but my for loop is not helping performance: # M is the matrix # nrow(M) and k are even, so nLeft is even M = matrix(1:48, ncol = 3) # half to fill k = nrow(M)/2 # simulate empty rows to be filled M[-(1:k), ] = 0 cat('before fill') print(M) # number of empty rows to fill nLeft = nrow(M) - k nextRow = k + 1 # list of rows to process (could be any order of non-empty rows) idxList = matrix(1:k, ncol = 2) for ( i in 1 : (nLeft / 2)) { row1 = M[idxList[i, 1],] row2 = M[idxList[i, 2],] # the two columns in 'results' will become 2 rows in M # fake result, return 2*row1 and 3*row2 results = matrix(c(2*row1, 3*row2), ncol = 2) # fill the matrix M[nextRow, ] = results[, 1] nextRow = nextRow + 1 M[nextRow, ] = results[, 2] nextRow = nextRow + 1 } cat('after fill') print(M) I tried to vectorize this, but failed... appreciate any help on improving this code, thanks!

Read the article
Optimizing Solaris 11 SHA-1 on Intel Processors

- by danx

SHA-1 is a "hash" or "digest" operation that produces a 160 bit (20 byte) checksum value on arbitrary data, such as a file. It is intended to uniquely identify text and to verify it hasn't been modified. Max Locktyukhin and others at Intel have improved the performance of the SHA-1 digest algorithm using multiple techniques. This code has been incorporated into Solaris 11 and is available in the Solaris Crypto Framework via the libmd(3LIB), the industry-standard libpkcs11(3LIB) library, and Solaris kernel module sha1. The optimized code is used automatically on systems with a x86 CPU supporting SSSE3 (Intel Supplemental SSSE3). Intel microprocessor architectures that support SSSE3 include Nehalem, Westmere, Sandy Bridge microprocessor families. Further optimizations are available for microprocessors that support AVX (such as Sandy Bridge). Although SHA-1 is considered obsolete because of weaknesses found in the SHA-1 algorithm—NIST recommends using at least SHA-256, SHA-1 is still widely used and will be with us for awhile more. Collisions (the same SHA-1 result for two different inputs) can be found with moderate effort. SHA-1 is used heavily though in SSL/TLS, for example. And SHA-1 is stronger than the older MD5 digest algorithm, another digest option defined in SSL/TLS. Optimizations Review SHA-1 operates by reading an arbitrary amount of data. The data is read in 512 bit (64 byte) blocks (the last block is padded in a specific way to ensure it's a full 64 bytes). Each 64 byte block has 80 "rounds" of calculations (consisting of a mixture of "ROTATE-LEFT", "AND", and "XOR") applied to the block. Each round produces a 32-bit intermediate result, called W[i]. Here's what each round operates: The first 16 rounds, rounds 0 to 15, read the 512 bit block 32 bits at-a-time. These 32 bits is used as input to the round. The remaining rounds, rounds 16 to 79, use the results from the previous rounds as input. Specifically for round i it XORs the results of rounds i-3, i-8, i-14, and i-16 and rotates the result left 1 bit. The remaining calculations for the round is a series of AND, XOR, and ROTATE-LEFT operators on the 32-bit input and some constants. The 32-bit result is saved as W[i] for round i. The 32-bit result of the final round, W[79], is the SHA-1 checksum. Optimization: Vectorization The first 16 rounds can be vectorized (computed in parallel) because they don't depend on the output of a previous round. As for the remaining rounds, because of step 2 above, computing round i depends on the results of round i-3, W[i-3], one can vectorize 3 rounds at-a-time. Max Locktyukhin found through simple factoring, explained in detail in his article referenced below, that the dependencies of round i on the results of rounds i-3, i-8, i-14, and i-16 can be replaced instead with dependencies on the results of rounds i-6, i-16, i-28, and i-32. That is, instead of initializing intermediate result W[i] with: W[i] = (W[i-3] XOR W[i-8] XOR W[i-14] XOR W[i-16]) ROTATE-LEFT 1 Initialize W[i] as follows: W[i] = (W[i-6] XOR W[i-16] XOR W[i-28] XOR W[i-32]) ROTATE-LEFT 2 That means that 6 rounds could be vectorized at once, with no additional calculations, instead of just 3! This optimization is independent of Intel or any other microprocessor architecture, although the microprocessor has to support vectorization to use it, and exploits one of the weaknesses of SHA-1. Optimization: SSSE3 Intel SSSE3 makes use of 16 %xmm registers, each 128 bits wide. The 4 32-bit inputs to a round, W[i-6], W[i-16], W[i-28], W[i-32], all fit in one %xmm register. The following code snippet, from Max Locktyukhin's article, converted to ATT assembly syntax, computes 4 rounds in parallel with just a dozen or so SSSE3 instructions: movdqa W_minus_04, W_TMP pxor W_minus_28, W // W equals W[i-32:i-29] before XOR // W = W[i-32:i-29] ^ W[i-28:i-25] palignr $8, W_minus_08, W_TMP // W_TMP = W[i-6:i-3], combined from // W[i-4:i-1] and W[i-8:i-5] vectors pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) movdqa W, W_TMP // 4 dwords in W are rotated left by 2 psrld $30, W // rotate left by 2 W = (W >> 30) | (W << 2) pslld $2, W_TMP por W, W_TMP movdqa W_TMP, W // four new W values W[i:i+3] are now calculated paddd (K_XMM), W_TMP // adding 4 current round's values of K movdqa W_TMP, (WK(i)) // storing for downstream GPR instructions to read A window of the 32 previous results, W[i-1] to W[i-32] is saved in memory on the stack. This is best illustrated with a chart. Without vectorization, computing the rounds is like this (each "R" represents 1 round of SHA-1 computation): RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR With vectorization, 4 rounds can be computed in parallel: RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR Optimization: AVX The new "Sandy Bridge" microprocessor architecture, which supports AVX, allows another interesting optimization. SSSE3 instructions have two operands, a input and an output. AVX allows three operands, two inputs and an output. In many cases two SSSE3 instructions can be combined into one AVX instruction. The difference is best illustrated with an example. Consider these two instructions from the snippet above: pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) With AVX they can be combined in one instruction: vpxor W_minus_16, W, W_TMP // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) This optimization is also in Solaris, although Sandy Bridge-based systems aren't widely available yet. As an exercise for the reader, AVX also has 256-bit media registers, %ymm0 - %ymm15 (a superset of 128-bit %xmm0 - %xmm15). Can %ymm registers be used to parallelize the code even more? Optimization: Solaris-specific In addition to using the Intel code described above, I performed other minor optimizations to the Solaris SHA-1 code: Increased the digest(1) and mac(1) command's buffer size from 4K to 64K, as previously done for decrypt(1) and encrypt(1). This size is well suited for ZFS file systems, but helps for other file systems as well. Optimized encode functions, which byte swap the input and output data, to copy/byte-swap 4 or 8 bytes at-a-time instead of 1 byte-at-a-time. Enhanced the Solaris mdb(1) and kmdb(1) debuggers to display all 16 %xmm and %ymm registers (mdb "$x" command). Previously they only displayed the first 8 that are available in 32-bit mode. Can't optimize if you can't debug :-). Changed the SHA-1 code to allow processing in "chunks" greater than 2 Gigabytes (64-bits) Performance I measured performance on a Sun Ultra 27 (which has a Nehalem-class Xeon 5500 Intel W3570 microprocessor @3.2GHz). Turbo mode is disabled for consistent performance measurement. Graphs are better than words and numbers, so here they are: The first graph shows the Solaris digest(1) command before and after the optimizations discussed here, contained in libmd(3LIB). I ran the digest command on a half GByte file in swapfs (/tmp) and execution time decreased from 1.35 seconds to 0.98 seconds. The second graph shows the the results of an internal microbenchmark that uses the Solaris libpkcs11(3LIB) library. The operations are on a 128 byte buffer with 10,000 iterations. The results show operations increased from 320,000 to 416,000 operations per second. Finally the third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. The results show for 1 kernel thread, operations increased from 410 to 600 MBytes/second. For 8 kernel threads, operations increase from 1540 to 1940 MBytes/second. Availability This code is in Solaris 11 FCS. It is available in the 64-bit libmd(3LIB) library for 64-bit programs and is in the Solaris kernel. You must be running hardware that supports Intel's SSSE3 instructions (for example, Intel Nehalem, Westmere, or Sandy Bridge microprocessor architectures). The easiest way to determine if SSSE3 is available is with the isainfo(1) command. For example, nehalem $ isainfo -v $ isainfo -v 64-bit amd64 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu If the output also shows "avx", the Solaris executes the even-more optimized 3-operand AVX instructions for SHA-1 mentioned above: sandybridge $ isainfo -v 64-bit amd64 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this code. Solaris libraries and kernel automatically determine if it's running on SSSE3 or AVX-capable machines and execute the correctly-tuned code for that microprocessor. Summary The Solaris 11 Crypto Framework, via the sha1 kernel module and libmd(3LIB) and libpkcs11(3LIB) libraries, incorporated a useful SHA-1 optimization from Intel for SSSE3-capable microprocessors. As with other Solaris optimizations, they come automatically "under the hood" with the current Solaris release. References "Improving the Performance of the Secure Hash Algorithm (SHA-1)" by Max Locktyukhin (Intel, March 2010). The source for these SHA-1 optimizations used in Solaris "SHA-1", Wikipedia Good overview of SHA-1 FIPS 180-1 SHA-1 standard (FIPS, 1995) NIST Comments on Cryptanalytic Attacks on SHA-1 (2005, revised 2006)

Read the article
What's the right way to utilize ARM SIMD on iPhone for Game vector/matrix operation?

- by Eonil

I'm making an vector/matrix library for Game which utilizes SIMD unit on iPhone (3GS or later). How can I do this? I searched about this, now I know several options: Accelerate framework (BLAS+LAPACK+...) from Apple (iPhone OS 4) OpenMAX implementation library from ARM GCC auto-vectorization feature What's the most suitable way for vector/matrix library for game?

Read the article
SSE SIMD Optimization For Loop

- by Projectile Fish

I have some code in a loop for(int i = 0; i < n; i++) { u[i] = c * u[i] + s * b[i]; } So, u and b are vectors of the same length, and c and s are scalars. Is this code a good candidate for vectorization for use with SSE in order to get a speedup?

Read the article
C++ performance, for versus while

- by aaa

hello. In general (or from your experience), is there difference in performance between for and while loops? What if they are doubly/triply nested? Is vectorization (SSE) affected by loop variant in g++ or Intel compilers? Thank you

Read the article
returning aligned memory with new?

- by Steve

I currently allocate my memory for arrays using the MS specific mm_malloc. I align the memory, as I'm doing some heavy duty math and the vectorization takes advantage of the alignment. I was wondering if anyone knows how to overload the new operator to do the same thing, as I feel dirty malloc'ing everywhere (and would eventually like to also compile on Linux)? Thanks for any help

Read the article
STL find performs bettern than hand-crafter loop

- by dusha

Hello all, I have some question. Given the following C++ code fragment: #include <boost/progress.hpp> #include <vector> #include <algorithm> #include <numeric> #include <iostream> struct incrementor { incrementor() : curr_() {} unsigned int operator()() { return curr_++; } private: unsigned int curr_; }; template<class Vec> char const* value_found(Vec const& v, typename Vec::const_iterator i) { return i==v.end() ? "no" : "yes"; } template<class Vec> typename Vec::const_iterator find1(Vec const& v, typename Vec::value_type val) { return find(v.begin(), v.end(), val); } template<class Vec> typename Vec::const_iterator find2(Vec const& v, typename Vec::value_type val) { for(typename Vec::const_iterator i=v.begin(), end=v.end(); i<end; ++i) if(*i==val) return i; return v.end(); } int main() { using namespace std; typedef vector<unsigned int>::const_iterator iter; vector<unsigned int> vec; vec.reserve(10000000); boost::progress_timer pt; generate_n(back_inserter(vec), vec.capacity(), incrementor()); //added this line, to avoid any doubts, that compiler is able to // guess the data is sorted random_shuffle(vec.begin(), vec.end()); cout << "value generation required: " << pt.elapsed() << endl; double d; pt.restart(); iter found=find1(vec, vec.capacity()); d=pt.elapsed(); cout << "first search required: " << d << endl; cout << "first search found value: " << value_found(vec, found)<< endl; pt.restart(); found=find2(vec, vec.capacity()); d=pt.elapsed(); cout << "second search required: " << d << endl; cout << "second search found value: " << value_found(vec, found)<< endl; return 0; } On my machine (Intel i7, Windows Vista) STL find (call via find1) runs about 10 times faster than the hand-crafted loop (call via find2). I first thought that Visual C++ performs some kind of vectorization (may be I am mistaken here), but as far as I can see assembly does not look the way it uses vectorization. Why is STL loop faster? Hand-crafted loop is identical to the loop from the STL-find body. I was asked to post program's output. Without shuffle: value generation required: 0.078 first search required: 0.008 first search found value: no second search required: 0.098 second search found value: no With shuffle (caching effects): value generation required: 1.454 first search required: 0.009 first search found value: no second search required: 0.044 second search found value: no Many thanks, dusha. P.S. I return the iterator and write out the result (found or not), because I would like to prevent compiler optimization, that it thinks the loop is not required at all. The searched value is obviously not in the vector.

Read the article
Fast modulo 3 or division algorithm?

- by aaa

Hello is there a fast algorithm, similar to power of 2, which can be used with 3, i.e. n%3. Perhaps something that uses the fact that if sum of digits is divisible by three, then the number is also divisible. This leads to a next question. What is the fast way to add digits in a number? I.e. 37 - 3 +7 - 10 I am looking for something that does not have conditionals as those tend to inhibit vectorization thanks

Read the article
What is the most useful R trick?

- by Dirk Eddelbuettel

In order to share some more tips and tricks for R, what is you single-most useful feature or trick? Clever vectorization? Data input/output? Visualization and graphics? Statistical analysis? Special functions? The interactive environment itself? One item per post, and we will see if we get a winner by means of votes. [Edit 25-Aug 2008]: So after one week, it seems that the simple str() won the poll. As I like to recommend that one myself, it is an easy answer to accept.

Read the article
Where is the bottleneck in this code?

- by Mikhail

I have the following tight loop that makes up the serial bottle neck of my code. Ideally I would parallelize the function that calls this but that is not possible. //n is about 60 for (int k = 0;k < n;k++) { double fone = z[k*n+i+1]; double fzer = z[k*n+i]; z[k*n+i+1]= s*fzer+c*fone; z[k*n+i] = c*fzer-s*fone; } Are there any optimizations that can be made such as vectorization or some evil inline that can help this code? I am looking into finding eigen solutions of tridiagonal matrices. http://www.cimat.mx/~posada/OptDoglegGraph/DocLogisticDogleg/projects/adjustedrecipes/tqli.cpp.html

Read the article
General suggestions for making R code faster? [closed]

- by gsk3

Questions come up fairly frequently about how to make R code faster. This is an attempt to provide a general framework for thinking about the problem. Questions of this nature seem to fall into one of a few categories: I have a loop and it's running slowly. I've heard that vectorization can speed things up. How do I vectorize it? I've vectorized and it's still running slowly. What do I do next? I'd like to speed up my code but it's running quickly enough already. What are the principles and specifics which can be used to make R code run faster?

Read the article

1 2 | Next Page >