Search Results

Search found 21 results on 1 pages for 'vectorize'.

Page 1/1 | 1 

  • vectorize is indeterminate

    - by telliott99
    I'm trying to vectorize a simple function in numpy and getting inconsistent behavior. I expect my code to return 0 for values < 0.5 and the unchanged value otherwise. Strangely, different runs of the script from the command line yield varying results: sometimes it works correctly, and sometimes I get all 0's. It doesn't matter which of the three lines I use for the case when d <= T. It does seem to be correlated with whether the first value to be returned is 0. Any ideas? Thanks. import numpy as np def my_func(d, T=0.5): if d > T: return d #if d <= T: return 0 else: return 0 #return 0 N = 4 A = np.random.uniform(size=N**2) A.shape = (N,N) print A f = np.vectorize(my_func) print f(A) $ python x.py [[ 0.86913815 0.96833127 0.54539153 0.46184594] [ 0.46550903 0.24645558 0.26988519 0.0959257 ] [ 0.73356391 0.69363161 0.57222389 0.98214089] [ 0.15789303 0.06803493 0.01601389 0.04735725]] [[ 0.86913815 0.96833127 0.54539153 0. ] [ 0. 0. 0. 0. ] [ 0.73356391 0.69363161 0.57222389 0.98214089] [ 0. 0. 0. 0. ]] $ python x.py [[ 0.37127366 0.77935622 0.74392301 0.92626644] [ 0.61639086 0.32584431 0.12345342 0.17392298] [ 0.03679475 0.00536863 0.60936931 0.12761859] [ 0.49091897 0.21261635 0.37063752 0.23578082]] [[0 0 0 0] [0 0 0 0] [0 0 0 0] [0 0 0 0]]

    Read the article

  • List comprehension, map, and numpy.vectorize performance

    - by mcstrother
    I have a function foo(i) that takes an integer and takes a significant amount of time to execute. Will there be a significant performance difference between any of the following ways of initializing a: a = [foo(i) for i in xrange(100)] a = map(foo, range(100)) vfoo = numpy.vectorize(foo) a = vfoo(range(100)) (I don't care whether the output is a list or a numpy array.) Is there a better way?

    Read the article

  • Vectorize matrix operation in R

    - by Fernando
    I have a R x C matrix filled to the k-th row and empty below this row. What i need to do is to fill the remaining rows. In order to do this, i have a function that takes 2 entire rows as arguments, do some calculations and output 2 fresh rows (these outputs will fill the matrix). I have a list of all 'pairs' of rows to be processed, but my for loop is not helping performance: # M is the matrix # nrow(M) and k are even, so nLeft is even M = matrix(1:48, ncol = 3) # half to fill k = nrow(M)/2 # simulate empty rows to be filled M[-(1:k), ] = 0 cat('before fill') print(M) # number of empty rows to fill nLeft = nrow(M) - k nextRow = k + 1 # list of rows to process (could be any order of non-empty rows) idxList = matrix(1:k, ncol = 2) for ( i in 1 : (nLeft / 2)) { row1 = M[idxList[i, 1],] row2 = M[idxList[i, 2],] # the two columns in 'results' will become 2 rows in M # fake result, return 2*row1 and 3*row2 results = matrix(c(2*row1, 3*row2), ncol = 2) # fill the matrix M[nextRow, ] = results[, 1] nextRow = nextRow + 1 M[nextRow, ] = results[, 2] nextRow = nextRow + 1 } cat('after fill') print(M) I tried to vectorize this, but failed... appreciate any help on improving this code, thanks!

    Read the article

  • Problem: Vectorizing Code with Intel Visual FORTRAN for X64

    - by user313209
    I'm compiling my fortran90 code using Intel Visual FORTRAN on Windows Server 2003 Enterprise X64 Edition. When I compile the code for 32 bit structure and using automatic and manual vectorizing options. The code will be compiled, vectorized. And when I run it on 8 core system the compiled code uses 70% of CPU that shows me that vectorizing is working. But when I compile the code with 64 Bit compiler, it says that the code is vectorized but when I run it it only shows CPU usage of about 12% that is full usage for one core out of 8, so it means that while the compiler says that code is vectorized, vectorization is not working. And it's strange for me because it's on a X64 Edition Windows and I was expecting to see the reverse result. I thought that it should be better to run a code that is compiled for 64 Bit architecture on a 64 bit windows. Anyone have any idea why the compiled code is not able to use the full power of multiple cores for 64 Bit Compiled version? Thanks in advance for your responses.

    Read the article

  • Parallelize or vectorize all-against-all operation on a large number of matrices?

    - by reve_etrange
    I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm. In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so list = who('data_matrix_prefix*') H = cell(numel(list),numel(list)); for i=1:numel(list) for j=1:numel(list) if i ~= j eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']); end end end is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s). However, operating on all the data requires almost 25 million calls. I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop: # who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data. nextData = cell(k,1); for i=1:k [nextData{:}] = deal(data{i}); H{:,i} = cellfun(@compare,data,nextData,'UniformOutput',false); end Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced. compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?). Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?

    Read the article

  • speed of map() vs. list comprehension vs. numpy vectorized function in python

    - by mcstrother
    I have a function foo(i) that takes an integer and takes a significant amount of time to execute. Will there be a significant performance difference between any of the following ways of initializing 'a': a = [foo(i) for i in xrange(100)] , a = map(foo, range(100)) , and vfoo = numpy.vectorize(foo) vfoo(range(100)) ? (I don't care whether the output is a list or a numpy array). Is there some other better way of doing this? Thanks.

    Read the article

  • R strsplit and vectorization

    - by James
    When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input? For example, to count the lengths of words in a character vector: words <- c("a","quick","brown","fox") > length(strsplit(words,"")) [1] 4 # The number of words (length of the list) > length(strsplit(words,"")[[1]]) [1] 1 # The length of the first word only > sapply(words,function (x) length(strsplit(x,"")[[1]])) a quick brown fox 1 5 5 3 # Success, but potentially very slow Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

    Read the article

  • Using numpy.apply

    - by andylei
    What's wrong with this snippet of code? import numpy as np from scipy import stats d = np.arange(10.0) cutoffs = [stats.scoreatpercentile(d, pct) for pct in range(0, 100, 20)] f = lambda x: np.sum(x > cutoffs) fv = np.vectorize(f) # why don't these two lines output the same values? [f(x) for x in d] # => [0, 1, 2, 2, 3, 3, 4, 4, 5, 5] fv(d) # => array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) Any ideas?

    Read the article

  • Mapping functions of 2D numpy arrays

    - by perimosocordiae
    I have a function foo that takes a NxM numpy array as an argument and returns a scalar value. I have a AxNxM numpy array data, over which I'd like to map foo to give me a resultant numpy array of length A. Curently, I'm doing this: result = numpy.array([foo(x) for x in data]) It works, but it seems like I'm not taking advantage of the numpy magic (and speed). Is there a better way? I've looked at numpy.vectorize, and numpy.apply_along_axis, but neither works for a function of 2D arrays. EDIT: I'm doing boosted regression on 24x24 image patches, so my AxNxM is something like 1000x24x24. What I called foo above applies a Haar-like feature to a patch (so, not terribly computationally intensive).

    Read the article

  • Vectorizing sums of different diagonals in a matrix

    - by reve_etrange
    I want to vectorize the following MATLAB code. I think it must be simple but I'm finding it confusing nevertheless. r = some constant less than m or n [m,n] = size(C); S = zeros(m-r,n-r); for i=1:m-r for j=1:n-r S(i,j) = sum(diag(C(i:i+r-1,j:j+r-1))); end end The code calculates a table of scores, S, for a dynamic programming algorithm, from another score table, C. The diagonal summing is to generate scores for individual pieces of the data used to generate C, for all possible pieces (of size r). Thanks in advance for any answers! Sorry if this one should be obvious...

    Read the article

  • General suggestions for making R code faster? [closed]

    - by gsk3
    Questions come up fairly frequently about how to make R code faster. This is an attempt to provide a general framework for thinking about the problem. Questions of this nature seem to fall into one of a few categories: I have a loop and it's running slowly. I've heard that vectorization can speed things up. How do I vectorize it? I've vectorized and it's still running slowly. What do I do next? I'd like to speed up my code but it's running quickly enough already. What are the principles and specifics which can be used to make R code run faster?

    Read the article

  • Fast image coordinate lookup in Numpy

    - by victor
    I've got a big numpy array full of coordinates (about 400): [[102, 234], [304, 104], .... ] And a numpy 2d array my_map of size 800x800. What's the fastest way to look up the coordinates given in that array? I tried things like paletting as described in this post: http://opencvpython.blogspot.com/2012/06/fast-array-manipulation-in-numpy.html but couldn't get it to work. I was also thinking about turning each coordinate into a linear index of the map and then piping it straight into my_map like so: my_map[linearized_coords] but I couldn't get vectorize to properly translate the coordinates into a linear fashion. Any ideas?

    Read the article

  • efficientcy effort: grep with a vectored pattern or match with a list of values

    - by Elad663
    I guess this is trivial, I apologize, I couldn't find how to do it. I am trying to abstain from a loop, so I am trying to vectorize the process: I need to do something like grep, but where the pattern is a vector. Another option is a match, where the value is not only the first location. For example data (which is not how the real data is, otherswise I would exploit it structure): COUNTRIES=c("Austria","Belgium","Denmark","France","Germany", "Ireland","Italy","Luxembourg","Netherlands", "Portugal","Sweden","Spain","Finland","United Kingdom") COUNTRIES_Target=rep(COUNTRIES,times=4066) COUNTRIES_Origin=rep(COUNTRIES,each=4066) Now, currently I got a loop that: var_pointer=list() for (i in 1:length(COUNTRIES_Origin)) { var_pointer[[i]]=which(COUNTRIES_Origin[i]==COUNTRS_Target) } The problem with match is that match(x=COUNTRIES_Origin,table=COUNTRIES_Target) returns a vector of the same length as COUNTRIES_Origin and the value is the first match, while I need all of them. The issue with grep is that grep(pattern=COUNTRIES_Origin,x=COUNTRIES_Target) is the given warning: Warning message: In grep(pattern = COUNTRIES_Origin, x = COUNTRIES_Target) : argument 'pattern' has length > 1 and only the first element will be used Any suggestions?

    Read the article

  • Surviving MATLAB and R as a Hardcore Programmer

    - by dsimcha
    I love programming in languages that seem geared towards hardcore programmers. (My favorites are Python and D.) MATLAB is geared towards engineers and R is geared towards statisticians, and it seems like these languages were designed by people who aren't hardcore programmers and don't think like hardcore programmers. I always find them somewhat awkward to use, and to some extent I can't put my finger on why. Here are some issues I have managed to identify: (Both): The extreme emphasis on vectors and matrices to the extent that there are no true primitives. (Both): The difficulty of basic string manipulation. (Both): Lack of or awkwardness in support for basic data structures like hash tables and "real", i.e. type-parametric and nestable, arrays. (Both): They're really, really slow even by interpreted language standards, unless you bend over backwards to vectorize your code. (Both): They seem to not be designed to interact with the outside world. For example, both are fairly bulky programs that take a while to launch and seem to not be designed to make simple text filter programs easy to write. Furthermore, the lack of good string processing makes file I/O in anything but very standard forms near impossible. (Both): Object orientation seems to have a very bolted-on feel. Yes, you can do it, but it doesn't feel much more idiomatic than OO in C. (Both): No obvious, simple way to get a reference type. No pointers or class references. For example, I have no idea how you roll your own linked list in either of these languages. (MATLAB): You can't put multiple top level functions in a single file, encouraging very long functions and cut-and-paste coding. (MATLAB): Integers apparently don't exist as a first class type. (R): The basic builtin data structures seem way too high level and poorly documented, and never seem to do quite what I expect given my experience with similar but lower level data structures. (R): The documentation is spread all over the place and virtually impossible to browse or search. Even D, which is often knocked for bad documentation and is still fairly alpha-ish, is substantially better as far as I can tell. (R): At least as far as I'm aware, there's no good IDE for it. Again, even D, a fairly alpha-ish language with a small community, does better. In general, I also feel like MATLAB and R could be easily replaced by plain old libraries in more general-purpose langauges, if sufficiently comprehensive libraries existed. This is especially true in newer general purpose languages that include lots of features for library writers. Why do R and MATLAB seem so weird to me? Are there any other major issues that you've noticed that may make these languages come off as strange to hardcore programmers? When their use is necessary, what are some good survival tips? Edit: I'm seeing one issue from some of the answers I've gotten. I have a strong personal preference, when I analyze data, to have one script that incorporates the whole pipeline. This implies that a general purpose language needs to be used. I hate having to write a script to "clean up" the data and spit it out, then another to read it back in a completely different environment, etc. I find the friction of using MATLAB/R for some of my work and a completely different language with a completely different address space and way of thinking for the rest to be a huge source of friction. Furthermore, I know there are glue layers that exist, but they always seem to be horribly complicated and a source of friction.

    Read the article

  • vectorizing a for loop in numpy/scipy?

    - by user248237
    I'm trying to vectorize a for loop that I have inside of a class method. The for loop has the following form: it iterates through a bunch of points and depending on whether a certain variable (called "self.condition_met" below) is true, calls a pair of functions on the point, and adds the result to a list. Each point here is an element in a vector of lists, i.e. a data structure that looks like array([[1,2,3], [4,5,6], ...]). Here is the problematic function: def myClass: def my_inefficient_method(self): final_vector = [] # Assume 'my_vector' and 'my_other_vector' are defined numpy arrays for point in all_points: if not self.condition_met: a = self.my_func1(point, my_vector) b = self.my_func2(point, my_other_vector) else: a = self.my_func3(point, my_vector) b = self.my_func4(point, my_other_vector) c = a + b final_vector.append(c) # Choose random element from resulting vector 'final_vector' self.condition_met is set before my_inefficient_method is called, so it seems unnecessary to check it each time, but I am not sure how to better write this. Since there are no destructive operations here it is seems like I could rewrite this entire thing as a vectorized operation -- is that possible? any ideas how to do this?

    Read the article

  • vectorizing loops in Matlab - performance issues

    - by Gacek
    This question is related to these two: http://stackoverflow.com/questions/2867901/introduction-to-vectorizing-in-matlab-any-good-tutorials http://stackoverflow.com/questions/2561617/filter-that-uses-elements-from-two-arrays-at-the-same-time Basing on the tutorials I read, I was trying to vectorize some procedure that takes really a lot of time. I've rewritten this: function B = bfltGray(A,w,sigma_r) dim = size(A); B = zeros(dim); for i = 1:dim(1) for j = 1:dim(2) % Extract local region. iMin = max(i-w,1); iMax = min(i+w,dim(1)); jMin = max(j-w,1); jMax = min(j+w,dim(2)); I = A(iMin:iMax,jMin:jMax); % Compute Gaussian intensity weights. F = exp(-0.5*(abs(I-A(i,j))/sigma_r).^2); B(i,j) = sum(F(:).*I(:))/sum(F(:)); end end into this: function B = rngVect(A, w, sigma) W = 2*w+1; I = padarray(A, [w,w],'symmetric'); I = im2col(I, [W,W]); H = exp(-0.5*(abs(I-repmat(A(:)', size(I,1),1))/sigma).^2); B = reshape(sum(H.*I,1)./sum(H,1), size(A, 1), []); But this version seems to be as slow as the first one, but in addition it uses a lot of memory and sometimes causes memory problems. I suppose I've made something wrong. Probably some logic mistake regarding vectorizing. Well, in fact I'm not surprised - this method creates really big matrices and probably the computations are proportionally longer. I have also tried to write it using nlfilter (similar to the second solution given by Jonas) but it seems to be hard since I use Matlab 6.5 (R13) (there are no sophisticated function handles available). So once again, I'm asking not for ready solution, but for some ideas that would help me to solve this in reasonable time. Maybe you will point me what I did wrong.

    Read the article

  • Introduction to vectorizing in MATLAB - any good tutorials?

    - by Gacek
    I'm looking for any good tutorials on vectorizing (loops) in MATLAB. I have quite simple algorithm, but it uses two for loops. I know that it should be simple to vectorize it and I would like to learn how to do it instead of asking you for the solution. But to let you know what problem I have, so you would be able to suggest best tutorials that are showing how to solve similar problems, here's the outline of my problem: B = zeros(size(A)); % //A is a given matrix. for i=1:size(A,1) for j=1:size(A,2) H = ... %// take some surrounding elements of the element at position (i,j) (i.e. using mask 3x3 elements) B(i,j) = computeSth(H); %// compute something on selected elements and place it in B end end So, I'm NOT asking for the solution. I'm asking for a good tutorials, examples of vectorizing loops in MATLAB. I would like to learn how to do it and do it on my own.

    Read the article

  • R optimization: How can I avoid a for loop in this situation?

    - by chrisamiller
    I'm trying to do a simple genomic track intersection in R, and running into major performance problems, probably related to my use of for loops. In this situation, I have pre-defined windows at intervals of 100bp and I'm trying to calculate how much of each window is covered by the annotations in mylist. Graphically, it looks something like this: 0 100 200 300 400 500 600 windows: |-----|-----|-----|-----|-----|-----| mylist: |-| |-----------| So I wrote some code to do just that, but it's fairly slow and has become a bottleneck in my code: ##window for each 100-bp segment windows <- numeric(6) ##second track mylist = vector("list") mylist[[1]] = c(1,20) mylist[[2]] = c(120,320) ##do the intersection for(i in 1:length(mylist)){ st <- floor(mylist[[i]][1]/100)+1 sp <- floor(mylist[[i]][2]/100)+1 for(j in st:sp){ b <- max((j-1)*100, mylist[[i]][1]) e <- min(j*100, mylist[[i]][2]) windows[j] <- windows[j] + e - b + 1 } } print(windows) [1] 20 81 101 21 0 0 Naturally, this is being used on data sets that are much larger than the example I provide here. Through some profiling, I can see that the bottleneck is in the for loops, but my clumsy attempt to vectorize it using *apply functions resulted in code that runs an order of magnitude more slowly. I suppose I could write something in C, but I'd like to avoid that if possible. Can anyone suggest another approach that will speed this calculation up?

    Read the article

  • Need help converting Ruby code to php code

    - by newprog
    Yesterday I posted this queston. Today I found the code which I need but written in Ruby. Some parts of code I have understood (I don't know Ruby) but there is one part that I can't. I think people who know ruby and php can help me understand this code. def do_create(image) # Clear any old info in case of a re-submit FIELDS_TO_CLEAR.each { |field| image.send(field+'=', nil) } image.save # Compose request vm_params = Hash.new # Submitting a file in ruby requires opening it and then reading the contents into the post body file = File.open(image.filename_in, "rb") # Populate the parameters and compute the signature # Normally you would do this in a subroutine - for maximum clarity all # parameters are explicitly spelled out here. vm_params["image"] = file # Contents will be read by the multipart object created below vm_params["image_checksum"] = image.image_checksum vm_params["start_job"] = 'vectorize' vm_params["image_type"] = image.image_type if image.image_type != 'none' vm_params["image_complexity"] = image.image_complexity if image.image_complexity != 'none' vm_params["image_num_colors"] = image.image_num_colors if image.image_num_colors != '' vm_params["image_colors"] = image.image_colors if image.image_colors != '' vm_params["expire_at"] = image.expire_at if image.expire_at != '' vm_params["licensee_id"] = DEVELOPER_ID #in php it's like this $vm_params["sequence_number"] = -rand(100000000);????? vm_params["sequence_number"] = Kernel.rand(1000000000) # Use a negative value to force an error when calling the test server vm_params["timestamp"] = Time.new.utc.httpdate string_to_sign = CREATE_URL + # Start out with the URL being called... #vm_params["image"].to_s + # ... don't include the file per se - use the checksum instead vm_params["image_checksum"].to_s + # ... then include all regular parameters vm_params["start_job"].to_s + vm_params["image_type"].to_s + vm_params["image_complexity"].to_s + # (nil.to_s => '', so this is fine for vm_params we don't use) vm_params["image_num_colors"].to_s + vm_params["image_colors"].to_s + vm_params["expire_at"].to_s + vm_params["licensee_id"].to_s + # ... then do all the security parameters vm_params["sequence_number"].to_s + vm_params["timestamp"].to_s vm_params["signature"] = sign(string_to_sign) #no problem # Workaround class for handling multipart posts mp = Multipart::MultipartPost.new query, headers = mp.prepare_query(vm_params) # Handles the file parameter in a special way (see /lib/multipart.rb) file.close # mp has read the contents, we can close the file now response = post_form(URI.parse(CREATE_URL), query, headers) logger.info(response.body) response_hash = ActiveSupport::JSON.decode(response.body) # Decode the JSON response string ##I have understood below def sign(string_to_sign) #logger.info("String to sign: '#{string_to_sign}'") Base64.encode64(HMAC::SHA1.digest(DEVELOPER_KEY, string_to_sign)) end # Within Multipart modul I have this: class MultipartPost BOUNDARY = 'tarsiers-rule0000' HEADER = {"Content-type" => "multipart/form-data, boundary=" + BOUNDARY + " "} def prepare_query (params) fp = [] params.each {|k,v| if v.respond_to?(:read) fp.push(FileParam.new(k, v.path, v.read)) else fp.push(Param.new(k,v)) end } query = fp.collect {|p| "--" + BOUNDARY + "\r\n" + p.to_multipart }.join("") + "--" + BOUNDARY + "--" return query, HEADER end end end Thanks for your help.

    Read the article

  • Optimizing Solaris 11 SHA-1 on Intel Processors

    - by danx
    SHA-1 is a "hash" or "digest" operation that produces a 160 bit (20 byte) checksum value on arbitrary data, such as a file. It is intended to uniquely identify text and to verify it hasn't been modified. Max Locktyukhin and others at Intel have improved the performance of the SHA-1 digest algorithm using multiple techniques. This code has been incorporated into Solaris 11 and is available in the Solaris Crypto Framework via the libmd(3LIB), the industry-standard libpkcs11(3LIB) library, and Solaris kernel module sha1. The optimized code is used automatically on systems with a x86 CPU supporting SSSE3 (Intel Supplemental SSSE3). Intel microprocessor architectures that support SSSE3 include Nehalem, Westmere, Sandy Bridge microprocessor families. Further optimizations are available for microprocessors that support AVX (such as Sandy Bridge). Although SHA-1 is considered obsolete because of weaknesses found in the SHA-1 algorithm—NIST recommends using at least SHA-256, SHA-1 is still widely used and will be with us for awhile more. Collisions (the same SHA-1 result for two different inputs) can be found with moderate effort. SHA-1 is used heavily though in SSL/TLS, for example. And SHA-1 is stronger than the older MD5 digest algorithm, another digest option defined in SSL/TLS. Optimizations Review SHA-1 operates by reading an arbitrary amount of data. The data is read in 512 bit (64 byte) blocks (the last block is padded in a specific way to ensure it's a full 64 bytes). Each 64 byte block has 80 "rounds" of calculations (consisting of a mixture of "ROTATE-LEFT", "AND", and "XOR") applied to the block. Each round produces a 32-bit intermediate result, called W[i]. Here's what each round operates: The first 16 rounds, rounds 0 to 15, read the 512 bit block 32 bits at-a-time. These 32 bits is used as input to the round. The remaining rounds, rounds 16 to 79, use the results from the previous rounds as input. Specifically for round i it XORs the results of rounds i-3, i-8, i-14, and i-16 and rotates the result left 1 bit. The remaining calculations for the round is a series of AND, XOR, and ROTATE-LEFT operators on the 32-bit input and some constants. The 32-bit result is saved as W[i] for round i. The 32-bit result of the final round, W[79], is the SHA-1 checksum. Optimization: Vectorization The first 16 rounds can be vectorized (computed in parallel) because they don't depend on the output of a previous round. As for the remaining rounds, because of step 2 above, computing round i depends on the results of round i-3, W[i-3], one can vectorize 3 rounds at-a-time. Max Locktyukhin found through simple factoring, explained in detail in his article referenced below, that the dependencies of round i on the results of rounds i-3, i-8, i-14, and i-16 can be replaced instead with dependencies on the results of rounds i-6, i-16, i-28, and i-32. That is, instead of initializing intermediate result W[i] with: W[i] = (W[i-3] XOR W[i-8] XOR W[i-14] XOR W[i-16]) ROTATE-LEFT 1 Initialize W[i] as follows: W[i] = (W[i-6] XOR W[i-16] XOR W[i-28] XOR W[i-32]) ROTATE-LEFT 2 That means that 6 rounds could be vectorized at once, with no additional calculations, instead of just 3! This optimization is independent of Intel or any other microprocessor architecture, although the microprocessor has to support vectorization to use it, and exploits one of the weaknesses of SHA-1. Optimization: SSSE3 Intel SSSE3 makes use of 16 %xmm registers, each 128 bits wide. The 4 32-bit inputs to a round, W[i-6], W[i-16], W[i-28], W[i-32], all fit in one %xmm register. The following code snippet, from Max Locktyukhin's article, converted to ATT assembly syntax, computes 4 rounds in parallel with just a dozen or so SSSE3 instructions: movdqa W_minus_04, W_TMP pxor W_minus_28, W // W equals W[i-32:i-29] before XOR // W = W[i-32:i-29] ^ W[i-28:i-25] palignr $8, W_minus_08, W_TMP // W_TMP = W[i-6:i-3], combined from // W[i-4:i-1] and W[i-8:i-5] vectors pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) movdqa W, W_TMP // 4 dwords in W are rotated left by 2 psrld $30, W // rotate left by 2 W = (W >> 30) | (W << 2) pslld $2, W_TMP por W, W_TMP movdqa W_TMP, W // four new W values W[i:i+3] are now calculated paddd (K_XMM), W_TMP // adding 4 current round's values of K movdqa W_TMP, (WK(i)) // storing for downstream GPR instructions to read A window of the 32 previous results, W[i-1] to W[i-32] is saved in memory on the stack. This is best illustrated with a chart. Without vectorization, computing the rounds is like this (each "R" represents 1 round of SHA-1 computation): RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR With vectorization, 4 rounds can be computed in parallel: RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR RRRRRRRRRRRRRRRRRRRR Optimization: AVX The new "Sandy Bridge" microprocessor architecture, which supports AVX, allows another interesting optimization. SSSE3 instructions have two operands, a input and an output. AVX allows three operands, two inputs and an output. In many cases two SSSE3 instructions can be combined into one AVX instruction. The difference is best illustrated with an example. Consider these two instructions from the snippet above: pxor W_minus_16, W // W = (W[i-32:i-29] ^ W[i-28:i-25]) ^ W[i-16:i-13] pxor W_TMP, W // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) With AVX they can be combined in one instruction: vpxor W_minus_16, W, W_TMP // W = (W[i-32:i-29] ^ W[i-28:i-25] ^ W[i-16:i-13]) ^ W[i-6:i-3]) This optimization is also in Solaris, although Sandy Bridge-based systems aren't widely available yet. As an exercise for the reader, AVX also has 256-bit media registers, %ymm0 - %ymm15 (a superset of 128-bit %xmm0 - %xmm15). Can %ymm registers be used to parallelize the code even more? Optimization: Solaris-specific In addition to using the Intel code described above, I performed other minor optimizations to the Solaris SHA-1 code: Increased the digest(1) and mac(1) command's buffer size from 4K to 64K, as previously done for decrypt(1) and encrypt(1). This size is well suited for ZFS file systems, but helps for other file systems as well. Optimized encode functions, which byte swap the input and output data, to copy/byte-swap 4 or 8 bytes at-a-time instead of 1 byte-at-a-time. Enhanced the Solaris mdb(1) and kmdb(1) debuggers to display all 16 %xmm and %ymm registers (mdb "$x" command). Previously they only displayed the first 8 that are available in 32-bit mode. Can't optimize if you can't debug :-). Changed the SHA-1 code to allow processing in "chunks" greater than 2 Gigabytes (64-bits) Performance I measured performance on a Sun Ultra 27 (which has a Nehalem-class Xeon 5500 Intel W3570 microprocessor @3.2GHz). Turbo mode is disabled for consistent performance measurement. Graphs are better than words and numbers, so here they are: The first graph shows the Solaris digest(1) command before and after the optimizations discussed here, contained in libmd(3LIB). I ran the digest command on a half GByte file in swapfs (/tmp) and execution time decreased from 1.35 seconds to 0.98 seconds. The second graph shows the the results of an internal microbenchmark that uses the Solaris libpkcs11(3LIB) library. The operations are on a 128 byte buffer with 10,000 iterations. The results show operations increased from 320,000 to 416,000 operations per second. Finally the third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. third graph shows the results of an internal kernel microbenchmark that uses the Solaris /kernel/crypto/amd64/sha1 module. The operations are on a 64Kbyte buffer with 100 iterations. The results show for 1 kernel thread, operations increased from 410 to 600 MBytes/second. For 8 kernel threads, operations increase from 1540 to 1940 MBytes/second. Availability This code is in Solaris 11 FCS. It is available in the 64-bit libmd(3LIB) library for 64-bit programs and is in the Solaris kernel. You must be running hardware that supports Intel's SSSE3 instructions (for example, Intel Nehalem, Westmere, or Sandy Bridge microprocessor architectures). The easiest way to determine if SSSE3 is available is with the isainfo(1) command. For example, nehalem $ isainfo -v $ isainfo -v 64-bit amd64 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu If the output also shows "avx", the Solaris executes the even-more optimized 3-operand AVX instructions for SHA-1 mentioned above: sandybridge $ isainfo -v 64-bit amd64 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu 32-bit i386 applications avx xsave pclmulqdq aes sse4.2 sse4.1 ssse3 popcnt tscp ahf cx16 sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu No special configuration or setup is needed to take advantage of this code. Solaris libraries and kernel automatically determine if it's running on SSSE3 or AVX-capable machines and execute the correctly-tuned code for that microprocessor. Summary The Solaris 11 Crypto Framework, via the sha1 kernel module and libmd(3LIB) and libpkcs11(3LIB) libraries, incorporated a useful SHA-1 optimization from Intel for SSSE3-capable microprocessors. As with other Solaris optimizations, they come automatically "under the hood" with the current Solaris release. References "Improving the Performance of the Secure Hash Algorithm (SHA-1)" by Max Locktyukhin (Intel, March 2010). The source for these SHA-1 optimizations used in Solaris "SHA-1", Wikipedia Good overview of SHA-1 FIPS 180-1 SHA-1 standard (FIPS, 1995) NIST Comments on Cryptanalytic Attacks on SHA-1 (2005, revised 2006)

    Read the article

  • Processing Kinect v2 Color Streams in Parallel

    - by Chris Gardner
    Originally posted on: http://geekswithblogs.net/freestylecoding/archive/2014/08/20/processing-kinect-v2-color-streams-in-parallel.aspxProcessing Kinect v2 Color Streams in Parallel I've really been enjoying being a part of the Kinect for Windows Developer's Preview. The new hardware has some really impressive capabilities. However, with great power comes great system specs. Unfortunately, my little laptop that could is not 100% up to the task; I've had to get a little creative. The most disappointing thing I've run into is that I can't always cleanly display the color camera stream in managed code. I managed to strip the code down to what I believe is the bear minimum: using( ColorFrame _ColorFrame = e.FrameReference.AcquireFrame() ) { if( null == _ColorFrame ) return;   BitmapToDisplay.Lock(); _ColorFrame.CopyConvertedFrameDataToIntPtr( BitmapToDisplay.BackBuffer, Convert.ToUInt32( BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight ), ColorImageFormat.Bgra ); BitmapToDisplay.AddDirtyRect( new Int32Rect( 0, 0, _ColorFrame.FrameDescription.Width, _ColorFrame.FrameDescription.Height ) ); BitmapToDisplay.Unlock(); } With this snippet, I'm placing the converted Bgra32 color stream directly on the BackBuffer of the WriteableBitmap. This gives me pretty smooth playback, but I still get the occasional freeze for half a second. After a bit of profiling, I discovered there were a few problems. The first problem is the size of the buffer along with the conversion on the buffer. At this time, the raw image format of the data from the Kinect is Yuy2. This is great for direct video processing. It would be ideal if I had a WriteableVideo object in WPF. However, this is not the case. Further digging led me to the real problem. It appears that the SDK is converting the input serially. Let's think about this for a second. The color camera is a 1080p camera. As we should all know, this give us a native resolution of 1920 x 1080. This produces 2,073,600 pixels. Yuy2 uses 4 bytes per 2 pixel, for a buffer size of 4,147,200 bytes. Bgra32 uses 4 bytes per pixel, for a buffer size of 8,294,400 bytes. The SDK appears to be doing this on one thread. I started wondering if I chould do this better myself. I mean, I have 8 cores in my system. Why can't I use them all? The first problem is converting a Yuy2 frame into a Bgra32 frame. It is NOT trivial. I spent a day of research of just how to do this. In the end, I didn't even produce the best algorithm possible, but it did work. After I managed to get that to work, I knew my next step was the get the conversion operation off the UI Thread. This was a simple process of throwing the work into a Task. Of course, this meant I had to marshal the final write to the WriteableBitmap back to the UI thread. Finally, I needed to vectorize the operation so I could run it safely in parallel. This was, mercifully, not quite as hard as I thought it would be. I had my loop return an index to a pair of pixels. From there, I had to tell the loop to do everything for this pair of pixels. If you're wondering why I did it for pairs of pixels, look back above at the specification for the Yuy2 format. I won't go into full detail on why each 4 bytes contains 2 pixels of information, but rest assured that there is a reason why the format is described in that way. The first working attempt at this algorithm successfully turned my poor laptop into a space heater. I very quickly brought and maintained all 8 cores up to about 97% usage. That's when I remembered that obscure option in the Task Parallel Library where you could limit the amount of parallelism used. After a little trial and error, I discovered 4 parallel tasks was enough for most cases. This yielded the follow code: private byte ClipToByte( int p_ValueToClip ) { return Convert.ToByte( ( p_ValueToClip < byte.MinValue ) ? byte.MinValue : ( ( p_ValueToClip > byte.MaxValue ) ? byte.MaxValue : p_ValueToClip ) ); }   private void ColorFrameArrived( object sender, ColorFrameArrivedEventArgs e ) { if( null == e.FrameReference ) return;   // If you do not dispose of the frame, you never get another one... using( ColorFrame _ColorFrame = e.FrameReference.AcquireFrame() ) { if( null == _ColorFrame ) return;   byte[] _InputImage = new byte[_ColorFrame.FrameDescription.LengthInPixels * _ColorFrame.FrameDescription.BytesPerPixel]; byte[] _OutputImage = new byte[BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight]; _ColorFrame.CopyRawFrameDataToArray( _InputImage );   Task.Factory.StartNew( () => { ParallelOptions _ParallelOptions = new ParallelOptions(); _ParallelOptions.MaxDegreeOfParallelism = 4;   Parallel.For( 0, Sensor.ColorFrameSource.FrameDescription.LengthInPixels / 2, _ParallelOptions, ( _Index ) => { // See http://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx int _Y0 = _InputImage[( _Index << 2 ) + 0] - 16; int _U = _InputImage[( _Index << 2 ) + 1] - 128; int _Y1 = _InputImage[( _Index << 2 ) + 2] - 16; int _V = _InputImage[( _Index << 2 ) + 3] - 128;   byte _R = ClipToByte( ( 298 * _Y0 + 409 * _V + 128 ) >> 8 ); byte _G = ClipToByte( ( 298 * _Y0 - 100 * _U - 208 * _V + 128 ) >> 8 ); byte _B = ClipToByte( ( 298 * _Y0 + 516 * _U + 128 ) >> 8 );   _OutputImage[( _Index << 3 ) + 0] = _B; _OutputImage[( _Index << 3 ) + 1] = _G; _OutputImage[( _Index << 3 ) + 2] = _R; _OutputImage[( _Index << 3 ) + 3] = 0xFF; // A   _R = ClipToByte( ( 298 * _Y1 + 409 * _V + 128 ) >> 8 ); _G = ClipToByte( ( 298 * _Y1 - 100 * _U - 208 * _V + 128 ) >> 8 ); _B = ClipToByte( ( 298 * _Y1 + 516 * _U + 128 ) >> 8 );   _OutputImage[( _Index << 3 ) + 4] = _B; _OutputImage[( _Index << 3 ) + 5] = _G; _OutputImage[( _Index << 3 ) + 6] = _R; _OutputImage[( _Index << 3 ) + 7] = 0xFF; } );   Application.Current.Dispatcher.Invoke( () => { BitmapToDisplay.WritePixels( new Int32Rect( 0, 0, Sensor.ColorFrameSource.FrameDescription.Width, Sensor.ColorFrameSource.FrameDescription.Height ), _OutputImage, BitmapToDisplay.BackBufferStride, 0 ); } ); } ); } } This seemed to yield a results I wanted, but there was still the occasional stutter. This lead to what I realized was the second problem. There is a race condition between the UI Thread and me locking the WriteableBitmap so I can write the next frame. Again, I'm writing approximately 8MB to the back buffer. Then, I started thinking I could cheat. The Kinect is running at 30 frames per second. The WPF UI Thread runs at 60 frames per second. This made me not feel bad about exploiting the Composition Thread. I moved the bulk of the code from the FrameArrived handler into CompositionTarget.Rendering. Once I was in there, I polled from a frame, and rendered it if it existed. Since, in theory, I'm only killing the Composition Thread every other hit, I decided I was ok with this for cases where silky smooth video performance REALLY mattered. This ode looked like this: private byte ClipToByte( int p_ValueToClip ) { return Convert.ToByte( ( p_ValueToClip < byte.MinValue ) ? byte.MinValue : ( ( p_ValueToClip > byte.MaxValue ) ? byte.MaxValue : p_ValueToClip ) ); }   void CompositionTarget_Rendering( object sender, EventArgs e ) { using( ColorFrame _ColorFrame = FrameReader.AcquireLatestFrame() ) { if( null == _ColorFrame ) return;   byte[] _InputImage = new byte[_ColorFrame.FrameDescription.LengthInPixels * _ColorFrame.FrameDescription.BytesPerPixel]; byte[] _OutputImage = new byte[BitmapToDisplay.BackBufferStride * BitmapToDisplay.PixelHeight]; _ColorFrame.CopyRawFrameDataToArray( _InputImage );   ParallelOptions _ParallelOptions = new ParallelOptions(); _ParallelOptions.MaxDegreeOfParallelism = 4;   Parallel.For( 0, Sensor.ColorFrameSource.FrameDescription.LengthInPixels / 2, _ParallelOptions, ( _Index ) => { // See http://msdn.microsoft.com/en-us/library/windows/desktop/dd206750(v=vs.85).aspx int _Y0 = _InputImage[( _Index << 2 ) + 0] - 16; int _U = _InputImage[( _Index << 2 ) + 1] - 128; int _Y1 = _InputImage[( _Index << 2 ) + 2] - 16; int _V = _InputImage[( _Index << 2 ) + 3] - 128;   byte _R = ClipToByte( ( 298 * _Y0 + 409 * _V + 128 ) >> 8 ); byte _G = ClipToByte( ( 298 * _Y0 - 100 * _U - 208 * _V + 128 ) >> 8 ); byte _B = ClipToByte( ( 298 * _Y0 + 516 * _U + 128 ) >> 8 );   _OutputImage[( _Index << 3 ) + 0] = _B; _OutputImage[( _Index << 3 ) + 1] = _G; _OutputImage[( _Index << 3 ) + 2] = _R; _OutputImage[( _Index << 3 ) + 3] = 0xFF; // A   _R = ClipToByte( ( 298 * _Y1 + 409 * _V + 128 ) >> 8 ); _G = ClipToByte( ( 298 * _Y1 - 100 * _U - 208 * _V + 128 ) >> 8 ); _B = ClipToByte( ( 298 * _Y1 + 516 * _U + 128 ) >> 8 );   _OutputImage[( _Index << 3 ) + 4] = _B; _OutputImage[( _Index << 3 ) + 5] = _G; _OutputImage[( _Index << 3 ) + 6] = _R; _OutputImage[( _Index << 3 ) + 7] = 0xFF; } );   BitmapToDisplay.WritePixels( new Int32Rect( 0, 0, Sensor.ColorFrameSource.FrameDescription.Width, Sensor.ColorFrameSource.FrameDescription.Height ), _OutputImage, BitmapToDisplay.BackBufferStride, 0 ); } }

    Read the article

1