Search Results

Search found 266 results on 11 pages for 'multiplication'.

Page 1/11 | 1 2 3 4 5 6 7 8 9 10 11  | Next Page >

  • Multiplication algorithm for abritrary precision (bignum) integers.

    - by nn
    Hi, I'm writing a small bignum library for a homework project. I am to implement Karatsuba multiplication, but before that I would like to write a naive multiplication routine. I'm following a guide written by Paul Zimmerman titled "Modern Computer Arithmetic" which is freely available online. On page 4, there is a description of an algorithm titled BasecaseMultiply which performs gradeschool multiplication. I understand step 2, 3, where B^j is a digit shift of 1, j times. But I don't understand step 1 and 3, where we have A*b_j. How is this multiplication meant to be carried out if the bignum multiplication hasn't been defined yet? Would the operation "*" in this algorithm just be the repeated addition method? Here is the parts I have written thus far. I have unit tested them so they appear to be correct for the most part: The structure I use for my bignum is as follows: #define BIGNUM_DIGITS 2048 typedef uint32_t u_hw; // halfword typedef uint64_t u_w; // word typedef struct { unsigned int sign; // 0 or 1 unsigned int n_digits; u_hw digits[BIGNUM_DIGITS]; } bn; Currently available routines: bn *bn_add(bn *a, bn *b); // returns a+b as a newly allocated bn void bn_lshift(bn *b, int d); // shifts d digits to the left, retains sign int bn_cmp(bn *a, bn *b); // returns 1 if a>b, 0 if a=b, -1 if a<b

    Read the article

  • What is Chain Matrix Multiplication ?

    - by Hellnar
    Hello, I am trying to understand what is a chain matrix multiplication and how it is different from a regular multiplication. I have checked several sourcers yet all seem to be very academically explained for me to understand. I guess it is a form of dynamic programming algorithm to achieve the operation in an optimised way but I didn't go any further. Thanks

    Read the article

  • Repeated Squaring - Matrix Multiplication using NEWMAT

    - by Dinakar Kulkarni
    I'm trying to use the repeated squaring algorithm (using recursion) to perform matrix exponentiation. I've included header files from the NEWMAT library instead of using arrays. The original matrix has elements in the range (-5,5), all numbers being of type float. # include "C:\User\newmat10\newmat.h" # include "C:\User\newmat10\newmatio.h" # include "C:\User\newmat10\newmatap.h" # include <iostream> # include <time.h> # include <ctime> # include <cstdlib> # include <iomanip> using namespace std; Matrix repeated_squaring(Matrix A, int exponent, int n) //Recursive function { A(n,n); IdentityMatrix I(n); if (exponent == 0) //Matrix raised to zero returns an Identity Matrix return I; else { if ( exponent%2 == 1 ) // if exponent is odd return (A * repeated_squaring (A*A, (exponent-1)/2, n)); else //if exponent is even return (A * repeated_squaring( A*A, exponent/2, n)); } } Matrix direct_squaring(Matrix B, int k, int no) //Brute Force Multiplication { B(no,no); Matrix C = B; for (int i = 1; i <= k; i++) C = B*C; return C; } //----Creating a matrix with elements b/w (-5,5)---- float unifRandom() { int a = -5; int b = 5; float temp = (float)((b-a)*( rand()/RAND_MAX) + a); return temp; } Matrix initialize_mat(Matrix H, int ord) { H(ord,ord); for (int y = 1; y <= ord; y++) for(int z = 1; z<= ord; z++) H(y,z) = unifRandom(); return(H); } //--------------------------------------------------- void main() { int exponent, dimension; cout<<"Insert exponent:"<<endl; cin>>exponent; cout<< "Insert dimension:"<<endl; cin>>dimension; cout<<"The number of rows/columns in the square matrix is: "<<dimension<<endl; cout<<"The exponent is: "<<exponent<<endl; Matrix A(dimension,dimension),B(dimension,dimension); Matrix C(dimension,dimension),D(dimension,dimension); B= initialize_mat(A,dimension); cout<<"Initial Matrix: "<<endl; cout<<setw(5)<<setprecision(2)<<B<<endl; //----------------------------------------------------------------------------- cout<<"Repeated Squaring Result: "<<endl; clock_t time_before1 = clock(); C = repeated_squaring (B, exponent , dimension); cout<< setw(5) <<setprecision(2) <<C; clock_t time_after1 = clock(); float diff1 = ((float) time_after1 - (float) time_before1); cout << "It took " << diff1/CLOCKS_PER_SEC << " seconds to complete" << endl<<endl; //--------------------------------------------------------------------------------- cout<<"Direct Squaring Result:"<<endl; clock_t time_before2 = clock(); D = direct_squaring (B, exponent , dimension); cout<<setw(5)<<setprecision(2)<<D; clock_t time_after2 = clock(); float diff2 = ((float) time_after2 - (float) time_before2); cout << "It took " << diff2/CLOCKS_PER_SEC << " seconds to complete" << endl<<endl; } I face the following problems: The random number generator returns only "-5" as each element in the output. The Matrix multiplication yield different results with brute force multiplication and using the repeated squaring algorithm. I'm timing the execution time of my code to compare the times taken by brute force multiplication and by repeated squaring. Could someone please find out what's wrong with the recursion and with the matrix initialization? NOTE: While compiling this program, make sure you've imported the NEWMAT library. Thanks in advance!

    Read the article

  • Faster Matrix Multiplication in C#

    - by Kyle Lahnakoski
    I have as small c# project that involves matrices. I am processing large amounts of data by splitting it into n-length chunks, treating the chucks as vectors, and multiplying by a Vandermonde** matrix. The problem is, depending on the conditions, the size of the chucks and corresponding Vandermonde** matrix can vary. I have a general solution which is easy to read, but way too slow: public byte[] addBlockRedundancy(byte[] data) { if (data.Length!=numGood) D.error("Expecting data to be just "+numGood+" bytes long"); aMatrix d=aMatrix.newColumnMatrix(this.mod, data); var r=vandermonde.multiplyBy(d); return r.ToByteArray(); }//method This can process about 1/4 megabytes per second on my i5 U470 @ 1.33GHz. I can make this faster by manually inlining the matrix multiplication: int o=0; int d=0; for (d=0; d<data.Length-numGood; d+=numGood) { for (int r=0; r<numGood+numRedundant; r++) { Byte value=0; for (int c=0; c<numGood; c++) { value=mod.Add(value, mod.Multiply(vandermonde.get(r, c), data[d+c])); }//for output[r][o]=value; }//for o++; }//for This can process about 1 meg a second. (Please note the "mod" is performing operations over GF(2^8) modulo my favorite irreducible polynomial.) I know this can get a lot faster: After all, the Vandermonde** matrix is mostly zeros. I should be able to make a routine, or find a routine, that can take my matrix and return a optimized method which will effectively multiply vectors by the given matrix, but faster. Then, when I give this routine a 5x5 Vandermonde matrix (the identity matrix), there is simply no arithmetic to perform, and the original data is just copied. ** Please note: What I use the term "Vandermonde", I actually mean an Identity matrix with some number of rows from the Vandermonde matrix appended (see comments). This matrix is wonderful because of all the zeros, and because if you remove enough rows (of your choosing) to make it square, it is an invertible matrix. And, of course, I would like to use this same routine to convert any one of those inverted matrices into an optimized series of instructions. How can I make this matrix multiplication faster? Thanks! (edited to correct my mistake with Vandermonde matrix)

    Read the article

  • Java Polynomial Multiplication with ArrayList

    - by user1506919
    I am having a problem with one of my methods in my program. The method is designed to take 2 arraylists and the perform multiplication between the two like a polynomial. For example, if I was to say list1={3,2,1} and list2={5,6,7}; I am trying to get a return value of 15,28,38,20,7. However, all I can get is an error message that says: Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0. I have provided the method below: private static ArrayList<Integer> multiply(ArrayList<Integer> list1,ArrayList<Integer> list2) { ArrayList<Integer> array =new ArrayList<Integer>(list1.size()+list2.size()); for (int i=0;i<array.size();i++) array.add(i, 0); for (int i = 0; i < list1.size(); i++) for (int j = 0; j < list2.size(); j++) array.set(i+j, ((list1.get(i) * list2.get(j))+array.get(i+j))); return array; } Any help with solving this problem is greatly appreciated.

    Read the article

  • Matrix multiplication using pairs

    - by sc_ray
    Hi, I am looking into alternate ways to do a Matrix Multiplication. Instead of storing my matrix as a two-dimensional array, I am using a vector such as vector<pair<pair<int,int >,int > > to store my matrix. The pair within my pair (pair) stores my indices (i,j) and the other int stores the value for the given (i,j) pair. I thought I might have some luck implementing my sparse array this way. The problem is when I try to multiply this matrix with itself. If this was a 2-d array implementation, I would have multiplied the matrix as follows: for(i=0; i<row1; i++) { for(j=0; j<col1; j++) { C[i][j] = 0; for(k=0; k<col2; k++) C[i][j] += A[i][j] * A[j][k]; } } Can somebody point out a way to achieve the same result using my vector of 'pair of pairs'? Thanks

    Read the article

  • matrix multiplication with MPI [on hold]

    - by user3695701
    I'm working on an assignment on matrix multiplication with MPI. A*B=C. the requirement is that B should be vertically partitioned. Here's what I intend to do: broadcast matrix A to all processes and scatter B into several slices with each slice containing n/p columns. The following code only works when the number of process(p) is 1. when p1(say 2), I got [cluster2:21080] *** Process received signal *** [cluster2:21080] Signal: Segmentation fault (11) [cluster2:21080] Signal code: Address not mapped (1) [cluster2:21080] Failing at address: (nil) [cluster2:21080] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7f49f38108f0] [cluster2:21080] [ 1] /lib/libc.so.6(memcpy+0xe1) [0x7f49f35024c1] [cluster2:21080] [ 2] /usr/lib/libmpi.so.0(ompi_convertor_unpack+0x121)[0x7f49f47c88e1] [cluster2:21080] [ 3] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x8a26) [0x7f49f0dcea26] [cluster2:21080] [ 4] /usr/lib/openmpi/lib/openmpi/mca_btl_tcp.so(+0x662c) [0x7f49efce462c] [cluster2:21080] [ 5] /usr/lib/libopen-pal.so.0(+0x1ede8) [0x7f49f42e0de8] [cluster2:21080] [ 6] /usr/lib/libopen-pal.so.0(opal_progress+0x99) [0x7f49f42d5369] [cluster2:21080] [ 7] /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so(+0x5585) [0x7f49f0dcb585] [cluster2:21080] [ 8] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0xcc01) [0x7f49eeeb1c01] [cluster2:21080] [ 9] /usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so(+0x266c) [0x7f49eeea766c] [cluster2:21080] [10] /usr/lib/openmpi/lib/openmpi/mca_coll_sync.so(+0x1388) [0x7f49ef0c0388] [cluster2:21080] [11] /usr/lib/libmpi.so.0(MPI_Bcast+0x10e) [0x7f49f47d025e] [cluster2:21080] [12] ./out(main+0x259) [0x401571] [cluster2:21080] [13] /lib/libc.so.6(__libc_start_main+0xfd) [0x7f49f3498c8d] [cluster2:21080] [14] ./out() [0x400f29] [cluster2:21080] *** End of error message *** Can someone help me? Thanks. //matrices A and B //double* A =(double *)malloc(n*n*sizeof(double)); //double* B =(double *)malloc(n*n*sizeof(double)); //code initializing A,B... //n is the size of the matrix //p is the number of processes //myrank is the rank of calling process MPI_Init (&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &p); //broadcast A to all processes MPI_Bcast (A, n*n, MPI_DOUBLE, 0, MPI_COMM_WORLD); MPI_Datatype tmp_type, col_type; // extract a slice from B MPI_Type_vector(n, num_of_col_per_slice, n, MPI_DOUBLE, &tmp_type); // position of the first (0) and each next (stride * sizeof(double) ) slice MPI_Type_create_resized(tmp_type, 0, n * sizeof(double), &col_type); MPI_Type_commit(&col_type); //scatter a slice of B to each process MPI_Scatter(B, 1, col_type, B+myrank*n/p, n * n/p, MPI_DOUBLE, 0, MPI_COMM_WORLD); //use blas function to calculate A*sliceOfB and store the resulting slice to C cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n/p, n, 1.0, A, n, B+myrank*n/p, n, 0.0, C+myrank*n/p, n); //gather all those resulting slices into C MPI_Gather (C+myrank*n/p, n*n/p, MPI_DOUBLE, C, n*n/p, MPI_DOUBLE, 0, MPI_COMM_WORLD);

    Read the article

  • Using multiplication and division with delta time

    - by tesselode
    Using delta time with addition and subtraction is easy. player.x += 100 * dt However, multiplication and division complicate things a bit. For example, let's say I want the player to double his speed every second. player.x = player.x * 2 * dt I can't do this because it'll slow down the player (unless delta time is really high). Division is the same way, except it'll speed things way up. How can I handle multiplication and division with delta time?

    Read the article

  • tile_static, tile_barrier, and tiled matrix multiplication with C++ AMP

    - by Daniel Moth
    We ended the previous post with a mechanical transformation of the C++ AMP matrix multiplication example to the tiled model and in the process introduced tiled_index and tiled_grid. This is part 2. tile_static memory You all know that in regular CPU code, static variables have the same value regardless of which thread accesses the static variable. This is in contrast with non-static local variables, where each thread has its own copy. Back to C++ AMP, the same rules apply and each thread has its own value for local variables in your lambda, whereas all threads see the same global memory, which is the data they have access to via the array and array_view. In addition, on an accelerator like the GPU, there is a programmable cache, a third kind of memory type if you'd like to think of it that way (some call it shared memory, others call it scratchpad memory). Variables stored in that memory share the same value for every thread in the same tile. So, when you use the tiled model, you can have variables where each thread in the same tile sees the same value for that variable, that threads from other tiles do not. The new storage class for local variables introduced for this purpose is called tile_static. You can only use tile_static in restrict(direct3d) functions, and only when explicitly using the tiled model. What this looks like in code should be no surprise, but here is a snippet to confirm your mental image, using a good old regular C array // each tile of threads has its own copy of locA, // shared among the threads of the tile tile_static float locA[16][16]; Note that tile_static variables are scoped and have the lifetime of the tile, and they cannot have constructors or destructors. tile_barrier In amp.h one of the types introduced is tile_barrier. You cannot construct this object yourself (although if you had one, you could use a copy constructor to create another one). So how do you get one of these? You get it, from a tiled_index object. Beyond the 4 properties returning index objects, tiled_index has another property, barrier, that returns a tile_barrier object. The tile_barrier class exposes a single member, the method wait. 15: // Given a tiled_index object named t_idx 16: t_idx.barrier.wait(); 17: // more code …in the code above, all threads in the tile will reach line 16 before a single one progresses to line 17. Note that all threads must be able to reach the barrier, i.e. if you had branchy code in such a way which meant that there is a chance that not all threads could reach line 16, then the code above would be illegal. Tiled Matrix Multiplication Example – part 2 So now that we added to our understanding the concepts of tile_static and tile_barrier, let me obfuscate rewrite the matrix multiplication code so that it takes advantage of tiling. Before you start reading this, I suggest you get a cup of your favorite non-alcoholic beverage to enjoy while you try to fully understand the code. 01: void MatrixMultiplyTiled(vector<float>& vC, const vector<float>& vA, const vector<float>& vB, int M, int N, int W) 02: { 03: static const int TS = 16; 04: array_view<const float,2> a(M, W, vA); 05: array_view<const float,2> b(W, N, vB); 06: array_view<writeonly<float>,2> c(M,N,vC); 07: parallel_for_each(c.grid.tile< TS, TS >(), 08: [=] (tiled_index< TS, TS> t_idx) restrict(direct3d) 09: { 10: int row = t_idx.local[0]; int col = t_idx.local[1]; 11: float sum = 0.0f; 12: for (int i = 0; i < W; i += TS) { 13: tile_static float locA[TS][TS], locB[TS][TS]; 14: locA[row][col] = a(t_idx.global[0], col + i); 15: locB[row][col] = b(row + i, t_idx.global[1]); 16: t_idx.barrier.wait(); 17: for (int k = 0; k < TS; k++) 18: sum += locA[row][k] * locB[k][col]; 19: t_idx.barrier.wait(); 20: } 21: c[t_idx.global] = sum; 22: }); 23: } Notice that all the code up to line 9 is the same as per the changes we made in part 1 of tiling introduction. If you squint, the body of the lambda itself preserves the original algorithm on lines 10, 11, and 17, 18, and 21. The difference being that those lines use new indexing and the tile_static arrays; the tile_static arrays are declared and initialized on the brand new lines 13-15. On those lines we copy from the global memory represented by the array_view objects (a and b), to the tile_static vanilla arrays (locA and locB) – we are copying enough to fit a tile. Because in the code that follows on line 18 we expect the data for this tile to be in the tile_static storage, we need to synchronize the threads within each tile with a barrier, which we do on line 16 (to avoid accessing uninitialized memory on line 18). We also need to synchronize the threads within a tile on line 19, again to avoid the race between lines 14, 15 (retrieving the next set of data for each tile and overwriting the previous set) and line 18 (not being done processing the previous set of data). Luckily, as part of the awesome C++ AMP debugger in Visual Studio there is an option that helps you find such races, but that is a story for another blog post another time. May I suggest reading the next section, and then coming back to re-read and walk through this code with pen and paper to really grok what is going on, if you haven't already? Cool. Why would I introduce this tiling complexity into my code? Funny you should ask that, I was just about to tell you. There is only one reason we tiled our extent, had to deal with finding a good tile size, ensure the number of threads we schedule are correctly divisible with the tile size, had to use a tiled_index instead of a normal index, and had to understand tile_barrier and to figure out where we need to use it, and double the size of our lambda in terms of lines of code: the reason is to be able to use tile_static memory. Why do we want to use tile_static memory? Because accessing tile_static memory is around 10 times faster than accessing the global memory on an accelerator like the GPU, e.g. in the code above, if you can get 150GB/second accessing data from the array_view a, you can get 1500GB/second accessing the tile_static array locA. And since by definition you are dealing with really large data sets, the savings really pay off. We have seen tiled implementations being twice as fast as their non-tiled counterparts. Now, some algorithms will not have performance benefits from tiling (and in fact may deteriorate), e.g. algorithms that require you to go only once to global memory will not benefit from tiling, since with tiling you already have to fetch the data once from global memory! Other algorithms may benefit, but you may decide that you are happy with your code being 150 times faster than the serial-version you had, and you do not need to invest to make it 250 times faster. Also algorithms with more than 3 dimensions, which C++ AMP supports in the non-tiled model, cannot be tiled. Also note that in future releases, we may invest in making the non-tiled model, which already uses tiling under the covers, go the extra step and use tile_static memory on your behalf, but it is obviously way to early to commit to anything like that, and we certainly don't do any of that today. Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • booth multiplication algorithm

    - by grassPro
    Is booth algorithm for multiplication only for multiplying 2 negative numbers (-3 * -4) or one positive and one negative number (-3 * 4) ? Whenever i multiply 2 positive numbers using booth algorithm i get a wrong result. example : 5 * 4 A = 101 000 0 // binary of 5 is 101 S = 011 000 0 // 2's complement of 5 is 011 P = 000 100 0 // binary of 4 is 100 x = 3 y = 3 m = 5 -m = 2's complement of m r = 4 After right shift of P by 1 bit 0 000 100 After right shift of P by 1 bit 0 000 010 P+S = 011 001 0 After right shift by 1 bit 0 011 001 Discarding the LSB 001100 But that comes out to be the binary of 12 . It should have been 20(010100)

    Read the article

  • Japanese Multiplication simulation - is a program actually capable of improving calculation speed?

    - by jt0dd
    On SuperUser, I asked a (possibly silly) question about processors using mathematical shortcuts and would like to have a look at the possibility at the software application of that concept. I'd like to write a simulation of Japanese Multiplication to get benchmarks on large calculations utilizing the shortcut vs traditional CPU multiplication. I'm curious as to whether it makes sense to try this. My Question: I'd like to know whether or not a software math shortcut, as described above is actually a shortcut at all. This is a question of programming concept. By utilizing the simulation of Japanese Multiplication, is a program actually capable of improving calculation speed? Or am I doomed from the start? The answer to this question isn't required to determine whether or not the experiment will succeed, but rather whether or not it's logically possible for such a thing to occur in any program, using this concept as an example. My theory is that since addition is computed faster than multiplication, a simulation of Japanese multiplication may actually allow a program to multiply (large) numbers faster than the CPU arithmetic unit can. I think this would be a very interesting finding, if it proves to be true. If, in the multiplication of numbers of any immense size, the shortcut were to calculate the result via less instructions (or faster) than traditional ALU multiplication, I would consider the experiment a success.

    Read the article

  • Matrix Multiplication with C++ AMP

    - by Daniel Moth
    As part of our API tour of C++ AMP, we looked recently at parallel_for_each. I ended that post by saying we would revisit parallel_for_each after introducing array and array_view. Now is the time, so this is part 2 of parallel_for_each, and also a post that brings together everything we've seen until now. The code for serial and accelerated Consider a naïve (or brute force) serial implementation of matrix multiplication  0: void MatrixMultiplySerial(std::vector<float>& vC, const std::vector<float>& vA, const std::vector<float>& vB, int M, int N, int W) 1: { 2: for (int row = 0; row < M; row++) 3: { 4: for (int col = 0; col < N; col++) 5: { 6: float sum = 0.0f; 7: for(int i = 0; i < W; i++) 8: sum += vA[row * W + i] * vB[i * N + col]; 9: vC[row * N + col] = sum; 10: } 11: } 12: } We notice that each loop iteration is independent from each other and so can be parallelized. If in addition we have really large amounts of data, then this is a good candidate to offload to an accelerator. First, I'll just show you an example of what that code may look like with C++ AMP, and then we'll analyze it. It is assumed that you included at the top of your file #include <amp.h> 13: void MatrixMultiplySimple(std::vector<float>& vC, const std::vector<float>& vA, const std::vector<float>& vB, int M, int N, int W) 14: { 15: concurrency::array_view<const float,2> a(M, W, vA); 16: concurrency::array_view<const float,2> b(W, N, vB); 17: concurrency::array_view<concurrency::writeonly<float>,2> c(M, N, vC); 18: concurrency::parallel_for_each(c.grid, 19: [=](concurrency::index<2> idx) restrict(direct3d) { 20: int row = idx[0]; int col = idx[1]; 21: float sum = 0.0f; 22: for(int i = 0; i < W; i++) 23: sum += a(row, i) * b(i, col); 24: c[idx] = sum; 25: }); 26: } First a visual comparison, just for fun: The beginning and end is the same, i.e. lines 0,1,12 are identical to lines 13,14,26. The double nested loop (lines 2,3,4,5 and 10,11) has been transformed into a parallel_for_each call (18,19,20 and 25). The core algorithm (lines 6,7,8,9) is essentially the same (lines 21,22,23,24). We have extra lines in the C++ AMP version (15,16,17). Now let's dig in deeper. Using array_view and extent When we decided to convert this function to run on an accelerator, we knew we couldn't use the std::vector objects in the restrict(direct3d) function. So we had a choice of copying the data to the the concurrency::array<T,N> object, or wrapping the vector container (and hence its data) with a concurrency::array_view<T,N> object from amp.h – here we used the latter (lines 15,16,17). Now we can access the same data through the array_view objects (a and b) instead of the vector objects (vA and vB), and the added benefit is that we can capture the array_view objects in the lambda (lines 19-25) that we pass to the parallel_for_each call (line 18) and the data will get copied on demand for us to the accelerator. Note that line 15 (and ditto for 16 and 17) could have been written as two lines instead of one: extent<2> e(M, W); array_view<const float, 2> a(e, vA); In other words, we could have explicitly created the extent object instead of letting the array_view create it for us under the covers through the constructor overload we chose. The benefit of the extent object in this instance is that we can express that the data is indeed two dimensional, i.e a matrix. When we were using a vector object we could not do that, and instead we had to track via additional unrelated variables the dimensions of the matrix (i.e. with the integers M and W) – aren't you loving C++ AMP already? Note that the const before the float when creating a and b, will result in the underling data only being copied to the accelerator and not be copied back – a nice optimization. A similar thing is happening on line 17 when creating array_view c, where we have indicated that we do not need to copy the data to the accelerator, only copy it back. The kernel dispatch On line 18 we make the call to the C++ AMP entry point (parallel_for_each) to invoke our parallel loop or, as some may say, dispatch our kernel. The first argument we need to pass describes how many threads we want for this computation. For this algorithm we decided that we want exactly the same number of threads as the number of elements in the output matrix, i.e. in array_view c which will eventually update the vector vC. So each thread will compute exactly one result. Since the elements in c are organized in a 2-dimensional manner we can organize our threads in a two-dimensional manner too. We don't have to think too much about how to create the first argument (a grid) since the array_view object helpfully exposes that as a property. Note that instead of c.grid we could have written grid<2>(c.extent) or grid<2>(extent<2>(M, N)) – the result is the same in that we have specified M*N threads to execute our lambda. The second argument is a restrict(direct3d) lambda that accepts an index object. Since we elected to use a two-dimensional extent as the first argument of parallel_for_each, the index will also be two-dimensional and as covered in the previous posts it represents the thread ID, which in our case maps perfectly to the index of each element in the resulting array_view. The kernel itself The lambda body (lines 20-24), or as some may say, the kernel, is the code that will actually execute on the accelerator. It will be called by M*N threads and we can use those threads to index into the two input array_views (a,b) and write results into the output array_view ( c ). The four lines (21-24) are essentially identical to the four lines of the serial algorithm (6-9). The only difference is how we index into a,b,c versus how we index into vA,vB,vC. The code we wrote with C++ AMP is much nicer in its indexing, because the dimensionality is a first class concept, so you don't have to do funny arithmetic calculating the index of where the next row starts, which you have to do when working with vectors directly (since they store all the data in a flat manner). I skipped over describing line 20. Note that we didn't really need to read the two components of the index into temporary local variables. This mostly reflects my personal choice, in some algorithms to break down the index into local variables with names that make sense for the algorithm, i.e. in this case row and col. In other cases it may i,j,k or x,y,z, or M,N or whatever. Also note that we could have written line 24 as: c(idx[0], idx[1])=sum  or  c(row, col)=sum instead of the simpler c[idx]=sum Targeting a specific accelerator Imagine that we had more than one hardware accelerator on a system and we wanted to pick a specific one to execute this parallel loop on. So there would be some code like this anywhere before line 18: vector<accelerator> accs = MyFunctionThatChoosesSuitableAccelerators(); accelerator acc = accs[0]; …and then we would modify line 18 so we would be calling another overload of parallel_for_each that accepts an accelerator_view as the first argument, so it would become: concurrency::parallel_for_each(acc.default_view, c.grid, ...and the rest of your code remains the same… how simple is that? Comments about this post by Daniel Moth welcome at the original blog.

    Read the article

  • Array Multiplication and Division

    - by Narfanator
    I came across a question that (eventually) landed me wondering about array arithmetic. I'm thinking specifically in Ruby, but I think the concepts are language independent. So, addition and subtraction are defined, in Ruby, as such: [1,6,8,3,6] + [5,6,7] == [1,6,8,3,6,5,6,7] # All the elements of the first, then all the elements of the second [1,6,8,3,6] - [5,6,7] == [1,8,3] # From the first, remove anything found in the second and array * scalar is defined: [1,2,3] * 2 == [1,2,3,1,2,3] But What, conceptually, should the following be? None of these are (as far as I can find) defined: Array x Array: [1,2,3] * [1,2,3] #=> ? Array / Scalar: [1,2,3,4,5] / 2 #=> ? Array / Scalar: [1,2,3,4,5] % 2 #=> ? Array / Array: [1,2,3,4,5] / [1,2] #=> ? Array / Array: [1,2,3,4,5] % [1,2] #=> ? I've found some mathematical descriptions of these operations for set theory, but I couldn't really follow them, and sets don't have duplicates (arrays do). Edit: Note, I do not mean vector (matrix) arithmetic, which is completely defined. Edit2: If this is the wrong stack exchange, tell me which is the right one and I'll move it. Edit 3: Add mod operators to the list. Edit 4: I figure array / scalar is derivable from array * scalar: a * b = c => a = b / c [1,2,3] * 3 = [1,2,3]+[1,2,3]+[1,2,3] = [1,2,3,1,2,3,1,2,3] => [1,2,3] = [1,2,3,1,2,3,1,2,3] / 3 Which, given that programmer's division ignore the remained and has modulus: [1,2,3,4,5] / 2 = [[1,2], [3,4]] [1,2,3,4,5] % 2 = [5] Except that these are pretty clearly non-reversible operations (not that modulus ever is), which is non-ideal. Edit: I asked a question over on Math that led me to Multisets. I think maybe extensible arrays are "multisets", but I'm not sure yet.

    Read the article

  • Shader optimization - cg/hlsl pseudo and via multiplication

    - by teodron
    Since HLSL/Cg do not allow texture fetching inside conditional blocks, I am first checking a variable and performing some computations, afterwards setting a float flag to 0.0 or 1.0, depending on the computations. I'd like to trigger a texture fetch only if the flag is 1.0 or not null, for that matter of fact. I kind of hoped this would do the trick: float4 TU0_atlas_colour = pseudoBool * tex2Dlod(TU0_texture, float4(tileCoord, 0, mipLevel)); That is, if pseudoBool is 0, will the texture fetch function still be called and produce overhead? I was hoping to prevent it from getting executed via this trick that usually works in plain C/C++.

    Read the article

  • Matrix multiplication - Scene Graphs

    - by bgarate
    I wrote a MatrixStack class in C# to use in a SceneGraph. So, to get the world matrix for an object I am suposed to use: WorldMatrix = ParentWorld * LocalTransform But, in fact, it only works as expected when I do the other way: WorldMatrix = LocalTransform * ParentWorld Mi code is: public class MatrixStack { Stack<Matrix> stack = new Stack<Matrix>(); Matrix result = Matrix.Identity; public void PushMatrix(Matrix matrix) { stack.Push(matrix); result = matrix * result; } public Matrix PopMatrix() { result = Matrix.Invert(stack.Peek()) * result; return stack.Pop(); } public Matrix Result { get { return result; } } public void Clear() { stack.Clear(); result = Matrix.Identity; } } Why it works this way and not the other? Thanks!

    Read the article

  • How to solve linker error in matrix multiplication in c using lapack library?

    - by Malar
    I did Matrix multiplication using lapack library, I am getting an error like below. Can any one help me? "error LNK2019: unresolved external symbol "void __cdecl dgemm(char,char,int *,int *,int *,double *,double *,int *,double *,int *,double *,double *,int *)" (?dgemm@@YAXDDPAH00PAN1010110@Z) referenced in function _main" 1..\bin\matrixMultiplicationUsingLapack.exe : fatal error LNK1120: 1 unresolved externals I post my code below # define matARowSize 2 // -- Matrix 1 number of rows # define matAColSize 2 // -- Matrix 1 number of cols # define matBRowSize 2 // -- Matrix 2 number of rows # define matBColSize 2 // -- Matrix 2 number of cols using namespace std; void dgemm(char, char, int *, int *, int *, double *, double *, int *, double *, int *, double *, double *, int *); int main() { double iMatrixA[matARowSize*matAColSize]; // -- Input matrix 1 {m x n} double iMatrixB[matBRowSize*matBColSize]; // -- Input matrix 2 {n x k} double iMatrixC[matARowSize*matBColSize]; // -- Output matrix {m x n * n x k = m x k} double alpha = 1.0f; double beta = 0.0f; int n = 2; iMatrixA[0] = 1; iMatrixA[1] = 1; iMatrixA[2] = 1; iMatrixA[3] = 1; iMatrixB[0] = 1; iMatrixB[1] = 1; iMatrixB[2] = 1; iMatrixB[3] = 1; //dgemm('N','N',&n,&n,&n,&alpha,iMatrixA,&n,iMatrixB,&n,&beta,iMatrixC,&n); dgemm('N','N',&n,&n,&n,&alpha,iMatrixA,&n,iMatrixB,&n,&beta,iMatrixC,&n); std::cin.get(); return 0; }

    Read the article

  • Matrix multiplication in java (RE-POST)

    - by Chapax
    Apologies for the re-post; the earlier time I'd posted I did not have all the details. My colleague, who quit the firm was a C# programmer, was forced to write Java code that involved (large, dense) matrix multiplication. He's coded his own DataTable class in Java, in order to be able to a) create indexes to sort and join with other DataTables b) do matrix multiplication. The code in its current form is NOT maintainable/extensible. I want to clean up the code, and thought using something like R within Java will help me focus on business logic rather than sorting, joining, matrix multiplication, etc. Plus, I'm very new to the concept of DataTable; I just want to replace the DataTable with 2D arrays, and let R handle the rest. (I currently do not know how to join 2 large datasets in java very efficiently Please let me know what you think. Also, are there any simple examples that I can take a look at?

    Read the article

  • Matrix multiplication in java

    - by Chapax
    Hi, I wanted to do matrix multiplication in Java, and the speed needs to be very good. I was thinking of calling R through java to achieve this. I had a couple of Qs though: Is calling R using Java a good idea? If yes, are there any code samples that can be shared? What are the other ways that can be considered to do matrix multiplication in Java? Many thanks. --Chapax

    Read the article

  • theoretical and practical matrix multiplication FLOP

    - by mjr
    I wrote traditional matrix multiplication in c++ and tried to measure and compare its theoretical and practical FLOP. As I know inner loop of MM has 2 operation therefore simple MM theoretical Flops is 2*n*n*n (2n^3) but in practice I get something like 4n^3 + number of operation which is 2 i.e. 6n^3 also if I just try to add up only one array a[i][j]++ practical flops then calculate like 3n^3 and not n^3 as you see again it is 2n^3 +1 operation and not 1 operation * n^3 . This is in case if I use 1D array in three nested loops as Matrix multiplication and compare flop, practical flop is the same (near) the theoretical flop and depend exactly as the number of operation in inner loop.I could not find the reason for this behaviour. what is the reason in both case? I know that theoretical flop is not the same as practical one because of some operations like load etc. system specification: Intel core2duo E4500 3700g memory L2 cache 2M x64 fedora 17 sample results: Matrix matrix multiplication 512*512 Real_time: 1.718368 Proc_time: 1.227672 Total flpops: 807,107,072 MFLOPS: 657.429016 Real_time: 3.608078 Proc_time: 3.042272 Total flpops: 807,024,448 MFLOPS: 265.270355 theoretical flop: 2*512*512*512=268,435,456 Practical flops= 6*512^3 =807,107,072 Using 1 dimensional array float d[size][size]:512 or any size for (int j = 0; j < size; ++j) { for (int k = 0; k < size; ++k) { d[k]=d[k]+e[k]+f[k]+g[k]+r; } } Real_time: 0.002288 Proc_time: 0.002260 Total flpops: 1,048,578 MFLOPS: 464.027161 theroretical flop: *4n^2=4*512^2=1,048,576* practical flop : 4n^2+overhead (other operation?)=1,048,578 3 loop version: Real_time: 1.282257 Proc_time: 1.155990 Total flpops: 536,872,000 MFLOPS: 464.426117 theoretical flop:4n^3 = 536,870,912 practical flop: *4n^3=4*512^3+overheads(other operation?)=536,872,000* thank you

    Read the article

  • C++ BigInt multiplication conceptual problem

    - by Kapo
    I'm building a small BigInt library in C++ for use in my programming language. The structure is like the following: short digits[ 1000 ]; int len; I have a function that converts a string into a bigint by splitting it up into single chars and putting them into digits. The numbers in digits are all reversed, so the number 123 would look like the following: digits[0]=3 digits[1]=3 digits[2]=1 I have already managed to code the adding function, which works perfectly. It works somewhat like this: overflow = 0 for i ++ until length of both numbers exceeded: add numberA[ i ] to numberB[ i ] add overflow to the result set overflow to 0 if the result is bigger than 10: substract 10 from the result overflow = 1 put the result into numberReturn[ i ] (Overflow is in this case what happens when I add 1 to 9: Substract 10 from 10, add 1 to overflow, overflow gets added to the next digit) So think of how two numbers are stored, like those: 0 | 1 | 2 --------- A 2 - - B 0 0 1 The above represents the digits of the bigints 2 (A) and 100 (B). - means uninitialized digits, they aren't accessed. So adding the above number works fine: start at 0, add 2 + 0, go to 1, add 0, go to 2, add 1 But: When I want to do multiplication with the above structure, my program ends up doing the following: Start at 0, multiply 2 with 0 (eek), go to 1, ... So it is obvious that, for multiplication, I have to get an order like this: 0 | 1 | 2 --------- A - - 2 B 0 0 1 Then, everything would be clear: Start at 0, multiply 0 with 0, go to 1, multiply 0 with 0, go to 2, multiply 1 with 2 How can I manage to get digits into the correct form for multiplication? I don't want to do any array moving/flipping - I need performance!

    Read the article

  • Fast 4x4 matrix multiplication in Java with NIO float buffers

    - by kayahr
    I know there are LOT of questions like that but I can't find one specific to my situation. I have 4x4 matrices implemented as NIO float buffers (These matrices are used for OpenGL). Now I want to implement a multiply method which multiplies Matrix A with Matrix B and stores the result in Matrix C. So the code may look like this: class Matrix4f { private FloatBuffer buffer = FloatBuffer.allocate(16); public Matrix4f multiply(Matrix4f matrix2, Matrix4f result) { {{{result = this * matrix2}}} <-- I need this code return result; } } What is the fastest possible code to do this multiplication? Some OpenGL implementations (Like the OpenGL ES stuff in Android) provide native code for this but others doesn't. So I want to provide a generic multiplication method for these implementations.

    Read the article

1 2 3 4 5 6 7 8 9 10 11  | Next Page >