Matrix Multiplication with C++ AMP

Posted by Daniel Moth on Daniel Moth See other posts from Daniel Moth or by Daniel Moth
Published on Sat, 10 Sep 2011 03:22:00 GMT Indexed on 2011/11/11 18:18 UTC
Read the original article Hit count: 297

Filed under:
|

As part of our API tour of C++ AMP, we looked recently at parallel_for_each. I ended that post by saying we would revisit parallel_for_each after introducing array and array_view. Now is the time, so this is part 2 of parallel_for_each, and also a post that brings together everything we've seen until now.

The code for serial and accelerated

Consider a naïve (or brute force) serial implementation of matrix multiplication 

0:  void MatrixMultiplySerial(std::vector<float>& vC, 
        const std::vector<float>& vA, 
        const std::vector<float>& vB, int M, int N, int W)
1:  {
2:    for (int row = 0; row < M; row++) 
3:    {
4:      for (int col = 0; col < N; col++)
5:      {
6:        float sum = 0.0f;
7:        for(int i = 0; i < W; i++)
8:          sum += vA[row * W + i] * vB[i * N + col];
9:        vC[row * N + col] = sum;
10:     }
11:   }
12: }

We notice that each loop iteration is independent from each other and so can be parallelized. If in addition we have really large amounts of data, then this is a good candidate to offload to an accelerator. First, I'll just show you an example of what that code may look like with C++ AMP, and then we'll analyze it. It is assumed that you included at the top of your file #include <amp.h>

13:  void MatrixMultiplySimple(std::vector<float>& vC, 
         const std::vector<float>& vA, 
         const std::vector<float>& vB, int M, int N, int W)
14:  {
15:    concurrency::array_view<const float,2> a(M, W, vA);
16:    concurrency::array_view<const float,2> b(W, N, vB);
17:    concurrency::array_view<concurrency::writeonly<float>,2> c(M, N, vC);
18:    concurrency::parallel_for_each(c.grid, 
19:    [=](concurrency::index<2> idx) restrict(direct3d) {
20:      int row = idx[0]; int col = idx[1];
21:      float sum = 0.0f;
22:      for(int i = 0; i < W; i++)
23:        sum += a(row, i) * b(i, col);
24:      c[idx] = sum;
25:    });
26:  }

First a visual comparison, just for fun: The beginning and end is the same, i.e. lines 0,1,12 are identical to lines 13,14,26. The double nested loop (lines 2,3,4,5 and 10,11) has beenMatrix Multiplication image from wikipedia transformed into a parallel_for_each call (18,19,20 and 25). The core algorithm (lines 6,7,8,9) is essentially the same (lines 21,22,23,24). We have extra lines in the C++ AMP version (15,16,17). Now let's dig in deeper.

Using array_view and extent

When we decided to convert this function to run on an accelerator, we knew we couldn't use the std::vector objects in the restrict(direct3d) function. So we had a choice of copying the data to the the concurrency::array<T,N> object, or wrapping the vector container (and hence its data) with a concurrency::array_view<T,N> object from amp.h – here we used the latter (lines 15,16,17). Now we can access the same data through the array_view objects (a and b) instead of the vector objects (vA and vB), and the added benefit is that we can capture the array_view objects in the lambda (lines 19-25) that we pass to the parallel_for_each call (line 18) and the data will get copied on demand for us to the accelerator.

Note that line 15 (and ditto for 16 and 17) could have been written as two lines instead of one:

  extent<2> e(M, W);
  array_view<const float, 2> a(e, vA);

In other words, we could have explicitly created the extent object instead of letting the array_view create it for us under the covers through the constructor overload we chose. The benefit of the extent object in this instance is that we can express that the data is indeed two dimensional, i.e a matrix. When we were using a vector object we could not do that, and instead we had to track via additional unrelated variables the dimensions of the matrix (i.e. with the integers M and W) – aren't you loving C++ AMP already?

Note that the const before the float when creating a and b, will result in the underling data only being copied to the accelerator and not be copied back – a nice optimization. A similar thing is happening on line 17 when creating array_view c, where we have indicated that we do not need to copy the data to the accelerator, only copy it back.

The kernel dispatch

On line 18 we make the call to the C++ AMP entry point (parallel_for_each) to invoke our parallel loop or, as some may say, dispatch our kernel.

The first argument we need to pass describes how many threads we want for this computation. For this algorithm we decided that we want exactly the same number of threads as the number of elements in the output matrix, i.e. in array_view c which will eventually update the vector vC. So each thread will compute exactly one result. Since the elements in c are organized in a 2-dimensional manner we can organize our threads in a two-dimensional manner too. We don't have to think too much about how to create the first argument (a grid) since the array_view object helpfully exposes that as a property. Note that instead of c.grid we could have written grid<2>(c.extent) or grid<2>(extent<2>(M, N)) – the result is the same in that we have specified M*N threads to execute our lambda.

The second argument is a restrict(direct3d) lambda that accepts an index object. Since we elected to use a two-dimensional extent as the first argument of parallel_for_each, the index will also be two-dimensional and as covered in the previous posts it represents the thread ID, which in our case maps perfectly to the index of each element in the resulting array_view.

The kernel itself

The lambda body (lines 20-24), or as some may say, the kernel, is the code that will actually execute on the accelerator. It will be called by M*N threads and we can use those threads to index into the two input array_views (a,b) and write results into the output array_view ( c ).

The four lines (21-24) are essentially identical to the four lines of the serial algorithm (6-9). The only difference is how we index into a,b,c versus how we index into vA,vB,vC. The code we wrote with C++ AMP is much nicer in its indexing, because the dimensionality is a first class concept, so you don't have to do funny arithmetic calculating the index of where the next row starts, which you have to do when working with vectors directly (since they store all the data in a flat manner).

I skipped over describing line 20. Note that we didn't really need to read the two components of the index into temporary local variables. This mostly reflects my personal choice, in some algorithms to break down the index into local variables with names that make sense for the algorithm, i.e. in this case row and col. In other cases it may i,j,k or x,y,z, or M,N or whatever. Also note that we could have written line 24 as: c(idx[0], idx[1])=sum  or  c(row, col)=sum instead of the simpler c[idx]=sum

Targeting a specific accelerator

Imagine that we had more than one hardware accelerator on a system and we wanted to pick a specific one to execute this parallel loop on. So there would be some code like this anywhere before line 18:

  vector<accelerator> accs = MyFunctionThatChoosesSuitableAccelerators();
  accelerator acc = accs[0];

…and then we would modify line 18 so we would be calling another overload of parallel_for_each that accepts an accelerator_view as the first argument, so it would become:

  concurrency::parallel_for_each(acc.default_view, c.grid,

...and the rest of your code remains the same… how simple is that?




Comments about this post by Daniel Moth welcome at the original blog.

© Daniel Moth or respective owner

Related posts about GPGPU

  • GPGPU

    as seen on Daniel Moth - Search for 'Daniel Moth'
    WhatGPU obviously stands for Graphics Processing Unit (the silicon powering the display you are using to read this blog post). The extra GP in front of that stands for General Purpose computing.So, altogether GPGPU refers to computing we can perform on GPU for purposes beyond just drawing on the screen… >>> More

  • gpgpu vs. physX for physics simulation

    as seen on Game Development - Search for 'Game Development'
    Hello First theoretical question. What is better (faster)? Develop your own gpgpu techniques for physics simulation (cloth, fluids, colisions...) or to use PhysX? (If i say develop i mean implement existing algorithms like navier-strokes...) I don't care about what will take more time to develop… >>> More

  • Financial applications on GPGPU

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I want to know what sort of financial applications can be implemented using a GPGPU. I'm aware of Option pricing/ Stock price estimation using Monte Carlo simulation on GPGPU using CUDA. Can someone enumerate the various possibilities of utilizing GPGPU for any application in Finance domain, >>> More

  • Best approach for GPGPU/CUDA/OpenCL in Java?

    as seen on Stack Overflow - Search for 'Stack Overflow'
    General-purpose computing on graphics processing units (GPGPU) is a very attractive concept to harness the power of the GPU for any kind of computing. I'd love to use GPGPU for image processing, particles, and fast geometric operations. Right now, it seems the two contenders in this space are CUDA… >>> More

  • GPGPU programming with OpenGL ES 2.0

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I am trying to do some image processing on the GPU, e.g. median, blur, brightness, etc. The general idea is to do something like this framework from GPU Gems 1. I am able to write the GLSL fragment shader for processing the pixels as I've been trying out different things in an effect designer app… >>> More

Related posts about ParallelComputing

  • C++ AMP Video Overview

    as seen on Daniel Moth - Search for 'Daniel Moth'
    I hope to be recording some C++ AMP screencasts for channel9 soon (you'll find them through my regular screencasts link on the left), and in all of them I will assume you have watched this short interview overview of C++ AMP.   Note: I think there were some technical problems with streaming… >>> More

  • Screencasts introducing C++ AMP

    as seen on Daniel Moth - Search for 'Daniel Moth'
    It has been almost 2.5 years since I last recorded a screencast, and I had forgotten how time consuming they are to plan/record/edit/produce/publish, but at the same time so much fun to see the end result! So below are links to 4 screencasts to teach you C++ AMP basics from scratch (even if you class… >>> More

  • GPGPU

    as seen on Daniel Moth - Search for 'Daniel Moth'
    WhatGPU obviously stands for Graphics Processing Unit (the silicon powering the display you are using to read this blog post). The extra GP in front of that stands for General Purpose computing.So, altogether GPGPU refers to computing we can perform on GPU for purposes beyond just drawing on the screen… >>> More

  • GPU Debugging with VS 11

    as seen on Daniel Moth - Search for 'Daniel Moth'
    With VS 11 Developer Preview we have invested tremendously in parallel debugging for both CPU (managed and native) and GPU debugging. I'll be doing a whole bunch of blog posts on those topics, and in this post I just wanted to get people started with GPU debugging, i.e. with debugging C++ AMP code… >>> More

  • Microsoft Windows HPC Server R2 Beta2

    as seen on Daniel Moth - Search for 'Daniel Moth'
    Internally and unofficially we refer to this as "HPC Server v3" and its Beta2 became available last week. Read the full story on this blog post from Ryan and this one from Don. There has been a lot of excitement on the web for this release with coverage from last Wednesday here, here, here… >>> More