parallel_for_each from amp.h – part 1

Posted by Daniel Moth on Daniel Moth See other posts from Daniel Moth or by Daniel Moth
Published on Wed, 07 Sep 2011 02:52:15 GMT Indexed on 2011/11/11 18:18 UTC
Read the original article Hit count: 283

Filed under:
|

This posts assumes that you've read my other C++ AMP posts on index<N> and extent<N>, as well as about the restrict modifier. It also assumes you are familiar with C++ lambdas (if not, follow my links to C++ documentation).

Basic structure and parameters

Now we are ready for part 1 of the description of the new overload for the concurrency::parallel_for_each function. The basic new parallel_for_each method signature returns void and accepts two parameters:

  • a grid<N> (think of it as an alias to extent)
  • a restrict(direct3d) lambda, whose signature is such that it returns void and accepts an index of the same rank as the grid

So it looks something like this (with generous returns for more palatable formatting) assuming we are dealing with a 2-dimensional space:

  // some_code_A
  parallel_for_each(
    g, // g  is of type grid<2>
    [ ](index<2> idx) restrict(direct3d)
    {
      // kernel code
    }
  );
  // some_code_B

The parallel_for_each will execute the body of the lambda (which must have the restrict modifier), on the GPU. We also call the lambda body the "kernel". The kernel will be executed multiple times, once per scheduled GPU thread. The only difference in each execution is the value of the index object (aka as the GPU thread ID in this context) that gets passed to your kernel code. The number of GPU threads (and the values of each index) is determined by the grid object you pass, as described next.

You know that grid is simply a wrapper on extent. In this context, one way to think about it is that the extent generates a number of index objects. So for the example above, if your grid was setup by some_code_A as follows:

  extent<2> e(2,3);
  grid<2> g(e);

...then given that: e.size()==6, e[0]==2, and e[1]=3

...the six index<2> objects it generates (and hence the values that your lambda would receive) are:

   (0,0) (1,0) (0,1) (1,1) (0,2) (1,2)

So what the above means is that the lambda body with the algorithm that you wrote will get executed 6 times and the index<2> object you receive each time will have one of the values just listed above (of course, each one will only appear once, the order is indeterminate, and they are likely to call your code at the same exact time). Obviously, in real GPU programming, you'd typically be scheduling thousands if not millions of threads, not just 6.

If you've been following along you should be thinking: "that is all fine and makes sense, but what can I do in the kernel since I passed nothing else meaningful to it, and it is not returning any values out to me?"

Passing data in and out

It is a good question, and in data parallel algorithms indeed you typically want to pass some data in, perform some operation, and then typically return some results out. The way you pass data into the kernel, is by capturing variables in the lambda (again, if you are not familiar with them, follow the links about C++ lambdas), and the way you use data after the kernel is done executing is simply by using those same variables.

In the example above, the lambda was written in a fairly useless way with an empty capture list: [ ](index<2> idx) restrict(direct3d), where the empty square brackets means that no variables were captured.

If instead I write it like this [&](index<2> idx) restrict(direct3d), then all variables in the some_code_A region are made available to the lambda by reference, but as soon as I try to use any of those variables in the lambda, I will receive a compiler error. This has to do with one of the direct3d restrictions, where only one type can be capture by reference: objects of the new concurrency::array class that I'll introduce in the next post (suffice for now to think of it as a container of data).

If I write the lambda line like this [=](index<2> idx) restrict(direct3d), all variables in the some_code_A region are made available to the lambda by value. This works for some types (e.g. an integer), but not for all, as per the restrictions for direct3d. In particular, no useful data classes work except for one new type we introduce with C++ AMP: objects of the new concurrency::array_view class, that I'll introduce in the post after next. Also note that if you capture some variable by value, you could use it as input to your algorithm, but you wouldn’t be able to observe changes to it after the parallel_for_each call (e.g. in some_code_B region since it was passed by value) – the exception to this rule is the array_view since (as we'll see in a future post) it is a wrapper for data, not a container.

Finally, for completeness, you can write your lambda, e.g. like this [av, &ar](index<2> idx) restrict(direct3d) where av is a variable of type array_view and ar is a variable of type array - the point being you can be very specific about what variables you capture and how.

So it looks like from a large data perspective you can only capture array and array_view objects in the lambda (that is how you pass data to your kernel) and then use the many threads that call your code (each with a unique index) to perform some operation. You can also capture some limited types by value, as input only. When the last thread completes execution of your lambda, the data in the array_view or array are ready to be used in the some_code_B region. We'll talk more about all this in future posts…

(a)synchronous

Please note that the parallel_for_each executes as if synchronous to the calling code, but in reality, it is asynchronous. I.e. once the parallel_for_each call is made and the kernel has been passed to the runtime, the some_code_B region continues to execute immediately by the CPU thread, while in parallel the kernel is executed by the GPU threads. However, if you try to access the (array or array_view) data that you captured in the lambda in the some_code_B region, your code will block until the results become available. Hence the correct statement: the parallel_for_each is as-if synchronous in terms of visible side-effects, but asynchronous in reality.

 

That's all for now, we'll revisit the parallel_for_each description, once we introduce properly array and array_view – coming next.




Comments about this post by Daniel Moth welcome at the original blog.

© Daniel Moth or respective owner

Related posts about GPGPU

  • GPGPU

    as seen on Daniel Moth - Search for 'Daniel Moth'
    WhatGPU obviously stands for Graphics Processing Unit (the silicon powering the display you are using to read this blog post). The extra GP in front of that stands for General Purpose computing.So, altogether GPGPU refers to computing we can perform on GPU for purposes beyond just drawing on the screen… >>> More

  • gpgpu vs. physX for physics simulation

    as seen on Game Development - Search for 'Game Development'
    Hello First theoretical question. What is better (faster)? Develop your own gpgpu techniques for physics simulation (cloth, fluids, colisions...) or to use PhysX? (If i say develop i mean implement existing algorithms like navier-strokes...) I don't care about what will take more time to develop… >>> More

  • Financial applications on GPGPU

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I want to know what sort of financial applications can be implemented using a GPGPU. I'm aware of Option pricing/ Stock price estimation using Monte Carlo simulation on GPGPU using CUDA. Can someone enumerate the various possibilities of utilizing GPGPU for any application in Finance domain, >>> More

  • Best approach for GPGPU/CUDA/OpenCL in Java?

    as seen on Stack Overflow - Search for 'Stack Overflow'
    General-purpose computing on graphics processing units (GPGPU) is a very attractive concept to harness the power of the GPU for any kind of computing. I'd love to use GPGPU for image processing, particles, and fast geometric operations. Right now, it seems the two contenders in this space are CUDA… >>> More

  • GPGPU programming with OpenGL ES 2.0

    as seen on Stack Overflow - Search for 'Stack Overflow'
    I am trying to do some image processing on the GPU, e.g. median, blur, brightness, etc. The general idea is to do something like this framework from GPU Gems 1. I am able to write the GLSL fragment shader for processing the pixels as I've been trying out different things in an effect designer app… >>> More

Related posts about ParallelComputing

  • C++ AMP Video Overview

    as seen on Daniel Moth - Search for 'Daniel Moth'
    I hope to be recording some C++ AMP screencasts for channel9 soon (you'll find them through my regular screencasts link on the left), and in all of them I will assume you have watched this short interview overview of C++ AMP.   Note: I think there were some technical problems with streaming… >>> More

  • Screencasts introducing C++ AMP

    as seen on Daniel Moth - Search for 'Daniel Moth'
    It has been almost 2.5 years since I last recorded a screencast, and I had forgotten how time consuming they are to plan/record/edit/produce/publish, but at the same time so much fun to see the end result! So below are links to 4 screencasts to teach you C++ AMP basics from scratch (even if you class… >>> More

  • GPGPU

    as seen on Daniel Moth - Search for 'Daniel Moth'
    WhatGPU obviously stands for Graphics Processing Unit (the silicon powering the display you are using to read this blog post). The extra GP in front of that stands for General Purpose computing.So, altogether GPGPU refers to computing we can perform on GPU for purposes beyond just drawing on the screen… >>> More

  • GPU Debugging with VS 11

    as seen on Daniel Moth - Search for 'Daniel Moth'
    With VS 11 Developer Preview we have invested tremendously in parallel debugging for both CPU (managed and native) and GPU debugging. I'll be doing a whole bunch of blog posts on those topics, and in this post I just wanted to get people started with GPU debugging, i.e. with debugging C++ AMP code… >>> More

  • Microsoft Windows HPC Server R2 Beta2

    as seen on Daniel Moth - Search for 'Daniel Moth'
    Internally and unofficially we refer to this as "HPC Server v3" and its Beta2 became available last week. Read the full story on this blog post from Ryan and this one from Don. There has been a lot of excitement on the web for this release with coverage from last Wednesday here, here, here… >>> More