Parallelism in .NET – Part 8, PLINQ’s ForAll Method

Posted by Reed on Reed Copsey See other posts from Reed Copsey or by Reed
Published on Thu, 04 Feb 2010 01:56:25 +0000 Indexed on 2010/12/06 17:00 UTC
Read the original article Hit count: 647

Filed under:
|
|
|
|
|
|

Parallel LINQ extends LINQ to Objects, and is typically very similar.  However, as I previously discussed, there are some differences.  Although the standard way to handle simple Data Parellelism is via Parallel.ForEach, it’s possible to do the same thing via PLINQ.

PLINQ adds a new method unavailable in standard LINQ which provides new functionality…

LINQ is designed to provide a much simpler way of handling querying, including filtering, ordering, grouping, and many other benefits.  Reading the description in LINQ to Objects on MSDN, it becomes clear that the thinking behind LINQ deals with retrieval of data.  LINQ works by adding a functional programming style on top of .NET, allowing us to express filters in terms of predicate functions, for example.

PLINQ is, generally, very similar.  Typically, when using PLINQ, we write declarative statements to filter a dataset or perform an aggregation.  However, PLINQ adds one new method, which provides a very different purpose: ForAll.

The ForAll method is defined on ParallelEnumerable, and will work upon any ParallelQuery<T>.  Unlike the sequence operators in LINQ and PLINQ, ForAll is intended to cause side effects.  It does not filter a collection, but rather invokes an action on each element of the collection.

At first glance, this seems like a bad idea.  For example, Eric Lippert clearly explained two philosophical objections to providing an IEnumerable<T>.ForEach extension method, one of which still applies when parallelized.  The sole purpose of this method is to cause side effects, and as such, I agree that the ForAll method “violates the functional programming principles that all the other sequence operators are based upon”, in exactly the same manner an IEnumerable<T>.ForEach extension method would violate these principles.  Eric Lippert’s second reason for disliking a ForEach extension method does not necessarily apply to ForAll – replacing ForAll with a call to Parallel.ForEach has the same closure semantics, so there is no loss there.

Although ForAll may have philosophical issues, there is a pragmatic reason to include this method.  Without ForAll, we would take a fairly serious performance hit in many situations.  Often, we need to perform some filtering or grouping, then perform an action using the results of our filter. 

Using a standard foreach statement to perform our action would avoid this philosophical issue:

// Filter our collection
var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() );

// Now perform an action
foreach (var item in filteredItems)
{
    // These will now run serially
    item.DoSomething();
}

This would cause a loss in performance, since we lose any parallelism in place, and cause all of our actions to be run serially.

We could easily use a Parallel.ForEach instead, which adds parallelism to the actions:

// Filter our collection
var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() );

// Now perform an action once the filter completes
Parallel.ForEach(filteredItems, item =>
{
    // These will now run in parallel
    item.DoSomething();
});

This is a noticeable improvement, since both our filtering and our actions run parallelized.  However, there is still a large bottleneck in place here. 

The problem lies with my comment “perform an action once the filter completes”.  Here, we’re parallelizing the filter, then collecting all of the results, blocking until the filter completes.  Once the filtering of every element is completed, we then repartition the results of the filter, reschedule into multiple threads, and perform the action on each element.  By moving this into two separate statements, we potentially double our parallelization overhead, since we’re forcing the work to be partitioned and scheduled twice as many times.

This is where the pragmatism comes into play.  By violating our functional principles, we gain the ability to avoid the overhead and cost of rescheduling the work:

// Perform an action on the results of our filter
collection
    .AsParallel()
    .Where( i => i.SomePredicate() )
    .ForAll( i => i.DoSomething() );

The ability to avoid the scheduling overhead is a compelling reason to use ForAll.  This really goes back to one of the key points I discussed in data parallelism: Partition your problem in a way to place the most work possible into each task.  Here, this means leaving the statement attached to the expression, even though it causes side effects and is not standard usage for LINQ.

This leads to my one guideline for using ForAll:

The ForAll extension method should only be used to process the results of a parallel query, as returned by a PLINQ expression.

Any other usage scenario should use Parallel.ForEach, instead.

© Reed Copsey or respective owner

Related posts about .NET

Related posts about algorithms