I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200).  Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution.  Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));  
for i=1:numel(list)
    for j=1:numel(list)
        if i ~= j
            eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
        end
    end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s).  However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
    [nextData{:}] = deal(data{i});
    H{:,i} = cellfun(@compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare().  Both of these code examples seem ill-suited for parallelization.  I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).  
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?