Unexpected performance curve from CPython merge sort

Posted by vkazanov on Stack Overflow See other posts from Stack Overflow or by vkazanov
Published on 2012-04-03T16:41:06Z Indexed on 2012/04/04 5:29 UTC
Read the original article Hit count: 322

Filed under:

python

|

sorting

|

merge

|

garbage-collection

I have implemented a naive merge sorting algorithm in Python. Algorithm and test code is below:

import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque

def sort(unsorted):
    if len(unsorted) <= 1:
        return unsorted
    to_merge = deque(deque([elem]) for elem in unsorted)
    while len(to_merge) > 1:
        left = to_merge.popleft()
        right = to_merge.popleft()
        to_merge.append(merge(left, right))
    return to_merge.pop()

def merge(left, right):
    result = deque()
    while left or right:
        if left and right:
            elem = left.popleft() if left[0] > right[0] else right.popleft()
        elif not left and right:
            elem = right.popleft()
        elif not right and left:
            elem = left.popleft()
        result.append(elem)
    return result

LOOP_COUNT = 100
START_N = 1
END_N = 1000

def test(fun, test_data):
    start = time.clock()
    for _ in xrange(LOOP_COUNT):
        fun(test_data)
    return  time.clock() - start

def run_test():
    timings, elem_nums = [], []
    test_data = random.sample(xrange(100000), END_N)
    for i in xrange(START_N, END_N):
        loop_test_data = test_data[:i]
        elapsed = test(sort, loop_test_data)
        timings.append(elapsed)
        elem_nums.append(len(loop_test_data))
        print "%f s --- %d elems" % (elapsed, len(loop_test_data))
    plt.plot(elem_nums, timings)
    plt.show()

run_test()

As much as I can see everything is OK and I should get a nice N*logN curve as a result. But the picture differs a bit:

enter image description here

Things I've tried to investigate the issue:

PyPy. The curve is ok.
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious. ` I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.

So how can this behaviour be explained and - hopefully - fixed?

UPD: changed lists to collections.deque

UPD2: added the full test code

UPD3: I use Python 2.7.1 on a Ubuntu 11.04 OS, using a quad-core 2Hz notebook. I tried to turn of most of all other processes: the number of spikes went down but at least one of them was still there.

Developer IT