Parallel processing slower than sequential?
        Posted  
        
            by 
                zebediah49
            
        on Super User
        
        See other posts from Super User
        
            or by zebediah49
        
        
        
        Published on 2011-07-28T20:30:02Z
        Indexed on 
            2012/06/28
            15:19 UTC
        
        
        Read the original article
        Hit count: 354
        
EDIT: For anyone who stumbles upon this in the future: Imagemagick uses a MP library. It's faster to use available cores if they're around, but if you have parallel jobs, it's unhelpful.
Do one of the following:
- do your jobs serially (with Imagemagick in parallel mode)
- set MAGICK_THREAD_LIMIT=1 for your invocation of the imagemagick binary in question.
By making Imagemagick use only one thread, it slows down by 20-30% in my test cases, but meant I could run one job per core without issues, for a significant net increase in performance.
Original question:
While converting some images using ImageMagick, I noticed a somewhat strange effect. Using xargs was significantly slower than a standard for loop. Since xargs limited to a single process should act like a for loop, I tested that, and found it to be about the same.
Thus, we have this demonstration.
- Quad core (AMD Athalon X4, 2.6GHz)
- Working entirely on a tempfs (16g ram total; no swap)
- No other major loads
Results:
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real        0m3.784s
user        0m2.240s
sys         0m0.230s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level
real        0m9.097s
user        0m28.020s
sys         0m0.910s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level
real        0m9.844s
user        0m33.200s
sys         0m1.270s
Can anyone think of a reason why running two instances of this program takes more than twice as long in real time, and more than ten times as long in processor time to complete the same task? After that initial hit, more processes do not seem to have as significant of an effect.
I thought it might have to do with disk seeking, so I did that test entirely in ram. Could it have something to do with how Convert works, and having more than one copy at once means it cannot use processor cache as efficiently or something?
EDIT: When done with 1000x 769KB files, performance is as expected. Interesting.
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real    3m37.679s
user    5m6.980s
sys 0m6.340s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 1 convert -auto-level
real    3m37.152s
user    5m6.140s
sys 0m6.530s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 2 convert -auto-level
real    2m7.578s
user    5m35.410s
sys     0m6.050s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 4 convert -auto-level
real    1m36.959s
user    5m48.900s
sys     0m6.350s
/media/ramdisk/img$ time for f in *.bmp; do echo $f ${f%bmp}png; done | xargs -n 2 -P 10 convert -auto-level
real    1m36.392s
user    5m54.840s
sys     0m5.650s
© Super User or respective owner