Bash script 'while read' loop causes 'broken pipe' error when run with GNU Parallel

Posted by Joe White on Super User See other posts from Super User or by Joe White
Published on 2012-09-27T22:35:53Z Indexed on 2012/09/28 3:41 UTC
Read the original article Hit count: 539

Filed under:
|
|
|
|

According to the GNU Parallel mailing list this is not a GNU Parallel-specific problem. They suggested that I post my problem here.

The error I'm getting is a "broken pipe" error, but I feel I should first explain the context of my problem and what causes this error. It happens when trying to use any bash script containing a 'while read' loop in GNU Parallel.

I have a basic bash script like this:

#!/bin/bash
# linkcheck.sh

while read domain
do
host "$domain"
done

Assume that I want to pipe in a large list (250mb say).

cat urllist | ./linkcheck.sh

Running host command on 250mb worth of URLs is rather slow. To speed things up I want to break up the input into chunks before piping it and then run multiple jobs in parallel. GNU Parallel is capable of doing this.

cat urllist | parallel --pipe -j0 parallel ./linkcheck.sh {}

{} is substituted by the contents of urllist line-by-line. Assume that my systems default setup is capable of running 500ish jobs per instance of parallel. To get round this limitation we can parallelize Parallel itself:

cat urllist | parallel -j10 --pipe parallel -j0 ./linkcheck.sh {}

This will run 5000'ish jobs. It will also, sadly, cause the error "broken pipe" (bash FAQ). Yet the script starts to work if I remove the while read loop and take input directly from whatever is fed into {} e.g.,

#!/bin/bash
# linkchecker.sh

domain="$1"
host "$1"

Why will it not work with a while read loop? Is it safe to just turn off the SIGPIPE signal to stop the "broken pipe" message, or will that have side effects such as data corruption?

Thanks for reading.

© Super User or respective owner

Related posts about linux

Related posts about bash