Hadoop streaming job on EC2 stays in "pending" state

Posted by liamf on Server Fault See other posts from Server Fault or by liamf
Published on 2011-06-27T22:25:56Z Indexed on 2011/06/28 0:23 UTC
Read the original article Hit count: 226

Filed under:
|

Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu.

Have valid data in hdfs:// ready for processing.

Wrote little streaming mapper in python.

When I launch a mapper only job using:

hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar -file   /usr/src/mystuff/mapper.py -mapper /usr/src/mystuff/mapper.py  -input /incoming/STBFlow/* -output testOP

hadoop duly decides it will use 66 mappers on the cluster to process the data. The testOP directory is created on HDFS. A job_conf.xml file is created. But the job tracker UI at port 50030 never shows the job moving out of "pending" state and nothing else happens. CPU usage stays at zero. (the job is created though)

If I give it a single file (instead of the entire directory) as input, same result (except Hadoop decides it needs 2 mappers instead of 66).

I also tried using the "dumbo" Python utility and launching jobs using that: same result: permanently pending.

So I am missing something basic: could someone help me out with what I should look for? The cluster is on Amazon EC2. Firewall issues maybe: ports are enabled explicitly, case by case, in the cluster security group.

© Server Fault or respective owner

Related posts about amazon-ec2

Related posts about hadoop