Search Results

Search found 1 results on 1 pages for 'user1429825'.

Page 1/1 | 1 

  • HDFS some datanodes of cluster are suddenly disconnected while reducers are running

    - by user1429825
    I have 8 slave computers and 1 master computer for running Hadoop (ver 0.21) some datanodes of cluster are suddenly disconnected while I was running MapReduce code on 10GB data After all mappers finished and around 80% of reducers was processed, randomly one or more datanode disconned from network. and then the other datanodes start to disappear from network even if I killed the MapReduce job when I found some datanode was disconnected. I've tried to change dfs.datanode.max.xcievers to 4096, turned off fire-walls of all computing node, disabled selinux and increased the number of file open limit to 20000 but they didn't work at all... anyone have a idea to solve this problem? followings are error log from mapreduce 12/06/01 12:31:29 INFO mapreduce.Job: Task Id : attempt_201206011227_0001_r_000006_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink as ***.***.***.148:20010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) and followings are logs from datanode 2012-06-01 13:01:01,118 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5549263231281364844_3453 src: /*.*.*.147:56205 dest: /*.*.*.142:20010 2012-06-01 13:01:01,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020) Starting thread to transfer block blk_-3849519151985279385_5906 to *.*.*.147:20010 2012-06-01 13:01:19,135 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-5797481564121417802_3453 to *.*.*.146:20010 got java.net.ConnectException: > Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1257) at java.lang.Thread.run(Thread.java:722) 2012-06-01 13:06:20,342 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6674438989226364081_3453 2012-06-01 13:09:01,781 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-3849519151985279385_5906 to *.*.*.147:20010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/*.*.*.142:60057 remote=/*.*.*.147:20010] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:203) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:388) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:476) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1284) at java.lang.Thread.run(Thread.java:722) hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/data/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/data/hdfs1,/home/hadoop/data/hdfs2,/home/hadoop/data/hdfs3,/home/hadoop/data/hdfs4,/home/hadoop/data/hdfs5</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:20070</value> <description>50070 The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:20075</value> <description>50075 The datanode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.secondary.http.address</name> <value>0.0.0.0:20090</value> <description>50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:20010</value> <description>50010 The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. </description> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:20020</value> <description>50020 The datanode ipc server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:20475</value> </property> <property> <name>dfs.https.address</name> <value>0.0.0.0:20470</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>masternode:29001</value> </property> <property> <name>mapred.system.dir</name> <value>/home/hadoop/data/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/hadoop/data/mapreduce/local</value> </property> <property> <name>mapred.map.tasks</name> <value>32</value> <description> default number of map tasks per job.</description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> </property> <property> <name>mapred.reduce.tasks</name> <value>8</value> <description> default number of reduce tasks per job.</description> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx2048M</value> </property> <property> <name>io.sort.mb</name> <value>500</value> </property> <property> <name>mapred.task.timeout</name> <value>1800000</value> <!-- 30 minutes --> </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:20030</value> <description> 50030 The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:20060</value> <description> 50060 </property> </configuration>

    Read the article

1