Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 2/17 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Hadoop:Only master node does the work

- by user287722

I've setup a Hadoop 2.2 cluster with 1 master node(namenode and secondary namenode) and 3 slave nodes(datanode and namenode on each one).All of the machines use Linux Mint 64bit. When I run my MapReduce program, writen in Java, I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing. I've checked the logs from all of the namenodes and there is nothing wrong with the namenodes on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager. I used this http://n0where.net/hadoop-2-2-multi-node-cluster-setup/ tutorial to configure my nodes. Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?

Read the article
Hadoop streaming with Python and python subprocess

- by Ganesh

I have established a basic hadoop master slave cluster setup and able to run mapreduce programs (including python) on the cluster. Now I am trying to run a python code which accesses a C binary and so I am using the subprocess module. I am able to use the hadoop streaming for a normal python code but when I include the subprocess module to access a binary, the job is getting failed. As you can see in the below logs, the hello executable is recognised to be used for the packaging, but still not able to run the code. . . packageJobJar: [/tmp/hello/hello, /app/hadoop/tmp/hadoop-unjar5030080067721998885/] [] /tmp/streamjob7446402517274720868.jar tmpDir=null JarBuilder.addNamedStream hello . . 12/03/07 22:31:32 INFO mapred.FileInputFormat: Total input paths to process : 1 12/03/07 22:31:32 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local] 12/03/07 22:31:32 INFO streaming.StreamJob: Running job: job_201203062329_0057 12/03/07 22:31:32 INFO streaming.StreamJob: To kill this job, run: 12/03/07 22:31:32 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201203062329_0057 12/03/07 22:31:32 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201203062329_0057 12/03/07 22:31:33 INFO streaming.StreamJob: map 0% reduce 0% 12/03/07 22:32:05 INFO streaming.StreamJob: map 100% reduce 100% 12/03/07 22:32:05 INFO streaming.StreamJob: To kill this job, run: 12/03/07 22:32:05 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201203062329_0057 12/03/07 22:32:05 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201203062329_0057 12/03/07 22:32:05 ERROR streaming.StreamJob: Job not Successful! 12/03/07 22:32:05 INFO streaming.StreamJob: killJob... Streaming Job Failed! Command I am trying is : hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/MARS.py -reducer /home/hduser/MARS_red.py -input /user/hduser/mars_inputt -output /user/hduser/mars-output -file /tmp/hello/hello -verbose where hello is the C executable. It is a simple helloworld program which I am using to check the basic functioning. My Python code is : #!/usr/bin/env python import subprocess subprocess.call(["./hello"]) Any help with how to get the executable run with Python in hadoop streaming or help with debugging this will get me forward in this. Thanks, Ganesh

Read the article
How to run a jar file in hadoop

- by Arihant

I have created a jar file using the java file from this blog using following statements javac -classpath /usr/local/hadoop/hadoop-core-1.0.3.jar -d /home/hduser/dir Dictionary.java /usr/lib/jvm/jdk1.7.0_07/bin/jar cf Dictionary.jar /home/hduser/dir Now i have tried running this jar in hadoop by hit and trial of various commands 1hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop jar Dictionary.jar Output: Warning: $HADOOP_HOME is deprecated. RunJar jarFile [mainClass] args... 2.hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop jar Dictionary.jar Dictionary Output: Warning: $HADOOP_HOME is deprecated. Exception in thread "main" java.lang.ClassNotFoundException: Dictionary at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:423) at java.lang.ClassLoader.loadClass(ClassLoader.java:356) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) How can i run the jar in hadoop? I have the right DFS Locations as per needed by my program.

Read the article
Raspberry Pi based Hadoop cluster

- by Dmitriy Sukharev

Is it at least possible to build Hadoop cluster from Raspberry Pi-based nodes? Can such a cluster meet hardware requirements of Hadoop? And if so, how much Raspberry Pi nodes are required to meet requirements? I understand that a cluster from several Raspberry Pi nodes being cheap is not powerful. My purpose is to organize cluster without possibility of loosing personal data from my desktop or notebook, and to use this cluster studying Hadoop. I'd appreciate if you suggest any better ideas of organizing a cheap Hadoop cluster for studying purposes. UPD: I've seen that recommended amount of memory for Hadoop is 16-24GB, multi-core processors, and 1TB of HDD, but it doesn't look like minimal requirements.

Read the article
0.20.2 API hadoop version with java 5

- by abdeslam

I have started a maven project trying to implement the MapReduce algorithm in java 1.5.0_14. I have chosen the 0.20.2 API hadoop version. In the pom.xml i'm using thus the following dependency: < dependency < groupId>org.apache.hadoop< /groupId> < artifactId>hadoop-core< /artifactId> < version>0.20.2< /version> < /dependency But when I'm using an import to the org.apache.hadoop classes, I get the following error: bad class file: ${HOME_DIR}\repository\org\apache\hadoop\hadoop-core\0.20.2\hadoop-core-0.20.2.jar(org/apache/hadoop/fs/Path.class) class file has wrong version 50.0, should be 49.0. Does someone know how can I solve this issue. Thanks.

Read the article
Hadoop streaming job on EC2 stays in "pending" state

- by liamf

Trying to experiment with Hadoop and Streaming using cloudera distribution CDH3 on Ubuntu. Have valid data in hdfs:// ready for processing. Wrote little streaming mapper in python. When I launch a mapper only job using: hadoop jar /usr/lib/hadoop/contrib/streaming/hadoop-streaming*.jar -file /usr/src/mystuff/mapper.py -mapper /usr/src/mystuff/mapper.py -input /incoming/STBFlow/* -output testOP hadoop duly decides it will use 66 mappers on the cluster to process the data. The testOP directory is created on HDFS. A job_conf.xml file is created. But the job tracker UI at port 50030 never shows the job moving out of "pending" state and nothing else happens. CPU usage stays at zero. (the job is created though) If I give it a single file (instead of the entire directory) as input, same result (except Hadoop decides it needs 2 mappers instead of 66). I also tried using the "dumbo" Python utility and launching jobs using that: same result: permanently pending. So I am missing something basic: could someone help me out with what I should look for? The cluster is on Amazon EC2. Firewall issues maybe: ports are enabled explicitly, case by case, in the cluster security group.

Read the article
How to Set Up a Hadoop Cluster Using Oracle Solaris (Hands-On Lab)

- by Orgad Kimchi

Oracle Technology Network (OTN) published the "How to Set Up a Hadoop Cluster Using Oracle Solaris" OOW 2013 Hands-On Lab. This hands-on lab presents exercises that demonstrate how to set up an Apache Hadoop cluster using Oracle Solaris 11 technologies such as Oracle Solaris Zones, ZFS, and network virtualization. Key topics include the Hadoop Distributed File System (HDFS) and the Hadoop MapReduce programming model. We will also cover the Hadoop installation process and the cluster building blocks: NameNode, a secondary NameNode, and DataNodes. In addition, you will see how you can combine the Oracle Solaris 11 technologies for better scalability and data security, and you will learn how to load data into the Hadoop cluster and run a MapReduce job. Summary of Lab Exercises This hands-on lab consists of 13 exercises covering various Oracle Solaris and Apache Hadoop technologies: Install Hadoop. Edit the Hadoop configuration files. Configure the Network Time Protocol. Create the virtual network interfaces (VNICs). Create the NameNode and the secondary NameNode zones. Set up the DataNode zones. Configure the NameNode. Set up SSH. Format HDFS from the NameNode. Start the Hadoop cluster. Run a MapReduce job. Secure data at rest using ZFS encryption. Use Oracle Solaris DTrace for performance monitoring. Read it now

Read the article
Documentation for installing and running hadoop 2.2 on Windows

- by user2325154

With the latest release of Hadoop 2.2 I see that the release notes mentions that this version has significant improvements for running Hadoop on Windows. I downloaded Hadoop 2.2 yesterday and I saw lot of .cmd file alon with .sh files which ensures that this version has scripts and batch files for running Hadoop on Windows environment. However while looking at the Apache Hadoop documentation I couldn't find any step-by-step instructions on how to install and run this newer version on Windows. Besides this it looks like that the newer version has YARN architecture embedded in it and the old configurations provided on some of the tutorials online may be outdated and not applicable anymore. Is there any good documentation for Hadoop 2.2 available online ? I want it specifically for running Hadoop under Windows.

Read the article
Why Hadoop is tightly bound to linux?

- by user1676346

I am new with Hadoop. What are the specific reasons why Hadoop is so tightly bound with Linux, and the cluster it runs upon is homogeneous? I'm looking for really specific details that can tell me why Hadoop does not work well with windows, and if there are some libraries some specific scripts that are involved? My project is to deploy Hadoop without using Cygwin. I have already seen the article from Hayes Davis where he explained how to install Hadoop without Cygwin, but he said that there are some bugs. I might start from scratch to properly configure Hadoop on Windows, but if any one can explain what, specifically, are the reasons that Hadoop doesn't work well on windows that would be very helpful.

Read the article
Hadoop Rolling Small files

- by Arenstar

I am running Hadoop on a project and need a suggestion. Generally by default Hadoop has a "block size" of around 64mb.. There is also a suggestion to not use many/small files.. I am currently having very very very small files being put into HDFS due to the application design of flume.. The problem is, that Hadoop <= 0.20 cannot append to files, whereby i have too many files for my map-reduce to function efficiently.. There must be a correct way to simply roll/merge roughly 100 files into one.. Therefore Hadoop is effectively reading 1 large file instead of 10 Any Suggestions??

Read the article
ORDER BY job failed in the Pig script while running EmbeddedPig using Java

- by C.c. Huang

I have this following pig script, which works perfectly using grunt shell (stored the results to HDFS without any issues); however, the last job (ORDER BY) failed if I ran the same script using Java EmbeddedPig. If I replace the ORDER BY job by others, such as GROUP or FOREACH GENERATE, the whole script then succeeded in Java EmbeddedPig. So I think it's the ORDER BY which causes the issue. Anyone has any experience with this? Any help would be appreciated! The Pig script: REGISTER pig-udf-0.0.1-SNAPSHOT.jar; user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float); simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score; grouped_user_similarity = GROUP simplified_user_similarity BY user_id; ordered_user_similarity = FOREACH grouped_user_similarity { sorted = ORDER simplified_user_similarity BY sim_score DESC; top = LIMIT sorted 10; GENERATE group, top; }; top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10); all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0); grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id; influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score; ordered_influence_scores = ORDER influence_scores BY influence_score DESC; STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage(); The error log from Pig: 12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job 12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir. 12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml 12/04/05 10:00:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100 12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720 12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680 12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004 java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153) at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112) ... 6 more 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed! 12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-cdh3u3 0.8.1-cdh3u3 cchuang 2012-04-05 10:00:34 2012-04-05 10:01:04 GROUP_BY,ORDER_BY Some jobs have failed! Stop running all dependent jobs Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_local_0001 0 0 0 0 0 0 0 0 all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity GROUP_BY job_local_0002 0 0 0 0 0 0 0 0 grouped_influence_scores,influence_scores GROUP_BY,COMBINER job_local_0003 0 0 0 0 0 0 0 0 ordered_influence_scores SAMPLER Failed Jobs: JobId Alias Feature Message Outputs job_local_0004 ordered_influence_scores ORDER_BY Message: Job failed! Error - NA /tmp/cc-test-results-1, Input(s): Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000" Output(s): Failed to produce result in "/tmp/cc-test-results-1" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local_0001 -> job_local_0002, job_local_0002 -> job_local_0003, job_local_0003 -> job_local_0004, job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs

Read the article
Which Hadoop API should I use?

- by Niels Basjes

In the latest Hadoop Studio the 0.18 API of Hadoop is called "Stable" and the 0.20 API of Hadoop is called "Unstable". Now given the fact that we'll start coding a new Hadoop project in the next few weeks; which API should we use and which Hadoop distribution (Apache, Cloudera, Yahoo, ...) should we use? Thanks for your insights.

Read the article
What is the value of the Cloudera Hadoop Certification for people new to the IT industry?

- by Saumitra

I am a software developer with 8 months of experience in the IT industry, currently working on the development of tools for BIG DATA analytics. I have learned Hadoop basics on my own and I am pretty comfortable with writing MapReduce Jobs, PIG, HIVE, Flume and other related projects. I am thinking of taking the exam for the Cloudera Hadoop Certification. Will this certification add value, considering that I have less than 1 year of experience? Many of the jobs I've seen relating to Hadoop require at least 3 years of experience. Should I invest more time in learning Hadoop and improving my skills to take this certification?

Read the article
Hadoop safemode recovery - taking lot of time

- by Algorist

Hi, We are running our cluster on Amazon EC2. we are using cloudera scripts to setup hadoop. On the master node, we start below services. 609 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start namenode' 610 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start secondarynamenode' 611 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start jobtracker' 612 613 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop dfsadmin -safemode wait' On the slave machine, we run the below services. 625 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start datanode' 626 $AS_HADOOP '"$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker' The main problem we are facing is, hdfs safemode recovery is taking more than an hour and this is causing delays in our job completion. Below are the main log messages. 1. domU-12-31-39-0A-34-61.compute-1.internal 10/05/05 20:44:19 INFO ipc.Client: Retrying connect to server: ec2-184-73-64-64.compute-1.amazonaws.com/10.192.11.240:8020. Already tried 21 time(s). 2. The reported blocks 283634 needs additional 322258 blocks to reach the threshold 0.9990 of total blocks 606499. Safe mode will be turned off automatically. The first message is thrown in task trackers log because, job tracker is not started. job tracker didn't start because of hdfs safemode recovery. The second message is thrown during the recovery process. Is there something I am doing wrong? How much time does normal hdfs safemode recovery takes? Will there be any speedup, by not starting task trackers till job tracker is started? Are there any known hadoop problems on amazon cluster? Thanks for your help. Regards Bala Mudiam

Read the article
Error in using Hadoop MapReduce in Eclipse

- by Shweta

When I executed a MapReduce program in Eclipse using Hadoop, I got the below error. It has to be some change in path, but I'm not able to figure it out. Any idea? 16:35:39 INFO mapred.JobClient: Task Id : attempt_201001151609_0001_m_000006_0, Status : FAILED java.io.FileNotFoundException: File C:/tmp/hadoop-Shwe/mapred/local/taskTracker/jobcache/job_201001151609_0001/attempt_201001151609_0001_m_000006_0/work/tmp does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.mapred.TaskRunner.setupWorkDir(TaskRunner.java:519) at org.apache.hadoop.mapred.Child.main(Child.java:155)

Read the article
Hadoop WordCount example stuck at map 100% reduce 0%

- by Abhinav Sharma

[hadoop-1.0.2] ? hadoop jar hadoop-examples-1.0.2.jar wordcount /user/abhinav/input /user/abhinav/output Warning: $HADOOP_HOME is deprecated. ****hdfs://localhost:54310/user/abhinav/input 12/04/15 15:52:31 INFO input.FileInputFormat: Total input paths to process : 1 12/04/15 15:52:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/04/15 15:52:31 WARN snappy.LoadSnappy: Snappy native library not loaded 12/04/15 15:52:31 INFO mapred.JobClient: Running job: job_201204151241_0010 12/04/15 15:52:32 INFO mapred.JobClient: map 0% reduce 0% 12/04/15 15:52:46 INFO mapred.JobClient: map 100% reduce 0% I've set up hadoop on a single node using this guide (http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#run-the-mapreduce-job) and I'm trying to run a provided example but I'm getting stuck at map 100% reduce 0%. What could be causing this?

Read the article
io Exception error in wordcount example

- by Anitha

I have installed Hadoop 1.0.3 in Ubuntu 12.04 version (64bit) based on michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ . I am trying to run a mapreduce job using the wordcount example. Running the command hduser@ubuntu: $/usr/local/hadoop/bin/hadoop jar hadoop-examples-1.0.3.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output gives the following error: Warning: $HADOOP_HOME is deprecated. Exception in thread "main" java.io.IOException: Error opening job jar: hadoop-examples-1.0.3.jar at org.apache.hadoop.util.RunJar.main(RunJar.java:90) Caused by: java.util.zip.ZipException: error in opening zip file at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.<init>(ZipFile.java:131) at java.util.jar.JarFile.<init>(JarFile.java:150) at java.util.jar.JarFile.<init>(JarFile.java:87) at org.apache.hadoop.util.RunJar.main(RunJar.java:88) Thanks in advance.

Read the article
Classnotfound exception while running hadoop

- by vana

Hi, I am new to hadoop. I have a file Wordcount.java which refers hadoop.jar and stanford-parser.jar I am running the following commnad javac -classpath .:hadoop-0.20.1-core.jar:stanford-parser.jar -d ep WordCount.java jar cvf ep.jar -C ep . bin/hadoop jar ep.jar WordCount gutenburg gutenburg1 After executing i am getting the following error: lang.ClassNotFoundException: edu.stanford.nlp.parser.lexparser.LexicalizedParser The class is in stanford-parser.jar ... What can be the possible problem? Thanks

Read the article
Reverse and Forward DNS set up correctly but sometimes MapReduce job fails

- by phodamentals

Ever since we switched over our cluster to communicate via private interfaces and created a DNS server with correct forward and reverse lookup zones, we get this message before the M/R job runs: ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormatBase - Cannot resolve the host name for /192.168.3.9 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '9.3.168.192.in-addr.arpa' A dig and nslookup both show that the reverse and forward look-ups both get good responses with no errors from within the cluster. Shortly after these messages, the job runs...but every once in awhile we get a NPE: Exception in thread "main" java.lang.NullPointerException INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.net.DNS.reverseDns(DNS.java:93) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.reverseDNS(TableInputFormatBase.java:219) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:184) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1063) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1080) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:992) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:945) INFO app.insights.search.SearchIndexUpdater - at java.security.AccessController.doPrivileged(Native Method) INFO app.insights.search.SearchIndexUpdater - at javax.security.auth.Subject.doAs(Subject.java:415) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:945) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapreduce.Job.submit(Job.java:566) INFO app.insights.search.SearchIndexUpdater - at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:596) INFO app.insights.search.SearchIndexUpdater - at app.insights.search.correlator.comments.CommentCorrelator.main(CommentCorrelator.java:72 Does anyone else who has set-up a CDH Hadoop cluster on a private network w/DNS server get this? CDH 4.3.1 with MR1 2.0.0 and HBase 0.94.6

Read the article
hadoop: port appears open locally but not remotelly

- by miguel

I am new to linux and hadoop and I am having the same issue as in this question. I think I understand what is causing it but I don't know how to solve it (Don't know what they mean by "Edit the Hadoop server's configuration file so that it includes its NIC's address."). The other post that they link says that the configuration files should refer to the machine's externally accessible host name. I think I got this right as every hadoop configuration file refers to "master" and the etc/hosts file lists the master by its private IP address. How can I solve this? Edit: I have 5 nodes: master, slavec, slaved, slavee and slavef all running debian. This is the hosts file in master: 127.0.0.1 master 10.0.1.201 slavec 10.0.1.202 slaved 10.0.1.203 slavee 10.0.1.204 slavef this is the hosts file in slavec (it looks similar in the other slaves): 10.0.1.200 master 127.0.0.1 slavec 10.0.1.202 slaved 10.0.1.203 slavee 10.0.1.204 slavef the masters file in master: master the slaves file in master: master slavec slaved slavee slavef the masters and slaves file in slavex has only one line: slavex

Read the article
Where can I find the supported way to deploy hadoop on precise?

- by Jeff McCarrell

I want to set up a small (6 node) hadoop/hive/pig cluster. I see the work in the juju space on charms; however, the current status of deploying a single charm per node will not work for me. I see ServerTeam Hadoop which talks about re-packaging the bigtop packages. The cloudera CDH3 installation guide talks about Maverick and Lucid, but not precise. What am I missing? Is there a straight forward way to deploy hadoop/hive/pig on 6 nodes that does not involve building from tarballs?

Read the article
how to convince other we should move to hadoop?

- by Ramy

Everything I've read about Hadoop seems like exactly the technology we need to make our enterprise more scalable. We have terabytes of raw data that is in non-relational form (text files of some kind). We're quickly approaching the upper limits of what our centralized file server can handle and everyone is aware of this. Most people on the tech team, especially the more junior members of the tech team are all in favor of moving from the central file system to HDFS. The problem is, there is one key (most senior, etc.) member of the team who is resisting this change and every time Hadoop comes up, he tells us that we could simply add another file server and be in the clear. So, my question (and yes, it's really subjective, but I need more help with this than any of my other questions) is what steps can we take to get upper management to move forward with Hadoop despite the hesitation of one member of the team?

Read the article
Which Hadoop API version should I use?

- by Niels Basjes

In the latest Hadoop Studio the 0.18 API of Hadoop is called "Stable" and the 0.20 API of Hadoop is called "Unstable". The distribution that comes from Yahoo is a 0.20 (with yahoo patches), which is apparently "the way to go". From cloudera they state the 0.20 (with cloudera patches) is also stable. Now given the fact that we'll start coding a new Hadoop project in the next few weeks; which API should we use and which Hadoop distribution (Apache, Cloudera, Yahoo, ...) should we use? Thanks for your insights.

Read the article
Hadoop in a RESTful Java Web Application - Conflicting URI templates

- by user1231583

I have a small Java Web Application in which I am using Jersey 1.12 and the Hadoop 1.0.0 JAR file (hadoop-core-1.0.0.jar). When I deploy my application to my JBoss 5.0 server, the log file records the following error: SEVERE: Conflicting URI templates. The URI template / for root resource class org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods and the URI template / transform to the same regular expression (/.*)? To make sure my code is not the problem, I have created a fresh web application that contains nothing but the Jersey and Hadoop JAR files along with a small stub. My web.xml is as follows: <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://java.sun.com/xml/ns/javaee http://java.sun.com/xml/ns/javaee/web-app_2_5.xsd"> <servlet> <servlet-name>ServletAdaptor</servlet-name> <servlet-class>com.sun.jersey.spi.container.servlet.ServletContainer</servlet- class> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>ServletAdaptor</servlet-name> <url-pattern>/mytest/*</url-pattern> </servlet-mapping> <session-config> <session-timeout> 30 </session-timeout> </session-config> <welcome-file-list> <welcome-file>index.jsp</welcome-file> </welcome-file-list> </web-app> My simple RESTful stub is as follows: import javax.ws.rs.core.Context; import javax.ws.rs.core.UriInfo; import javax.ws.rs.Path; @Path("/mytest") public class MyRest { @Context private UriInfo context; public MyRest() { } } In my regular application, when I remove the Hadoop JAR files (and the code that is using Hadoop), everything works as I would expect. The deployment is successful and the remaining RESTful services work. I have also tried the Hadoop 1.0.1 JAR files and have had the same problems with the conflicting URL template in the NamenodeWebHdfsMethods class. Any suggestions or tips in solving this problem would be greatly appreciated.

Read the article
Best practice for administering a (hadoop) cluster

- by Alex

Dear all, I've recently been playing with Hadoop. I have a six node cluster up and running - with HDFS, and having run a number of MapRed jobs. So far, so good. However I'm now looking to do this more systematically and with a larger number of nodes. Our base system is Ubuntu and the current setup has been administered using apt (to install the correct java runtime) and ssh/scp (to propagate out the various conf files). This is clearly not scalable over time. Does anyone have any experience of good systems for administering (possibly slightly heterogenous: different disk sizes, different numbers of cpus on each node) hadoop clusters automagically? I would consider diskless boot - but imagine that with a large cluster, getting the cluster up and running might be bottle-necked on the machine serving the OS. Or some form of distributed debian apt to keep the machines native environment synchronised? And how do people successfully manage the conf files over a number of (potentially heterogenous) machines? Thanks very much in advance, Alex

Read the article

Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 2/17 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by user287722

- by Ganesh

- by Arihant

- by Dmitriy Sukharev

- by abdeslam

- by liamf

- by Orgad Kimchi

- by user2325154

- by user1676346

- by Arenstar

- by C.c. Huang

- by Niels Basjes

- by Saumitra

- by Algorist

- by Shweta

- by Abhinav Sharma

- by Anitha

- by vana

- by phodamentals

- by miguel

- by Jeff McCarrell

- by Ramy

- by Niels Basjes

- by user1231583

- by Alex

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >