hadoop - Page 4 - Developer IT

Set LD_LIBRARY_PATH and CLASSPATH on cluster nodes before running a hadoop job

- by Ashish Sharma

I need to set LD_LIBRARY_PATH and CLASSPATH before running a job a cluster. In LD_LIBRARY_PATH i need to add location of some jars which are required while running the job, As these jars are avaiable at my cluster, similar with CLASSPATH. I have a 3 NODE cluster, I need to set this LD_LIBRARY_PATH and CLASSPATH for all the 3 data nodes so that the following jar are available while running the job

Read the article

Hadoop on Amazon EC2 : Job tracker not starting properly

- by Algorist

Hi, We are running Hadoop on Amazon EC2 cluster. We start the master, slaves and attach the ebs volumes and finally waiting for hadoop jobtracker, tasktracker etc to start and we have timeout of 3600 seconds. We are noticing 50% of the time that job tracker is not able to start before the timeout. Reason being, hdfs is not initialized properly and still in safemode and job tracker is unable to start. I noticed few connectivity issues between nodes on EC2 as I tried manually pinging slaves. Did anyone face similar issue and know how to solve this? Thank you Bala

Read the article

running Hadoop software on office computers (when they are idle)

- by Shahbaz

Is there a project which helps setup a Hadoop cluster on office desktops, when they are idle? I'd like to experiment with Hadoop/MR/hbase but don't have acces to 5-10 computers. The computers at work are idle after hours and are connected to each other through a very high speed connection. What's more, data on these computers stays within our network so there is no privacy issue. In order for this to work I need a fairly light weight monitor running on each machine. When the computer has been idle for X hours, it will join the cluster. If the user logs on, it has to drop out of the cluster and return all CPU/memory back. Does something like this exist?

Read the article

hadoop - large database query

- by Mastergeek

Situation: I have a Postgres DB that contains a table with several million rows and I'm trying to query all of those rows for a MapReduce job. From the research I've done on DBInputFormat, Hadoop might try and use the same query again for a new mapper and since these queries take a considerable amount of time I'd like to prevent this in one of two ways that I've thought up: 1) Limit the job to only run 1 mapper that queries the whole table and call it good. or 2) Somehow incorporate an offset in the query so that if Hadoop does try to use a new mapper it won't grab the same stuff. I feel like option (1) seems more promising, but I don't know if such a configuration is possible. Option(2) sounds nice in theory but I have no idea how I would keep track of the mappers being made and if it is at all possible to detect that and reconfigure. Help is appreciated and I'm namely looking for a way to pull all of the DB table data and not have several of the same query running because that would be a waste of time.

Read the article

Knowledge mining using Hadoop.

- by Anurag

Hello there, I want to do a project Hadoop and map reduce and present it as my graduation project. To this, I've given some thought,searched over the internet and came up with the idea of implementing some basic knowledge mining algorithms say on a social websites like Facebook or may stckoverflow, Quora etc and draw some statistical graphs, comparisons frequency distributions and other sort of important values.For searching purpose would it be wise to use Apache Solr ? I want know If such thing is feasible using the above mentioned tools, if so how should I build up on this little idea? Where can I learn about knowledge mining algorithms which are easy to implement using java and map reduce techniques? In case this is a wrong idea please suggest what else can otherwise be done on using Hadoop and other related sub-projects? Thank you

Read the article

Microsoft lance Hadoop pour Windows Server et Windows Azure, première version Beta du framework "HDInsight"

Microsoft lance Hadoop pour Windows Server et Windows Azure Première version Beta du framework HDInsight. Microsoft lance une version bêta publique du Framework Hadoop pour Windows Server et Windows Azure. Les deux nouveaux produits portent les noms officiels de Windows Azure HDInsight Service et Microsoft HDInsight Server pour Windows. Ces produits sont nés d'un partenariat entre Microsoft et Hortonworks, éditeur de logiciels et fournisseur de solutions Hadoop commerciales. Un mois après l'annonce du partenariat en automne 2011, Microsoft a renoncé à faire sa propre solution Big-Data intitulée Dryad

Read the article

Découvrir Hadoop, avec un tutoriel traduit par Stéphane Dupont

Bonjour,Je vous présente ce tutoriel traduit par Stéphane Dupont intitulé : Tutoriel Hadoop Hadoop est un système distribué, tolérant aux pannes, pour le stockage de données et qui est hautement scalable. Cette capacité de monter en charge est le résultat d'un stockage en cluster à haute bande passante et répliqué, connu sous l'acronyme de HDFS (Hadoop Distributed File System) et d'un traitement distribué...

Read the article

HDFS some datanodes of cluster are suddenly disconnected while reducers are running

- by user1429825

I have 8 slave computers and 1 master computer for running Hadoop (ver 0.21) some datanodes of cluster are suddenly disconnected while I was running MapReduce code on 10GB data After all mappers finished and around 80% of reducers was processed, randomly one or more datanode disconned from network. and then the other datanodes start to disappear from network even if I killed the MapReduce job when I found some datanode was disconnected. I've tried to change dfs.datanode.max.xcievers to 4096, turned off fire-walls of all computing node, disabled selinux and increased the number of file open limit to 20000 but they didn't work at all... anyone have a idea to solve this problem? followings are error log from mapreduce 12/06/01 12:31:29 INFO mapreduce.Job: Task Id : attempt_201206011227_0001_r_000006_0, Status : FAILED java.io.IOException: Bad connect ack with firstBadLink as ***.***.***.148:20010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:889) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:820) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427) and followings are logs from datanode 2012-06-01 13:01:01,118 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5549263231281364844_3453 src: /*.*.*.147:56205 dest: /*.*.*.142:20010 2012-06-01 13:01:01,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020) Starting thread to transfer block blk_-3849519151985279385_5906 to *.*.*.147:20010 2012-06-01 13:01:19,135 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-5797481564121417802_3453 to *.*.*.146:20010 got java.net.ConnectException: > Connection timed out at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:373) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1257) at java.lang.Thread.run(Thread.java:722) 2012-06-01 13:06:20,342 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6674438989226364081_3453 2012-06-01 13:09:01,781 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(*.*.*.142:20010, storageID=DS-1534489105-*.*.*.142-20010-1337757934836, infoPort=20075, ipcPort=20020):Failed to transfer blk_-3849519151985279385_5906 to *.*.*.147:20010 got java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/*.*.*.142:60057 remote=/*.*.*.147:20010] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:164) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:203) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:388) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:476) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1284) at java.lang.Thread.run(Thread.java:722) hdfs-site.xml <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/data/name</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/data/hdfs1,/home/hadoop/data/hdfs2,/home/hadoop/data/hdfs3,/home/hadoop/data/hdfs4,/home/hadoop/data/hdfs5</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.http.address</name> <value>0.0.0.0:20070</value> <description>50070 The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.http.address</name> <value>0.0.0.0:20075</value> <description>50075 The datanode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.secondary.http.address</name> <value>0.0.0.0:20090</value> <description>50090 The secondary namenode http server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:20010</value> <description>50010 The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. </description> <property> <name>dfs.datanode.ipc.address</name> <value>0.0.0.0:20020</value> <description>50020 The datanode ipc server address and port. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:20475</value> </property> <property> <name>dfs.https.address</name> <value>0.0.0.0:20470</value> </property> </configuration> mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>masternode:29001</value> </property> <property> <name>mapred.system.dir</name> <value>/home/hadoop/data/mapreduce/system</value> </property> <property> <name>mapred.local.dir</name> <value>/home/hadoop/data/mapreduce/local</value> </property> <property> <name>mapred.map.tasks</name> <value>32</value> <description> default number of map tasks per job.</description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>4</value> </property> <property> <name>mapred.reduce.tasks</name> <value>8</value> <description> default number of reduce tasks per job.</description> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx2048M</value> </property> <property> <name>io.sort.mb</name> <value>500</value> </property> <property> <name>mapred.task.timeout</name> <value>1800000</value>  </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:20030</value> <description> 50030 The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. </description> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:20060</value> <description> 50060 </property> </configuration>

Read the article

How do I control output files name and content of an Hadoop streaming job?

- by Eran Kampf

Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key. Update: Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html I havent seen any samples for this out there... Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?

Read the article

Oracle Big Data Software Downloads

- by Mike.Hallett(at)Oracle-BI&EPM

Companies have been making business decisions for decades based on transactional data stored in relational databases. Beyond that critical data, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and photographs that can be mined for useful information. Oracle offers a broad integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data Connectors Downloads here, includes: Oracle SQL Connector for Hadoop Distributed File System Release 2.1.0 Oracle Loader for Hadoop Release 2.1.0 Oracle Data Integrator Companion 11g Oracle R Connector for Hadoop v 2.1 Oracle Big Data Documentation The Oracle Big Data solution offers an integrated portfolio of products to help you organize and analyze your diverse data sources alongside your existing data to find new insights and capitalize on hidden relationships. Oracle Big Data, Release 2.2.0 - E41604_01 zip (27.4 MB) Integrated Software and Big Data Connectors User's Guide HTML PDF Oracle Data Integrator (ODI) Application Adapter for Hadoop Apache Hadoop is designed to handle and process data that is typically from data sources that are non-relational and data volumes that are beyond what is handled by relational databases. Typical processing in Hadoop includes data validation and transformations that are programmed as MapReduce jobs. Designing and implementing a MapReduce job usually requires expert programming knowledge. However, when you use Oracle Data Integrator with the Application Adapter for Hadoop, you do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language (HiveQL), a SQL-like language for implementing MapReduce jobs. Employing familiar and easy-to-use tools and pre-configured knowledge modules (KMs), the application adapter provides the following capabilities: Loading data into Hadoop from the local file system and HDFS Performing validation and transformation of data within Hadoop Loading processed data from Hadoop to an Oracle database for further processing and generating reports Oracle Database Loader for Hadoop Oracle Loader for Hadoop is an efficient and high-performance loader for fast movement of data from a Hadoop cluster into a table in an Oracle database. It pre-partitions the data if necessary and transforms it into a database-ready format. Oracle Loader for Hadoop is a Java MapReduce application that balances the data across reducers to help maximize performance. Oracle R Connector for Hadoop Oracle R Connector for Hadoop is a collection of R packages that provide: Interfaces to work with Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables Predictive analytic techniques, written in R or Java as Hadoop MapReduce jobs, that can be applied to data in HDFS files You install and load this package as you would any other R package. Using simple R functions, you can perform tasks such as: Access and transform HDFS data using a Hive-enabled transparency layer Use the R language for writing mappers and reducers Copy data between R memory, the local file system, HDFS, Hive, and Oracle databases Schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations Oracle SQL Connector for Hadoop Distributed File System Using Oracle SQL Connector for HDFS, you can use an Oracle Database to access and analyze data residing in Hadoop in these formats: Data Pump files in HDFS Delimited text files in HDFS Hive tables For other file formats, such as JSON files, you can stage the input in Hive tables before using Oracle SQL Connector for HDFS. Oracle SQL Connector for HDFS uses external tables to provide Oracle Database with read access to Hive tables, and to delimited text files and Data Pump files in HDFS. Related Documentation Cloudera's Distribution Including Apache Hadoop Library HTML Oracle R Enterprise HTML Oracle NoSQL Database HTML Recent Blog Posts Big Data Appliance vs. DIY Price Comparison Big Data: Architecture Overview Big Data: Achieve the Impossible in Real-Time Big Data: Vertical Behavioral Analytics Big Data: In-Memory MapReduce Flume and Hive for Log Analytics Building Workflows in Oozie

Read the article

hadoop implementing a generic list writable

- by Guruprasad Venkatesh

I am working on building a map reduce pipeline of jobs(with one MR job's output feeding to another as input). The values being passed around are fairly complex, in that there are lists of different types and hash maps with values as lists. Hadoop api does not seem to have a ListWritable. Am trying to write a generic one, but it seems i can't instantiate a generic type in my readFields implementation, unless i pass in the class type itself: public class ListWritable<T extends Writable> implements Writable { private List<T> list; private Class<T> clazz; public ListWritable(Class<T> clazz) { this.clazz = clazz; list = new ArrayList<T>(); } @Override public void write(DataOutput out) throws IOException { out.writeInt(list.size()); for (T element : list) { element.write(out); } } @Override public void readFields(DataInput in) throws IOException{ int count = in.readInt(); this.list = new ArrayList<T>(); for (int i = 0; i < count; i++) { try { T obj = clazz.newInstance(); obj.readFields(in); list.add(obj); } catch (InstantiationException e) { e.printStackTrace(); } catch (IllegalAccessException e) { e.printStackTrace(); } } } } But hadoop requires all writables to have a no argument constructor to read the values back. Has anybody tried to do the same and solved this problem? TIA.

Read the article

Hadoop File Read

- by user3684584

Hadoop Distributed Cache Wordcount example in hadoop 2.2.0. Copied file into hdfs filesystem to be used inside setup of mapper class. protected void setup(Context context) throws IOException,InterruptedException { Path[] uris = DistributedCache.getLocalCacheFiles(context.getConfiguration()); cacheData=new HashMap(); for(Path urifile: uris) { try { BufferedReader readBuffer1 = new BufferedReader(new FileReader(urifile.toString())); String line; while ((line=readBuffer1.readLine())!=null) { System.out.println("**************"+line); cacheData.put(line,line); } readBuffer1.close(); } catch (Exception e) { System.out.println(e.toString()); } } } Inside Driver Main class Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); if (otherArgs.length != 3) { System.err.println("Usage: wordcount <in> <out>"); System.exit(2); } Job job = new Job(conf, "word_count"); job.setJarByClass(WordCount.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); Path outputpath=new Path(otherArgs[1]); outputpath.getFileSystem(conf).delete(outputpath,true); FileOutputFormat.setOutputPath(job,outputpath); System.out.println("CachePath****************"+otherArgs[2]); DistributedCache.addCacheFile(new URI(otherArgs[2]),job.getConfiguration()); System.exit(job.waitForCompletion(true) ? 0 : 1); But getting exception java.io.FileNotFoundException: file:/home/user12/tmp/mapred/local/1408960542382/cache (No such file or directory) So Cache functionality not working properly. Any Idea ?

Read the article

Hadoop Hive web interface options

- by Garethr

I've been experimenting with Hive for some data mining activities and would like to make it easily available to less command line orientated colleagues. Hive does now ship with a web interface (http://wiki.apache.org/hadoop/Hive/HiveWebInterface) but it's very basic at this stage. My question is does a visually polished and fully featured interface (either desktop or preferably web based) to Hive exist yet? Are their any open source efforts outside the Hive project working on this?

Read the article

Debugging hadoop applications

- by Deepak Konidena

Hi, I tried printing out values using System.out.println(), but they won't appear on the console. How do i print out the values in a map/reduce application for debugging purposes using Hadoop? Thanks, Deepak.

Read the article

Very basic question about Hadoop and compressed input files

- by Luis Sisamon

I have started to look into Hadoop. If my understanding is right i could process a very big file and it would get split over different nodes, however if the file is compressed then the file could not be split and wold need to be processed by a single node (effectively destroying the advantage of running a mapreduce ver a cluster of parallel machines). My question is, assuming the above is correct, is it possible to split a large file manually in fixed-size chunks, or daily chunks, compress them and then pass a list of compressed input files to perform a mapreduce?

Read the article

fft algorithm implementation with hadoop

- by haluk

Hi, I want to implement Fast Fourier Transform algorithm with Hadoop. I know recursive-fft algorithm but I need your guideline in order to implement it Map/Reduce approach. Any suggestions? Thanks.

Read the article

Hadoop write directory

- by FaultyJuggler

Simple question but reading through documentation and configuration I can't quite seem to figure it out. How do I A) know where hadoop is writing to on the local disk and B) change that For initial testing I setup HDFS on a 20gb linux VM - to it we've added a 500gb networked drive for moving towards prototyping the full system. So now how do I point HDFS at that drive, or do I simply move the home directory/install with some slight change in setup and restart the process?

Read the article

Having two sets of input combined on hadoop

- by aeolist

I have a rather simple hadoop question which i'll try to present with an example say you have a list of strings and a large file and you want each mapper to process a piece of the file and one of the strings in a grep like program. how are you supposed to do that? I am under the impression that the number of mappers is a result of the inputsplits produced. I could run subsequent jobs, one for each string, but it seems kinda... messy?

Read the article

Is there anything like Hadoop in C++?

- by anon

What is the closest thing like Hadoop, but in C++? In particular, I want to do distributed computing using MapReduce. Thanks!

Read the article

Efficient way to store a graph for calculation in Hadoop

- by user337499

I am currently trying to perform calculations like clustering coefficient on huge graphs with the help of Hadoop. Therefore I need an efficient way to store the graph in a way that I can easily access nodes, their neighbors and the neighbors' neighbors. The graph is quite sparse and stored in a huge tab separated file where the first field is the node from which an edge goes to the second node in field two. Thanks in advance!

Read the article

Hadoop On Azure C#Isotope

- by Sreesankar

During the initial release of the HadoopOnAzure, Microsoft had provided the C#Isotope SDK as a programatic interface to the Hadoop cluster on the Azure. After the HDInsight release this is removed from the downoads. More over while trying with the previous version of the sdk we get a 500 - Internal Server Error. Any idea if this services is disabled? If so whats the alternative way to programatically interface with the HDInsight Cluster on Azure?

Read the article

Changing the block size of a dfs file in Hadoop

- by Sam

I found that my map tasks is currently inefficient when parsing one particular set of files (total 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster, does anyone know the command that would change the block size when I upload it (ie copy from local to the dfs)? Thanks!

Read the article

Hadoop application development, and PHP

- by alwaysarvind

For hadoop application development, are PHP frameworks less popular ?If so, why? Else,please do point literature/documentation/tutorials for a specific framework? (stuff for Symfony would be awesome!)

Read the article

How to setup Hadoop cluster so that it accepts mapreduce jobs from remote computers?

- by drasto

There is a computer I use for Hadoop map/reduce testing. This computer runs 4 Linux virtual machines (using Oracle virtual box). Each of them has Cloudera with Hadoop (distribution c3u4) installed and serves as a node of Hadoop cluster. One of those 4 nodes is master node running namenode and jobtracker, others are slave nodes. Normally I use this cluster from local network for testing. However when I try to access it from another network I cannot send any jobs to it. The computer running Hadoop cluster has public IP and can be reached over internet for another services. For example I am able to get HDFS (namenode) administration site and map/reduce (jobtracker) administration site (on ports 50070 and 50030 respectively) from remote network. Also it is possible to use Hue. Ports 8020 and 8021 are both allowed. What is blocking my map/reduce job submits from reaching the cluster? Is there some setting that I must change first in order to be able to submit map/reduce jobs remotely? Here is my mapred-site.xml file: <configuration> <property> <name>mapred.job.tracker</name> <value>master:8021</value> </property>  <property> <name>mapred.jobtracker.plugins</name> <value>org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin</value> <description>Comma-separated list of jobtracker plug-ins to be activated. </description> </property> <property> <name>jobtracker.thrift.address</name> <value>0.0.0.0:9290</value> </property> </configuration> And this is in /etc/hosts file: 192.168.1.15 master 192.168.1.14 slave1 192.168.1.13 slave2 192.168.1.9 slave3

Read the article

Read files from directory to create a ZIP hadoop

- by Félix

I'm looking for hadoop examples, something more complex than the wordcount example. What I want to do It's read the files in a directory in hadoop and get a zip, so I have thought to collect al the files in the map class and create the zip file in the reduce class. Can anyone give me a link to a tutorial or example than can help me to built it? I don't want anyone to do this for me, i'm asking for a link with better examples than the wordaccount. This is what I have, maybe it's useful for someone public class Testing { private static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, BytesWritable> { // reuse objects to save overhead of object creation Logger log = Logger.getLogger("log_file"); public void map(LongWritable key, Text value, OutputCollector<Text, BytesWritable> output, Reporter reporter) throws IOException { String line = ((Text) value).toString(); log.info("Doing something ... " + line); BytesWritable b = new BytesWritable(); b.set(value.toString().getBytes() , 0, value.toString().getBytes() .length); output.collect(value, b); } } private static class ReduceClass extends MapReduceBase implements Reducer<Text, BytesWritable, Text, BytesWritable> { Logger log = Logger.getLogger("log_file"); ByteArrayOutputStream bout; ZipOutputStream out; @Override public void configure(JobConf job) { super.configure(job); log.setLevel(Level.INFO); bout = new ByteArrayOutputStream(); out = new ZipOutputStream(bout); } public void reduce(Text key, Iterator<BytesWritable> values, OutputCollector<Text, BytesWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { byte[] data = values.next().getBytes(); ZipEntry entry = new ZipEntry("entry"); out.putNextEntry(entry); out.write(data); out.closeEntry(); } BytesWritable b = new BytesWritable(); b.set(bout.toByteArray(), 0, bout.size()); output.collect(key, b); } @Override public void close() throws IOException { // TODO Auto-generated method stub super.close(); out.close(); } } /** * Runs the demo. */ public static void main(String[] args) throws IOException { int mapTasks = 20; int reduceTasks = 1; JobConf conf = new JobConf(Prue.class); conf.setJobName("testing"); conf.setNumMapTasks(mapTasks); conf.setNumReduceTasks(reduceTasks); MultipleInputs.addInputPath(conf, new Path("/messages"), TextInputFormat.class, MapClass.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(BytesWritable.class); FileOutputFormat.setOutputPath(conf, new Path("/czip")); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); // Delete the output directory if it exists already JobClient.runJob(conf); } }

Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 4/17 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Ashish Sharma

- by Algorist

- by Shahbaz

- by Mastergeek

- by Anurag

- by user1429825

- by Eran Kampf

- by Mike.Hallett(at)Oracle-BI&EPM

- by Guruprasad Venkatesh

- by user3684584

- by Garethr

- by Deepak Konidena

- by Luis Sisamon

- by haluk

- by FaultyJuggler

- by aeolist

- by anon

- by user337499

- by Sreesankar

- by Sam

- by alwaysarvind

- by drasto

- by Félix

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >