Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 5/17 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >

  • Chaining multiple MapReduce jobs in Hadoop.

    - by Niels Basjes
    In many real-life situations where you apply MapReduce, the final algorithms end up being several MapReduce steps. I.e. Map1 , Reduce1 , Map2 , Reduce2 , etc. So you have the output from the last reduce that is needed as the input for the next map. The intermediate data is something you (in general) do not want to keep once the pipeline has been successfully completed. Also because this intermediate data is in general some data structure (like a 'map' or a 'set') you don't want to put too much effort in writing and reading these key-value pairs. What is the recommended way of doing that in Hadoop? Is there a (simple) example that shows how to handle this intermediate data in the correct way, including the cleanup afterward? Thanks.

    Read the article

  • OpenStreetMap and Hadoop

    - by portoalet
    Hi, I need some ideas for a weekend project about Hadoop and OpenStreetMap. I have access to AWS EC2 instance with OpenStreetMap snapshot in my EBS volume. The OpenStreetMap data is in a PostgreSQL database. What kind of MapReduce function can be run on the OpenStreetMap data, assuming I can export them into xml format, and then place into HDFS ? In other words, I am having a brain cramp at the moment, and cannot think what kind of MapReduce operation that can extract valuable insight from the OpenStreetMap xml? (i.e. extract all the places designated as park or golf course. But this needs to be done once only, not continuously) Many Thanks

    Read the article

  • Free Large datasets to experiment with Hadoop

    - by Sundar
    Do you know any large datasets to experiment with Hadoop which is free/low cost? Any pointers/links related is appreciated. Prefernce: Atleast one GB of data. Production log data of webserver. Few of them which I found so far: http://dumps.wikimedia.org/enwiki/20100130/ http://wiki.freebase.com/wiki/Data_dumps http://aws.amazon.com/publicdatasets/ Also can we run our own crawler to gather data from sites e.g. Wikipedia? Any pointers on how to do this is appreciated as well.

    Read the article

  • Free data warehouse - Infobright, Hadoop/Hive or what ?

    - by peperg
    I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to : store them securely use them to analysis (mostly time-oriented) retrieve some raw data occasionally It would be nice if it could be used with JasperReports or BIRT My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 . I heard about Hypertable, Pentaho, MongoDB .... Do you have any recommendations ? (Yes, I found some topics here, but it was year or two ago)

    Read the article

  • How does Hadoop perform input splits?

    - by Deepak Konidena
    Hi, This is a conceptual question involving Hadoop/HDFS. Lets say you have a file containing 1 billion lines. And for the sake of simplicity, lets consider that each line is of the form <k,v> where k is the offset of the line from the beginning and value is the content of the line. Now, when we say that we want to run N map tasks, does the framework split the input file into N splits and run each map task on that split? or do we have to write a partitioning function that does the N splits and run each map task on the split generated? All i want to know is, whether the splits are done internally or do we have to split the data manually? More specifically, each time the map() function is called what are its Key key and Value val parameters? Thanks, Deepak

    Read the article

  • Help regarding no sql databases like hadoop, hbase etc

    - by user560370
    I am new to the distributed NoSQL databases like Hadoop, Cassandra, etc. I have few questions for which I seek an expert advice: Can you list problems/challenges one will generally face when making a shift from the present conventional database like MySQL to these large cluster-based databases? What are the difficulties, if any, when one needs to adapt to a newer version of these open source projects? Can you list out the things which are generally stored/kept in memcached for fast rendering of the page? How can I understand the source code of open-source projects so that I can build on it and maybe give back to the community? Above questions may sound to be idiotic and basic but please it's a request for the experts to answer the above questions in detailed and to best of their abilities.

    Read the article

  • My Take on Hadoop World 2011

    - by Jean-Pierre Dijcks
    I’m sure some of you have read pieces about Hadoop World and I did see some headlines which were somewhat, shall we say, interesting? I thought the keynote by Larry Feinsmith of JP Morgan Chase & Co was one of the highlights of the conference for me. The reason was very simple, he addressed some real use cases outside of internet and ad platforms. The following are my notes, since the keynote was recorded I presume you can go and look at Hadoopworld.com at some point… On the use cases that were mentioned: ETL – how can I do complex data transformation at scale Doing Basel III liquidity analysis Private banking – transaction filtering to feed [relational] data marts Common Data Platform – a place to keep data that is (or will be) valuable some day, to someone, somewhere 360 Degree view of customers – become pro-active and look at events across lines of business. For example make sure the mortgage folks know about direct deposits being stopped into an account and ensure the bank is pro-active to service the customer Treasury and Security – Global Payment Hub [I think this is really consolidation of data to cross reference activity across business and geographies] Data Mining Bypass data engineering [I interpret this as running a lot of a large data set rather than on samples] Fraud prevention – work on event triggers, say a number of failed log-ins to the website. When they occur grab web logs, firewall logs and rules and start to figure out who is trying to log in. Is this me, who forget his password, or is it someone in some other country trying to guess passwords Trade quality analysis – do a batch analysis or all trades done and run them through an analysis or comparison pipeline One of the key requests – if you can say it like that – was for vendors and entrepreneurs to make sure that new tools work with existing tools. JPMC has a large footprint of BI Tools and Big Data reporting and tools should work with those tools, rather than be separate. Security and Entitlement – how to protect data within a large cluster from unwanted snooping was another topic that came up. I thought his Elephant ears graph was interesting (couldn’t actually read the points on it, but the concept certainly made some sense) and it was interesting – when asked to show hands – how the audience did not (!) think that RDBMS and Hadoop technology would overlap completely within a few years. Another interesting session was the session from Disney discussing how Disney is building a DaaS (Data as a Service) platform and how Hadoop processing capabilities are mixed with Database technologies. I thought this one of the best sessions I have seen in a long time. It discussed real use case, where problems existed, how they were solved and how Disney planned some of it. The planning focused on three things/phases: Determine the Strategy – Design a platform and evangelize this within the organization Focus on the people – Hire key people, grow and train the staff (and do not overload what you have with new things on top of their day-to-day job), leverage a partner with experience Work on Execution of the strategy – Implement the platform Hadoop next to the other technologies and work toward the DaaS platform This kind of fitted with some of the Linked-In comments, best summarized in “Think Platform – Think Hadoop”. In other words [my interpretation], step back and engineer a platform (like DaaS in the Disney example), then layer the rest of the solutions on top of this platform. One general observation, I got the impression that we have knowledge gaps left and right. On the one hand are people looking for more information and details on the Hadoop tools and languages. On the other I got the impression that the capabilities of today’s relational databases are underestimated. Mostly in terms of data volumes and parallel processing capabilities or things like commodity hardware scale-out models. All in all I liked this conference, it was great to chat with a wide range of people on Oracle big data, on big data, on use cases and all sorts of other stuff. Just hope they get a set of bigger rooms next time… and yes, I hope I’m going to be back next year!

    Read the article

  • Hadoop/MapReduce: Reading and writing classes generated from DDL

    - by Dave
    Hi, Can someone walk me though the basic work-flow of reading and writing data with classes generated from DDL? I have defined some struct-like records using DDL. For example: class Customer { ustring FirstName; ustring LastName; ustring CardNo; long LastPurchase; } I've compiled this to get a Customer class and included it into my project. I can easily see how to use this as input and output for mappers and reducers (the generated class implements Writable), but not how to read and write it to file. The JavaDoc for the org.apache.hadoop.record package talks about serializing these records in Binary, CSV or XML format. How do I actually do that? Say my reducer produces IntWritable keys and Customer values. What OutputFormat do I use to write the result in CSV format? What InputFormat would I use to read the resulting files in later, if I wanted to perform analysis over them?

    Read the article

  • Need help implementing this algorithm with map Hadoop MapReduce

    - by Julia
    Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this. EDIT(i did not clarified i previos version)The data set I have is on a Hadoop cluster, and I should make its MapReduce implementation I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it. So here is shortly pseudo code of my algorithm LIST termList (there is method that creates this list from lucene index) FOLDER topFolder INPUT topFolder IF it is folder and not empty list files (there are 30 sub folders inside) FOR EACH sub folder GET file "CheckedFile.txt" analyze(CheckedFile) ENDFOR END IF Method ANALYZE(CheckedFile) read CheckedFile WHILE CheckedFile has next line GET line FOR(loops through termList) GET third word from line IF third word = term from list append whole line to string buffer ENDIF ENDFOR END WHILE OUTPUT string buffer to file Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs??? I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK! Please please help.

    Read the article

  • Any good method for mounting Hadoop HDFS from another system?

    - by Beel
    I want to mount the Cloudera Hadoop as a Linux file system over the LAN. As a setup, I already have the hadoop cluster running on a set of Ubuntu machines. But now I need to be able to use it as a normal file system from a Fedora system over the LAN. I tried FUSe but two things: 1. Cloudera says FUSE loses data (click here for that comment by a Cloudera employee on the official Cloudera support site) 2. I've had no success making it work the way we want As a point of clarification, I am using Hadoop ONLY for the file system, not for its other capabilities.

    Read the article

  • If I intend to use Hadoop is there a difference in 12.04 LTS 64 Desktop and Server?

    - by Charles Daringer
    Sorry for such a Newbie Question, but I'm looking at installing M3 edition of MapR the requirements are at this link: http://www.mapr.com/doc/display/MapR/Requirements+for+Installation And my question is this, is the Desktop Kernel 64 for 12.04 LTS adequate or the "same" as the Server version of the product? If I'm setting up a lab to attempt to install a home cluster environment should I start with the Server or Dual Boot that distribution? My assumption is that the two are the same. That I can add any additional software to the 64 as needed. Can anyone elaborate on this? Have I missed something obvious?

    Read the article

  • Oracle Loader for Hadoop 1.1.0.0.3

    - by mannamal
    We are pleased to announce availability of Oracle Loader for Hadoop 1.1.0.0.3, containing bug fixes and performance improvements to Oracle Loader for Hadoop. The updated product can be downloaded from here: http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/big-data-downloads-1451048.html Note that the Oracle Loader for Hadoop 1.1.0.0.3 kit is a complete kit containing the product and bug fixes. Fixes of the earlier version 1.1 patch releases are also included. Upgrading to Oracle Loader for Hadoop 1.1.0.0.3 (from versions 1.1.x): On the Oracle Big Data Appliance:  1. 1.  Upload the new oraloader rpm to the first Oracle Big Data Appliance server.  For example:   /tmp/oraloader-1.1.0.0.3-1.x86_64.rpm 2.     As the root user, use dcli from the first Oracle Big Data Appliance server to copy the new rpm to all nodes. For example:   #dcli -f /tmp/oraloader-1.1.0.0.3-1.x86_64.rpm  -d /tmp/oraloader-replace.rpm 3. 3.  As the root user, use dcli from the first Oracle Big Data Appliance server to replace the old oraloader rpm with the new one.  For example:  #dcli "rpm -e oraloader ; rpm -Uvh /tmp/oraloader-replace.rpm" On other hardware: 1. 1.  Unzip oraloader-1.1.0.0.3.x86_64.zip at <location of install> 2. 2.  Update OLH_HOME to point to <location of install>/oraloader-1.1.0.0.3 

    Read the article

  • FairScheduling Conventions in Hadoop

    - by dan.mcclary
    While scheduling and resource allocation control has been present in Hadoop since 0.20, a lot of people haven't discovered or utilized it in their initial investigations of the Hadoop ecosystem. We could chalk this up to many things: Organizations are still determining what their dataflow and analysis workloads will comprise Small deployments under tests aren't likely to show the signs of strains that would send someone looking for resource allocation options The default scheduling options -- the FairScheduler and the CapacityScheduler -- are not placed in the most prominent position within the Hadoop documentation. However, for production deployments, it's wise to start with at least the foundations of scheduling in place so that you can tune the cluster as workloads emerge. To do that, we have to ask ourselves something about what the off-the-rack scheduling options are. We have some choices: The FairScheduler, which will work to ensure resource allocations are enforced on a per-job basis. The CapacityScheduler, which will ensure resource allocations are enforced on a per-queue basis. Writing your own implementation of the abstract class org.apache.hadoop.mapred.job.TaskScheduler is an option, but usually overkill. If you're going to have several concurrent users and leverage the more interactive aspects of the Hadoop environment (e.g. Pig and Hive scripting), the FairScheduler is definitely the way to go. In particular, we can do user-specific pools so that default users get their fair share, and specific users are given the resources their workloads require. To enable fair scheduling, we're going to need to do a couple of things. First, we need to tell the JobTracker that we want to use scheduling and where we're going to be defining our allocations. We do this by adding the following to the mapred-site.xml file in HADOOP_HOME/conf: <property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value> </property> <property> <name>mapred.fairscheduler.allocation.file</name> <value>/path/to/allocations.xml</value> </property> <property> <name>mapred.fairscheduler.poolnameproperty</name> <value>pool.name</value> </property> <property> <name>pool.name</name> <value>${user.name}</name> </property> What we've done here is simply tell the JobTracker that we'd like to task scheduling to use the FairScheduler class rather than a single FIFO queue. Moreover, we're going to be defining our resource pools and allocations in a file called allocations.xml For reference, the allocation file is read every 15s or so, which allows for tuning allocations without having to take down the JobTracker. Our allocation file is now going to look a little like this <?xml version="1.0"?> <allocations> <pool name="dan"> <minMaps>5</minMaps> <minReduces>5</minReduces> <maxMaps>25</maxMaps> <maxReduces>25</maxReduces> <minSharePreemptionTimeout>300</minSharePreemptionTimeout> </pool> <mapreduce.job.user.name="dan"> <maxRunningJobs>6</maxRunningJobs> </user> <userMaxJobsDefault>3</userMaxJobsDefault> <fairSharePreemptionTimeout>600</fairSharePreemptionTimeout> </allocations> In this case, I've explicitly set my username to have upper and lower bounds on the maps and reduces, and allotted myself double the number of running jobs. Now, if I run hive or pig jobs from either the console or via the Hue web interface, I'll be treated "fairly" by the JobTracker. There's a lot more tweaking that can be done to the allocations file, so it's best to dig down into the description and start trying out allocations that might fit your workload.

    Read the article

  • Loading from Multiple Data Sources with Oracle Loader for Hadoop

    - by mannamal
    Oracle Loader for Hadoop can be used to load data from multiple data sources (for example Hive, HBase), and data in multiple formats (for example Apache weblogs, JSON files).   There are two ways to do this: (1) Use an input format implementation.  Oracle Loader for Hadoop includes several input format implementations.  In addition, a user can develop their own input format implementation for proprietary data sources and formats. (2) Leverage the capabilities of Hive, and use Oracle Loader for Hadoop to load from Hive. These approaches are discussed in our Oracle Open World 2013 presentation. 

    Read the article

  • Cannot save tar.gz file to usr/local

    - by ATMathew
    I'm using the following instruction to install and configure Hadoop on Ubuntu 10.10. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/#installation I tried to save the compressed tar.gz file to /usr/local/ but it just won't save. I've tried saving the tar.gz in my home folder and desktop and copying the files to the desired folder, but I get an error that tells me I don't have permission. How do I save and extract a tar.gz folder to /usr/local/hadoop?

    Read the article

  • Hadooop map reduce

    - by Aina Ari
    Im very much new to map reduce and i completed hadoop wordcount example. In that example it produces unsorted file (with key value) of word counts. So is it possible to make it sorted according to the most number of word occurrences by combining another map reduce task to the earlier one. Thanks in Advance

    Read the article

  • Implications of Multiple JobTracker nodes in a Hadoop cluster?

    - by Jim Dennis
    I get the impression that one can, potentially, have multiple JobTracker nodes configured to share the same set of MR (TaskTracker) nodes. I know that, conventionally, all the nodes in a Hadoop cluster should have the same set of configuration files (conventionally under /etc/hadoop/conf/ --- at least for the Cloudera Distribution of Hadoop (CDH). Can we define multiple Job Trackers in mapred-site.xml? Something like: <configuration> <property> <name>mapred.job.tracker</name> <value>jt01.mydomain.not:8021</value> </property> <property> <name>mapred.job.tracker</name> <value>jt02.mydomain.not:8021</value> </property> ... </configuration> Or is there some other allowed syntax for this? What are the implications of doing this. Does each JobTracker get information about the load on each TaskTracker node. In other words can the two JobTracker co-ordinated their scheduling across the TT nodes only based on the gossip information from the TTs or would they need to talk to one another? Is this documented anywhere?

    Read the article

  • Hadoop/Pig Cross-join

    - by sagie
    Hi I am using Pig to cross join two data sets, both with format. This will result as . In my case, if I have both tuples and , it is a duplication. Can I filter those duplications (or not joining them at all)? thanks, Sagie

    Read the article

  • Hadoop: Iterative MapReduce Performance

    - by S.N
    Is it correct to say that the parallel computation with iterative MapReduce can be justified only when the training data size is too large for the non-parallel computation for the same logic? I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required. I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases. Is it the only benefit to use the iterative MapReduce? If not, what are the other benefits could be?

    Read the article

  • Hbase / Hadoop Query Help

    - by zechariahs
    I'm working on a project with a friend that will utilize Hbase to store it's data. Are there any good query examples? I seem to be writing a ton of Java code to iterate through lists of RowResult's when, in SQL land, I could write a simple query. Am I missing something? Or is Hbase missing something?

    Read the article

  • Hadoop 0.2: How to read outputs from TextOutputFormat?

    - by S.N
    My reducer class produces outputs with TextOutputFormat (the default OutputFormat given by Job). I like to consume this outputs after the MapReduce job complete to aggregate the outputs. In addition to this, I like to write out the aggregated information with TextInputFormat so that the output from this process can be consumed by the next iteration of MapReduce task. Can anyone give me an example on how to write & read with TextFormat? By the way, the reason why I am using TextFormat, rather Sequence, is the interoperability. The outputs should be consumed by any software.

    Read the article

  • Splitting input into substrings in PIG (Hadoop)

    - by Niels Basjes
    Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

    Read the article

  • hadoop mapper static initialisation

    - by rakeshr
    Hi, I have a code fragment in which I am using a static code block to initialize a variable. public static class JoinMap extends Mapper<IntWritable, MbrWritable, LongWritable, IntWritable> { ....... public static RTree rt = null; static { String rtreeFileName = "R.rtree"; rt = new RTree(rtreeFileName); } public void map(IntWritable key, MbrWritable mbr,Context context) throws IOException, InterruptedException { ......... List elements = rt.overlaps(mbr.getRect()); ....... } } My problem is that the variable rt in the above code fragment is not getting initialised. Can anybody suggest a fix or an alternate way to initialise the variable. I don't want to initialise it inside my map function since that slows down the entire process.

    Read the article

  • Global variables in hadoop.

    - by Deepak Konidena
    Hi, My program follows a iterative map/reduce approach. And it needs to stop if certain conditions are met. Is there anyway i can set a global variable that can be distributed across all map/reduce tasks and check if the global variable reaches the condition for completion. Something like this. While(Condition != true){ Configuration conf = getConf(); Job job = new Job(conf, "Dijkstra Graph Search"); job.setJarByClass(GraphSearch.class); job.setMapperClass(DijkstraMap.class); job.setReducerClass(DijkstraReduce.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); } Where condition is a global variable that is modified during/after each map/reduce execution.

    Read the article

  • Need help implementing this algorithm with map reduce(hadoop)

    - by Julia
    Hi all! i have algorithm that will go through a large data set read some text files and search for specific terms in those lines. I have it implemented in Java, but I didnt want to post code so that it doesnt look i am searching for someone to implement it for me, but it is true i really need a lot of help!!! This was not planned for my project, but data set turned out to be huge, so teacher told me I have to do it like this. I was reading about MapReduce and thaught that i first do the standard implementation and then it will be more/less easier to do it with mapreduce. But didnt happen, since algorithm is quite stupid and nothing special, and map reduce...i cant wrap my mind around it. So here is shortly pseudo code of my algorithm LIST termList (there is method that creates this list from lucene index) FOLDER topFolder INPUT topFolder IF it is folder and not empty list files (there are 30 sub folders inside) FOR EACH sub folder GET file "CheckedFile.txt" analyze(CheckedFile) ENDFOR END IF Method ANALYZE(CheckedFile) read CheckedFile WHILE CheckedFile has next line GET line FOR(loops through termList) GET third word from line IF third word = term from list append whole line to string buffer ENDIF ENDFOR END WHILE OUTPUT string buffer to file Also, as you can see, each time when "analyze" is called, new file has to be created, i understood that map reduce is difficult to write to many outputs??? I understand mapreduce intuition, and my example seems perfectly suited for mapreduce, but when it comes to do this, obviously I do not know enough and i am STUCK! Please please help.

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12  | Next Page >