Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 8/17 | < Previous Page | 4 5 6 7 8 9 10 11 12 13 14 15  | Next Page >

  • Java MapReduce read data

    - by Tatiana
    Hi I am having following map-reduce code by which I am trying to read records from my database. There's code: import java.io.*; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.fs.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.db.DBConfiguration; import org.apache.hadoop.mapred.lib.db.DBInputFormat; import org.apache.hadoop.mapred.lib.db.DBWritable; import org.apache.hadoop.util.*; import org.apache.hadoop.conf.*; public class Connection extends Configured implements Tool { public int run(String[] args) throws IOException { JobConf conf = new JobConf(getConf(), Connection.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, "com.sun.java.util.jar.pack.Driver", "jdbc:postgresql://localhost:5432/polyclinic", "postgres", "12345"); String[] fields = { "name" }; DBInputFormat.setInput(conf, MyRecord.class, "doctors", null, null, fields); conf.setMapOutputKeyClass(LongWritable.class); conf.setMapOutputValueClass(MyRecord.class); conf.setOutputKeyClass(LongWritable.class); conf.setOutputValueClass(TextOutputFormat.class); TextOutputFormat.setOutputPath(conf, new Path(args[0])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Connection(), args); System.exit(exitCode); } } Class Mapper: import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; public class MyMapper extends MapReduceBase implements Mapper<LongWritable, MyRecord, Text, IntWritable> { public void map(LongWritable key, MyRecord val, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { output.collect(new Text(val.name), new IntWritable(1)); } } Class Record: import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapred.lib.db.DBWritable; class MyRecord implements Writable, DBWritable { String name; public void readFields(DataInput in) throws IOException { this.name = Text.readString(in); } public void readFields(ResultSet resultSet) throws SQLException { this.name = resultSet.getString(1); } public void write(DataOutput out) throws IOException { } public void write(PreparedStatement stmt) throws SQLException { } } After this I got error: WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). Can you give me any suggestion how to solve this problem?

    Read the article

  • Does changing the default HDFS replication factor from 3 affect mapper performance?

    - by liamf
    Have a HDFS/Hadoop cluster setup and am looking into tuning. I wonder if changing the default HDFS replication factor (default:3) to something bigger will improve mapper performance, at the obvious expense of increasing disk storage used? My reasoning being that if the data is already replicated to more nodes, mapper jobs can be run on more nodes in parallel without any data streaming/copying? Anyone got any opinions?

    Read the article

  • Facebook sort Presto, son moteur de requêtes open source pour le big data, qui serait dix fois plus performant que celui de Hadoop

    Facebook sort Presto, son moteur de requêtes open source pour le big data qui serait dix fois plus performant que celui de HadoopDe nombreuses entreprises comme Facebook dépendent du Big data. Dans le domaine, on compte la paire Hadoop/Hive parmi les références. Pour rappel, Hive c'est le moteur de requêtes populaire pour Hadoop. Cependant, il se pourrait que le MapReduce élément essentiel sur lequel repose Hive ne soit pas optimisé pour des situations ou la quantité de données excède un certain...

    Read the article

  • How can I get started with BigData?

    - by ????? ????????
    I have a programming background, and I've done lots of database design and written lots of queries with Sql Server. I am really excited about looking at bigdata solutions. I know almost nothing about it. The way I want to learn is to sign up for a sandbox where I can try things out. questions Is there a sandbox where I can play around with hadoop? It does not have to be free. Would amazon EMR be the right path to go? What technologies should I be looking at to get started quickly? Is there a 'bigdata' dataset that is available to play with? Thank you so much for your guidance.

    Read the article

  • Unable to login into CentOS

    - by Rendl
    I had setup a multinode cluster using CentOS with VMware yesterday. Today when I reboot the nodes I get the below error on startup. "there is a problem with the configuration server status 256 centOS" (/usr/libexec/gconf-sanity-check-2 ) I am unable to login as root or any user as the screen is frozen. The solutions online is to change the permissions for some tmp files. My problem is I am unable to access the terminal as I cannot login. Also on reboot I do not have any recovery options in CentOS. I only see command line GRUB. I am new to linux and Hadoop.Pls help asap.

    Read the article

  • Hive NR map progress inconsistent and regurlarly restart from 0%

    - by user92471
    I have a Yarn MR (with two ec2 instances to mapreduce) job on a dataset of approximately a thousand avro records, and the map phase is behaving erratically. See the progress below. Of course i checked the logs on resourcemanager and nodemanagers and saw nothing suspicious, but these logs are too verbose What is going on there ? hive> select * from nikon where qs_cs_s_aid='VIEW' limit 10; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1352281315350_0020, Tracking URL = http://blabla.ec2.internal:8088/proxy/application_1352281315350_0020/ Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=blabla.com:8032 -kill job_1352281315350_0020 Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0 2012-11-07 11:14:40,976 Stage-1 map = 0%, reduce = 0% 2012-11-07 11:15:06,136 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 10.38 sec 2012-11-07 11:15:07,253 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec 2012-11-07 11:15:08,371 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec 2012-11-07 11:15:09,491 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 12.18 sec 2012-11-07 11:15:10,643 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 15.42 sec (...) 2012-11-07 11:15:35,441 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 37.77 sec 2012-11-07 11:15:36,486 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 37.77 sec here restart at 16% ? 2012-11-07 11:15:37,692 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec 2012-11-07 11:15:38,815 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec 2012-11-07 11:15:39,865 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 21.15 sec 2012-11-07 11:15:41,064 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec 2012-11-07 11:15:42,181 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec 2012-11-07 11:15:43,299 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 22.4 sec here restart at 0% ? 2012-11-07 11:15:44,418 Stage-1 map = 0%, reduce = 0% 2012-11-07 11:16:02,076 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 6.86 sec 2012-11-07 11:16:03,193 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 6.86 sec 2012-11-07 11:16:04,259 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 8.45 sec (...) 2012-11-07 11:16:31,291 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 35.34 sec 2012-11-07 11:16:32,414 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 37.93 sec here restart at 11% ? 2012-11-07 11:16:33,459 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 19.53 sec 2012-11-07 11:16:34,507 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 19.53 sec 2012-11-07 11:16:35,731 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 21.47 sec (...) 2012-11-07 11:16:46,839 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 24.14 sec here restart at 0% ? 2012-11-07 11:16:47,939 Stage-1 map = 0%, reduce = 0% 2012-11-07 11:16:56,653 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 7.54 sec 2012-11-07 11:16:57,814 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 7.54 sec (...) Needless to say the job crashes after some time with an Error: java.io.IOException: java.io.IOException: java.lang.ArrayIndexOutOfBoundsException: -56

    Read the article

  • How to use Mahout in a Windows environment?

    - by oopdemo
    I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means. As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running in Cygwin is not an option, since I want to be able to run the App from eclipse. So my question is is there a way to avoid having to create sequence files to retrieve my vectors from a lucene index? or is there a way to create sequence files in a Windows environment?

    Read the article

  • facing problems while updating rows in hbase

    - by sammy
    Hello i've just started exploring hbase i've run samples : SampleUploader,PerformanceEvaluation and rowcount as given in hadoop wiki: http://wiki.apache.org/hadoop/Hbase/MapReduce The problem im facing is : table1 is my table with the column family column create 'table1','column' put 'table1','row1','column:address','SanFrancisco' hbase(main):020:0 scan 'table1' ROW COLUMN+CELL row1 column=column:address, timestamp=1276351974560, value=SanFrancisco put 'table1','row1','column:name','Hannah' hbase(main):020:0 scan 'table1' ROW COLUMN+CELL row1 column=column:address,timestamp=1276351974560,value=SanFrancisco row1 column=column:name, timestamp=1276351899573, value=Hannah i want both the columns to appear in the same row as a different version similary, if i change the name column to sarah, it shows the updated row.... but i want both the old row and the changed row to appear as 2 different versions so that i could make analysis on the data........ whatis the mistake im making???? thank u a lot sammy

    Read the article

  • Multiple lines of text to a single map

    - by steven
    I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already. I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line]. I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map: job.setInt("mapred.line.input.format.linespermap", 10); I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private. I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help. Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.

    Read the article

  • I have an Errno 13 Permission denied with subprocess in python

    - by wDroter
    The line with the issue is ret=subprocess.call(shlex.split(cmd)) cmd = /usr/share/java -cp pig-hadoop-conf-Simpsons:lib/pig-0.8.1-cdh3u1-core.jar:lib/hadoop-core-0.20.2-cdh3u1.jar org.apache.pig.Main -param func=cat -param from =foo.txt -x mapreduce fsFunc.pig The error is. File "./run_pig.py", line 157, in process ret=subprocess.call(shlex.split(cmd)) File "/usr/lib/python2.7/subprocess.py", line 493, in call return Popen(*popenargs, **kwargs).wait() File "/usr/lib/python2.7/subprocess.py", line 679, in __init__ errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child raise child_exception OSError: [Errno 13] Permission denied Let me know if any more info is needed. Any help is appreciated. Thanks.

    Read the article

  • Java or Python distributed compute job (on a student budget)?

    - by midget_sadhu
    I have a large dataset (c. 40G) that I want to use for some NLP (largely embarrassingly parallel) over a couple of computers in the lab, to which i do not have root access, and only 1G of user space. I experimented with hadoop, but of course this was dead in the water-- the data is stored on an external usb hard drive, and i cant load it on to the dfs because of the 1G user space cap. I have been looking into a couple of python based options (as I'd rather use NLTK instead of Java's lingpipe if I can help it), and it seems distributed compute options look like: Ipython DISCO After my hadoop experience, i am trying to make sure i try and make an informed choice -- any help on what might be more appropriate would be greatly appreciated. Amazon's EC2 etc not really an option, as i have next to no budget.

    Read the article

  • kmeans based on mapreduce by python

    - by user3616059
    I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as key and coordinates of row as value. In reducer, updating the centroids would be performed. I am writing this by python. As you know, I have to use Hadoop streaming to transfer data between STDIN and STOUT. according to my knowledge, when we print (key + "\t"+value), it will be sent to reducer. Reducer will receive data and it calculates the new centroids but when we print new centroids, I think it does not send them to mapper to calculate new clusters and it just send it to STDOUT and as you know, kmeans is a iterative program. So, my questions is whether Hadoop streaming suffers of doing iterative programs and we should employ MRJOB for iterative programs?

    Read the article

  • What database strategy to choose for a large web application

    - by Snoopy
    I have to rewrite a large database application, running on 32 servers. The hardware is up to date, each machine has two quad core Xeon and 32 GByte RAM. The database is multi-tenant, each customer has his own file, around 5 to 10 GByte each. I run around 50 databases on this hardware. The app is open to the web, so I have no control on the load. There are no really complex queries, so SQL is not required if there is a better solution. The databases get updated via FTP every day at midnight. The database is read-only. C# is my favourite language and I want to use ASP.NET MVC. I thought about the following options: Use two big SQL servers running SQL Server 2012 to serve the 32 servers with data. On the 32 servers running IIS hosting providing REST services. Denormalize the database and use Redis on each webserver. Use booksleeve as a Redis client. Use a combination of SQL Server and Redis Use SQL Server 2012 together with Hadoop Use Hadoop without SQL Server What is the best way for a read-only database, to get the best performance without loosing maintainability? Does Map-Reduce make sense at all in such a scenario? The reason for the rewrite is, the old app written in C++ with ISAM technology is too slow, the interfaces are old fashioned and not nice to use from an website, especially when using ajax. The app uses a relational datamodel with many tables, but it is possible to write one accerlerator table where all queries can be performed on, and all other information from the other tables are possible by a simple key lookup.

    Read the article

  • Amazon EC2 - network issues

    - by Algorist
    Hi, We are launching hadoop cluster on amazon ec2 and recently we are having network issues like master unable to connect to slave. We thought the reason is due to amazon throttling the network connections over a limit. So, we tried to establish a connection after a random delay from each slave node. But, that didn't help. Are there any other suggestions? Thank you Bala

    Read the article

  • Any Open Source Pregel like framework for distributed processing of large Graphs?

    - by Akshay Bhat
    Google has described a novel framework for distributed processing on Massive Graphs. http://portal.acm.org/citation.cfm?id=1582716.1582723 I wanted to know if similar to Hadoop (Map-Reduce) are there any open source implementations of this framework? I am actually in process of writing a Pseudo distributed one using python and multiprocessing module and thus wanted to know if someone else has also tried implementing it. Since public information about this framework is extremely scarce. (A link above and a blog post at Google Research)

    Read the article

  • How does Hive compare to HBase?

    - by mrhahn
    I'm interested in finding out how the recently-released (http://mirror.facebook.com/facebook/hive/hadoop-0.17/) Hive compares to HBase in terms of performance. The SQL-like interface used by Hive is very much preferable to the HBase API we have implemented.

    Read the article

  • Nearmap architecture

    - by portoalet
    Looking at http://www.nearmap.com/, Just wondering if you can approximate how much storage is needed to store the images? (NearMap’s monthly city PhotoMaps are captured at 3cm, 5cm, 7.5cm, or 10cm resolution) And what kind of systems/architecture is suitable to deliver those data/images? (say you are not Google, and want to implement this from scratch, what would you do? ) ie. would you store the images in Hadoop, and use memcache to deliver etc ?

    Read the article

  • How can I load a file into a DataBag from within a Yahoo PigLatin UDF?

    - by Cervo
    I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without needing to COGROUP...... Code: # **** Load files for iteration **** register myudfs.jar; wordcounts = LOAD 'input/wordcounts.txt' USING PigStorage('\t') AS (PatentNumber:chararray, word:chararray, frequency:double); centerassignments = load 'input/centerassignments/part-*' USING PigStorage('\t') AS (PatentNumber: chararray, oldCenter: chararray, newCenter: chararray); kcenters = LOAD 'input/kcenters/part-*' USING PigStorage('\t') AS (CenterID:chararray, word:chararray, frequency:double); kcentersa1 = CROSS centerassignments, kcenters; kcentersa = FOREACH kcentersa1 GENERATE centerassignments::PatentNumber as PatentNumber, kcenters::CenterID as CenterID, kcenters::word as word, kcenters::frequency as frequency; #***** Assign to nearest k-mean ******* assignpre1 = COGROUP wordcounts by PatentNumber, kcentersa by PatentNumber; assignwork2 = FOREACH assignpre1 GENERATE group as PatentNumber, myudfs.kmeans(wordcounts, kcentersa) as CenterID; basically my issue is that for each patent I need to pass the sub relations (wordcounts, kcenters). In order to do this, I do a cross and then a COGROUP by PatentNumber in order to get the set PatentNumber, {wordcounts}, {kcenters}. If I could figure a way to pass a relation or open up the centers from within the UDF, then I could just GROUP wordcounts by PatentNumber and run myudfs.kmeans(wordcount) which is hopefully much faster without the CROSS/COGROUP. This is an expensive operation. Currently this takes about 20 minutes and appears to tack the CPU/RAM. I was thinking it might be more efficient without the CROSS. I'm not sure it will be faster, so I'd like to experiment. Anyway it looks like calling the Loading functions from within Pig needs a PigContext object which I don't get from an evalfunc. And to use the hadoop file system, I need some initial objects as well, which I don't see how to get. So my question is how can I open a file from the hadoop file system from within a PIG UDF? I also run the UDF via main for debugging. So I need to load from the normal filesystem when in debug mode. Another better idea would be if there was a way to pass a relation into a UDF without needing to CROSS/COGROUP. This would be ideal, particularly if the relation resides in memory.. ie being able to do myudfs.kmeans(wordcounts, kcenters) without needing the CROSS/COGROUP with kcenters... But the basic idea is to trade IO for RAM/CPU cycles. Anyway any help will be much appreciated, the PIG UDFs aren't super well documented beyond the most simple ones, even in the UDF manual.

    Read the article

  • Stateful Iterators Java

    - by Gitmo
    What is a Stateful Iterator? This question relates to an Iterator defined in Hadoop for performing Joins. As the reference documentation states: This defines an interface to a stateful Iterator that can replay elements added to it directly. Note that this does not extend Iterator. What does 'replay elements added to it directly' mean? How is this iterator different from a usual iterator?

    Read the article

  • Look up values in a BDB for several files in parallel

    - by biznez
    What is the most efficient way to look up values in a BDB for several files in parallel? If I had a Perl script which did this for one file at a time, would forking/running the process in background with the ampersand in Linux work? How might Hadoop be used to solve this problem? Would threading be another solution?

    Read the article

  • New Feature in ODI 11.1.1.6: ODI for Big Data

    - by Julien Testut
    Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";} By Ananth Tirupattur Starting with Oracle Data Integrator 11.1.1.6.0, ODI is offering a solution to process Big Data. This post provides an overview of this feature. With all the buzz around Big Data and before getting into the details of ODI for Big Data, I will provide a brief introduction to Big Data and Oracle Solution for Big Data. So, what is Big Data? Big data includes: structured data (this includes data from relation data stores, xml data stores), semi-structured data (this includes data from weblogs) unstructured data (this includes data from text blob, images) Traditionally, business decisions are based on the information gathered from transactional data. For example, transactional Data from CRM applications is fed to a decision system for analysis and decision making. Products such as ODI play a key role in enabling decision systems. However, with the emergence of massive amounts of semi-structured and unstructured data it is important for decision system to include them in the analysis to achieve better decision making capability. While there is an abundance of opportunities for business for gaining competitive advantages, process of Big Data has challenges. The challenges of processing Big Data include: Volume of data Velocity of data - The high Rate at which data is generated Variety of data In order to address these challenges and convert them into opportunities, we would need an appropriate framework, platform and the right set of tools. Hadoop is an open source framework which is highly scalable, fault tolerant system, for storage and processing large amounts of data. Hadoop provides 2 key services, distributed and reliable storage called Hadoop Distributed File System or HDFS and a framework for parallel data processing called Map-Reduce. Innovations in Hadoop and its related technology continue to rapidly evolve, hence therefore, it is highly recommended to follow information on the web to keep up with latest information. Oracle's vision is to provide a comprehensive solution to address the challenges faced by Big Data. Oracle is providing the necessary Hardware, software and tools for processing Big Data Oracle solution includes: Big Data Appliance Oracle NoSQL Database Cloudera distribution for Hadoop Oracle R Enterprise- R is a statistical package which is very popular among data scientists. ODI solution for Big Data Oracle Loader for Hadoop for loading data from Hadoop to Oracle. Further details can be found here: http://www.oracle.com/us/products/database/big-data-appliance/overview/index.html ODI Solution for Big Data: ODI’s goal is to minimize the need to understand the complexity of Hadoop framework and simplify the adoption of processing Big Data seamlessly in an enterprise. ODI is providing the capabilities for an integrated architecture for processing Big Data. This includes capability to load data in to Hadoop, process data in Hadoop and load data from Hadoop into Oracle. ODI is expanding its support for Big Data by providing the following out of the box Knowledge Modules (KMs). IKM File to Hive (LOAD DATA).Load unstructured data from File (Local file system or HDFS ) into Hive IKM Hive Control AppendTransform and validate structured data on Hive IKM Hive TransformTransform unstructured data on Hive IKM File/Hive to Oracle (OLH)Load processed data in Hive to Oracle RKM HiveReverse engineer Hive tables to generate models Using the Loading KM you can map files (local and HDFS files) to the corresponding Hive tables. For example, you can map weblog files categorized by date into a corresponding partitioned Hive table schema. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";} Using the Hive control Append KM you can validate and transform data in Hive. In the below example, two source Hive tables are joined and mapped to a target Hive table. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";} The Hive Transform KM facilitates processing of semi-structured data in Hive. In the below example, the data from weblog is processed using a Perl script and mapped to target Hive table. Normal 0 false false false EN-US X-NONE X-NONE /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Calibri","sans-serif"; mso-bidi-font-family:"Times New Roman";} Using the Oracle Loader for Hadoop (OLH) KM you can load data from Hive table or HDFS to a corresponding table in Oracle. OLH is available as a standalone product. ODI greatly enhances OLH capability by generating the configuration and mapping files for OLH based on the configuration provided in the interface and KM options. ODI seamlessly invokes OLH when executing the scenario. In the below example, a HDFS file is mapped to a table in Oracle. Development and Deployment:The following diagram illustrates the development and deployment of ODI solution for Big Data. Using the ODI Studio on your development machine create and develop ODI solution for processing Big Data by connecting to a MySQL DB or Oracle database on a BDA machine or Hadoop cluster. Schedule the ODI scenarios to be executed on the ODI agent deployed on the BDA machine or Hadoop cluster. ODI Solution for Big Data provides several exciting new capabilities to facilitate the adoption of Big Data in an enterprise. You can find more information about the Oracle Big Data connectors on OTN. You can find an overview of all the new features introduced in ODI 11.1.1.6 in the following document: ODI 11.1.1.6 New Features Overview

    Read the article

  • Could not start ZK at requested port of 2181, while export HBASE_MANAGES_ZK=false

    - by utrecht
    Problem The first aim was to run HBase standalone. Navigating to ip:60010/master-status is succesfull once HBase has been started. The second aim is to run a distinct ZooKeeper quorum. ZooKeeper has been downloaded and has been started: netstat -nato | grep 2181 tcp 0 0 :::2181 :::* LISTEN off (0.00/0/0) The conf/hbase-env.sh was changed as follows: # Tell HBase whether it should manage it's own instance of Zookeeper or not. export HBASE_MANAGES_ZK=false in order to avoid HBase starts ZooKeeper once HBase has been started. However, the following error occurs once HBase has been started. Could not start ZK at requested port of 2181. ZK was started at port: 2182. Aborting as clients (e.g. shell) will not be able to find this ZK quorum. Question How to disable the startup of ZooKeeper by HBase and run ZooKeeper separately?

    Read the article

  • Cloudera Manager agent deploy failing to receive heartbeat from agent

    - by user150341
    All, I am getting the error on the console at the last phase of the installation: Installation failed. Failed to receive heartbeat from agent Server Log: 2012-12-19 00:32:12,132 INFO [NodeConfiguratorThread-4- 0:node.NodeConfiguratorProgress@503] 192.168.1.100: Setting WAIT_FOR_HEARTBEAT as failed and done state All nodes (name node and (2)client nodes) are VM's running 64bit CentOS. sshd has been enabled on all nodes, and VM's are set to Bridge. Any clue on how to fix this error?

    Read the article

  • What is meant by "streaming data access" in HDFS?

    - by Van Gale
    According to the HDFS Architecture page HDFS was designed for "streaming data access". I'm not sure what that means exactly, but would guess it means an operation like seek is either disabled or has sub-optimal performance. Would this be correct? I'm interested in using HDFS for storing audio/video files that need to be streamed to browser clients. Most of the streams will be start to finish, but some could have a high number of seeks. Maybe there is another file system that could do this better?

    Read the article

< Previous Page | 4 5 6 7 8 9 10 11 12 13 14 15  | Next Page >