Search Results

Search found 841 results on 34 pages for 'mr pig'.

Page 1/34 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

- by Pinal Dave

In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story. Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. What is Pig and What is Pig Latin? Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster. Pig Latin vs SQL Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science. DAG Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article
Storing data to SequenceFile from Apache Pig

- by asquithea

Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig?

Read the article
ORDER BY job failed in the Pig script while running EmbeddedPig using Java

- by C.c. Huang

I have this following pig script, which works perfectly using grunt shell (stored the results to HDFS without any issues); however, the last job (ORDER BY) failed if I ran the same script using Java EmbeddedPig. If I replace the ORDER BY job by others, such as GROUP or FOREACH GENERATE, the whole script then succeeded in Java EmbeddedPig. So I think it's the ORDER BY which causes the issue. Anyone has any experience with this? Any help would be appreciated! The Pig script: REGISTER pig-udf-0.0.1-SNAPSHOT.jar; user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float); simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score; grouped_user_similarity = GROUP simplified_user_similarity BY user_id; ordered_user_similarity = FOREACH grouped_user_similarity { sorted = ORDER simplified_user_similarity BY sim_score DESC; top = LIMIT sorted 10; GENERATE group, top; }; top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10); all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0); grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id; influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score; ordered_influence_scores = ORDER influence_scores BY influence_score DESC; STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage(); The error log from Pig: 12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job 12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir. 12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml 12/04/05 10:00:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100 12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720 12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680 12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004 java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153) at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112) ... 6 more 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed! 12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-cdh3u3 0.8.1-cdh3u3 cchuang 2012-04-05 10:00:34 2012-04-05 10:01:04 GROUP_BY,ORDER_BY Some jobs have failed! Stop running all dependent jobs Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_local_0001 0 0 0 0 0 0 0 0 all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity GROUP_BY job_local_0002 0 0 0 0 0 0 0 0 grouped_influence_scores,influence_scores GROUP_BY,COMBINER job_local_0003 0 0 0 0 0 0 0 0 ordered_influence_scores SAMPLER Failed Jobs: JobId Alias Feature Message Outputs job_local_0004 ordered_influence_scores ORDER_BY Message: Job failed! Error - NA /tmp/cc-test-results-1, Input(s): Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000" Output(s): Failed to produce result in "/tmp/cc-test-results-1" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local_0001 -> job_local_0002, job_local_0002 -> job_local_0003, job_local_0003 -> job_local_0004, job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs

Read the article
Pig_Cassandra integration caused - ERROR 1070: Could not resolve CassandraStorage using imports:

- by Le Dude

I'm following basic Pig, Cassandra, Hadoop installation. Everything works just fine as a stand alone. No error. However when I tried to run the example file provided by Pig_cassandra example, I got this error. [root@localhost pig]# /opt/cassandra/apache-cassandra-1.1.6/examples/pig/bin/pig_cassandra -x local -x local /opt/cassandra/apache-cassandra-1.1.6/examples/pig/example-script.pig Using /opt/pig/pig-0.10.0/pig-0.10.0-withouthadoop.jar. 2012-10-24 21:14:58,551 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 2012-10-24 21:14:58,552 [main] INFO org.apache.pig.Main - Logging error messages to: /opt/pig/pig_1351138498539.log 2012-10-24 21:14:59,004 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2012-10-24 21:14:59,472 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /opt/pig/pig_1351138498539.log Here is the log file Pig Stack Trace --------------- ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1597) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1540) at org.apache.pig.PigServer.registerQuery(PigServer.java:540) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:555) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: Failed to parse: Cannot instantiate: CassandraStorage at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1589) ... 14 more Caused by: java.lang.RuntimeException: Cannot instantiate: CassandraStorage at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:510) at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:791) at org.apache.pig.parser.LogicalPlanBuilder.buildFuncSpec(LogicalPlanBuilder.java:780) at org.apache.pig.parser.LogicalPlanGenerator.func_clause(LogicalPlanGenerator.java:4583) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3115) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:175) ... 15 more Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:495) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:507) ... 24 more ================================================================================ I googled around and got to this point from other stackoverflow user that identified the potential problem but not the solution. Cassandra and pig integration cause error during startup I believe my configuration is correct and the path has already been defined properly. I didn't change anything in the pig_cassandra file. I'm not quite sure how to proceed from here. Please help?

Read the article
Splitting input into substrings in PIG (Hadoop)

- by Niels Basjes

Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

Read the article
Bunny Inc. – Episode 1. Mr. CIO meets Mr. Executive Manager

- by kellsey.ruppel(at)oracle.com

To make accurate and timely business decisions, executive managers are constantly in need of valuable information that is often hidden in old-style traditional systems. What can Mr. CIO come up with to help make Mr. Executive Manager's job easier at Bunny Inc.? Take a look and discover how you too can make informed business decisions by combining back-office systems with social media. Bunny Inc. -- Episode 1. Mr. CIO meets Mr. Executive ManagerTechnorati Tags: UXP, collaboration, enterprise 2.0, modern user experience, oracle, portals, webcenter, e20bunnies

Read the article
Filtering null values with pig

- by arianp

It looks like a silly problem, but I can´t find a way to filter null values from my rows. This is the result when I dump the object geoinfo: DUMP geoinfo; ([longitude#70.95853,latitude#30.9773]) ([longitude#-9.37944507,latitude#38.91780853]) (null) (null) (null) ([longitude#-92.64416,latitude#16.73326]) (null) (null) ([longitude#-9.15199849,latitude#38.71179122]) ([longitude#-9.15210796,latitude#38.71195131]) here is the description DESCRIBE geoinfo; geoinfo: {geoLocation: bytearray} What I'm trying to do is to filter null values like this: geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null; but the result remains the same. nothing is filtered. I also tried something like this geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null'; and I got an error org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a map to a String What am I doing wrong? details, running on ubuntu, hadoop-1.0.3 with pig 0.9.3 pig -version Apache Pig version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012, 19:04:03 java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Read the article
Pig: Count number of keys in a map

- by Donald Miner

I'd like to count the number of keys in a map in Pig. I could write a UDF to do this, but I was hoping there would be an easier way. data = LOAD 'hbase://MARS1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'A:*', '-loadKey true -caching=100000') AS (id:bytearray, A_map:map[]); In the code above, I want to basically build a histogram of id and how many items in column family A that key has. In hoping, I tried c = FOREACH data GENERATE id, COUNT(A_map); but that unsurprisingly didn't work. Or, perhaps someone can suggest a better way to do this entirely. If I can't figure this out soon I'll just write a Java MapReduce job or a Pig UDF.

Read the article
Mr Flibble: As Seen Through a Lens, Darkly

- by Phil Factor

One of the rewarding things about getting involved with Simple-Talk has been in meeting and working with some pretty daunting talents. I’d like to say that Dom Reed’s talents are at the end of the visible spectrum, but then there is Richard, who pops up on national radio occasionally, presenting intellectual programs, Andrew, master of the ukulele, with his pioneering local history work, and Tony with marathon running and his past as a university lecturer. However, Dom, who is Red Gate’s head of creative design and who did the preliminary design work for Simple-Talk, has taken the art photography to an extreme that was impossible before Photoshop. He’s not the first person to take a photograph of himself every day for two years, but he is definitely the first to weave the results into a frightening narrative that veers from comedy to pathos, using all the arts of Photoshop to create a fictional character, Mr Flibble. Have a look at some of the Flickr pages. Uncle Spike The B-Men – Woolverine The 2011 BoyZ iN Sink reunion tour turned out to be their last Error 404 – Flibble not found Mr Flibble is not a normal type of alter-ego. We generally prefer to choose bronze age warriors of impossibly magnificent physique and stamina; superheroes who bestride the world, scorning the forces of evil and anarchy in a series noble and righteous quests. Not so Dom, whose Mr Flibble is vulnerable, and laid low by an addiction to toxic substances. His work has gained an international cult following and is used as course material by several courses in photography. Although his work was for a while ignored by the more conventional world of ‘art’ photography they became famous through the internet. His photos have received well over a million views on Flickr. It was definitely time to turn this work into a book, because the whole sequence of images has its maximum effect when seen in sequence. He has a Kickstarter project page, one of the first following the recent UK launch of the crowdfunding platform. The publication of the book should be a major event and the £45 I shall divvy up will be one of the securest investments I shall ever make. The local news in Cambridge picked up on the project and I can quote from the report by the excellent Cabume website , the source of Tech news from the ‘Cambridge cluster’ Put really simply Mr Flibble likes to dress up and take pictures of himself. One of the benefits of a split personality, however is that Mr Flibble is supported in his endeavour by Reed’s top notch photography skills, supreme mastery of Photoshop and unflinching dedication to the cause. The duo have collaborated to take a picture every day for the past 730-plus days. It is not a big surprise that neither Mr Flibble nor Reed watches any TV: In addition to his full-time role at Cambridge software house,Red Gate Software as head of creativity and the two to five hours a day he spends taking the Mr Flibble shots, Reed also helps organise the . And now Reed is using Kickstarter to see if the world is ready for a Mr Flibble coffee table book. Judging by the early response it is. At the time of writing, just a few days after it went live, ‘I Drink Lead Paint: An absurd photography book by Mr Flibble’ had raised £1,545 of the £10,000 target it needs to raise by the Friday 30 November deadline from 37 backers. Following the standard Kickstarter template, Reed is offering a series of rewards based on the amount pledged, ranging from a Mr Flibble desktop wallpaper for pledges of £5 or more to a signed copy of the book for pledges of £45 or more, right up to a starring role in the book for £1,500. Mr Flibble is unquestionably one of the more deranged Kickstarter hopefuls, but don’t think for a second that he doesn’t have a firm grasp on the challenges he faces on the road to immortalisation on 150 gsm stock. Under the section ‘risks and challenges’ on his Kickstarter page his statement begins: “An angry horde of telepathic iguanas discover the world’s last remaining stock of vintage lead paint and hold me to ransom. Gosh how I love to guzzle lead paint. Anyway… faced with such brazen bravado, I cower at the thought of taking on their combined might and die a sad and lonely Flibble deprived of my one and only true liquid love.” At which point, Reed manages to wrestle away the keyboard, giving him the opportunity to present slightly more cogent analysis of the obstacles the project must still overcome. We asked Reed a few questions about Mr Flibble’s Kickstarter adventure and felt that his responses were worth publishing in full: Firstly, how did you manage it – holding down a full time job and also conceiving and executing these ideas on a daily basis? I employed a small team of ferocious gerbils to feed me ideas on a daily basis. Whilst most of their ideas were incomprehensibly rubbish and usually revolved around food, just occasionally they’d give me an idea like my B-Men series. As a backup plan though, I found that the best way to generate ideas was to actually start taking photos. If I were to stand in front of the camera, pull a silly face, place a vegetable on my head or something else equally stupid, the resulting photo of that would typically spark an idea when I came to look at it. Sitting around idly trying to think of an idea was doomed to result in no ideas. I admit that I really struggled with time. I’m proud that I never missed a day, but it was definitely hard when you were late from work, tired or doing something socially on the same day. I don’t watch TV, which I guess really helps, because I’d frequently be spending 2-5 hours taking and processing the photos every day. Are there any overlaps between software development and creative thinking? Software is an inherently creative business and the speed that it moves ensures you always have to find solutions to new things. Everyone in the team needs to be a problem solver. Has it helped me specifically with my photography? Probably. Working within teams that continually need to figure out new stuff keeps the brain feisty I suppose, and I guess I’m continually exposed to a lot of possible sources of inspiration. How specifically will this Kickstarter project allow you to test the commercial appeal of your work and do you plan to get the book into shops? It’s taken a while to be confident saying it, but I know that people like the work that I do. I’ve had well over a million views of my pictures, many humbling comments and I know I’ve garnered some loyal fans out there who anticipate my next photo. For me, this Kickstarter is about seeing if there’s worth to my work beyond just making people smile. In an online world where there’s an abundance of freely available content, can you hope to receive anything from what you do, or would people just move onto the next piece of content if you happen to ask for some support? A book has been the single-most requested thing that people have asked me to produce and it’s something that I feel would showcase my work well. It’s just hard to convince people in the publishing industry just now to take any kind of risk – they’ve been hit hard. If I can show that people would like my work enough to buy a book, then it sends a pretty clear picture that publishers might hear, or it gives me the confidence enough to invest in myself a bit more – hard to do when you’re riddled with self-doubt! I’d love to see my work in the shops, yes. I could see it being the thing that someone flips through idly as they’re Christmas shopping and recognizing that it’d be just the perfect gift for their difficult to buy for friend or relative. That said, working in the software industry means I’m clearly aware of how I could use technology to distribute my work, but I can’t deny that there’s something very appealing to having a physical thing to hold in your hands. If the project is successful is there a chance that it could become a full-time job? At the moment that seems like a distant dream, as should this be successful, there are many more steps I’d need to take to reach any kind of business viability. Kickstarter seems exactly that – a way for people to help kick start me into something that could take off. If people like my work and want me to succeed with it, then taking a look at my Kickstarter page (and hopefully pledging a bit of support) would make my elbows blush considerably. So there is is. An opportunity to open the wallet just a bit to ensure that one of the more unusual talents sees the light in the format it deserves.

Read the article
PIG doesn't read my custom InputFormat

- by Simon Guo

I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow: public class EccUDFLogLoader extends LoadFunc { @Override public InputFormat getInputFormat() { System.out.println("I am in getInputFormat function"); return new MyInputFormat(); } } public class MyInputFormat extends TextInputFormat { public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException { System.out.prinln("I am in createRecordReader"); //MyRecordReader suppose to handle record boundary return new MyRecordReader((FileSplit)inputSplit, jobConf); } } For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks. I am using PIG on Amazon EMR.

Read the article
Configuring correct port for Oozie (invoking PIG script) in Cloudera Hue

- by user2985324

I am new to CDH4 Oozie workflow editor. While trying to invoke a pig script from Oozie workflow editor, i am getting the following error. HadoopAccessorException: E0900: Jobtracker [mymachine:8032] not allowed, not in Oozies whitelist It looks like Oozie is submitting the job to Yarn port (8032). I want it to submit to 8021 (MR jobtracker) port. Can someone help me in identify where to set the job tracker URL or port so that oozie picks up the correct one (using Hue or Cloudera manager). Previously I tried the following but none of them helped Modfied workflow.xml file /user/hue/oozie/workspaces/../workflow.xml file. However it gets overwritten when I submit the job from workflow editor. In cloudera Manager -- oozie -- configuration --Oozie Server (advanced) -- Oozie Server Configuration Safety Valve for oozie-site.xml property I set the following- <property> <name>oozie.service.HadoopAccessorService.nameNode.whitelist</name> <value>mymachine:8020</value> oozie.service.HadoopAccessorService.jobTracker.whitelist mymachine:8021 and restarted the oozie service. 3. Tried to override 'jobTracker' property while configuring the pig task. This appears as follows in the workflow file however it doesn't take effect (or doesn't override) and still uses 8032 port. <global> <configuration> <property> <name>jobTracker</name> <value>mymachine:8021</value> </property> </configuration> </global> I am using CDH4 version. Thanks for looking into my question.

Read the article
Hadoop/Pig Cross-join

- by sagie

Hi I am using Pig to cross join two data sets, both with format. This will result as . In my case, if I have both tuples and , it is a duplication. Can I filter those duplications (or not joining them at all)? thanks, Sagie

Read the article
groovy, merge two lists : listA is [[Name: mr good, note: good, rating:9], [Name: mr bad, note: bad,

- by user311884

I have two lists: listA: [[Name: mr good, note: good,rating:9], [Name: mr bad, note: bad, rating:5]] listB: [[Name: mr good, note: good,score:77], [Name: mr bad, note: bad, score:12]] I want to get this one listC: [[Name: mr good, note: good,, rating:9, score:77], [Name: mr bad, note: bad, rating:5,score:12]] how could I do it ? thanks.

Read the article
Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

- by Daniel Huckstep

I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { try { // do stuff with input } catch (Exception e) { throw WrappedIOException.wrap("Error with line", e); } } } My question is: if it throws the IOException, will it stop completely, or will it return results for the rest of the lines that don't throw an exception? Example: I run this in pig REGISTER myjar.jar DEFINE Extractor com.namespace.Extractor(); logs = LOAD '$IN' USING TextLoader AS (line: chararray); events = FOREACH logs GENERATE FLATTEN(Extractor(line)); With this input: 1.5 7 "Valid Line" 1.3 gghyhtt Inv"alid line"" I throw an exceptioN!! 1.8 10 "Valid Line 2" Will it process the two lines and will 'logs' have 2 tuples, or will it just die in a fire?

Read the article
Professional Custom Logo Design vs. Mr. Right

John is an ex-marine and ex-employee of general motors. He recently lost his job working as a welder on the assembly lines of one of GM manufacturing plants. John has traveled a lot and knows a lot a... [Author: Emily Matthew - Web Design and Development - March 31, 2010]

Read the article
Why was Mr. Scott Scottish?

- by iamjames

It's a good question: of all the engineers in the world, why choose a Scottish engineer? The Gene Roddenberry probably chose a Scottish engineer because of this guy: That's James Watt, the same guy the unit of energy watt is named after. He was a Scottish inventor and mechancial engineer who built the first made significant improvements to the steam engine. Made sense in the 60's, however given the past hundred years if they were to make a new Star Trek they might have started with a German engineer (or maybe Japanese), but since World War II had ended barely 20 years earlier the 20-somethings that had survived the war were now 40-somethings and seeing a German engineer probably wouldn't have gone over too well.

Read the article
How to use Cassandra's Map Reduce with or w/o Pig?

- by UltimateBrent

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa? There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help. Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.

Read the article
Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

- by lmonson

I'm using Amazon's elastic map reduce. I have log files that look something like this random text foo="1" more random text foo="2" more text noise foo="1" blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ... How can I write a pig expression to pick out all the numbers in the 'foo' expressions? I prefer tuples that look something like this: (1,2) (1) (1,3,4) I've tried the following: TUPLES = foreach LINES generate FLATTEN(EXTRACT(line,'foo="([0-9]+)"')); But this yields only the first match in each line: (1) (1) (1)

Read the article
Calling Grep inside Java gives incorrect results when calling grep in shell gives correct results.

- by futureelite7

I've got a problem where calling grep from inside java gives incorrect results, as compared to the results from calling grep on the same file in the shell. My grep command (called both in Java and in bash. I escaped the slash in Java accordingly): /bin/grep -vP --regexp='^[0-9]+\t.*' /usr/local/apache-tomcat-6.0.18/work/Catalina/localhost/saccitic/237482319867147879_1271411421 The command is supposed to match and discard strings like these: 85295371616 Hi Mr Lee, please be informed that... My input file is this: 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85291234783 a3283784428349247233834728482984723333 85219299222 The commands works when I call it from inside bash (Results below): 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85219299222 However, when I call grep again inside java, I get the entire file (Results below): 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85291234783 a3283784428349247233834728482984723333 85219299222 What could be the problem that will cause the grep called by Java to return incorrect results? I tried passing local information via the environment string array in runtime.exec, but nothing seems to change. Am I passing in the locale information incorrectly, or is the problem something else entirely? private String[] localeArray = new String[] { "LANG=", "LC_COLLATE=C", "LC_CTYPE=UTF-8", "LC_MESSAGES=C", "LC_MONETARY=C", "LC_NUMERIC=C", "LC_TIME=C", "LC_ALL=" };

Read the article
Calling Grep inside Java gives incorrect results while calling grep in shell gives correct results.

- by futureelite7

I've got a problem where calling grep from inside java gives incorrect results, as compared to the results from calling grep on the same file in the shell. My grep command (called both in Java and in bash. I escaped the slash in Java accordingly): /bin/grep -vP --regexp='^[0-9]+\t.*' /usr/local/apache-tomcat-6.0.18/work/Catalina/localhost/saccitic/237482319867147879_1271411421 Java Code: String filepath = "/path/to/file"; String options = "P"; String grepparams = "^[0-9]+\\t.*"; String greppath = "/bin/"; String[] localeArray = new String[] { "LANG=", "LC_COLLATE=C", "LC_CTYPE=UTF-8", "LC_MESSAGES=C", "LC_MONETARY=C", "LC_NUMERIC=C", "LC_TIME=C", "LC_ALL=" }; options = "v"+options; //Assign optional params if (options.contains("P")) { grepparams = "\'"+grepparams+"\'"; //Quote the regex expression if -P flag is used } else { options = "E"+options; //equivalent to calling egrep } proc = sysRuntime.exec(greppath+"/grep -"+options+" --regexp="+grepparams+" "+filepath, localeArray); System.out.println(greppath+"/grep -"+options+" --regexp="+grepparams+" "+filepath); inStream = proc.getInputStream(); The command is supposed to match and discard strings like these: 85295371616 Hi Mr Lee, please be informed that... My input file is this: 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85291234783 a3283784428349247233834728482984723333 85219299222 The commands works when I call it from inside bash (Results below): 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85219299222 However, when I call grep again inside java, I get the entire file (Results below): 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85aaa234567 Hi Ms Chan, please be informed that... 85292vx5678 Hi Mrs Ng, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 8~!95371616 Hi Mr Lee, please be informed that... 85295371616 Hi Mr Lee, please be informed that... 852&^*&1616 Hi Mr Lee, please be informed that... 8529537Ax16 Hi Mr Lee, please be informed that... 85====ppq16 Hi Mr Lee, please be informed that... 85291234783 a3283784428349247233834728482984723333 85219299222 What could be the problem that will cause the grep called by Java to return incorrect results? I tried passing local information via the environment string array in runtime.exec, but nothing seems to change. Am I passing in the locale information incorrectly, or is the problem something else entirely?

Read the article
Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)

- by Josh Crews

I'm trying to parse a text file into sentences ending in periods, but names like Mr. Hopkins are throwing false alarms on matching for periods. What regex indentfy's "." but not "Mr." For bonus, I'm also using ! to find end of sentences, so my current Regex is /(!/./ and I'd love an answer that incorporates my !'s too.

Read the article
Unable to run MR on cluster

- by RAVITEJA SATYAVADA

I have an Map reduce program that is running successfully in standalone(Ecllipse) mode but while trying to run the same MR by exporting the jar in cluster. It is showing null pointer exception like this, 13/06/26 05:46:22 ERROR mypackage.HHDriver: Error while configuring run method. java.lang.NullPointerException I double checked the run method parameters those are not null and it is running in standalone mode as well..

Read the article
How can I load a file into a DataBag from within a Yahoo PigLatin UDF?

- by Cervo

I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without needing to COGROUP...... Code: # **** Load files for iteration **** register myudfs.jar; wordcounts = LOAD 'input/wordcounts.txt' USING PigStorage('\t') AS (PatentNumber:chararray, word:chararray, frequency:double); centerassignments = load 'input/centerassignments/part-*' USING PigStorage('\t') AS (PatentNumber: chararray, oldCenter: chararray, newCenter: chararray); kcenters = LOAD 'input/kcenters/part-*' USING PigStorage('\t') AS (CenterID:chararray, word:chararray, frequency:double); kcentersa1 = CROSS centerassignments, kcenters; kcentersa = FOREACH kcentersa1 GENERATE centerassignments::PatentNumber as PatentNumber, kcenters::CenterID as CenterID, kcenters::word as word, kcenters::frequency as frequency; #***** Assign to nearest k-mean ******* assignpre1 = COGROUP wordcounts by PatentNumber, kcentersa by PatentNumber; assignwork2 = FOREACH assignpre1 GENERATE group as PatentNumber, myudfs.kmeans(wordcounts, kcentersa) as CenterID; basically my issue is that for each patent I need to pass the sub relations (wordcounts, kcenters). In order to do this, I do a cross and then a COGROUP by PatentNumber in order to get the set PatentNumber, {wordcounts}, {kcenters}. If I could figure a way to pass a relation or open up the centers from within the UDF, then I could just GROUP wordcounts by PatentNumber and run myudfs.kmeans(wordcount) which is hopefully much faster without the CROSS/COGROUP. This is an expensive operation. Currently this takes about 20 minutes and appears to tack the CPU/RAM. I was thinking it might be more efficient without the CROSS. I'm not sure it will be faster, so I'd like to experiment. Anyway it looks like calling the Loading functions from within Pig needs a PigContext object which I don't get from an evalfunc. And to use the hadoop file system, I need some initial objects as well, which I don't see how to get. So my question is how can I open a file from the hadoop file system from within a PIG UDF? I also run the UDF via main for debugging. So I need to load from the normal filesystem when in debug mode. Another better idea would be if there was a way to pass a relation into a UDF without needing to CROSS/COGROUP. This would be ideal, particularly if the relation resides in memory.. ie being able to do myudfs.kmeans(wordcounts, kcenters) without needing the CROSS/COGROUP with kcenters... But the basic idea is to trade IO for RAM/CPU cycles. Anyway any help will be much appreciated, the PIG UDFs aren't super well documented beyond the most simple ones, even in the UDF manual.

Read the article
Building Simple Workflows in Oozie

- by dan.mcclary

Introduction More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site. Hive Actions: Prepping for Pig In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie. I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code. CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2) PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access. ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010') LOCATION '/user/oracle/weather/historic/yr=2011'; INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month, w.weather_day, w.temp, w.dewp, w.weather FROM ( FROM historic_weather SELECT TRANSFORM(...) USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather ) w; Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql. Starting Our Workflow Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point: <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app> To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure. <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action> There are a couple of things to note here: I have to give the FQDN (or IP) and port of my JobTracker and NameNode. I have to include a hive-default.xml file. I have to include a script file. The hive-default.xml and script file must be stored in HDFS That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores). A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to workflow.xml. At this point, our local directory should contain: workflow.xml hive-defaults.xml (make sure this file contains your metastore connection data) ncdc_parse.hql Adding Pig to the Ooze Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to workflow.xml as follows: <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action> Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes. While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag. Yes, that's as confusing as you think it is. You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook. Making the Workflow Work We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called job.properties as follows: nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 queueName=default weatherRoot=weather_ooze mapreduce.jobtracker.kerberos.principal=foo dfs.namenode.kerberos.principal=foo oozie.libpath=${nameNode}/user/oozie/share/lib oozie.wf.application.path=${nameNode}/user/${user.name}/${weatherRoot} outputDir=weather-ooze While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory. We're finally ready to submit our job! After all that work we only need to do a few more things: Validate our workflow.xml Copy our working directory to HDFS Submit our job to the Oozie server Run our workflow Let's do them in order. First validate the workflow: oozie validate workflow.xml Next, copy the working directory up to HDFS: hadoop fs -put working_dir /user/oracle/working_dir Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our job.properties file as an argument. oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output: 14-20120525161321-oozie-oracle This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job. oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately. Takeaway So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

Read the article
Arguments passed on by shell to command in Unix

- by Ryan Brown

I've been going over this question and I can't for the life of me figure out why the answer is what it is. How many arguments are passed to the command by the shell on this command line:<pig pig -x " " -z -r" " >pig pig pig a. 8 b. 6 c. 5 d. 7 e. 9 The first symbol is supposed to be the symbol for redirected input but the site isn't letting me use it. [Fixed.] I looked at this question and said ok...arguments...not options so 2nd pig, then " ", then -r" ", 4th pig and 5th pig...-z and -x are options, so I count 5. The answer is b. 6. Where is the 6th argument that's being passed on?

Read the article

Search Results

Search found 841 results on 34 pages for 'mr pig'.

Page 1/34 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by Pinal Dave

- by asquithea

- by C.c. Huang

- by Le Dude

- by Niels Basjes

- by kellsey.ruppel(at)oracle.com

- by arianp

- by Donald Miner

- by Phil Factor

- by Simon Guo

- by user2985324

- by sagie

- by user311884

- by Daniel Huckstep

- by iamjames

- by UltimateBrent

- by lmonson

- by futureelite7

- by futureelite7

- by Josh Crews

- by RAVITEJA SATYAVADA

- by Cervo

- by dan.mcclary

- by Ryan Brown

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >