Search Results

Search found 8 results on 1 pages for 'piglatin'.

Page 1/1 | 1 

  • How can I load a file into a DataBag from within a Yahoo PigLatin UDF?

    - by Cervo
    I have a Pig program where I am trying to compute the minimum center between two bags. In order for it to work, I found I need to COGROUP the bags into a single dataset. The entire operation takes a long time. I want to either open one of the bags from disk within the UDF, or to be able to pass another relation into the UDF without needing to COGROUP...... Code: # **** Load files for iteration **** register myudfs.jar; wordcounts = LOAD 'input/wordcounts.txt' USING PigStorage('\t') AS (PatentNumber:chararray, word:chararray, frequency:double); centerassignments = load 'input/centerassignments/part-*' USING PigStorage('\t') AS (PatentNumber: chararray, oldCenter: chararray, newCenter: chararray); kcenters = LOAD 'input/kcenters/part-*' USING PigStorage('\t') AS (CenterID:chararray, word:chararray, frequency:double); kcentersa1 = CROSS centerassignments, kcenters; kcentersa = FOREACH kcentersa1 GENERATE centerassignments::PatentNumber as PatentNumber, kcenters::CenterID as CenterID, kcenters::word as word, kcenters::frequency as frequency; #***** Assign to nearest k-mean ******* assignpre1 = COGROUP wordcounts by PatentNumber, kcentersa by PatentNumber; assignwork2 = FOREACH assignpre1 GENERATE group as PatentNumber, myudfs.kmeans(wordcounts, kcentersa) as CenterID; basically my issue is that for each patent I need to pass the sub relations (wordcounts, kcenters). In order to do this, I do a cross and then a COGROUP by PatentNumber in order to get the set PatentNumber, {wordcounts}, {kcenters}. If I could figure a way to pass a relation or open up the centers from within the UDF, then I could just GROUP wordcounts by PatentNumber and run myudfs.kmeans(wordcount) which is hopefully much faster without the CROSS/COGROUP. This is an expensive operation. Currently this takes about 20 minutes and appears to tack the CPU/RAM. I was thinking it might be more efficient without the CROSS. I'm not sure it will be faster, so I'd like to experiment. Anyway it looks like calling the Loading functions from within Pig needs a PigContext object which I don't get from an evalfunc. And to use the hadoop file system, I need some initial objects as well, which I don't see how to get. So my question is how can I open a file from the hadoop file system from within a PIG UDF? I also run the UDF via main for debugging. So I need to load from the normal filesystem when in debug mode. Another better idea would be if there was a way to pass a relation into a UDF without needing to CROSS/COGROUP. This would be ideal, particularly if the relation resides in memory.. ie being able to do myudfs.kmeans(wordcounts, kcenters) without needing the CROSS/COGROUP with kcenters... But the basic idea is to trade IO for RAM/CPU cycles. Anyway any help will be much appreciated, the PIG UDFs aren't super well documented beyond the most simple ones, even in the UDF manual.

    Read the article

  • Storing data to SequenceFile from Apache Pig

    - by asquithea
    Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig?

    Read the article

  • Does throwing an exception in an EvalFunc pig UDF skip just that line, or stop completely?

    - by Daniel Huckstep
    I have a User Defined Function (UDF) written in Java to parse lines in a log file and return information back to pig, so it can do all the processing. It looks something like this: public abstract class Foo extends EvalFunc<Tuple> { public Foo() { super(); } public Tuple exec(Tuple input) throws IOException { try { // do stuff with input } catch (Exception e) { throw WrappedIOException.wrap("Error with line", e); } } } My question is: if it throws the IOException, will it stop completely, or will it return results for the rest of the lines that don't throw an exception? Example: I run this in pig REGISTER myjar.jar DEFINE Extractor com.namespace.Extractor(); logs = LOAD '$IN' USING TextLoader AS (line: chararray); events = FOREACH logs GENERATE FLATTEN(Extractor(line)); With this input: 1.5 7 "Valid Line" 1.3 gghyhtt Inv"alid line"" I throw an exceptioN!! 1.8 10 "Valid Line 2" Will it process the two lines and will 'logs' have 2 tuples, or will it just die in a fire?

    Read the article

  • ORDER BY job failed in the Pig script while running EmbeddedPig using Java

    - by C.c. Huang
    I have this following pig script, which works perfectly using grunt shell (stored the results to HDFS without any issues); however, the last job (ORDER BY) failed if I ran the same script using Java EmbeddedPig. If I replace the ORDER BY job by others, such as GROUP or FOREACH GENERATE, the whole script then succeeded in Java EmbeddedPig. So I think it's the ORDER BY which causes the issue. Anyone has any experience with this? Any help would be appreciated! The Pig script: REGISTER pig-udf-0.0.1-SNAPSHOT.jar; user_similarity = LOAD '/tmp/sample-sim-score-results-31/part-r-00000' USING PigStorage('\t') AS (user_id: chararray, sim_user_id: chararray, basic_sim_score: float, alt_sim_score: float); simplified_user_similarity = FOREACH user_similarity GENERATE $0 AS user_id, $1 AS sim_user_id, $2 AS sim_score; grouped_user_similarity = GROUP simplified_user_similarity BY user_id; ordered_user_similarity = FOREACH grouped_user_similarity { sorted = ORDER simplified_user_similarity BY sim_score DESC; top = LIMIT sorted 10; GENERATE group, top; }; top_influencers = FOREACH ordered_user_similarity GENERATE com.aol.grapevine.similarity.pig.udf.AssignPointsToTopInfluencer($1, 10); all_influence_scores = FOREACH top_influencers GENERATE FLATTEN($0); grouped_influence_scores = GROUP all_influence_scores BY bag_of_topSimUserTuples::user_id; influence_scores = FOREACH grouped_influence_scores GENERATE group AS user_id, SUM(all_influence_scores.bag_of_topSimUserTuples::points) AS influence_score; ordered_influence_scores = ORDER influence_scores BY influence_score DESC; STORE ordered_influence_scores INTO '/tmp/cc-test-results-1' USING PigStorage(); The error log from Pig: 12/04/05 10:00:56 INFO pigstats.ScriptState: Pig script settings are added to the job 12/04/05 10:00:56 INFO mapReduceLayer.JobControlCompiler: mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 12/04/05 10:00:58 INFO mapReduceLayer.JobControlCompiler: Setting up single store job 12/04/05 10:00:58 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 12/04/05 10:00:58 INFO mapReduceLayer.MapReduceLauncher: 1 map-reduce job(s) waiting for submission. 12/04/05 10:00:58 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/04/05 10:00:58 INFO input.FileInputFormat: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths to process : 1 12/04/05 10:00:58 INFO util.MapRedUtil: Total input paths (combined) to process : 1 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating tmp-1546565755 in /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134-work-6955502337234509704 with rwxr-xr-x 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Cached hdfs://localhost/tmp/temp1725960134/tmp-1546565755#pigsample_854728855_1333645258470 as /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:58 WARN mapred.LocalJobRunner: LocalJobRunner does not support symlinking into current working dir. 12/04/05 10:00:58 INFO mapred.TaskRunner: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/pigsample_854728855_1333645258470 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.jar.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.jar.crc 12/04/05 10:00:58 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.split.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.split.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.splitmetainfo.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.splitmetainfo.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/.job.xml.crc <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/.job.xml.crc 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.jar <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.jar 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.split <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.split 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.splitmetainfo <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.splitmetainfo 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Creating symlink: /var/lib/hadoop-0.20/cache/cchuang/mapred/staging/cchuang402164468/.staging/job_local_0004/job.xml <- /var/lib/hadoop-0.20/cache/cchuang/mapred/local/localRunner/job.xml 12/04/05 10:00:59 INFO mapred.Task: Using ResourceCalculatorPlugin : null 12/04/05 10:00:59 INFO mapred.MapTask: io.sort.mb = 100 12/04/05 10:00:59 INFO mapred.MapTask: data buffer = 79691776/99614720 12/04/05 10:00:59 INFO mapred.MapTask: record buffer = 262144/327680 12/04/05 10:00:59 WARN mapred.LocalJobRunner: job_local_0004 java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:139) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:560) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:639) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/cchuang/workspace/grapevine-rec/pigsample_854728855_1333645258470 at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInputFormat.listStatus(PigFileInputFormat.java:37) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248) at org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153) at org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:112) ... 6 more 12/04/05 10:00:59 INFO filecache.TrackerDistributedCacheManager: Deleted path /var/lib/hadoop-0.20/cache/cchuang/mapred/local/archive/4334795313006396107_361978491_57907159/localhost/tmp/temp1725960134/tmp-1546565755 12/04/05 10:00:59 INFO mapReduceLayer.MapReduceLauncher: HadoopJobId: job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: job job_local_0004 has failed! Stop running all dependent jobs 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: 100% complete 12/04/05 10:01:04 ERROR pigstats.PigStatsUtil: 1 map reduce job(s) failed! 12/04/05 10:01:04 INFO pigstats.PigStats: Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-cdh3u3 0.8.1-cdh3u3 cchuang 2012-04-05 10:00:34 2012-04-05 10:01:04 GROUP_BY,ORDER_BY Some jobs have failed! Stop running all dependent jobs Job Stats (time in seconds): JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_local_0001 0 0 0 0 0 0 0 0 all_influence_scores,grouped_user_similarity,simplified_user_similarity,user_similarity GROUP_BY job_local_0002 0 0 0 0 0 0 0 0 grouped_influence_scores,influence_scores GROUP_BY,COMBINER job_local_0003 0 0 0 0 0 0 0 0 ordered_influence_scores SAMPLER Failed Jobs: JobId Alias Feature Message Outputs job_local_0004 ordered_influence_scores ORDER_BY Message: Job failed! Error - NA /tmp/cc-test-results-1, Input(s): Successfully read 0 records from: "/tmp/sample-sim-score-results-31/part-r-00000" Output(s): Failed to produce result in "/tmp/cc-test-results-1" Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_local_0001 -> job_local_0002, job_local_0002 -> job_local_0003, job_local_0003 -> job_local_0004, job_local_0004 12/04/05 10:01:04 INFO mapReduceLayer.MapReduceLauncher: Some jobs have failed! Stop running all dependent jobs

    Read the article

  • Splitting input into substrings in PIG (Hadoop)

    - by Niels Basjes
    Assume I have the following input in Pig: some And I would like to convert that into: s so som some I've not (yet) found a way to iterate over a chararray in pig latin. I have found the TOKENIZE function but that splits on word boundries. So can "pig latin" do this or is this something that requires a Java class to do that?

    Read the article

  • PIG doesn't read my custom InputFormat

    - by Simon Guo
    I have a custom MyInputFormat that suppose to deal with record boundary problem for multi-lined inputs. But when I put the MyInputFormat into my UDF load function. As follow: public class EccUDFLogLoader extends LoadFunc { @Override public InputFormat getInputFormat() { System.out.println("I am in getInputFormat function"); return new MyInputFormat(); } } public class MyInputFormat extends TextInputFormat { public RecordReader createRecordReader(InputSplit inputSplit, JobConf jobConf) throws IOException { System.out.prinln("I am in createRecordReader"); //MyRecordReader suppose to handle record boundary return new MyRecordReader((FileSplit)inputSplit, jobConf); } } For each mapper, it print out I am in getInputFormat function but not I am in createRecordReader. I am wondering if anyone can provide a hint on how to hoop up my costome MyInputFormat to PIG's UDF loader? Much Thanks. I am using PIG on Amazon EMR.

    Read the article

  • I have an Errno 13 Permission denied with subprocess in python

    - by wDroter
    The line with the issue is ret=subprocess.call(shlex.split(cmd)) cmd = /usr/share/java -cp pig-hadoop-conf-Simpsons:lib/pig-0.8.1-cdh3u1-core.jar:lib/hadoop-core-0.20.2-cdh3u1.jar org.apache.pig.Main -param func=cat -param from =foo.txt -x mapreduce fsFunc.pig The error is. File "./run_pig.py", line 157, in process ret=subprocess.call(shlex.split(cmd)) File "/usr/lib/python2.7/subprocess.py", line 493, in call return Popen(*popenargs, **kwargs).wait() File "/usr/lib/python2.7/subprocess.py", line 679, in __init__ errread, errwrite) File "/usr/lib/python2.7/subprocess.py", line 1249, in _execute_child raise child_exception OSError: [Errno 13] Permission denied Let me know if any more info is needed. Any help is appreciated. Thanks.

    Read the article

  • Filtering null values with pig

    - by arianp
    It looks like a silly problem, but I can´t find a way to filter null values from my rows. This is the result when I dump the object geoinfo: DUMP geoinfo; ([longitude#70.95853,latitude#30.9773]) ([longitude#-9.37944507,latitude#38.91780853]) (null) (null) (null) ([longitude#-92.64416,latitude#16.73326]) (null) (null) ([longitude#-9.15199849,latitude#38.71179122]) ([longitude#-9.15210796,latitude#38.71195131]) here is the description DESCRIBE geoinfo; geoinfo: {geoLocation: bytearray} What I'm trying to do is to filter null values like this: geoinfo_no_nulls = FILTER geoinfo BY geoLocation is not null; but the result remains the same. nothing is filtered. I also tried something like this geoinfo_no_nulls = FILTER geoinfo BY geoLocation != 'null'; and I got an error org.apache.pig.backend.executionengine.ExecException: ERROR 1071: Cannot convert a map to a String What am I doing wrong? details, running on ubuntu, hadoop-1.0.3 with pig 0.9.3 pig -version Apache Pig version 0.9.3-SNAPSHOT (rexported) compiled Oct 24 2012, 19:04:03 java version "1.6.0_24" OpenJDK Runtime Environment (IcedTea6 1.11.4) (6b24-1.11.4-1ubuntu0.12.04.1) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

    Read the article

1