Search Results

Search found 31 results on 2 pages for 'weka'.

Page 1/2 | 1 2  | Next Page >

  • Export a SQL database into a CSV file and use it with WEKA

    - by Simon
    How can I export a query result from a .sql database into a .csv file? I tried with SELECT * FROM players INTO OUTFILE 'players.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY ';';` and my .csv file is something like: p1,1,2,3 p2,1,4,5 But they are not in saparated columns, all are in 1 column. I tried to create a .csv file by myself just to try WEKA, something like: p1 1 2 3 p2 1 4 5 But WEKA recognizes p1 1 2 3 as a single attribute. So: how can I export correctly a table from a sql db to a csv file? And how can I use it with WEKA?

    Read the article

  • Image classification using openCV and weka

    - by simk
    Hi i want to do image classification, so i am planning to use openCV for the preprocessing of image and weka to check which ML algorithm gives best result, so the problem i am facing is converting the image data in to weka ARFF file format, when i apply some image transformation to image and write the image data it become so large and not sure how to define @ATTRIBUTE for that. If you have done similar thing before please suggest how can i solve this and it particularly doesn't have to be openCV i can use other tools also, please suggest.

    Read the article

  • How to interpret weka classification?

    - by gargi2010
    How can we interpret the classification result in weka using naive bayes? How is mean, std deviation, weight sum and precision calculated? How is kappa statistic, mean absolute error, root mean squared error etc calculated? What is the interpretation of the confusion matrix?

    Read the article

  • Visualize Classifier Error Weka

    - by user1780592
    Hye there i have a have datasets where this data i have test it on weka with J48 classifier It give me an output = 87.2611% Total of instances = 157 Correctly Instances = 137 Incorrectly instance = 20 Then i have do a visualize classifier error on my data. However my result have been decrease to: New result = 85.4015% Correctly Instances = 117 Incorrectly instances = 20 Total of instances = 137 Is there any reason for that? Should my result become much better after i do the visualize classifier error?

    Read the article

  • Filtering Attributes with Weka

    - by hrzafer
    Hi eveyone! I have a simple question about filtering attributes in WEKA. Let's say I have 500 attributes 30 classes and 100 samples for each class which equals 3000 rows and 500 columns. This causes time and memory problems a you can guess. How do I filter attributes that occur only once or twice (or n times) in 3000 rows. And is it a good idea? Thank you

    Read the article

  • How to persist non-trivial fields in Play Framework

    - by AlexR
    I am trying to persist complex objects using Ebeans in Play Framework (2.03). In particular, I've created a class that contains a field of type weka.classifier.Classifier (Weka is a popular machine learning library - see http://weka.sourceforge.net/doc/weka/classifiers/Classifier.html). Classifier implements Serializeable so I hoped that I can get away with something like @Entity @Table(name = "classifiers") public class ClassifierData extends Model { @Id public Long id; public Classifier classifier; } However, the Evolutions script suggests the following database structure: create table classifiers ( id bigint auto_increment not null, constraint pk_classifiers primary key (id)) ) In other words, it ignores the field of type Classifier. (The database is MySQL if it makes any difference) What should I do to store complex serializeable objects using Ebean/Evolutions/PlayFramework?

    Read the article

  • Sentiment analysis with NLTK python for sentences using sample data or webservice?

    - by Ke
    I am embarking upon a NLP project for sentiment analysis. I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task. Here is my task: I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice) I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??) Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron" Then I would like to check for positive/negative sentiment in each sentence and count them accordingly NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm. Here are the troubles I am having: All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?) I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary? I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link? Any help is much appreciated and will save me much hair! Cheers Ke

    Read the article

  • Text mining with PHP

    - by garyc40
    Hi, I'm doing a project for a college class I'm taking. I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree. However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP? I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient. Do you have any idea what I should use for this project? Or should I just switch to Python? Thanks

    Read the article

  • Using MOA to classify new examples?

    - by Sam Zetoloth
    I'm trying to use the java machine learning library MOA to train on a training data stream, then predict classes for a test data stream. The first part works fine, using (for example) java -cp .:moa.jar:weka.jar -javaagent:sizeofag.jar moa.DoTask "LearnModel -l MajorityClass -s (ArffFileStream -f atrain.arff -c -1) -O amodel.moa" But then I cannot figure out how to use the trained model (amodel.moa) on another stream (atest.arff) to predict the classes. Has anyone done this before?

    Read the article

  • what is the best way to generate fake data for classification problem ?

    - by Berkay
    i'm working on a project and i have a subset of user's key-stroke time data.This means that the user makes n attempts and i will use these recorded attempt time data in various kinds of classification algorithms for future user attempts to verify that the login process is done by the user or some another person. (Simply i can say that this is biometrics) I have 3 different times of the user login attempt process, ofcourse this is subset of the infinite data. until now it is an easy classification problem, i decided to use WEKA but as far as i understand i have to create some fake data to feed the classification algorithm. can i use some optimization algorithms ? or is there any way to create this fake data to get min false positives ? Thanks

    Read the article

  • please help me to interpret the naive bayes result in weka..

    - by resmi
    Anybody please help me to interpret the following result generated in weka for classification using naive bayes.....Please explain clearly what is this Normal Distribution , Mean , StandardDev , WeightSum and Precision.Please help me.Am new in weka. ** Naive Bayes Classifier Class Normal: Prior probability = 0.5 1374195_at: Normal Distribution. Mean = 218.06 StandardDev = 6.0572 WeightSum = 3 Precision = 36.34333334 1373315_at: Normal Distribution. Mean = 1142.58 StandardDev = 21.1589 WeightSum = 3 Precision = 126.95333339999999

    Read the article

  • What machine learning algorithms can be used in this scenario?

    - by ExceptionHandler
    My data consists of objects as follows. Obj1 - Color - shape - size - price - ranking So I want to be able to predict what combination of color/shape/size/price is a good combination to get high ranking. Or even a combination could work like for eg: in order to get good ranking, the alg predicts best performance for this color and this shape. Something like that. What are the advisable algorithms for such a prediction? Also may be if you can briefly explain how I can approach towards the model building I would really appreciate it. Say for eg: my data looks like Blue pentagon small $50.00 #5 Red Squre large $30.00 #3 So what is a useful prediction model that I should look at? What algorithm should I try to predict like say highest weightage is for price followed by color and then size. What if I wanted to predict in combinations like a Red small shape is less likely to higher rank compared to pink small shape . (In essence trying to combine more than one nominal values column to make the prediction)

    Read the article

  • Managing text-maps in a 2D array on to be painted on HTML5 Canvas

    - by weka
    So, I'm making a HTML5 RPG just for fun. The map is a <canvas> (512px width, 352px height | 16 tiles across, 11 tiles top to bottom). I want to know if there's a more efficient way to paint the <canvas>. Here's how I have it right now. How tiles are loaded and painted on map The map is being painted by tiles (32x32) using the Image() piece. The image files are loaded through a simple for loop and put into an array called tiles[] to be PAINTED on using drawImage(). First, we load the tiles... and here's how it's being done: // SET UP THE & DRAW THE MAP TILES tiles = []; var loadedImagesCount = 0; for (x = 0; x <= NUM_OF_TILES; x++) { var imageObj = new Image(); // new instance for each image imageObj.src = "js/tiles/t" + x + ".png"; imageObj.onload = function () { console.log("Added tile ... " + loadedImagesCount); loadedImagesCount++; if (loadedImagesCount == NUM_OF_TILES) { // Onces all tiles are loaded ... // We paint the map for (y = 0; y <= 15; y++) { for (x = 0; x <= 10; x++) { theX = x * 32; theY = y * 32; context.drawImage(tiles[5], theY, theX, 32, 32); } } } }; tiles.push(imageObj); } Naturally, when a player starts a game it loads the map they last left off. But for here, it an all-grass map. Right now, the maps use 2D arrays. Here's an example map. [[4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 4, 1, 1, 1, 1, 1], [1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 1, 1, 1, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 1, 13, 13, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 1, 13, 13, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 1, 13, 13, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 1, 1, 1, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 13, 13, 13, 1, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 13, 13, 11, 11, 11, 13, 13, 13, 13, 13, 13, 13, 1], [13, 13, 13, 1, 1, 1, 1, 1, 1, 1, 13, 13, 13, 13, 13, 1], [1, 1, 1, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 1, 1, 1]]; I get different maps using a simple if structure. Once the 2d array above is return, the corresponding number in each array will be painted according to Image() stored inside tile[]. Then drawImage() will occur and paint according to the x and y and times it by 32 to paint on the correct x-y coordinate. How multiple map switching occurs With my game, maps have five things to keep track of: currentID, leftID, rightID, upID, and bottomID. currentID: The current ID of the map you are on. leftID: What ID of currentID to load when you exit on the left of current map. rightID: What ID of currentID to load when you exit on the right of current map. downID: What ID of currentID to load when you exit on the bottom of current map. upID: What ID of currentID to load when you exit on the top of current map. Something to note: If either leftID, rightID, upID, or bottomID are NOT specific, that means they are a 0. That means they cannot leave that side of the map. It is merely an invisible blockade. So, once a person exits a side of the map, depending on where they exited... for example if they exited on the bottom, bottomID will the number of the map to load and thus be painted on the map. Here's a representational .GIF to help you better visualize: As you can see, sooner or later, with many maps I will be dealing with many IDs. And that can possibly get a little confusing and hectic. The obvious pros is that it load 176 tiles at a time, refresh a small 512x352 canvas, and handles one map at time. The con is that the MAP ids, when dealing with many maps, may get confusing at times. My question Is this an efficient way to store maps (given the usage of tiles), or is there a better way to handle maps? I was thinking along the lines of a giant map. The map-size is big and it's all one 2D array. The viewport, however, is still 512x352 pixels. Here's another .gif I made (for this question) to help visualize: Sorry if you cannot understand my English. Please ask anything you have trouble understanding. Hopefully, I made it clear. Thanks.

    Read the article

  • Android - Adding external library to project

    - by mmontalbo
    Hi, I am having a lot of trouble adding the WEKA library to a project I am working on. I have followed several tutorials that explain how to do this including the Android Developers guide: http://developer.android.com/guide/appendix/faq/commontasks.html#addexternallibrary and several of the postings on SO. I have created a folder in my project with the weka.jar file, created a new library (adding the weka.jar file to the library) and included this library in my build path. I have also added the library under the "Order and Export" tab in the project properties. I have also tried importing the jar file so that the actual contents of the jar are extracted into a directory in my project. The end result of all of this is that my project is able to build correctly and without error, but when it comes time to run my code on the emulator I get the following exception: 04-10 22:52:21.051: ERROR/dalvikvm(582): Could not find class 'weka.classifiers.trees.J48', referenced from method edu.usc.student.composure.classifier.GaitClassifierImpl. with J48 being the class I reference in my code. Does anyone have any additional suggestions that I may have overlooked? Thanks!

    Read the article

  • Vista 64-bit, DISK BOOT FAILURE

    - by weka
    So I have this Acer Aspire AX3200-U3600A with Windows Vista (64-bit). Every night I turn it off and turn it back on in the morning. Around three weeks ago, I did a fresh factory reimage. Good as new. Then around two days ago, when I turned it on, I noticed it was running extremly slow. As in, it would often freeze up while I had multiple applications open when it usually never froze up. So I decided to restart my computer. Big mistake. My computer froze right after I clicked shut-down. I waited a while. Nothing. Waited some minutes. Nope. I decided to shut it down by pressing the power button. Here is where the problems begin. When I turned it back on, I saw the Windows logo and loading bar and then it loaded to black. I turned it off again forcefully by power button and then once more... then I got: AMD Data Change... Update New Data to DMI! then later the screen clears and I get: AHCI Option ROM BIOS Revision: 01.05.92 Date: 02-19-2008 Copyright (c) 2006-2008 Phoenix Technologies, LTD Port 01: Reset Port Error!! Port 02: then the screen clears again but this time, this loads from the bottom: Nvidia Boot Agent 249.0542 (copyright stuff... blah blah) PXE-E61: Media test failure, check cable. PXE-M0F: Exiting Nvidia Boot Agent DISK BOOT FAILURE, INSERT SYSTEM DISK AND PRESS ENTER. So I try to go into Safe Mode. Well, first of all it doesn't load as fast. After it loads disk.sys from windows/drivers, it will wait a while (2-3 mins) THEN load. However it loads the Acer eRecovery Management Tool. I have three options: Reset computer to factory default, Restore computer from user's backup, or Exit. However, the top two options are gray and disabled where as the Exit is in blue and definitely clickable. So obviously safe mode is not there... A strong thing to note: In the beginning when all of this started, I did a Boot Windows Normal from pressing f8 and I got to my desktop! It logged me in. I could see the icons on my files. However my desktop was extremely slow as in when I clicked on the Start menu, it would wait a while, then load up the menu with JUST the gradient, no text or icons... so as you can see... it saw my HDD? Also, before anyone says, I have NO USB plugged in. My mouse and keyboard are not USB inputs, I assure you. And this came without a recovery CD AND when I went in BIOS, to change the BOOT ORDER, I did NOT see a CD-ROM option. And when I tried pressing ALT+F10 to get into Acer eRecovery Management, the top two options were disabled as well. But sometimes on start-up, I get: Windows has encountered a problem communicating with a device connected to your computer. This error can be caused by unplugging a removable storage device such as an external USB drive while the device is in use, or by faulty hardware such as a hard drive or CD-ROM drive that is failing. Make sure any removeable storage is properly connected and then restart your computer. If you continue to receive this error message, contact the hardware manufacturer. Status: 0xc00000e9 Info: An unexpected I/O error has occured. Then I tried Last Known Good Configuration Settings, that gives me a BSOD. What should I do/

    Read the article

  • Updating GeForce 9200 driver on Vista (64-bit) yields installer failed

    - by weka
    OK, so I have a Windows Vista... more specifically this: http://www.newegg.com/Product/Product.aspx?Item=N82E16883103182 a Acer Aspire AX3200-U3600A Desktop. I have DirectX 11 and I am trying to update my drivers... During the update of the latest one: GeForce 306.23 Driver my screen flashes a bit and goes black. This is standard when updating GRAPHIC DRIVERS. Then... my screen shifts 50 pixels or so to the left... OFFSCREEN. Not even Auto Image Adjust will fix the screen back... Then after that happens, I get this: Help, please? I just want to update my graphic drivers....

    Read the article

  • CSS / HTML - Image will not show up

    - by weka
    Ugh, ok. I've been up all night working this thing and now an image won't show. It's so darn annoying. Trying to get this .png image to show up on a simple PHP webpage. I just wanna go to sleep X_X CSS: <style> .achievement { position:relative; width:500px; background:#B5B5B5; float:left; padding:10px; margin-bottom:10px; } .icon { float:left; width:32px; height:32px; background: url("images/trophy.php") no-repeat center; padding:05px; border:4px solid #4D4D4D; } .ptsgained { position:absolute; top:0; right:0; background:#79E310; color:#fff; font-family:Tahoma; font-weight:bold; font-size:12px; padding:5px; } .achievement h1 { color:#454545; font-size:12pt; font-family:Georgia; font-weight:none; margin:0;padding:0; } .achievement p { margin:0;padding:0; font-size:12px; font-family:Tahoma; color:#1C1C1C; } .text { margin-left:10px; float:left; } </style> HTML: <div class="achievement"> <span class="ptsgained">+10</span> <div class="icon"></div> <div class="text"><h1>All Around Submitter</h1> <p>Submit and have approved content in all 6 areas.</p> </div> </div> What am I doing wrong, guys? :\

    Read the article

  • Which web algorithms book to get ?

    - by fjxx
    I am currently undecided between which of the following web algorithms book to buy: 1) Algorithms of the Intelligent web by Marmanis 2) Collective Intelligence by Alag Both feature code in Java; Marmanis' book delves deeper into the core algorithms while Alag's book discusses more APIs including WEKA. I have already read Programming Collective Intelligence by Segaran and enjoyed it. Any comments on these books or any other recommendations are welcome.

    Read the article

  • Which web algorithms book to get? [closed]

    - by fjxx
    I am currently undecided between which of the following web algorithms book to buy: 1) Algorithms of the Intelligent web by Marmanis 2) Collective Intelligence by Alag Both feature code in Java; Marmanis' book delves deeper into the core algorithms while Alag's book discusses more APIs including WEKA. I have already read Programming Collective Intelligence by Segaran and enjoyed it. Any comments on these books or any other recommendations are welcome.

    Read the article

  • /etc/init.d Character Encoding Issue

    - by Ryan Rosario
    I have a script in /etc/init.d on an EC2 image that, on machine startup, pulls in source code via SVN, builds it, and then runs it using Ant. The source code is Java. Within this code is a call to the Weka library which writes a file to disk. On most Ubuntu AMIs, and my home machines' versions of Ubuntu, there is no issue. The problem is that with certain versions/AMIs of Ubuntu, Unicode characters in the file are replaced with question marks ('?'). If I run the job manually on the trouble instance, Unicode is output to file correctly, but not when run from /etc/init.d. What might be causing this problem and how can I fix it so that Unicode characters appear correctly in files written from /etc/init.d processes?

    Read the article

  • Implementing Custom Software or Using Ready Softwares at Industy at Machine Learning Area? [closed]

    - by kamaci
    I am studying on Machine Learning and its implementations. I have different choices in front of me for my future. Testing algorithms by some tools as like Weka and finding best approach and after that implementing it(maybe with using some libraries at Machine Learning) On the other hand I see that there are softwares as like SPSS, SAS etc. Instead of improving myself like that should I learn that kind of programs. Do I reinventing the wheel or if I improve myself and implement custom solutions to customers then can I be a part of industry?

    Read the article

  • Building Simple Workflows in Oozie

    - by dan.mcclary
    Introduction More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site. Hive Actions: Prepping for Pig In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie. I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code. CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2) PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access. ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010') LOCATION '/user/oracle/weather/historic/yr=2011'; INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month, w.weather_day, w.temp, w.dewp, w.weather FROM ( FROM historic_weather SELECT TRANSFORM(...) USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather ) w; Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql. Starting Our Workflow Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point: <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app> To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure. <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action> There are a couple of things to note here: I have to give the FQDN (or IP) and port of my JobTracker and NameNode. I have to include a hive-default.xml file. I have to include a script file. The hive-default.xml and script file must be stored in HDFS That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores). A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to workflow.xml. At this point, our local directory should contain: workflow.xml hive-defaults.xml (make sure this file contains your metastore connection data) ncdc_parse.hql Adding Pig to the Ooze Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to workflow.xml as follows: <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action> Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes. While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag. Yes, that's as confusing as you think it is. You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook. Making the Workflow Work We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called job.properties as follows: nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 queueName=default weatherRoot=weather_ooze mapreduce.jobtracker.kerberos.principal=foo dfs.namenode.kerberos.principal=foo oozie.libpath=${nameNode}/user/oozie/share/lib oozie.wf.application.path=${nameNode}/user/${user.name}/${weatherRoot} outputDir=weather-ooze While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory. We're finally ready to submit our job! After all that work we only need to do a few more things: Validate our workflow.xml Copy our working directory to HDFS Submit our job to the Oozie server Run our workflow Let's do them in order. First validate the workflow: oozie validate workflow.xml Next, copy the working directory up to HDFS: hadoop fs -put working_dir /user/oracle/working_dir Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our job.properties file as an argument. oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output: 14-20120525161321-oozie-oracle This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job. oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately. Takeaway So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

    Read the article

  • How do I cluster strings based on a relation between two strings?

    - by Tom Wijsman
    If you don't know WEKA, you can try a theoretical answer. I don't need literal code/examples... I have a huge data set of strings in which I want to cluster the strings to find the most related ones, these could as well be seen as duplicate. I already have a set of couples of string for which I know that they are duplicate to each other, so, now I want to do some data mining on those two sets. The result I'm looking for is a system that would return me the possible most relevant couples of strings for which we don't know yet that they are duplicates, I believe that I need clustering for this, which type? Note that I want to base myself on word occurrence comparison, not on interpretation or meaning. Here is an example of two string of which we know they are duplicate (in our vision on them): The weather is really cold and it is raining. It is raining and the weather is really cold. Now, the following strings also exist (most to least relevant, ignoring stop words): Is the weather really that cold today? Rainy days are awful. I see the sunshine outside. The software would return the following two strings as most relevant, which aren't known to be duplicate: The weather is really cold and it is raining. Is the weather really that cold today? Then, I would mark that as duplicate or not duplicate and it would present me with another couple. How do I go to implement this in the most efficient way that I can apply to a large data set?

    Read the article

1 2  | Next Page >