Search Results

Search found 7 results on 1 pages for 'weatherman'.

Page 1/1 | 1 

  • Building Simple Workflows in Oozie

    - by dan.mcclary
    Introduction More often than not, data doesn't come packaged exactly as we'd like it for analysis. Transformation, match-merge operations, and a host of data munging tasks are usually needed before we can extract insights from our Big Data sources. Few people find data munging exciting, but it has to be done. Once we've suffered that boredom, we should take steps to automate the process. We want codify our work into repeatable units and create workflows which we can leverage over and over again without having to write new code. In this article, we'll look at how to use Oozie to create a workflow for the parallel machine learning task I described on Cloudera's site. Hive Actions: Prepping for Pig In my parallel machine learning article, I use data from the National Climatic Data Center to build weather models on a state-by-state basis. NCDC makes the data freely available as gzipped files of day-over-day observations stretching from the 1930s to today. In reading that post, one might get the impression that the data came in a handy, ready-to-model files with convenient delimiters. The truth of it is that I need to perform some parsing and projection on the dataset before it can be modeled. If I get more observations, I'll want to retrain and test those models, which will require more parsing and projection. This is a good opportunity to start building up a workflow with Oozie. I store the data from the NCDC in HDFS and create an external Hive table partitioned by year. This gives me flexibility of Hive's query language when I want it, but let's me put the dataset in a directory of my choosing in case I want to treat the same data with Pig or MapReduce code. CREATE EXTERNAL TABLE IF NOT EXISTS historic_weather(column 1, column2) PARTITIONED BY (yr string) STORED AS ... LOCATION '/user/oracle/weather/historic'; As new weather data comes in from NCDC, I'll need to add partitions to my table. That's an action I should put in the workflow. Similarly, the weather data requires parsing in order to be useful as a set of columns. Because of their long history, the weather data is broken up into fields of specific byte lengths: x bytes for the station ID, y bytes for the dew point, and so on. The delimiting is consistent from year to year, so writing SerDe or a parser for transformation is simple. Once that's done, I want to select columns on which to train, classify certain features, and place the training data in an HDFS directory for my Pig script to access. ALTER TABLE historic_weather ADD IF NOT EXISTS PARTITION (yr='2010') LOCATION '/user/oracle/weather/historic/yr=2011'; INSERT OVERWRITE DIRECTORY '/user/oracle/weather/cleaned_history' SELECT w.stn, w.wban, w.weather_year, w.weather_month, w.weather_day, w.temp, w.dewp, w.weather FROM ( FROM historic_weather SELECT TRANSFORM(...) USING '/path/to/hive/filters/ncdc_parser.py' as stn, wban, weather_year, weather_month, weather_day, temp, dewp, weather ) w; Since I'm going to prepare training directories with at least the same frequency that I add partitions, I should also add that to my workflow. Oozie is going to invoke these Hive actions using what's somewhat obviously referred to as a Hive action. Hive actions amount to Oozie running a script file containing our query language statements, so we can place them in a file called weather_train.hql. Starting Our Workflow Oozie offers two types of jobs: workflows and coordinator jobs. Workflows are straightforward: they define a set of actions to perform as a sequence or directed acyclic graph. Coordinator jobs can take all the same actions of Workflow jobs, but they can be automatically started either periodically or when new data arrives in a specified location. To keep things simple we'll make a workflow job; coordinator jobs simply require another XML file for scheduling. The bare minimum for workflow XML defines a name, a starting point, and an end point: <workflow-app name="WeatherMan" xmlns="uri:oozie:workflow:0.1"> <start to="ParseNCDCData"/> <end name="end"/> </workflow-app> To this we need to add an action, and within that we'll specify the hive parameters Also, keep in mind that actions require <ok> and <error> tags to direct the next action on success or failure. <action name="ParseNCDCData"> <hive xmlns="uri:oozie:hive-action:0.2"> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <configuration> <property> <name>oozie.hive.defaults</name> <value>/user/oracle/weather_ooze/hive-default.xml</value> </property> </configuration> <script>ncdc_parse.hql</script> </hive> <ok to="WeatherMan"/> <error to="end"/> </action> There are a couple of things to note here: I have to give the FQDN (or IP) and port of my JobTracker and NameNode. I have to include a hive-default.xml file. I have to include a script file. The hive-default.xml and script file must be stored in HDFS That last point is particularly important. Oozie doesn't make assumptions about where a given workflow is being run. You might submit workflows against different clusters, or have different hive-defaults.xml on different clusters (e.g. MySQL or Postgres-backed metastores). A quick way to ensure that all the assets end up in the right place in HDFS is just to make a working directory locally, build your workflow.xml in it, and copy the assets you'll need to it as you add actions to workflow.xml. At this point, our local directory should contain: workflow.xml hive-defaults.xml (make sure this file contains your metastore connection data) ncdc_parse.hql Adding Pig to the Ooze Adding our Pig script as an action is slightly simpler from an XML standpoint. All we do is add an action to workflow.xml as follows: <action name="WeatherMan"> <pig> <job-tracker>localhost:8021</job-tracker> <name-node>localhost:8020</name-node> <script>weather_train.pig</script> </pig> <ok to="end"/> <error to="end"/> </action> Once we've done this, we'll copy weather_train.pig to our working directory. However, there's a bit of a "gotcha" here. My pig script registers the Weka Jar and a chunk of jython. If those aren't also in HDFS, our action will fail from the outset -- but where do we put them? The Jython script goes into the working directory at the same level as the pig script, because pig attempts to load Jython files in the directory from which the script executes. However, that's not where our Weka jar goes. While Oozie doesn't assume much, it does make an assumption about the Pig classpath. Anything under working_directory/lib gets automatically added to the Pig classpath and no longer requires a REGISTER statement in the script. Anything that uses a REGISTER statement cannot be in the working_directory/lib directory. Instead, it needs to be in a different HDFS directory and attached to the pig action with an <archive> tag. Yes, that's as confusing as you think it is. You can get the exact rules for adding Jars to the distributed cache from Oozie's Pig Cookbook. Making the Workflow Work We've got a workflow defined and have collected all the components we'll need to run. But we can't run anything yet, because we still have to define some properties about the job and submit it to Oozie. We need to start with the job properties, as this is essentially the "request" we'll submit to the Oozie server. In the same working directory, we'll make a file called job.properties as follows: nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 queueName=default weatherRoot=weather_ooze mapreduce.jobtracker.kerberos.principal=foo dfs.namenode.kerberos.principal=foo oozie.libpath=${nameNode}/user/oozie/share/lib oozie.wf.application.path=${nameNode}/user/${user.name}/${weatherRoot} outputDir=weather-ooze While some of the pieces of the properties file are familiar (e.g., JobTracker address), others take a bit of explaining. The first is weatherRoot: this is essentially an environment variable for the script (as are jobTracker and queueName). We're simply using them to simplify the directives for the Oozie job. The oozie.libpath pieces is extremely important. This is a directory in HDFS which holds Oozie's shared libraries: a collection of Jars necessary for invoking Hive, Pig, and other actions. It's a good idea to make sure this has been installed and copied up to HDFS. The last two lines are straightforward: run the application defined by workflow.xml at the application path listed and write the output to the output directory. We're finally ready to submit our job! After all that work we only need to do a few more things: Validate our workflow.xml Copy our working directory to HDFS Submit our job to the Oozie server Run our workflow Let's do them in order. First validate the workflow: oozie validate workflow.xml Next, copy the working directory up to HDFS: hadoop fs -put working_dir /user/oracle/working_dir Now we submit the job to the Oozie server. We need to ensure that we've got the correct URL for the Oozie server, and we need to specify our job.properties file as an argument. oozie job -oozie http://url.to.oozie.server:port_number/ -config /path/to/working_dir/job.properties -submit We've submitted the job, but we don't see any activity on the JobTracker? All I got was this funny bit of output: 14-20120525161321-oozie-oracle This is because submitting a job to Oozie creates an entry for the job and places it in PREP status. What we got back, in essence, is a ticket for our workflow to ride the Oozie train. We're responsible for redeeming our ticket and running the job. oozie -oozie http://url.to.oozie.server:port_number/ -start 14-20120525161321-oozie-oracle Of course, if we really want to run the job from the outset, we can change the "-submit" argument above to "-run." This will prep and run the workflow immediately. Takeaway So, there you have it: the somewhat laborious process of building an Oozie workflow. It's a bit tedious the first time out, but it does present a pair of real benefits to those of us who spend a great deal of time data munging. First, when new data arrives that requires the same processing, we already have the workflow defined and ready to run. Second, as we build up a set of useful action definitions over time, creating new workflows becomes quicker and quicker.

    Read the article

  • Is there an application that can do a blue screen effect with a webcam?

    - by Axxmasterr
    Background: This is not the blue screen of death I am speaking of but the process called "Blue Screening" that takes and removes a particular colored background from an image so that it can be superimposed on some other video/still picture. If you have ever seen the weatherman stand in front of the map, then you have seen someone doing a blue screen technique. I would like to be able to capture video from my webcam, then send that video to a blue screen program which removes the white (or other color) from the background and then inserts a background of my own choosing. (think of the dead guy in freejack who was calling from all of the different places on earth) Then once the image is superimposed, I would like to pipe it into Skype for video conferencing. Anyone have a good way to do this?

    Read the article

  • Podcast Show Notes: The Red Room Interview &ndash; Part 2

    - by Bob Rhubart
    Room bloggers Sean Boiling, Richard Ward, and Mervin Chaing bring their in-the-trenches perspective to the conversation once again in this week’s edition of the OTN ArchBeat Podcast. Listen. (Missed last week? No problemo: Listen to Part 1) In this segment the conversation turns to SOA governance and balancing the need for reuse against the need for speed.  It’s no mystery that many people react to the term “SOA Governance” in much the same way as they would to the sound of Darth Vader’s respirator. But Mervin explains how a simple change in terminology can go a long way toward lowering blood pressure. Those interested in connecting with Sean, Richard, or Mervin can do so via the links listed below: Sean Boiling - Sales Consulting Manager for Oracle Fusion Middleware LinkedIn | Twitter | Blog Richard Ward - SOA Channel Development Manager at Oracle LinkedIn | Blog Mervin Chiang - Consulting Principal at Leonardo Consulting LinkedIn | Twitter | Blog And you’ll find the complete list of the Red Room SOA Best Practice Posts in last week’s show notes. The third and final segment of the Red Room series runs next week.  I have enough material from the original interview for a fourth program,  but it’ll have to wait. Also, as mentioned last week, the podcast name change is now complete, from Arch2Arch, to ArchBeat. As WPBH-TV9 weatherman Phil Connors says, “Anything different is good.”   Technorati Tags: archbeat,podcast. arch2arch,soa,soa governance,oracle,otn Flickr Tags: archbeat,podcast. arch2arch,soa,soa governance,oracle,otn

    Read the article

  • Monitoring Baseline

    - by Grant Fritchey
    Knowing what's happening on your servers is important, that's monitoring. Knowing what happened on your server is establishing a baseline. You need to do both. I really enjoyed this blog post by Ted Krueger (blog|twitter). It's not enough to know what happened in the last hour or yesterday, you need to compare today to last week, especially if you released software this weekend. You need to compare today to 30 days ago in order to begin to establish future projections. How your data has changed over 30 days is a great indicator how it's going to change for the next 30. No, it's not perfect, but predicting the future is not exactly a science, just ask your local weatherman. Red Gate's SQL Monitor can show you the last week, the last 30 days, the last year, or all data you've collected (if you choose to keep a year's worth of data or more, please have PLENTY of storage standing by). You have a lot of choice and control here over how much data you store. Here's the configuration window showing how you can set this up: This is for version 2.3 of SQL Monitor, so if you're running an older version, you might want to update. The key point is, a baseline simply represents a moment in time in your server. The ability to compare now to then is what you're looking for in order to really have a useful baseline as Ted lays out so well in his post.

    Read the article

  • Question on a tutorial

    - by hansa
    Hello, i´m trying to get following tutorial to run and understand: http://www.ibm.com/developerworks/web/library/wa-cometjava/index.html In the example code which can be downloaded at the bottom of the page is everything in one class with two inner classes. How can i make the the thread of "MessageSender" (Listing 3) visible to "The Weatherman" (Listing 4) so i can use it in the run method without using inner classes? Thank you hansa Reformulation of Question: How to make the send-method of inner class MessageSender make accessible in ClassThatDoSomething. Example-Code: public class Example extends HttpServlet implements CometProcessor { private MessageSender messageSender = null; @Override public void init() throws ServletException { // starts thread MessageSender } public event(CometEvent) { // Object of ClassThatDoSomething gets created started } private class ClassThatDoSomething { public void start() { Runnable runnable = new Runnable() { public void run(){ messageSender.send(message); } Thread thread = new Thread(runnable); thread.start(); } } private class MessageSender implements Runnable { public void send(String message) { //... } public void run() { //...} } }

    Read the article

  • To My 24 Year Old Self, Wherever You Are&hellip;

    - by D'Arcy Lussier
    A decade is a milestone in one’s life, regardless of when it occurs. 2011 might seem like a weird year to mark a decade, but 2001 was a defining year for me. It marked my emergence into the technology industry, an unexpected loss of innocence, and triggered an ongoing struggle with faith and belief. Once you go through a valley, climbing the mountain and looking back over where you travelled, you can take in the entirety of the journey. Over the last 10 years I kept journals, and in this new year I took some time to review them. For those today that are me a decade ago, I share with you what I’ve gleamed from my experiences. Take it for what it’s worth, and safe travels on your own journeys through life. Life is a Performance-Based Sport Have confidence, believe you’re capable, but realize that life is a performance-based sport. Everything you get in life is based on whether you can show that you deserve it. Performance is also your best defense against personal attacks. Just make sure you know what standards you’re expected to hit and if people want to poke holes at you let them do the work of trying to find them. Sometimes performance won’t matter though. Good things will happen to bad people, and bad things to good people. What’s important is that you do the right things and ensure the good and bad even out in your own life. How you finish is just as important as how you start. Start strong, end strong. Respect is Your Most Prized Reward Respect is more important than status or ego. The formula is simple: Performing Well + Building Trust + Showing Dedication = Respect Focus on perfecting your craft and helping your team and respect will come. Life is a Team Sport Whatever aspect of your life, you can’t do it alone. You need to rely on the people around you and ensure you’re a positive aspect of their lives; even those that may be difficult or unpleasant. Avoid criticism and instead find ways to help colleagues and superiors better whatever environment you’re in (work, home, etc.). Don’t just highlight gaps and issues, but also come to the table with solutions. At the same time though, stand up for yourself and hold others accountable for the commitments they make to the team. A healthy team needs accountability. Give feedback early and often, and make it verbal. Issues should be dealt with immediately, and positives should be celebrated as they happen. Life is a Contact Sport Difficult moments will happen. Don’t run from them or shield yourself from experiencing them. Embrace them. They will further mold you and reveal who you will become. Find Your Tribe and Embrace Your Community We all need a tribe: a group of people that we gravitate to for support, guidance, wisdom, and friendship. Discover your tribe and immerse yourself in them. Don’t look for a non-existent tribe just to fill the need of belonging though that will leave you empty and bitter when they don’t meet your unrealistic expectations. Try to associate with people more experienced and more knowledgeable than you. You’ll always learn, and you’ll always remember you have much to learn. Put yourself out there, get involved with the community. Opportunities will present themselves. When we open ourselves up to be vulnerable, we also give others the chance to do the same. This helps us all to grow and help each other, it’s very important. And listen to your wife. (Easter *is* a romantic holiday btw, regardless of what you may think.) Don’t Believe Your Own Press Clippings (and by that I mean the ones you write) Until you have a track record of performance to refer to, any notions of grandeur are just that: notions. You lose your rookie status through trials and tribulations, not by the number of stamps in your passport. Be realistic about your own “experience and leadership” and be honest when you aren’t ready for something. And always remember: nobody really cares about you as much as you think they do. Don’t Let Assholes Get You Down The world isn’t evil, but there is evil in the world. Know the difference and don’t paint all people with the same brush. Do be wary of those that use personal beliefs to describe their business (i.e. “We’re a [religion] company”). What matters is the culture of the organization, and that will tell you the moral compass and what is truly valued. Don’t make someone or something a priority that only makes you an option. Life is unfair and enemies/opponents will succeed when you fail. Don’t waste your energy getting upset at this; the only one that will lose out is you. As mentioned earlier, nobody really cares about you as much as you think they do. Misc Ecclesiastes is bullshit. Everything is certainly *not* meaningless. Software development is about delivery, not the process. Having a great process means nothing if you don’t produce anything. Watch “The Weatherman” (“It’s not easy, but easy doesn’t enter into grownup life.”). Read Tony Dungee’s autobiography, even if you don’t like football, and even if you aren’t a Christian. Say no, don’t feel like you have to commit right away when someone asks you to.

    Read the article

1