hadoop - Page 14 - Developer IT

ODI 11g – Faster Files

- by David Allan

Deep in the trenches of ODI development I raised my head above the parapet to read a few odds and ends and then think why don’t they know this? Such as this article here – in the past customers (see forum) were told to use a staging route which has a big overhead for large files. This KM is an example of the great extensibility capabilities of ODI, its quite simple, just a new KM that; improves the out of the box experience – just build the mapping and the appropriate KM is used improves out of the box performance for file to file data movement. This improvement for out of the box handling for File to File data integration cases (from the 11.1.1.5.2 companion CD and on) dramatically speeds up the file integration handling. In the past I had seem some consultants write perl versions of the file to file integration case, now Oracle ships this KM to fill the gap. You can find the documentation for the IKM here. The KM uses pure java to perform the integration, using java.io classes to read and write the file in a pipe – it uses java threading in order to super-charge the file processing, and can process several source files at once when the datastore's resource name contains a wildcard. This is a big step for regular file processing on the way to super-charging big data files using Hadoop – the KM works with the lightweight agent and regular filesystems. So in my design below transforming a bunch of files, by default the IKM File to File (Java) knowledge module was assigned. I pointed the KM at my JDK (since the KM generates and compiles java), and I also increased the thread count to 2, to take advantage of my 2 processors. For my illustration I transformed (can also filter if desired) and moved about 1.3Gb with 2 threads in 140 seconds (with a single thread it took 220 seconds) - by no means was this on any super computer by the way. The great thing here is that it worked well out of the box from the design to the execution without any funky configuration, plus, and a big plus it was much faster than before, So if you are doing any file to file transformations, check it out!

Read the article

What makes a person contribute to opensource? [duplicate]

- by Jibin

This question already has an answer here: Why develop free, open source programs? [closed] 14 answers I know this is controversial, but.. There are many great projects like Apache Webserver or Hadoop provided by the OpenSource Community. I often feel that the people that actually benefit (financially), from these projects are developers like me, sitting in India, working for MNCs, who has never contributed anything to any opensource project so far, but earn handsomely due to my basic googling skills & the community provided documentation. Is it fair ? I mean no other industry in the world face such dilemma.Those who work get paid. I mean I almost am starting to feel guilty of taking such advantage of some thing that I contribute nothing to. I had to do projects every semester in college (we could choose projects) & I used to enjoy coding then. I want to contribute. But contributing to opensource is not a task [unlike college or office work]. And life is so busy .. Are all these opensource contributors really jobless ?[just kidding..] Could someone please share some personal experiences on how you guys started contributing or any advice on why I should contribute or what attitude in life makes you keep aside time or is it that you just crazly love writing code or is it that you just love to see your name in the contributors list or do you have a local coding group you hang out with ? Do you feel I am destined to do this ? This is my part of contributing back to the world ? Whats that basic mentality that make you guys want to contribute, while I just want to finish my work and go home. What makes you guys tick ?

Read the article

ODI 11g - Faster Files

- by David Allan

Deep in the trenches of ODI development I raised my head above the parapet to read a few odds and ends and then think why don’t they know this? Such as this article here – in the past customers (see forum) were told to use a staging route which has a big overhead for large files. This KM is an example of the great extensibility capabilities of ODI, its quite simple, just a new KM that; improves the out of the box experience – just build the mapping and the appropriate KM is used improves out of the box performance for file to file data movement. This improvement for out of the box handling for File to File data integration cases (from the 11.1.1.5.2 companion CD and on) dramatically speeds up the file integration handling. In the past I had seem some consultants write perl versions of the file to file integration case, now Oracle ships this KM to fill the gap. You can find the documentation for the IKM here. The KM uses pure java to perform the integration, using java.io classes to read and write the file in a pipe – it uses java threading in order to super-charge the file processing, and can process several source files at once when the datastore's resource name contains a wildcard. This is a big step for regular file processing on the way to super-charging big data files using Hadoop – the KM works with the lightweight agent and regular filesystems. So in my design below transforming a bunch of files, by default the IKM File to File (Java) knowledge module was assigned. I pointed the KM at my JDK (since the KM generates and compiles java), and I also increased the thread count to 2, to take advantage of my 2 processors. For my illustration I transformed (can also filter if desired) and moved about 1.3Gb with 2 threads in 140 seconds (with a single thread it took 220 seconds) - by no means was this on any super computer by the way. The great thing here is that it worked well out of the box from the design to the execution without any funky configuration, plus, and a big plus it was much faster than before, So if you are doing any file to file transformations, check it out!

Read the article

Venez nous voir au Forum Oracle Big Data le 5 avril !

- by Kinoa

Le Big Data vient de plus en plus souvent au devant de la scène et vous souhaitez en apprendre davantage ? Générés à partir des réseaux sociaux, de capteurs numériques et autres équipements mobiles, les Big Data - autrement dits, d'énormes volumes de données - constituent une mine d'informations précieuses sur vos activités et les comportements de vos clients. Votre challenge aujourd’hui consiste à gérer l’acquisition, l’organisation et la compréhension de ces volumes de données non structurées, et à les intégrer dans votre système d’information. Vous avez des questions ? Ca vous parait complexe ? Alors le Forum Oracle Bid Data organisé par Oracle et Intel est fait pour vous ! Nous aborderons plusieurs points : Accélération du déploiement de Big Data par l'approche intégrée du hardware et du software Mise à disposition de tous les outils nécessaires au processus complet, de l'acquisition des données à la restitution Intégration de Big Data dans votre système d'information pour fournir aux utilisateurs la quintessence de l'information Nous vous avons concocté un programme des plus alléchant pour cette journée du 5 avril : 9h00 Accueil et remise des badges 9h30 Big Data : The Industry View. Are you ready ?Johan Hendrickx, Core Technology Director, Oracle EMEA Keynote : Big Data – Are you ready ? George Lumpkin, Vice President of DW Product Management, Oracle Corporation Acquisition des données dans votre Big Dataavec Hadoop et Oracle NoSQL Pause Organisez et structurez l'information au sein de votre Big Data avec Big Data Connectors et Oracle Data Integrator Tirez parti des analyses des données de votre Big Dataavec Oracle Endeca et Oracle Business Intelligence 13h00 Cocktail déjeunatoire Le nombre de places est limité, pensez à vous inscrire dès maintenant. Lieu : Maison de la Chimie28 B, rue Saint Dominique 75007 Paris

Read the article

How can Agile methodologies be adapted to High Volume processing system development?

- by luckyluke

I am developing high volume processing systems. Like mathematical models that calculate various parameters based on millions of records, calculated derived fields over milions of records, process huge files having transactions etc... I am well aware of unit testing methodologies and if my code is in C# I have no problem in unit testing it. Problem is I often have code in T-SQL, C# code that is a SQL stored assembly, and SSIS workflow with a good amount of logic (and outcomes etc) or some SAS process. What is the approach YOu use when developing such systems. I usually develop several tests as Stored procedures in a designed schema(TEST) and then automatically run them overnight and check out the results. But this is only for T-SQL. And Continous integration IS hard. But the problem is with testing SSIS packages. How do You test it? What is Your preferred approach for stubbing data into tables (especially if You need a lot data initialization). I have some approach derived over the years but maybe I am just not reading enough articles. So Banking, Telecom, Risk developers out there. How do You test your mission critical apps that process milions of records at end day, month end etc? What frameworks do You use? How do You validate that Your ssis package is Correct (as You develop it)/ How do You achieve continous integration in such an environment (Personally I never got there)? I hope this is not to open-ended question. How do You test Your map-reduce jobs for example (i do not use hadoop but this is quite similar). luke Hope that this is not too open ended

Read the article

CodePlex Daily Summary for Wednesday, October 24, 2012

CodePlex Daily Summary for Wednesday, October 24, 2012Popular ReleasesfastJSON: v2.0.9: - added support for root level DataSet and DataTable deserialize (you have to do ToObject<DataSet>(...) ) - added dataset testsInfo Gempa BMKG: Info Gempa BMKG v 1.0.0.3: Release perdana aplikasi pembaca informasi gempa dari data feeder BMKG.Microsoft Ajax Minifier: Microsoft Ajax Minifier 4.72: Fix for Issue #18819 - bad optimization of return/assign operator.DNN Module Creator: 01.01.00: Updated templates for DNN7 ( ie. DAL2, Web Service API ). Numerous bug fixes and enhancements.WPF Application Framework (WAF): WPF Application Framework (WAF) 2.5.0.390: Version 2.5.0.390 (Release Candidate): This release contains the source code of the WPF Application Framework (WAF) and the sample applications. Requirements .NET Framework 4.0 (The package contains a solution file for Visual Studio 2010) The unit test projects require Visual Studio 2010 Professional Changelog Legend: [B] Breaking change; [O] Marked member as obsolete WAF: Fix recent file list remove issue. WAF: Minor code improvements. BookLibrary: Fix Blend design time support o...ltxml.js - LINQ to XML for JavaScript: 1.0 - Beta 1: First release!ZXMAK2: Version 2.6.6.0: + fix refresh debugger after open RZX file + add NoFlic video filterEPiServer CMS ElencySolutions.MultipleProperty: ElencySolutions.MultipleProperty v1.6.3: The ElencySolutions.MulitpleProperty property controls have been developed by Lee Crowe a technical developer at Fortune Cookie (London). Installation notes The property copy page can be locked down by adding the following location element, the path of this will be different depending on whether you use the embedded or non embedded resource version. When installing the nuget package these will be added automatically, examples below: Embedded: <location path="util/ElencySolutionsMultipleP...Fiskalizacija za developere: FiskalizacijaDev 1.1: Ovo je prva nadogradnja ovog projekta nakon inicijalnog predstavljanja - dodali smo nekoliko feature-a, bilo zato što smo sami primijetili da bi ih bilo dobro dodati, bilo na osnovu vaših sugestija - hvala svima koji su se ukljucili :) Ovo su stvari riješene u v1.1.: 1. Bilo bi dobro da se XML dokument koji se šalje u CIS može snimiti u datoteku (http://fiskalizacija.codeplex.com/workitem/612) 2. Podrška za COM DLL (VB6) (http://fiskalizacija.codeplex.com/workitem/613) 3. Podrška za DOS (unu...MCEBuddy 2.x: MCEBuddy 2.3.4: Changelog for 2.3.4 (32bit and 64bit) 1. Fixed a bug introduced in 2.3.3 that would cause HD recordings and recordings with multiple audio channels to fail. 2. Updated <encoder-unsupported> option to compare with all Audio tracks for videos with multiple audio tracks. 3. Fixed a bug with SRT and EDL files, when input and output directory are the same the files are not preserved.Liberty: v3.4.0.0 Release 20th October 2012: Change Log -Added -Halo 4 support (invincibility, ammo editing) -Reach A warning dialog now shows up when you first attempt to swap a weapon -Fixed -A few minor bugsFoxOS: Stable Fox: Stable Fox Versione 0.0.0.1 Richiede .NET Framework 3.5 o superioriDoctor Reg: Doctor Reg V1.0: Doctor Reg V1.0 PT-PTkv: kv 1.0: if it were any more stable it would be a barn.LINQ for C++: cpplinq-20121020: LINQ for C++ is an attempt to bring LINQ-like list manipulation to C++11. This release includes just the source code. What's new in this release: join range operators: Inner Joins two ranges using a key selector reverse range operator distinct range operator union_with range operator intersect_with range operator except range operator concat range operator sequence_equal range aggregator to_lookup range aggregator This is a sample on how to use cpplinq: #include "cpplinq.h...helferlein_Form: 02.03.05: Requirements.Net 4.0 DotNetNuke 05.06.07 or higher, maybe it works with lower versions, but I developed it on this one and tested it on DotNetNuke 06.02.00 as well helferlein_BabelFish version 01.01.03 - please upgrade this first! Issues fixed Fixed issue with all users from all portals are listed as Host users in the sender options (E-Mail Options - Sender - ALL Users Listed) Registered postback-button for Excel-Export on Form submission edit control Changed behaviour Due to some mis...ClosedXML - The easy way to OpenXML: ClosedXML 0.68.1: ClosedXML now resolves formulas! Yes it finally happened. If you call cell.Value and it has a formula the library will try to evaluate the formula and give you the result. For example: var wb = new XLWorkbook(); var ws = wb.AddWorksheet("Sheet1"); ws.Cell("A1").SetValue(1).CellBelow().SetValue(1); ws.Cell("B1").SetValue(1).CellBelow().SetValue(1); ws.Cell("C1").FormulaA1 = "\"The total value is: \" & SUM(A1:B2)"; var...Orchard Project: Orchard 1.6 RC: RELEASE NOTES This is the Release Candidate version of Orchard 1.6. You should use this version to prepare your current developments to the upcoming final release, and report problems. Please read our release notes for Orchard 1.6 RC: http://docs.orchardproject.net/Documentation/Orchard-1-6-Release-Notes Please do not post questions as reviews. Questions should be posted in the Discussions tab, where they will usually get promptly responded to. If you post a question as a review, you wil...Rawr: Rawr 5.0.1: This is the Downloadable WPF version of Rawr!For web-based version see http://elitistjerks.com/rawr.php You can find the version notes at: http://rawr.codeplex.com/wikipage?title=VersionNotes Rawr Addon (NOT UPDATED YET FOR MOP)We now have a Rawr Official Addon for in-game exporting and importing of character data hosted on Curse. The Addon does not perform calculations like Rawr, it simply shows your exported Rawr data in wow tooltips and lets you export your character to Rawr (including ba...Yahoo! UI Library: YUI Compressor for .Net: Version 2.1.1.0 - Sartha (BugFix): - Revered back the embedding of the 2x assemblies.New ProjectsASP.NET DatePicker (Persian/Gregorian): This is just another DatePicker for ASP.NET that supports both Persian (Jalali/Shamsi/Solar) and Gregorian Calendar.AspectMap: AspectMap is an Aspect Oriented framework built on top of the StructureMap IoC framework.Building a Pinterest like Image Crawler: More about it here : http://www.alexpeta.ro/article/building-a-pinterest-like-image-crawlerCRM 2011 client scripting TypeScript definition file: xrm crm 2011 typescript javascript definition fileDarkSky Commerce: DarkSky Commerce is an Orchard module that provides a generic and extensible set of core commerce features and servicesDarkSky Learning: An Orchard module providing E-learning authoring tools and engines.Dusk Consulting: Dusk Consulting provides automation scripts to help IT Professionals and Developers alike.Edi: Edi is a an IDE based editor that is currently focused on text based editing (other editors may follow). This project is based on AvalonDock 2.0 and AvalonEdit.Entity Framework Extensions: This project includes extensions for the entity framework, includes unit of work pattern with event support, repository pattern with event support, and so on.EPiBoost (EPiServer 7 MVC Toolkit): Coming SoonExtended Guid: Easilly create version 5 guids. Converts an arbitrary string to a guid given a specific work area, or namespace.FileToText: pour la création de liste de fichierFilmBook, Film and Photo Archival Management: FilmBook is a simple image gallery and organization application. It's goal is to help identify the location of negatives in an archival page.FridgeBoard: Este proyecto está siendo realizado por alumnos de informática. No nos hacemos responsables de la mala calidad del mismo.GIV_P1: testyIQTool: Provide a set of tools to generate reports capturing the state of a server. This is written in powershell to allow easy modification / update of the scriptsLeaving Application System: have a good time.Leaving Management System: optimize work flow of staff work attendance managementltxml.js - LINQ to XML for JavaScript: ltxml.js is an XML processing library for JavaScript that is inspired by and largely similar to the LINQ to XML library for .NET.Microsoft .NET SDK For Hadoop: A collection of API's, tools and samples for using .NET with Apache Hadoop.MyLib: This is my personal library of various tidbits that I have been using regularly. Includes a generic repository with EF and MongoDB implementations.Parlez MVC: A multi-lingual implementation for the ASP.NET MVC 4 framework to make it easier to build websites that are viewable in different languages.Perfect Sport Nutrition: Proyecto realizado por la Empresa de Desarrollo Software SoftwareHC E.I.R.LScale Model Database: This project is a work of databasesSliding Block Puzzle: This is a simple (not very good) game for Windows Phone 7.1 (Mango). The point of it is to show off how to do different things in WP7.1.SmallBasic Extension Manager: SXM, or the SmallBasic Extension Manager, is a program intended for use with the Small Basic programming language from Microsoft.SQL Azure Federation Backup to Azure Blob Storage using Azure Worker: Azure Worker project, that will backup any number of Azure SQL Databases to any selected Azure Blob containers. Config in XML. Automatic 7z compression.TeF: A framework for running tests with a plenty of features and ease of use.Testge: testTwittOracle: This project is a prototype project intended to be submitted in a competition. The details will be updated later as things are finalized.uMembership - Alternative Umbraco Member Api: uMembership is an alternative and faster way of dealing with Members in Umbraco when you have 1000's or even 100Variedades Silverlight 5: Varidades Silverlight 5Website d?t tour du lich: Ð? án môn ASP.NET Xây d?ng website d?t tour du l?ch d?a trên mô hình MVCWPF About Box: WPF About Box is simple and free about box for WPF using MVVM pattern. Several properties can be set. Some properties are read from assembly, automatically.WTUnion: WTUnionXML Cleaner: Console app for batch cleaning XML files containing ilegal caracters. It can be useful for cleaning XML files before importing them to SOLR or something similarXmlToObject: XmlToObject is a .Net library for object serialization with use of XPath expressions applied to class fields and properties via custom attributes.XNAGalcon: This Project will create Galcon Game for XNA version

Read the article

Java Magazine: Growing on Open

- by Tori Wieldt

The November/December issue of Java Magazine is now out, with several great Java stories, including: Growing on Open AgroSense provides an all-Java open source platform for sustainable farming and precision agriculture. An Engine for Big Data Hadoop uses Java for large-scale analytics. JavaFX in SpringStephen Chin shows you why to use the Spring framework on the client. JCP Executive Q&A: Mike MilinkovichThe Eclipse Foundation’s executive director assesses the state of Java and the JCP. Exploring Lambda Expressions for the Java Language and the JVMBen Evans, Martijn Verburg, and Trisha Gee help you get ready for lambda expressions in Java SE 8. Get Started with Java SE for Embedded Devices on Raspberry PiWe walk you through getting Linux and Java SE for Embedded Devices to run on the Raspberry Pi in less than an hour. Java NationGet the news from JavaOne 2012 in San Francisco. Java Magazine is a bi-monthly online publication. It includes technical articles on the Java language and platform; Java innovations and innovators; JUG and JCP news; Java events; links to online Java communities; and videos and multimedia demos. Subscriptions are free. Do you have feedback about Java Magazine? Send a tweet to @oraclejavamag.

Read the article

Customer Highlight: NTT DOCOMO

- by jeckels

NTT DOCOMO is the largest mobile operator in Japan, and serves over 13 million smartphone customers. Due to their growing data processing and scalability needs, they turned to Oracle's Cloud Application Foundation products for an integral soultion. At Oracle OpenWorld 2012, we first showcased NTT DOCOMO as a customer who was utilizing Oracle Coherence to process mobile data at a rate of 700,000 events per second (and then using Hadoop for distributed processing of big data). Overall, this Led to a 50% cost reduction due to the ultra-high velocity traffic processing of their customers' events. Recently, on October 7th, 2013, Oracle and NTT DOCOMO were proud to again announce a partnership around another key component of Oracle CAF: WebLogic Server. WebLogic was recently deployed as the application platform of choice to run DOCOMO's mission-critical data system ALADIN, which connects nationwide shops and information centers. ALADIN, which also utilizes Oracle Database and Oracle Tuxedo, is based on Java Platform, Enterprise Edition (Java EE), which has allowed the company to operate smoothly while minimizing additional development and modification associated with the migration of application server products. We look forward to continuing to partner with NTT DOCOMO, and are proud that Oracle Cloud Application Foundation products are providing the mission-critical solutions - at scale - that DOCOMO requires. Want to learn more about how CAF products are working in the real world? Join us for a FREE Virtual Developer Day on November 5th from 9am-1pm Pacific Time!REGISTER NOW

Read the article

???: Oracle NoSQL Database??

- by zhangqm

?????????Oracle?????Oracle NoSQL Database,?????NoSQL Database ??????????Oracle NoSQL Database??2???,Community Edition ?Enterprise Edition?????????NoSQL Database 11g R2 (11gR2.1.2.123). ?????????????????: Oracle NoSQL Database OTN portal (includes download facility) Oracle NoSQL Database OTN documentation Oracle NoSQL Database license information ??Oracle NoSQL Database ???????????,????,?????(key-value)???TB????,????????????(???)????,??????????????????????????,????,??????????? ?Oracle NoSQL Database?,???????????key-value???,??key???????:??????????key?(?????string),????????(??????????bytes)??????key-value ??primary key?hash?,????????????????????????????????,???????,????????????????????????? ???????????????????Java API??????Oracle NoSQL Database driver ????????,?????key-value????????????????Oracle NoSQL Database ?????Create, Read, Update and Delete (CRUD)??,???????durability??????????????????????:?????web console???command line??? Oracle Berkeley DB Java Edition Oracle NoSQL Database?? Oracle Berkeley DB Java Edition ????????,??????????????????????????????,?????????????????? ????????????Oracle NoSQL Database Driver?????key-value????????????Oracle NoSQL Database Driver??:?????????hash??????????????????,?????????????????????? ????????Oracle NoSQL Database Oracle NoSQL Database????????????????????????????????????????????????????????????????: ???? ???? ???? ?????? ???? ?????? ????,??,?? ???? ???? ??? (sub-millisecond) ???????? ????? ??????? ???????? ?????Oracle?????? ???? (Oracle Big Data Appliance) ???? ?????????????????????????????????,???“??”???????????,Oracle NoSQL Database???????????Oracle NoSQL Database?????(Cloud)??,????????(TB?PB??)???Oracle NoSQL Database ??????ETL??(??MapReduce, Hadoop)??,??acquire-organize-analyze ?????????? ???????Oracle NoSQL Database?????: • Large schema-less data repositories• Web?? (click-through capture)• ????• ????• ?????????? • Sensor/statistics/network capture (?????, ?????)• ?????????• ???? (MMS, SMS, routing)• ???? Oracle NoSQL Database (Community Edition ??)??????????? Oracle Big Data Appliance???

Read the article

How to use Cassandra's Map Reduce with or w/o Pig?

- by UltimateBrent

Can someone explain how MapReduce works with Cassandra .6? I've read through the word count example, but I don't quite follow what's happening on the Cassandra end vs. the "client" end. https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ For instance, let's say I'm using Python and Pycassa, how would I load in a new map reduce function, and then call it? Does my map reduce function have to be java that's installed on the cassandra server? If so, how do I call it from Pycassa? There's also mention of Pig making this all easier, but I'm a complete Hadoop noob, so that didn't really help. Your answer can use Thrift or whatever, I just mentioned Pycassa to denote the client side. I'm just trying to understand the difference between what runs in the Cassandra cluster vs. the actual server making the requests.

Read the article

Running Awk command on a cluster

- by alex

How do you execute a Unix shell command (awk script, a pipe etc) on a cluster in parallel (step 1) and collect the results back to a central node (step 2) Hadoop seems to be a huge overkill with its 600k LOC and its performance is terrible (takes minutes just to initialize the job) i don't need shared memory, or - something like MPI/openMP as i dont need to synchronize or share anything, don't need a distributed VM or anything as complex Google's SawZall seems to work only with Google proprietary MapReduce API some distributed shell packages i found failed to compile, but there must be a simple way to run a data-centric batch job on a cluster, something as close as possible to native OS, may be using unix RPC calls i liked rsync simplicity but it seem to update remote notes sequentially, and you cant use it for executing scripts as afar as i know switching to Plan 9 or some other network oriented OS looks like another overkill i'm looking for a simple, distributed way to run awk scripts or similar - as close as possible to data with a minimal initialization overhead, in a nothing-shared, nothing-synchronized fashion Thanks Alex

Read the article

Clojure / HBase: How to Import HBaseTestingUtility in v0.94.6.1

- by David Williams

In Clojure, if I want to start a test cluster using the hbase testing utility, I have to annotate my dependencies with: [org.apache.hbase/hbase "0.92.2" :classifier "tests" :scope "test"] First of all, I have no idea what this means. According to leiningens sample project.clj ;; Dependencies are listed as [group-id/name version]; in addition ;; to keywords supported by Pomegranate, you can use :native-prefix ;; to specify a prefix. This prefix is used to extract natives in ;; jars that don't adhere to the default "<os>/<arch>/" layout that ;; Leiningen expects. Question 1: What does that mean? Question 2: If I upgrade the version: [org.apache.hbase/hbase "0.94.6.1" :classifier "tests" :scope "test"] Then I receive a ClassNotFoundException Exception in thread "main" java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration Whats going on here and how do I fix it?

Read the article

Essential skills of a Data Scientist

- by harshsinghal

I would like to know more about the relevant skills in the arsenal of a Data Scientist, and with new technologies coming in every day, how one picks and chooses the essentials. A few ideas germane to this discussion: Knowing SQL and the use of a DB such as MySQL, PostgreSQL was great till the advent of NoSql and non-relational databases. MongoDB, CouchDB etc. are becoming popular to work with web-scale data. Knowing a stats tool like R is enough for analysis, but to create applications one may need to add Java, Python, and such others to the list. Data now comes in the form of text, urls, multi-media to name a few, and there are different paradigms associated with their manipulation. What about cluster computing, parallel computing, the cloud, Amazon EC2, Hadoop ? OLS Regression now has Artificial Neural Networks, Random Forests and other relatively exotic machine learning/data mining algos. for company Thoughts?

Read the article

ClassCastException happens when I use maven with tomcat plugin

- by zjffdu

Hi all, I try to use maven with tomcat plugin to develop a simple web application. But When I invoke the servlet, ClassCastException happens, this is the error message: java.lang.ClassCastException: "com.snda.dw.moniter.LogQueryServlet cannot be to javax.servlet.Servlet" But I already make com.snda.dw.moniter.LogQueryServlet extends HttpServlet, it should can be cast to avax.servlet.Servlet. The following is my pom.xml http://maven.apache.org/maven-v4_0_0.xsd" 4.0.0 com.snda dw.moniter war 0.0.1-SNAPSHOT dw.moniter Maven Webapp http://maven.apache.org junit junit 3.8.1 test <dependency> <groupId>javax.servlet</groupId> <artifactId>servlet-api</artifactId> <version>2.4</version> <scope>provided</scope> </dependency> <dependency> <groupId>javax.servlet.jsp</groupId> <artifactId>jsp-api</artifactId> <version>2.0</version> <scope>provided</scope> </dependency> <dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>r07</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-api</artifactId> <version>1.6.1</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>org.slf4j</groupId> <artifactId>slf4j-simple</artifactId> <version>1.6.1</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>0.20.2</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>com.snda</groupId> <artifactId>dw.common</artifactId> <version>1.0-SNAPSHOT</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupId>net.sf.flexjson</groupId> <artifactId>flexjson</artifactId> <version>2.1</version> <type>jar</type> <scope>compile</scope> </dependency> </dependencies> <build> <finalName>dw.moniter</finalName> <pluginManagement> <plugins> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.5</source> <target>1.5</target> </configuration> </plugin> <plugin> <groupId>org.codehaus.mojo</groupId> <artifactId>tomcat-maven-plugin</artifactId> <version>1.1</version> </plugin> <plugin> <groupId>org.mortbay.jetty</groupId> <artifactId>jetty-maven-plugin</artifactId> <version>8.0.0.M2</version> </plugin> </plugins> </pluginManagement> </build>

Read the article

Map Reduce Frameworks/Infrastructure

- by Johannes Rudolph

Map Reduce is a pattern that seems to get a lot of traction lately and I start to see it manifest in one of my projects that is focused on an event processing pipeline (iPhone Accelerometer and GPS data). I needed to built a lot of infrastructure for this project, in fact it overweighs the logic code interacting with it by 2x. Some of the components I built where EventProcessors (with in- and output plus buffering, timing etc.), multiplexers and aggregators. This leads me to my question what the "common" required infrastrucutre for map reduce is. Since I am working with .Net a lot I can see map reduce infrastructure built into the Framework and language constructs. Functional languages support this paradigm per se. It seems every language can be used with map reduce, some have better support than others, others again are built around that concept (e.g. Go). And there are Frameworks like Apache Hadoop to support map reduce.

Read the article

Can't find netbooted for Kerrighed pxe boot with Ubuntu Lucid Server

- by Pengin

I'm following installtion guides for pxe booting and kerrighed. I can't find the package nfsbooted for Ubuntu 10.04. Where did it go? Context: At work I have access to 8 mini-ITX PCs and am trying to build a cluster. My plans include trying Condor, GridGain, Hadoop, and recently Kerrighed has caught my eye. (I reaslise these are all for different kinds of things, I'm just evaluating). Ideally, I'd like to have all the nodes network boot from a single server, since that seems so much easier to manage, plus I can 'borrow' additional PCs for a while without touching their HD. I've been getting on great with Ubuntu Lucid Server (10.04), trying to follow the only guides I can find to get pxe booting (and ultimately kerrighed) to work. This guide is for Ubuntu 8.04 and this one is for Debian. They both refer to a package I can't seem to find, nfsbooted. Has this package been replaced? Am I doing something daft?

Read the article

Solution for distributing MANY simple network tasks?

- by EmpireJones

I would like to create some sort of a distributed setup for running a ton of small/simple REST web queries in a production environment. For each 5-10 related queries which are executed from a node, I will generate a very small amount of derived data, which will need to be stored in a standard relational database (such as PostgreSQL). What platforms are built for this type of problem set? The nature, data sizes, and quantities seem to contradict the mindset of Hadoop. There are also more grid based architectures such as Condor and Sun Grid Engine, which I have seen mentioned. I'm not sure if these platforms have any recovery from errors though (checking if a job succeeds). What I would really like is a FIFO type queue that I could add jobs to, with the end result of my database getting updated. Any suggestions on the best tool for the job?

Read the article

Map Reduce job on Amazon: argument for custom jar

- by zero51

Hi all, This is one of my first try with Map Reduce on AWS in its Management Console. Hi have uploaded on AWS S3 my runnable jar developed on Hadoop 0.18, and it works on my local machine. As described on documentation, I have passed the S3 paths for input and output as argument of the jar: all right, but the problem is the third argument that is another path (as string) to a file that I need to load while the job is in execution. That file resides on S3 bucket too, but it seems that my jar doesn't recognize the path and I got a FileNotFound Exception while it tries to load it. That is strange because this is a path exactly like the other two... Anyone have any idea? Thank you Luca

Read the article

Reading a simple Avro file from HDFS

- by John Galt... who

I am trying to do a simple read of an Avro file stored in HDFS. I found out how to read it when it is on the local file system.... FileReader reader = DataFileReader.openReader(new File(filename), new GenericDatumReader()); for (GenericRecord datum : fileReader) { String value = datum.get(1).toString(); System.out.println("value = " value); } reader.close(); My file is in HDFS, however. I cannot give the openReader a Path or an FSDataInputStream. How can I simply read an Avro file in HDFS? EDIT: I got this to work by creating a custom class (SeekableHadoopInput) that implements SeekableInput. I "stole" this from "Ganglion" on github. Still, seems like there would be a Hadoop/Avro integration path for this. Thanks

Read the article

Notify via email if something wrong got happened in the shell script

- by Nevzz03

fileexist=0 for i in $( ls /data/read-only/clv/daily/Finished-HADOOP_EXPORT_&processDate#.done); do mv /data/read-only/clv/daily/Finished-HADOOP_EXPORT_&processDate#.done /data/read-only/clv/daily/archieve-wip/ fileexist=1 done --some other script below Above is the shell script I have in which in the for loop, I am moving some files. I want to notify myself via email if something wrong got happened in the moving process, as I am running this script on the Hadoop Cluster, so it might be possible that cluster went down while this was running etc etc. So how can I have better error handling mechanism in this shell script? Any thoughts?

Read the article

The Business case for Big Data

- by jasonw

The Business Case for Big Data Part 1 What's the Big Deal Okay, so a new buzz word is emerging. It's gone beyond just a buzzword now, and I think it is going to change the landscape of retail, financial services, healthcare....everything. Let me spend a moment to talk about what i'm going to talk about. Massive amounts of data are being collected every second, more than ever imaginable, and the size of this data is more than can be practically managed by today’s current strategies and technologies. There is a revolution at hand centering on this groundswell of data and it will change how we execute our businesses through greater efficiencies, new revenue discovery and even enable innovation. It is the revolution of Big Data. This is more than just a new buzzword is being tossed around technology circles.This blog series for Big Data will explain this new wave of technology and provide a roadmap for businesses to take advantage of this growing trend. Cases for Big Data There is a growing list of use cases for big data. We naturally think of Marketing as the low hanging fruit. Many projects look to analyze twitter feeds to find new ways to do marketing. I think of a great example from a TED speech that I recently saw on data visualization from Facebook from my masters studies at University of Virginia. We can see when the most likely time for breaks-ups occurs by looking at status changes and updates on users Walls. This is the intersection of Big Data, Analytics and traditional structured data. Ted Video Marketers can use this to sell more stuff. I really like the following piece on looking at twitter feeds to measure mood. The following company was bought by a hedge fund. They could predict how the S&P was going to do within three days at an 85% accuracy. Link to the article Here we see a convergence of predictive analytics and Big Data. So, we'll look at a lot of these business cases and start talking about what this means for the business. It's more than just finding ways to use Hadoop + NoSql and we'll talk about that too. How do I start in Big Data? That's what is coming next post.

Read the article

Cost justification for buying a 32GB superfast Alienware M18x with a price tag of around £5K ($10K)

- by tonyrogerson

When considering buying a laptop that’s going to cost me around £5,000 I really need to justify the purchase from a business perspective; my Lenovo W700 has served me very well for the last 2 years, it’s an extremely good machine and as solid as a rock (and as heavy), alas though it is limited to the 8GB. As SQL Server 2012 approaches and with my interest in working in the Business Intelligence space over the next year or two it is clear I need a powerful machine that I can run a full infrastructure though virtualised. My requirements For High Availability / Disaster Recovery research and demonstration Machine for a domain controller Four machines in a shared disk cluster (SQL Server Clustering active – active etc.) Five machines in a file share cluster (SQL Server Availability Groups) For Business Intelligence research and demonstration Not entirely sure how many machine I want to run here, but it would be to cover the entire BI stack in an enterprise setting, sharepoint, sql server etc. For Big Data Research I have a fondness for the NoSQL approach to scalability and dealing with large volumes so I need a number of machines to research VoltDB, Hadoop etc. As you can see the requirements for a SQL Server consultant to service their clients well is considerable; will 8GB suffice, alas no, it will no longer do. I’m a very strong believer that in order to do your job well you must expense it, short cuts only cost you time, waiting 5 minutes instead of an hour for something to run not only saves me time but my clients time, I can do things quicker and more importantly I can demonstrate concepts. My W700 with the 8GB of RAM and SSD’s cost me around £3.5K two years ago, to be honest I’ve not got the full use I wanted out of it but the machine has had the power when I’ve needed it, it’s served me and my clients well. Alienware now do a model (the M18x) with 32GB of RAM; yes 32GB in a laptop! Dual drives so I can whack a couple of really good SSD’s in there, a quad core with hyper threading i7 and a decent speed. I can reduce the cost of the memory by getting it from Crucial, so instead of £1.5K for 32GB it will be around £900, I can also cost save on the SSD as well. The beauty about the M18x is that it is USB3.0, SATA 3 and also really importantly has eSATA, running VM’s will never be easier, I can have a removeable SSD with my VM’s on it and can plug it into my home machine or laptop – an ideal world! The initial outlay of £5K is peanuts compared to the benefits I’ll give my clients, I will be able to present real enterprise concepts, I’ll also be able to give training on those real enterprise concepts and with real, albeit virtualised machines.

Read the article

Big Data – Buzz Words: What is NoSQL – Day 5 of 21

- by Pinal Dave

In yesterday’s blog post we explored the basic architecture of Big Data . In this article we will take a quick look at one of the four most important buzz words which goes around Big Data – NoSQL. What is NoSQL? NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means there is No SQL, which is not true – they both sound same but the meaning is totally different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per Wikipedia’s NoSQL Database Definition – “A NoSQL database provides a mechanism for storage and retrieval of data that uses looser consistency models than traditional relational databases.“ Why use NoSQL? A traditional relation database usually deals with predictable structured data. Whereas as the world has moved forward with unstructured data we often see the limitations of the traditional relational database in dealing with them. For example, nowadays we have data in format of SMS, wave files, photos and video format. It is a bit difficult to manage them by using a traditional relational database. I often see people using BLOB filed to store such a data. BLOB can store the data but when we have to retrieve them or even process them the same BLOB is extremely slow in processing the unstructured data. A NoSQL database is the type of database that can handle unstructured, unorganized and unpredictable data that our business needs it. Along with the support to unstructured data, the other advantage of NoSQL Database is high performance and high availability. Eventual Consistency Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity, Consistency, Isolation, Durability) compliance. Though, NoSQL Database does not support ACID they provide eventual consistency. That means over the long period of time all updates can be expected to propagate eventually through the system and data will be consistent. Taxonomy Taxonomy is the practice of classification of things or concepts and the principles. The NoSQL taxonomy supports column store, document store, key-value stores, and graph databases. We will discuss the taxonomy in detail in later blog posts. Here are few of the examples of the each of the No SQL Category. Column: Hbase, Cassandra, Accumulo Document: MongoDB, Couchbase, Raven Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m Graph: Neo4J, Allegro, Virtuoso, Bigdata As of now there are over 150 NoSQL Database and you can read everything about them in this single link. Tomorrow In tomorrow’s blog post we will discuss Buzz Word – Hadoop. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

Read the article

Fast Data: Go Big. Go Fast.

- by J Swaroop

Cross-posting Dain Hansen's excellent recap of the Big Data/Fast Data announcement during OOW: For those of you who may have missed it, today’s second full day of Oracle OpenWorld 2012 started with a rumpus. Joe Tucci, from EMC outlined the human face of big data with real examples of how big data is transforming our world. And no not the usual tried-and-true weblog examples, but real stories about taxi cab drivers in Singapore using big data to better optimize their routes as well as folks just trying to get a better hair cut. Next we heard from Thomas Kurian who talked at length about the important platform characteristics of Oracle’s Cloud and more specifically Oracle’s expanded Cloud Services portfolio. Especially interesting to our integration customers are the messaging support for Oracle’s Cloud applications. What this means is that now Oracle’s Cloud applications have a lightweight integration fabric that on-premise applications can communicate to it via REST-APIs using Oracle SOA Suite. It’s an important element to our strategy at Oracle that supports this idea that whether your requirements are for private or public, Oracle has a solution in the Cloud for all of your applications and we give you more deployment choice than any vendor. If this wasn’t enough to get the juices flowing, later that morning we heard from Hasan Rizvi who outlined in his Fusion Middleware session the four most important enterprise imperatives: Social, Mobile, Cloud, and a brand new one: Fast Data. Today, Rizvi made an important step in the definition of this term to explain that he believes it’s a convergence of four essential technology elements: Event Processing for event filtering, business rules – with Oracle Event Processing Data Transformation and Loading - with Oracle Data Integrator Real-time replication and integration – with Oracle GoldenGate Analytics and data discovery – with Oracle Business Intelligence Each of these four elements can be considered (and architect-ed) together on a single integrated platform that can help customers integrate any type of data (structured, semi-structured) leveraging new styles of big data technologies (MapReduce, HDFS, Hive, NoSQL) to process more volume and variety of data at a faster velocity with greater results. Fast data processing (and especially real-time) has always been our credo at Oracle with each one of these products in Fusion Middleware. For example, Oracle GoldenGate continues to be made even faster with the recent 11g R2 Release of Oracle GoldenGate which gives us some even greater optimization to Oracle Database with Integrated Capture, as well as some new heterogeneity capabilities. With Oracle Data Integrator with Big Data Connectors, we’re seeing much improved performance by running MapReduce transformations natively on Hadoop systems. And with Oracle Event Processing we’re seeing some remarkable performance with customers like NTT Docomo. Check out their upcoming session at Oracle OpenWorld on Wednesday to hear more how this customer is using Event processing and Big Data together. If you missed any of these sessions and keynotes, not to worry. There's on-demand versions available on the Oracle OpenWorld website. You can also checkout our upcoming webcast where we will outline some of these new breakthroughs in Data Integration technologies for Big Data, Cloud, and Real-time in more details.

Read the article

Tackling Big Data Analytics with Oracle Data Integrator

- by Irem Radzik

v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Table Normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0in 5.4pt 0in 5.4pt; mso-para-margin:0in; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman","serif"; mso-fareast-font-family:"Times New Roman";} By Mike Eisterer The term big data draws a lot of attention, but behind the hype there's a simple story. For decades, companies have been making business decisions based on transactional data stored in relational databases. Beyond that critical data, however, is a potential treasure trove of less structured data: weblogs, social media, email, sensors, and documents that can be mined for useful information. Companies are facing emerging technologies, increasing data volumes, numerous data varieties and the processing power needed to efficiently analyze data which changes with high velocity. Oracle offers the broadest and most integrated portfolio of products to help you acquire and organize these diverse data sources and analyze them alongside your existing data to find new insights and capitalize on hidden relationships Oracle Data Integrator Enterprise Edition(ODI) is critical to any enterprise big data strategy. ODI and the Oracle Data Connectors provide native access to Hadoop, leveraging such technologies as MapReduce, HDFS and Hive. Alongside with ODI’s metadata driven approach for extracting, loading and transforming data; companies may now integrate their existing data with big data technologies and deliver timely and trusted data to their analytic and decision support platforms. In this session, you’ll learn about ODI and Oracle Big Data Connectors and how, coupled together, they provide the critical integration with multiple big data platforms. Tackling Big Data Analytics with Oracle Data Integrator October 1, 2012 12:15 PM at MOSCONE WEST – 3005 For other data integration sessions at OpenWorld, please check our Focus-On document. If you are not able to attend OpenWorld, please check out our latest resources for Data Integration.

Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 14/17 | < Previous Page | 10 11 12 13 14 15 16 17 | Next Page >

- by David Allan

- by Jibin

- by David Allan

- by Kinoa

- by luckyluke

- by Tori Wieldt

- by jeckels

- by zhangqm

- by UltimateBrent

- by alex

- by David Williams

- by harshsinghal

- by zjffdu

- by Johannes Rudolph

- by Pengin

- by EmpireJones

- by zero51

- by John Galt... who

- by Nevzz03

- by jasonw

- by tonyrogerson

- by Pinal Dave

- by J Swaroop

- by Irem Radzik

< Previous Page | 10 11 12 13 14 15 16 17 | Next Page >