Search Results

Search found 401 results on 17 pages for 'hadoop'.

Page 7/17 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14  | Next Page >

  • Welcome!

    - by mannamal
    Welcome to the Oracle Big Data Connectors blog, which will focus on posts related to integrating data on a Hadoop cluster with Oracle Database. In particular the blog will focus on best practices, usage notes, and performance tips for using Oracle Loader for Hadoop and Oracle Direct Connector for HDFS, which are part of Oracle Big Data Connectors. Oracle Big Data Connectors 1.0 also includes Oracle R Connector for Hadoop and Oracle Data Integrator Application Adapters for Hadoop. Oracle Loader for Hadoop: Oracle Loader for Hadoop loads data from Hadoop to Oracle Database. It runs as a MapReduce job on Hadoop to partition, sort, and convert the data into an Oracle-ready format, offloading to Hadoop the processing that is typically done using database CPUs. The data is thenloaded to the database by the Oracle Loader for Hadoop job (online load) or written out as Oracle Data Pump files for load and access later (offline load) with Oracle Direct Connector for HDFS. Oracle Direct Connector for HDFS: Oracle Direct Connector for HDFS is a connector for high speed access of data on HDFS from Oracle Database. With this connector Oracle SQL can be used to directly query data on HDFS. The data can be Oracle Data Pump files generated by Oracle Loader for Hadoop or delimited text files. The connector can also be used to load data into the database using SQL.

    Read the article

  • Windows Azure Recipe: Big Data

    - by Clint Edmonson
    As the name implies, what we’re talking about here is the explosion of electronic data that comes from huge volumes of transactions, devices, and sensors being captured by businesses today. This data often comes in unstructured formats and/or too fast for us to effectively process in real time. Collectively, we call these the 4 big data V’s: Volume, Velocity, Variety, and Variability. These qualities make this type of data best managed by NoSQL systems like Hadoop, rather than by conventional Relational Database Management System (RDBMS). We know that there are patterns hidden inside this data that might provide competitive insight into market trends.  The key is knowing when and how to leverage these “No SQL” tools combined with traditional business such as SQL-based relational databases and warehouses and other business intelligence tools. Drivers Petabyte scale data collection and storage Business intelligence and insight Solution The sketch below shows one of many big data solutions using Hadoop’s unique highly scalable storage and parallel processing capabilities combined with Microsoft Office’s Business Intelligence Components to access the data in the cluster. Ingredients Hadoop – this big data industry heavyweight provides both large scale data storage infrastructure and a highly parallelized map-reduce processing engine to crunch through the data efficiently. Here are the key pieces of the environment: Pig - a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. Mahout - a machine learning library with algorithms for clustering, classification and batch based collaborative filtering that are implemented on top of Apache Hadoop using the map/reduce paradigm. Hive - data warehouse software built on top of Apache Hadoop that facilitates querying and managing large datasets residing in distributed storage. Directly accessible to Microsoft Office and other consumers via add-ins and the Hive ODBC data driver. Pegasus - a Peta-scale graph mining system that runs in parallel, distributed manner on top of Hadoop and that provides algorithms for important graph mining tasks such as Degree, PageRank, Random Walk with Restart (RWR), Radius, and Connected Components. Sqoop - a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. Flume - a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large log data amounts to HDFS. Database – directly accessible to Hadoop via the Sqoop based Microsoft SQL Server Connector for Apache Hadoop, data can be efficiently transferred to traditional relational data stores for replication, reporting, or other needs. Reporting – provides easily consumable reporting when combined with a database being fed from the Hadoop environment. Training These links point to online Windows Azure training labs where you can learn more about the individual ingredients described above. Hadoop Learning Resources (20+ tutorials and labs) Huge collection of resources for learning about all aspects of Apache Hadoop-based development on Windows Azure and the Hadoop and Windows Azure Ecosystems SQL Azure (7 labs) Microsoft SQL Azure delivers on the Microsoft Data Platform vision of extending the SQL Server capabilities to the cloud as web-based services, enabling you to store structured, semi-structured, and unstructured data. See my Windows Azure Resource Guide for more guidance on how to get started, including links web portals, training kits, samples, and blogs related to Windows Azure.

    Read the article

  • How best to merge/sort/page through tons of JSON arrays?

    - by Joshiatto
    Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is the best way to merge/sort/filter/page through these activities via a web UI? For example, say we want to take a few thousand of the documents, merge them into a gigantic array, sort the array by the "created_datetime" attribute descending and then page through it 10 activities at a time. Also keep in mind that roughly 25% of these JSON documents are updated every day, and updates have to make it into the view within 5 minutes. My first thought is to parse all of the documents into an RDBMS table and then it would just be a simple query such as "select top 10 name, created_datetime from Activity where user_id=12345 order by created_datetime desc". Some have suggested I use NoSQL techniques such as hadoop or map/reduce instead. How exactly would this work? For more background, see: Why is NoSQL better for this scenario?

    Read the article

  • unable to start regionservers in HBase

    - by amsal
    I have a problem starting regionservers on slave PCs. When I enlist only master PC in conf/regionservers everything works fine but when I add two more slaves to it the hbase does not start. If I delete all hbase folders in the tmp folder from all PCs and then start regionserver (with 3 regionservers enlisted) the hbase gets started but when I try to create a table it again fails (gets stuck). I am using hadoop 0.20.0 which is working fine and hbase 0.92.0. I have 3 PCs in cluster, one master and two slaves. Also tell that is DNS (both forward and backward lookup working) necessary for hbase in my case? Is there any way to replicate hbase table to all region servers? I.e. I want to have a copy of table at each PC and want to access them locally (when I execute map task they should use their local copy of hbase table).

    Read the article

  • Design application to send messages by marking circle on the map where you want to send message

    - by jhamb
    This is question asked to me by an interviewer, in which a map of world is given, and for those country you want to send message, just marked circle on that area, and just send to all the people comes in that area. Question visual link is : Design this application The approach that I told him: Firstly build whole person's data (contacts , place information and all) Then where you mark on the map, just build a cluster of that country using Hadoop and fire the message to all the person's contact comes in that cluster. So help me for better understandings of this problem, and if have another good approach (all back-end ad front-end) , then please tell me or discuss here with me. Thanks in advance.

    Read the article

  • Disable RAID Controller

    - by B.Mr.W.
    I have some decent HP Proliants server that come with "HP Smart Array P410i Controller" enabled, I am using these boxes to set up a Hadoop cluster and I know, RAID is for sure a no-no for Hadoop since the application itself will take care of data redundancy and extra intelligence provided by RAID won't be helpful and might turn down the performance. I tried to disable the devices at the BIOS and the box cannot even access the disk afterwards. So I am assuming the controller is sitting between disks and mother board, and we have to turn it on and configure it to "level0" or something like that. I am wondering what should I do to "disable" the RAID functionality so it will fit into the Hadoop environment.

    Read the article

  • How to remove large number of files/folders in linux

    - by user1745713
    We are using hadoop to split a table into smaller files to feed to mahout, but in the process, we created a huge amount of _temporary logs. we have an nfs mount for the hadoop volume so we can use all the linux commands to delete folders files, but we just can't get them to be deleted, here's what I've tried so far: hadoop fs -rmr /.../_temporary : hangs for hours and does nothing on nfs mount: rmr -rf /.../_temporary :hangs for hours and does nothing find . -name '*.*' -type f -delete : same as above the folders look like this (38 of these folders inside _temporary): drwxr-xr-x 319324 user user 319322 Oct 24 12:12 _attempt_201310221525_0404_r_000000_0 the content of these are actually folders, not files. each one of those 319322 folders has exactly one file inside. not sure why the do the logging this way. Any help is appreciated.

    Read the article

  • Using PIG with Hadoop, how do I regex match parts of text with an unknown number of groups?

    - by lmonson
    I'm using Amazon's elastic map reduce. I have log files that look something like this random text foo="1" more random text foo="2" more text noise foo="1" blah blah blah foo="1" blah blah foo="3" blah blah foo="4" ... How can I write a pig expression to pick out all the numbers in the 'foo' expressions? I prefer tuples that look something like this: (1,2) (1) (1,3,4) I've tried the following: TUPLES = foreach LINES generate FLATTEN(EXTRACT(line,'foo="([0-9]+)"')); But this yields only the first match in each line: (1) (1) (1)

    Read the article

  • How to calculate Centered Moving Average of a set of data in Hadoop Map-Reduce?

    - by 100gods
    I want to calculate Centered Moving average of a set of Data . Example Input format : quarter | sales Q1'11 | 9 Q2'11 | 8 Q3'11 | 9 Q4'11 | 12 Q1'12 | 9 Q2'12 | 12 Q3'12 | 9 Q4'12 | 10 Mathematical Representation of data and calculation of Moving average and then centered moving average Period Value MA Centered 1 9 1.5 2 8 2.5 9.5 3 9 9.5 3.5 9.5 4 12 10.0 4.5 10.5 5 9 10.750 5.5 11.0 6 12 6.5 7 9 I am stuck with the implementation of RecordReader which will provide mapper sales value of a year i.e. of four quarter. The RecordReader Problem Question Thread Thanks

    Read the article

  • Microsoft lance Windows Azure HDInsight, sa plateforme PaaS basée sur Hadoop pour l'analyse des gros volumes de données

    Microsoft lance Windows Azure HDInsight sa plateforme PaaS basée sur Hadoop pour l'analyse des gros volumes des donnéesApache Hadoop est un Framework Web open source d'analyse des gros volumes de données. Microsoft vient d'annoncer la disponibilité générale de la solution sur sa plateforme Cloud Windows Azure.Dévoilé en version beta publique il y a pratiquement un an, la plateforme PaaS (Platform as a Service) Windows Azure HDInsight fournit un clone de Hadoop et des outils associer pour la manipulation...

    Read the article

  • Fin de LINQ to HPC : Microsoft abandonne sa plateforme de traitement de gros volumes de données pour son concurrent Hadoop

    Fin de LINQ to HPC : Microsoft abandonne sa plateforme de traitement de gros volumes de données pour se concentrer sur le support de son concurrent Hadoop Microsoft abandonne LINQ to HPC (High performance computing), nom de code Dryad, sa propre plateforme haute performance pour des calculs distribués et la gestion intensive des données, pour se concentrer sur le support de son concurrent Hadoop dans ses produits. L'éditeur avait récemment manifesté son intérêt pour la plateforme Java de stockage et de traitement par lot de très grandes quantités de données (Big Data) Hadoop, en publiant notamment deux connecteurs

    Read the article

  • Amazon EC2 master node hanging

    - by Algorist
    Hi, I am using cloudera setup to launch a cluster with hadoop on Amazon. Sometimes, the master hadoop node hangs and we have to restart the job from the job. Did anyone face similar problem and resolve the issue. Thank you.

    Read the article

  • Storing data to SequenceFile from Apache Pig

    - by asquithea
    Apache Pig can load data from Hadoop sequence files using the PiggyBank SequenceFileLoader: REGISTER /home/hadoop/pig/contrib/piggybank/java/piggybank.jar; DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader(); log = LOAD '/data/logs' USING SequenceFileLoader AS (...) Is there also a library out there that would allow writing to Hadoop sequence files from Pig?

    Read the article

  • kick off a map reduce job from my java/mysql webapp

    - by Brian
    Hi guys, I need a bit of archecture advice. I have a java based webapp, with a JPA based ORM backed onto a mysql relational database. Now, as part of the application I have a batch job that compares thousands of database records with each other. This job has become too time consuming and needs to be parallelized. I'm looking at using mapreduce and hadoop in order to do this. However, I'm not too sure about how to integrate this into my current architecture. I think the easiest initial solution is to find a way to push data from mysql into hadoop jobs. I have done some initial research on this and found the following relevant information and possibilities: 1) https://issues.apache.org/jira/browse/HADOOP-2536 this gives an interesting overview of some inbuilt JDBC support 2) This article http://architects.dzone.com/articles/tools-moving-sql-database describes some third party tools to move data from mysql to hadoop. To be honest I'm just starting out with learning about hbase and hadoop but I really don't know how to integrate this into my webapp. Any advice is greatly appreciated. cheers, Brian

    Read the article

  • Pig_Cassandra integration caused - ERROR 1070: Could not resolve CassandraStorage using imports:

    - by Le Dude
    I'm following basic Pig, Cassandra, Hadoop installation. Everything works just fine as a stand alone. No error. However when I tried to run the example file provided by Pig_cassandra example, I got this error. [root@localhost pig]# /opt/cassandra/apache-cassandra-1.1.6/examples/pig/bin/pig_cassandra -x local -x local /opt/cassandra/apache-cassandra-1.1.6/examples/pig/example-script.pig Using /opt/pig/pig-0.10.0/pig-0.10.0-withouthadoop.jar. 2012-10-24 21:14:58,551 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 2012-10-24 21:14:58,552 [main] INFO org.apache.pig.Main - Logging error messages to: /opt/pig/pig_1351138498539.log 2012-10-24 21:14:59,004 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2012-10-24 21:14:59,472 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] Details at logfile: /opt/pig/pig_1351138498539.log Here is the log file Pig Stack Trace --------------- ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during parsing. Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1597) at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1540) at org.apache.pig.PigServer.registerQuery(PigServer.java:540) at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:970) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:386) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:189) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) at org.apache.pig.Main.run(Main.java:555) at org.apache.pig.Main.main(Main.java:111) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: Failed to parse: Cannot instantiate: CassandraStorage at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:184) at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1589) ... 14 more Caused by: java.lang.RuntimeException: Cannot instantiate: CassandraStorage at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:510) at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:791) at org.apache.pig.parser.LogicalPlanBuilder.buildFuncSpec(LogicalPlanBuilder.java:780) at org.apache.pig.parser.LogicalPlanGenerator.func_clause(LogicalPlanGenerator.java:4583) at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3115) at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1291) at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:789) at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:507) at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:382) at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:175) ... 15 more Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.] at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:495) at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:507) ... 24 more ================================================================================ I googled around and got to this point from other stackoverflow user that identified the potential problem but not the solution. Cassandra and pig integration cause error during startup I believe my configuration is correct and the path has already been defined properly. I didn't change anything in the pig_cassandra file. I'm not quite sure how to proceed from here. Please help?

    Read the article

  • configuration required for HIVE to be installed on a node

    - by ????? ????????
    I went through the process of manually installing ambari (not through SSH, because I couldnt get keyless to work) and everything installed OK, except for HIVE and GANGLIA. I got this message: stderr: None stdout: warning: Unrecognised escape sequence ‘\;’ in file /var/lib/ambari-agent/puppet/modules/hdp-hive/manifests/hive/service_check.pp at line 32 warning: Dynamic lookup of $configuration is deprecated. Support will be removed in Puppet 2.8. Use a fully-qualified variable name (e.g., $classname::variable) or parameterized classes. notice: /Stage[1]/Hdp::Snappy::Package/Hdp::Snappy::Package::Ln[32]/Hdp::Exec[hdp::snappy::package::ln 32]/Exec[hdp::snappy::package::ln 32]/returns: executed successfully notice: /Stage[2]/Hdp-hive::Hive::Service_check/File[/tmp/hiveserver2Smoke.sh]/ensure: defined content as ‘{md5}7f1d24221266a2330ec55ba620c015a9' notice: /Stage[2]/Hdp-hive::Hive::Service_check/File[/tmp/hiveserver2.sql]/ensure: defined content as ‘{md5}0c429dc9ae0867b5af74ef85b5530d84' notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/File[/tmp/hcatSmoke.sh]/ensure: defined content as ‘{md5}bae7742f7083db968cb6b2bd208874cb’ notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: 13/06/25 03:11:56 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: FAILED: SemanticException org.apache.hadoop.hive.ql.parse.SemanticException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: 13/06/25 03:12:06 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: FAILED: SemanticException [Error 10001]: Table not found hcatsmokeida8c07401_date102513 notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: 13/06/25 03:12:15 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. notice: /Stage[2]/Hdp-hcat::Hcat::Service_check/Exec[hcatSmoke.sh prepare]/returns: FAILED: SemanticException o When i go to the alerts and health checks i’m getting this: ive Metastore status check CRIT for 42 minutes CRITICAL: Error accessing hive-metaserver status [13/06/25 03:44:06 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. What am I doing wrong? I have already tried to do ambari-server reset on the the database without results.

    Read the article

  • HDFS datanode startup fails when disks are full

    - by mbac
    Our HDFS cluster is only 90% full but some datanodes have some disks that are 100% full. That means when we mass reboot the entire cluster some datanodes completely fail to start with a message like this: 2013-10-26 03:58:27,295 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Mkdirs failed to create /mnt/local/sda1/hadoop/dfsdata/blocksBeingWritten Only three have to fail this way before we start experiencing real data loss. Currently we workaround it by decreasing the amount of space reserved for the root user but we'll eventually run out. We also run the re-balancer pretty much constantly, but some disks stay stuck at 100% anyway. Changing the dfs.datanode.failed.volumes.tolerated setting is not the solution as the volume has not failed. Any ideas?

    Read the article

  • How much power supply do I need for my server, and could a shortage be causing my odd crashing?

    - by dolan
    I have 5 servers, all with similar hardware (i7, four 2tb 7200rpm drives, two 4tb 5400rpm drives, 430 watt power supply), and lately the machines have been freezing up. This has gotten worse in the last day or so, and I can't pinpoint any explanation. One recent change was adding the two 4tb hard drives. The crashes happen most often while running a large Hadoop job, so I was originally thinking the load was causing some issues, but last night one server just froze without any heavy load on the box (or so I think), other than HDFS (Hadoop's distributed file system) was probably rebalancing itself since two of the five nodes were offline. If I plugin a monitor and keyboard to one of these frozen machines, I can't get any response or feedback on the screen. Any ideas on possible points of failure and/or different logs I can look at to investigate? Thanks Edit: The systems are running Ubuntu 10.04 Edit 2: More on hardware: intel core i7-930 bloomfield 2.8ghz processor (quad core) 12gb (6 x 2gb) kingston ddr3 1333 ram antec earthwatts green 430 power supply msi x58m lga 1366 motherboard

    Read the article

  • Big Data – Interacting with Hadoop – What is PIG? – What is PIG Latin? – Day 16 of 21

    - by Pinal Dave
    In yesterday’s blog post we learned the importance of the HIVE in Big Data Story. In this article we will understand what is PIG and PIG Latin in Big Data Story. Yahoo started working on Pig for their application deployment on Hadoop. The goal of Yahoo to manage their unstructured data. What is Pig and What is Pig Latin? Pig is a high level platform for creating MapReduce programs used with Hadoop and the language we use for this platform is called PIG Latin. The pig was designed to make Hadoop more user-friendly and approachable by power-users and nondevelopers. PIG is an interactive execution environment supporting Pig Latin language. The language Pig Latin has supported loading and processing of input data with series of transforming to produce desired results. PIG has two different execution environments 1) Local Mode – In this case all the scripts run on a single machine. 2) Hadoop – In this case all the scripts run on Hadoop Cluster. Pig Latin vs SQL Pig essentially creates set of map and reduce jobs under the hoods. Due to same users does not have to now write, compile and build solution for Big Data. The pig is very similar to SQL in many ways. The Ping Latin language provide an abstraction layer over the data. It focuses on the data and not the structure under the hood. Pig Latin is a very powerful language and it can do various operations like loading and storing data, streaming data, filtering data as well various data operations related to strings. The major difference between SQL and Pig Latin is that PIG is procedural and SQL is declarative. In simpler words, Pig Latin is very similar to SQ Lexecution plan and that makes it much easier for programmers to build various processes. Whereas SQL handles trees naturally, Pig Latin follows directed acyclic graph (DAG). DAGs is used to model several different kinds of structures in mathematics and computer science. DAG Tomorrow In tomorrow’s blog post we will discuss about very important components of the Big Data Ecosystem – Zookeeper. Reference: Pinal Dave (http://blog.sqlauthority.com) Filed under: Big Data, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL

    Read the article

  • Big Data&rsquo;s Killer App&hellip;

    - by jean-pierre.dijcks
    Recently Keith spent  some time talking about the cloud on this blog and I will spare you my thoughts on the whole thing. What I do want to write down is something about the Big Data movement and what I think is the killer app for Big Data... Where is this coming from, ok, I confess... I spent 3 days in cloud land at the Cloud Connect conference in Santa Clara and it was quite a lot of fun. One of the nice things at Cloud Connect was that there was a track dedicated to Big Data, which prompted me to some extend to write this post. What is Big Data anyways? The most valuable point made in the Big Data track was that Big Data in itself is not very cool. Doing something with Big Data is what makes all of this cool and interesting to a business user! The other good insight I got was that a lot of people think Big Data means a single gigantic monolithic system holding gazillions of bytes or documents or log files. Well turns out that most people in the Big Data track are talking about a lot of collections of smaller data sets. So rather than thinking "big = monolithic" you should be thinking "big = many data sets". This is more than just theoretical, it is actually relevant when thinking about big data and how to process it. It is important because it means that the platform that stores data will most likely consist out of multiple solutions. You may be storing logs on something like HDFS, you may store your customer information in Oracle and you may store distilled clickstream information in some distilled form in MySQL. The big question you will need to solve is not what lives where, but how to get it all together and get some value out of all that data. NoSQL and MapReduce Nope, sorry, this is not the killer app... and no I'm not saying this because my business card says Oracle and I'm therefore biased. I think language is important, but as with storage I think pragmatic is better. In other words, some questions can be answered with SQL very efficiently, others can be answered with PERL or TCL others with MR. History should teach us that anyone trying to solve a problem will use any and all tools around. For example, most data warehouses (Big Data 1.0?) get a lot of data in flat files. Everyone then runs a bunch of shell scripts to massage or verify those files and then shoves those files into the database. We've even built shell script support into external tables to allow for this. I think the Big Data projects will do the same. Some people will use MapReduce, although I would argue that things like Cascading are more interesting, some people will use Java. Some data is stored on HDFS making Cascading the way to go, some data is stored in Oracle and SQL does do a good job there. As with storage and with history, be pragmatic and use what fits and neither NoSQL nor MR will be the one and only. Also, a language, while important, does in itself not deliver business value. So while cool it is not a killer app... Vertical Behavioral Analytics This is the killer app! And you are now thinking: "what does that mean?" Let's decompose that heading. First of all, analytics. I would think you had guessed by now that this is really what I'm after, and of course you are right. But not just analytics, which has a very large scope and means many things to many people. I'm not just after Business Intelligence (analytics 1.0?) or data mining (analytics 2.0?) but I'm after something more interesting that you can only do after collecting large volumes of specific data. That all important data is about behavior. What do my customers do? More importantly why do they behave like that? If you can figure that out, you can tailor web sites, stores, products etc. to that behavior and figure out how to be successful. Today's behavior that is somewhat easily tracked is web site clicks, search patterns and all of those things that a web site or web server tracks. that is where the Big Data lives and where these patters are now emerging. Other examples however are emerging, and one of the examples used at the conference was about prediction churn for a telco based on the social network its members are a part of. That social network is not about LinkedIn or Facebook, but about who calls whom. I call you a lot, you switch provider, and I might/will switch too. And that just naturally brings me to the next word, vertical. Vertical in this context means per industry, e.g. communications or retail or government or any other vertical. The reason for being more specific than just behavioral analytics is that each industry has its own data sources, has its own quirky logic and has its own demands and priorities. Of course, the methods and some of the software will be common and some will have both retail and service industry analytics in place (your corner coffee store for example). But the gist of it all is that analytics that can predict customer behavior for a specific focused group of people in a specific industry is what makes Big Data interesting. Building a Vertical Behavioral Analysis System Well, that is going to be interesting. I have not seen much going on in that space and if I had to have some criticism on the cloud connect conference it would be the lack of concrete user cases on big data. The telco example, while a step into the vertical behavioral part is not really on big data. It used a sample of data from the customers' data warehouse. One thing I do think, and this is where I think parts of the NoSQL stuff come from, is that we will be doing this analysis where the data is. Over the past 10 years we at Oracle have called this in-database analytics. I guess we were (too) early? Now the entire market is going there including companies like SAS. In-place btw does not mean "no data movement at all", what it means that you will do this on data's permanent home. For SAS that is kind of the current problem. Most of the inputs live in a data warehouse. So why move it into SAS and back? That all worked with 1 TB data warehouses, but when we are looking at 100TB to 500 TB of distilled data... Comments? As it is still early days with these systems, I'm very interested in seeing reactions and thoughts to some of these thoughts...

    Read the article

  • Scalable distributed file system for blobs like images and other documents

    - by Pinnacle
    Cassandra & HBase both do not efficiently support storage of blobs like images. Storing directly on HDFS stresses the Namenode because of huge number of files. Facebook uses Haystack for images and attachments storage, but this is not open source. So is Lustre a good choice for distributed blob storage? I have read that Amazon S3 is used by many, but this would cost money and personally, I would not like to rely on third party system. What are other suggestions?

    Read the article

  • Using a "local" S3 emulation layer as a replacement for HDFS?

    - by user183394
    I have been testing out the most recent Cloudera CDH4 hadoop-conf-pseudo (i.e. MRv2 or YARN) on a notebook, which has 4 cores, 8GB RAM, an Intel X25MG2 SSD, and runs a S3 emulation layer my colleagues and I wrote in C++. The OS is Ubuntu 12.04LTS 64bit. So far so good. Looking at Setting up hadoop to use S3 as a replacement for HDFS, I would like to do it on my notebook. Nevertheless, I can't find where I can change the jets3t.properties for setting the end point to localhost. I downloaded the hadoop-2.0.1-alpha.tar.gz and searched the source without finding out a clue. There is a similar Q on SO Using s3 as fs.default.name or HDFS?, but I want to use our own lightweight and fast S3 emulation layer, instead of AWS S3, for our experiments. I would appreciate a hint as to how I can change the end point to a different hostname. Regards, --Zack

    Read the article

  • Cannot access data from hbase with amazone ec2

    - by Najeeb Thalakkatt
    I have a single node hadoop machine in which hbase is running in the amazone ec2 instance. Due to some reason the server got restarted. So i need to start hadoop and hbase again. Then its working fine, but the old data in the hbase cannot be accessed through the web-servies. While i use the shell command its working fine, i am getting the data. So i created the scenario on my local server machine, but its working fine. The version details are as follows. hadoop-0.20.2 hbase-0.90.5 apache-tomcat-7.0.30 in Amazon ec2 medium instance And i use restful web-services with Orm-hbase to access data.

    Read the article

  • how to convince other we should move to hadoop?

    - by Ramy
    Everything I've read about Hadoop seems like exactly the technology we need to make our enterprise more scalable. We have terabytes of raw data that is in non-relational form (text files of some kind). We're quickly approaching the upper limits of what our centralized file server can handle and everyone is aware of this. Most people on the tech team, especially the more junior members of the tech team are all in favor of moving from the central file system to HDFS. The problem is, there is one key (most senior, etc.) member of the team who is resisting this change and every time Hadoop comes up, he tells us that we could simply add another file server and be in the clear. So, my question (and yes, it's really subjective, but I need more help with this than any of my other questions) is what steps can we take to get upper management to move forward with Hadoop despite the hesitation of one member of the team?

    Read the article

  • writing hbase reports

    - by sammy
    Hello i've just started exploring hbase i've run 2 samples : SampleUploader from examples and PerformanceEvaluation as given in hadoop wiki: http://wiki.apache.org/hadoop/Hbase/MapReduce Well, my application involves updating huge amount of data once a day.. i need samples that store and retrieve rows based on timestamp and make an analysis over the data could u please provide r poinnters as to how to continue as i dont find many tutorials on samples using TIMESTAMP thank u a lot sammy

    Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14  | Next Page >