Between the elephant ( the
		  tradional image of the Hadoop framework) and the Oracle Iron Man
		  (Big Data..) an english setter could be seen as the link to the
		  right data
		
	
	
    Data, Data, Data, we are living in a world where data
    technology based on popular applications , search engines,
    Webservers, rich sms messages, email clients, weather forecasts
    and so on, have a predominant role in our life.
     More and more technologies are used to analyze/track our
      behavior, try to detect patterns, to propose us "the best/right
      user experience" from the Google Ad services, to Telco companies
      or large consumer sites (like Amazon:) ).  The more we use all
      these technologies, the more we generate data, and thus there is
      a need of huge data marts and specific hardware/software servers
      (as the Exadata servers) in order to treat/analyze/understand
      the trends and offer new services to the users.
    Some of these "data feeds" are raw, unstructured data, and
      cannot be processed effectively by normal SQL queries. Large
      scale distributed processing was an emerging infrastructure need
      and the solution seemed to be the "collocation of compute nodes
      with the data", which in turn leaded to MapReduce parallel
      patterns and the development of the Hadoop framework, which is
      based on MapReduce and a distributed file system (HDFS) that
      runs on larger clusters of rather inexpensive servers.
    Several Oracle products are using the distributed / aggregation
      pattern for data calculation ( Coherence, NoSql, times ten ) so
      once that you are familiar with one of these technologies, lets
      says with coherence aggregators, you will find the whole Hadoop,
      MapReduce concept very similar.
    Oracle Big Data Appliance is based on the Cloudera Distribution
      (CDH), and the Oracle Big Data Connectors can be plugged on a
      Hadoop cluster running the CDH distribution or equivalent Hadoop
      clusters.
    In this paper, a "lab like" implementation of this concept is
      done on a single Linux X64 server, running an Oracle Database
      11g Enterprise Edition Release 11.2.0.4.0, and a single node
      Apache hadoop-1.2.1 HDFS cluster, using the SQL connector for HDFS.
      
    The whole setup is fairly simple:
      
        Install on a Linux x64 server ( or virtual box appliance)
        an Oracle Database 11g Enterprise Edition Release 11.2.0.4.0
        server
        Get the Apache Hadoop distribution from:
          http://mir2.ovh.net/ftp.apache.org/dist/hadoop/common/hadoop-1.2.1.
        Get the Oracle Big Data Connectors from:
          http://www.oracle.com/technetwork/bdc/big-data-connectors/downloads/index.html?ssSourceSiteId=ocomen.
        Check the java version of your Linux server with the command:
          
java -version
 java version "1.7.0_40"
Java(TM) SE Runtime Environment (build 1.7.0_40-b43)
Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode)
        Decompress the hadoop hadoop-1.2.1.tar.gz file to
          /u01/hadoop-1.2.1
          
        Modify your .bash_profile
          export HADOOP_HOME=/u01/hadoop-1.2.1
export PATH=$PATH:$HADOOP_HOME/bin
export HIVE_HOME=/u01/hive-0.11.0
export PATH=$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
          (also see my
          sample .bash_profile)
        Set up ssh trust for Hadoop
          process, this is a mandatory step, in our case we have to
          establish a "local trust" as will are using a single node
          configuration
          
          copy the new public keys to the
            list of authorized keys
            
          connect and test the ssh setup
            to your localhost:
            
            
          We will run a "pseudo-Hadoop
            cluster", in what is called "local standalone mode", all
            the Hadoop java components are running in one Java
            process, this is enough for our demo purposes.  We need to
            "fine tune" some Hadoop configuration files, we have to go
            at our $HADOOP_HOME/conf, and modify the files:
            
core-site.xml
            
                        
            
hdfs-site.xml
                        
            
mapred-site.xml
            
            check that the hadoop binaries
              are referenced correctly from the command line by
              executing:
              
hadoop  -version
            
             As Hadoop is managing our
              "clustered HDFS" file system we have to create "the
              mount point" and format it , the mount point will be
              declared to core-site.xml as:
              
              The layout under the
                /u01/hadoop-1.2.1/data will be created and used by
                other hadoop components (MapReduce = /mapred/...)
                HDFS is using the /dfs/... layout structure
                              
              format the HDFS hadoop  file system:
              
              Start the java components
                for the HDFS system
                
              As an additional check,
                you can use the GUI Hadoop browsers to check the
                content of your HDFS configurations:
                
                
                 Once our HDFS
                    Hadoop setup is done you can use the HDFS
                    file system to store data ( big data : )), and
                    plug them back and forth to Oracle Databases by
                    the means of the Big Data Connectors ( which is
                    the next configuration step).
                You can create / use a Hive db, but in our case we
                  will make a simple integration of "raw data" ,
                  through the creation of an External Table to a local
                  Oracle instance ( on the same Linux box, we run the
                  Hadoop HDFS one node cluster and one Oracle DB).
              
              Download some public "big
                data", I use the site:
                
                  http://france.meteofrance.com/france/observations,
                from where I can get *.csv files for my big data
                simulations :).  
              Here is the data layout of my
                example file:
              
              Download the Big Data
                Connector from the OTN (oraosch-2.2.0.zip), unzip it
                to your local file system (see picture below)
              
              
        Modify your environment in order
          to access the connector libraries , and make the following
          test:
          
[oracle@dg1 bin]$./hdfs_stream
Usage: hdfs_stream locationFile
[oracle@dg1 bin]$
        
        
        Load the data to the Hadoop hdfs
          file system:
          
hadoop fs  -mkdir bgtest_data
hadoop  fs  -put obsFrance.txt bgtest_data/obsFrance.txt
hadoop fs  -ls  /user/oracle/bgtest_data/obsFrance.txt       
[oracle@dg1 bg-data-raw]$ hadoop fs -ls  /user/oracle/bgtest_data/obsFrance.txt
Found 1 items
-rw-r--r--   1 oracle supergroup      54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt
[oracle@dg1 bg-data-raw]$hadoop fs -ls  hdfs:///user/oracle/bgtest_data/obsFrance.txt
Found 1 items
-rw-r--r--   1 oracle supergroup      54103 2013-10-22 06:10 /user/oracle/bgtest_data/obsFrance.txt
        
        Check the content of the HDFS with
          the browser UI:
          
        
        Start the Oracle database, and run
          the following script in order to create the Oracle database
          user, the Oracle directories for the Oracle Big Data
          Connector (dg1 it’s my own db id replace accordingly yours):
          #!/bin/bash
export ORAENV_ASK=NO
export ORACLE_SID=dg1
. oraenv
sqlplus /nolog <<EOF
CONNECT / AS sysdba;
CREATE OR REPLACE DIRECTORY osch_bin_path  AS  '/u01/orahdfs-2.2.0/bin';
CREATE USER BGUSER IDENTIFIED BY oracle;
GRANT CREATE SESSION, CREATE TABLE TO BGUSER;
GRANT EXECUTE ON sys.utl_file TO BGUSER;
GRANT READ, EXECUTE ON DIRECTORY osch_bin_path TO BGUSER;
CREATE OR REPLACE DIRECTORY BGT_LOG_DIR as '/u01/BG_TEST/logs';
GRANT READ, WRITE ON DIRECTORY BGT_LOG_DIR to BGUSER;
CREATE OR REPLACE DIRECTORY BGT_DATA_DIR as '/u01/BG_TEST/data';
GRANT READ, WRITE ON DIRECTORY BGT_DATA_DIR to BGUSER;
EOF
        
        Put the following in a file named t3.sh and make  it executable, 
hadoop jar $OSCH_HOME/jlib/orahdfs.jar \
oracle.hadoop.exttab.ExternalTable \
-D oracle.hadoop.exttab.tableName=BGTEST_DP_XTAB \
-D oracle.hadoop.exttab.defaultDirectory=BGT_DATA_DIR \
-D oracle.hadoop.exttab.dataPaths="hdfs:///user/oracle/bgtest_data/obsFrance.txt" \
-D oracle.hadoop.exttab.columnCount=7 \
-D oracle.hadoop.connection.url=jdbc:oracle:thin:@//localhost:1521/dg1 \
-D oracle.hadoop.connection.user=BGUSER \
-D oracle.hadoop.exttab.printStackTrace=true \
-createTable  --noexecute
then test the creation fo the external table with it:
[oracle@dg1 samples]$ ./t3.sh
./t3.sh: line 2: /u01/orahdfs-2.2.0: Is a directory
Oracle SQL Connector for HDFS Release 2.2.0 - Production
Copyright (c) 2011, 2013, Oracle and/or its affiliates. All rights reserved.
Enter Database Password:]
The create table command was not executed.
The following table would be created.
CREATE TABLE "BGUSER"."BGTEST_DP_XTAB"
(
 "C1"                             VARCHAR2(4000),
 "C2"                             VARCHAR2(4000),
 "C3"                             VARCHAR2(4000),
 "C4"                             VARCHAR2(4000),
 "C5"                             VARCHAR2(4000),
 "C6"                             VARCHAR2(4000),
 "C7"                             VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "BGT_DATA_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     CHARACTERSET AL32UTF8
     STRING SIZES ARE IN CHARACTERS
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'2C'
     MISSING FIELD VALUES ARE NULL
     (
       "C1" CHAR(4000),
       "C2" CHAR(4000),
       "C3" CHAR(4000),
       "C4" CHAR(4000),
       "C5" CHAR(4000),
       "C6" CHAR(4000),
       "C7" CHAR(4000)
     )
   )
   LOCATION
   (
     'osch-20131022081035-74-1'
   )
) PARALLEL REJECT LIMIT UNLIMITED;
The following location files would be created.
osch-20131022081035-74-1 contains 1 URI, 54103 bytes
       54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt
        
        Then remove
          the --noexecute flag and create the
          external Oracle table for the Hadoop data.
          Check the results:
          
The create table command succeeded.
CREATE TABLE "BGUSER"."BGTEST_DP_XTAB"
(
 "C1"                             VARCHAR2(4000),
 "C2"                             VARCHAR2(4000),
 "C3"                             VARCHAR2(4000),
 "C4"                             VARCHAR2(4000),
 "C5"                             VARCHAR2(4000),
 "C6"                             VARCHAR2(4000),
 "C7"                             VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( 
   TYPE ORACLE_LOADER
   DEFAULT DIRECTORY "BGT_DATA_DIR"
   ACCESS PARAMETERS
   (
     RECORDS DELIMITED BY 0X'0A'
     CHARACTERSET AL32UTF8
     STRING SIZES ARE IN CHARACTERS
     PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
     FIELDS TERMINATED BY 0X'2C'
     MISSING FIELD VALUES ARE NULL
     (
       "C1" CHAR(4000),
       "C2" CHAR(4000),
       "C3" CHAR(4000),
       "C4" CHAR(4000),
       "C5" CHAR(4000),
       "C6" CHAR(4000),
       "C7" CHAR(4000)
     )
   )
   LOCATION
   (
     'osch-20131022081719-3239-1'
   )
) PARALLEL REJECT LIMIT UNLIMITED;
The following location files were created.
osch-20131022081719-3239-1 contains 1 URI, 54103 bytes
       54103 hdfs://localhost:19000/user/oracle/bgtest_data/obsFrance.txt
          This is the view from the SQL Developer:
          
          and finally the number of lines in the oracle table,  imported from our Hadoop HDFS cluster
          
SQL select count(*) from "BGUSER"."BGTEST_DP_XTAB";
                  
COUNT(*)
----------
      1151
In a next post we will integrate data from a Hive database, and try
some ODI integrations with the ODI Big Data connector. Our simplistic
approach is just a step to show you how these unstructured data world
  can be integrated to Oracle infrastructure.
Hadoop, BigData, NoSql are great technologies, they are widely used
and Oracle is offering a large integration infrastructure based on
these services.
Oracle University
presents a complete curriculum on all the Oracle related technologies:
NoSQL:
   Introduction to Oracle NoSQL Database
  Using Oracle NoSQL Database
    
Big Data:
  Introduction to Big Data
  Oracle Big Data Essentials
  Oracle Big Data Overview
  
Oracle Data Integrator:
  Oracle Data Integrator 12c: New Features
  Oracle Data Integrator 11g: Integration and Administration
  Oracle Data Integrator: Administration and Development
  Oracle Data Integrator 11g: Advanced Integration and Development
Oracle Coherence 12c:
  Oracle Coherence 12c: New Features
  Oracle Coherence 12c: Share and Manage Data in Clusters
Oracle Coherence 12c:
  Oracle GoldenGate 11g: Fundamentals for Oracle
  Oracle GoldenGate 11g: Fundamentals for SQL Server
  Oracle GoldenGate 11g Fundamentals for Oracle
  Oracle GoldenGate 11g Fundamentals for DB2
  Oracle GoldenGate 11g Fundamentals for Teradata
  Oracle GoldenGate 11g Fundamentals for HP NonStop
  Oracle GoldenGate 11g Management Pack: Overview
  Oracle GoldenGate 11g Troubleshooting and Tuning
  Oracle GoldenGate 11g: Advanced Configuration for Oracle
Other Resources:
  Apache Hadoop : 
      http://hadoop.apache.org/ is the homepage
    for these technologies.
  "Hadoop Definitive Guide 3rdEdition" by Tom White
    is a classical lecture for people who want to know more about
    Hadoop , and some active "googling " will also give you some more
    references.
About the author:
Eugene Simos is based in France and joined Oracle
  through the BEA-Weblogic Acquisition, where he worked for the
  Professional Service, Support, end Education for major accounts
  across the EMEA Region. He worked in the banking sector, ATT, Telco
  companies giving him extensive experience on production
  environments. Eugen currently specializes in Oracle Fusion
  Middleware teaching an array of courses on Weblogic/Webcenter,
  Content,BPM /SOA/Identity-Security/GoldenGate/Virtualisation/Unified
  Comm Suite) throughout the EMEA region.