Extending Oracle CEP with Predictive Analytics

Posted by vikram.shukla(at)oracle.com on Oracle Blogs See other posts from Oracle Blogs or by vikram.shukla(at)oracle.com
Published on Thu, 10 Feb 2011 13:36:16 -0800 Indexed on 2011/02/10 23:30 UTC
Read the original article Hit count: 433

Filed under:
Introduction:

OCEP is often used as a business rules engine to execute a set of business logic rules via CQL statements, and take decisions based on the outcome of those rules. There are times where configuring rules manually is sufficient because an application needs to deal with only a small and well-defined set of static rules. However, in many situations customers don't want to pre-define such rules for two reasons. First, they are dealing with events with lots of columns and manually crafting such rules for each column or a set of columns and combinations thereof is almost impossible. Second, they are content with probabilistic outcomes and do not care about 100% precision. The former is the case when a user is dealing with data with high dimensionality, the latter when an application can live with "false" positives as they can be discarded after further inspection, say by a Human Task component in a Business Process Management software.

The primary goal of this blog post is to show how this can be achieved by combining OCEP with Oracle Data Mining® and leveraging the latter's rich set of algorithms and functionality to do predictive analytics in real time on streaming events. The secondary goal of this post is also to show how OCEP can be extended to invoke any arbitrary external computation in an RDBMS from within CEP. The extensible facility is known as the JDBC cartridge.

The rest of the post describes the steps required to achieve this:

We use the dataset available at http://blogs.oracle.com/datamining/2010/01/fraud_and_anomaly_detection_made_simple.html to showcase the capabilities. We use it to show how transaction anomalies or fraud can be detected.

Building the model:

Follow the self-explanatory steps described at the above URL to build the model.  It is very simple - it uses built-in Oracle Data Mining PL/SQL packages to cleanse, normalize and build the model out of the dataset.  You can also use graphical Oracle Data Miner®  to build the models.

To summarize, it involves:

  • Specifying which algorithms to use. In this case we use Support Vector Machines as we're trying to find anomalies in highly dimensional dataset.
  • Build model on the data in the table for the algorithms specified.

For this example, the table was populated in the scott/tiger schema with appropriate privileges.

Configuring the Data Source:

This is the first step in building CEP application using such an integration.  Our datasource looks as follows in the server config file.  It is advisable that you use the Visualizer to add it to the running server dynamically, rather than manually edit the file.

   <data-source>
        <name>DataMining</name>
        <data-source-params>
            <jndi-names>
                <element>DataMining</element>
            </jndi-names>
            <global-transactions-protocol>OnePhaseCommit</global-transactions-protocol>
        </data-source-params>
        <connection-pool-params>
            <credential-mapping-enabled></credential-mapping-enabled>
            <test-table-name>SQL SELECT 1 from DUAL</test-table-name>
            <initial-capacity>1</initial-capacity>
            <max-capacity>15</max-capacity>
            <capacity-increment>1</capacity-increment>
        </connection-pool-params>
        <driver-params>
            <use-xa-data-source-interface>true</use-xa-data-source-interface>
            <driver-name>oracle.jdbc.OracleDriver</driver-name>
            <url>jdbc:oracle:thin:@localhost:1522:orcl</url>
            <properties>
                <element>
                    <value>scott</value>
                    <name>user</name>
                </element>
                <element>
                    <value>{Salted-3DES}AzFE5dDbO2g=</value>
                    <name>password</name>
                </element>
                                <element>
                    <name>com.bea.core.
datasource.serviceName</name>
                    <value>oracle11.2g</value>
                </element>
                <element>
                    <name>com.bea.core.
datasource.serviceVersion</name>
                    <value>11.2.0</value>
                </element>
                <element>
                    <name>com.bea.core.
datasource.serviceObjectClass</name>
                    <value>java.sql.Driver</value>
                </element>
            </properties>
        </driver-params>
    </data-source>

 

Designing the EPN:

The EPN is very simple in this example.

Thumbnail image for dm-epn.jpg





We briefly describe each of the components.

  • The adapter ("DataMiningAdapter") reads data from a .csv file and sends it to the CQL processor downstream. The event payload here is same as that of the table in the database (refer to the attached project or do a "desc table-name" from a SQL*PLUS prompt). While this is for convenience in this example, it need not be the case. One can still omit fields in the streaming events, and need not match all columns in the table on which the model was built. Better yet, it does not even need to have the same name as columns in the table, as long as you alias them in the USING clause of the mining function. (Caveat: they still need to draw values from a similar universe or domain, otherwise it constitutes incorrect usage of the model).
  • There are two things in the CQL processor ("DataMiningProc") that make scoring possible on streaming events.

1.      User defined cartridge function

Please refer to the OCEP CQL reference manual to find more details about how to define such functions. We include the function below in its entirety for illustration.

<?xml version="1.0" encoding="UTF-8"?>

<jdbcctxconfig:config

    xmlns:jdbcctxconfig="http://www.bea.com/ns/wlevs/config/application"

    xmlns:jc="http://www.oracle.com/ns/ocep/config/jdbc">

       <jc:jdbc-ctx>

        <name>Oracle11gR2</name>

        <data-source>DataMining</data-source>

              <function name="prediction2">           

                     <param name="CQLMONTH" type="char"/>

                     <param name="WEEKOFMONTH" type="int"/>

                     <param name="DAYOFWEEK" type="char" />

                     <param name="MAKE" type="char" />

                     <param name="ACCIDENTAREA"   type="char" />

                     <param name="DAYOFWEEKCLAIMED"  type="char" />

                     <param name="MONTHCLAIMED" type="char" />

                     <param name="WEEKOFMONTHCLAIMED" type="int" />

                     <param name="SEX" type="char" />

                     <param name="MARITALSTATUS"   type="char" />

                     <param name="AGE" type="int" />

                     <param name="FAULT" type="char" />

                     <param name="POLICYTYPE"   type="char" />

                     <param name="VEHICLECATEGORY"  type="char" />

                     <param name="VEHICLEPRICE" type="char" />

                     <param name="FRAUDFOUND" type="int" />

                     <param name="POLICYNUMBER" type="int" />

                     <param name="REPNUMBER" type="int" />

                     <param name="DEDUCTIBLE"   type="int" />

                     <param name="DRIVERRATING"  type="int" />

                     <param name="DAYSPOLICYACCIDENT"   type="char" />

                     <param name="DAYSPOLICYCLAIM" type="char" />

                     <param name="PASTNUMOFCLAIMS" type="char" />

                     <param name="AGEOFVEHICLES" type="char" />

                     <param name="AGEOFPOLICYHOLDER" type="char" />

                     <param name="POLICEREPORTFILED" type="char" />

                     <param name="WITNESSPRESNT" type="char" />

                     <param name="AGENTTYPE" type="char" />

                     <param name="NUMOFSUPP" type="char" />

                     <param name="ADDRCHGCLAIM"   type="char" />

                     <param name="NUMOFCARS" type="char" />

                     <param name="CQLYEAR" type="int" />

                     <param name="BASEPOLICY" type="char" />                      

              <return-component-type>char</return-component-type>

                    

                    

           <sql><![CDATA[

            SELECT to_char(PREDICTION_PROBABILITY(CLAIMSMODEL, '0' USING *))

              AS probability

            FROM (SELECT  :CQLMONTH AS MONTH,                 

                          :WEEKOFMONTH AS WEEKOFMONTH,

                          :DAYOFWEEK AS DAYOFWEEK,

                          :MAKE AS MAKE,

                          :ACCIDENTAREA AS ACCIDENTAREA,

                          :DAYOFWEEKCLAIMED AS DAYOFWEEKCLAIMED,

                          :MONTHCLAIMED AS MONTHCLAIMED,

                          :WEEKOFMONTHCLAIMED,  

                          :SEX AS SEX,

                          :MARITALSTATUS AS MARITALSTATUS, 

                          :AGE AS AGE,

                          :FAULT AS FAULT,

                          :POLICYTYPE AS POLICYTYPE, 

                          :VEHICLECATEGORY AS VEHICLECATEGORY,

                          :VEHICLEPRICE AS VEHICLEPRICE,

                          :FRAUDFOUND AS FRAUDFOUND,

                          :POLICYNUMBER AS POLICYNUMBER,

                          :REPNUMBER AS REPNUMBER,

                          :DEDUCTIBLE AS DEDUCTIBLE, 

                          :DRIVERRATING AS DRIVERRATING,

                          :DAYSPOLICYACCIDENT AS DAYSPOLICYACCIDENT, 

                          :DAYSPOLICYCLAIM AS DAYSPOLICYCLAIM,

                          :PASTNUMOFCLAIMS AS PASTNUMOFCLAIMS,

                          :AGEOFVEHICLES AS AGEOFVEHICLES,

                          :AGEOFPOLICYHOLDER AS AGEOFPOLICYHOLDER,

                          :POLICEREPORTFILED AS POLICEREPORTFILED,

                          :WITNESSPRESNT AS WITNESSPRESENT,

                          :AGENTTYPE AS AGENTTYPE,

                          :NUMOFSUPP AS NUMOFSUPP,

                          :ADDRCHGCLAIM AS ADDRCHGCLAIM, 

                          :NUMOFCARS AS NUMOFCARS,

                          :CQLYEAR AS YEAR,

                          :BASEPOLICY AS BASEPOLICY

                FROM dual)

                ]]>

        </sql>

       </function>

    </jc:jdbc-ctx>

</jdbcctxconfig:config>

2.      Invoking the function for each event.

Once this function is defined, you can invoke it from CQL as follows:

<?xml version="1.0" encoding="UTF-8"?>

<wlevs:config xmlns:wlevs="http://www.bea.com/ns/wlevs/config/application">

 

<processor>

    <name>DataMiningProc</name>

    <rules>

       <query id="q1"><![CDATA[

                    ISTREAM(SELECT S.CQLMONTH,

                                   S.WEEKOFMONTH,

                                   S.DAYOFWEEK, S.MAKE,

                                   :     

                                   S.BASEPOLICY,

                                   C.F AS probability

                                                 FROM

                                StreamDataChannel [NOW] AS S,

                                TABLE(prediction2@Oracle11gR2(S.CQLMONTH,

                                      S.WEEKOFMONTH,

                                      S.DAYOFWEEK

                                      S.MAKE, ...,

                                      S.BASEPOLICY) AS F of char) AS C)

                      ]]></query>

                </rules>

              </processor>

          </wlevs:config>

 

  • Finally, the last stage in the EPN prints out the probability of the event being an anomaly. One can also define a threshold in CQL to filter out events that are normal, i.e., below a certain mark as defined by the analyst or designer.

Sample Runs:

Now let's see how this behaves when events are streamed through CEP. We use only two events for brevity, one normal and other one not.

This is one of the "normal" looking events and the probability of it being anomalous is less than 60%.

Event is: eventType=DataMiningOutEvent object=q1  time=2904821976256 S.CQLMONTH=Dec, S.WEEKOFMONTH=5, S.DAYOFWEEK=Wednesday, S.MAKE=Honda, S.ACCIDENTAREA=Urban, S.DAYOFWEEKCLAIMED=Tuesday, S.MONTHCLAIMED=Jan, S.WEEKOFMONTHCLAIMED=1, S.SEX=Female, S.MARITALSTATUS=Single, S.AGE=21, S.FAULT=Policy Holder, S.POLICYTYPE=Sport - Liability, S.VEHICLECATEGORY=Sport, S.VEHICLEPRICE=more than 69000, S.FRAUDFOUND=0, S.POLICYNUMBER=1, S.REPNUMBER=12, S.DEDUCTIBLE=300, S.DRIVERRATING=1, S.DAYSPOLICYACCIDENT=more than 30, S.DAYSPOLICYCLAIM=more than 30, S.PASTNUMOFCLAIMS=none, S.AGEOFVEHICLES=3 years, S.AGEOFPOLICYHOLDER=26 to 30, S.POLICEREPORTFILED=No, S.WITNESSPRESENT=No, S.AGENTTYPE=External, S.NUMOFSUPP=none, S.ADDRCHGCLAIM=1 year, S.NUMOFCARS=3 to 4, S.CQLYEAR=1994, S.BASEPOLICY=Liability, probability=.58931702982118561 isTotalOrderGuarantee=true\nAnamoly probability: .58931702982118561

However, the following event is scored as an anomaly with a very high probability of  89%. So there is likely to be something wrong with it. A close look reveals that the value of "deductible" field (10000) is not "normal". What exactly constitutes normal here?. If you run the query on the database to find ALL distinct values for the "deductible" field, it returns the following set: {300, 400, 500, 700}

Event is: eventType=DataMiningOutEvent object=q1  time=2598483773496 S.CQLMONTH=Dec, S.WEEKOFMONTH=5, S.DAYOFWEEK=Wednesday, S.MAKE=Honda, S.ACCIDENTAREA=Urban, S.DAYOFWEEKCLAIMED=Tuesday, S.MONTHCLAIMED=Jan, S.WEEKOFMONTHCLAIMED=1, S.SEX=Female, S.MARITALSTATUS=Single, S.AGE=21, S.FAULT=Policy Holder, S.POLICYTYPE=Sport - Liability, S.VEHICLECATEGORY=Sport, S.VEHICLEPRICE=more than 69000, S.FRAUDFOUND=0, S.POLICYNUMBER=1, S.REPNUMBER=12, S.DEDUCTIBLE=10000, S.DRIVERRATING=1, S.DAYSPOLICYACCIDENT=more than 30, S.DAYSPOLICYCLAIM=more than 30, S.PASTNUMOFCLAIMS=none, S.AGEOFVEHICLES=3 years, S.AGEOFPOLICYHOLDER=26 to 30, S.POLICEREPORTFILED=No, S.WITNESSPRESENT=No, S.AGENTTYPE=External, S.NUMOFSUPP=none, S.ADDRCHGCLAIM=1 year, S.NUMOFCARS=3 to 4, S.CQLYEAR=1994, S.BASEPOLICY=Liability, probability=.89171554529576691 isTotalOrderGuarantee=true\nAnamoly probability: .89171554529576691

Conclusion:

By way of this example, we show:

  • real-time scoring of events as they flow through CEP leveraging Oracle Data Mining.
  • how CEP applications can invoke complex arbitrary external computations (function shipping) in an RDBMS.

© Oracle Blogs or respective owner