Search Results

Search found 28 results on 2 pages for 'outliers'.

Page 2/2 | < Previous Page | 1 2

Fraud Detection with the SQL Server Suite Part 2

- by Dejan Sarka

This is the second part of the fraud detection whitepaper. You can find the first part in my previous blog post about this topic. My Approach to Data Mining Projects It is impossible to evaluate the time and money needed for a complete fraud detection infrastructure in advance. Personally, I do not know the customer’s data in advance. I don’t know whether there is already an existing infrastructure, like a data warehouse, in place, or whether we would need to build one from scratch. Therefore, I always suggest to start with a proof-of-concept (POC) project. A POC takes something between 5 and 10 working days, and involves personnel from the customer’s site – either employees or outsourced consultants. The team should include a subject matter expert (SME) and at least one information technology (IT) expert. The SME must be familiar with both the domain in question as well as the meaning of data at hand, while the IT expert should be familiar with the structure of data, how to access it, and have some programming (preferably Transact-SQL) knowledge. With more than one IT expert the most time consuming work, namely data preparation and overview, can be completed sooner. I assume that the relevant data is already extracted and available at the very beginning of the POC project. If a customer wants to have their people involved in the project directly and requests the transfer of knowledge, the project begins with training. I strongly advise this approach as it offers the establishment of a common background for all people involved, the understanding of how the algorithms work and the understanding of how the results should be interpreted, a way of becoming familiar with the SQL Server suite, and more. Once the data has been extracted, the customer’s SME (i.e. the analyst), and the IT expert assigned to the project will learn how to prepare the data in an efficient manner. Together with me, knowledge and expertise allow us to focus immediately on the most interesting attributes and identify any additional, calculated, ones soon after. By employing our programming knowledge, we can, for example, prepare tens of derived variables, detect outliers, identify the relationships between pairs of input variables, and more, in only two or three days, depending on the quantity and the quality of input data. I favor the customer’s decision of assigning additional personnel to the project. For example, I actually prefer to work with two teams simultaneously. I demonstrate and explain the subject matter by applying techniques directly on the data managed by each team, and then both teams continue to work on the data overview and data preparation under our supervision. I explain to the teams what kind of results we expect, the reasons why they are needed, and how to achieve them. Afterwards we review and explain the results, and continue with new instructions, until we resolve all known problems. Simultaneously with the data preparation the data overview is performed. The logic behind this task is the same – again I show to the teams involved the expected results, how to achieve them and what they mean. This is also done in multiple cycles as is the case with data preparation, because, quite frankly, both tasks are completely interleaved. A specific objective of the data overview is of principal importance – it is represented by a simple star schema and a simple OLAP cube that will first of all simplify data discovery and interpretation of the results, and will also prove useful in the following tasks. The presence of the customer’s SME is the key to resolving possible issues with the actual meaning of the data. We can always replace the IT part of the team with another database developer; however, we cannot conduct this kind of a project without the customer’s SME. After the data preparation and when the data overview is available, we begin the scientific part of the project. I assist the team in developing a variety of models, and in interpreting the results. The results are presented graphically, in an intuitive way. While it is possible to interpret the results on the fly, a much more appropriate alternative is possible if the initial training was also performed, because it allows the customer’s personnel to interpret the results by themselves, with only some guidance from me. The models are evaluated immediately by using several different techniques. One of the techniques includes evaluation over time, where we use an OLAP cube. After evaluating the models, we select the most appropriate model to be deployed for a production test; this allows the team to understand the deployment process. There are many possibilities of deploying data mining models into production; at the POC stage, we select the one that can be completed quickly. Typically, this means that we add the mining model as an additional dimension to an existing DW or OLAP cube, or to the OLAP cube developed during the data overview phase. Finally, we spend some time presenting the results of the POC project to the stakeholders and managers. Even from a POC, the customer will receive lots of benefits, all at the sole risk of spending money and time for a single 5 to 10 day project: The customer learns the basic patterns of frauds and fraud detection The customer learns how to do the entire cycle with their own people, only relying on me for the most complex problems The customer’s analysts learn how to perform much more in-depth analyses than they ever thought possible The customer’s IT experts learn how to perform data extraction and preparation much more efficiently than they did before All of the attendees of this training learn how to use their own creativity to implement further improvements of the process and procedures, even after the solution has been deployed to production The POC output for a smaller company or for a subsidiary of a larger company can actually be considered a finished, production-ready solution It is possible to utilize the results of the POC project at subsidiary level, as a finished POC project for the entire enterprise Typically, the project results in several important “side effects” Improved data quality Improved employee job satisfaction, as they are able to proactively contribute to the central knowledge about fraud patterns in the organization Because eventually more minds get to be involved in the enterprise, the company should expect more and better fraud detection patterns After the POC project is completed as described above, the actual project would not need months of engagement from my side. This is possible due to our preference to transfer the knowledge onto the customer’s employees: typically, the customer will use the results of the POC project for some time, and only engage me again to complete the project, or to ask for additional expertise if the complexity of the problem increases significantly. I usually expect to perform the following tasks: Establish the final infrastructure to measure the efficiency of the deployed models Deploy the models in additional scenarios Through reports By including Data Mining Extensions (DMX) queries in OLTP applications to support real-time early warnings Include data mining models as dimensions in OLAP cubes, if this was not done already during the POC project Create smart ETL applications that divert suspicious data for immediate or later inspection I would also offer to investigate how the outcome could be transferred automatically to the central system; for instance, if the POC project was performed in a subsidiary whereas a central system is available as well Of course, for the actual project, I would repeat the data and model preparation as needed It is virtually impossible to tell in advance how much time the deployment would take, before we decide together with customer what exactly the deployment process should cover. Without considering the deployment part, and with the POC project conducted as suggested above (including the transfer of knowledge), the actual project should still only take additional 5 to 10 days. The approximate timeline for the POC project is, as follows: 1-2 days of training 2-3 days for data preparation and data overview 2 days for creating and evaluating the models 1 day for initial preparation of the continuous learning infrastructure 1 day for presentation of the results and discussion of further actions Quite frequently I receive the following question: are we going to find the best possible model during the POC project, or during the actual project? My answer is always quite simple: I do not know. Maybe, if we would spend just one hour more for data preparation, or create just one more model, we could get better patterns and predictions. However, we simply must stop somewhere, and the best possible way to do this, according to my experience, is to restrict the time spent on the project in advance, after an agreement with the customer. You must also never forget that, because we build the complete learning infrastructure and transfer the knowledge, the customer will be capable of doing further investigations independently and improve the models and predictions over time without the need for a constant engagement with me.

Read the article
Fraud Detection with the SQL Server Suite Part 1

- by Dejan Sarka

While working on different fraud detection projects, I developed my own approach to the solution for this problem. In my PASS Summit 2013 session I am introducing this approach. I also wrote a whitepaper on the same topic, which was generously reviewed by my friend Matija Lah. In order to spread this knowledge faster, I am starting a series of blog posts which will at the end make the whole whitepaper. Abstract With the massive usage of credit cards and web applications for banking and payment processing, the number of fraudulent transactions is growing rapidly and on a global scale. Several fraud detection algorithms are available within a variety of different products. In this paper, we focus on using the Microsoft SQL Server suite for this purpose. In addition, we will explain our original approach to solving the problem by introducing a continuous learning procedure. Our preferred type of service is mentoring; it allows us to perform the work and consulting together with transferring the knowledge onto the customer, thus making it possible for a customer to continue to learn independently. This paper is based on practical experience with different projects covering online banking and credit card usage. Introduction A fraud is a criminal or deceptive activity with the intention of achieving financial or some other gain. Fraud can appear in multiple business areas. You can find a detailed overview of the business domains where fraud can take place in Sahin Y., & Duman E. (2011), Detecting Credit Card Fraud by Decision Trees and Support Vector Machines, Proceedings of the International MultiConference of Engineers and Computer Scientists 2011 Vol 1. Hong Kong: IMECS. Dealing with frauds includes fraud prevention and fraud detection. Fraud prevention is a proactive mechanism, which tries to disable frauds by using previous knowledge. Fraud detection is a reactive mechanism with the goal of detecting suspicious behavior when a fraudster surpasses the fraud prevention mechanism. A fraud detection mechanism checks every transaction and assigns a weight in terms of probability between 0 and 1 that represents a score for evaluating whether a transaction is fraudulent or not. A fraud detection mechanism cannot detect frauds with a probability of 100%; therefore, manual transaction checking must also be available. With fraud detection, this manual part can focus on the most suspicious transactions. This way, an unchanged number of supervisors can detect significantly more frauds than could be achieved with traditional methods of selecting which transactions to check, for example with random sampling. There are two principal data mining techniques available both in general data mining as well as in specific fraud detection techniques: supervised or directed and unsupervised or undirected. Supervised techniques or data mining models use previous knowledge. Typically, existing transactions are marked with a flag denoting whether a particular transaction is fraudulent or not. Customers at some point in time do report frauds, and the transactional system should be capable of accepting such a flag. Supervised data mining algorithms try to explain the value of this flag by using different input variables. When the patterns and rules that lead to frauds are learned through the model training process, they can be used for prediction of the fraud flag on new incoming transactions. Unsupervised techniques analyze data without prior knowledge, without the fraud flag; they try to find transactions which do not resemble other transactions, i.e. outliers. In both cases, there should be more frauds in the data set selected for checking by using the data mining knowledge compared to selecting the data set with simpler methods; this is known as the lift of a model. Typically, we compare the lift with random sampling. The supervised methods typically give a much better lift than the unsupervised ones. However, we must use the unsupervised ones when we do not have any previous knowledge. Furthermore, unsupervised methods are useful for controlling whether the supervised models are still efficient. Accuracy of the predictions drops over time. Patterns of credit card usage, for example, change over time. In addition, fraudsters continuously learn as well. Therefore, it is important to check the efficiency of the predictive models with the undirected ones. When the difference between the lift of the supervised models and the lift of the unsupervised models drops, it is time to refine the supervised models. However, the unsupervised models can become obsolete as well. It is also important to measure the overall efficiency of both, supervised and unsupervised models, over time. We can compare the number of predicted frauds with the total number of frauds that include predicted and reported occurrences. For measuring behavior across time, specific analytical databases called data warehouses (DW) and on-line analytical processing (OLAP) systems can be employed. By controlling the supervised models with unsupervised ones and by using an OLAP system or DW reports to control both, a continuous learning infrastructure can be established. There are many difficulties in developing a fraud detection system. As has already been mentioned, fraudsters continuously learn, and the patterns change. The exchange of experiences and ideas can be very limited due to privacy concerns. In addition, both data sets and results might be censored, as the companies generally do not want to publically expose actual fraudulent behaviors. Therefore it can be quite difficult if not impossible to cross-evaluate the models using data from different companies and different business areas. This fact stresses the importance of continuous learning even more. Finally, the number of frauds in the total number of transactions is small, typically much less than 1% of transactions is fraudulent. Some predictive data mining algorithms do not give good results when the target state is represented with a very low frequency. Data preparation techniques like oversampling and undersampling can help overcome the shortcomings of many algorithms. SQL Server suite includes all of the software required to create, deploy any maintain a fraud detection infrastructure. The Database Engine is the relational database management system (RDBMS), which supports all activity needed for data preparation and for data warehouses. SQL Server Analysis Services (SSAS) supports OLAP and data mining (in version 2012, you need to install SSAS in multidimensional and data mining mode; this was the only mode in previous versions of SSAS, while SSAS 2012 also supports the tabular mode, which does not include data mining). Additional products from the suite can be useful as well. SQL Server Integration Services (SSIS) is a tool for developing extract transform–load (ETL) applications. SSIS is typically used for loading a DW, and in addition, it can use SSAS data mining models for building intelligent data flows. SQL Server Reporting Services (SSRS) is useful for presenting the results in a variety of reports. Data Quality Services (DQS) mitigate the occasional data cleansing process by maintaining a knowledge base. Master Data Services is an application that helps companies maintaining a central, authoritative source of their master data, i.e. the most important data to any organization. For an overview of the SQL Server business intelligence (BI) part of the suite that includes Database Engine, SSAS and SSRS, please refer to Veerman E., Lachev T., & Sarka D. (2009). MCTS Self-Paced Training Kit (Exam 70-448): Microsoft® SQL Server® 2008 Business Intelligence Development and Maintenance. MS Press. For an overview of the enterprise information management (EIM) part that includes SSIS, DQS and MDS, please refer to Sarka D., Lah M., & Jerkic G. (2012). Training Kit (Exam 70-463): Implementing a Data Warehouse with Microsoft® SQL Server® 2012. O'Reilly. For details about SSAS data mining, please refer to MacLennan J., Tang Z., & Crivat B. (2009). Data Mining with Microsoft SQL Server 2008. Wiley. SQL Server Data Mining Add-ins for Office, a free download for Office versions 2007, 2010 and 2013, bring the power of data mining to Excel, enabling advanced analytics in Excel. Together with PowerPivot for Excel, which is also freely downloadable and can be used in Excel 2010, is already included in Excel 2013. It brings OLAP functionalities directly into Excel, making it possible for an advanced analyst to build a complete learning infrastructure using a familiar tool. This way, many more people, including employees in subsidiaries, can contribute to the learning process by examining local transactions and quickly identifying new patterns.

Read the article
ZFS for Database Log Files

- by user12620111

I've been troubled by drop outs in CPU usage in my application server, characterized by the CPUs suddenly going from close to 90% CPU busy to almost completely CPU idle for a few seconds. Here is an example of a drop out as shown by a snippet of vmstat data taken while the application server is under a heavy workload. # vmstat 1 kthr memory page disk faults cpu r b w swap free re mf pi po fr de sr s3 s4 s5 s6 in sy cs us sy id 1 0 0 130160176 116381952 0 16 0 0 0 0 0 0 0 0 0 207377 117715 203884 70 21 9 12 0 0 130160160 116381936 0 25 0 0 0 0 0 0 0 0 0 200413 117162 197250 70 20 9 11 0 0 130160176 116381920 0 16 0 0 0 0 0 0 1 0 0 203150 119365 200249 72 21 7 8 0 0 130160176 116377808 0 19 0 0 0 0 0 0 0 0 0 169826 96144 165194 56 17 27 0 0 0 130160176 116377800 0 16 0 0 0 0 0 0 0 0 1 10245 9376 9164 2 1 97 0 0 0 130160176 116377792 0 16 0 0 0 0 0 0 0 0 2 15742 12401 14784 4 1 95 0 0 0 130160176 116377776 2 16 0 0 0 0 0 0 1 0 0 19972 17703 19612 6 2 92 14 0 0 130160176 116377696 0 16 0 0 0 0 0 0 0 0 0 202794 116793 199807 71 21 8 9 0 0 130160160 116373584 0 30 0 0 0 0 0 0 18 0 0 203123 117857 198825 69 20 11 This behavior occurred consistently while the application server was processing synthetic transactions: HTTP requests from JMeter running on an external machine. I explored many theories trying to explain the drop outs, including: Unexpected JMeter behavior Network contention Java Garbage Collection Application Server thread pool problems Connection pool problems Database transaction processing Database I/O contention Graphing the CPU %idle led to a breakthrough: Several of the drop outs were 30 seconds apart. With that insight, I went digging through the data again and looking for other outliers that were 30 seconds apart. In the database server statistics, I found spikes in the iostat "asvc_t" (average response time of disk transactions, in milliseconds) for the disk drive that was being used for the database log files. Here is an example: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 2053.6 0.0 8234.3 0.0 0.2 0.0 0.1 0 24 c3t60080E5...F4F6d0s0 0.0 2162.2 0.0 8652.8 0.0 0.3 0.0 0.1 0 28 c3t60080E5...F4F6d0s0 0.0 1102.5 0.0 10012.8 0.0 4.5 0.0 4.1 0 69 c3t60080E5...F4F6d0s0 0.0 74.0 0.0 7920.6 0.0 10.0 0.0 135.1 0 100 c3t60080E5...F4F6d0s0 0.0 568.7 0.0 6674.0 0.0 6.4 0.0 11.2 0 90 c3t60080E5...F4F6d0s0 0.0 1358.0 0.0 5456.0 0.0 0.6 0.0 0.4 0 55 c3t60080E5...F4F6d0s0 0.0 1314.3 0.0 5285.2 0.0 0.7 0.0 0.5 0 70 c3t60080E5...F4F6d0s0 Here is a little more information about my database configuration: The database and application server were running on two different SPARC servers. Storage for the database was on a storage array connected via 8 gigabit Fibre Channel Data storage and log file were on different physical disk drives Reliable low latency I/O is provided by battery backed NVRAM Highly available: Two Fibre Channel links accessed via MPxIO Two Mirrored cache controllers The log file physical disks were mirrored in the storage device Database log files on a ZFS Filesystem with cutting-edge technologies, such as copy-on-write and end-to-end checksumming Why would I be getting service time spikes in my high-end storage? First, I wanted to verify that the database log disk service time spikes aligned with the application server CPU drop outs, and they did: At first, I guessed that the disk service time spikes might be related to flushing the write through cache on the storage device, but I was unable to validate that theory. After searching the WWW for a while, I decided to try using a separate log device: # zpool add ZFS-db-41 log c3t60080E500017D55C000015C150A9F8A7d0 The ZFS log device is configured in a similar manner as described above: two physical disks mirrored in the storage array. This change to the database storage configuration eliminated the application server CPU drop outs: Here is the zpool configuration: # zpool status ZFS-db-41 pool: ZFS-db-41 state: ONLINE scan: none requested config: NAME STATE ZFS-db-41 ONLINE c3t60080E5...F4F6d0 ONLINE logs c3t60080E5...F8A7d0 ONLINE Now, the I/O spikes look like this: extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1053.5 0.0 4234.1 0.0 0.8 0.0 0.7 0 75 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1131.8 0.0 4555.3 0.0 0.8 0.0 0.7 0 76 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1167.6 0.0 4682.2 0.0 0.7 0.0 0.6 0 74 c3t60080E5...F8A7d0s0 0.0 162.2 0.0 19153.9 0.0 0.7 0.0 4.2 0 12 c3t60080E5...F4F6d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1247.2 0.0 4992.6 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0 0.0 41.0 0.0 70.0 0.0 0.1 0.0 1.6 0 2 c3t60080E5...F4F6d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1241.3 0.0 4989.3 0.0 0.8 0.0 0.6 0 75 c3t60080E5...F8A7d0s0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 1193.2 0.0 4772.9 0.0 0.7 0.0 0.6 0 71 c3t60080E5...F8A7d0s0 We can see the steady flow of 4k writes to the ZIL device from O_SYNC database log file writes. The spikes are from flushing the transaction group. Like almost all problems that I run into, once I thoroughly understand the problem, I find that other people have documented similar experiences. Thanks to all of you who have documented alternative approaches. Saved for another day: now that the problem is obvious, I should try "zfs:zfs_immediate_write_sz" as recommended in the ZFS Evil Tuning Guide. References: The ZFS Intent Log Solaris ZFS, Synchronous Writes and the ZIL Explained ZFS Evil Tuning Guide: Cache Flushes ZFS Evil Tuning Guide: Tuning ZFS for Database Performance

Read the article

< Previous Page | 1 2