Search Results

Search found 266 results on 11 pages for 'etl'.

Page 1/11 | 1 2 3 4 5 6 7 8 9 10 11  | Next Page >

  • SQL SERVER – Introduction to Adaptive ETL Tool – How adaptive is your ETL?

    - by pinaldave
    I am often reminded by the fact that BI/data warehousing infrastructure is very brittle and not very adaptive to change. There are lots of basic use cases where data needs to be frequently loaded into SQL Server or another database. What I have found is that as long as the sources and targets stay the same, SSIS or any other ETL tool for that matter does a pretty good job handling these types of scenarios. But what happens when you are faced with more challenging scenarios, where the data formats and possibly the data types of the source data are changing from customer to customer?  Let’s examine a real life situation where a health management company receives claims data from their customers in various source formats. Even though this company supplied all their customers with the same claims forms, they ended up building one-off ETL applications to process the claims for each customer. Why, you ask? Well, it turned out that the claims data from various regional hospitals they needed to process had slightly different data formats, e.g. “integer” versus “string” data field definitions.  Moreover the data itself was represented with slight nuances, e.g. “0001124” or “1124” or “0000001124” to represent a particular account number, which forced them, as I eluded above, to build new ETL processes for each customer in order to overcome the inconsistencies in the various claims forms.  As a result, they experienced a lot of redundancy in these ETL processes and recognized quickly that their system would become more difficult to maintain over time. So imagine for a moment that you could use an ETL tool that helps you abstract the data formats so that your ETL transformation process becomes more reusable. Imagine that one claims form represents a data item as a string – acc_no(varchar) – while a second claims form represents the same data item as an integer – account_no(integer). This would break your traditional ETL process as the data mappings are hard-wired.  But in a world of abstracted definitions, all you need to do is create parallel data mappings to a common data representation used within your ETL application; that is, map both external data fields to a common attribute whose name and type remain unchanged within the application. acc_no(varchar) is mapped to account_number(integer) expressor Studio first claim form schema mapping account_no(integer) is also mapped to account_number(integer) expressor Studio second claim form schema mapping All the data processing logic that follows manipulates the data as an integer value named account_number. Well, these are the kind of problems that that the expressor data integration solution automates for you.  I’ve been following them since last year and encourage you to check them out by downloading their free expressor Studio ETL software. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Business Intelligence, Pinal Dave, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: ETL, SSIS

    Read the article

  • SQL SERVER – 4 Tips for ETL Software IDE Developers

    - by pinaldave
    In a previous blog, I introduced the notion of Semantic Types. To an end-user, a seamlessly integrated semantic typing engine significantly increases the ease of use of an ETL IDE (integrated development environment, or developer studio). This led me to think about other ease-of-use issues I have encountered while building ETL applications. When I get stumped while programming, I find myself asking the variations on these questions: “How do I…?” “Now what?” “Why isn’t this working?” “Why do I have to redo the work I just did?” It seems to me that a good ETL IDE will anticipate these questions and seek to answer them before they are even asked. So here are my tips to help software vendors build developer IDEs that actually make development easier. How do I…? While developing an ETL application, have you ever asked yourself: “How do I set up the connection to my SQL Server database?”,“How do I import my table definitions from Access?”, etc. An easy answer might be “read the manual” but sometimes product manuals are not robust or easily accessible. So, integrating robust how-to instructions directly into your ETLstudio would help users get the information they need at the time they need it. Now what? IDEs in general know where you last clicked or performed an action using an input device such as a keyboard; so they should be able to reasonably predict the design context you are in and suggest the next steps accordingly. Context-sensitive suggestions based on the state of the user’s work will help users move forward in ETL application development. Why isn’t this working? Or why do I have to wait till I compile to be told about a critical design issue? If an ETL IDE is smart enough to signal to users what in their design structures is left to be completed or has been completed incorrectly, then the developer can spend much less time in the designàcompileàerror-correct loop. Just-in-time validation helps users detect and correct programming errors earlier in the ETL development life cycle. Why do I have to redo the work I just did? In ETL development, schemas, transformation rules, connectivity objects, etc., can be reused in various situations. Using mouse-clicks to build and manage libraries of reusable design objects implies that the application development effort should decrease over time and as the library acquires more objects. I met a great company at SQL Pass that is trying to address many of these usability issues. Check them out at www.expressor-software.com. What other ease-of-use suggestions do you have for ETL software vendors? Please post your valuable comments. ?Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Best Practices, Pinal Dave, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology Tagged: ETL

    Read the article

  • ETL Software Research Question

    - by WernerCD
    Where I work, we use an in-house ETL solution that's homegrown and has been around for 5-10 years. I'm still new to my data analysis job, but I was wondering about the ETL tools that are out there. This is a new area for me. My situation, and job, is basically digging in a set of databases (DB2, SQL2005, Citrix, Ancient Cobol Database with a SQL Wrapper on top, MySQL, etc). Gather the desired information. combine the different datasets into one set. output into a file of choice (CSV, Tab Separated, Pipe Separated, XLS, etc). FTP to customer. I guess what my real question is, given my job, what are some good ETL suites that I can look at and compare to my in-house tools? This is more to research some other options. Ultimately, I'd either suggest a new solution or get options/ideas to improve our current app.

    Read the article

  • Designing a Content-Based ETL Process with .NET and SFDC

    - by Patrick
    As my firm makes the transition to using SFDC as our main operational system, we've spun together a couple of SFDC portals where we can post customer-specific documents to be viewed at will. As such, we've had the need for pseudo-ETL applications to be implemented that are able to extract metadata from the documents our analysts generate internally (most are industry-standard PDFs, XML, or MS Office formats) and place in networked "queue" folders. From there, our applications scoop of the queued documents and upload them to the appropriate SFDC CRM Content Library along with some select pieces of metadata. I've mostly used DbAmp to broker communication with SFDC (DbAmp is a Linked Server provider that allows you to use SQL conventions to interact with your SFDC Org data). I've been able to create [console] applications in C# that work pretty well, and they're usually structured something like this: static void Main() { // Load parameters from app.config. // Get documents from queue. var files = someInterface.GetFiles(someFilterOrRegexPattern); foreach (var file in files) { // Extract metadata from the file. // Validate some attributes of the file; add any validation errors to an in-memory // structure (e.g. List<ValidationErrors>). if (isValid) { var fileData = File.ReadAllBytes(file); // Upload using some wrapper for an ORM or DAL someInterface.Upload(fileData, meta.Param1, meta.Param2, ...); } else { // Bounce the file } } // Report any validation errors (via message bus or SMTP or some such). } And that's pretty much it. Most of the time I wrap all these operations in a "Worker" class that takes the needed interfaces as constructor parameters. This approach has worked reasonably well, but I just get this feeling in my gut that there's something awful about it and would love some feedback. Is writing an ETL process as a C# Console app a bad idea? I'm also wondering if there are some design patterns that would be useful in this scenario that I'm clearly overlooking. Thanks in advance!

    Read the article

  • Designing Content-Based ETL Process with .NET and SFDC

    - by Patrick
    As my firm makes the transition to using SFDC as our main operational system, we've spun together a couple of SFDC portals where we can post customer-specific documents to be viewed at will. As such, we've had the need for pseudo-ETL applications to be implemented that are able to extract metadata from the documents our analysts generate internally (most are industry-standard PDFs, XML, or MS Office formats) and place in networked "queue" folders. From there, our applications scoop of the queued documents and upload them to the appropriate SFDC CRM Content Library along with some select pieces of metadata. I've mostly used DbAmp to broker communication with SFDC (DbAmp is a Linked Server provider that allows you to use SQL conventions to interact with your SFDC Org data). I've been able to create [console] applications in C# that work pretty well, and they're usually structured something like this: static void Main() { // Load parameters from app.config. // Get documents from queue. var files = someInterface.GetFiles(someFilterOrRegexPattern); foreach (var file in files) { // Extract metadata from the file. // Validate some attributes of the file; add any validation errors to an in-memory // structure (e.g. List<ValidationErrors>). if (isValid) { // Upload using some wrapper for an ORM an someInterface.Upload(meta.Param1, meta.Param2, ...); } else { // Bounce the file } } // Report any validation errors (via message bus or SMTP or some such). } And that's pretty much it. Most of the time I wrap all these operations in a "Worker" class that takes the needed interfaces as constructor parameters. This approach has worked reasonably well, but I just get this feeling in my gut that there's something awful about it and would love some feedback. Is writing an ETL process as a C# Console app a bad idea? I'm also wondering if there are some design patterns that would be useful in this scenario that I'm clearly overlooking. Thanks in advance!

    Read the article

  • TimeStamp and mini-ETL (extract, transform, load)

    - by Tomaz.tsql
    Short example how to use Timestamp for a mini ETL process of your data. example below is following: Table_1 is production table on server1 Table_2 is datawarehouse table on server2 where datawarehouse is located Every day data are extracted, transformed and loaded to dataware house for further off-line usage and data analysis and business decision support. 1. Creating the environment if object_id ('table_1') is not null drop table table_1; go if object_id ('table_2') is not null drop...(read more)

    Read the article

  • How to automate a monitoring system for ETL runs

    - by Jeffrey McDaniel
    Upon completion of the Primavera ETL process there are a few ways to determine if the process finished successfully.  First, in the <installation directory>\log folder,  there is a staretlprocess.log and staretl.html files. These files will give the output results of the ETL run. The staretl.html file will give a detailed summary of each step of the process, its run time, and its status. The .log file, based on the logging level set in the Configuration tool, can give extensive information about the ETL process. The log file can be used as a validation for process completion.  To automate the monitoring of these log files, perform the following steps: 1. Write a custom application to parse through the log file and search for [ERROR] . In most cases,  a major [ERROR] could cause the ETL process to fail. Searching the log and finding this value is worthy of an alert. 2. Determine the total number of steps in the ETL process, and validate that the log file recorded and entry for the final step.  For example validate that your log file contains an entry for Step 39/39 (could be different based on the version you are running). If there is no Step 39/39, then either the process is taking longer than expected or it didn't make it to the end.  Either way this would be a good cause for an alert. 3. Check the last line in the log file. The last line of the log file should contain an indication that the ETL run completed successfully. For example, the last line of a log file will say (results could be different based on Reporting Database versions):   [INFO] (Message) Finished Writing Report 4. You could write an Ant script to execute the ETL process and have it set to - failonerror="true" - and from there send results to an external tool to monitor the jobs, send to email, or send to database. With each ETL run, the log file appends to the existing log file by default. Because of this behavior, I would recommend renaming the existing log files before running a new ETL process. By doing this,  only log entries for the currently running ETL process is recorded in the new log files. Based on these log entries, alerts can be setup to notify the administrator or DBA. Another way to determine if the ETL process has completed successfully is to monitor the etl_processmaster table.  Depending on the Reporting Database version this could be in the Stage or Star databases. As of Reporting Database 2.2 and higher this would be in the Star database.  The etl_processmaster table records entries for the ETL run along with a Start and Finish time.  If the ETl process has failed the Finish date should be null. This table can be queried at a time when ETL process is expected to be finished and if null send an alert.  These are just some options. There are additional ways this can be accomplished based around these two areas - log files or database. Here is an additional query to gather more information about your ETL run (connect as Staruser): SELECT SYSDATE,test_script,decode(loc, 0, PROCESSNAME, trim(SUBSTR(PROCESSNAME, loc+1))) PROCESSNAME ,duration duration from ( select (e.endtime - b.starttime) * 1440 duration, to_char(b.starttime, 'hh24:mi:ss') starttime, to_char(e.endtime, 'hh24:mi:ss') endtime,  b.PROCESSNAME, instr(b.PROCESSNAME, ']') loc, b.infotype test_script from ( select processid, infodate starttime, PROCESSNAME, INFOMSG, INFOTYPE from etl_processinfo  where processid = (select max(PROCESSID) from etl_processinfo) and infotype = 'BEGIN' ) b  inner Join ( select processid, infodate endtime, PROCESSNAME, INFOMSG, INFOTYPE from etl_processinfo  where processid = (select max(PROCESSID) from etl_processinfo) and infotype = 'END' ) e on b.processid = e.processid  and b.PROCESSNAME = e.PROCESSNAME order by b.starttime)

    Read the article

  • SSIS Design Patterns Training in London 8-11 Sep!

    - by andyleonard
    A few seats remain for my course SQL Server Integration Services 2012 Design Patterns to be delivered in London 8-11 Sep 2014. Register today to learn more about: New features in SSIS 2012 and 2014 Advanced patterns for loading data warehouses Error handling The (new) Project Deployment Model Scripting in SSIS The (new) SSIS Catalog Designing custom SSIS tasks Executing, managing, monitoring, and administering SSIS in the enterprise Business Intelligence Markup Language (Biml) BimlScript ETL Instrumentation...(read more)

    Read the article

  • Great Discussion of ETL and ELT Tooling in TDWI Linkedin Group

    - by antonio romero
    All, There’s a great discussion of ETL and ELT tooling going on in the official TDWI Linkedin group, under the heading “How Sustainable is SQL for ETL?” It delves into a wide range of topics: The pros and cons of handcoding vs. using tools to design ETL ETL (with separate transformation engines) vs. ELT (transforms in the database) and push-down solutions The future of ETL and data warehousing products A number of community members (of varying affiliations) have kept this conversation going for many months, and are learning from each other as they go. So check it out… Also, while you’re on Linkedin, join the Oracle ETL/Data Integration Linkedin group (for both OWB and ODI users), which recently passed the 2000 member mark.

    Read the article

  • SQL SERVER – Sharing your ETL Resources Across Applications with Ease

    - by pinaldave
    Frequently an organization will find that the same resources are used in multiple ETL applications, for example, the same database, general purpose processing logic, or file system locations.  Creating an easy way to reuse these resources across multiple applications would increase efficiency and reduce errors.  Moreover, not every ETL developer has the same skill set, and it is likely that one developer will be more adept at writing code while another is more comfortable configuring database connections.  Real productivity gains will come when these developers are able to work independently while still making their work available to others assigned to the same project.  These are the benefits of a centralized version control system. Of course, most version control systems could be used to store and serve files, but the real need is to store and serve entire ETL applications so that each developer’s ongoing work can immediately benefit from another developer’s completed work.  In other words, the version control system needs to be tightly integrated with the tools used to develop the ETL application. The following screen shot shows such a tool. Desktop ETL tool that tightly integrates with a central version control system Developers can checkout or commit entire projects or just a single artifact.  Each artifact may be managed independently so if you need to go back to an earlier version of one artifact, changes you may have made to other artifacts are not lost.  By being tightly integrated into the graphical environment used to create and edit the project artifacts, it is extremely easy and straight-forward to move your files to and from the version control system and there is no dependency on another vendor’s version control system.  The built in version control system is optimized for managing the artifacts of ETL applications. It is equally important that the version control system supports all of the actions one typically performs such as rollbacks, locking and unlocking of files, and the ability to resolve conflicts.  Note that this particular ETL tool also has the capability to switch back and forth between multiple version control systems. It also needs to be easy to determine the status of an artifact.  Not just that it has been committed or modified, but when and by whom.  Generally you must query the version control system for this information, but having it displayed within the development environment is more desirable. Who’s ETL tool works in this fashion?  Last month I mentioned the data integration solution offered by expressor software.  The version control features I described in this post are all available in their just released expressor 3.1 Standard Edition through the integration of their expressor Studio development environment with a centralized metadata repository and version control system. You can download their Studio application, which is free, or evaluate the full Standard Edition on your own hardware.  It may be worth your time. Reference: Pinal Dave (http://blog.SQLAuthority.com) Filed under: Pinal Dave, PostADay, SQL, SQL Authority, SQL Query, SQL Server, SQL Tips and Tricks, T SQL, Technology

    Read the article

  • ETL Tools and Build Tools

    - by Ngu Soon Hui
    I have familiarities with software automated build tools ( such as Automated Build Studio). Now I am looking at ETL tools. The one thing crosses my mind is that, I can do anything I can do in ETL tools by using a software build tool. ETL tools are tailored for data loading and manipulation for which a lot of scripts are needed in order to do the job. Software build tool, on the other hand, is versatile enough to do any jobs, including writing scripts to extract, transform and load any data from any format into any format. Am I right?

    Read the article

  • Loop Control within a DataflowTask in ETL

    - by Ben
    Hi, Being fairly new to SSIS and the ETL process, I was wondering if there is anyway to loop though a record set within a DataFlowTask and pass each row (deriving parameters from the row) into a Stored Procedure (the next step in the ETL phase). Once i have passed the row into the stored procedure, I want the results from each iteration to be written to a Table. Does anyone know how to do this? Thanks.

    Read the article

  • Oracle Warehouse Builder és Enterprise ETL

    - by Fekete Zoltán
    Friss és ropogós az adatlap!!! Fogyasszátok egészséggel: ODI Enterprise Edition: Warehouse Builder Enterprise ETL white paper. A jó hír: minden megvásárolt Oracle Database-hez ingyenese használható az Oracle Warehouse Builder alap (core) funkcionalitása. Mi is az az OWB core funkcionalitás, és mit használhatunk az opciókban? Az Enterprise ETL funkcionalitás az Oracle Data Integrator Enterprise Edition licensz részeként érheto el az OWB-hez. Azok a funkciók, amik csak az ODI EE licensszel érhetok el (a korábbi OWB Enterprise ETL opció is ennek a része) megtekinthetok itt is a szöveg alján. Ezek: - Transportable ETL modules, multiple configurations, and pluggable mappings - Operators for pluggable mapping, pluggable mapping input signature, pluggable mapping output signature - Design Environment Support for RAC - Metadata change propagation - Schedulable Mappings and Process Flows - Slowing Changing Dimensions (SCD) Type 2 and 3 - XML Files as a target - Target load ordering - Seeded spatial and streams transformations - Process Flow Activity templates - Process Flow variables support - Process Flow looping activities such as For Loop and While Loop - Process Flow Route and Notification activities - Metadata lineage and impact analysis - Metadata Extensibility - Deployment to Discoverer EUL - Deployment to Oracle BI Beans catalog Tehát ha komolyabb környezetben szeretném használni az OWB-t, több környezetbe deployálni, stb, akkor szükség van az ODI EE licenszre is. ODI Enterprise Edition: Warehouse Builder Enterprise ETL white paper.

    Read the article

  • Field specific errors for ETL

    - by AaronLS
    I am creating a ETL process in MS SQL Server and I would like to have errors specific to a particular column of a particular row. For example, the data is initially loaded from excel files into a table(we'll call the Initial table) where all columns are varchar(2000) and then I stage the data to another table(the DataTypedTable) that contains more specific data types (datetime,int, etc.) or more tightly constrained varchar lengths. I need to be able to create error messages for a specific field such as: "Jan. 13th" is not a valid date format for the submission date. Please use a format of MM/DD/YYYY These error messages would need to be stored in some way such that later in the process a automated process can create reports with the error messages such that each message references a specific row and field(someone will need to go back and correct the data in the source system and resubmit the excel file). So ideally it would be inserted into a Failures tables of some sort and contain the primary key of the failed row, the column name, and the error message. Question: So I am wondering if this can be accomplished with SSIS, or some open source tool like Talend, and if so, what would be your general approach? Or what hand coded approach you would take? Couple approaches I've thought of using SQL(up until no I have done ETL by hand in SQL procs, but I want to consider other approaches. Possible C# even.): Use a cursor to read through the Initial table, and for each row insert a blank record with only the primary key into the DataTyped table, then use a single update statement for each column, such that if that update fails I can insert a very specific error message specific to that column in the error messages table. Insert all the data as is into the DataTyped table, but have duplicate columns like SubmissionDate and SubmissionDateOld. After the initial insert the *Old columns have data, the rest are blank, and I have a single update for each column that sets the SubmissionDate based on the SubmissionDateOld. In addition to suggesting an approach, I'd like to know if you are using that approach or something similar already in the work you do.

    Read the article

  • Replication Services as ETL extraction tool

    - by jorg
    In my last blog post I explained the principles of Replication Services and the possibilities it offers in a BI environment. One of the possibilities I described was the use of snapshot replication as an ETL extraction tool: “Snapshot Replication can also be useful in BI environments, if you don’t need a near real-time copy of the database, you can choose to use this form of replication. Next to an alternative for Transactional Replication it can be used to stage data so it can be transformed and moved into the data warehousing environment afterwards. In many solutions I have seen developers create multiple SSIS packages that simply copies data from one or more source systems to a staging database that figures as source for the ETL process. The creation of these packages takes a lot of (boring) time, while Replication Services can do the same in minutes. It is possible to filter out columns and/or records and it can even apply schema changes automatically so I think it offers enough features here. I don’t know how the performance will be and if it really works as good for this purpose as I expect, but I want to try this out soon!” Well I have tried it out and I must say it worked well. I was able to let replication services do work in a fraction of the time it would cost me to do the same in SSIS. What I did was the following: Configure snapshot replication for some Adventure Works tables, this was quite simple and straightforward. Create an SSIS package that executes the snapshot replication on demand and waits for its completion. This is something that you can’t do with out of the box functionality. While configuring the snapshot replication two SQL Agent Jobs are created, one for the creation of the snapshot and one for the distribution of the snapshot. Unfortunately these jobs are  asynchronous which means that if you execute them they immediately report back if the job started successfully or not, they do not wait for completion and report its result afterwards. So I had to create an SSIS package that executes the jobs and waits for their completion before the rest of the ETL process continues. Fortunately I was able to create the SSIS package with the desired functionality. I have made a step-by-step guide that will help you configure the snapshot replication and I have uploaded the SSIS package you need to execute it. Configure snapshot replication   The first step is to create a publication on the database you want to replicate. Connect to SQL Server Management Studio and right-click Replication, choose for New.. Publication…   The New Publication Wizard appears, click Next Choose your “source” database and click Next Choose Snapshot publication and click Next   You can now select tables and other objects that you want to publish Expand Tables and select the tables that are needed in your ETL process In the next screen you can add filters on the selected tables which can be very useful. Think about selecting only the last x days of data for example. Its possible to filter out rows and/or columns. In this example I did not apply any filters. Schedule the Snapshot Agent to run at a desired time, by doing this a SQL Agent Job is created which we need to execute from a SSIS package later on. Next you need to set the Security Settings for the Snapshot Agent. Click on the Security Settings button.   In this example I ran the Agent under the SQL Server Agent service account. This is not recommended as a security best practice. Fortunately there is an excellent article on TechNet which tells you exactly how to set up the security for replication services. Read it here and make sure you follow the guidelines!   On the next screen choose to create the publication at the end of the wizard Give the publication a name (SnapshotTest) and complete the wizard   The publication is created and the articles (tables in this case) are added Now the publication is created successfully its time to create a new subscription for this publication.   Expand the Replication folder in SSMS and right click Local Subscriptions, choose New Subscriptions   The New Subscription Wizard appears   Select the publisher on which you just created your publication and select the database and publication (SnapshotTest)   You can now choose where the Distribution Agent should run. If it runs at the distributor (push subscriptions) it causes extra processing overhead. If you use a separate server for your ETL process and databases choose to run each agent at its subscriber (pull subscriptions) to reduce the processing overhead at the distributor. Of course we need a database for the subscription and fortunately the Wizard can create it for you. Choose for New database   Give the database the desired name, set the desired options and click OK You can now add multiple SQL Server Subscribers which is not necessary in this case but can be very useful.   You now need to set the security settings for the Distribution Agent. Click on the …. button Again, in this example I ran the Agent under the SQL Server Agent service account. Read the security best practices here   Click Next   Make sure you create a synchronization job schedule again. This job is also necessary in the SSIS package later on. Initialize the subscription at first synchronization Select the first box to create the subscription when finishing this wizard Complete the wizard by clicking Finish The subscription will be created In SSMS you see a new database is created, the subscriber. There are no tables or other objects in the database available yet because the replication jobs did not ran yet. Now expand the SQL Server Agent, go to Jobs and search for the job that creates the snapshot:   Rename this job to “CreateSnapshot” Now search for the job that distributes the snapshot:   Rename this job to “DistributeSnapshot” Create an SSIS package that executes the snapshot replication We now need an SSIS package that will take care of the execution of both jobs. The CreateSnapshot job needs to execute and finish before the DistributeSnapshot job runs. After the DistributeSnapshot job has started the package needs to wait until its finished before the package execution finishes. The Execute SQL Server Agent Job Task is designed to execute SQL Agent Jobs from SSIS. Unfortunately this SSIS task only executes the job and reports back if the job started succesfully or not, it does not report if the job actually completed with success or failure. This is because these jobs are asynchronous. The SSIS package I’ve created does the following: It runs the CreateSnapshot job It checks every 5 seconds if the job is completed with a for loop When the CreateSnapshot job is completed it starts the DistributeSnapshot job And again it waits until the snapshot is delivered before the package will finish successfully Quite simple and the package is ready to use as standalone extract mechanism. After executing the package the replicated tables are added to the subscriber database and are filled with data:   Download the SSIS package here (SSIS 2008) Conclusion In this example I only replicated 5 tables, I could create a SSIS package that does the same in approximately the same amount of time. But if I replicated all the 70+ AdventureWorks tables I would save a lot of time and boring work! With replication services you also benefit from the feature that schema changes are applied automatically which means your entire extract phase wont break. Because a snapshot is created using the bcp utility (bulk copy) it’s also quite fast, so the performance will be quite good. Disadvantages of using snapshot replication as extraction tool is the limitation on source systems. You can only choose SQL Server or Oracle databases to act as a publisher. So if you plan to build an extract phase for your ETL process that will invoke a lot of tables think about replication services, it would save you a lot of time and thanks to the Extract SSIS package I’ve created you can perfectly fit it in your usual SSIS ETL process.

    Read the article

  • ETL mechanisms for MySQL to SQL Server over WAN

    - by Troy Hunt
    I’m looking for some feedback on mechanisms to batch data from MySQL Community Server 5.1.32 with an external host down to an internal SQL Server 05 Enterprise machine over VPN. The external box accumulates data throughout business hours (about 100Mb per day), which then needs to be transferred internationally across a WAN connection to an internal corporate environment before some BI work is performed. This should just be change-sets making their way down each night. I’m interested in thoughts on the ETL mechanisms people have successfully used in similar scenarios before. SSIS seems like a potential candidate; can anyone comment on the suitability for this scenario? Alternatively, other thoughts on how to do this in a cost-conscious way would be most appreciated. Thanks!

    Read the article

  • Adhoc Data processing / ETL

    - by Dane
    I've just started at a new company in outsourced communications (e.g. print and mail, email, fax). One of the requirements is to process clients data and get it ready for mailing. For recurring jobs, this is easy using an ETL tool linked in with some addressing software, but for adhoc stuff it's a bit overkill. I've used inhouse developed stuff before (clunky but usable), but I don't want to have to re-develop that here. Any recommendations? Some features : Basic DBMS functionality (preferably with a proper DBMS backend for SQL support) Field concatenation (e.g. combine Firstname + Surname) "Pushing columns" (e.g. with address fields 1 - 8, push them left so if one is blank, the next one gets pushed up) Australia post mail sorting and dpid allocation (or can link into external tools relatively easily)

    Read the article

  • ETL , Esper or Drools?

    - by geoaxis
    Hello, The question environment relates to JavaEE, Spring I am developing a system which can start and stop arbitrary TCP (or other) listeners for incoming messages. There could be a need to authenticate these messages. These messages need to be parsed and stored in some other entities. These entities model which fields they store. So for example if I have property1 that can have two text fields FillLevel1 and FillLevel2, I could receive messages on TCP which have both fill levels specified in text as F1=100;F2=90 Later I could add another filed say FillLevel3 when I start receiving messages F1=xx;F2=xx;F3=xx. But this is a conscious decision on the part of system modeler. My question is what do you think is better to use for parsing and storing the message. ETL (using Pantaho, which is used in other system) where you store the raw message and use task executor to consume them one by one and store the transformed messages as per your rules. One could use Espr or Drools to do the same thing , storing rules and executing them with timer, but I am not sure how dynamic you could get with making rules (they have to be made by end user in a running system and preferably in most user friendly way, ie no scripts or code, only GUI) The end user should be capable of changing the parse rules. It is also possible that end user might want to change the archived data as well (for example in the above example if a new value of FillLevel is added, one would like to put a FillLevel=-99 in the previous values to make the data consistent). Please ask for explanations, I have the feeling that I need to revise this question a bit. Thanks

    Read the article

  • PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data

    - by belvoir
    Background: I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP. I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse. How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5. Best Solution? From what I can piece together the most practical solution is as follows: Create a TRIGGER to write all DML activity to a rotating CSV log file Perform whatever transformations are required Use the native DW data pump tool to efficiently pump the transformed CSV into the DW Why this approach? TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable CSV easy and fast to transform Easy to pump CSV into the DW Alternatives considered .... Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log) Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner Using the WAL Has anyone done this before? Want to share your thoughts?

    Read the article

  • ETL Operation - Return Primary Key

    - by user302254
    I am using Talend to populate a data warehouse. My job is writing customer data to a dimension table and transaction data to the fact table. The surrogate key (p_key) on the fact table is auto-incrementing. When I insert a new customer, I need my fact table to reflect the id of the related customer. As I mentioned my p_key is auto auto_incrementing so I can't just insert an arbitrary value for the p_key. Any thought on how I can insert a row into my dimension table and still retrieve the primary key to reference in my fact record? Thanks.

    Read the article

1 2 3 4 5 6 7 8 9 10 11  | Next Page >