Data Warehouse ETL slow - change primary key in dimension?

Posted by Jubbles on Stack Overflow See other posts from Stack Overflow or by Jubbles
Published on 2012-07-02T01:02:45Z Indexed on 2012/07/02 3:16 UTC
Read the original article Hit count: 172

I have a working MySQL data warehouse that is organized as a star schema and I am using Talend Open Studio for Data Integration 5.1 to create the ETL process. I would like this process to run once per day. I have estimated that one of the dimension tables (dimUser) will have approximately 2 million records and 23 columns.

I created a small test ETL process in Talend that worked, but given the amount of data that may need to be updated daily, the current performance will not cut it. It takes the ETL process four minutes to UPDATE or INSERT 100 records to dimUser. If I assumed a linear relationship between the count of records and the amount of time to UPDATE or INSERT, then there is no way the ETL can finish in 3-4 hours (my hope), let alone one day.

Since I'm unfamiliar with Java, I wrote the ETL as a Python script and ran into the same problem. Although, I did discover that if I did only INSERT, the process went much faster. I am pretty sure that the bottleneck is caused by the UPDATE statements.

The primary key in dimUser is an auto-increment integer. My friend suggested that I scrap this primary key and replace it with a multi-field primary key (in my case, 2-3 fields).

Before I rip the test data out of my warehouse and change the schema, can anyone provide suggestions or guidelines related to

  1. the design of the data warehouse
  2. the ETL process
  3. how realistic it is to have an ETL process INSERT or UPDATE a few million records each day
  4. will my friend's suggestion significantly help

If you need any further information, just let me know and I'll post it.

UPDATE - additional information:

mysql> describe dimUser;
Field                        Type                Null Key Default            Extra
user_key                     int(10) unsigned    NO   PRI NULL               auto_increment
id_A                         int(10) unsigned    NO       NULL
id_B                         int(10) unsigned    NO       NULL
field_4                      tinyint(4) unsigned NO       0
field_5                      varchar(50)         YES      NULL
city                         varchar(50)         YES      NULL
state                        varchar(2)          YES      NULL
country                      varchar(50)         YES      NULL
zip_code                     varchar(10)         NO       99999
field_10                     tinyint(1)          NO       0
field_11                     tinyint(1)          NO       0
field_12                     tinyint(1)          NO       0
field_13                     tinyint(1)          NO       1
field_14                     tinyint(1)          NO       0
field_15                     tinyint(1)          NO       0
field_16                     tinyint(1)          NO       0
field_17                     tinyint(1)          NO       1
field_18                     tinyint(1)          NO       0
field_19                     tinyint(1)          NO       0
field_20                     tinyint(1)          NO       0
create_date                  datetime            NO       2012-01-01 00:00:00
last_update                  datetime            NO       2012-01-01 00:00:00
run_id                       int(10) unsigned    NO       999

I used a surrogate key because I had read that it was good practice. Since, from a business perspective, I want to keep aware of potential fraudulent activity (say for 200 days a user is associated with state X and then the next day they are associated with state Y - they could have moved or their account could have been compromised), so that is why geographic data is kept. The field id_B may have a few distinct values of id_A associated with it, but I am interested in knowing distinct (id_A, id_B) tuples. In the context of this information, my friend suggested that something like (id_A, id_B, zip_code) be the primary key.

For the large majority of daily ETL processes (>80%), I only expect the following fields to be updated for existing records: field_10 - field_14, last_update, and run_id (this field is a foreign key to my etlLog table and is used for ETL auditing purposes).

© Stack Overflow or respective owner

Related posts about python

Related posts about mysql