handling long running large transactions with perl dbi
- by 1stdayonthejob
I've got a large transaction comprising of getting lots of data from database A, do some manipulations with this data, then inserting the manipulated data into database B.  I've only got permissions to select in database A but I can create tables and insert/update etc in database B.
The manipulation and insertion part is written in perl and already in use for loading data into database B from other data sources, so all that's required is to get the necessary data from database A and using it to initialize the perl classes.
How can I go about doing this so I can easily track back and pick up from where the error happened if any error occurs during the manipulation or insertion procedures (database disconnection, problems with class initialization because of invalid values, hard disk failure etc...)?  Doing the transaction in one go doesn't seem like a good option because the amount data from database A means it would take at least a day or 2 for data manipulation and insertion into database B.
The data from database A can be grouped into around 1000 groups using unique keys, with each key containing 1000s of rows each.  One way I thought I could do is to write a script that does commits per group, meaning I've got to track which group has already been inserted into database B.  The only way I can think of to track the progress of which groups have been processed or not is either in a log file or in a table in database B.  A second way I thought could work is to dump all the necessary fields needed for loading the classes for manipulation and insertion into a flatfile, read the file to initialize the classes and insert into database B.  This also means that I got to do some logging, but should narrow it down to the exact row in the flatfile if any error occurs.  The script will look something like this:
use strict;
use warnings;
use DBI;
#connect to database A
my $dbh = DBI->connect('dbi:oracle:my_db', $user, $password, { RaiseError => 1, AutoCommit => 0 });
#statement to get data based on group unique key
my $sth = $dbh->prepare($my_sql);
my @groups; #I have a list of this already
open my $fh, '>>', 'my_logfile' or die "can't open logfile $!";
eval {
    foreach my $g (@groups){
        #subroutine to check if group has already been processed, either from log file or from database table
        next if is_processed($g);
        $sth->execute($g);
        my $data = $sth->fetchall_arrayref;
        #manipulate $data, then use it to load perl classes for insertion into database B
        #.
        #.
        #.
    }
    print $fh "$g\n";
};
if ($@){
   $dbh->rollback;
   die "something wrong...rollback";
}
So if any errors do occur, I can just run this script again and it should skip the groups or rows that have been processed and continue.
Both these methods is just variations on the same theme, and both require going back to where I've been tracking my progress (in table or file), skip the ones that've been commited to database B and process the remaining data.
I'm sure there's a better way of doing this but am struggling to think of other solutions.  Is there another way of handling large transactions between databases that require data manipulation between getting data out from one and inserting into another?  The process doesn't need to be all in Perl, as long as I can reuse the perl classes for manipulating and inserting the data into the database.