Alternatives for comparing data from different databases
- by Alex
I have two huge tables on separate databases. One of them has the information of all the SMS that passed through the company's servers while the other one has the information of the actual billing of those SMS. 
My job is to compare samples of both of these tables (for example, the records between 1 and 2 pm) to see if there are any differences: SMS that were sent but not charged to the user for whatever reason that may be happening.
The columns I will be using to compare are the remitent's phone number and the exact date the SMS was sent. An issue here is that dates usually are the same on both sides, but in many cases differ by 1 or 2 seconds.
I have, so far, two alternatives to do this:
(PL/SQL) Create two tables where i'm going to temporarily store all the records of that 1hour sample. One for each of the main tables. Then, for each distinct phone number, select the time of every SMS sent from that phone from both my temporary tables and start comparing one by one using cursors. In this case, the procedure would be ran on the server where one of the sources is so the contents of the other one would be looked up using a dblink.
(sqlplus + c++) Instead of storing the 1hour samples in new tables, output the query to a text file. I will have two text files, one for each source. Then, open the first file and load all of it's content on a hash_map (key-value) using c++, where the key will be the phone number and the value a list of times of SMS sent from that phone. Finally, open the second file, grab each line (in this format: numberX timeX), look for numberX's entry on the hash_map (wich will be a list of times) and then check if timeX is on that list. If it isn't, save it somewhere to finally store it on a "uncharged" table (this would also be the final step on case 1)
My main concern is efficiency. These samples have about 2 million records on each source, so just grabbing one record on one side and looking it up on the other would not be possible. That's the reason I wanted to use hash_maps
Which do you think is a better option?