a txt - Page 107 - Developer IT

Searching for duplicate records within a text file where the duplicate is determined by only two fie

- by plg

First, Python Newbie; be patient/kind. Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match. So far, my code looks like this: Compress Full Part to a Compressed Part and Check for Duplicates on Supplier Code and Compressed Part combination import sys import re import time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start=time.time() try: file1 = open("C:\Accounting\May Accounting\May.txt", "r") except IOError: print sys.stderr, "Cannot Open Read File" sys.exit(1) try: file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a") except IOError: print sys.stderr, "Cannot Open Write File" sys.exit(1) hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR" file2.write(hdrList+chr(10)) lines_seen=set() affirm="YES" records = file1.readlines() for record in records: fields = record.split(chr(124)) if fields[0]=="CIGSupplier": continue #If incoming file has a header line, skip it file2.write(fields[0]+"|"), #Supplier Code file2.write(fields[1]+"|"), #Full_Part file2.write(fields[2]+"|"), #Part Status file2.write(fields[3]+"|"), #Alias Flag file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1]) if dupechk not in lines_seen: file2.write(chr(10)) lines_seen.add(dupechk) else: file2.write(affirm+chr(10)) print "it took", time.time() - start, "seconds." ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ file2.close() file1.close() It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file. Confusing enough? Thanks.

Search Results

Search found 4783 results on 192 pages for 'a txt'.

Page 107/192 | < Previous Page | 103 104 105 106 107 108 109 110 111 112 113 114 | Next Page >

- by plg

- by suvirai

- by latinunit-net

- by Tim

- by ajsie

- by chavanak

- by Joomala

- by Sten

- by VBeginner

- by riad

- by user136104

- by Pankaj

- by Bernhard V

- by Crimsonland

- by Pedro Silva

- by user1296787

- by user365828

- by tazim

- by avon_verma

- by user369701

- by Fred Yang

- by user316360

- by user2971553

- by user2913962

- by Alsciende

< Previous Page | 103 104 105 106 107 108 109 110 111 112 113 114 | Next Page >