text search - Page 381 - Developer IT

Searching for duplicate records within a text file where the duplicate is determined by only two fie

- by plg

First, Python Newbie; be patient/kind. Next, once a month I receive a large text file (think 7 Million records) to test for duplicate values. This is catalog information. I get 7 fields, but the two I'm interested in are a supplier code and a full orderable part number. To determine if the record is dupliacted, I compress all special characters from the part number (except . and #) and create a compressed part number. The test for duplicates becomes the supplier code and compressed part number combination. This part is fairly straight forward. Currently, I am just copying the original file with 2 new columns (compressed part and duplicate indicator). If the part is a duplicate, I put a "YES" in the last field. Now that this is done, I want to be able to go back (or better yet, at the same time) to get the previous record where there was a supplier code/compressed part number match. So far, my code looks like this: Compress Full Part to a Compressed Part and Check for Duplicates on Supplier Code and Compressed Part combination import sys import re import time ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ start=time.time() try: file1 = open("C:\Accounting\May Accounting\May.txt", "r") except IOError: print sys.stderr, "Cannot Open Read File" sys.exit(1) try: file2 = open(file1.name[0:len(file1.name)-4] + "_" + "COMPRESSPN.txt", "a") except IOError: print sys.stderr, "Cannot Open Write File" sys.exit(1) hdrList="CIGSUPPLIER|FULL_PART|PART_STATUS|ALIAS_FLAG|ACQUISITION_FLAG|COMPRESSED_PART|DUPLICATE_INDICATOR" file2.write(hdrList+chr(10)) lines_seen=set() affirm="YES" records = file1.readlines() for record in records: fields = record.split(chr(124)) if fields[0]=="CIGSupplier": continue #If incoming file has a header line, skip it file2.write(fields[0]+"|"), #Supplier Code file2.write(fields[1]+"|"), #Full_Part file2.write(fields[2]+"|"), #Part Status file2.write(fields[3]+"|"), #Alias Flag file2.write(re.sub("[$\r\n]", "", fields[4])+"|"), #Acquisition Flag file2.write(re.sub("[^0-9a-zA-Z.#]", "", fields[1])+"|"), #Compressed_Part dupechk=fields[0]+"|"+re.sub("[^0-9a-zA-Z.#]", "", fields[1]) if dupechk not in lines_seen: file2.write(chr(10)) lines_seen.add(dupechk) else: file2.write(affirm+chr(10)) print "it took", time.time() - start, "seconds." ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ file2.close() file1.close() It runs in less than 6 minutes, so I am happy with this part, even if it is not elegant. Right now, when I get my results, I import the results into Access and do a self join to locate the duplicates. Loading/querying/exporting results in Access a file this size takes around an hour, so I would like to be able to export the matched duplicates to another text file or an Excel file. Confusing enough? Thanks.

Search Results

Search found 48264 results on 1931 pages for 'text search'.

Page 381/1931 | < Previous Page | 377 378 379 380 381 382 383 384 385 386 387 388 | Next Page >

- by plg

- by janoChen

- by Jack Low

- by Patrick

- by OutOFTouch

- by DerNalia

- by Lieven Cardoen

- by ashishjsoni

- by leon

- by William

- by janoChen

- by Hates_

- by eco_bach

- by AlexanderJohannesen

- by dr. evil

- by user355008

- by NetCaster

- by jitendra

- by tiendv

- by aku

- by Ayyappan.Anbalagan

- by Ross

- by Guru1985

- by user307840

- by Géza

< Previous Page | 377 378 379 380 381 382 383 384 385 386 387 388 | Next Page >