dictreader - Developer IT

Python DictReader - Skipping rows with missing columns?

- by victorhooi

heya, I have a Excel .CSV file I'm attempting to read in with DictReader. All seems to be well, except it seems to omit rows, specifically those with missing columns. Our input looks like: mail,givenName,sn,lorem,ipsum,dolor,telephoneNumber [email protected],ian,bay,3424,8403,2535,+65(2)34523534545 [email protected],mike,gibson,3424,8403,2535,+65(2)34523534545 [email protected],ross,martin,,,,+65(2)34523534545 [email protected],david,connor,,,,+65(2)34523534545 [email protected],chris,call,3424,8403,2535,+65(2)34523534545 So some of the rows have missing lorem/ipsum/dolor columns, and it's just a string of commas for those. We're reading it in with: def read_gd_dump(input_file="blah 20100423.csv"): gd_extract = csv.DictReader(open('blah 20100423.csv'), restval='missing', dialect='excel') return dict([(row['something'], row) for row in gd_extract]) And I checked that "something" (the key for our dict) isn't one of the missing columns, I had originally suspected it might be that. It's one of the columns after that. However, DictReader seems to completely skip over the rows. I tried setting restval to something, didn't seem to make any difference. I can't seem to find anything in Python's CSV docs (http://docs.python.org/library/csv.html) that would explain this behaviour, but I may have misread something. Any ideas? Thanks, Victor

Read the article

Number of lines in csv.DictReader

- by Alan Harris-Reid

Hi there, I have a csv DictReader object (using Python 3.1), but I would like to know the number of lines/rows contained in the reader before I iterate through it. Something like as follows... myreader = csv.DictReader(open('myFile.csv', newline='')) totalrows = ? rowcount = 0 for row in myreader: rowcount +=1 print("Row %d/%d" % (rowcount,totalrows)) I know I could get the total by iterating through the reader, but then I couldn't run the 'for' loop. I could iterate through a copy of the reader, but I cannot find how to copy an iterator. I could also use totalrows = len(open('myFile.csv').readlines()) but that seems an unnecessary re-opening of the file. I would rather get the count from the DictReader if possible. Any help would be appreciated. Alan

Read the article

writing header in csv python with DictWriter

- by user248237

assume I have a csv.DictReader object and I want to write it out as a csv file. How can I do this? I thought of the following: dr = csv.DictReader(open(f), delimiter='\t') # process my dr object # ... # write out object output = csv.DictWriter(open(f2, 'w'), delimiter='\t') for item in dr: output.writerow(item) Is that the best way? More importantly, how can I make it so a header is written out too, in this case the object "dr"s .fieldnames property? thanks.

Read the article

Excel CSV into Nested Dictionary; List Comprehensions

- by victorhooi

heya, I have a Excel CSV files with employee records in them. Something like this: mail,first_name,surname,employee_id,manager_id,telephone_number [email protected],john,smith,503422,503423,+65(2)3423-2433 [email protected],george,brown,503097,503098,+65(2)3423-9782 .... I'm using DictReader to put this into a nested dictionary: import csv gd_extract = csv.DictReader(open('filename 20100331 original.csv'), dialect='excel') employees = dict([(row['employee_id'], row) for row in gp_extract]) Is the above the proper way to do it - it does work, but is it the Right Way? Something more efficient? Also, the funny thing is, in IDLE, if I try to print out "employees" at the shell, it seems to cause IDLE to crash (there's approximately 1051 rows). 2. Remove employee_id from inner dict The second issue issue, I'm putting it into a dictionary indexed by employee_id, with the value as a nested dictionary of all the values - however, employee_id is also a key:value inside the nested dictionary, which is a bit redundant? Is there any way to exclude it from the inner dictionary? 3. Manipulate data in comprehension Thirdly, we need do some manipulations to the imported data - for example, all the phone numbers are in the wrong format, so we need to do some regex there. Also, we need to convert manager_id to an actual manager's name, and their email address. Most managers are in the same file, while others are in an external_contractors CSV, which is similar but not quite the same format - I can import that to a separate dict though. Are these two items things that can be done within the single list comprehension, or should I use a for loop? Or does multiple comprehensions work? (sample code would be really awesome here). Or is there a smarter way in Python do it? Cheers, Victor

Read the article

Python + MySQLdb executemany

- by lhahne

I'm using Python and its MySQLdb module to import some measurement data into a Mysql database. The amount of data that we have is quite high (currently about ~250 MB of csv files and plenty of more to come). Currently I use cursor.execute(...) to import some metadata. This isn't problematic as there are only a few entries for these. The problem is that when I try to use cursor.executemany() to import larger quantities of the actual measurement data, MySQLdb raises a TypeError: not all arguments converted during string formatting My current code is def __insert_values(self, values): cursor = self.connection.cursor() cursor.executemany(""" insert into values (ensg, value, sampleid) values (%s, %s, %s)""", values) cursor.close() where values is a list of tuples containing three strings each. Any ideas what could be wrong with this? Edit: The values are generated by yield (prefix + row['id'], row['value'], sample_id) and then read into a list one thousand at a time where row is and iterator coming from csv.DictReader.

Read the article

Python - Converting CSV to Objects - Code Design

- by victorhooi

Hi, I have a small script we're using to read in a CSV file containing employees, and perform some basic manipulations on that data. We read in the data (import_gd_dump), and create an Employees object, containing a list of Employee objects (maybe I should think of a better naming convention...lol). We then call clean_all_phone_numbers() on Employees, which calls clean_phone_number() on each Employee, as well as lookup_all_supervisors(), on Employees. import csv import re import sys #class CSVLoader: # """Virtual class to assist with loading in CSV files.""" # def import_gd_dump(self, input_file='Gp Directory 20100331 original.csv'): # gd_extract = csv.DictReader(open(input_file), dialect='excel') # employees = [] # for row in gd_extract: # curr_employee = Employee(row) # employees.append(curr_employee) # return employees # #self.employees = {row['dbdirid']:row for row in gd_extract} # Previously, this was inside a (virtual) class called "CSVLoader". # However, according to here (http://tomayko.com/writings/the-static-method-thing) - the idiomatic way of doing this in Python is not with a class-fucntion but with a module-level function def import_gd_dump(input_file='Gp Directory 20100331 original.csv'): """Return a list ('employee') of dict objects, taken from a Group Directory CSV file.""" gd_extract = csv.DictReader(open(input_file), dialect='excel') employees = [] for row in gd_extract: employees.append(row) return employees def write_gd_formatted(employees_dict, output_file="gd_formatted.csv"): """Read in an Employees() object, and write out each Employee() inside this to a CSV file""" gd_output_fieldnames = ('hrid', 'mail', 'givenName', 'sn', 'dbcostcenter', 'dbdirid', 'hrreportsto', 'PHFull', 'PHFull_message', 'SupervisorEmail', 'SupervisorFirstName', 'SupervisorSurname') try: gd_formatted = csv.DictWriter(open(output_file, 'w', newline=''), fieldnames=gd_output_fieldnames, extrasaction='ignore', dialect='excel') except IOError: print('Unable to open file, IO error (Is it locked?)') sys.exit(1) headers = {n:n for n in gd_output_fieldnames} gd_formatted.writerow(headers) for employee in employees_dict.employee_list: # We're using the employee object's inbuilt __dict__ attribute - hmm, is this good practice? gd_formatted.writerow(employee.__dict__) class Employee: """An Employee in the system, with employee attributes (name, email, cost-centre etc.)""" def __init__(self, employee_attributes): """We use the Employee constructor to convert a dictionary into instance attributes.""" for k, v in employee_attributes.items(): setattr(self, k, v) def clean_phone_number(self): """Perform some rudimentary checks and corrections, to make sure numbers are in the right format. Numbers should be in the form 0XYYYYYYYY, where X is the area code, and Y is the local number.""" if self.telephoneNumber is None or self.telephoneNumber == '': return '', 'Missing phone number.' else: standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})') extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})') missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<local_second_half>\d{4})') if standard_format.search(self.telephoneNumber): result = standard_format.search(self.telephoneNumber) return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), '' elif extra_zero.search(self.telephoneNumber): result = extra_zero.search(self.telephoneNumber) return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Extra zero in area code - ask user to remediate. ' elif missing_hyphen.search(self.telephoneNumber): result = missing_hyphen.search(self.telephoneNumber) return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Missing hyphen in local component - ask user to remediate. ' else: return '', "Number didn't match recognised format. Original text is: " + self.telephoneNumber class Employees: def __init__(self, import_list): self.employee_list = [] for employee in import_list: self.employee_list.append(Employee(employee)) def clean_all_phone_numbers(self): for employee in self.employee_list: #Should we just set this directly in Employee.clean_phone_number() instead? employee.PHFull, employee.PHFull_message = employee.clean_phone_number() # Hmm, the search is O(n^2) - there's probably a better way of doing this search? def lookup_all_supervisors(self): for employee in self.employee_list: if employee.hrreportsto is not None and employee.hrreportsto != '': for supervisor in self.employee_list: if supervisor.hrid == employee.hrreportsto: (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = supervisor.mail, supervisor.givenName, supervisor.sn break else: (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = ('Supervisor not found.', 'Supervisor not found.', 'Supervisor not found.') else: (employee.SupervisorEmail, employee.SupervisorFirstName, employee.SupervisorSurname) = ('Supervisor not set.', 'Supervisor not set.', 'Supervisor not set.') #Is thre a more pythonic way of doing this? def print_employees(self): for employee in self.employee_list: print(employee.__dict__) if __name__ == '__main__': db_employees = Employees(import_gd_dump()) db_employees.clean_all_phone_numbers() db_employees.lookup_all_supervisors() #db_employees.print_employees() write_gd_formatted(db_employees) Firstly, my preamble question is, can you see anything inherently wrong with the above, from either a class design or Python point-of-view? Is the logic/design sound? Anyhow, to the specifics: The Employees object has a method, clean_all_phone_numbers(), which calls clean_phone_number() on each Employee object inside it. Is this bad design? If so, why? Also, is the way I'm calling lookup_all_supervisors() bad? Originally, I wrapped the clean_phone_number() and lookup_supervisor() method in a single function, with a single for-loop inside it. clean_phone_number is O(n), I believe, lookup_supervisor is O(n^2) - is it ok splitting it into two loops like this? In clean_all_phone_numbers(), I'm looping on the Employee objects, and settings their values using return/assignment - should I be setting this inside clean_phone_number() itself? There's also a few things that I'm sorted of hacked out, not sure if they're bad practice - e.g. print_employee() and gd_formatted() both use __dict__, and the constructor for Employee uses setattr() to convert a dictionary into instance attributes. I'd value any thoughts at all. If you think the questions are too broad, let me know and I can repost as several split up (I just didn't want to pollute the boards with multiple similar questions, and the three questions are more or less fairly tightly related). Cheers, Victor

Read the article

How to use SQLAlchemy to dump an SQL file from query expressions to bulk-insert into a DBMS?

- by Mahmoud Abdelkader

Please bear with me as I explain the problem, how I tried to solve it, and my question on how to improve it is at the end. I have a 100,000 line csv file from an offline batch job and I needed to insert it into the database as its proper models. Ordinarily, if this is a fairly straight-forward load, this can be trivially loaded by just munging the CSV file to fit a schema, but I had to do some external processing that requires querying and it's just much more convenient to use SQLAlchemy to generate the data I want. The data I want here is 3 models that represent 3 pre-exiting tables in the database and each subsequent model depends on the previous model. For example: Model C --> Foreign Key --> Model B --> Foreign Key --> Model A So, the models must be inserted in the order A, B, and C. I came up with a producer/consumer approach: - instantiate a multiprocessing.Process which contains a threadpool of 50 persister threads that have a threadlocal connection to a database - read a line from the file using the csv DictReader - enqueue the dictionary to the process, where each thread creates the appropriate models by querying the right values and each thread persists the models in the appropriate order This was faster than a non-threaded read/persist but it is way slower than bulk-loading a file into the database. The job finished persisting after about 45 minutes. For fun, I decided to write it in SQL statements, it took 5 minutes. Writing the SQL statements took me a couple of hours, though. So my question is, could I have used a faster method to insert rows using SQLAlchemy? As I understand it, SQLAlchemy is not designed for bulk insert operations, so this is less than ideal. This follows to my question, is there a way to generate the SQL statements using SQLAlchemy, throw them in a file, and then just use a bulk-load into the database? I know about str(model_object) but it does not show the interpolated values. I would appreciate any guidance for how to do this faster. Thanks!

Read the article

Convert array to CSV/TSV-formated string in Python.

- by dreeves

Python provides csv.DictWriter for outputting CSV to a file. What is the simplest way to output CSV to a string or to stdout? For example, given a 2D array like this: [["a b c", "1,2,3"], ["i \"comma-heart\" you", "i \",heart\" u, too"]] return the following string: "a b c, \"1, 2, 3\"\n\"i \"\"comma-heart\"\" you\", \"i \"\",heart\"\" u, too\"" which when printed would look like this: a b c, "1,2,3" "i ""heart"" you", "i "",heart"" u, too" (I'm taking csv.DictWriter's word for it that that is in fact the canonical way to output that array as CSV. Excel does parse it correctly that way, though Mathematica does not. From a quick look at the wikipedia page on CSV it seems Mathematica is wrong.) One way would be to write to a temp file with csv.DictWriter and read it back with csv.DictReader. What's a better way? TSV instead of CSV It also occurs to me that I'm not wedded to CSV. TSV would make a lot of the headaches with delimiters and quotes go away: just replace tabs with spaces in the entries of the 2D array and then just intersperse tabs and newlines and you're done. Let's include solutions for both TSV and CSV in the answers to make this as useful as possible for future searchers.

Read the article

Python - Checking for membership inside nested dict

- by victorhooi

heya, This is a followup questions to this one: http://stackoverflow.com/questions/2901422/python-dictreader-skipping-rows-with-missing-columns Turns out I was being silly, and using the wrong ID field. I'm using Python 3.x here. I have a dict of employees, indexed by a string, "directory_id". Each value is a nested dict with employee attributes (phone number, surname etc.). One of these values is a secondary ID, say "internal_id", and another is their manager, call it "manager_internal_id". The "internal_id" field is non-mandatory, and not every employee has one. (I've simplified the fields a little, both to make it easier to read, and also for privacy/compliance reasons). The issue here is that we index (key) each employee by their directory_id, but when we lookup their manager, we need to find managers by their "internal_id". Before, when employee.keys() was a list of internal_ids, I was using a membership check on this. Now, the last part of my if statement won't work, since the internal_ids is part of the dict values, instead of the key itself. def lookup_supervisor(manager_internal_id, employees): if manager_internal_idis not None and manager_internal_id!= "" and manager_internal_id in employees.keys(): return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn']) else: return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found') So the first question is, how do I check whether the manager_internal_id is present in the dict's values. I've tried substituting employee.keys() with employee.values(), that didn't work. Also, I'm hoping for something a little more efficient, not sure if there's a way to get a subset of the values, specifically, all the entries for employees[directory_id]['internal_id']. Hopefully there's some Pythonic way of doing this, without using a massive heap of nested for/if loops. My second question is, how do I then cleanly return the required employee attributes (mail, givenname, surname etc.). My for loop is iterating over each employee, and calling lookup_supervisor. I'm feeling a bit stupid/stumped here. def tidy_data(employees): for directory_id, data in employees.items(): # We really shouldnt' be passing employees back and forth like this - hmm, classes? data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = lookup_supervisor(data['manager_internal_id'], employees) Thanks in advance =), Victor

Developer IT