Importing a large dataset into a database

Posted by peaceful on Stack Overflow See other posts from Stack Overflow or by peaceful
Published on 2010-03-15T17:35:41Z Indexed on 2010/03/15 17:39 UTC
Read the original article Hit count: 729

Filed under:

ruby-on-rails

I'm a beginning programmer in the relevant areas to this question, so if possible, it'd be helpful to avoid assuming I know a lot already.

I'm trying to import the OpenLibrary dataset into a local Postgres database. After it's imported, I plan to use it as a starting seed for a Ruby on Rails application that will include information on books.

The OpenLibrary datasets are available here, in a modified JSON format: http://openlibrary.org/dev/docs/jsondump

I only need very basic information for my application, much less than what is provided in the dumps. I'm only trying to get out book titles, author names, and relationships between books and authors.

Below are two typical entries from their dataset, the first for an author, and the second for a book (they seem to have an entry for each edition of a book). The entries seem to lead off with a primary key, and then with a type, before including the actual JSON database dump.

/a/OL2A /type/author {"name": "U. Venkatakrishna Rao", "personal_name": "U. Venkatakrishna Rao", "last_modified": {"type": "/type/datetime", "value": "2008-09-10 08:44:01.978456"}, "key": "/a/OL2A", "birth_date": "1904", "type": {"key": "/type/author"}, "id": 99, "revision": 3}

/b/OL345M /type/edition {"publishers": ["Social Science Research Project, Dept. of Geography, University of Dacca"], "pagination": "ii, 54 p.", "title": "Land use in Fayadabad area", "lccn": ["sa 65000491"], "subject_place": ["East Pakistan", "Dacca region."], "number_of_pages": 54, "languages": [{"comment": "initial import", "code": "eng", "name": "English", "key": "/l/eng"}], "lc_classifications": ["S471.P162 E23"], "publish_date": "1963", "publish_country": "pk ", "key": "/b/OL345M", "authors": [{"birth_date": "1911", "name": "Nafis Ahmad", "key": "/a/OL302A", "personal_name": "Nafis Ahmad"}], "publish_places": ["Dacca, East Pakistan"], "by_statement": "[by] Nafis Ahmad and F. Karim Khan.", "oclc_numbers": ["4671066"], "contributions": ["Khan, Fazle Karim, joint author."], "subjects": ["Land use -- East Pakistan -- Dacca region."]}

The size of the uncompressed dumps are enormous, about 2GB for the authors list, and 18GB for the book editions list. OpenLibrary does not provide any tools for this themselves, they provide a simple unoptimized Python script for reading in sample data (which unlike the actual dumps comes in pure JSON format), but they estimate if that was modified for use on their actual data it would take 2 months (!) to finish loading the data.

How can I read this into the database? I assume I'll need to write a program to do this. What language and any guidance on how I should do it to finish in a reasonable amount of time? The only scripting language I have any experience with is Ruby.

Developer IT

Importing a large dataset into a database - Developer IT

Importing a large dataset into a database

dataset

import

JSON

postgresql

ruby-on-rails

Related posts about dataset

Dataset -> XML Document - Load DataSet into an XML Document - C#.Net

Combine multiple dataset columns to one dataset

Making a DataSet from another DataSet

Declaring variables with New DataSet vs DataSet

DataSet.Copy vs Dataset.Clone

Related posts about import

E: Sub-process /usr/bin/dpkg returned an error code (1) seems to be choking on kde-runtime-data version issue

Python ImportError when executing 'import.py', but not when executing 'python import.py'

Python Wildcard Import Vs Named Import

Import, Can I import all lib's, how to ??

how to import the blog.py(i import the 'blog' folder)

Categories cloud