Efficient alternative to merge() when building dataframe from json files with R?

Posted by Bryan on Stack Overflow See other posts from Stack Overflow or by Bryan
Published on 2011-03-03T20:11:06Z Indexed on 2011/03/05 7:24 UTC
Read the original article Hit count: 464

Filed under:

r

I have written the following code which works, but is painfully slow once I start executing it over thousands of records:

require("RJSONIO")
people_data <- data.frame(person_id=numeric(0))

json_data <- fromJSON(json_file)
n_people <- length(json_data)
for(lender in 1:n_people) {
        person_dataframe <- as.data.frame(t(unlist(json_data[[person]])))
        people_data <- merge(people_data, person_dataframe, all=TRUE)
    }

output_file <- paste("people_data",".csv")
write.csv(people_data, file=output_file)

I am attempting to build a unified data table from a series of json-formated files. The fromJSON() function reads in the data as lists of lists. Each element of the list is a person, which then contains a list of the attributes for that person.

For example:

[[1]]
    person_id
    name
    gender
    hair_color
[[2]]
    person_id
    name
    location
    gender
    height

[[...]]

structure(list(person_id = "Amy123", name = "Amy", gender = "F",
               hair_color = "brown"), 
          .Names = c("person_id", "name", "gender", "hair_color"))

structure(list(person_id = "matt53", name = "Matt", 
               location = structure(c(47231, "IN"), 
                                    .Names = c("zip_code", "state")), 
               gender = "M", height = 172), 
          .Names = c("person_id", "name", "location", "gender", "height"))

The end result of the code above is matrix where the columns are every person-attribute that appears in the structure above, and the rows are the relevant values for each person. As you can see though, some data is missing for some of the people, so I need to ensure those show up as NA and make sure things end up in the right columns. Further, location itself is a vector with two components: state and zip_code, meaning it needs to be flattened to location.state and location.zip_code before it can be merged with another person record; this is what I use unlist() for. I then keep the running master table in people_data.

The above code works, but do you know of a more efficient way to accomplish what I'm trying to do? It appears the merge() is slowing this to a crawl... I have hundreds of files with hundreds of people in each file.

Thanks! Bryan

Developer IT

Efficient alternative to merge() when building dataframe from json files with R? - Developer IT

Efficient alternative to merge() when building dataframe from json files with R?

Performance

JSON

r

Related posts about Performance

Improving VPN performance - stronger encryption = more performance?

Inaccurate performance counter timer values in Windows Performance Monitor

Excel-based Performance Reviews transformed into Web Application for Performance Management

How to save a perfmon Performance Counter as a textfile (Reliability and Performance Monitor Version

SQLAuthority News – A Successful Performance Tuning Seminar at Pune – Dec 4-5, 2010

Related posts about JSON

Using JSON.NET for dynamic JSON parsing

Azure Mobile Services: what files does it consist of?

Retrieving Json Array

Deserializing JSON data to C# using JSON.NET

Parsing nested JSON objects with JSON Framework for Objective-C

Categories cloud