Parse and transform XML with missing elements into table structure

Posted by dnlbrky on Stack Overflow See other posts from Stack Overflow or by dnlbrky
Published on 2014-08-25T16:08:34Z Indexed on 2014/08/25 16:20 UTC
Read the original article Hit count: 394

Filed under:
|
|
|
|

I'm trying to parse an XML file. A simplified version of it looks like this:

x <- '<grandparent><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>'

library(XML)
xmlRoot(xmlTreeParse(x))
## <grandparent>
##   <parent>
##     <child1>ABC123</child1>
##     <child2>1381956044</child2>
##   </parent>
##   <parent>
##     <child2>1397527137</child2>
##   </parent>
##   <parent>
##     <child3>4675</child3>
##   </parent>
##   <parent>
##     <child1>DEF456</child1>
##     <child3>3735</child3>
##   </parent>
##   <parent>
##     <child1/>
##     <child3>3735</child3>
##   </parent>
## </grandparent>

I'd like to transform the XML into a data.frame / data.table that looks like this:

parent <- data.frame(child1=c("ABC123",NA,NA,"DEF456",NA), child2=c(1381956044, 1397527137, rep(NA, 3)), child3=c(rep(NA, 2), 4675, 3735, 3735))
parent
##   child1     child2 child3
## 1 ABC123 1381956044     NA
## 2   <NA> 1397527137     NA
## 3   <NA>         NA   4675
## 4 DEF456         NA   3735
## 5   <NA>         NA   3735

If each parent node always contained all of the possible elements ("child1", "child2", "child3", etc.), I could use xmlToList and unlist to flatten it, and then dcast to put it into a table. But the XML often has missing child elements. Here is an attempt with incorrect output:

library(data.table)

## Flatten:
dt <- as.data.table(unlist(xmlToList(x)), keep.rownames=T)
setnames(dt, c("column", "value"))

## Add row numbers, but they're incorrect due to missing XML elements:
dt[, row:=.SD[,.I], by=column][]
          column      value row
1: parent.child1     ABC123   1
2: parent.child2 1381956044   1
3: parent.child2 1397527137   2
4: parent.child3       4675   1
5: parent.child1     DEF456   2
6: parent.child3       3735   2
7: parent.child3       3735   3

## Reshape from long to wide, but some value are in the wrong row:
dcast.data.table(dt, row~column, value.var="value", fill=NA)
##    row parent.child1 parent.child2 parent.child3
## 1:   1        ABC123    1381956044          4675
## 2:   2        DEF456    1397527137          3735
## 3:   3            NA            NA          3735

I won't know ahead of time the names of the child elements, or the count of unique element names for children of the grandparent, so the answer should be flexible.

© Stack Overflow or respective owner

Related posts about Xml

Related posts about r