Search Results

Search found 1 results on 1 pages for 'dnlbrky'.

Page 1/1 | 1 

  • Parse and transform XML with missing elements into table structure

    - by dnlbrky
    I'm trying to parse an XML file. A simplified version of it looks like this: x <- '<grandparent><parent><child1>ABC123</child1><child2>1381956044</child2></parent><parent><child2>1397527137</child2></parent><parent><child3>4675</child3></parent><parent><child1>DEF456</child1><child3>3735</child3></parent><parent><child1/><child3>3735</child3></parent></grandparent>' library(XML) xmlRoot(xmlTreeParse(x)) ## <grandparent> ## <parent> ## <child1>ABC123</child1> ## <child2>1381956044</child2> ## </parent> ## <parent> ## <child2>1397527137</child2> ## </parent> ## <parent> ## <child3>4675</child3> ## </parent> ## <parent> ## <child1>DEF456</child1> ## <child3>3735</child3> ## </parent> ## <parent> ## <child1/> ## <child3>3735</child3> ## </parent> ## </grandparent> I'd like to transform the XML into a data.frame / data.table that looks like this: parent <- data.frame(child1=c("ABC123",NA,NA,"DEF456",NA), child2=c(1381956044, 1397527137, rep(NA, 3)), child3=c(rep(NA, 2), 4675, 3735, 3735)) parent ## child1 child2 child3 ## 1 ABC123 1381956044 NA ## 2 <NA> 1397527137 NA ## 3 <NA> NA 4675 ## 4 DEF456 NA 3735 ## 5 <NA> NA 3735 If each parent node always contained all of the possible elements ("child1", "child2", "child3", etc.), I could use xmlToList and unlist to flatten it, and then dcast to put it into a table. But the XML often has missing child elements. Here is an attempt with incorrect output: library(data.table) ## Flatten: dt <- as.data.table(unlist(xmlToList(x)), keep.rownames=T) setnames(dt, c("column", "value")) ## Add row numbers, but they're incorrect due to missing XML elements: dt[, row:=.SD[,.I], by=column][] column value row 1: parent.child1 ABC123 1 2: parent.child2 1381956044 1 3: parent.child2 1397527137 2 4: parent.child3 4675 1 5: parent.child1 DEF456 2 6: parent.child3 3735 2 7: parent.child3 3735 3 ## Reshape from long to wide, but some value are in the wrong row: dcast.data.table(dt, row~column, value.var="value", fill=NA) ## row parent.child1 parent.child2 parent.child3 ## 1: 1 ABC123 1381956044 4675 ## 2: 2 DEF456 1397527137 3735 ## 3: 3 NA NA 3735 I won't know ahead of time the names of the child elements, or the count of unique element names for children of the grandparent, so the answer should be flexible.

    Read the article

1