Optimizing a "set in a string list" to a "set as a matrix" operation

Posted by Eric Fournier on Stack Overflow See other posts from Stack Overflow or by Eric Fournier
Published on 2013-10-25T15:46:45Z Indexed on 2013/10/25 15:54 UTC
Read the original article Hit count: 163

Filed under:
|
|

I have a set of strings which contain space-separated elements. I want to build a matrix which will tell me which elements were part of which strings. For example:

""
"A B C"
"D"
"B D"

Should give something like:

  A B C D
1
2 1 1 1
3       1
4   1   1

Now I've got a solution, but it runs slow as molasse, and I've run out of ideas on how to make it faster:

reverseIn <- function(vector, value) {
    return(value %in% vector)
}

buildCategoryMatrix <- function(valueVector) {
    allClasses <- c()
    for(classVec in unique(valueVector)) {
        allClasses <- unique(c(allClasses,
                               strsplit(classVec, " ", fixed=TRUE)[[1]]))
    }

    resMatrix <- matrix(ncol=0, nrow=length(valueVector))
    splitValues <- strsplit(valueVector, " ", fixed=TRUE)

    for(cat in allClasses) {
        if(cat=="") {
            catIsPart <- (valueVector == "")
        } else {
            catIsPart <- sapply(splitValues, reverseIn, cat)
        }
        resMatrix <- cbind(resMatrix, catIsPart)
    }
    colnames(resMatrix) <- allClasses

    return(resMatrix)
}

Profiling the function gives me this:

$by.self
                  self.time self.pct total.time total.pct
"match"               31.20    34.74      31.24     34.79
"FUN"                 30.26    33.70      74.30     82.74
"lapply"              13.56    15.10      87.86     97.84
"%in%"                12.92    14.39      44.10     49.11

So my actual questions would be: - Where are the 33% spent in "FUN" coming from? - Would there be any way to speed up the %in% call?

I tried turning the strings into factors prior to going into the loop so that I'd be matching numbers instead of strings, but that actually makes R crash. I've also tried going for partial matrix assignment (IE, resMatrix[i,x] <- 1) where i is the number of the string and x is the vector of factors. No dice there either, as it seems to keep on running infinitely.

© Stack Overflow or respective owner

Related posts about r

    Related posts about optimization