R: Are there any alternatives to loops for subsetting from an optimization standpoint?

Posted by Adam on Stack Overflow See other posts from Stack Overflow or by Adam
Published on 2010-03-27T00:46:29Z Indexed on 2010/03/27 0:53 UTC
Read the original article Hit count: 220

Filed under:
|
|
|

A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing.

How I typically do this in R is something like the following:

data.mat <- read.csv("...")
groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups

results <- matrix(rep("NA",300),ncol=3,nrow=100)

for(i in 1:100) {
tempmat <- subset(data.mat,ID==groupids[i])

#Run various stats on tempmat (correlations, regressions, etc), checking to
#make sure this specific group doesn't have NAs in the variables I'm using
#and assign results to x, y, and z, for example.

results[i,1] <- x
results[i,2] <- y
results[i,3] <- z
}

This ends up working for me, but depending on the size of the data and the number of groups I'm working with, this can take up to three days.

Besides branching out into parallel processing, is there any "trick" for making something like this run faster? For instance, converting the loops into something else (something like an apply with a function containing the stats I want to run inside the loop), or eliminating the need to actually assign the subset of data to a variable?

© Stack Overflow or respective owner

Related posts about r

    Related posts about optimization