Partitioning data set in r based on multiple classes of observations

Posted by Danny on Stack Overflow See other posts from Stack Overflow or by Danny
Published on 2012-11-23T22:44:38Z Indexed on 2012/11/23 23:04 UTC
Read the original article Hit count: 203

Filed under:
|
|

I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one classification variable, and seven numerical variables. Each observation is classified as either A, B, C, or D.

For simplicity's sake, let's say that the classification variable, cl, is A for the first 100 observations, B for observations 101 to 200, C till 300, and D till 400. I'm trying to get a partition that has 2/3 of the observations for each of A, B, C, and D (as opposed to simply getting 2/3 of the observations for the entire data set since it will likely not have equal amounts of each classification).

When I try to sample from a subset of the data, such as sample(subset(data, cl=='A')), the columns are reordered instead of the rows.

To summarize, my goal is to have 67 random observations from each of A, B, C, and D as my training data, and store the remaining 33 observations for each of A, B, C, and D as testing data. I have found a very similar question to mine, but it did not factor in multiple variables.

I feel silly asking this question because it seems so simple, but I'm stumped. Also, this is my first question on this site, so I apologize in advance for any faux pas on my part.

© Stack Overflow or respective owner

Related posts about r

    Related posts about partitioning