Search Results

Search found 17 results on 1 pages for 'plyr'.

Page 1/1 | 1 

  • Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

    - by Mark Hornick
    The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach. This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply. To get started, we'll create a demo data set and load the plyr package. set.seed(1) d <- data.frame(year = rep(2000:2014, each = 3),         count = round(runif(45, 0, 20))) dim(d) library(plyr) This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame. # Example 1 res <- ddply(d, "year", function(x) {   mean.count <- mean(x$count)   sd.count <- sd(x$count)   cv <- sd.count/mean.count   data.frame(cv.count = cv)   }) To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame. D <- ore.push(d) res <- ore.groupApply (D, D$year, function(x) {   mean.count <- mean(x$count)   sd.count <- sd(x$count)   cv <- sd.count/mean.count   data.frame(year=x$year[1], cv.count = cv)   }, FUN.VALUE=data.frame(year=1, cv.count=1)) You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result. The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large. R> class(res) [1] "ore.frame" attr(,"package") [1] "OREbase" R> head(res)    year cv.count 1 2000 0.3984848 2 2001 0.6062178 3 2002 0.2309401 4 2003 0.5773503 5 2004 0.3069680 6 2005 0.3431743 To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used. The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year. # Example 2 ddply(d, "year", summarise, mean.count = mean(count)) res <- ore.groupApply (D, D$year, function(x) {   mean.count <- mean(x$count)   data.frame(year=x$year[1], mean.count = mean.count)   }, FUN.VALUE=data.frame(year=1, mean.count=1)) R> head(res)    year mean.count 1 2000 7.666667 2 2001 13.333333 3 2002 15.000000 4 2003 3.000000 5 2004 12.333333 6 2005 14.666667 Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame. # Example 3 ddply(d, "year", transform, total.count = sum(count)) res <- ore.groupApply (D, D$year, function(x) {   total.count <- sum(x$count)   data.frame(year=x$year[1], count=x$count, total.count = total.count)   }, FUN.VALUE=data.frame(year=1, count=1, total.count=1)) > head(res)    year count total.count 1 2000 5 23 2 2000 7 23 3 2000 11 23 4 2001 18 40 5 2001 4 40 6 2001 18 40 In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns. # Example 4 ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),       cv = sigma/mu) res <- ore.groupApply (D, D$year, function(x) {   mu <- mean(x$count)   sigma <- sd(x$count)   cv <- sigma/mu   data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)   }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1)) R> head(res)    year count mu sigma cv 1 2000 5 7.666667 3.055050 0.3984848 2 2000 7 7.666667 3.055050 0.3984848 3 2000 11 7.666667 3.055050 0.3984848 4 2001 18 13.333333 8.082904 0.6062178 5 2001 4 13.333333 8.082904 0.6062178 6 2001 18 13.333333 8.082904 0.6062178 In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data. # Example 5 baseball.dat <- subset(baseball, year > 2000) # data from the plyr package x <- ddply(baseball.dat, c("year", "team"), summarize,            homeruns = sum(hr)) We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows. BB.DAT <- ore.push(baseball) BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+")) BB.DAT2 <- subset(BB.DAT, year > 2000) X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {   data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))   }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE) res <- ore.sort(X, by=c("year","team")) R> head(res)    year team homeruns 1 2001 ANA 4 2 2001 ARI 155 3 2001 ATL 63 4 2001 BAL 58 5 2001 BOS 77 6 2001 CHA 63 Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below. # Example 6 with ggplot2 library(ggplot2) df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),                  y = rnorm(30)) # Compute sample mean and standard deviation in each group library(plyr) ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y)) # Set up a skeleton ggplot object and add layers: ggplot() +   geom_point(data = df, aes(x = gp, y = y)) +   geom_point(data = ds, aes(x = gp, y = mean),              colour = 'red', size = 3) +   geom_errorbar(data = ds, aes(x = gp, y = mean,                                ymin = mean - sd, ymax = mean + sd),              colour = 'red', width = 0.4) DF <- ore.push(df) ore.tableApply(DF, function(df) {   library(ggplot2)   library(plyr)   ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))   ggplot() +     geom_point(data = df, aes(x = gp, y = y)) +     geom_point(data = ds, aes(x = gp, y = mean),                colour = 'red', size = 3) +     geom_errorbar(data = ds, aes(x = gp, y = mean,                                  ymin = mean - sd, ymax = mean + sd),                   colour = 'red', width = 0.4) }) But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element. df2 <- rbind(df,df,df) df2$y <- df2$y + rnorm(nrow(df2)) df2$index <- c(rep(1,300), rep(2,300), rep(3,300)) DF2 <- ore.push(df2) res <- ore.groupApply(DF2, DF2$index, function(df) {   df <- df[,1:2]   library(ggplot2)   library(plyr)   ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))   ggplot() +     geom_point(data = df, aes(x = gp, y = y)) +     geom_point(data = ds, aes(x = gp, y = mean),                colour = 'red', size = 3) +     geom_errorbar(data = ds, aes(x = gp, y = mean,                                  ymin = mean - sd, ymax = mean + sd),                   colour = 'red', width = 0.4)   }, parallel=TRUE) res[[1]] res[[2]] res[[3]] To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

    Read the article

  • Calculating a Sample Covariance Matrix for Groups with plyr

    - by John A. Ramey
    I'm going to use the sample code from http://gettinggeneticsdone.blogspot.com/2009/11/split-apply-and-combine-in-r-using-plyr.html for this example. So, first, let's copy their example data: mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10))) I am going to ignore SNP2 in this example and just pretend the values in SNP1 denote group membership. So then, I may want some summary statistics about each group in SNP1: "AA", "Aa", "aa". Then if I want to calculate the means for each variable, it makes sense (modifying their code slightly) to use: > ddply(mydata, c("SNP1"), function(df) data.frame(meanX1=mean(df$X1), meanX2=mean(df$X2))) SNP1 meanX1 meanX2 1 aa 0.05178028 4.812302 2 Aa 0.30586206 4.820739 3 AA -0.26862500 4.856006 But what if I want the sample covariance matrix for each group? Ideally, I would like a 3D array, where the I have the covariance matrix for each group, and the third dimension denotes the corresponding group. I tried a modified version of the previous code and got the following results that have convinced me that I'm doing something wrong. > daply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2))) , , = 1 SNP1 1 2 aa 1.4961210 -0.9496134 Aa 0.8833190 -0.1640711 AA 0.9942357 -0.9955837 , , = 2 SNP1 1 2 aa -0.9496134 2.881515 Aa -0.1640711 2.466105 AA -0.9955837 4.938320 I was thinking that the dim() of the 3rd dimension would be 3, but instead, it is 2. Really this is a sliced up version of the covariance matrix for each group. If we manually compute the sample covariance matrix for aa, we get: [,1] [,2] [1,] 1.4961210 -0.9496134 [2,] -0.9496134 2.8815146 Using plyr, the following gives me what I want in list() form: > dlply(mydata, c("SNP1"), function(df) cov(cbind(df$X1, df$X2))) $aa [,1] [,2] [1,] 1.4961210 -0.9496134 [2,] -0.9496134 2.8815146 $Aa [,1] [,2] [1,] 0.8833190 -0.1640711 [2,] -0.1640711 2.4661046 $AA [,1] [,2] [1,] 0.9942357 -0.9955837 [2,] -0.9955837 4.9383196 attr(,"split_type") [1] "data.frame" attr(,"split_labels") SNP1 1 aa 2 Aa 3 AA But like I said earlier, I would really like this in a 3D array. Any thoughts on where I went wrong with daply() or suggestions? Of course, I could typecast the list from dlply() to a 3D array, but I'd rather not do this because I will be repeating this process many times in a simulation. As a side note, I found one method (http://www.mail-archive.com/[email protected]/msg86328.html) that provides the sample covariance matrix for each group, but the outputted object is bloated. Thanks in advance.

    Read the article

  • How to better create stacked bar graphs with multiple variables from ggplot2?

    - by deoksu
    I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things: First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, which is why I take out the axis labels completely. Second, there must be a simpler way to reorganize my data to make this happen. It seems like the sort of thing I should be able to do natively in ggplot2 with plyR, but the documentation for plyR is not very clear (and I have read both the ggplot2 book and the online plyR documentation. My best graph looks like this, the code to create it follows: the R code I use to get it is the following: library(epicalc) ### recode the variables to factors ### recode(c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ), c(1,2,3,4,5,6,7,8,9, NA), c('Very Interested','Somewhat Interested','Not Very Interested','Not At All interested',NA,NA,NA,NA,NA,NA)) ### Combine recoded variables to a common vector Interest1<-c(int_newcoun, int_newneigh, int_neweur, int_newusa, int_neweco, int_newit, int_newen, int_newsp, int_newhr, int_newlit, int_newent, int_newrel, int_newhth, int_bapo, int_wopo, int_eupo, int_educ) ### Create a second vector to label the first vector by original variable ### a1<-rep("News about Bangladesh", length(int_newcoun)) a2<-rep("Neighboring Countries", length(int_newneigh)) [...] a17<-rep("Education", length(int_educ)) Interest2<-c(a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15, a16, a17) ### Create a Weighting vector of the proper length ### Interest.weight<-rep(weight, 17) ### Make and save a new data frame from the three vectors ### Interest.df<-cbind(Interest1, Interest2, Interest.weight) Interest.df<-as.data.frame(Interest.df) write.csv(Interest.df, 'C:\\Documents and Settings\\[name]\\Desktop\\Sweave\\InterestBangladesh.csv') ### Sort the factor levels to display properly ### Interest.df$Interest1<-relevel(Interest$Interest1, ref='Not Very Interested') Interest.df$Interest1<-relevel(Interest$Interest1, ref='Somewhat Interested') Interest.df$Interest1<-relevel(Interest$Interest1, ref='Very Interested') Interest.df$Interest2<-relevel(Interest$Interest2, ref='News about Bangladesh') Interest.df$Interest2<-relevel(Interest$Interest2, ref='Education') [...] Interest.df$Interest2<-relevel(Interest$Interest2, ref='European Politics') detach(Interest) attach(Interest) ### Finally create the graph in ggplot2 ### library(ggplot2) p<-ggplot(Interest, aes(Interest2, ..count..)) p<-p+geom_bar((aes(weight=Interest.weight, fill=Interest1))) p<-p+coord_flip() p<-p+scale_y_continuous("", breaks=NA) p<-p+scale_fill_manual(value = rev(brewer.pal(5, "Purples"))) p update_labels(p, list(fill='', x='', y='')) I'd very much appreciate any tips, tricks or hints. Thanks.

    Read the article

  • subtotals in columns usind reshape2 in R

    - by user1043144
    I have spent some time now learning RESHAPE2 and plyr but I still do not get it. This time I have a problem with (a) subtotals and (b) passing different aggregate functions . Here an example using data from the excellent tutorial on the blog of mrdwab http://news.mrdwab.com/ # libraries library(plyr) library(reshape2) # get data and add few more variables book.sales = read.csv("http://news.mrdwab.com/data-booksales") book.sales$Stock = book.sales$Quantity + 10 book.sales$SubjCat[(book.sales$Subject == 'Economics') | (book.sales$Subject == 'Management') ] <- '1_EconSciences' book.sales$SubjCat[book.sales$Subject %in% c('Anthropology', 'Politics', 'Sociology') ] <- '2_SocSciences' book.sales$SubjCat[book.sales$Subject %in% c('Communication', 'Fiction', 'History', 'Research', 'Statistics') ] <- '3_other' # to get to my starting dataframe (close to the project I am working on) book.sales1 <- ddply(book.sales, c('Region', 'Representative', 'SubjCat', 'Subject', 'Publisher'), summarize, Stock = sum(Stock), Sold = sum(Quantity), Ratio = round((100 * sum(Quantity)/ sum(Stock)), digits = 1)) #melt it m.book.sales = melt(data = book.sales1, id.vars = c('Region', 'Representative', 'SubjCat', 'Subject', 'Publisher'), measured.vars = c('Stock', 'Sold', 'Ratio')) # cast it Tab1 <- dcast(data = m.book.sales, formula = Region + Representative ~ Publisher + variable, fun.aggregate = sum, margins = c('Region', 'Representative')) Now my questions : I have been able to add the subtotals in rows. But is it possible also to add margins in the columns. Say for example, Totals of Stock for one Publisher ? Sorry I meant to say example total sold for all publishers There is a problem with the columns with “ratio”. How can I get “mean” instead of “sum” for this variable ? P.S: I have seen some examples using reshape. Will you recommend to use it instead of reshape2 (which seems not to include the functionalities of two functions).

    Read the article

  • Repeat elements of vector in R

    - by bshor
    Hi, I'm trying to repeat the elements of vector a, b number of times. That is, a="abc" should be "aabbcc" if y = 2. Why doesn't either of the following code examples work? sapply(a, function (x) rep(x,b)) and from the plyr package, aaply(a, function (x) rep(x,b)) I know I'm missing something very obvious ...

    Read the article

  • break dataframe into subsets by factor values, send to function that returns glm class, how to recom

    - by Alex Holcombe
    Thanks to Hadley's plyr package ddply function we can take a dataframe, break it down into subdataframes by factors, send each to a function, and then combine the function results for each subdataframe into a new dataframe. But what if the function returns an object of a class like glm or in my case, a c("glm", "lm"). Then, these can't be combined into a dataframe can they? I get this error instead Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) : cannot coerce class 'c("glm", "lm")' into a data.frame Is there some more flexible data structure that will accommodate all the complex glm class results of my function calls, preserving the information regarding the dataframe subsets? Or should this be done in an entirely different way?

    Read the article

  • Efficient alternatives to merge for larger data.frames R

    - by Etienne Low-Décarie
    I am looking for an efficient (both computer resource wise and learning/implementation wise) method to merge two larger (size1 million / 300 KB RData file) data frames. "merge" in base R and "join" in plyr appear to use up all my memory effectively crashing my system. Example load test data frame and try test.merged<-merge(test, test) or test.merged<-join(test, test, type="all") - The following post provides a list of merge and alternatives: How to join data frames in R (inner, outer, left, right)? The following allows object size inspection: https://heuristically.wordpress.com/2010/01/04/r-memory-usage-statistics-variable/ Data produced by anonym

    Read the article

  • Using ddply() to Get Frequency of Certain IDs, by Appearance in Multiple Rows (in R)

    - by EconomiCurtis
    Goal If the following description is hard follow, please see the example "before" and "after" to see a straightforward example. I have bartering data, with unique trade ids, and two sides of the trade. Side1 and Side2 are baskets, lists of item ids that represent both sides of the barter transaction. I'd like to count the frequency each ITEM appears in TRADES. E.g, if item "001" appeared in 3 trades, I'd have a count of 3 (ignoring how many times the item appeared in each trade). Further, I'd like to do this with the plyr ddply function. (If you're interested as to my motivation, I working over many hundreds of thousands of transactions and am already using a ddply to calculate several other summary statistics. I'd like to add this to the ddply I'm already using, rather than calculate it after, and merge it into the ddply output.... sorry if that was difficult to follow.) In terms of pseudo code I'm working off of: merge each row of Side1 and Side2 by row, get unique() appearances of each item id apply table() function transpose and relabel output from table Example of the structure of my data, and the output I desire. Data Example (before): df <- data.frame(TradeID = c("01","02","03","04")) df$Side1 = list(c("001","001","002"), c("002","002","003"), c("001","004"), c("001","002","003","004")) df$Side2 = list(c("001"),c("007"),c("009"),c()) Desired Output (after): df.ItemRelFreq_byTradeID <- data.frame(ItemID = c("001","002","003","004","007","009"), RelFreq_byTrade = c(3,3,2,2,1,1)) One method to do this without ddply I've worked out one way to do this below. My problem is that I can't quite seem to get ddply to do this for me. temp <- table(unlist(sapply(mapply(c,df$Side1,df$Side2), unique))) df.ItemRelFreq_byTradeID <- data.frame(ItemID = names(temp), RelFreq_byTrade = temp[]) Thanks for any help you can offer! Curtis

    Read the article

  • How do I sub sample data by group using ddply?

    - by Maiasaura
    I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset. I wrote a function to subset a data frame as follows: samp <- function(dataf) { dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),] } Now I want to apply this function to each species in a larger data frame. When I try something like culled_data = ddply (larger_data, .(species), subset, samp) I get this error: Error in subset.data.frame(piece, ...) : 'subset' must evaluate to logical Anyone got ideas on how to do this?

    Read the article

  • compare two characters based on subset

    - by schultem
    I have a simple dataframe with two columns: df <- data.frame(x = c(1,1,2,2,3), y = c(rep(1:2,2),1), target = c('a','a','a','b','a')) I would like to compare the strings in the target column (find out whether they are equal or not, i.e., TRUE or FALSE) within every level of x (same number for x). First I would like to compare lines 1 and 2, then 3 and 4 ... My problem is that I am missing some comparisons, for example, line 5 has only one case instead of two - so it should turn out to be FALSE. Variable y indicates the first and second case within x. I played around with ddply doing something like: ddply(df, .(x), summarise, ifelse(as.character(df[df$y == '1',]$target), as.character(df[df$y == '2',]$target),0,1)) which is ugly ... and does not work ... Any insights how I could achieve this comparison? Thanks

    Read the article

  • how to determine if a character vector is a valid numeric or integer vector

    - by Andrew Barr
    I am trying to turn a nested list structure into a dataframe. The list looks similar to the following (it is serialized data from parsed JSON read in using the httr package). myList <- list(object1 = list(w=1, x=list(y=0.1, z="cat")), object2 = list(w=2, x=list(y=0.2, z="dog"))) unlist(myList) does a great job of recursively flattening the list, and I can then use lapply to flatten all the objects nicely. flatList <- lapply(myList, FUN= function(object) {return(as.data.frame(rbind(unlist(object))))}) And finally, I can button it up using plyr::rbind.fill myDF <- do.call(plyr::rbind.fill, flatList) str(myDF) #'data.frame': 2 obs. of 3 variables: #$ w : Factor w/ 2 levels "1","2": 1 2 #$ x.y: Factor w/ 2 levels "0.1","0.2": 1 2 #$ x.z: Factor w/ 2 levels "cat","dog": 1 2 The problem is that w and x.y are now being interpreted as character vectors, which by default get parsed as factors in the dataframe. I believe that unlist() is the culprit, but I can't figure out another way to recursively flatten the list structure. A workaround would be to post-process the dataframe, and assign data types then. What is the best way to determine if a vector is a valid numeric or integer vector?

    Read the article

  • data handling with javascript

    - by Vincent Warmerdam
    Python has a very neat package called pandas which allows for quick data transformation; tables, aggregation, that sort of thing. A lot of these types of functionality can also be found in the python itertools module. The plyR package in R is also very similar. Usually one woud use this functionality to produce a table which is later visualized with a plot. I am personally very fond of d3, and I would like to allow the user to first indicate what type of data aggregation he wants on the dataset before it is visualized. The visualisation in question involves making a heatmap where the user gets to select the size of the bins of the heatmap beforehand (I want d3 to project this through leaflet). I want to visually select the ideal size of the bins for the heatmap. The way I work now is that I take the dataset, aggregate it with python and then manually load it in d3. This is a process that takes a lot of human effort and I was wondering if the data aggregation can be done through the javascript of the browser. I couldn't find a package for javascript specifically built for data, suggesting (to me) that this is a bad idea and that one should not use javascript for the data handling. Is there a good module/package for javascript to handle data aggregation? Is it a good/bad idea to do the data aggregation in javascript (performance wise)?

    Read the article

  • Subset a data.frame by list and apply function on each part, by rows

    - by aL3xa
    This may seem as a typical plyr problem, but I have something different in mind. Here's the function that I want to optimize (skip the for loop). # dummy data set.seed(1985) lst <- list(a=1:10, b=11:15, c=16:20) m <- matrix(round(runif(200, 1, 7)), 10) m <- as.data.frame(m) dfsub <- function(dt, lst, fun) { # check whether dt is `data.frame` stopifnot (is.data.frame(dt)) # check if vectors in lst are "whole" / integer # vector elements should be column indexes is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol # fall if any non-integers in list idx <- rapply(lst, is.wholenumber) stopifnot(idx) # check for list length stopifnot(ncol(dt) == length(idx)) # subset the data subs <- list() for (i in 1:length(lst)) { # apply function on each part, by row subs[[i]] <- apply(dt[ , lst[[i]]], 1, fun) } # preserve names names(subs) <- names(lst) # convert to data.frame subs <- as.data.frame(subs) # guess what =) return(subs) } And now a short demonstration... actually, I'm about to explain what I primarily intended to do. I wanted to subset a data.frame by vectors gathered in list object. Since this is a part of code from a function that accompanies data manipulation in psychological research, you can consider m as a results from personality questionnaire (10 subjects, 20 vars). Vectors in list hold column indexes that define questionnaire subscales (e.g. personality traits). Each subscale is defined by several items (columns in data.frame). If we presuppose that the score on each subscale is nothing more than sum (or some other function) of row values (results on that part of questionnaire for each subject), you could run: > dfsub(m, lst, sum) a b c 1 46 20 24 2 41 24 21 3 41 13 12 4 37 14 18 5 57 18 25 6 27 18 18 7 28 17 20 8 31 18 23 9 38 14 15 10 41 14 22 I took a glance at this function and I must admit that this little loop isn't spoiling the code at all... BUT, if there's an easier/efficient way of doing this, please, let me know!

    Read the article

  • How to include multiple tables programmaticaly into a Sweave document using R

    - by PaulHurleyuk
    Hello, I want to have a sweave document that will include a variable number of tables in. I thought the example below would work, but it doesn't. I want to loop over the list foo and print each element as it's own table. % \documentclass[a4paper]{article} \usepackage[OT1]{fontenc} \usepackage{longtable} \usepackage{geometry} \usepackage{Sweave} \geometry{left=1.25in, right=1.25in, top=1in, bottom=1in} \listfiles \begin{document} <<label=start, echo=FALSE, include=FALSE>>= startt<-proc.time()[3] library(RODBC) library(psych) library(xtable) library(plyr) library(ggplot2) options(width=80) #Produce some example data, here I'm creating some dummy dataframes and putting them in a list foo<-list() foo[[1]]<-data.frame(GRP=c(rep("AA",10), rep("Aa",10), rep("aa",10)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[2]]<-data.frame(GRP=c(rep("BB",10), rep("bB",10), rep("BB",10)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[3]]<-data.frame(GRP=c(rep("CC",12), rep("cc",18)), X1=rnorm(30), X2=rnorm(30,5,2)) foo[[4]]<-data.frame(GRP=c(rep("DD",10), rep("Dd",10), rep("dd",10)), X1=rnorm(30), X2=rnorm(30,5,2)) @ \title{Docuemnt to test putting a variable number of tables into a sweave Document} \author{"Paul Hurley"} \maketitle \section{Text} This document was created on \today, with \Sexpr{print(version$version.string)} running on a \Sexpr{print(version$platform)} platform. It took approx \input{time} sec to process. <<label=test, echo=FALSE, results=tex>>= cat("Foo") @ that was a test, so is this <<label=table1test, echo=FALSE, results=tex>>= print(xtable(foo[[1]])) @ \newpage \subsection{Tables} <<label=Tables, echo=FALSE, results=tex>>= for(i in seq(foo)){ cat("\n") cat(paste("Table_",i,sep="")) cat("\n") print(xtable(foo[[i]])) cat("\n") } #cat("<<label=endofTables>>= ") @ <<label=bye, include=FALSE, echo=FALSE>>= endt<-proc.time()[3] elapsedtime<-as.numeric(endt-startt) @ <<label=elapsed, include=FALSE, echo=FALSE>>= fileConn<-file("time.tex", "wt") writeLines(as.character(elapsedtime), fileConn) close(fileConn) @ \end{document} Here, the table1test chunk works as expected, and produced a table based on the dataframe in foo[[1]], however the loop only produces Table(underscore)1.... Any ideas what I'm doing wrong ?

    Read the article

  • 2D Histogram in R: Converting from Count to Frequency within a Column

    - by Jac
    Would appreciate help with generating a 2D histogram of frequencies, where frequencies are calculated within a column. My main issue: converting from counts to column based frequency. Here's my starting code: # expected packages library(ggplot2) library(plyr) # generate example data corresponding to expected data input x_data = sample(101:200,10000, replace = TRUE) y_data = sample(1:100,10000, replace = TRUE) my_set = data.frame(x_data,y_data) # define x and y interval cut points x_seq = seq(100,200,10) y_seq = seq(0,100,10) # label samples as belonging within x and y intervals my_set$x_interval = cut(my_set$x_data,x_seq) my_set$y_interval = cut(my_set$y_data,y_seq) # determine count for each x,y block xy_df = ddply(my_set, c("x_interval","y_interval"),"nrow") # still need to convert for use with dplyr # convert from count to frequency based on formula: freq = count/sum(count in given x interval) ################ TRYING TO FIGURE OUT ################# # plot results fig_count <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = nrow)) # count fig_freq <- ggplot(xy_df, aes(x = x_interval, y = y_interval)) + geom_tile(aes(fill = freq)) # frequency I would appreciate any help in how to calculate the frequency within a column. Thanks! jac EDIT: I think the solution will require the following steps 1) Calculate and store overall counts for each x-interval factor 2) Divide the individual bin count by its corresponding x-interval factor count to obtain frequency. Not sure how to carry this out though. .

    Read the article

  • How do I replace values within a data frame with a string in R?

    - by Arturito
    short version: How do I replace values within a data frame with a string found within another data frame? longer version: I'm a biologist working with many species of bees. I have a data set with many thousands of bees. Each row has a unique bee ID # along with all the relevant info about that specimen (data of capture, GPS location, etc). The species information for each bee has not been entered because it takes a long time to ID them. When IDing, I end up with boxes of hundred of bees, all of the same species. I enter these into a separate data frame. I am trying to write code that will update the original data file with species information (family, genus, species, sex, etc) as I ID the bees. Currently, in the original data file, the species info is blank and is interpreted as NA within R. I want to have R find all unique bee ID #'s and fill in the species info, but I am having trouble figuring out how to replace the NA values with a string (e.g. "Andrenidae") Here is a simple example of what I am trying to do: rawData<-data.frame(beeID=c(1:20),family=rep(NA,20)) speciesInfo<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7)) rawData[rawData$beeID == 4,"family"] <- speciesInfo[speciesInfo$beeID == 4,"family"] So, I am replacing things as I want, but with a number rather than the family name (a string). What I would eventually like to do is write a little loop to add in all the species info, e.g.: for (i in speciesInfo$beeID){ rawData[rawData$beeID == i,"family"] <- speciesInfo[speciesInfo$beeID == i,"family"] } Thanks in advance for any advice! Cheers, Zak EDIT: I just noticed that the first two methods below add a new column each time, which would cause problems if I needed to add species info multiple times (which I typically do). For example: rawData<-data.frame(beeID=c(1:20),family=rep(NA,20)) Andrenidae<-data.frame(beeID=seq(1,20,3),family=rep("Andrenidae",7)) Halictidae<-data.frame(beeID=seq(1,20,3)+1,family=rep("Halictidae",7)) # using join library(plyr) rawData <- join(rawData, Andrenidae, by = "beeID", type = "left") rawData <- join(rawData, Halictidae, by = "beeID", type = "left") # using merge rawData <- merge(x=rawData,y=Andrenidae,by='beeID',all.x=T,all.y=F) rawData <- merge(x=rawData,y=Halictidae,by='beeID',all.x=T,all.y=F) Is there a way to either collapse the columns so that I have one, unified data frame? Or a way to update the rawData rather than adding a new column each time? Thanks in advance!

    Read the article

  • Solving Big Problems with Oracle R Enterprise, Part I

    - by dbayard
    Abstract: This blog post will show how we used Oracle R Enterprise to tackle a customer’s big calculation problem across a big data set. Overview: Databases are great for managing large amounts of data in a central place with rigorous enterprise-level controls.  R is great for doing advanced computations.  Sometimes you need to do advanced computations on large amounts of data, subject to rigorous enterprise-level concerns.  This blog post shows how Oracle R Enterprise enables R plus the Oracle Database enabled us to do some pretty sophisticated calculations across 1 million accounts (each with many detailed records) in minutes. The problem: A financial services customer of mine has a need to calculate the historical internal rate of return (IRR) for its customers’ portfolios.  This information is needed for customer statements and the online web application.  In the past, they had solved this with a home-grown application that pulled trade and account data out of their data warehouse and ran the calculations.  But this home-grown application was not able to do this fast enough, plus it was a challenge for them to write and maintain the code that did the IRR calculation. IRR – a problem that R is good at solving: Internal Rate of Return is an interesting calculation in that in most real-world scenarios it is impractical to calculate exactly.  Rather, IRR is a calculation where approximation techniques need to be used.  In this blog post, we will discuss calculating the “money weighted rate of return” but in the actual customer proof of concept we used R to calculate both money weighted rate of returns and time weighted rate of returns.  You can learn more about the money weighted rate of returns here: http://www.wikinvest.com/wiki/Money-weighted_return First Steps- Calculating IRR in R We will start with calculating the IRR in standalone/desktop R.  In our second post, we will show how to take this desktop R function, deploy it to an Oracle Database, and make it work at real-world scale.  The first step we did was to get some sample data.  For a historical IRR calculation, you have a balances and cash flows.  In our case, the customer provided us with several accounts worth of sample data in Microsoft Excel.      The above figure shows part of the spreadsheet of sample data.  The data provides balances and cash flows for a sample account (BMV=beginning market value. FLOW=cash flow in/out of account. EMV=ending market value). Once we had the sample spreadsheet, the next step we did was to read the Excel data into R.  This is something that R does well.  R offers multiple ways to work with spreadsheet data.  For instance, one could save the spreadsheet as a .csv file.  In our case, the customer provided a spreadsheet file containing multiple sheets where each sheet provided data for a different sample account.  To handle this easily, we took advantage of the RODBC package which allowed us to read the Excel data sheet-by-sheet without having to create individual .csv files.  We wrote ourselves a little helper function called getsheet() around the RODBC package.  Then we loaded all of the sample accounts into a data.frame called SimpleMWRRData. Writing the IRR function At this point, it was time to write the money weighted rate of return (MWRR) function itself.  The definition of MWRR is easily found on the internet or if you are old school you can look in an investment performance text book.  In the customer proof, we based our calculations off the ones defined in the The Handbook of Investment Performance: A User’s Guide by David Spaulding since this is the reference book used by the customer.  (One of the nice things we found during the course of this proof-of-concept is that by using R to write our IRR functions we could easily incorporate the specific variations and business rules of the customer into the calculation.) The key thing with calculating IRR is the need to solve a complex equation with a numerical approximation technique.  For IRR, you need to find the value of the rate of return (r) that sets the Net Present Value of all the flows in and out of the account to zero.  With R, we solve this by defining our NPV function: where bmv is the beginning market value, cf is a vector of cash flows, t is a vector of time (relative to the beginning), emv is the ending market value, and tend is the ending time. Since solving for r is a one-dimensional optimization problem, we decided to take advantage of R’s optimize method (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/optimize.html). The optimize method can be used to find a minimum or maximum; to find the value of r where our npv function is closest to zero, we wrapped our npv function inside the abs function and asked optimize to find the minimum.  Here is an example of using optimize: where low and high are scalars that indicate the range to search for an answer.   To test this out, we need to set values for bmv, cf, t, emv, tend, low, and high.  We will set low and high to some reasonable defaults. For example, this account had a negative 2.2% money weighted rate of return. Enhancing and Packaging the IRR function With numerical approximation methods like optimize, sometimes you will not be able to find an answer with your initial set of inputs.  To account for this, our approach was to first try to find an answer for r within a narrow range, then if we did not find an answer, try calling optimize() again with a broader range.  See the R help page on optimize()  for more details about the search range and its algorithm. At this point, we can now write a simplified version of our MWRR function.  (Our real-world version is  more sophisticated in that it calculates rate of returns for 5 different time periods [since inception, last quarter, year-to-date, last year, year before last year] in a single invocation.  In our actual customer proof, we also defined time-weighted rate of return calculations.  The beauty of R is that it was very easy to add these enhancements and additional calculations to our IRR package.)To simplify code deployment, we then created a new package of our IRR functions and sample data.  For this blog post, we only need to include our SimpleMWRR function and our SimpleMWRRData sample data.  We created the shell of the package by calling: To turn this package skeleton into something usable, at a minimum you need to edit the SimpleMWRR.Rd and SimpleMWRRData.Rd files in the \man subdirectory.  In those files, you need to at least provide a value for the “title” section. Once that is done, you can change directory to the IRR directory and type at the command-line: The myIRR package for this blog post (which has both SimpleMWRR source and SimpleMWRRData sample data) is downloadable from here: myIRR package Testing the myIRR package Here is an example of testing our IRR function once it was converted to an installable package: Calculating IRR for All the Accounts So far, we have shown how to calculate IRR for a single account.  The real-world issue is how do you calculate IRR for all of the accounts?This is the kind of situation where we can leverage the “Split-Apply-Combine” approach (see http://www.cscs.umich.edu/~crshalizi/weblog/815.html).  Given that our sample data can fit in memory, one easy approach is to use R’s “by” function.  (Other approaches to Split-Apply-Combine such as plyr can also be used.  See http://4dpiecharts.com/2011/12/16/a-quick-primer-on-split-apply-combine-problems/). Here is an example showing the use of “by” to calculate the money weighted rate of return for each account in our sample data set.  Recap and Next Steps At this point, you’ve seen the power of R being used to calculate IRR.  There were several good things: R could easily work with the spreadsheets of sample data we were given R’s optimize() function provided a nice way to solve for IRR- it was both fast and allowed us to avoid having to code our own iterative approximation algorithm R was a convenient language to express the customer-specific variations, business-rules, and exceptions that often occur in real-world calculations- these could be easily added to our IRR functions The Split-Apply-Combine technique can be used to perform calculations of IRR for multiple accounts at once. However, there are several challenges yet to be conquered at this point in our story: The actual data that needs to be used lives in a database, not in a spreadsheet The actual data is much, much bigger- too big to fit into the normal R memory space and too big to want to move across the network The overall process needs to run fast- much faster than a single processor The actual data needs to be kept secured- another reason to not want to move it from the database and across the network And the process of calculating the IRR needs to be integrated together with other database ETL activities, so that IRR’s can be calculated as part of the data warehouse refresh processes In our next blog post in this series, we will show you how Oracle R Enterprise solved these challenges.

    Read the article

1