Search Results

Search found 21 results on 1 pages for 'rbind'.

Page 1/1 | 1 

  • rbind.zoo doesn't seem create consistent zoo object

    - by a-or-b
    I want to rbind.zoo two zoo object together. When I was testing I came across the following issue(?)... Note: The below is an example, there is clearly no point to it apart from being illustrative. I have an zoo object, call it, 'X'. I want to break it into two parts and then rbind.zoo them together. When I compare it to the original object then all.equal gives differences. It appears that the '$class' attribute differs, but I can't see how or why. Is I make these xts objects then the all.equal works as expected. i.e. ..... X.date <- as.POSIXct(paste("2003-", rep(1:4, 4:1), "-", sample(1:28, 10, replace = TRUE), sep = "")) X <- zoo(matrix(rnorm(24), ncol = 2), X.date) a <- X[c(1:3), ] # first 3 elements b <- X[c(4:6), ] # second 3 elements c <- rbind.zoo(a, b) # rbind into an object of 6 elements d <- X[c(1:6), ] # all 6 elements all.equal(c, d) # are they equal? ~~~~ all.equal gives me the following difference: "Attributes: < Component 3: Attributes: < Length mismatch: comparison on first 1 components "

    Read the article

  • R problem with apply + rbind

    - by Carl
    I cannot seem to get the following to work directory <- "./" files.15x16 <- c("15x16-70d.out", "15x16-71d.out") data.15x16<-rbind( lapply( as.array(paste(directory, files.15x16, sep="")), FUN=read.csv, sep=" ", header=F) ) What it should be doing is pretty straightforward - I have a directory name, some file names, and actual files of data. I paste the directory and file names together, read the data from the files in, and then rbind them all together into a single chunk of data. Except the result of the lapply has the data in [[]] - i.e., accessing it occurs via a[[1]], a[[2]], etc which rbind doesn't seem to accept. Suggestions?

    Read the article

  • do.call(rbind, list) for uneven number of column

    - by h.l.m
    I have a list, with each element being a character vector, of differing lengths I would like to bind the data as rows, so that the column names 'line up' and if there is extra data then create column and if there is missing data then create NAs Below is a mock example of the data I am working with x <- list() x[[1]] <- letters[seq(2,20,by=2)] names(x[[1]]) <- LETTERS[c(1:length(x[[1]]))] x[[2]] <- letters[seq(3,20, by=3)] names(x[[2]]) <- LETTERS[seq(3,20, by=3)] x[[3]] <- letters[seq(4,20, by=4)] names(x[[3]]) <- LETTERS[seq(4,20, by=4)] The below line would normally be what I would do if I was sure that the format for each element was the same... do.call(rbind,x) I was hoping that someone had come up with a nice little solution that matches up the column names and fills in blanks with NAs whilst adding new columns if in the binding process new columns are found...

    Read the article

  • how to determine if a character vector is a valid numeric or integer vector

    - by Andrew Barr
    I am trying to turn a nested list structure into a dataframe. The list looks similar to the following (it is serialized data from parsed JSON read in using the httr package). myList <- list(object1 = list(w=1, x=list(y=0.1, z="cat")), object2 = list(w=2, x=list(y=0.2, z="dog"))) unlist(myList) does a great job of recursively flattening the list, and I can then use lapply to flatten all the objects nicely. flatList <- lapply(myList, FUN= function(object) {return(as.data.frame(rbind(unlist(object))))}) And finally, I can button it up using plyr::rbind.fill myDF <- do.call(plyr::rbind.fill, flatList) str(myDF) #'data.frame': 2 obs. of 3 variables: #$ w : Factor w/ 2 levels "1","2": 1 2 #$ x.y: Factor w/ 2 levels "0.1","0.2": 1 2 #$ x.z: Factor w/ 2 levels "cat","dog": 1 2 The problem is that w and x.y are now being interpreted as character vectors, which by default get parsed as factors in the dataframe. I believe that unlist() is the culprit, but I can't figure out another way to recursively flatten the list structure. A workaround would be to post-process the dataframe, and assign data types then. What is the best way to determine if a vector is a valid numeric or integer vector?

    Read the article

  • Mounting a network filesystem into schroot

    - by haggai_e
    Hi, I'm using a network file system (AFS) and I would like to also mount it into my schroot. Adding a line to /etc/schroot/mount-defaults, with bind or rbind in the options didn't help: schroot always mounts the directory with bind, and it remains empty in the chroot. My current solution is a script that remounts the /afs directory with rbind after the schroot has been set up. Is there a way to make schroot use rbind directly?

    Read the article

  • Solving quadratic programming using R

    - by user702846
    I would like to solve the following quadratic programming equation using ipop function from kernlab : min 0.5*x'*H*x + f'*x subject to: A*x <= b Aeq*x = beq LB <= x <= UB where in our example H 3x3 matrix, f is 3x1, A is 2x3, b is 2x1, LB and UB are both 3x1. edit 1 My R code is : library(kernlab) H <- rbind(c(1,0,0),c(0,1,0),c(0,0,1)) f = rbind(0,0,0) A = rbind(c(1,1,1), c(-1,-1,-1)) b = rbind(4.26, -1.73) LB = rbind(0,0,0) UB = rbind(100,100,100) > ipop(f,H,A,b,LB,UB,0) Error in crossprod(r, q) : non-conformable arguments I know from matlab that is something like this : H = eye(3); f = [0,0,0]; nsamples=3; eps = (sqrt(nsamples)-1)/sqrt(nsamples); A=ones(1,nsamples); A(2,:)=-ones(1,nsamples); b=[nsamples*(eps+1); nsamples*(eps-1)]; Aeq = []; beq = []; LB = zeros(nsamples,1); UB = ones(nsamples,1).*1000; [beta,FVAL,EXITFLAG] = quadprog(H,f,A,b,Aeq,beq,LB,UB); and the answer is a vector of 3x1 equals to [0.57,0.57,0.57]; However when I try it on R, using ipop function from kernlab library ipop(f,H,A,b,LB,UB,0)) and I am facing Error in crossprod(r, q) : non-conformable arguments I appreciate any comment

    Read the article

  • Group variables in a boxplot in R

    - by tao.hong
    I am trying to generate a boxplot whose data come from two scenarios. In the plot, I would like to group boxes by their names (So there will be two boxes per variable). I know ggplot would be a good choice. But I got errors which I could not figure out. Can anyone give me some suggestions? sensitivity_out1 structure(c(0.0522902104339716, 0.0521369824334004, 0.0520240345973737, 0.0519818337359876, 0.051935071418996, 0.0519089404325544, 0.000392698277338341, 0.000326135474295325, 0.000280863338343747, 0.000259631566041935, 0.000246594043996332, 0.000237923540393391, 0.00046732650331544, 0.000474448907808135, 0.000478287273678457, 0.000480194683464109, 0.000480631753078668, 0.000481760272726273, 0.000947965771207979, 0.000944821699830455, 0.000939631071343889, 0.000937186900570605, 0.000936007346568281, 0.000934756220144141, 0.00132442589501872, 0.00132658367774979, 0.00133334696220742, 0.00133622384928092, 0.0013381577476241, 0.00134005741746304, 0.0991622968751298, 0.100791399440082, 0.101946808417405, 0.102524244727408, 0.102920085260477, 0.103232984259916, 0.0305219507186844, 0.0304635269233494, 0.0304161055015213, 0.0303742106794513, 0.0303381888169022, 0.0302996157711171, 1.94268588634518e-05, 2.23991225564447e-05, 2.5756135487907e-05, 2.79997917298194e-05, 3.00753967077715e-05, 3.16270817369878e-05, 0.544701146678523, 0.542887331601984, 0.541632986366816, 0.541005610554556, 0.540617004208336, 0.540315690692195, 0.000453386694666078, 0.000448473414508756, 0.00044692043197248, 0.000444826296854332, 0.000445747996014684, 0.000444764303682453, 0.000127569551159321, 0.000128422491392669, 0.00012933662856487, 0.000129941842982939, 0.000129578971489026, 0.000131113075233758, 0.00684610571790029, 0.00686349387897349, 0.00687468164010565, 0.00687880720347743, 0.00688275579317197, 0.00687822247621936), .Dim = c(6L, 12L)) out2 structure(c(0.0189965816735366, 0.0189995096225103, 0.0190099362589894, 0.0190033523148514, 0.01900896721937, 0.0190099427513381, 0.00192043989797585, 0.00207303208721059, 0.00225931163225165, 0.0024049969048389, 0.00252310364086785, 0.00262940166568126, 0.00195164921633517, 0.00190079923515755, 0.00186139563778548, 0.00184188171395076, 0.00183248544676564, 0.00182492970673969, 1.83038731485927e-05, 1.98252671720347e-05, 2.14794764479231e-05, 2.30713122969332e-05, 2.4484220713564e-05, 2.55958833705284e-05, 0.0428066864455102, 0.0431686808647809, 0.0434411033615353, 0.0435883377765726, 0.0436690169266633, 0.0437340464360965, 0.145288252474567, 0.141488776430307, 0.138204532539654, 0.136281799717717, 0.134864952272761, 0.133738386148036, 0.0711728636959696, 0.072031388688795, 0.0727536853228245, 0.0731581966147734, 0.0734424337399303, 0.0736637270702609, 0.000605277151497094, 0.000617268349064968, 0.000632975679951382, 0.000643904422677427, 0.000653775268094148, 0.000662225067910141, 0.26735354610469, 0.267515415990146, 0.26753155165617, 0.267553498616325, 0.267532284594615, 0.267510330320289, 0.000334158771646756, 0.000319032383145857, 0.000306074699839994, 0.000299153278494114, 0.000293956197852583, 0.000290171804454218, 0.000645975219899115, 0.000637548672578787, 0.000632375486965757, 0.000629579821884212, 0.000624956458229123, 0.000622456283217054, 0.0645188290106884, 0.0651539609630352, 0.0656417364889907, 0.0658996698322889, 0.0660715073023965, 0.0662034341510152), .Dim = c(6L, 12L)) Melt data: group variable value 1 1 PLDKRT 0 2 1 PLDKRT 0 3 1 PLDKRT 0 4 1 PLDKRT 0 5 1 PLDKRT 0 6 1 PLDKRT 0 Code: #Data_source 1 sensitivity_1=rbind(sensitivity_out1,sensitivity_out2) sensitivity_1=data.frame(sensitivity_1) colnames(sensitivity_1)=main_l #variable names sensitivity_1$group=1 #Data_source 2 sensitivity_2=rbind(sensitivity_out1[3:4,],sensitivity_out2[3:4,]) sensitivity_2=data.frame(sensitivity_2) colnames(sensitivity_2)=main_l sensitivity_2$group=2 sensitivity_pool=rbind(sensitivity_1,sensitivity_2) sensitivity_pool_m=melt(sensitivity_pool,id.vars="group") ggplot(data = sensitivity_pool_m, aes(x = variable, y = value)) + geom_boxplot(aes( fill= group), width = 0.8) Error: "Error in unit(tic_pos.c, "mm") : 'x' and 'units' must have length > 0" Update Figure out the error. I should use geom_boxplot(aes( fill= factor(group)), width = 0.8) rather than fill= group

    Read the article

  • How do I introduce row names to a function in R

    - by Tahnoon Pasha
    Hi I have a utility function I've put together to insert rows into a dataframe below. If I was writing out the formula by hand I would put something like newframe=rbind(oldframe[1:rownum,],row_to_insert=row_to_insert,oldframe[(rownum+1:nrow(oldframe),] to name row_to_insert. Could someone tell me how to do this in a function? Thanks insertrows=function (x, y, rownum) { newframe = rbind(y[1:rownum, ], x, y[(rownum + 1):nrow(y), ]) return(data.frame(newframe)) } MWE of some underlying data added below financials=data.frame(sales=c(100,150,200,250),some.direct.costs=c(25,30,35,40),other.direct.costs=c(15,25,25,35),indirect.costs=c(40,45,45,50)) oldframe=t(financials) colnames(oldframe)=make.names(seq(2000,2003,1)) total.direct.costs=oldframe['some.direct.costs',]+oldframe['other.direct.costs',] newframe=total.direct.costs n=rownum=3 oldframe=insertrows(total.direct.costs=newframe,oldframe,n)

    Read the article

  • function not working R

    - by user3722828
    I've never programmed before and am trying to learn. I'm following that "coursera" course that I've seen other people post about — a course offered by Johns Hopkins on R programming. Anyway, this was supposed to be my first function. Yet, it doesn't work! But when I type out all the steps individually, it runs just fine... Can anyone tell me why? > pollutantmean <- function(directory, pollutant, id = 1:332){ + x<- list.files("/Users/mike******/Desktop/directory", full.names=TRUE) + y<- lapply(x, read.csv) + z<- do.call(rbind.data.frame, y[id]) + + mean(z$pollutant, na.rm=TRUE) + } > pollutantmean(specdata,nitrate,1:10) [1] NA Warning message: In mean.default(z$pollutant, na.rm = TRUE) : argument is not numeric or logical: returning NA #### > x<- list.files("/Users/mike******/Desktop/specdata",full.names=TRUE) > y<- lapply(x,read.csv) > z<- do.call(rbind.data.frame,y[1:10]) > mean(z$nitrate,na.rm=TRUE) [1] 0.7976266

    Read the article

  • [R] How to create a data.frame with a unknow number of columns ?

    - by Olivier
    Hello I would like to create, in a function, a boucle to create a data.frame with a variable number of columns. WIth something like : a = c("a","b") b = c(list(1,2,3), list(4,5,6)) data.frame(a,b) I would like to get a data-frame like : a 1 2 3 b 4 5 6 Instead of I obtain : a 1 2 3 4 5 6 b 1 2 3 4 5 6 Thank you ! PS : I also try with rbind, but it's doesn't work...

    Read the article

  • Create lags with a for-loop in R

    - by cptn
    I've got a data.frame with stock data of several companies (here it's only two). I want 10 additional columns in my stock data.frame df with lagged dates (from -5 days to +5 days) for both companies in my event data.frame. I'm using a for loop which is probably not the best solution, but it works partially. DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000") RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12) COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B") df <- data.frame(DATE, RET, COMP, stringsAsFactors=F) df # DATE RET COMP # 1 01.01.2000 -2.00 A # 2 02.01.2000 1.10 A # 3 03.01.2000 3.00 A # 4 06.01.2000 1.40 A # 5 07.01.2000 -0.20 A # 6 09.01.2000 0.60 A # 7 10.01.2000 0.10 A # 8 01.01.2000 -0.21 B # 9 02.01.2000 -1.20 B # 10 04.01.2000 0.90 B # 11 06.01.2000 0.30 B # 12 07.01.2000 -0.10 B # 13 09.01.2000 0.30 B # 14 10.01.2000 -0.12 B this loop works fine comp <- as.vector(unique(df$COMP)) mylist <- vector('list', length(comp)) # create lags in DATE for(i in 1:length(comp)) { print(i) comp_i <- comp[i] df_k <- df[df$COMP %in% comp_i, ] # all trading days of one firm df_k <- transform(df_k, DATEm1 = c(NA, head(DATE, -1)), DATEm2 = c(NA, NA, head(DATE, -2)), DATEm3 = c(NA, NA, NA, head(DATE, -3)), DATEm4 = c(NA, NA, NA, NA,head(DATE, -4)), DATEm5 = c(NA, NA, NA, NA, NA, head(DATE, -5)), DATEp1 = c(DATE[-1], NA)) #DATEp2 = c(DATE[-2], NA, NA), #DATEp3 = c(DATE[-3], NA, NA, NA), #DATEp4 = c(DATE[-4], NA, NA, NA, NA), #DATEp5 = c(DATE[-5], NA, NA, NA, NA, NA)) mylist[[i]] = df_k } df1 <- do.call(rbind, mylist) But if I add the lines with DATEp2, DATEp3, DATEp4, DATEp5. the code doesn't work. Can anybody tell me what I'm doing wrong here? Here the code with all the lagged dates. # create lags in DATE for(i in 1:length(comp)) { print(i) comp_i <- comp[i] df_k <- df[df$COMP %in% comp_i, ] # all trading days of one firm df_k <- transform(df_k, DATEm1 = c(NA, head(DATE, -1)), DATEm2 = c(NA, NA, head(DATE, -2)), DATEm3 = c(NA, NA, NA, head(DATE, -3)), DATEm4 = c(NA, NA, NA, NA,head(DATE, -4)), DATEm5 = c(NA, NA, NA, NA, NA, head(DATE, -5)), DATEp1 = c(DATE[-1], NA), DATEp2 = c(DATE[-2], NA, NA), DATEp3 = c(DATE[-3], NA, NA, NA), DATEp4 = c(DATE[-4], NA, NA, NA, NA), DATEp5 = c(DATE[-5], NA, NA, NA, NA, NA)) mylist[[i]] = df_k } df1 <- do.call(rbind, mylist)

    Read the article

  • R: Converting a list of data frames into one data frame

    - by JD Long
    I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame. I got some pointers from an earlier question which was trying to do something similar but more complex. Here's an example of what I am starting with (this is grossly simplified for illustration): listOfDataFrames <- NULL for (i in 1:100) { listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T), b=rnorm(500), c=rnorm(500)) } I am currently using this: df <- do.call("rbind", listOfDataFrames) *EDIT* whoops. In my haste to implement what I had "learned" in a previous question I totally screwed up. Yes, the unlist() is just plain wrong. I'm editing that out of the question above.

    Read the article

  • Best way to reduce consecutive NAs to single NA

    - by digEmAll
    I need to reduce the consecutive NA's in a vector to a single NA, without touching the other values. So, for example, given a vector like this: NA NA 8 7 NA NA NA NA NA 3 3 NA -1 4 what I need to get, is the following result: NA 8 7 NA 3 3 NA -1 4 Currently, I'm using the following function: reduceConsecutiveNA2One <- function(vect){ enc <- rle(is.na(vect)) # helper func tmpFun <- function(i){ if(enc$values[i]){ data.frame(L=c(enc$lengths[i]-1, 1), V=c(TRUE,FALSE)) }else{ data.frame(L=enc$lengths[i], V=enc$values[i]) } } Df <- do.call(rbind.data.frame,lapply(1:length(enc$lengths),FUN=tmpFun)) return(vect[rep.int(!Df$V,Df$L)]) } and it seems to work fine, but probably there's a simpler/faster way to accomplish this task. Any suggestions ? Thanks in advance.

    Read the article

  • improve my code for collapsing a list of data.frames

    - by romunov
    Dear StackOverFlowers (flowers in short), I have a list of data.frames (walk.sample) that I would like to collapse into a single (giant) data.frame. While collapsing, I would like to mark (adding another column) which rows have came from which element of the list. This is what I've got so far. This is the data.frame that needs to be collapsed/stacked. > walk.sample [[1]] walker x y 1073 3 228.8756 -726.9198 1086 3 226.7393 -722.5561 1081 3 219.8005 -728.3990 1089 3 225.2239 -727.7422 1032 3 233.1753 -731.5526 [[2]] walker x y 1008 3 205.9104 -775.7488 1022 3 208.3638 -723.8616 1072 3 233.8807 -718.0974 1064 3 217.0028 -689.7917 1026 3 234.1824 -723.7423 [[3]] [1] 3 [[4]] walker x y 546 2 629.9041 831.0852 524 2 627.8698 873.3774 578 2 572.3312 838.7587 513 2 633.0598 871.7559 538 2 636.3088 836.6325 1079 3 206.3683 -729.6257 1095 3 239.9884 -748.2637 1005 3 197.2960 -780.4704 1045 3 245.1900 -694.3566 1026 3 234.1824 -723.7423 I have written a function to add a column that denote from which element the rows came followed by appending it to an existing data.frame. collapseToDataFrame <- function(x) { # collapse list to a dataframe with a twist walk.df <- data.frame() for (i in 1:length(x)) { n.rows <- nrow(x[[i]]) if (length(x[[i]])>1) { temp.df <- cbind(x[[i]], rep(i, n.rows)) names(temp.df) <- c("walker", "x", "y", "session") walk.df <- rbind(walk.df, temp.df) } else { cat("Empty list", "\n") } } return(walk.df) } > collapseToDataFrame(walk.sample) Empty list Empty list walker x y session 3 1 -604.5055 -123.18759 1 60 1 -562.0078 -61.24912 1 84 1 -594.4661 -57.20730 1 9 1 -604.2893 -110.09168 1 43 1 -632.2491 -54.52548 1 1028 3 240.3905 -724.67284 1 1040 3 232.5545 -681.61225 1 1073 3 228.8756 -726.91980 1 1091 3 209.0373 -740.96173 1 1036 3 248.7123 -694.47380 1 I'm curious whether this can be done more elegantly, with perhaps do.call() or some other more generic function?

    Read the article

  • P values in wilcox.test gone mad :(

    - by Error404
    I have a code that isn't doing what it should do. I am testing P value for a wilcox.test for a huge set of data. the code i am using is the following library(MASS) data1 <- read.csv("file1path.csv",header=T,sep=",") data2 <- read.csv("file2path.csv",header=T,sep=",") data3 <- read.csv("file3path.csv",header=T,sep=",") data4 <- read.csv("file4path.csv",header=T,sep=",") data1$K <- with(data1,{"N"}) data2$K <- with(data2,{"E"}) data3$K <- with(data3,{"M"}) data4$K <- with(data4,{"U"}) new=rbind(data1,data2,data3,data4) i=3 for (o in 1:4800){ x1 <- data1[,i] x2 <- data2[,i] x3 <- data3[,i] x4 <- data4[,i] wt12 <- wilcox.test(x1,x2, na.omit=TRUE) wt13 <- wilcox.test(x1,x3, na.omit=TRUE) wt14 <- wilcox.test(x1,x4, na.omit=TRUE) if (wt12$p.value=="NaN"){ print("This is wrong") } else if (wt12$p.value < 0.05){ print(wt12$p.value) mypath=file.path("C:", "all1-less-05", (paste("graph-data1-data2",names(data1[i]), ".pdf", sep="-"))) pdf(file=mypath) mytitle = paste("graph",names(data1[i])) boxplot(new[,i] ~ new$K, main = mytitle, names.arg=c("data1","data2","data3","data4")) dev.off() } if (wt13$p.value=="NaN"){ print("This is wrong") } else if (wt13$p.value < 0.05){ print(wt13$p.value) mypath=file.path("C:", "all2-less-05", (paste("graph-data1-data3",names(data1[i]), ".pdf", sep="-"))) pdf(file=mypath) mytitle = paste("graph",names(data1[i])) boxplot(new[,i] ~ new$K, main = mytitle, names.arg=c("data1","data2","data3","data4")) dev.off() } if (wt14$p.value=="NaN"){ print("This is wrong") } else if (wt14$p.value < 0.05){ print(wt14$p.value) mypath=file.path("C:", "all3-less-05", (paste("graph-data1-data4",names(data1[i]), ".pdf", sep="-"))) pdf(file=mypath) mytitle = paste("graph",names(data1[i])) boxplot(new[,i] ~ new$K, main = mytitle, names.arg=c("data1","data2","data3","data4")) dev.off() } i=i+1 } I am having 2 problems with this long command: 1- Without specifying a certain P value, the code gives me arouind 14,000 graphs, when specifying a p value less than 0.05 the number of graphs generated goes down to 9,0000. THE FIRST PROBLEM IS: Some P value are more than 0.05 and are still showing up! 2- I designed the program to give me a result of "This is wrong" when the Value of P is "NaN", I am getting results of "NaN" Here's a screenshot from the results do you know what the mistake i made with the command to get these errors? Thanks in advance

    Read the article

  • Performing calculations by subsets of data in R

    - by Vivi
    I want to perform calculations for each company number in the column PERMNO of my data frame, the summary of which can be seen here: > summary(companydataRETS) PERMNO RET Min. :10000 Min. :-0.971698 1st Qu.:32716 1st Qu.:-0.011905 Median :61735 Median : 0.000000 Mean :56788 Mean : 0.000799 3rd Qu.:80280 3rd Qu.: 0.010989 Max. :93436 Max. :19.000000 My solution so far was to create a variable with all possible company numbers compns <- companydataRETS[!duplicated(companydataRETS[,"PERMNO"]),"PERMNO"] And then use a foreach loop using parallel computing which calls my function get.rho() which in turn perform the desired calculations rhos <- foreach (i=1:length(compns), .combine=rbind) %dopar% get.rho(subset(companydataRETS[,"RET"],companydataRETS$PERMNO == compns[i])) I tested it for a subset of my data and it all works. The problem is that I have 72 million observations, and even after leaving the computer working overnight, it still didn't finish. I am new in R, so I imagine my code structure can be improved upon and there is a better (quicker, less computationally intensive) way to perform this same task (perhaps using apply or with, both of which I don't understand). Any suggestions?

    Read the article

  • R Random Data Sets within loops

    - by jugossery
    Here is what I want to do: I have a time series data frame with let us say 100 time-series of length 600 - each in one column of the data frame. I want to pick up 4 of the time-series randomly and then assign them random weights that sum up to one (ie 0.1, 0.5, 0.3, 0.1). Using those I want to compute the mean of the sum of the 4 weighted time series variables (e.g. convex combination). I want to do this let us say 100k times and store each result in the form ts1.name, ts2.name, ts3.name, ts4.name, weight1, weight2, weight3, weight4, mean so that I get a 9*100k df. I tried some things already but R is very bad with loops and I know vector oriented solutions are better because of R design. Thanks Here is what I did and I know it is horrible The df is in the form v1,v2,v2.....v100 1,5,6,.......9 2,4,6,.......10 3,5,8,.......6 2,2,8,.......2 etc e=NULL for (x in 1:100000) { s=sample(1:100,4)#pick 4 variables randomly a=sample(seq(0,1,0.01),1) b=sample(seq(0,1-a,0.01),1) c=sample(seq(0,(1-a-b),0.01),1) d=1-a-b-c e=c(a,b,c,d)#4 random weights average=mean(timeseries.df[,s]%*%t(e)) e=rbind(e,s,average)#in the end i get the 9*100k df } The procedure runs way to slow. EDIT: Thanks for the help i had,i am not used to think R and i am not very used to translate every problem into a matrix algebra equation which is what you need in R. Then the problem becomes a little bit complex if i want to calculate the standard deviation. i need the covariance matrix and i am not sure i can if/how i can pick random elements for each sample from the original timeseries.df covariance matrix then compute the sample variance (t(sampleweights)%*%sample_cov.mat%*%sampleweights) to get in the end the ts.weighted_standard_dev matrix Last question what is the best way to proceed if i want to bootstrap the original df x times and then apply the same computations to test the robustness of my datas thanks

    Read the article

  • Calculating Growth-Rates by applying log-differences

    - by mropa
    I am trying to transform my data.frame by calculating the log-differences of each column and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable. So here is a random df with an id column, a time period colum p and three variable columns: df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"), p = c(1,2,3,1,2,1,2,3,4,5), var1 = rnorm(10, 5), var2 = rnorm(10, 5), var3 = rnorm(10, 5) ) df id p var1 var2 var3 1 a 1 5.375797 4.110324 5.773473 2 a 2 4.574700 6.541862 6.116153 3 a 3 3.029428 4.931924 5.631847 4 c 1 5.375855 4.181034 5.756510 5 c 2 5.067131 6.053009 6.746442 6 d 1 3.846438 4.515268 6.920389 7 d 2 4.910792 5.525340 4.625942 8 d 3 6.410238 5.138040 7.404533 9 d 4 4.637469 3.522542 3.661668 10 d 5 5.519138 4.599829 5.566892 Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate the shortcut. Here is the function and the output for the posted data frame: fct.logDiff <- function (df) { df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)]))) list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff)))) ldply (list.nalog, data.frame) } fct.logDiff(df) id p var1 var2 var3 1 a 1 NA NA NA 2 a 2 -0.16136569 0.46472004 0.05765945 3 a 3 -0.41216720 -0.28249264 -0.08249587 4 c 1 NA NA NA 5 c 2 -0.05914281 0.36999681 0.15868378 6 d 1 NA NA NA 7 d 2 0.24428771 0.20188025 -0.40279188 8 d 3 0.26646102 -0.07267311 0.47041227 9 d 4 -0.32372771 -0.37748866 -0.70417351 10 d 5 0.17405309 0.26683625 0.41891802 The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs, and afterwards transform the list back into a data.frame. That looks tedious. Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?

    Read the article

  • Calculating all distances between one point and a group of points efficiently in R

    - by dbarbosa
    Hi, First of all, I am new to R (I started yesterday). I have two groups of points, data and centers, the first one of size n and the second of size K (for instance, n = 3823 and K = 10), and for each i in the first set, I need to find j in the second with the minimum distance. My idea is simple: for each i, let dist[j] be the distance between i and j, I only need to use which.min(dist) to find what I am looking for. Each point is an array of 64 doubles, so > dim(data) [1] 3823 64 > dim(centers) [1] 10 64 I have tried with for (i in 1:n) { for (j in 1:K) { d[j] <- sqrt(sum((centers[j,] - data[i,])^2)) } S[i] <- which.min(d) } which is extremely slow (with n = 200, it takes more than 40s!!). The fastest solution that I wrote is distance <- function(point, group) { return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)]) } for (i in 1:n) { d <- distance(data[i,], centers) which.min(d) } Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. And also, the distance code is very ugly. I tried to replace it with distance <- function(point, group) { return (dist(rbind(point,group))[1:nrow(group)]) } but this seems to be twice slower. I also tried to use dist for each pair, but it is also slower. I don't know what to do now. It seems like I am doing something very wrong. Any idea on how to do this more efficiently? ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. stats::kmeans do all computation in less than one second.

    Read the article

  • Convert ddply {plyr} to Oracle R Enterprise, or use with Embedded R Execution

    - by Mark Hornick
    The plyr package contains a set of tools for partitioning a problem into smaller sub-problems that can be more easily processed. One function within {plyr} is ddply, which allows you to specify subsets of a data.frame and then apply a function to each subset. The result is gathered into a single data.frame. Such a capability is very convenient. The function ddply also has a parallel option that if TRUE, will apply the function in parallel, using the backend provided by foreach. This type of functionality is available through Oracle R Enterprise using the ore.groupApply function. In this blog post, we show a few examples from Sean Anderson's "A quick introduction to plyr" to illustrate the correpsonding functionality using ore.groupApply. To get started, we'll create a demo data set and load the plyr package. set.seed(1) d <- data.frame(year = rep(2000:2014, each = 3),         count = round(runif(45, 0, 20))) dim(d) library(plyr) This first example takes the data frame, partitions it by year, and calculates the coefficient of variation of the count, returning a data frame. # Example 1 res <- ddply(d, "year", function(x) {   mean.count <- mean(x$count)   sd.count <- sd(x$count)   cv <- sd.count/mean.count   data.frame(cv.count = cv)   }) To illustrate the equivalent functionality in Oracle R Enterprise, using embedded R execution, we use the ore.groupApply function on the same data, but pushed to the database, creating an ore.frame. The function ore.push creates a temporary table in the database, returning a proxy object, the ore.frame. D <- ore.push(d) res <- ore.groupApply (D, D$year, function(x) {   mean.count <- mean(x$count)   sd.count <- sd(x$count)   cv <- sd.count/mean.count   data.frame(year=x$year[1], cv.count = cv)   }, FUN.VALUE=data.frame(year=1, cv.count=1)) You'll notice the similarities in the first three arguments. With ore.groupApply, we augment the function to return the specific data.frame we want. We also specify the argument FUN.VALUE, which describes the resulting data.frame. From our previous blog posts, you may recall that by default, ore.groupApply returns an ore.list containing the results of each function invocation. To get a data.frame, we specify the structure of the result. The results in both cases are the same, however the ore.groupApply result is an ore.frame. In this case the data stays in the database until it's actually required. This can result in significant memory and time savings whe data is large. R> class(res) [1] "ore.frame" attr(,"package") [1] "OREbase" R> head(res)    year cv.count 1 2000 0.3984848 2 2001 0.6062178 3 2002 0.2309401 4 2003 0.5773503 5 2004 0.3069680 6 2005 0.3431743 To make the ore.groupApply execute in parallel, you can specify the argument parallel with either TRUE, to use default database parallelism, or to a specific number, which serves as a hint to the database as to how many parallel R engines should be used. The next ddply example uses the summarise function, which creates a new data.frame. In ore.groupApply, the year column is passed in with the data. Since no automatic creation of columns takes place, we explicitly set the year column in the data.frame result to the value of the first row, since all rows received by the function have the same year. # Example 2 ddply(d, "year", summarise, mean.count = mean(count)) res <- ore.groupApply (D, D$year, function(x) {   mean.count <- mean(x$count)   data.frame(year=x$year[1], mean.count = mean.count)   }, FUN.VALUE=data.frame(year=1, mean.count=1)) R> head(res)    year mean.count 1 2000 7.666667 2 2001 13.333333 3 2002 15.000000 4 2003 3.000000 5 2004 12.333333 6 2005 14.666667 Example 3 uses the transform function with ddply, which modifies the existing data.frame. With ore.groupApply, we again construct the data.frame explicilty, which is returned as an ore.frame. # Example 3 ddply(d, "year", transform, total.count = sum(count)) res <- ore.groupApply (D, D$year, function(x) {   total.count <- sum(x$count)   data.frame(year=x$year[1], count=x$count, total.count = total.count)   }, FUN.VALUE=data.frame(year=1, count=1, total.count=1)) > head(res)    year count total.count 1 2000 5 23 2 2000 7 23 3 2000 11 23 4 2001 18 40 5 2001 4 40 6 2001 18 40 In Example 4, the mutate function with ddply enables you to define new columns that build on columns just defined. Since the construction of the data.frame using ore.groupApply is explicit, you always have complete control over when and how to use columns. # Example 4 ddply(d, "year", mutate, mu = mean(count), sigma = sd(count),       cv = sigma/mu) res <- ore.groupApply (D, D$year, function(x) {   mu <- mean(x$count)   sigma <- sd(x$count)   cv <- sigma/mu   data.frame(year=x$year[1], count=x$count, mu=mu, sigma=sigma, cv=cv)   }, FUN.VALUE=data.frame(year=1, count=1, mu=1,sigma=1,cv=1)) R> head(res)    year count mu sigma cv 1 2000 5 7.666667 3.055050 0.3984848 2 2000 7 7.666667 3.055050 0.3984848 3 2000 11 7.666667 3.055050 0.3984848 4 2001 18 13.333333 8.082904 0.6062178 5 2001 4 13.333333 8.082904 0.6062178 6 2001 18 13.333333 8.082904 0.6062178 In Example 5, ddply is used to partition data on multiple columns before constructing the result. Realizing this with ore.groupApply involves creating an index column out of the concatenation of the columns used for partitioning. This example also allows us to illustrate using the ORE transparency layer to subset the data. # Example 5 baseball.dat <- subset(baseball, year > 2000) # data from the plyr package x <- ddply(baseball.dat, c("year", "team"), summarize,            homeruns = sum(hr)) We first push the data set to the database to get an ore.frame. We then add the composite column and perform the subset, using the transparency layer. Since the results from database execution are unordered, we will explicitly sort these results and view the first 6 rows. BB.DAT <- ore.push(baseball) BB.DAT$index <- with(BB.DAT, paste(year, team, sep="+")) BB.DAT2 <- subset(BB.DAT, year > 2000) X <- ore.groupApply (BB.DAT2, BB.DAT2$index, function(x) {   data.frame(year=x$year[1], team=x$team[1], homeruns=sum(x$hr))   }, FUN.VALUE=data.frame(year=1, team="A", homeruns=1), parallel=FALSE) res <- ore.sort(X, by=c("year","team")) R> head(res)    year team homeruns 1 2001 ANA 4 2 2001 ARI 155 3 2001 ATL 63 4 2001 BAL 58 5 2001 BOS 77 6 2001 CHA 63 Our next example is derived from the ggplot function documentation. This illustrates the use of ddply within using the ggplot2 package. We first create a data.frame with demo data and use ddply to create some statistics for each group (gp). We then use ggplot to produce the graph. We can take this same code, push the data.frame df to the database and invoke this on the database server. The graph will be returned to the client window, as depicted below. # Example 6 with ggplot2 library(ggplot2) df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),                  y = rnorm(30)) # Compute sample mean and standard deviation in each group library(plyr) ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y)) # Set up a skeleton ggplot object and add layers: ggplot() +   geom_point(data = df, aes(x = gp, y = y)) +   geom_point(data = ds, aes(x = gp, y = mean),              colour = 'red', size = 3) +   geom_errorbar(data = ds, aes(x = gp, y = mean,                                ymin = mean - sd, ymax = mean + sd),              colour = 'red', width = 0.4) DF <- ore.push(df) ore.tableApply(DF, function(df) {   library(ggplot2)   library(plyr)   ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))   ggplot() +     geom_point(data = df, aes(x = gp, y = y)) +     geom_point(data = ds, aes(x = gp, y = mean),                colour = 'red', size = 3) +     geom_errorbar(data = ds, aes(x = gp, y = mean,                                  ymin = mean - sd, ymax = mean + sd),                   colour = 'red', width = 0.4) }) But let's take this one step further. Suppose we wanted to produce multiple graphs, partitioned on some index column. We replicate the data three times and add some noise to the y values, just to make the graphs a little different. We also create an index column to form our three partitions. Note that we've also specified that this should be executed in parallel, allowing Oracle Database to control and manage the server-side R engines. The result of ore.groupApply is an ore.list that contains the three graphs. Each graph can be viewed by printing the list element. df2 <- rbind(df,df,df) df2$y <- df2$y + rnorm(nrow(df2)) df2$index <- c(rep(1,300), rep(2,300), rep(3,300)) DF2 <- ore.push(df2) res <- ore.groupApply(DF2, DF2$index, function(df) {   df <- df[,1:2]   library(ggplot2)   library(plyr)   ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))   ggplot() +     geom_point(data = df, aes(x = gp, y = y)) +     geom_point(data = ds, aes(x = gp, y = mean),                colour = 'red', size = 3) +     geom_errorbar(data = ds, aes(x = gp, y = mean,                                  ymin = mean - sd, ymax = mean + sd),                   colour = 'red', width = 0.4)   }, parallel=TRUE) res[[1]] res[[2]] res[[3]] To recap, we've illustrated how various uses of ddply from the plyr package can be realized in ore.groupApply, which affords the user explicit control over the contents of the data.frame result in a straightforward manner. We've also highlighted how ddply can be used within an ore.groupApply call.

    Read the article

  • Join and sum not compatible matrices through data.table

    - by leodido
    My goal is to "sum" two not compatible matrices (matrices with different dimensions) using (and preserving) row and column names. I've figured this approach: convert the matrices to data.table objects, join them and then sum columns vectors. An example: > M1 1 3 4 5 7 8 1 0 0 1 0 0 0 3 0 0 0 0 0 0 4 1 0 0 0 0 0 5 0 0 0 0 0 0 7 0 0 0 0 1 0 8 0 0 0 0 0 0 > M2 1 3 4 5 8 1 0 0 1 0 0 3 0 0 0 0 0 4 1 0 0 0 0 5 0 0 0 0 0 8 0 0 0 0 0 > M1 %ms% M2 1 3 4 5 7 8 1 0 0 2 0 0 0 3 0 0 0 0 0 0 4 2 0 0 0 0 0 5 0 0 0 0 0 0 7 0 0 0 0 1 0 8 0 0 0 0 0 0 This is my code: M1 <- matrix(c(0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0), byrow = TRUE, ncol = 6) colnames(M1) <- c(1,3,4,5,7,8) M2 <- matrix(c(0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0), byrow = TRUE, ncol = 5) colnames(M2) <- c(1,3,4,5,8) # to data.table objects DT1 <- data.table(M1, keep.rownames = TRUE, key = "rn") DT2 <- data.table(M2, keep.rownames = TRUE, key = "rn") # join and sum of common columns if (nrow(DT1) > nrow(DT2)) { A <- DT2[DT1, roll = TRUE] A[, list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1), by = rn] } That outputs: rn X1 X3 X4 X5 X7 X8 1: 1 0 0 2 0 0 0 2: 3 0 0 0 0 0 0 3: 4 2 0 0 0 0 0 4: 5 0 0 0 0 0 0 5: 7 0 0 0 0 1 0 6: 8 0 0 0 0 0 0 Then I can convert back this data.table to a matrix and fix row and column names. The questions are: how to generalize this procedure? I need a way to automatically create list(X1 = X1 + X1.1, X3 = X3 + X3.1, X4 = X4 + X4.1, X5 = X5 + X5.1, X7, X8 = X8 + X8.1) because i want to apply this function to matrices which dimensions (and row/columns names) are not known in advance. In summary I need a merge procedure that behaves as described. there are other strategies/implementations that achieve the same goal that are, at the same time, faster and generalized? (hoping that some data.table monster help me) to what kind of join (inner, outer, etc. etc.) is assimilable this procedure? Thanks in advance. p.s.: I'm using data.table version 1.8.2 EDIT - SOLUTIONS @Aaron solution. No external libraries, only base R. It works also on list of matrices. add_matrices_1 <- function(...) { a <- list(...) cols <- sort(unique(unlist(lapply(a, colnames)))) rows <- sort(unique(unlist(lapply(a, rownames)))) out <- array(0, dim = c(length(rows), length(cols)), dimnames = list(rows,cols)) for (m in a) out[rownames(m), colnames(m)] <- out[rownames(m), colnames(m)] + m out } @MadScone solution. Used reshape2 package. It works only on two matrices per call. add_matrices_2 <- function(m1, m2) { m <- acast(rbind(melt(M1), melt(M2)), Var1~Var2, fun.aggregate = sum) mn <- unique(colnames(m1), colnames(m2)) rownames(m) <- mn colnames(m) <- mn m } BENCHMARK (100 runs with microbenchmark package) Unit: microseconds expr min lq median uq max 1 add_matrices_1 196.009 257.5865 282.027 291.2735 549.397 2 add_matrices_2 13737.851 14697.9790 14864.778 16285.7650 25567.448 No need to comment the benchmark: @Aaron solution wins. I'll continue to investigate a similar solution for data.table objects. I'll add other solutions eventually reported or discovered.

    Read the article

1