strsplit - Developer IT

R strsplit and vectorization

- by James

When creating functions that use strsplit, vector inputs do not behave as desired, and sapply needs to be used. This is due to the list output that strsplit produces. Is there a way to vectorize the process - that is, the function produces the correct element in the list for each of the elements of the input? For example, to count the lengths of words in a character vector: words <- c("a","quick","brown","fox") > length(strsplit(words,"")) [1] 4 # The number of words (length of the list) > length(strsplit(words,"")[[1]]) [1] 1 # The length of the first word only > sapply(words,function (x) length(strsplit(x,"")[[1]])) a quick brown fox 1 5 5 3 # Success, but potentially very slow Ideally, something like length(strsplit(words,"")[[.]]) where . is interpreted as the being the relevant part of the input vector.

Read the article

strsplit in R with metacharacter

- by user1429852

I have received a large amount of data where the delimiter is a backslash (obviously a bad choice). I'm processing it in R for computation, and having a hard time finding how to split the string since the backslash is a metacharacter. For example, a string would look like this: "1128\0019\XA5\E2R\366\00=15" and I want to split it along the "\" character, but when I run the strsplit command: strsplit(tempStr, "\") Error in strsplit(tempStr, "\") : invalid regular expression '\', reason 'Trailing backslash' When I try to used the "fixed" option, it does not run because it is expecting something after the backslash: strsplit(tempStr, "\", fixed = TRUE) Unfortunately, I can't preprocess the data with another program because the data is generated daily. Please help and thanks!

Read the article

How do I specify a dynamic position for the start of substring?

- by analyticsPierce

As in the example, I am trying to substring the Video_full column in a data.frame (video_data_2) I am working on. I want to keep all the characters after the period. The period is always present, there is only one period and it is in a different position in each value for the column. Date Video_full Instances 1 Apr 1, 2010 installs/AA.intro_video_1 546 2 Apr 1, 2010 installs/ABAC.intro_video_2 548 I got substring to work; video_data_2$Video_full <- substring(video_data_2$Video_full,11) And strsplit also; strsplit("installs/AA.intro_video_1 ",'[.]') I'm just not able to figure out how to start the substring in a dynamic position or only keep the second value returned by strsplit. Thanks for any help you can offer for a simple question.

Read the article

Faster way to split a string and count characters using R?

- by chrisamiller

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider. I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this: ## ## count the number of GCs in the characters between start and stop ## gcCount <- function(line, st, sp){ chars = strsplit(as.character(line),"")[[1]] numGC = 0 for(j in st:sp){ ##nested ifs faster than an OR (|) construction if(chars[[j]] == "g"){ numGC <- numGC + 1 }else if(chars[[j]] == "G"){ numGC <- numGC + 1 }else if(chars[[j]] == "c"){ numGC <- numGC + 1 }else if(chars[[j]] == "C"){ numGC <- numGC + 1 } } return(numGC) } Running Rprof gives me the following output: > a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg" > Rprof(filename="Rprof.out") > for(i in 1:500000){gcCount(a,1,40)}; > Rprof(NULL) > summaryRprof(filename="Rprof.out") self.time self.pct total.time total.pct "gcCount" 77.36 76.8 100.74 100.0 "==" 18.30 18.2 18.30 18.2 "strsplit" 3.58 3.6 3.64 3.6 "+" 1.14 1.1 1.14 1.1 ":" 0.30 0.3 0.30 0.3 "as.logical" 0.04 0.0 0.04 0.0 "as.character" 0.02 0.0 0.02 0.0 $by.total total.time total.pct self.time self.pct "gcCount" 100.74 100.0 77.36 76.8 "==" 18.30 18.2 18.30 18.2 "strsplit" 3.64 3.6 3.58 3.6 "+" 1.14 1.1 1.14 1.1 ":" 0.30 0.3 0.30 0.3 "as.logical" 0.04 0.0 0.04 0.0 "as.character" 0.02 0.0 0.02 0.0 $sampling.time [1] 100.74 Any advice for making this code faster?

Read the article

How to get the second sub element of every element in a list in R

- by PaulHurleyuk

I know I've come across this problem before, but I'm having a bit of a mental block at the moment. and as I can't find it on SO, I'll post it here so I can find it next time. I have a dataframe that contains a field representing an ID label. This label has two parts, an alpha prefix and a numeric suffix. I want to split it apart and create two new fields with these values in. structure(list(lab = c("N00", "N01", "N02", "B00", "B01", "B02", "Z21", "BA01", "NA03")), .Names = "lab", row.names = c(NA, -9L ), class = "data.frame") df$pre<-strsplit(df$lab, "[0-9]+") df$suf<-strsplit(df$lab, "[A-Z]+") Which gives lab pre suf 1 N00 N , 00 2 N01 N , 01 3 N02 N , 02 4 B00 B , 00 5 B01 B , 01 6 B02 B , 02 7 Z21 Z , 21 8 BA01 BA , 01 9 NA03 NA , 03 So, the first strsplit works fine, but the second gives a list, each having two elements, an empty string and the result I want, and stuffs them both into the dataframe column. How can I select the second sub-element from each element of the list ? (or, is there a better way to do this) Thanks Paul.

Read the article

R regex to validate user input is correct.

- by John

I'm trying to practice writing better code, so I wanted to validate my input sequence with regex to make sure that the first thing I get is a single letter A to H only, and the second is a number 1 to 12 only. I'm new to regex and not sure what the expression should look like. I'm also not sure what type of error R would throw if this is invalidated? In Perl it would be something like this I think: =~ m/([A-M]?))/) Here is what I have so far for R: input_string = "A1" first_well_row = unlist(strsplit(input_string, ""))[1] # get the letter out first_well_col = unlist(strsplit(input_string, ""))[2] # get the number out

Read the article

Splitting a string into new rows in R

- by user3703195

I have a data set like below: Country Region Molecule Item Code IND NA PB102 FR206985511 THAI AP PB103 BA-107603 / F000113361 / 107603 LUXE NA PB105 1012701 / SGP-1012701 / F041701000 IND AP PB106 AU206985211 / CA-F206985211 THAI HP PB107 F034702000 / 1010701 / SGP-1010701 BANG NA PB108 F000007970/25781/20009021 I want to split based the string values in ITEMCODE column on / and create a new row for each entry. For instance, the desired output will be: Country Region Molecule Item Code New row IND NA PB102 FR206985511 FR206985511 THAI AP PB103 BA-107603 / F000113361 / 107603 F000113361 107603 BA-107603 LUXE NA PB105 1012701 / SP-1012701 / F041701000 1012701 SP-1012701 F041701000 IND AP PB106 AU206985211 / CA-F206985211 AU206985211 CA-F206985211 THAI HP PB107 F034702000 / 1010701 / SP-1010701 F034702000 1010701 SP-1010701 BANG NA PB108 F000007970/25781/20009021 F000007970 25781 20009021 I tried the below code library(splitstackshape) df2=concat.split.multiple(df1,"Plant.Item.Code","/", direction="long") but got the Error "Error: memory exhausted (limit reached?)" When i tried strsplit() i got the below error message. Error in strsplit(df1$Plant.Item.Code, "/") : non-character argument Any help from you will be appreciated.

Read the article

Convert 12hour time to 24Hour time

- by RwardBound

I have hourly weather data. I've seen the function examples from here: http://casoilresource.lawr.ucdavis.edu/drupal/node/991 I'm altering the code to account for airport data, which has a different URL type. Another issue with the airport weather data is that the time data is saved in 12 hour format. Here is a sample of the data: 14 10:43 AM 15 10:54 AM 16 11:54 AM 17 12:07 PM 18 12:15 PM 19 12:54 PM 20 1:54 PM 21 2:54 PM Here's what I attempted: (I see that using just 'PM' isn't careful enough because any times between 12 and 1 pm will be off if they go through this alg) date<-Sys.Date() data$TimeEST<-strsplit(data$TimeEST, ' ') for (x in 1:35){ if('AM' %in% data$TimeEST[[x]]){ gsub('AM','',data$TimeEST[[x]]) data$TimeEST[[x]]<-str_trim(data$TimeEST[[x]]) data$TimeEST[[x]]<-str_c(date,' ',data$TimeEST[x],':',data$TimeEST[2]) } else if('PM' %in% data$TimeEST[[x]]){ data$TimeEST[[x]]<-gsub('PM', '',data$TimeEST[[x]]) data$TimeEST[[x]]<-strsplit(data$TimeEST[[x]], ':') data$TimeEST[[x]][[1]][1]<-as.integer(data$TimeEST[[x]][[1]][1])+12 data$TimeEST[[x]]<-str_trim(data$TimeEST[[x]][[1]]) data$TimeEST[[x]]<-str_c(date, " ", data$TimeEST[[x]][1],':',data$TimeEST[[x]][2]) } } Any help?

Read the article

Extracting noun+noun or (adj|noun)+noun from Text

- by ssuhan

I would like to query if it is possible to extract noun+noun or (adj|noun)+noun in R package openNLP?That is, I would like to use linguistic filtering to extract candidate noun phrases. Could you direct me how to do? Many thanks. Thanks for the responses. here is the code: library("openNLP") acq <- "Gulf Applied Technologies Inc said it sold its subsidiaries engaged in pipeline and terminal operations for 12.2 mln dlrs. The company said the sale is subject to certain post closing adjustments, which it did not explain. Reuter." acqTag <- tagPOS(acq) acqTagSplit = strsplit(acqTag," ") acqTagSplit qq = 0 tag = 0 for (i in 1:length(acqTagSplit[[1]])){ qq[i] <-strsplit(acqTagSplit[[1]][i],'/') tag[i] = qq[i][[1]][2] } index = 0 k = 0 for (i in 1:(length(acqTagSplit[[1]])-1)) { if ((tag[i] == "NN" && tag[i+1] == "NN") | (tag[i] == "NNS" && tag[i+1] == "NNS") | (tag[i] == "NNS" && tag[i+1] == "NN") | (tag[i] == "NN" && tag[i+1] == "NNS") | (tag[i] == "JJ" && tag[i+1] == "NN") | (tag[i] == "JJ" && tag[i+1] == "NNS")){ k = k +1 index[k] = i } } index Reader can refer index on acqTagSplit to do noun+noun or (adj|noun)+noun extractation. (The code is not optimum but work. If you have any idea, please let me know.) Furthermore, I still have a problem. Justeson and Katz (1995) proposed another linguistic filtering to extract candidate noun phrases: ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun I cannot well understand its meaning, could someone do me a favor to explain it or transform such representation into R language

Read the article

Optimizing a "set in a string list" to a "set as a matrix" operation

- by Eric Fournier

I have a set of strings which contain space-separated elements. I want to build a matrix which will tell me which elements were part of which strings. For example: "" "A B C" "D" "B D" Should give something like: A B C D 1 2 1 1 1 3 1 4 1 1 Now I've got a solution, but it runs slow as molasse, and I've run out of ideas on how to make it faster: reverseIn <- function(vector, value) { return(value %in% vector) } buildCategoryMatrix <- function(valueVector) { allClasses <- c() for(classVec in unique(valueVector)) { allClasses <- unique(c(allClasses, strsplit(classVec, " ", fixed=TRUE)[[1]])) } resMatrix <- matrix(ncol=0, nrow=length(valueVector)) splitValues <- strsplit(valueVector, " ", fixed=TRUE) for(cat in allClasses) { if(cat=="") { catIsPart <- (valueVector == "") } else { catIsPart <- sapply(splitValues, reverseIn, cat) } resMatrix <- cbind(resMatrix, catIsPart) } colnames(resMatrix) <- allClasses return(resMatrix) } Profiling the function gives me this: $by.self self.time self.pct total.time total.pct "match" 31.20 34.74 31.24 34.79 "FUN" 30.26 33.70 74.30 82.74 "lapply" 13.56 15.10 87.86 97.84 "%in%" 12.92 14.39 44.10 49.11 So my actual questions would be: - Where are the 33% spent in "FUN" coming from? - Would there be any way to speed up the %in% call? I tried turning the strings into factors prior to going into the loop so that I'd be matching numbers instead of strings, but that actually makes R crash. I've also tried going for partial matrix assignment (IE, resMatrix[i,x] <- 1) where i is the number of the string and x is the vector of factors. No dice there either, as it seems to keep on running infinitely.

Read the article

WPF Datagrid Get Selected Item

- by wonea

Hello, How do I get the selected item in a WPF datagrid? Tried the following, with no luck; dataGrid1.CurrentCell.Item.ToString(); string[] strsplit = dataGrid1.SelectedValue.ToString().Split('+'); dataGrid1.SelectedCells[0].Item.ToString(); dataGrid1.CurrentItem.ToString(); dataGrid1.CurrentCell.Item.ToString(); dataGrid1.CurrentCell.Item.ToString(); Thanks

Read the article

Extracting ""((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun"" from Text (Justeson & Katz, 1995)

- by ssuhan

I would like to query if it is possible to extract ((Adj|Noun)+|((Adj|Noun)(Noun-Prep)?)(Adj|Noun))Noun proposed by Justeson and Katz (1995) in R package openNLP? That is, I would like to use this linguistic filtering to extract candidate noun phrases. I cannot well understand its meaning. Could you do me a favor to explain it or transform such representation into R language. Many thanks. Maybe we can start the sample code from: library("openNLP") acq <- "This paper describes a novel optical thread plug gauge (OTPG) for internal thread inspection using machine vision. The OTPG is composed of a rigid industrial endoscope, a charge-coupled device camera, and a two degree-of-freedom motion control unit. A sequence of partial wall images of an internal thread are retrieved and reconstructed into a 2D unwrapped image. Then, a digital image processing and classification procedure is used to normalize, segment, and determine the quality of the internal thread." acqTag <- tagPOS(acq) acqTagSplit = strsplit(acqTag," ")

Read the article

Is there a better (i.e vectorised) way to put part of a column name into a row of a data frame in R

- by PaulHurleyuk

I have a data frame in R that has come about from running some stats on the result fo a melt/cast operation. I want to add a row into this dataframe containing a Nominal value. That Nominal Value is present in the names for each column df<-as.data.frame(cbind(x=c(1,2,3,4,5),`Var A_100`=c(5,4,3,2,1),`Var B_5`=c(9,8,7,6,5))) > df x Var A_100 Var B_5 1 1 5 9 2 2 4 8 3 3 3 7 4 4 2 6 5 5 1 5 So, I want to create a new row, that contains '100' in the column Var A_100 and '5' in Var B_5. Currently this is what I'm doing but I'm sure there must be a better, vectorised way to do this. temp_nom<-NULL for (l in 1:length(names(df))){ temp_nom[l]<-strsplit(names(df),"_")[[l]][2] } temp_nom [1] NA "100" "5" df[6,]<-temp_nom > df x Var A_100 Var B_5 1 1 5 9 2 2 4 8 3 3 3 7 4 4 2 6 5 5 1 5 6 <NA> 100 5 rm(temp_nom) Typically I'd have 16-24 columns. Any ideas ?

Read the article

How to patch an S4 method in an R package?

- by Richie Cotton

If you find a bug in a package, it's usually possible to patch the problem with fixInNamespace, e.g. fixInNamespace("mean.default", "base"). For S4 methods, I'm not sure how to do it though. The method I'm looking at is in the gWidgetstcltk package. You can see the source code with getMethod(".svalue", c("gTabletcltk", "guiWidgetsToolkittcltk")) I can't find the methods with fixInNamespace. fixInNamespace(".svalue", "gWidgetstcltk") Error in get(subx, envir = ns, inherits = FALSE) : object '.svalue' not found I thought setMethod might do the trick, but setMethod(".svalue", c("gTabletcltk", "guiWidgetsToolkittcltk"), definition = function (obj, toolkit, index = NULL, drop = NULL, ...) { widget = getWidget(obj) sel <- unlist(strsplit(tclvalue(tcl(widget, "selection")), " ")) if (length(sel) == 0) { return(NA) } theChildren <- .allChildren(widget) indices <- sapply(sel, function(i) match(i, theChildren)) inds <- which(visible(obj))[indices] if (!is.null(index) && index == TRUE) { return(inds) } if (missing(drop) || is.null(drop)) drop = TRUE chosencol <- tag(obj, "chosencol") if (drop) return(obj[inds, chosencol, drop = drop]) else return(obj[inds, ]) }, where = "package:gWidgetstcltk" ) Error in setMethod(".svalue", c("gTabletcltk", "guiWidgetsToolkittcltk"), : the environment "gWidgetstcltk" is locked; cannot assign methods for function ".svalue" Any ideas?

Search Results

Search found 14 results on 1 pages for 'strsplit'.

Page 1/1 | 1

- by James

- by user1429852

- by analyticsPierce

- by chrisamiller

- by PaulHurleyuk

- by John

- by user3703195

- by RwardBound

- by ssuhan

- by Eric Fournier

- by wonea

- by ssuhan

- by PaulHurleyuk

- by Richie Cotton