Search Results

Search found 4 results on 1 pages for 'dnkb'.

Page 1/1 | 1 

  • SQL like group by and sum for text files in command line?

    - by dnkb
    I have huge text files with two fields, the first is a string the second is an integer. The files are sorted by the first field. What I'd like to get in the output is one line per unique string and the sum of the numbers for the identical strings. Some strings appear only once while other appear multiple times. E.g. Given the sample data below, for the string glehnia I'd like to get 10+22=32 in the result. Any suggestions how to do this either with gnuwin32 command line tools or in linux shell? Thanks! glehnia 10 glehnia 22 glehniae 343 glehnii 923 glei 1171 glei 2283 glei 3466 gleib 914 gleiber 652 gleiberg 495 gleiberg 709

    Read the article

  • remove words containing non-alpha characters

    - by dnkb
    Given a text file with space separated string and a tab separated integer, I'd ;like to get rid of all words that have non-alpha characters but keep words consisting of alpha only characters and the tab plus the integer afterwards. My attempts like the ones below didin't yield any good. What I was trying to express is something like: "replace anything within word boundaries that starts and ends with 0 or more whatever and there is at least one :digits: or :punct: in between". sed 's/\b.[:digits::punct:]+.\b//g' sed 's/\b.[^:alpha:]+.\b//g' What am I missing? See sample input data below. Thank you! asdf 754m 563 a2a 754mm 291 754n 463 754 ppp 1409 754pin 4652 pin pin 462 754pins 652 754 ppp 1409 754pin 4652 pi$n pin 462 754/p ins 652 754 pp+p 1409 754 p=in 4652

    Read the article

  • joining text files with 600M+ lines

    - by dnkb
    I have two files huge.txt and small.txt. Huge has around 600M rows and it's 14Gigs, each line has four space separated words (tokens) and finally another space separated column with a number. Small has 150K rows with a size of ~3M, a space separated word and a number. Both Files are sorted using the sort command, with no extra options. The words in both files may include apostrophes (') and dashes (-). The deisred output would contain all columns from the huge.txt and the second column (the number) from small txt where the first word of huge.txt and the first word of small.txt match. My attemtpts below failed miserably with the following error: cat huge.txt|join -o 1.1 1.2 1.3 1.4 2.2 - small.txt > output.txt join: memory exhausted What I suspect is that the sorting order isn't right somehow even though the files are pre-sorted using: sort -k1 huge.unsorted.txt > huge.txt sort -k1 small.unsorted.txt > small.txt Problems seem to appear around words that have apostrophes (') or dashes (-). I also tried dictinoary sorting using the -d option bumping into the same error at the end. I see two ways out of this but don't know how to implement any of them. 1) Any tips how to sort the files in a way that the join command considers them to be sorted properly? 2) I was thinking of calculating MD5 or some other hashes of the strings to get rid of the apostrophes and dashes, and do the sorting and joining with the hashes instead of the strings themselves an dat the "translate" back the hashes to strings, but leave the numbers intact at the end of the lines. Since there would be only 150K hashes it's not that bad. What would be a good way to calculate individual hashes for each of the strings? Some AWK magic? See file samples at the end. Thank you! sample of huge.txt had stirred me to 46 had stirred my corruption 57 had stirred old emotions 55 had stirred something in 69 had stirred something within 40 sample of small.txt caley 114881 calf 2757974 calfed 137861 calfee 71143 calflora 154624 calfskin 148347 calgary 9416465 calgon's 94846

    Read the article

1