Ruby : UTF-8 IO
        Posted  
        
            by subtenante
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by subtenante
        
        
        
        Published on 2010-03-12T22:09:35Z
        Indexed on 
            2010/03/12
            22:57 UTC
        
        
        Read the original article
        Hit count: 332
        
I use ruby 1.8.7.
I try to parse some text files containing greek sentences, encoded in UTF-8.
(I can't much paste here sample files, because they are subject to copyright. Really just some greek text encoded in UTF-8.)
I want, for each file, to parse the file, extract all the words, and make a list of each new word found in this file. All that saved to one big index file.
Here is my code :
#!/usr/bin/ruby -KU
def prepare_line(l)
    l.gsub(/^\s*[ST]\d+\s*:\s*|\s+$|\(\d+\)\s*/u, "")
end
def tokenize(l)
    l.split /['·.;!:\s]+/u
end
$dict = {}
$cpt = 0
$out = File.new 'out.txt', 'w'
def lesson(file)
    $cpt = $cpt + 1
    file.readlines.each do |l|
        $out.puts l
        l = prepare_line l
        tokenize(l).each do |t|
            unless $dict[t]
                $dict[t] = $cpt
                $out.puts  "  #{t}\n"
            end
        end
    end
end
Dir.new('etc/').each do |filename|
    f = File.new("etc/#{filename}")
    unless File.directory? f
        lesson f
    end
end
Here is part of my output :
?@???†?†?????????? ?...[snip very long hangul/hanzi mishmash]... ????????†? ???N2 : ?e?te?? (2) µ???µa
(Note that the puts l part seems to work fine, at the end of the given output line.)
Any idea what is wrong with my code ?
(General comments about ruby idioms I could use are very welcome, I'm really a beginner.)
© Stack Overflow or respective owner