How to count the Chinese word in a file using regex in perl?

Posted by Ivan on Stack Overflow See other posts from Stack Overflow or by Ivan
Published on 2011-01-06T03:19:41Z Indexed on 2011/01/06 3:53 UTC
Read the original article Hit count: 225

Filed under:
|
|
|

I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.

The Error message is

Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things  = 125, valid words = 

which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8 pragma is installed.

#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;

my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;

open (FILE, "< $input_file") or die "Can't open $input_file: $!";

while (<FILE>) {
 foreach (split) { #break $_ into words, assign each to $_ in turn
 $total++;
 next if /\W|^\d+/;  #strange words skip the remainder of the loop
 $valid++;
 $count{$_}++;  # count each separate word stored in a hash
 ## next comes here ##
      }
   }

   print "Total things  = $total, valid words = $valid\n";
   foreach my $word (sort keys %count) {
      print "$word \t was seen \t $count{$word} \t times.\n";
   }

##---Data----
sample_file.txt

??????,???????,????.??????.????:"?????????????,??????,????????.????????,?????????, ???????????.????????,???????????,??????,??????.???:`??,???????????.'?????, ??????????."??????,??????.????.???, ????????????,????,??????,?????????,??????????????. ????????,??????,???????????,????????,????????.????,????,???????, ??????????,??????,????????.??????.

© Stack Overflow or respective owner

Related posts about regex

Related posts about perl