How to count the Chinese word in a file using regex in perl?

Posted by Ivan on Stack Overflow See other posts from Stack Overflow or by Ivan
Published on 2011-01-06T03:19:41Z Indexed on 2011/01/06 3:53 UTC
Read the original article Hit count: 274

Filed under:

chinese

I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.

The Error message is

Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things  = 125, valid words =

which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8 pragma is installed.

#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;

my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;

open (FILE, "< $input_file") or die "Can't open $input_file: $!";

while (<FILE>) {
 foreach (split) { #break $_ into words, assign each to $_ in turn
 $total++;
 next if /\W|^\d+/;  #strange words skip the remainder of the loop
 $valid++;
 $count{$_}++;  # count each separate word stored in a hash
 ## next comes here ##
      }
   }

   print "Total things  = $total, valid words = $valid\n";
   foreach my $word (sort keys %count) {
      print "$word \t was seen \t $count{$word} \t times.\n";
   }

##---Data----
sample_file.txt

??????,???????,????.??????.????:"?????????????,??????,????????.????????,?????????, ???????????.????????,???????????,??????,??????.???:`??,???????????.'?????, ??????????."??????,??????.????.???, ????????????,????,??????,?????????,??????????????. ????????,??????,???????????,????????,????????.????,????,???????, ??????????,??????,????????.??????.

Developer IT

How to count the Chinese word in a file using regex in perl? - Developer IT

How to count the Chinese word in a file using regex in perl?

regex

perl

embedding

chinese

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

OWASP Regex Repository: Is this regex correct?

Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

JS regex isn't matching, even thought it works with a regex tester

c# RegEx with "|"

Related posts about perl

Munin on Centos 6 - missing perl MODULE_COMPAT_5.8.8

Pain removing a perl rootkit

How To Avoid a Perl script calling an Another Perl Script

Perl :how to sort dates in perl

please suggest a perl book exclusively for perl programs

Categories cloud