Code golf: Word frequency chart

Posted by ChristopheD on Stack Overflow See other posts from Stack Overflow or by ChristopheD
Published on 2010-07-02T20:54:05Z Indexed on 2012/04/07 23:29 UTC
Read the original article Hit count: 304

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

The rules:

  • Only accept a-z and A-Z (alphabetic characters) as part of a word.
  • Ignore casing (She == she for our purpose).
  • Ignore the following words (quite arbitary, I know): the, and, of, to, a, i, it, in, or, is
  • Clarification: considering don't: this would be taken as 2 different 'words' in the ranges a-z and A-Z: (don and t).

  • Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

  • Display the chart (also see the example below) for the 22 most common words (ordered by descending frequency).
  • The bar width represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.
  • Make sure these bars (plus space-word-space) always fit: bar + [space] + word + [space] should be always <= 80 characters (make sure you account for possible differing bar and word lenghts: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).

An example:

The text for the example can be found here (Alice's Adventures in Wonderland, by Lewis Carroll).

This specific text would yield the following chart:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so 


For your information: these are the frequencies the above chart is built upon:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:

 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so 

The winner:

Shortest solution (by character count, per language). Have fun!


Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

© Stack Overflow or respective owner

Related posts about language-agnostic

Related posts about code-golf

  • Code Golf: Collatz Conjecture

    as seen on Stack Overflow - Search for 'Stack Overflow'
    Inspired by http://xkcd.com/710/ here is a code golf for it. The Challenge Given a positive integer greater than 0, print out the hailstone sequence for that number. The Hailstone Sequence See Wikipedia for more detail.. If the number is even, divide it by two. If the number is odd, triple… >>> More

  • Code Golf - p day

    as seen on Stack Overflow - Search for 'Stack Overflow'
    The Challenge The shortest code by character count to display a representation of a circle of radius R using the *character, followed by an approximation of p. Input is a single number, R. Since most computers seem to have almost 2:1 ratio you should only output lines where y is odd. The approximation… >>> More

  • Code Golf - PI day

    as seen on Stack Overflow - Search for 'Stack Overflow'
    The Challenge The shortest code by character count to display a representation of a circle of radius R using the *character. Followed by an approximation of pi Input is a single number, R Since most computers seem to have almost 2:1 ratio you should only output lines where y is odd. The approximation… >>> More

  • Code Golf: Triforce

    as seen on Stack Overflow - Search for 'Stack Overflow'
    This is inspired by/taken from this thread: http://www.allegro.cc/forums/thread/603383 The Problem Assume the user gives you a numeric input ranging from 1 to 7. Input should be taken from the console, arguments are less desirable. When the input is 1, print the following: *********** *********… >>> More

  • Code Golf: Tic Tac Toe

    as seen on Stack Overflow - Search for 'Stack Overflow'
    Post your shortest code, by character count, to check if a player has won, and if so, which. Assume you have an integer array in a variable b (board), which holds the Tic Tac Toe board, and the moves of the players where: 0 = nothing set 1 = player 1 (X) 2 = player 2 (O) So, given the array b… >>> More