Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

Posted by Jonathan Sampson on Stack Overflow See other posts from Stack Overflow or by Jonathan Sampson
Published on 2009-06-24T13:12:52Z Indexed on 2010/03/16 1:39 UTC
Read the original article Hit count: 428

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

  1. Return only words with an occurrence greater than X
  2. Return only words with a length greater than Y
  3. Ignore common terms like "and, is, the, etc"
  4. Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
  5. Return results in a collection/array

Extra Credit

  1. Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
    Where 'too good to be true' would be the actual statement

Extra-Extra Credit

  1. Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
    "The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."
    Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

  1. It would be great to see the results of your code, output, in addition to how long the operation lasted.

© Stack Overflow or respective owner

Related posts about code-golf

  • Code Golf: Collatz Conjecture

    as seen on Stack Overflow - Search for 'Stack Overflow'
    Inspired by http://xkcd.com/710/ here is a code golf for it. The Challenge Given a positive integer greater than 0, print out the hailstone sequence for that number. The Hailstone Sequence See Wikipedia for more detail.. If the number is even, divide it by two. If the number is odd, triple… >>> More

  • Code Golf - p day

    as seen on Stack Overflow - Search for 'Stack Overflow'
    The Challenge The shortest code by character count to display a representation of a circle of radius R using the *character, followed by an approximation of p. Input is a single number, R. Since most computers seem to have almost 2:1 ratio you should only output lines where y is odd. The approximation… >>> More

  • Code Golf - PI day

    as seen on Stack Overflow - Search for 'Stack Overflow'
    The Challenge The shortest code by character count to display a representation of a circle of radius R using the *character. Followed by an approximation of pi Input is a single number, R Since most computers seem to have almost 2:1 ratio you should only output lines where y is odd. The approximation… >>> More

  • Code Golf: Triforce

    as seen on Stack Overflow - Search for 'Stack Overflow'
    This is inspired by/taken from this thread: http://www.allegro.cc/forums/thread/603383 The Problem Assume the user gives you a numeric input ranging from 1 to 7. Input should be taken from the console, arguments are less desirable. When the input is 1, print the following: *********** *********… >>> More

  • Code Golf: Tic Tac Toe

    as seen on Stack Overflow - Search for 'Stack Overflow'
    Post your shortest code, by character count, to check if a player has won, and if so, which. Assume you have an integer array in a variable b (board), which holds the Tic Tac Toe board, and the moves of the players where: 0 = nothing set 1 = player 1 (X) 2 = player 2 (O) So, given the array b… >>> More

Related posts about text-parsing