Random forests for short texts

Posted by Jasie on Stack Overflow See other posts from Stack Overflow or by Jasie
Published on 2010-05-02T23:45:12Z Indexed on 2010/05/02 23:47 UTC
Read the original article Hit count: 212

Hi all,

I've been reading about Random Forests (1,2) because I think it'd be really cool to be able to classify a set of 1,000 sentences into pre-defined categories. I'm wondering if someone can explain to me the algorithm better, I think the papers are a bit dense. Here's the gist from 1:

Overview

We assume that the user knows about the construction of single classification trees. Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

Each tree is grown as follows:

  1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
  2. If there are M input variables, a number m « M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
  3. Each tree is grown to the largest extent possible. There is no pruning.

So, does this look right?

  1. I'd have N = 1,000 training cases (sentences),
  2. M = 100 variables (let's say, there are only 100 unique words across all sentences), so the input vector is a bit vector of length 100 corresponding to each word.
  3. I randomly sample N = 1000 cases at random (with replacement) to build trees from.
  4. I pick some small number of input variables m « M, let's say 10, to build a tree off of.
  5. Do I build tree nodes randomly, using all m input variables? How many classification trees do I build?

Thanks for the help!

© Stack Overflow or respective owner

Related posts about random-forests

Related posts about machine-learning