Naive Bayesian classification (spam filtering) - Doubt in one calculation? Which one is right? Plz c

Posted by Microkernel on Stack Overflow See other posts from Stack Overflow or by Microkernel
Published on 2010-05-13T15:38:37Z Indexed on 2010/05/13 16:34 UTC
Read the original article Hit count: 302

Filed under:

bayesian

|

algorithm

|

mathematics

|

spam-filtering

|

statistics

Hi guys,

I am implementing Naive Bayesian classifier for spam filtering. I have doubt on some calculation. Please clarify me what to do. Here is my question.

In this method, you have to calculate

$alt text$

P(S|W) -> Probability that Message is spam given word W occurs in it.

P(W|S) -> Probability that word W occurs in a spam message.

P(W|H) -> Probability that word W occurs in a Ham message.

So to calculate P(W|S), should I do

(1) (Number of times W occuring in spam)/(total number of times W occurs in all the messages)

OR

(2) (Number of times word W occurs in Spam)/(Total number of words in the spam message)

So, to calculate P(W|S), should I do (1) or (2)? (I thought it to be (2), but I am not sure, so plz clarify me)

I am refering http://en.wikipedia.org/wiki/Bayesian_spam_filtering for the info by the way.

I got to complete the implementation by this weekend :(

Thanks and regards,

MicroKernel :)

@sth:

Hmm... Shouldn't repeated occurrence of word 'W' increase a message's spam score? In the your approach it wouldn't, right?.

Lets take a scenario and discuss...

Lets say, we have 100 training messages, out of which 50 are spam and 50 are Ham. and say word_count of each message = 100.

And lets say, in spam messages word W occurs 5 times in each message and word W occurs 1 time in Ham message.

So total number of times W occuring in all the spam message = 5*50 = 250 times.

And total number of times W occuring in all Ham messages = 1*50 = 50 times.

Total occurance of W in all of the training messages = (250+50) = 300 times.

So, in this scenario, how do u calculate P(W|S) and P(W|H) ?

Naturally we should expect, P(W|S) > P(W|H)??? right.

Please share your thought...

Developer IT