A good machine learning technique to weed out good URLs from bad

Posted by git-noob on Stack Overflow See other posts from Stack Overflow or by git-noob
Published on 2010-03-11T14:11:30Z Indexed on 2010/03/12 11:47 UTC
Read the original article Hit count: 623

Filed under:
|

Hi,

I have an application that needs to discriminate between good HTTP GET requests and bad.

For example:

http://somesite.com?passes=dodgy+parameter                # BAD
http://anothersite.com?passes=a+good+parameter            # GOOD

My system can make a binary decision about whether or not a URL is good or bad - but ideally I would like it to predict whether or not a previously unseen URL is good or bad.

http://some-new-site.com?passes=a+really+dodgy+parameter # BAD

I feel the need for a support vector machine (SVM) ... but I need to learn machine learning. Some questions:

1) Is an SVM appropriate for this task? 2) Can I train it with the raw URLs? - without explicitly specifying 'features' 3) How many URLs will I need for it to be good at predictions? 4) What kind of SVM kernel should I use? 5) After I train it, how do I keep it up to date? 6) How do I test unseen URLs again the SVM to decide whether it's good or bad? I

© Stack Overflow or respective owner

Related posts about machine-learning

Related posts about svm