Web Crawler for Learnign Topics on Wikipedia

Posted by Chris Okyen on Programmers See other posts from Programmers or by Chris Okyen
Published on 2012-10-08T16:23:39Z Indexed on 2012/10/08 21:50 UTC
Read the original article Hit count: 228

Filed under:
|
|
|

When I want to learn a vast topic on wikipedia, I don't know where to start. For instance say I want to learn about Binary Stars, I then have to know other things linked on that pages and linked pages on all the linked pages and so on for the specified number of levels. I want to write a web crawler like HTTracker or something similiar, that will display a heiarchy of the links on a certain page and the links on those linked pages.I wish to use as much prewritten code as possible. Here is an example:

Pretending we are bending the rules by grabing links from only the first sentence of each pages

The example archives and "processes" two levels deep

The page is Ternary operation

The First Level

In mathematics a ternary operation is an N-ary operation

The Second Level

Under Mathmatics:

Mathematics (from Greek µ???µa máthema, “knowledge, study, learning”) is the abstract study of topics encompassing quantity, structure, space, change and others; it has no generally accepted definition.

Under N-ary

In logic,mathematics, and computer science, the arity i/'ær?ti/ of a function or operation is the number of arguments or operands that the function takes

Under Operation

In its simplest meaning in mathematics and logic, an operation is an action or procedure which produces a new value from one or more input values

-------------------------------------------------------------------------

I need some way to determine what oder to approach all these wiki pages to learn the concept ( in this case ternary operations )... Following along with this exmpakle, one way to show the path to read would a printout flowout like so:

enter image description here

This shows that the first sentence of the Mathematics page doesn't link to the first sentence of pages linked on ternary page two levels deep. (Please tell me how I should explain this ) ---> In otherwords, the child node of the top pages first sentence, ternary_operation, does not have any child nodes that reference the children of the top pages other children nodes- N-ary and operation. Thus it is safe to read this first. Since N-ary has a link to operations we shoudl read the operation page second and finally read the N-ary page last.

Again, I wish to use as much prewritten code as possible, and was wondering what language to use and what would be the simpliest way to go about doing this if there isn't already somethign out there?

Thank You!

© Programmers or respective owner

Related posts about children

Related posts about trees