how to store data crawled from website

Posted by Richard on Stack Overflow See other posts from Stack Overflow or by Richard
Published on 2010-03-17T04:19:30Z Indexed on 2010/03/17 4:31 UTC
Read the original article Hit count: 231

Filed under:

webcrawling

|

file-system

|

database

I want to crawl a website and store the content on my computer for later analysis. However my OS file system has a limit on the number of sub directories, meaning storing the original folder structure is not going to work.

Suggestions?

Map the URL to some filename so can store flatly? Or just shove it in a database like sqlite to avoid file system limitations?

© Stack Overflow or respective owner

Related posts about webcrawling

Asynchronous Webcrawling F#, something wrong ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Not quite sure if it is ok to do this but, my question is: Is there something wrong with my code ? It doesn't go as fast as I would like, and since I am using lots of async workflows maybe I am doing something wrong. The goal here is to build something that can crawl 20 000 pages in less than an hour… >>> More
WebCrawling Dynamic Links

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise. >>> More
Crawling engine architecture - Java/ Perl integration

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Getting web page after calling DownloadStringAsync()?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine… >>> More

Related posts about file-system

How to delete/edit files from readonly filesystem

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I am having problem with my memory device (actually a memory card that act external memory device like pendrive). experimentx@workmateX:/var/www/zendtest$ sudo rm /media/A88F-8788/python-2.7.1-docs-html.zip rm: cannot remove `/media/A88F-8788/python-2.7.1-docs-html.zip': Read-only file system I… >>> More
"cannot open file system. File system seems damaged "

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I was using windows 7 till yesterday. I tried to install ubuntu 14. 04 Lts version yesterday with in windows 7. But it was not succeeded. Then I decided to install ubuntu only. By mistake I installed ubuntu in whole disk. After that to get deleted partitions I installed testdisk. I also used deeper… >>> More
Open source embedded filesystem (or single file virtual filesystem, or structured storage) library f

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm not sure what the "general" name of something like this might be. I'm looking for a library that gives me a file format to store different types of binary data in an expanding single file. open source, non-GPL (LGPL ok) C interface the file format is a single file multiple files within using… >>> More
Why didn't 12.04 install?

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Ok, so I've installed Ubuntu many times on my computer.. Normally on the same partition, and WIndows would always delete Ubuntu(I don't know how.. it just happens) if i go away from keyboard during boot and it chooses Windows automatically because I took to long. So i tried to reinstall again, but… >>> More
What's up with OCFS2?

as seen on Oracle Blogs - Search for 'Oracle Blogs'
On Linux there are many filesystem choices and even from Oracle we provide a number of filesystems, all with their own advantages and use cases. Customers often confuse ACFS with OCFS or OCFS2 which then causes assumptions to be made such as one replacing the other etc... I thought it would be good… >>> More