Writing a PHP web crawler using cron

Posted by Horse on Stack Overflow See other posts from Stack Overflow or by Horse
Published on 2011-01-11T16:42:58Z Indexed on 2011/01/11 16:53 UTC
Read the original article Hit count: 243

Filed under:

php

|

regex

|

web-crawler

|

links

Hi all

I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links

That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.

Any ideas on making it as quick and efficient as possible other than setting the memory limit / execution time higher?

© Stack Overflow or respective owner

Related posts about php

Magento, NGINX, PHP-FPM, APC, MEMCACHED, 16gb Ram CentOS, Spiking PHP-FPM to 100% CPU

as seen on Server Fault - Search for 'Server Fault'
I have been trying to resolve my issue of spiking cpu caused by php-fpm processes. I've reduced the php-fpm config settings to: pm = ondemand pm.max_children = 12 pm.start_servers = 2 pm.min_spare_servers = 2 pm.max_spare_servers = 10 pm.max_requests = 500 php_admin_value[memory_limit] = 128M Problem… >>> More
PHP Pear Installation on CentOS

as seen on Server Fault - Search for 'Server Fault'
[root@ip ~]# yum install php-pear* Reducing CentOS-5 Testing to included packages only Finished Setting up Install Process Package 1:php-pear-1.8.1-2.el5.centos.noarch already installed and latest versio … >>> More
Apache configurations for php "AddType text/html php" or "AddType application/x-httpd-php php .php"

as seen on Server Fault - Search for 'Server Fault'
I am taking over an application server and discover that it contain the following settings: AddType text/html php Although it works, but my understanding is that it should set as following: AddType application/x-httpd-php php .php What are the key differences between the two settings?… >>> More
mod_rewrite settings causes server to throw HTTP 500 errors instead of 404

as seen on Server Fault - Search for 'Server Fault'
Hello. I have a server with VBulletin forum (working under Apache 2.2, CentOS). The default settings for it in .htaccess are as follows: RewriteEngine on RewriteCond %{HTTP_HOST} ^gsmforum\.ru RewriteRule (.*) http://www.gsmforum.ru/$1 [R=301,L] # If you are having problems or are using VirtualDocumentRoot… >>> More
Problems installing Memcache (PECL extension)

as seen on Server Fault - Search for 'Server Fault'
I have installed memcached fine, and now I will need to install PECL extension memcache. Im running RedHat x86_64 es5. The installation gives me this: downloading memcache-2.2.6.tgz ... Starting to download memcache-2.2.6.tgz (35,957 bytes) ..........done: 35,957 bytes 11 source files, building running:… >>> More

Related posts about regex

Find multiple regex in each line and skip result if one of the regex doesn't match

as seen on Stack Overflow - Search for 'Stack Overflow'
I have a list of variables: variables = ['VariableA', 'VariableB','VariableC'] which I'm going to search for, line by line ifile = open("temp.txt",'r') d = {} match = zeros(len(variables)) for line in ifile: emptyCells=0 for i in range(len(variables)): regex = r'('+variables[i]+r')[:|=|\(](-… >>> More
OWASP Regex Repository: Is this regex correct?

as seen on Stack Overflow - Search for 'Stack Overflow'
I was looking at the regular expression for validating various data types from the (OWASP Regex Repository). One of the regular expressions in there is called safetext and looks like: ^[a-zA-Z0-9\s.\-]+$ My first question is: Is this regular expression correct? complementary question If this… >>> More
Make a Perl-style regex interpreter behave like a basic or extended regex interpreter

as seen on Stack Overflow - Search for 'Stack Overflow'
I am writing a tool to help students learn regular expressions. I will probably be writing it in Java. The idea is this: the student types in a regular expression and the tool shows which parts of a text will get matched by the regex. Simple enough. But I want to support several different regex… >>> More
JS regex isn't matching, even thought it works with a regex tester

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm writing a piece of client-side javascript code that takes a function and finds the derivative of it, however, the regex that's supposed to match with the power rule fails to work in the context of the javascript program, even though it sucessfully matches when it's used with an independent regex… >>> More
c# RegEx with "|"

as seen on Stack Overflow - Search for 'Stack Overflow'
I need to be able to check for a pattern with | in them. For example an expression like d*|*t should return true for a string like "dtest|test". I'm no regex hero so I just tried a couple of things, like: Regex Pattern = new Regex("s*\|*d"); //unable to build because of single backslash Regex Pattern… >>> More