Search Results

Search found 211 results on 9 pages for 'spider paddy'.

Page 1/9 | 1 2 3 4 5 6 7 8 9 | Next Page >

Creating a spider using Scrapy, Spider generation error.

- by Nacari

I just downloaded Scrapy (web crawler) on Windows 32 and have just created a new project folder using the "scrapy-ctl.py startproject dmoz" command in dos. I then proceeded to created the first spider using the command: scrapy-ctl.py genspider myspider myspdier-domain.com but it did not work and returns the error: Error running: scrapy-ctl.py genspider, Cannot find project settings module in python path: scrapy_settings. I know I have the path set right (to python26/scripts), but I am having difficulty figuring out what the problem is. I am new to both scrapy and python so there is a good possibility that I have failled to do something important. Also, I have been using eclipse with the Pydev plugin to edit the code if that might cause some problems.

Read the article
Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article
Baidu spider is hammering my server and bloating my error_log file

- by Gravy

I am getting the following errors in my /etc/httpd/logs/error_log file [Sun Oct 20 00:04:15 2013] [error] [client 180.76.5.16] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:08:31 2013] [error] [client 180.76.5.113] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:12:47 2013] [error] [client 180.76.5.88] File does not exist: /usr/local/apache/htdocs/homes [Sun Oct 20 00:17:07 2013] [error] [client 180.76.5.138] File does not exist: /usr/local/apache/htdocs/homes These kinds of errors are so often, that my error log files are over 500MB! I have done an IP trace on the client address to find that it belongs to something called baidu. Beijing Baidu Netcom Science and Technology Co in China. Is there a way that I can just get apache to deny any incoming requests from some crummy spider that is repeatedly hitting my site??? Is there a better way of dealing with the problem? I am happy to completely block out China if it means that I can actually track real errors.

Read the article
Website crawler/spider to get site map

- by ack__

I need to retrieve a whole website map, in a format like : http://example.org/ http://example.org/product/ http://example.org/service/ http://example.org/about/ http://example.org/product/viewproduct/ I need it to be linked-based (no file or dir brute-force), like : parse homepage - retrieve all links - explore them - retrieve links, ... And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found : http://example.org/product/viewproduct?id=1 http://example.org/product/viewproduct?id=2 http://example.org/product/viewproduct?id=3 I need to get only once the http://example.org/product/viewproduct I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far. The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language. Thanks

Read the article
Spider a Website and Return URLs Only

- by Rob Wilkerson

I'm not quite sure how best to define/articulate this, but I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work: wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:' The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set? Thanks. UPDATE So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need: wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:' I'd still be interested in other/better means for doing this kind of thing, if any exist.

Read the article
Getting Started with Python: Attribute Error

- by Nacari

I am new to python and just downloaded it today. I am using it to work on a web spider, so to test it out and make sure everything was working, I downloaded a sample code. Unfortunately, it does not work and gives me the error: "AttributeError: 'MyShell' object has no attribute 'loaded' " I am not sure if the code its self has an error or I failed to do something correctly when installing python. Is there anything you have to do when installing python like adding environmental variables, etc.? And what does that error generally mean? Here is the sample code I used with imported spider class: import chilkat spider = chilkat.CkSpider() spider.Initialize("www.chilkatsoft.com") spider.AddUnspidered("http://www.chilkatsoft.com/") for i in range(0,10): success = spider.CrawlNext() if (success == True): print spider.lastUrl() else: if (spider.get_NumUnspidered() == 0): print "No more URLs to spider" else: print spider.lastErrorText() # Sleep 1 second before spidering the next URL. spider.SleepMs(1000)

Read the article
How does bing-bot( is that the right spider-name? ) and googlebot interpret 301 redirect?

- by jbcurtin

I've been looking for documentation on how the Microsoft and Google bots interpret 301 redirects. It seems that google-bot stores documents on a url based index system. But I haven't been able to figure out how bing works. Should I assume that they are still working towards coping everyone else and assume they use an algorithm close to google? Is it best to just forward a page to a new location via Javascript? I think this might be a blackhat trick, but how would I tell the bots that it's not? Is 301 redirect my best option and I just have to bit the bullet because said pages are no longer in existence? What other options do I have that I might not be aware of?

Read the article
How to create a web crawler/spider/robot?

- by Chris

Is there a way to make a web robot like websiteoutlook.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc. What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way? A simple example or a link to more information would be much appreciated.

Read the article
Scrapy Could not find spider Error

- by Nacari

I have been trying to get a simple spider to run with scrapy, but keep getting the error: Could not find spider for domain:stackexchange.com when I run the code with the expression scrapy-ctl.py crawl stackexchange.com. The spider is as follow: from scrapy.spider import BaseSpider from __future__ import absolute_import class StackExchangeSpider(BaseSpider): domain_name = "stackexchange.com" start_urls = [ "http://www.stackexchange.com/", ] def parse(self, response): filename = response.url.split("/")[-2] open(filename, 'wb').write(response.body) SPIDER = StackExchangeSpider()` Another person posted almost the exact same problem months ago but did not say how they fixed it, http://stackoverflow.com/questions/1806990/scrapy-spider-is-not-working I have been following the turtorial exactly at http://doc.scrapy.org/intro/tutorial.html, and cannot figure out why it is not working.

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 4)

- by hinkmond

Here's the first sign of life of a Hexbug Spider Robot converted to become a Skynet Big Data model T-1. Yes, this is T-1 the precursor to the Cyberdyne Systems T-101 (and you know where that will lead to...) It is demonstrating a heartbeat using a simple Java SE Embedded program to drive it. See: Skynet Model T-1 Heartbeat It's alive!!! Well, almost alive. At least there's a pulse. We'll program more to its actions next, and then finally connect it to Skynet Big Data to do more advanced stuff, like hunt for Sara Connor. Java SE Embedded programming makes it simple to create the first model in the long line of T-XXX robots to take on the world. Raspberry Pi makes connecting it all together on one simple device, easy. Next post, I'll show how the wires are connected to drive the T-1 robot. Hinkmond

Read the article
Skynet Big Data Demo Using Hexbug Spider Robot, Raspberry Pi, and Java SE Embedded (Part 3)

- by hinkmond

In Part 2, I described what connections you need to make for this demo using a Hexbug Spider Robot, a Raspberry Pi, and Java SE Embedded for programming. Here are some photos of me doing the soldering. Software engineers should not be afraid of a little soldering work. It's all good. See: Skynet Big Data Demo (Part 2) One thing to watch out for when you open the remote is that there may be some glue covering the contact points. Make sure to use an Exacto knife or small screwdriver to scrape away any glue or non-conductive material covering each place where you need to solder. And after you are done with your soldering and you gave the solder enough time to cool, make sure all your connections are marked so that you know which wire goes where. Give each wire a very light tug to make sure it is soldered correctly and is making good contact. There are lots of videos on the Web to help you if this is your first time soldering. Check out Laday Ada's (from adafruit.com) links on how to solder if you need some additional help: http://www.ladyada.net/learn/soldering/thm.html If everything looks good, zip everything back up and meet back here for how to connect these wires to your Raspberry Pi. That will be it for the hardware part of this project. See, that wasn't so bad. Hinkmond

Read the article
Scrapy domain_name for spider

- by Zeynel

From the Scrapy tutorial: domain_name: identifies the Spider. It must be unique, that is, you can’t set the same domain name for different Spiders. Does this mean that domain_name must be a valid domain name, like domain_name = 'example.com' Or can I name domain_name = 'ex1' The problem is I had a spider that worked with domain name domain_name = 'whitecase.com' Now I created a new spider as an instance of CrawlSpider and named it domain_name = 'wc2' but I am getting the error "could not find spider for domain "wc2""

Read the article
Is there anyway of making json data readable by a Google spider?

- by leeand00

Is it possible to make JSON data readable by a Google spider? Say for instance that I have a JSON feed that contains the data for an e-commerce site. This JSON data is used to populate a human-readable page in the users browser. (I.E. The translation from JSON data to human displayed page is done inside the users browser; not my choice, just what I've been given to work with, its an old legacy CGI application and not an actual server-side scripting language.) My concern here is that, the google spiders will not be able to pickup/directly link to the item in question when a user clicks on it in google, being presented with an index page full of all the items, rather than being linked directly to the item they clicked on. Is there anyway of "informing" the google spider in the JSON that what they should feed the user a different link?

Read the article
What is the best way to archive (spider) a site that is going to be removed?

- by Guy

Three different blogs that I read have recently announced that they are going to be discontinued and removed from the web. Although the archived pages will probably be in Google's cache for a few weeks after they've gone and some of the pages will be in the Way Back Machine I'd like to archive those sites to my hard disk for future reference. What is the best way to do this? Is there any software that transforms a blog (e.g. Blogspot) into a chronological PDF?

Read the article
Code Golf: Spider webs

- by LiraNuna

The challenge The shortest code by character count to output a spider web with rings equal to user's input. A spider web is started by reconstructing the center ring: \_|_/ _/ \_ \___/ / | \ Then adding rings equal to the amount entered by the user. A ring is another level of a "spider circles" made from \ / | and _, and wraps the center circle. Input is always guaranteed to be a single positive integer. Test cases Input 1 Output \__|__/ /\_|_/\ _/_/ \_\_ \ \___/ / \/_|_\/ / | \ Input 4 Output \_____|_____/ /\____|____/\ / /\___|___/\ \ / / /\__|__/\ \ \ / / / /\_|_/\ \ \ \ _/_/_/_/_/ \_\_\_\_\_ \ \ \ \ \___/ / / / / \ \ \ \/_|_\/ / / / \ \ \/__|__\/ / / \ \/___|___\/ / \/____|____\/ / | \ Input: 7 Output: \________|________/ /\_______|_______/\ / /\______|______/\ \ / / /\_____|_____/\ \ \ / / / /\____|____/\ \ \ \ / / / / /\___|___/\ \ \ \ \ / / / / / /\__|__/\ \ \ \ \ \ / / / / / / /\_|_/\ \ \ \ \ \ \ _/_/_/_/_/_/_/_/ \_\_\_\_\_\_\_\_ \ \ \ \ \ \ \ \___/ / / / / / / / \ \ \ \ \ \ \/_|_\/ / / / / / / \ \ \ \ \ \/__|__\/ / / / / / \ \ \ \ \/___|___\/ / / / / \ \ \ \/____|____\/ / / / \ \ \/_____|_____\/ / / \ \/______|______\/ / \/_______|_______\/ / | \ Code count includes input/output (i.e full program).

Read the article
how does spider in a search engine works?

- by Niraj CHoubey

How does crawler or spider in a search engine works

Read the article
Scrapy spider is not working

- by Zeynel

Since nothing so far is working I started a new project with python scrapy-ctl.py startproject Nu I followed the tutorial exactly, and created the folders, and a new spider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item from Nu.items import NuItem from urls import u class NuSpider(CrawlSpider): domain_name = "wcase" start_urls = ['http://www.whitecase.com/aabbas/'] names = hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+') u = names.pop() rules = (Rule(SgmlLinkExtractor(allow=(u, )), callback='parse_item'),) def parse(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['school'] = hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') return item SPIDER = NuSpider() and when I run C:\Python26\Scripts\Nu>python scrapy-ctl.py crawl wcase I get [Nu] ERROR: Could not find spider for domain: wcase The other spiders at least are recognized by Scrapy, this one is not. What am I doing wrong? Thanks for your help!

Read the article
Google Analytics: Spider Image

- by Fuxi

hi all, i'd like to have google analytics also spider images (eg. livecams) - i mean it should directly spider how much a certain .jpg was loaded. is this possible? thx

Read the article
How do I block a user-agent from Apache

- by rubo77

How do I realize a UA string block by regular expression in the config files of my Apache webserver? For example: if I would like to block out all bots from Apache on my debian server, that have the regular expression /\b\w+[Bb]ot\b/ or /Spider/ in their user-agent. Those bots should not be able to see any page on my server and they should not appear neither in the accesslogs nor in the errorlogs. http://global-security.blogspot.de/2009/06/how-to-block-robots-before-they-hit.html supposes to uses mod_security for that, but isn't there a simple directive for http.conf?

Read the article
building Mozilla Spider Monkey on Ubuntu

- by Hussain

Hi. I'm trying to build spider monkey on ubuntu 10.04 (lucid). However, when I run autoconf2.13 on the js/src directory, it tells me there is no configure.in file. I can't just do the usual ./configure make sudo make install , either. What's up with it?

Read the article
Cron job on Plesk command line options (spider)

- by Paulo Bueno

Hi guys, My customer is setting up a cron task via plesk. In pycron there's an option called spider that prevents the cron to save a copy of the file on the server root. Does anyone knows wich are the command line options accepted by a cron task setted trought plesk? thx.

Read the article
SphinxSearch or a spider - which one to choose?

- by r2b2

Hello, here is my problem: We own SiteA and SiteB and they share the same server and database where we have full control. SiteC , siteD and siteE are some of the sites we own as well but reside on a different web hosts. The goal is to create a unified search functionality for all of the sites mentioned above. That is if somebody search for a term in SiteA, the search result will automatically come up with results from SiteB,SiteC,SiteD and Site E too. The search results should be shown under the website they were found in. All these websites content are stored in their own databases. If I use SphinxSearch to index the above sites,I would then require those sites that we dont have complete control with to setup a web service where i can download a database dump or csv file for indexing. Im not quite sure about how a sphider will come into play here so need your opinion. Sphinx or a spider? THanks!

Read the article
Is there any reason to allow Yahoo! Slurp to crawl my site?

- by James Skemp

I thought a year or more ago Yahoo! would be using another search engine for results, and no longer using their own Slurp bot. However, a couple of the sites I manage Yahoo! Slurp continues to crawl pages, and seems to ignore the Gone status code when returned (as it keeps coming back). Is there any reason why I wouldn't want to block Yahoo! Slurp via robots.txt or by IP (since it tends to ignore robots.txt in some cases anyways)? I've confirmed that when the bot does hit it is from Yahoo! IPs, so I believe this is a legit instance of the bot. Is Yahoo Search the same as Bing Search now? is a related question, but I don't think it completely answers whether one should add a new block of the bot.

Read the article
Using Scrapy Spider in a CGI Script or Web Framework, ie: Django

- by Israel ANY

How do I use a Python library such as Scrapy in a CGI script or even in a framework such as Django if I need to do so later? Here is the documentation I've consulted thus far, but it doesn't seem to meet the concern I have. http://doc.scrapy.org/topics/spiders.html http://doc.scrapy.org/topics/webconsole.html Critique and suggestions are welcomed!

Read the article
Storing URLs while Spidering

- by itemio

I created a little web spider in python which I'm using to collect URLs. I'm not interested in the content. Right now I'm keeping all the visited URLs in a set in memory, because I don't want my spider to visit URLs twice. Of course that's a very limited way of accomplishing this. So what's the best way to keep track of my visited URLs? Should I use a database? * which one? MySQL, sqlite, postgre? * how should I save the URLs? As a primary key trying to insert every URL before visiting it? Or should I write them to a file? * one file? * multiple files? how should I design the file-structure? I'm sure there are books and a lot of papers on this or similar topics. Can you give me some advice what I should read?

Read the article

1 2 3 4 5 6 7 8 9 | Next Page >