Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Posted by Diego on Stack Overflow See other posts from Stack Overflow or by Diego
Published on 2010-05-17T00:35:43Z Indexed on 2010/05/17 0:40 UTC
Read the original article Hit count: 464

Filed under:

http-status-code-403

|

screen-scraping

|

python

|

mechanize

|

beautifulsoup

Is there a way to get around the following?

httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt

Is the only way around this to contact the site-owner (barnesandnoble.com).. i'm building a site that would bring them more sales, not sure why they would deny access at a certain depth.

I'm using mechanize and BeautifulSoup on Python2.6.

hoping for a work-around

© Stack Overflow or respective owner

Related posts about http-status-code-403

How to solve "403 Forbidden" on CentOS6 with SELinux Disabled?

as seen on Server Fault - Search for 'Server Fault'
I have a machine on Linode that is driving me crazy. Linode does not have SELinux on CentOS6... I'm trying to configure to put my website in "/home/websites/public_html/mysite.com/public" As I don´t have SELinux enable, how can I avoid the "403 Forbidden" that I get when trying to access the webpage… >>> More
ErrorDocument not working when accessing .htaccess

as seen on Server Fault - Search for 'Server Fault'
I've been setting up ErrorDocuments for a website I'm working, and generally they've been working. However, after I set the 403 ErrorDocument, I noticed that it didn't work when I tried to access the .htaccess file itself. When I access a different forbidden file, the Error Document appears just fine… >>> More
Set maximum requests per IP in IIS7

as seen on Server Fault - Search for 'Server Fault'
I have a web site deployed to IIS 7. One page it is has 15+ .js files linked to it. Last two files referenced in <head> tag (loaded last) get 403 forbidden response from server. I have enabled FailedRequestTracing and have been able to see a detailed error code which is 403.502. I suppose… >>> More
403 error after adding javascript to masterpage for sharepoint.

as seen on Stack Overflow - Search for 'Stack Overflow'
I am attempting to add highslide-with-html.js from http://highslide.com/ to my masterpage. I am receiving a 403 forbidden error when I use the provided masterpage. I have placed it in C:\Program Files\Common Files\Microsoft Shared\web server extensions\12\TEMPLATE\LAYOUTS\1033. Test javascript files… >>> More
.htaccess blocking images on some internal pages

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm doing some web design for a friend and I noticed that everywhere else on her site images will load fine except for the subdirectory I'm working in. I looked in her .htaccess file and sure enough it is setup to deny people from stealing her images. Fair Enough, except the pages i'm working on are… >>> More

Related posts about screen-scraping

PHP Screen Scraping Class

as seen on Bradino - Search for 'Bradino'
After some positive feedback I have decided to continue to develop the PHP Screen Scraping class. This post will server as the permanent home for the class. Download PHP Screen Scraping Class Updates 20009-07-30 Added setHeader() function >>> More
Screen scraping over SSL with .NET

as seen on Stack Overflow - Search for 'Stack Overflow'
What solutions exist for screen scraping a site over SSL for use with .NET? My use case is that I need to login to a partner website (https), navigate through a dynamic hierarchy, and download a zipped file of reports. I certainly could use other screen scrapers if there are no good viable options… >>> More
looking for alternative to Webzinc .NET , screen scraping, web automation library for .net

as seen on Stack Overflow - Search for 'Stack Overflow'
i came across this .net library http://www.webzinc.com/online/faq.aspx however, i was wondering if there was a free alternative out there ? >>> More
Screen-scraping of a secure page of any site on https:// with asp.net in C#

as seen on Stack Overflow - Search for 'Stack Overflow'
I've done site scraping of secure page of any site on http:// but when I am trying to scrap any site on https:// then i always scrape the login page not secure page. Please advice what should i do for scraping a secure page of any site on https://. >>> More
How different is mashup from screenscraping and consuming webservices

as seen on Stack Overflow - Search for 'Stack Overflow'
From what I understand, Mashup is aggregating data from separate sources and providing a single view. How different is mashup when compared to screenscraping or using webservices to get data from external sources? >>> More