Search Results

Search found 261 results on 11 pages for 'crawler'.

Page 4/11 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11  | Next Page >

  • How to block this URL pattern in Varnish VCL?

    - by iTech
    My website is getting badly hit by spambots and scrappers, I am using Cloudflare but the problem still remains there. The problem is spambots accessing non-existing urls causing a lot of load to my drupal backend which goes all the way and bootstraps db just to serve a 404 error doc. I cant simply dish out non-drupal 404's for all page not found errors, as I need to have drupal catch them. Since, varnish is in front it can check if the bot is acting nice and asking for valid url - if not it servers them a 404 or 403. These bots are causing errors using this pattern : http://www.megaleecher.net/http:/www.megaleecher.net/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_S/Using_iPhone_As_USB_Mass_Storage Now, pls. suggest a regex varnbisg VCL directive which catches this URL pattern and serves a 404 error from varnish, preventing it from reaching apache/drupal ?

    Read the article

  • Does searching a keyword on Google make the crawlers look harder in the future?

    - by Foo Bar
    Do the search requests made by the users influence the Google crawlers "attraction" by this keyword? Let's say Google has some hits on a specific keyword in the search index. And now I search for exactly this keyword. Will the Google crawlers react to the search and keep looking more intense for pages that could match this keyword? A reason why this could be important: Privacy when searching yourself. Assume you just want to know how much Google (and thus other people) can find out about you. If now any (statistical) additional search for your name trigger the crawlers even one step harder to find even more about you, it would have the negative effect that you would actually be found easier in the future, even though you had the intention and hope to find out how few Google finds about you. It's a bit like the dillema in quantum mechanis: Does observing the system automatically change the system?

    Read the article

  • Is there a list of known web crawlers?

    - by J. Pablo Fernández
    I'm trying to get accurate download numbers for some files on a web server. I look at the user agents and some are clearly bots or web crawlers, but many for many I'm not sure, they may or may not be a web crawler and they are causing many downloads so it's important for me to know. Is there somewhere a list of know web crawlers with some documentation like user agent, IPs, behavior, etc? I'm not interested in the official ones, like Google's, Yahoo's, or Microsoft's. Those are generally well behaved and self-indentified.

    Read the article

  • Legality, terms of service for performing a web crawl

    - by Berlin Brown
    I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do? Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause. Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site" It is just research? Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

    Read the article

  • How to Identify the website's content language

    - by Ajay
    I am developing a website to crawl the other website content in ASP.NET . I am able to get the content correctly but how can I identify which language is used based on that content. I used following code. HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(TextBox1.Text ); request.UserAgent = "A .NET Web Crawler"; WebResponse response = request.GetResponse(); Stream stream = response.GetResponseStream(); StreamReader reader = new StreamReader(stream); string htmlText = reader.ReadToEnd();

    Read the article

  • TypeError: coercing to Unicode: need string or buffer, User found

    - by Clemens
    hi, i have to crawl last.fm for users (university exercise). I'm new to python and get following error: Traceback (most recent call last): File "crawler.py", line 23, in <module> for f in user_.get_friends(limit='200'): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 2717, in get_friends for node in _collect_nodes(limit, self, "user.getFriends", False): File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 3409, in _collect_nodes doc = sender._request(method_name, cacheable, params) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 969, in _request return _Request(self.network, method_name, params).execute(cacheable) File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 721, in __init__ self.sign_it() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 727, in sign_it self.params['api_sig'] = self._get_signature() File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/pylast.py", line 740, in _get_signature string += self.params[name] TypeError: coercing to Unicode: need string or buffer, User found i use the pylast lib for crawling. what i want to do: i want to get a users friends and the friends of the users friends. the error occurs, when i have a for loop in another for loop. here's the code: network = pylast.get_lastfm_network(api_key = API_KEY, api_secret = API_SECRET, username = username, password_hash = password_hash) user = network.get_user("vidarnelson") friends = user.get_friends(limit='200') i = 1 for friend in friends: user_ = network.get_user(friend) print '#%d %s' % (i, friend) i = i + 1 for f in user_.get_friends(limit='200'): print f any advice? thanks in advance. regards!

    Read the article

  • Is there a search engine that indexes source code of a web-page?

    - by Dexter
    I need to search the web for sites that are in our industry that use the same Adwords management company, to ensure that the said company is not violating our contract, as they have been accused of doing. They use a tracking code in the template of every page which has a certain domain in the URL, and I'm wondering if it's possible "Google" the source code using some bot that crawls the code rather than the content? For example, I bought an unlimited license for an image gallery, and I was asked to type the license number in a comment just before the script. I thought it was just so a human could look at the source and find out if someone paid, but it turned out that it was actually that they had a crawler looking for their source code and that comment. If it ran across the code on your site, it would look for the comment, and if it found one, it would check to see if it was an existing one. If not, it would first notify you of your noncompliance, and then notify the owner of the script. Edit: I'm looking to index HTML and JavaScript only, not the server-side languages or Java.

    Read the article

  • Does a multithreaded crawler in Python really speed things up?

    - by beagleguy
    Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..? Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded? thanks!

    Read the article

  • Parsing with BeautifulSoup, error message TypeError: coercing to Unicode: need string or buffer, NoneType found

    - by Samsun Knight
    so I'm trying to scrape an Amazon page for data, and I'm getting an error when I try to parse for where the seller is located. Here's my code: #getting the html request = urllib2.Request('http://www.amazon.com/gp/offer-listing/0393934241/') opener = urllib2.build_opener() #hiding that I'm a webscraper request.add_header('User-Agent', 'Mozilla/5 (Solaris 10) Gecko') #opening it up, putting into soup form html = opener.open(request).read() soup = BeautifulSoup(html, "html5lib") #parsing for the seller info sellers = soup.findAll('div', {'class' : 'a-row a-spacing-medium olpOffer'}) for eachseller in sellers: #parsing for price price = eachseller.find('span', {'class' : 'a-size-large a-color-price olpOfferPrice a-text-bold'}) #parsing for shipping costs shippingprice = eachseller.find('span' , {'class' : 'olpShippingPrice'}) #parsing for condition condition = eachseller.find('span', {'class' : 'a-size-medium'}) #parsing for seller name sellername = eachseller.find('b') #parsing for seller location location = eachseller.find('div', {'class' : 'olpAvailability'}) #printing it all out print "price, " + price.string + ", shipping price, " + shippingprice.string + ", condition," + condition.string + ", seller name, " + sellername.string + ", location, " + location.string I get the error message, pertaining to the 'print' command at the end, "TypeError: coercing to Unicode: need string or buffer, NoneType found" I know that it's coming from this line - location = eachseller.find('div', {'class' : 'olpAvailability'}) - because the code works fine without that line, and I know that I'm getting NoneType because the line isn't finding anything. Here's the html from the section I'm looking to parse: <*div class="olpAvailability"> In Stock. Ships from WI, United States. <*br/><*a href="/gp/aag/details/ref=olp_merch_ship_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=shipping-rates#aag_shipping">Domestic shipping rates</a> and <*a href="/gp/aag/details/ref=olp_merch_return_9/175-0430757-3801038?ie=UTF8&amp;asin=0393934241&amp;seller=A1W2IX7T37FAMZ&amp;sshmPath=returns#aag_returns">return policy</a>. <*/div> (but without the stars - just making sure the HTML doesn't compile out of code form) I don't see what's the problem with the 'location' line of code, or why it's not pulling the data I want. Help?

    Read the article

  • How do web crawlers affect site statistics?

    - by LM
    What are ways in which web crawlers (both from search engines and non-search engines) could affect site statistics (e.g., when doing AB-testing different page variations)? And what are ways to take care of these problems? For example: Do a lot of people writing web crawlers often delete their cookies and mask their IPs, so that web crawlers often show up as different users each time they crawl the site? What are heuristics to use to recognize that something is a bot? (I'm guessing any sophisticated enough bot can be indistinguishable from a real user, if it wants to -- is this correct?)

    Read the article

  • What are the best measures to protect content from being crawled?

    - by Moak
    I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions: Robots.txt (yea right) IP blacklists What can be done to catch bot activity? What can be done to make data extraction difficult? What can be done to give them crap data? Just looking for ideas, no right/wrong answer

    Read the article

  • Where Googlebot starts crawling?

    - by Nirmal
    Say if I register a domain and have developed it into a complete website. From where and how Googlebot knows that the new domain is up? Does it always start with the domain registry? If it starts with the registry, does that mean that anyone can have complete access to the registry's database? Thanks for any insight.

    Read the article

  • Nutch - how to crawl by small patches?

    - by Yurish
    Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. Crawl another 20000 pages, index them and merge with first index. Loop step 3 n times. Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl. Some help would be very usefull!

    Read the article

  • Age verification forms and crawlers

    - by user333763
    I have created a website about some beer brand and had to include age verification page. The verification script is written in PHP and uses sessions to store verification variable. The script works the way that no matter form which link you will try to enter the website it will take you to the verification page first. The verification is very simple. There are 2 button: "I'm under 21" and "I'm over 21". If you click the latter, you can browse the website. After some time I discovered that the web crawlers are not able to get past verification page. I checked the website in Google webmaster tools and the only text content scanned was from the verification page. I read somewhere that crawlers are not able to submit form buttons, is it true? Considering the fact that age verification pages are useless anyways, maybe I should just leave it as a starting page but don't forbid going around it, e.g. from links to the subpages?

    Read the article

  • Which metadata I should save when downloading web-pages?

    - by Vojtech R.
    Hi, I'm going to download (for future purposes of language processing) some thousands webpages. Now I'm thinking, which metadata I should save. I explore this, but I do not wont to neglect something important. <title> <link> <publish_date> <date_downloaded> <source> // to this page <keyword> // for Solr indexing <text> // cleaned body of page Is there something important what I could miss in future?

    Read the article

  • architecture python question

    - by tom smith
    hi. creating a distributed crawling python app. it consists of a master server, and associated client apps that will run on client servers. the purpose of the client app is to run across a targeted site, to extract specific data. the clients need to go "deep" within the site, behind multiple levels of forms, so each client is specifically geared towards a given site. each client app looks something like main: parse initial url call function level1 (data1) function level1 (data) parse the url, for data1 use the required xpath to get the dom elements call the next function call level2 (data) function level2 (data2) parse the url, for data2 use the required xpath to get the dom elements call the next function call level3 function level3 (dat3) parse the url, for data3 use the required xpath to get the dom elements call the next function call level4 function level4 (data) parse the url, for data4 use the required xpath to get the dom elements at the final function.. --all the data output, and eventually returned to the server --at this point the data has elements from each function... my question: given that the number of calls that is made to the child function by the current function varies, i'm trying to figure out the best approach. each function essentialy fetches a page of content, and then parses the page using a number of different XPath expressions, combined with different regex expressions depending on the site/page. if i run a client on a single box, as a sequential process, it'll take awhile, but the load on the box is rather small. i've thought of attempting to implement the child functions as threads from the current function, but that could be a nightmare, as well as quickly bring the "box" to its knees! i've thought of breaking the app up in a manner that would allow the master to essentially pass packets to the client boxes, in a way to allow each client/function to be run directly from the master. this process requires a bit of rewrite, but it has a number of advantages. a bunch of redundancy, and speed. it would detect if a section of the process was crashing and restart from that point. but not sure if it would be any faster... i'm writing the parsing scripts in python.. so... any thoughts/comments would be appreciated... i can get into a great deal more detail, but didn't want to bore anyone!! thanks! tom

    Read the article

  • How to scrap the first paragraphe from a wikipedia page?

    - by David
    Hi, Let's say i want to grab the first paragraphe in This wikipedia Page How to get the principal text between the title and CONTENTS box using XPath or DOM & PHP or something similar? Is there any php library for that? i don't want to use the api because it's a bit complex. Note: i just need that to add a widget under my pages that displays related infos from wikipedia. Thanks

    Read the article

  • How to scrape the first paragraph from a wikipedia page?

    - by David
    Let's say I want to grab the first paragraph in this wikipedia page. How do I get the principal text between the title and contents box using XPath or DOM & PHP or something similar? Is there any php library for that? I don't want to use the api because it's a bit complex. Note: i just need that to add a widget under my pages that displays related info from Wikipedia.

    Read the article

  • Scrapy - Follow RSS links

    - by Tupak Goliam
    Hello, I was wondering if anyone ever tried to extract/follow RSS links using SgmlLinkExtractor/CrawlSpider. I can't get it to work... I am using the following rule: rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), ) (having in mind that rss links are located in the link tag). I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ... Any help is welcome, Thanks in advance

    Read the article

  • is it possible to extract all PDFs from a site

    - by deming
    given a URL like www.mysampleurl.com is it possible to crawl through the site and extract links for all PDFs that might exist? I've gotten the impression that Python is good for this kind of stuff. but is this feasible to do? how would one go about implementing something like this? also, assume that the site does not let you visit something like www.mysampleurl.com/files/

    Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11  | Next Page >