Search Results

Search found 241 results on 10 pages for 'crawling'.

Page 8/10 | < Previous Page | 4 5 6 7 8 9 10 | Next Page >

De-index URL paremeters

- by Doug Firr

Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have certain parameters appended I have a website example.com with language translations. There used to be many translations but I deleted them all so that only English (Default) and French options remain. When one selects a language option a parameter is aded to the URL. For example, the home page: https://example.com (default) https://example.com/main?l=fr_FR (French) I added a robots.txt to stop Google from crawling any of the language translations: # robots.txt generated at http://www.mcanerin.com User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /*?l= So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works. But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it. I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist. This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS) https://example.com/reports/view/884?l=vi_VN&l=hy_AM This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123 It seems that parameters always load as long as everything before the question mark is a real URL. So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.

Read the article
How can I tell GoogleBot that a subdirectory is now a subdomain?

- by cwd

I had about a million pages of a catalog indexed under a subdirectory, and now that's moved to a subdomain. GoogleBot is crawling each one of them and getting a 301 redirect to the new location. Even though I have set up the redirect rule in the apache sites-enabled configuration file, (i.e. it's early on when apache does the redirect - PHP is not even getting loaded), even though I have done that, the server isn't handling the load well. GoogleBot is making around 5 requests per second, and on top of my normal traffic that is hiking up the CPU for a few hours at a time. I checked in Webmaster Tools and the corresponding documentation for a way to let Google know that the content had been moved from a subdirectory to a subdomain, but with little luck. Basically the most helpful thing I saw said to just send 301 headers for the new location. How can I tell GoogleBot that a subdirectory is now a subdomain? If that is not an option, how can I more efficiently send 301 redirects out for a particular subdomain? I was thinking perhaps the Nginx server but I'm not sure that I can run both Apache and Nginx side by side on port 80 for different subdomains.

Read the article
Disallow robots.txt from being accessed in a browser but still accessible by spiders?

- by Michael Irigoyen

We make use of the robots.txt file to prevent Google (and other search spiders) from crawling certain pages/directories in our domain. Some of these directories/files are secret, meaning they aren't linked (except perhaps on other pages encompassed by the robots.txt file). Some of these directories/files aren't secret, we just don't want them indexed. If somebody browses directly to www.mydomain.com/robots.txt, they can see the contents of the robots.txt file. From a security standpoint, this is not something we want publicly available to anybody. Any directories that contain secure information are set behind authentication, but we still don't want them to be discoverable unless the user specifically knows about them. Is there a way to provide a robots.txt file but to have it's presence masked by John Doe accessing it from his browser? Perhaps by using PHP to generate the document based on certain criteria? Perhaps something I'm not thinking of? We'd prefer a way to centrally do it (meaning a <meta> tag solution is less than ideal).

Read the article
mod_rewrite and SEO friendliness

- by John Doe

My website has an atypical structure and I'm not sure if this could create problems in the long run, specially for SEO positioning purposes. I have a unique, large PHP script, and I use the Apache module mod_rewrite in the .htaccess file to create friendly URLs, for example: RewriteRule ^$ /index.php?section=Main RewriteRule ^createArticle$ /index.php?section=Main&view=CreateArticle RewriteRule ^configuration$ /index.php?section=Configuration RewriteRule ^article/([0-9]{1,10})$ /index.php?section=Article&view=Default&id=$1 RewriteRule ^deleteArticle/([0-9]{1,10})$ /index.php?section=Article&view=Delete&id=$1 RewriteRule ^reportArticle/([0-9]{1,10})$ /index.php?section=Article&view=Report&id=$1 RewriteRule ^logIn$ /index.php?section=Authentication ... So, www.example.com/index.php?section=Article&view=Default&id=105 would become www.example.com/article/105. The only real physical file is index.php, in which the parameters of the URL queried is processed and the corresponding result is outputted. My question is, do the crawling robots (e.g. Googlebot) recognize these links? Do they index the resulting HTML outputted by index.php with the specified parameters as if it was a actual HTML file? Also, would this become a problem when creating a Sitemap?

Read the article
Dualboot (Win 8 / Ubuntu 13) is stuck at 'switching to clocksource'

- by Daniel Puscht

for days I have been crawling the web for solutions to my problem, but couln't find any. Here it is: I got a new Laptop (ASUS Vivobook S200E) with Win 8 OEM preinstalled. I wanted to create a dual-boot system with Ubuntu 13 next to it. I read about UEFI and that I have to turn of Secure Boot and use the existing EFI partition as bootloader for Ubuntu. So I did. I also ran boot-repair reinstalling the GRUB. The result is when I start the computer I get into the boot menu. So far, so good. When I pick Win everthing is fine. But when I choose Ubuntu (recovery) the system starts, but gets stuck at the line '[1.806366] Switching to clocksource tsc'. I already tried other versions of Ubuntu (12.04.2, 12.10). I played with boot-repair (using the recommended fix, setting everything manually). But nothing works. It's always the same issue. I read that it could be a problem concerning graphic drivers, but this I can hardly believe. If this is any help, boot-repair gave me this link to post in fora. http://paste.ubuntu.com/5810391/ Thanks for any help in advance

Read the article
Can I use nofollow for offsite links without it affecting my page rank?

- by Jack

What I have is a page with almost all offsite links. Each clicked link is forwarded on to the destination. What I would like the search engines to do is to index the text between the anchor tag and not follow the link itself. <a href="somelink">Index This Text Only</a> I've read several articles and they all seem to contradict themselves as to when to use nofollow. What's been happening over the past 2 months that the site has been live is that both Google and Bing are crawling the site as well as all the links on the site that it has been forwarded to. The search engines are now generating a lot of 404s for images and files that never existed on my site but rather seems to correlate to the site it was forwarded to. The search engines don't seem to honor the 302 header when forwarded. I would like to get a definitive answer on the nofollow tag as it relates to my situation. Can I use nofollow to stop the 404s and if so, will it affect my page ranking negatively?

Read the article
Cross-domain jQuery using YQL gives robots.txt error

- by Jens Roland

On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows the JSON result, including the error message: "forbidden":"robots.txt for the domain disallows crawling for url: http://qxlapp.dk/projects/dagens_kup/show.php". The robots.txt on the qxlapp.dk domain contains the following: User-agent: Yahoo Pipes 2.0 Allow: / User-agent: * Allow: / So I don't see what the problem is? Shouldn't it pull the page just fine with those settings?

Read the article
SQL Server 2005, Sudden increase of connections - SharePoint 2007

- by CrazyNick

We observed that sudden increase of SQL connections during a specific hour, it is a backend of a SharePoint 2007 Farm. From SharePoint 2007 Perspective: 1. Incremental crawling is scheduled at that time and few of the Timer jobs (normal timer jobs) are scheduled to run every mins / per 10mins. 2. Number of user requests are less. From SQL Server 2005 Perspective: 1. Transaction log backup is scheduled at that time 2. No other scheduled jobs are running at that time. so, how to narrow down the issue, what would be causing the sudden SQL connection increase?

Read the article
Extracting pure content / text from HTML Pages by excluding navigation and chrome content

- by Ankur Gupta

Hi, I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc. Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in Can you Suggest an alternative strategy for extraction of pure content, Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ? How would you approach the above problem ?. Are these any research papers on the same ?. Regards Ankur Gupta

Read the article
What are the best measures to protect content from being crawled?

- by Moak

I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions: Robots.txt (yea right) IP blacklists What can be done to catch bot activity? What can be done to make data extraction difficult? What can be done to give them crap data? Just looking for ideas, no right/wrong answer

Read the article
What is the right license for tutorial source code ?

- by devdude

Putting sourcecode from tutorials or books online requires the author to add some kind of disclaimer or license (otherwise people would use it make lots of $$$ or break a power plant IT control system and sue you as author). But what is the right license or disclaimer statement ? Can I use BSD license with ... IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA,... We are talking about tutorials ! Released to teach and share knowledge. Or do I need to follow the (potentially different) licenses of the libraries I use ? Might be insignificant now but I feel we will face a license hunt in "public" sourcecode (aka OSS) in future, similar to companies/lawyers currently crawling the web for pictures with wrong copyright statement or infringing their IP (and suing someone using a picture in a personal blog,etc..).

Read the article
iTunes Visualization -- What type of code is it written in and what does that code look like?

- by Christopher Altman

Being a web developer, I know how event driven user interfaces are written, but do not have insight into other families of code (embedded software like automotive software, automation software on assembly lines, drivers, or the crawling lower-thirds on CNN, etc.) I was looking at the iTunes visualizer (example) and am curious: What code is used to write the visualizer? Objective C? Does it use Core Animation? What type of abstraction does that library offer? What does the code look like? Is it a list of mathematical equations for producing the crazy graphics? Is it a list of key frames with tweening? Is there an array of images, fractals, worm holes, flowers, sparkles, and some magic mixes them together. Or something totally different? I am not looking for a tutorial, just an understanding of how something very different than web development works. Oh yah, I know iTunes is closed source, so all of this is conjecture.

Read the article
Millions of anonymous ASP.Net profiles!?

- by Mantorok

Hi all, some advice needed! Our website receives approximately 50,000 hits a day, and we use anonymous ASP.Net membership profiles/users, this is resulting in millions (4.5m currently) of "active" profiles and the database is 'crawling', we have a nightly task that cleans up all the inactive ones. There is no way that we have 4.5m unique visitors (our county population is only 1/2 million), could this be caused by crawlers and spiders? Also, if we have to live with this huge number of profiles is there anyway of optimising the DB? Thanks Kev

Read the article
Django : proper way to use model, duplicates!

- by llazzaro

Hello, I have a question about the proper, best way to manage the model. I am relative newbie to django, so I think I need to read more docs, tutorials,etc (suggestions for this would be cool!). Anyway, this is my question : I have a python web crawler, that is "connected" with django model. Crawling is done once a day, so its really common to find "duplicates". To avoid duplicates I do this : cars = Car.Objects.filter(name=crawledItem['name']) if len(cars) 0: #object already exists, update it car = cars[0] else: car = Car() #some non-relevant code here car.save() I want to know, if this is the proper/correct way to do it or its any "automatic" way to do it. Its possible to put the logic inside the Car() constructor also, should I do that? Thanks a lot!

Read the article
SwingWorker in Java (beginner question)

- by Malachi

I am relatively new to multi-threading and want to execute a background task using a Swingworker thread - the method that is called does not actually return anything but I would like to be notified when it has completed. The code I have so far doesn't appear to be working: private void crawl(ActionEvent evt) { try { SwingWorker<Void, Void> crawler = new SwingWorker<Void, Void>() { @Override protected Void doInBackground() throws Exception { Discoverer discover = new Discoverer(); discover.crawl(); return null; } @Override protected void done() { JOptionPane.showMessageDialog(jfThis, "Finished Crawling", "Success", JOptionPane.INFORMATION_MESSAGE); } }; crawler.execute(); } catch (Exception ex) { JOptionPane.showMessageDialog(this, ex.getMessage(), "Exception", JOptionPane.ERROR_MESSAGE); } } Any feedback/advice would be greatly appreciated as multi-threading is a big area of programming that I am weak in.

Read the article
Should a given URI in a RESTful architecture always return the same response?

- by keithjgrant

This is kind of a follow-up question to this one. So is having a unique response for any given URI a core tenant of RESTful architecture? A lot of discussion here tends that direction, but I haven't seen it anywhere as a "hard and fast" rule. I understand the value of it (for caching, crawling, passing links, etc), but I also see things like the twitter API violate it (A request to http://api.twitter.com/1/statuses/friends_timeline.xml will vary based on the username given), and I understand there are times when it may be necessary--not to mention that a chronologically paged resource will also change as new elements are added. Should I strive for varied responses from the same URI to be eliminated altogether, or do I just accept that sometimes it isn't practical, and as long as I minimize its occurrence, I'll be in decent shape.

Read the article
Issues with SharePoint 2010 Development

- by Rahul Soni

I am planning to use my laptop for SharePoint 2010 development and I have only 4 GB RAM which is not even upgradable. Just because of RAM constraint, my VS 2010 keeps crawling if I try to run it along with SharePoint 2010 on the same machine. Hence, I've reformatted my machine and looking for alternate solutions until I get a new laptop. Currently, I have installed VS 2010 ONLY on my laptop and wanted to create an empty SharePoint project. Once done with my project, I want to deploy it on a different machine (which is a 4GB RAM machine as well, but contains only SharePoint 2010). I thought this will work and give me a bit of breather if everything is configured well. Unfortunately, when I tried creating a new SharePoint Empty Project in VS 2010, it says... A SharePoint server is not installed on this computer. A SharePoint server must be installed to work with SharePoint projects. Is there a way out?

Read the article
Is is better to store serialized data or raw html in mysql?

- by Yegor

I de-normalized my database, since the application was crawling otherwise, and Im storing a list of categories for each item in the DB as a raw html version, and simply echoing it out in my design. Each category is actually a link, which is include a tag. Naturally, this is abit of a pain, especially if I want to change the look of how the category links are displayed, since I gotta update all the old cached entries. What if I were to store this data as a serialized array instead, and simply unserialize it, and then apply formatting to it in php. Would there be a significant performance decrease over simply echoing out the raw html?

Read the article
Python urllib3 and how to handle cookie support?

- by bigredbob

So I'm looking into urllib3 because it has connection pooling and is thread safe (so performance is better, especially for crawling), but the documentation is... minimal to say the least. urllib2 has build_opener so something like: #!/usr/bin/python import cookielib, urllib2 cj = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) r = opener.open("http://example.com/") But urllib3 has no build_opener method, so the only way I have figured out so far is to manually put it in the header: #!/usr/bin/python import urllib3 http_pool = urllib3.connection_from_url("http://example.com") myheaders = {'Cookie':'some cookie data'} r = http_pool.get_url("http://example.org/", headers=myheaders) But I am hoping there is a better way and that one of you can tell me what it is. Also can someone tag this with "urllib3" please.

Read the article
How to display recently installed programs and when they were installed?

- by salvationishere

I have a Windows XP and I just installed about 12 new programs. Big mistake! Before I installed these programs, my internet connection was running great. But now after installing and restarting my laptop, the internet is crawling. How can I see what was changed? Hint: prior to installing these 12 programs, I installed IE version 8. So probably if I removed that it would fix it, but the problem is I need IE in order for my SQL/C# web application to work properly.

Read the article
Does Wicket hamper SEO or search engines ability to crawl?

- by Nick

We're coming from GWT projects and because of problems with SEO not liking GWT for our next project we're going to move clear of GWT (mainly because seo is a high priority for this next project). In choosing a new framework, I'm looking at Wicket and liking what I've seen so far. I've only done a few tutorials, but in looking at the war layout (from these tutorials) it looks like most of the html pages are in the WEB-INF folder. It this going to cause problems for SEO and search engines crawling through the sites files? Ideally, I'd like to use Wicket with some AJAX and deploy to Google App Engine.

Read the article
WebClient.DownloadString() Not Producing Exact HTML

- by Ryan Fuentes

So here's the deal. I'm creating a spider bot for a website that scans all the product pages and records the product data. I'm using C# and the WebClient library to download the HTML string. The site I'm crawling must be specially made because the HTML that is received from WebClient.DownloadString() is different than the HTML that I get when I view the source of the HTML when visiting it on a browser. This seems intentional because the only info I can't get is the price. Does anyone know a workaround for this problem or can anyone explain what is happening? Thanks.

Read the article
Ruby execute code in class getting inherited to

- by AdamB

I'm trying to be able to have a global exception capture where I can add extra information when an error happens. I have two classes, "crawler" and "amazon". What I want to do is be able to call "crawl", execute a function in amazon, and use the exception handling in the crawl function. Here are the two classes I have: require 'mechanize' class Crawler Mechanize.html_parser = Nokogiri::HTML def initialize @agent = Mechanize.new end def crawl puts "crawling" begin #execute code in Amazon class here? rescue Exception => e puts "Exception: #{e.message}" puts "On url: #{@current_url}" puts e.backtrace end end def get(url) @current_url = url @agent.get(url) end end class Amazon < Crawler #some code with errors def stuff page = get("http://www.amazon.com") puts page.parser.xpath("//asldkfjasdlkj").first['href'] end end a = Amazon.new a.crawl Is there a way I can call "stuff" inside of "crawl" so I can use that exception handling over the entire stuff function? Is there a better way to accomplish this?

Read the article
Rewrite this function as DB query?

- by aLk

I'm cleaning up my code, should i change the following function to a MySQL query? If so what would be a nice MySQL function to achieve this functionality? public ArrayList getNewTitles(ArrayList candidateTitles, ArrayList existingTitles) { ArrayList newTitles = new ArrayList(); Movie movie = new Movie(); boolean isNew = true; for(int i=0; i<candidateTitles.size(); i++) { for(int j=0; j<existingTitles.size(); j++) { movie = (Movie)existingTitles.get(j); if(((String)candidateTitles.get(i)).equals(movie.getRawTitle())) { isNew = false; } } if(isNew == true) { System.out.println("newTitle for crawling: " + (String)candidateTitles.get(i)); newTitles.add((String)candidateTitles.get(i)); } else { System.out.println("candidate binned: " + (String)candidateTitles.get(i)); } isNew = true; } return newTitles; }

Read the article
Anonymous users support vs Google bot

- by Andy

I have a User class in my web app that represents a user currently logged in. Every time a user vists a page, a User instance is populated based on authentication data supplied in cookies. A User instance is created even if an anonymous user logs in - and a corresponding new record is created in the User table in the database. This approach allows me to save some state info for the current user regardless of its type. The problem however with this approach is the Google bot, and other non-human web organisms crawling my pages. Every time a bot starts to walk around the site, thousands of useless records will be created in the database, each of them only to be used for a single page. Question: what is the best trade off? How to support anonymous users, save their state, and don't get too much overhead because of cookieless bots?

Read the article

< Previous Page | 4 5 6 7 8 9 10 | Next Page >