crawl - Developer IT

Redirect pages to fix crawl errors

- by sarah

Google is giving me a crawl error for pages that I have removed like www.mysite.com/mypage.html. I want to redirect this pages to the new page www.mysite.com/mysite/mypage. I tried to do that by using .htaccess but instead of fixing the problem, the crawl pages increased and a new crawl came www.mysite.com/www.mysite.com. This is my .htaccess file: <IfModule mod_rewrite.c> RewriteEngine On RewriteBase /sitename/ RewriteRule ^index\.php$ - [L] RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /sitename/index.php [L] </IfModule> # END WordPress Should I add this after the rewrite rule or I should do something else? RewriteRule ^pagename\.html$ http://www.sitename.com/pagename [R=301]

Read the article

GWT: Generate more complete crawl error report

- by Mike

I'm a developer in charge of managing Webmasters and related issues (including correcting crawl errors) for dozens (hundreds, maybe?) of active sites and as part of my duties I create a report of every discrepancy, including all pages generating a 404 and all pages that link to those pages. Currently within Webmaster Tools I'm able to download a csv file of all pages with a 404 response, but I'm then having to manually click on every single one of those links and copy the "linked from" field to paste into my spreadsheet. This is extremely tedious and seems unnecessary; I would expect the ability to download all that data at once. I'm ultimately looking for the end result of one csv file that has every url with a 404, but also has every url that links to each one of them. Am I overlooking this functionality somewhere or does anyone have a good solution? Edit 1 (2/11/2013): Example of what the csv output looks like now: URL,Response Code,News Error,Detected,Category http://www.abcdef.com/123.php,404,,11/12/13,Not found http://www.abcdef.com/456.php,404,,11/12/13,Not found Which is great, but let's say 123.php has 5 pages that link to it. Now I have to duplicate that row in my spreadsheet 4 more times, then go into Webmasters, get all the url's that link to the page, and add that data to my spreadsheet. The output I would prefer: URL,Response Code,Linked From,News Error,Detected,Category http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage1.php,,11/12/13,Not found http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage2.php,,11/12/13,Not found http://www.abcdef.com/123.php,404,http://www.ghijkl.com/naughtypage3.php,,11/12/13,Not found http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage1.php,,11/12/13,Not found http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage2.php,,11/12/13,Not found http://www.abcdef.com/456.php,404,http://www.ghijkl.com/naughtypage3.php,,11/12/13,Not found Note the (hypothetical) addition of a "Linked From" column, as well as the fact there are only 2 unique URL's now (like before) but all of the "Linked To" pages are shown in one report. Edit 2 (2/12/2013): To clarify, my question is less about detecting and correcting 404's, but more about generating a report of what Google has listed as errors. Oftentimes, these errors aren't even valid anymore but I still need documentation to show that Google detected a problem and that problem is now fixed. Many of the "linked from" url's I find are actually outdated, cached resources. For example, I'll frequently see that the linked-from url is the sitemap, which is actually an old sitemap cached by Google that points to an old page. Neither the sitemap or old page exist, but they still appear in my crawl error reports because they are cached resources.

Read the article

crawl websites out of java web application without using bin/nutch

- by Marcel

hi :) i am trying to using nutch (1.1) without bin/nutch from my (java) mojarra 2.0.2 webapp... i am searching at google for examples, but there are no examples how i can realize this :/ ... i get an exception and the job fails :/ (i think of cause something with hadoop)... here is my code: public void run() throws Exception { final String[] args = new String[] { String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_URLS), "-dir", String.format("%s%s%s%s", JSFUtils.getWebAppRoot(), "nutch", File.separator, DIRECTORY_CRAWL), "-threads", this.preferences.get("threads"), "-depth", this.preferences.get("depth"), "-topN", this.preferences.get("topN"), "-solr", this.preferences.get("solr") }; Crawl.main(args); } and a part of the logging: 10/05/17 10:42:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 10/05/17 10:42:54 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:54 INFO mapred.JobClient: Running job: job_local_0001 10/05/17 10:42:54 INFO mapred.FileInputFormat: Total input paths to process : 1 10/05/17 10:42:55 INFO mapred.MapTask: numReduceTasks: 1 10/05/17 10:42:55 INFO mapred.MapTask: io.sort.mb = 100 java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) at org.apache.nutch.crawl.Injector.inject(Injector.java:211) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124) at lan.localhost.process.NutchCrawling.run(NutchCrawling.java:108) at lan.localhost.main.Index.indexing(Index.java:71) at lan.localhost.bean.FeedingBean.actionStart(FeedingBean.java:25) .... can someone help me or tell me how i can crawling from a java application? i have increased the Xms to 256m and Xmx to 768m, but nothing changed... best regards marcel

Read the article

"Error in the Site Data Web Service." when performing crawl

- by Janis Veinbergs

Installed SharePoint Services v3 (SP2, october 2009 cumulative updates, Language Pack), attached to a content database I had previously (all works). Installed Search server 2008 Express (with language pack) on top of WSS and crawl does not work. However it works for newly created web application + database. Was playing around with accounts, permissions to try get it working. Currently I have WSS_Crawler account with such permissions: Office Search Server runs with WSS_Crawler account Config database has read permissions for WSS_Crawler Content database has read permissions for WSS_Crawler WSS_Crawler is owner of search database. Added WSS_Crawler to SQL server browser user group and administrator Yes, i'v given more permissions than needed, but it doesn't even work with that and i don't know if its permission problem or what. Crawl log says there is Error in the Site Data Web Service., nothing more. There were known issues with a similar error: Error in the Site Data Web Service. (Value does not fall within the expected range.), but this is not the case as thats an old issue and i hope it has been included in SP2... Logs are from olders to newest (descending order). They don't appear to be very helpful. Crawl log http://serveris Crawled Local Office SharePoint Server sites 3/15/2010 9:39 AM sts3://serveris Crawled Local Office SharePoint Server sites 3/15/2010 9:39 AM sts3://serveris/contentdbid={55180cfa-9d2d-46e4... Crawled Local Office SharePoint Server sites 3/15/2010 9:39 AM http://serveris/test Error in the Site Data Web Service. Local Office SharePoint Server sites 3/15/2010 9:39 AM http://serveris Error in the Site Data Web Service. Local Office SharePoint Server sites 3/15/2010 9:39 AM EventLog No errors in EventLog, just some Information events that Office Server Search provides The search service started. Successfully stored the application configuration registry snapshot in the database. Context: Application 'SharedServices Component: da1288b2-4109-4219-8c0c-3a22802eb842 Catalog: Portal_Content. A master merge was started due to an external request. Component: da1288b2-4109-4219-8c0c-3a22802eb842 A master merge has completed for catalog Portal_Content. Component: da1288b2-4109-4219-8c0c-3a22802eb842 Catalog: AnchorProject. A master merge was started due to an external request. Component: da1288b2-4109-4219-8c0c-3a22802eb842 A master merge has completed for catalog AnchorProject. ULS Log Just some information, but no exceptions, unexpected errors 03/15/2010 09:03:28.28 mssearch.exe (0x1B2C) 0x0E8C Search Server Common GatherStatus 0 Monitorable Insert crawl 771 to inprogress queue hr 0x00000000 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6591 03/15/2010 09:03:28.28 mssearch.exe (0x1B2C) 0x0E8C Search Server Common GatherStatus 0 Monitorable Request Start Crawl 1, project Portal_Content, crawl 771 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2875 03/15/2010 09:03:28.28 mssearch.exe (0x1B2C) 0x0E8C Search Server Common GatherStatus 0 Monitorable Advise status change 1, project Portal_Content, crawl 771 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:28.28 w3wp.exe (0x1D98) 0x0958 Search Server Common MS Search Administration 8wn6 Information A full crawl was started on 'Local Office SharePoint Server sites' by BALTICOVO\janis.veinbergs. 03/15/2010 09:03:28.43 mssdmn.exe (0x1750) 0x10F8 ULS Logging Unified Logging Service 8wsv High ULS Init Completed (mssdmn.exe, Microsoft.Office.Server.Native.dll) 03/15/2010 09:03:30.48 mssdmn.exe (0x1750) 0x09C0 Search Server Common MS Search Indexing 8z0v Medium Create CCache 03/15/2010 09:03:30.56 mssdmn.exe (0x1750) 0x09C0 Search Server Common MS Search Indexing 8z0z Medium Create CUserCatalogCache 03/15/2010 09:03:32.06 w3wp.exe (0x1D98) 0x0958 Search Server Common MS Search Administration 90ge Medium SQL: dbo.proc_MSS_PropagationGetQueryServers 03/15/2010 09:03:32.09 w3wp.exe (0x1D98) 0x0958 Search Server Common MS Search Administration 7phq High GetProtocolConfigHelper failed in GetNotesInterface(). 03/15/2010 09:03:34.26 mssearch.exe (0x1B2C) 0x16A4 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project Portal_Content, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:35.92 mssearch.exe (0x1B2C) 0x16A4 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project Portal_Content, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:37.32 mssearch.exe (0x1B2C) 0x16A4 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project Portal_Content, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:37.23 mssdmn.exe (0x1750) 0x1850 Search Server Common MS Search Indexing 8z14 Medium Test TRACE (NULL):(null), (NULL)(null), (CrLf): , end 03/15/2010 09:03:39.04 mssearch.exe (0x1B2C) 0x16A4 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project Portal_Content, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:40.98 mssdmn.exe (0x1750) 0x0B24 Search Server Common MS Search Indexing 7how Monitorable GetWebDefaultPage fail. error 2147755542, strWebUrl http://serveris 03/15/2010 09:03:41.87 mssdmn.exe (0x1750) 0x1260 Search Server Common PHSts 0 Monitorable CSTS3Accessor::GetSubWebListItemAccessURL GetAccessURL failed: Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3acc.cxx Line:505 03/15/2010 09:03:41.87 mssdmn.exe (0x1750) 0x1260 Search Server Common PHSts 0 Monitorable CSTS3Accessor::Init: GetSubWebListItemAccessURL failed. Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3acc.cxx Line:348 03/15/2010 09:03:41.87 mssdmn.exe (0x1750) 0x1260 Search Server Common PHSts 0 Monitorable CSTS3Accessor::Init fails, Url sts3://serveris/siteurl=test/siteid={390611b2-55f3-4a99-8600-778727177a28}/weburl=/webid={fb0e4bff-65d5-4ded-98d5-fd099456962b}, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3handler.cxx Line:243 03/15/2010 09:03:41.87 mssdmn.exe (0x1750) 0x1260 Search Server Common PHSts 0 Monitorable CSTS3Handler::CreateAccessorExB: Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3handler.cxx Line:261 03/15/2010 09:03:40.98 mssdmn.exe (0x1750) 0x1260 Search Server Common MS Search Indexing 7how Monitorable GetWebDefaultPage fail. error 2147755542, strWebUrl http://serveris/test 03/15/2010 09:03:41.90 mssdmn.exe (0x1750) 0x0B24 Search Server Common PHSts 0 Monitorable CSTS3Accessor::GetSubWebListItemAccessURL GetAccessURL failed: Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3acc.cxx Line:505 03/15/2010 09:03:41.90 mssdmn.exe (0x1750) 0x0B24 Search Server Common PHSts 0 Monitorable CSTS3Accessor::Init: GetSubWebListItemAccessURL failed. Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3acc.cxx Line:348 03/15/2010 09:03:41.90 mssdmn.exe (0x1750) 0x0B24 Search Server Common PHSts 0 Monitorable CSTS3Accessor::Init fails, Url sts3://serveris/siteurl=/siteid={505443fa-ef12-4f1e-a04b-d5450c939b78}/weburl=/webid={c5a4f8aa-9561-4527-9e1a-b3c23200f11c}, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3handler.cxx Line:243 03/15/2010 09:03:41.90 mssdmn.exe (0x1750) 0x0B24 Search Server Common PHSts 0 Monitorable CSTS3Handler::CreateAccessorExB: Return error to caller, hr=80042616 - File:d:\office\source\search\search\gather\protocols\sts3\sts3handler.cxx Line:261 03/15/2010 09:03:43.26 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 24, project Portal_Content, crawl 771 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:43.26 mssearch.exe (0x1B2C) 0x1804 Search Server Common GatherStatus 0 Monitorable Remove crawl 771 from inprogress queue - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6722 03/15/2010 09:03:43.26 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project Portal_Content, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:44.65 mssearch.exe (0x1B2C) 0x1804 Search Server Common GatherStatus 0 Monitorable Insert crawl 772 to inprogress queue hr 0x00000000 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6591 03/15/2010 09:03:44.65 mssearch.exe (0x1B2C) 0x1804 Search Server Common GatherStatus 0 Monitorable Request Start Crawl 0, project AnchorProject, crawl 772 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2875 03/15/2010 09:03:44.65 mssearch.exe (0x1B2C) 0x1804 Search Server Common GatherStatus 0 Monitorable Advise status change 0, project AnchorProject, crawl 772 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:44.65 mssearch.exe (0x1B2C) 0x1804 Search Server Common GatherStatus 0 Monitorable Unlock Queue, project Portal_Content - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2922 03/15/2010 09:03:44.82 mssearch.exe (0x1B2C) 0x1DD0 Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:44.95 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project AnchorProject, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:46.51 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project AnchorProject, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:46.39 mssearch.exe (0x1B2C) 0x1E4C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.01 mssearch.exe (0x1B2C) 0x1C6C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.87 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.29 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.53 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.67 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.82 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.84 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.89 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:49.90 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project AnchorProject, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:51.42 mssearch.exe (0x1B2C) 0x1E4C Search Server Common GatherStatus 0 Monitorable Advise status change 4, project AnchorProject, crawl 772 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:51.00 mssearch.exe (0x1B2C) 0x1E4C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:51.42 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Remove crawl 772 from inprogress queue - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6722 03/15/2010 09:03:52.96 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Insert crawl 773 to inprogress queue hr 0x00000000 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6591 03/15/2010 09:03:52.96 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Request Start Crawl 0, project AnchorProject, crawl 773 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2875 03/15/2010 09:03:55.29 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Unlock Queue, project AnchorProject - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2922 03/15/2010 09:03:55.29 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Removed start crawl request from Queue 0, crawl 773 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2942 03/15/2010 09:03:55.29 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Request Start Crawl 0, project AnchorProject, crawl 773 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2875 03/15/2010 09:03:55.29 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GatherStatus 0 Monitorable Advise status change 0, project AnchorProject, crawl 773 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:55.37 mssearch.exe (0x1B2C) 0x1CCC Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:55.37 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project AnchorProject, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:56.71 mssearch.exe (0x1B2C) 0x1E4C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:56.78 mssearch.exe (0x1B2C) 0x0750 Search Server Common GatherStatus 0 Monitorable Advise status change 12, project AnchorProject, crawl -1 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:58.40 mssearch.exe (0x1B2C) 0x155C Search Server Common GathererSql 0 Monitorable CGatherer::LoadTransactionsFromCrawlInternal Flush anchor, count 0 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4943 03/15/2010 09:03:58.89 mssearch.exe (0x1B2C) 0x155C Search Server Common GatherStatus 0 Monitorable Advise status change 4, project AnchorProject, crawl 773 - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:4853 03/15/2010 09:03:58.89 mssearch.exe (0x1B2C) 0x1130 Search Server Common GatherStatus 0 Monitorable Remove crawl 773 from inprogress queue - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:6722 03/15/2010 09:03:58.89 mssearch.exe (0x1B2C) 0x1130 Search Server Common GatherStatus 0 Monitorable Unlock Queue, project AnchorProject - File:d:\office\source\search\search\gather\server\gatherobj.cxx Line:2922 What could be wrong here - any clues?

Read the article

Fake links cause crawl error in Google Webmaster Tools

- by Itai

Google reported Crawl Errors last week on my largest site though Webmaster Tools. Here is the message: Google detected a significant increase in the number of URLs that return a 404 (Page Not Found) error. Investigating these errors and fixing them where appropriate ensures that Google can successfully crawl your site's pages. The Crawl Errors list is now full of hundreds of fake links like these causing 16,519 errors so far: Note that my site does not even have a search.html and is not related to any of the terms shown in the above image. Inspecting sources for one of those links, I can see this is not simply an isolated source but a concerted effort: Each of the links has a few to a dozen sources all from different, seemingly unrelated sites. It is completely baffling as to why would someone to spending effort doing this. What are they hoping to achieve? Is this an attack? Most importantly: Does this have a negative effect on my side? Could it negatively impact my ranking? If so, what to do about it? The few linking pages I looked at are full of thousands of links to tons of sites and have no contact information and do not seem like the kind of people who would simply stop if asked nicely! According to Google Webmaster Tools, these errors have appeared in a span of 11 days. No crawl errors were being reported previously.

Read the article

Understanding the maximum hit-rate supported by a web-server

- by SNag

I would like to crawl a publicly available site (and one that's legal to crawl) for a personal project. From a brief trial of the crawler, I gathered that my program hits the server with a new HTTPRequest 8 times in a second. At this rate, as per my estimate, to obtain the full set of data I need about 60 full days of crawling. While the site is legal to crawl, I understand it can still be unethical to crawl at a rate that causes inconvenience to the regular traffic on the site. What I'd like to understand here is -- how high is 8 hits per second to the server I'm crawling? Could I possibly do 4 times that (by running 4 instances of my crawler in parallel) to bring the total effort down to just 15 days instead of 60? How do you find the maximum hit-rate a web-server supports? What would be the theoretical (and ethical) upper-limit for the crawl-rate so as to not adversely affect the server's routine traffic?

Read the article

Page load speeds effect on crawl rate

- by Sam Pegler

We've noticed a big drop in the total pages crawled per day on our site, we have no control over the crawl rate in google webmaster tools so it's possible this has been changed by google. However it's a fairly large site and I wouldn't of thought that the crawl rate would've been decreased. What we have noticed though is a sizeable increase in page load times, in my mind this would be the cause. Can anyone else confirm if the crawl rate is directly correlated to page load time? Seems logical, longer page load time, less pages crawled. Any decent documentation on this would be appreciated, I don't normally have any input on SEO so this is new to me.

Read the article

How to crawl a webPage with dynamic content added by javascript

- by blunderboy

I guess there is a news that Google bots have the capability to understand our javascript code. It means this is possible to fully crawl a webpage which has lazy loading feature enabled. I am using Apache Nutch to crawl websites but I don't think it has the capability to fetch the URLs being injected in HTML page by javascript when the page is scrolled down. I see a lot of websites doing lazy loading for performance issue. So Can somebody please explain me how can i crawl the data which comes in HTML page on lazy load. (On scrolling the page down).

Read the article

404 crawl error for mailto:[email protected] in google webmaster

- by sam

im getting a 404 crawl error for mailto:[email protected], in google webmaster tools under the health crawl errors surly google should see that mailto: is related to an email not a webpage.. the html im using for the mailto on my page is <a href="mailto:mailto:[email protected]">[email protected]</a> Whats the best way to resolve this ? Is mailto still widely used or is there a newer alternative ?

Read the article

Factors of Crawl Allocation

Google, Yahoo!, and Bing all have heavy duty ways to crawl each and every website on the internet. Here are some factors that can influence your crawl allocation:

Read the article

Google Webmasters tools crawl error

- by Shiro

I am looking in to Google Webmaster Tools - Crawl Error section. How should I handle for those URL due their system / application showed invalid URL. e.g http://www.example/images/products/s_=enlarge_16gb.jpg but, I dunno what happen to yahoo groups, it break the link into http://www.example/images/products/s_= enlarge_16gb.jpg and I only make the top part become hyperlink, which is http://www.example/images/products/s_= Because of the URL, Google show crawl error, I got few error because of this kind of result or because other people typo error. How do I prevent this. I am sure I don't have the right go and change other people post. What is the solution for this. Thanks!

Read the article

Google Webmasters tools crawl error caused by URL split into two lines

- by Shiro

I am looking in to Google Webmaster Tools - Crawl Error section. How should I handle for those URL due their system / application showed invalid URL. e.g http://www.example/images/products/s_=enlarge_16gb.jpg but, I dunno what happen to yahoo groups, it break the link into http://www.example/images/products/s_= enlarge_16gb.jpg and I only make the top part become hyperlink, which is http://www.example/images/products/s_= Because of the URL, Google show crawl error, I got few error because of this kind of result or because other people typo error. How do I prevent this. I am sure I don't have the right go and change other people post. What is the solution for this. Thanks!

Read the article

SQL Saturday 300 BBQ Crawl

- by Bill Graziano

SQL Saturday #300 is coming up right here in Kansas City on September 13th, 2014. This is our fifth SQL Saturday which means it's the fifth anniversary of our now infamous BBQ Crawl. We get together on Friday afternoon before the event and visit a few local joints. We've done nice places and we've done dives. We haven’t picked the venues yet but I promise you’ll be well fed! And if you’re thinking about the BBQ crawl you should think about submitting a session. Our call for speakers closes Tuesday, July 15th so you just have time! If you’re going to be at the event, contact me and I’ll get you added to the list.

Read the article

Nutch - how to crawl by small patches?

- by Yurish

Hi everyone! I am stuck! Can`t get Nutch to crawl for me by small patches. I start it by bin/nutch crawl command with parameters -depth 7 and -topN 10000. And it never ends. Ends only when my HDD is empty. What i need to do: Start to crawl my seeds with possibility to go further on outlinks. Crawl 20000 pages, then index them. Crawl another 20000 pages, index them and merge with first index. Loop step 3 n times. Tried also with scripts found in wiki, but all scripts i found don't go further. If i run them again, they do everything from beginning. And in the end of script i have the same index i had, when started to crawl. But, i need to continue my crawl. Some help would be very usefull!

Read the article

Meaning of Crawl errors

- by com

My question is about definition of Crawl errors in Google Webmaster Tools. Crawl errors is devided into few sections. Let's first consider HTTP section. I assume that all broken links in this section was somehow found by crawler, this is not the links from sitemap. If all this links was found by scanning pages from sitemap for links, why it doesn't mention what was the source page, like in sitemap section with column Linked From. Please correct me if I am wrong. Sitemap section. Looks like all those links came from my sitemap. But there is Linked From column, I already know, that all those broken links is from sitemap, so in order to fix the error, I should revise my sitemap. Am I wrong? Not followed section. I don't know what does it mean. Looks like it accumulates all links that caused redirect, but for some reason Google considers all those redirect as wrong redirect. Do you know if there are any set of rules how to determine wrong redirect. Actually I found were was my mistake, I tried to normalize URL and redirect it to the right URL, but I did normalization in a wrong way. Not found section. This section like HTTP section but with 404 errors. This section has Linked From column. But very often Linked From has unavailable. What does it mean, Google can not say me how it found this non existing page. How this section related to sitemap section. Does this section contains all 404 links from sitemap too. But there is too many 404 links, much more than in sitemap. I tried to take a look what we have in Linked From, and I saw that this link came from sitemap two month ago. But why Google keeps it indexed, the link is already dead, new sitemap doesn't have it. If there is any expire date for old links? Unreachable section. Looks like this section for 500 errors. This section doesn't contain Linked From column. There are too many completely meaningless links, I really don't know where this stuff came from, and without Linked From I am not able to figure out how to deal with it. Sorry for such a big topic, but I just want to make it clear, what every section stands for, because it's extremely crucial in order to deal with all those problems. Hopefully it will be useful not just for me. Thanks!

Read the article

Top SEO Tips - Get Google to Crawl Your Website in Less Than 24 Hours

No matter if you are a website owner or internet marketing professional it is important to work with Google and from time to time you will need to get Google to crawl yours or you clients website ASAP. This could be for many reasons, one being that your client has changed their domain name or you are launching a new or improved website.

Read the article

Kansas City SQL Saturday 2012: BBQ Crawl

- by Bill Graziano

The next Kansas City SQL Saturday is coming up on August 4th. We’ll have the usual SQL Saturday goodness: lots of technical sessions, great networking events and a fantastic speaker dinner. And we’ll have the Third Annual Kansas City SQL Saturday BBQ Crawl. On Friday afternoon we’ll visit a few BBQ places in town. We tend to order big sampler plates and just share everything around. It’s a great way to try a variety of styles. This year we’ll be hitting an all new selection of BBQ joints. You don’t need to be a speaker to attend. However the call for speakers is open until June 28th (hint, hint). Locals and out-of-towners are all welcome. If you’re interested in attending send me an email and I’ll get you added to the list. We finish in plenty of time to get you to the speaker dinner – as if you could eat any more.

Read the article

Why do people crawl sites without downloading pictures?

- by Michael

Let me show you what I mean: IP Pages Hits Bandwidth 85.xx.xx.xxx 236 236 735.00 KB 195.xx.xxx.xx 164 164 533.74 KB 95.xxx.xxx.xxx 90 90 293.47 KB It's very clear that these person are crawling my site with bots. There's no way that you could visit my site and use <1MB bandwidth. You might say that there's the possibility that they could be browsing the site using some browser or plug-in that does not download images, js/css files, etc., but the simple fact of the matter is that there are not 90-236 pages that are linked from the home page (outside of WP files), even if you visited every page twice. I could understand if these people were crawling the site for pictures, but once again, the bandwidth indicates that this isn't what is happening. Why, then, would they crawl the site to simply view the HTML/txt/js/etc. files? The only thing that I can come up with is that they are scanning for outdated versions of WordPress, SQL injection vulnerabilities, etc., which makes me inclined to outright ban the IPs, but I'm curious, is it possible that this person is a legitimate user, or at the very least, not intending to be harmful?

Read the article

How do I rate limit google's crawl of my class C IP block?

- by Zak

I have several sites in a class C network that all get crawled by google on a pretty regular basis. Normally this is fine. However, when google starts crawling all the sites at the same time, the small set of servers that back this IP block can take a pretty big hit on load. With google webmaster tools, you can rate limit the googlebot on a given domain, but I haven't found a way to limit the bot across an IP network yet. Anyone have experience with this? How did you fix it?

Read the article

MOSS search crawl fails with "Access is denied ..."

- by strongopinions

Recently the search crawler stopped working on my MOSS installation. The message in the crawl log is Access is denied. Check that the Default Content Access Account has access to this content, or add a crawl rule to crawl this content. (The item was deleted because it was either not found or the crawler was denied access to it.) The default content account is an admin on the site collection that I am trying to crawl. Almost every result for this error on Google tells me to add the DisableLoobackCheck registry key with a value of 1. I have done this and rebooted and the error continues. The "Do not allow Basic Authentication" checkbox in my crawl rule screen is unchecked. Is there anything else that could be causing this error? Something with file system or database permissions maybe?

Read the article

Search server 2008 express working with WSS 3.0 (error when crawl second web application (website) s

- by tberube

My search server 2008 express crawls 3 sharepoint servers and 1 windows file server. The file server and 2 other sharepoint server crawl all content and master and sub sites just fine. On the 3rd sharepoint server the default site crawl just fine...The second web application site (content database) crawls the top site and crawls all sub sites BUT The subsites get the below error in the crawl log. Deleted by the gatherer (The start address or content source that contained this item was deleted and hence this item was deleted.) I have checked the security rights (OK) I have checked the setting to crawl a SharePoint site (if wrong the top site would not work)... ???? Last part = I can crawl 3 other sharepoint sites (web applications/other content databases) on the same server...It seems to be just this one site...

Read the article

Is there a way to crawl all facebook fan pages?

- by user220755

Is there a way to crawl all facebook fan pages and collect some information? like for example crawling facebook fan pages and save their names, or how many fans, etc? Or at least, do you have a hint of how this could be possibly done?

Read the article

How to resolve "Google can't find your site's robots.txt" error?

- by Manivasagam

I've recently found that "Google can't find your site's robots.txt" in crawl errors. When I tried Fetching as Google, I got result "SUCCESS", then I tried looking at crawl errors and it still shows "Google can't find your site's robots.txt". What can I do to resolve this issue? Before this issue arose, my site was indexed within a few mintues, but now I find that it took time to be indexed in Google's search. When I access http://mydomain.com/robots.txt, it shows the data below: User-agent: *Disallow: /wp-admin/ Disallow: /wp-includes/ I found Blocked URLs = 0, also no any other errors. Is there any other thing I need to change? Or what could be the solution for this? Any help would be appreciated.

Read the article

Yandex frequently replaces page names with ampersands

- by Guy

The Yandex spider is a frequent visitor to one of the sites I manage. On ocassion it replaces the page name with two ampersands and a space. So if the page is: /mypage.aspx?param=value then it will try and crawl it as: /&& ?param=value Any idea why it is doing this? [EDIT] If I remember correctly the IP that this "mistake" is coming from is based in California and not Russia. I believe that they crawl US sites from a US based IP address. Not sure if that helps. More Info about request: IP: 199.21.99.82 City: Palo Alto State: California Country: United States ISP: Yandex Inc. User-Agent: Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

Read the article

Google ranking, page crawl

- by Nawaf Mubarak

please don't mind me for asking this newbie question about Google ranking. I know that in order to get ranked the page has to be crawled by Google bots, I have had a page example of which I will get a better understanding of how the system works with Google. I have made a page on my website last month, it got indexed pretty quickly, then I found that it's in Google's page 15 on my keyword as a start, next day it made it to page 13, then after a week it was jumping back and forth in page 17/18 up to 20. Now a month passed by, when and it isn't listed in any position of that 'keyword' sometimes I will find it in page 30, but later I won't find it anywhere, keep happening this way these days. Even if it isn't listed in any page for my keyword if I do a search for "site:thepageadress" it will be listed which means I'm not penalized and my page is there for google to see, but it isn't in the search result for my keyword. But when I write "site:thepage_adress" and I hit "search tools" option and click on "Past day" or "past week" it isn't listed, it is only listed when I click on "Past month" which I think means that Google indexed the page, looked at it once when I published it, and never looked at it again, is this a fair statement? So two questions that comes to mind here. 1- Should Google keep looking at a page even if I haven't changed any info for it? and is this an indication for me that my page is doing fine? or is it normal that Google see's it once and thats it? 2- Why and how to fix the fact that my page keeps jumping back and forth in the ranking result for keyword, and sometimes it isn't even listed, what does that mean? Sorry for the long msg, I hope to god that somebody help me with this. Thank you!

Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 1/18 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by sarah

- by Mike

- by Marcel

- by Janis Veinbergs

- by Itai

- by SNag

- by Sam Pegler

- by blunderboy

- by sam

- by Shiro

- by Shiro

- by Bill Graziano

- by Yurish

- by com

- by Bill Graziano

- by Michael

- by Zak

- by strongopinions

- by tberube

- by user220755

- by Manivasagam

- by Guy

- by Nawaf Mubarak

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >