Search Results

Search found 261 results on 11 pages for 'prion crawler'.

Page 7/11 | < Previous Page | 3 4 5 6 7 8 9 10 11 | Next Page >

How can I index content within a Content Editor web part?

- by Hirvox

I'm using MOSS 2007 v12.0.0.6529, and the the Shared Services crawler is ignoring content inside Content Editor Web Parts. The page itself is a Publishing page, and content within the Page Content field is indexed properly and shows up in search results. How can I ensure that content within Content Editor webparts is also indexed? Or do I have to use other methods like additional content fields in the page?

Read the article
Why is my site not on Google? [closed]

- by RD

I wanted to post a link here, but some people might see that as advertising. So, instead I'm going to phrase my question like this: What can I do, to make sure my site appears on Google? I have already done the following: Submitted my sitemap Added my site at www.google.com/addurl Added Analytics to my site Checked in the webmaster tools if there are crawlers errors But still, after about three or four days, the crawler hasn't crawled my site. What am I missing?

Read the article
What can i use as a 3d Tile map editor?

- by alfa64

I need to make grid based levels with 3d models for a dungeon crawler ( as a recent example Legend of Grimrock), but i need to have several layers and place entities with properties and position, angle, etc. I was considering Tiled, using layers as height for each level, but it's very hard to work with and visualize. What can i use for this pourpose? The output format needs to be json, xml, or something i can use on my engine. Ideally i'd want something like Tiled with a 3d visualization/edit mode and support for loading models or at least some visual representation of them.

Read the article
Understanding Ajax crawling of search site

- by vacuum

I have a couple of questions about Ajax crawling of site, which is kind of search engine itself. The base article explains the mechanism of making AJAX application crawlable. All this stuff with HTML-snapshots is clear and easy to implement, but I cant understand where will Google bot will get "the crawler finds a pretty AJAX URL"( ie www.example.com/ajax.html#key=value) to work with. First thing, that came on mind - is breadcrumb. In sitemap we can specify pages with breadcrumb on it. so bot will go to these pages and get HTML-snapshots from here. But I'm sure, there are exists other ways to give bot this "pretty AJAX URL". In our case, we have simple search site, where user enters keyword, presses "Find", js execute Ajax request, receives JSON reponce and fill page with results(without any refresh of course). In this case - how to make google bot crawle all the presults in addition to sitemap? Is there some example of solution, described in article above?

Read the article
sqlite3 timestamp (current_timestamp) one hour off

- by Eiriks

I run a small crawler on a virtual ubuntu server, initiated by crontab hourly. Datetime is inserted by defaulting the date filed to TIMESTAMP DEFAULT CURRENT_TIMESTAMP. Table creation looks like this: CREATE TABLE links (page TEXT, link TEXT, date TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(page,link)); The datetime gets stored fine, but it one hour off (one hour behind) Norwegian time (GMT +1). The server is located where-ever, I just need it to be on GMT+1. By typing datein the ssh session I get: Wed Dec 19 17:26:02 CET 2012 and that is correct (just now). So where does sqlite3 get it's time from? What must I do to set the time so that sqlite3 gets the time right?

Read the article
Is content in option tags indexed?

- by Silfverstrom

Is data inside an <option> tag indexed? For example, would the following option tag allow "Volvo", "Saab", "Opel" and "Audi" to be indexed by a crawler? <select> <option value="volvo">Volvo</option> <option value="saab">Saab</option> <option value="opel">Opel</option> <option value="audi">Audi</option> </select> Will search engines put any weight on data in an option form element?

Read the article
Do search engines directly penalize bad grammar?

- by Nicolas Raoul

Let's say I have a web page with user-contributed content, which is good content but with bad grammar, slang terms, inappropriate tone. I know that bad grammar is a also a problem because it drives away visitors and scares people from linking to it, but let's put that aside. Let's also put aside the fact that incorrectly spelt terms might be ignored by a crawler, potentially leading to less text-comparizon hits. QUESTION: Do search engines like Google directly recognize and penalize bad grammar? For instance because they might consider bad-grammar as a sign of low-quality content.

Read the article
Webmaster Tools word count

- by Henrik Erlandsson

Is there a way to somehow verify that the googlebot finds the headings and the content, for example by word count? I'm asking this because I tried a program called Screaming Frog, which fails to even fetch the first h1 on a validated page - for about 1/3 of all the pages(!) - and got insecure. Even though the site looks hunky dory in Webmaster Tools, I'd like to know what a googlebot-like content crawler finds on my page and in what order. Any tips on such tools is appreciated. This is not about keyword count.

Read the article
Directing crawlers to content in language per language sub-domain

- by Noam

I have a site with multilingual website with many pages (40M). The site has UGC, and each translation is actually for the titles. Each sub-domain points to the same content with different titles per language. As far as I understand, each sub-domain should be indexed by search engines, meaning they will actually need to crawl 40M x supported-languages. So I thought it might be best to direct each subdomain crawler, to pages that are fully in that language (titles + UGC). Is there a way to do this? Should search engines understand this on their own?

Read the article
How to disallow indexing but allow crawling?

- by John Doe

In the front page of my website, I have some previews to articles (with a small introduction to them) that link to the full articles. I want to disallow the front page to prevent duplicate content. But if I do this (in robots.txt), would it still be crawled? I mean, the full articles would be still reached by the crawler even though I disallowed the only page that links to them? I don't want the webcrawler not to access the page and enter the links in them, but I just don't want it to save the information (that will be repeated in the full articles).

Read the article
Foolproof way to ensure Google news pulls the correct image for it's thumbnails?

- by Anthony

Google news results have an acompanying thumbnail next to articles that show up in the results. If google's crawler can't find a thumbnail to pull from our site, it uses its next best guess from another site, therefore linking the image to another site but still uses our headline. Example: Headline from Reuters, Image from Livemint: Our pages absolutely have images, they are not massive in file-size or dimensions, yet we are not having them pulled / crawled correctly. We have read up on the suggestions from google, and from others around the web and nothing is panning out. Has anyone had any experience where they can ensure google news will pull a thumbnail of our choosing?

Read the article
HTTP 303 redirection and robots.txt

- by Ian Dickinson

On a site I'm working on, we're using the HTTP 303 redirect pattern (see this article for background) to distinguish between information and non-information resources. So: some URL's under /id get redirected to dynamically-created pages under /doc. These dynamic pages are built from a database, and contain links to other /doc/ resources, so in general we don't want them to be crawled. Our robots.txt contains: Disallow: /doc However, we do want the non-redirected pages under /id to get indexed by Google et al: Allow: /id So the question I have, which I can't find an answer to so far, is: if an allowed /id page 303-redirects to a /doc page, will it still be blocked by robots.txt? If yes, we're OK, but otherwise I'm going to disallow all /id resources in the robots file, as having the crawler hammer the db would be worse than losing search indexing for the /id pages.

Read the article
How to fix Google 404 not found Crawl Errors?

- by Freeme

I was checking on Google webmater tool for my blog site to see if there's any indication on why my blog traffic decreased to half in one day and i saw 43 Not Found crawl errors and 5 in Sitemap Not Found errors. The 5 Not Found errors in Sitemap were the links to categories. I guess I renamed categories that's why google can't find the links. As for the 43 other Not Found errors, I see blog post titles that contains (' .) EX: McDonald's, O.N.E. They weren't found by google crawler. Blog post with /CachedYou at the end and blog posts with /www.example.com attached at the end, they weren't found by Google crawlers either. My question is how do I correct those Not Found Errors? Thanks

Read the article
What does Enable/Disable mean in Bing's URL Normalization feature?

- by DisgruntledGoat

I'm in Bing Webmaster Tools, under Index URL Normalization. Many parameters are listed in the table with 3 other columns: Status, Source, Date. The "Source" column says "Webmaster" where I have added parameters, and "Bing" where I assume the parameter has been auto-detected. "Date" is probably the last date it detected the parameter. I've tried searching the help files but I can't find what the Status column means. The top of the page says: This feature allows you to specify query parameters for Bing’s crawler to ignore. But it's not clear whether "Enable" or "Disable" is related to this, and if so what happens in each case. Does anyone know?

Read the article
How to allow Google Images search to by pass hotlink protection?

- by Marco Demaio

I saw Google Images seems to index my images only if hotlink protection is off. * I use anyway hotlink protection because I don't like the idea of people sucking my bandwidth, i simply this code to protcet my sites from being hotlinked: RewriteEngine on RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mydomain\.com/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?mydomain\.com$ [NC] RewriteRule .*\.(jpg|jpeg|png|gif)$ - [F,NC,L] But in order to allow Google Image search to bypass my hotlink protection (I want Google Images search to show my images) would it suffice to add a line like this one: RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google\.com/.*$ [NC] RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?google\.com$ [NC] Because I'm wondring: is the crawler crawling just from google.com? and what about google.it / google.co.uk, etc.? FYI: on Google official guidelines I did not find info about this. I suppose hotlink protection prevents Google Images to show images in its results because I did some tests and it seems hotlink protection does prevent my images to be shown in Google Images search.

Read the article
How to get rid of crawling errors due to the URL Encoded Slashes (%2F) problem in Apache

- by user14198

The Google web crawler has indexed a whole set of URLs with encoded slashes (%2F) for our site. I assume it has picked up the pages from our XML sitemap file. The problem is that the live pages will actually result in a failure because of the Url Encoded Slashes Problem in Apache. Some solutions are mentioned here We are implementing a 301 redirect scheme for all the error pages. This should make the Google bot delete the pages from the crawling errors (no more crashing pages). Does implementing the 301s require the pages to be "live"? In that case we may be forced to implement solution 1 in the article. The problem is that solution 1 will pose a security vulnerability..

Read the article
Disqus thread migration. Gotchas?

- by sramsay

I've been migrating a site to a new domain. The site itself is pretty straightforward (it uses Jekyll), and everything has gone fine -- except migration of Disqus threads. I've had partial success -- some of the threads have migrated successfully, but not all. I've tried the domain migration wizard (which caught a few), the URL mapper (which caught a few), and the 301 redirect crawler (which caught a few). But the remaining threads just won't move, no matter which method I use. So, I suppose I suppose I'm asking if there are any "gotchas" I should know about with this. When you execute any of these migration tools, it says it will "take awhile." Does that mean hours? Days? I can't tell if it's working, and there's no logging or error reporting that I can see.

Read the article
Google Analytics - Traffic Source - Search engine - (Not Provided)

- by Dharmavir

I am using Google Analytics, now here when I go to "Traffic Source Overview" under that it shows Keyword as "(Not provided)" which is almost 40% of my traffic source. Now more than 90% of search engine traffic is from Google and still out of that for more than 40% of keywords are "(Not provided)". Can anyone explain me what is going wrong here or how can I get that data? Because that comes as 1st option and is biggest keyword in the list. Will that be some crawler or secure google search?

Read the article
How to optimize my PageRank calculation?

- by asmaier

In the book Programming Collective Intelligence I found the following function to compute the PageRank: def calculatepagerank(self,iterations=20): # clear out the current PageRank tables self.con.execute("drop table if exists pagerank") self.con.execute("create table pagerank(urlid primary key,score)") self.con.execute("create index prankidx on pagerank(urlid)") # initialize every url with a PageRank of 1.0 self.con.execute("insert into pagerank select rowid,1.0 from urllist") self.dbcommit() for i in range(iterations): print "Iteration %d" % i for (urlid,) in self.con.execute("select rowid from urllist"): pr=0.15 # Loop through all the pages that link to this one for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid): # Get the PageRank of the linker linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0] # Get the total number of links from the linker linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0] pr+=0.85*(linkingpr/linkingcount) self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid)) self.dbcommit() However, this function is very slow, because of all the SQL queries in every iteration >>> import cProfile >>> cProfile.run("crawler.calculatepagerank()") 2262510 function calls in 136.006 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 136.006 136.006 <string>:1(<module>) 1 20.826 20.826 136.006 136.006 searchengine.py:179(calculatepagerank) 21 0.000 0.000 0.528 0.025 searchengine.py:27(dbcommit) 21 0.528 0.025 0.528 0.025 {method 'commit' of 'sqlite3.Connecti 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler 1339864 112.602 0.000 112.602 0.000 {method 'execute' of 'sqlite3.Connec 922600 2.050 0.000 2.050 0.000 {method 'fetchone' of 'sqlite3.Cursor' 1 0.000 0.000 0.000 0.000 {range} So I optimized the function and came up with this: def calculatepagerank2(self,iterations=20): # clear out the current PageRank tables self.con.execute("drop table if exists pagerank") self.con.execute("create table pagerank(urlid primary key,score)") self.con.execute("create index prankidx on pagerank(urlid)") # initialize every url with a PageRank of 1.0 self.con.execute("insert into pagerank select rowid,1.0 from urllist") self.dbcommit() inlinks={} numoutlinks={} pagerank={} for (urlid,) in self.con.execute("select rowid from urllist"): inlinks[urlid]=[] numoutlinks[urlid]=0 # Initialize pagerank vector with 1.0 pagerank[urlid]=1.0 # Loop through all the pages that link to this one for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid): inlinks[urlid].append(inlink) # get number of outgoing links from a page numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0] for i in range(iterations): print "Iteration %d" % i for urlid in pagerank: pr=0.15 for link in inlinks[urlid]: linkpr=pagerank[link] linkcount=numoutlinks[link] pr+=0.85*(linkpr/linkcount) pagerank[urlid]=pr for urlid in pagerank: self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid)) self.dbcommit() This function is 20 times faster (but uses a lot more memory for all the temporary dictionaries) because it avoids the unnecessary SQL queries in every iteration: >>> cProfile.run("crawler.calculatepagerank2()") 64802 function calls in 6.950 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.004 0.004 6.950 6.950 <string>:1(<module>) 1 1.004 1.004 6.946 6.946 searchengine.py:207(calculatepagerank2 2 0.000 0.000 0.104 0.052 searchengine.py:27(dbcommit) 23065 0.012 0.000 0.012 0.000 {meth 'append' of 'list' objects} 2 0.104 0.052 0.104 0.052 {meth 'commit' of 'sqlite3.Connection 1 0.000 0.000 0.000 0.000 {meth 'disable' of '_lsprof.Profiler' 31298 5.809 0.000 5.809 0.000 {meth 'execute' of 'sqlite3.Connectio 10431 0.018 0.000 0.018 0.000 {method 'fetchone' of 'sqlite3.Cursor' 1 0.000 0.000 0.000 0.000 {range} But is it possible to further reduce the number of SQL queries to speed up the function even more?

Read the article
Equivalent of libwww-perl in .NET or Java

- by voidvector

I have written a crawler in Perl awhile back and it was super simple giving the high-level capability of libwww-perl. It is so straight forward in fact, it can take the raw HTML response of one request, and create the next HTTP request for you from the FORMs on that page (as in it will parse the HTML for you). Does anyone know any library like this on .NET or Java? Selenium is out of question because it requires the browser to be open, which we cannot accommodate in our implementation)

Read the article
Unhandled Exception in c#

- by nightcoder1

Hello i am currently trying to run a web crawler through the terminal. it compiles fine and the debug does not find any errors, however i get the following error which i do not understand. any ideas on how to get rid of this error would be much appreciated Unhandled Exception: System.ArgumentOutOfRangeException: startIndex + length > this.length Parameter name: length at System.String.Substring (Int32 startIndex, Int32 length) [0x00000] at OpenWebSpiderCS.mysql.executeSQLQuery (System.String SQL) [0x00000] at OpenWebSpiderCS.db.startIndexThisSite (OpenWebSpiderCS.page p) [0x00000] at OpenWebSpiderCS.ows.startCrawling () [0x00000] at OpenWebSpiderCS.mainClass.Main (System.String[] args) [0x00000] thank you

Read the article
SEO & Ajax

- by cloudhead

I'm experimenting with building sites dynamically on the client side, through javascript + a json content server, the js retrieves the content, and builds the page client-side. Now, the content won't be indexed by google this way, is there a work around for this? like having a crawler version and a user version? Or having some sort of static archives? Has anyone done this already?

Read the article
Oracle Secure Enterprise Search(SES) Intranet crawling problem.

- by vipin k.

I am using oracle Oracle Secure Enterprise Search(SES), and using the crawler to crawl the Intranet site. but i am getting the error. EQG-30008: http://site-name/: Not found I have added the Log on password and user name and also added the proxy settings. Any body who worked on SES crawling,please look in.

Read the article
file_get_contents VS CURL, what has better performance?

- by ahmed

I am using PHP to build a web crawler, to crawl millions of URLs, what is better for me in terms of performance?file_get_contents or CURL? Thanks

Read the article
Valid content-type for XML, HTML and XHTML documents

- by astropanic

What are correctly content-types for this documents ? I need to write a simple crawler, that only fetches this kind of files. Nowadays http://somedomain.com/index.html can serve for example an JPEG file due to mod_rewrite, so I need to check the content-type from the response header and compare it with a list of allowed content-types. From where I can get such list ?

Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 | Next Page >