Search Results

Search found 446 results on 18 pages for 'crawl'.

Page 7/18 | < Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >

7 Ways to Get Your Website Noticed

Putting your website on the internet is only the first step to getting your website noticed. Sooner or later, the web spiders will crawl around it and a while after that you'll start appearing in the search results, but probably only for very specific searches such as your company name. And, let's face it, if someone knows your company name they probably already know a bit about you.

Read the article
How to Use SEO Services to Have a Successful Website

Essentially the optimization of web pages in a site is required because search engines are software programs based on a specific algorithm that is used at the time of its crawling into your website. Each website has numerous web pages and it is practically difficult to index and crawl each and every web page. No search engine can perform this function.

Read the article
Analyzing I/O performance in Linux

<b>cmdln.org:</b> "Monitoring and analyzing performance is an important task for any sysadmin. Disk I/O bottlenecks can bring applications to a crawl. What is an IOP? Should I use SATA, SAS, or FC? How many spindles do I need?"

Read the article
Basic SEO Strategies - How to Get Your Website at the Top of the Search Results

Search Engine Optimisation means making changes on your website so that it can be crawled more easily by search engines and found in search listings. By making changes on your website you can increase targeted visitors to your site. Search Engines use robots to crawl websites and the robots are only able to read content that is text. Your website results are displayed based on how relevant the content is in relation to the keyword that is being searched for in the search engine.

Read the article
Optimizing Robots Text File

We can block spiders to crawl restricted parts of our website. Restricted parts of our website means those links of our website which we don't want to be indexed in search engines and getting some unwanted visitors. For example:

Read the article
Search Engine Optimization Terms

By the time you complete this multiple lesson tutorial, you'll know just what it takes to score top search engine positions for your Web sites. You'll understand how search engines crawl the Web, how they rank Web sites, and how they find previously undiscovered sites. You'll master the important HTML tags that are your key to getting your sites on a search engine's radar, and you'll see why it's important to amass as many potential keywords as possible.

Read the article
How to deduplicate 40TB of data?

- by Michael Stauffer

I've inherited a research cluster with ~40TB of data across three filesystems. The data stretches back almost 15 years, and there are most likely a good amount of duplicates as researchers copy each others data for different reasons and then just hang on to the copies. I know about de-duping tools like fdupes and rmlint. I'm trying to find one that will work on such a large dataset. I don't care if it takes weeks (or maybe even months) to crawl all the data - I'll probably throttle it anyway to go easy on the filesystems. But I need to find a tool that's either somehow super efficient with RAM, or can store all the intermediary data it needs in files rather than RAM. I'm assuming that my RAM (64GB) will be exhausted if I crawl through all this data as one set. I'm experimenting with fdupes now on a 900GB tree. It's 25% of the way through and RAM usage has been slowly creeping up the whole time, now it's at 700MB. Or, is there a way to direct a process to use disk-mapped RAM so there's much more available and it doesn't use system RAM? I'm running CentOS 6.

Read the article
Apache process consuming all memory on the server

- by jemmille

I have an apache process that suddenly appears on a particular server. When it shows up it starts consuming memory at a very rapid rate, then moves on to all the swap. In all it consumes about 11GB (including swap) of memory and the server eventually becomes unresponsive. The load on the server is under 1 at all other times. The process runs as nobody and I am having a hard time tracking down the source. If i run an strace on the process and all it did was continuously dump out mprotect over and over again If i run lsof -p <pid>, I get this, but only sometimes: httpd 19229 nobody 152u IPv4 175050 crawl-66-249-67-216.googlebot.com:62336 (CLOSE_WAIT) httpd 19229 nobody 153u IPv4 179104 crawl-66-249-71-167.googlebot.com:58012 (ESTABLISHED) As long as I catch it, I can kill the process and the server almost immediately stabilizes. I have on site on the server that is getting a few thousand hits a a day that I think might be the source, but I still can't find the exact reason. Also, this is a cPanel server and I have upcp'd the server, rebuilt apache with easy apache, and rebuilt httpd.conf. It is not spawing any related processes, meaning I can find any php, mysql, cgi, etc. processes that relate to this process. It's just a loner process that balloons fast and consumes ever last MB of memory. This is on a XenServer 5.6 based VM. No other servers in the cluster are having this issue.

Read the article
Force request to miss cache but still store the response

- by Tom Marthenal

I have a slow web app that I've placed Varnish in front of. All of the pages are static (they don't vary for a different user), but they need to be updated every 5 minutes so they contain recent data. I have a simple script (wget --mirror) that crawls the entire website every 15 minutes. Each crawl takes about 5 minutes. The point of the crawl is to update every page in the Varnish cache so that a user never has to wait for the page to generate (since all pages have been generated recently thanks to the spider). The timeline looks like this: 00:00:00: Cache flushed 00:00:00: Spider starts crawling to update cache with new pages 00:05:00: Spider finishes crawling, all pages are updated until 1:15 A request that comes in between 0:00:00 and 0:05:00 might hit a page that hasn't been updated yet, and will be forced to wait a few seconds for a response. This isn't acceptable. What I'd like to do is, perhaps using some VCL magic, always foward requests from the spider to the backend, but still store the response in the cache. This way, a user will never have to wait for a page to generate since there is no 5-minute window in which parts of the cache are empty (except perhaps at server startup). How can I do this?

Read the article
Removing Paths/ Landing Pages From SharePoint Search Results

- by j.strugnell

Hi there, We've been asked by a client to remove a number of pages from being shown up in their public website search results page. I've been into the SSP and created Crawl Rules to remove these pages. All seemed to have worked ok but we have an issue in that landing pages are still showing up in their "www.domain.com/sitearea/" form but not in their "www.domain.com/sitearea/pages/default.aspx". For each of this type of page we have created one rule to "Exclude" the "aspx" path and another rule to include the "/" path but to "Follow links on the URL without crawling the URL itself". We tried adding rules to exclude the "/" format but that only resulted in all results underneath that being excluded. Does anybody know how to remove the "area/pages/default.aspx" and the "area/" pats from Search Results? I'm not sure if it's the "done thing" to ask 2 questions in one but this is in a similar vein so it should be ok. I was wondering if anyone knew of a tool (or if it is possible) to allow site admins to exclude pages from search results (not via SSP/Crawl Rules). I know they can do it at the site level but I was wondering if anything out there enabled this to be done at the page level through either Page or Site Settings?

Read the article
Why am I experiencing random connection timeouts? (CentOS)

- by Ryan

I have a CentOS server setup that currently hosts several websites (all relative of each other in some form or another). As of recently throughout the day at the most random times the website speed will lag to a crawl and eventually hit a connection timeout. When I say random times this typically happens anywhere between 10am and 1pm usually, however, this morning this happened to me at 8am. I do not have a lot of familiarity with server knowledge as far as what I am looking for in this situation. What are some possible causes of why my server is slowing the websites down to a complete crawl or timing out? Are there specific things I should be checking for when this happens? I have noticed using: tail /var/log/httpd/access_log That usually when this down time occurs there are lot of IP addresses related to BingBot, Googlebot, and sometimes various bots or spiders that I am unfamiliar with. Could this be related and if so how can I avoid this from causing my websites to lag out? Thanks in advance for any help or advice. The websites that are timing out are built with PHP and use a MySQL database to display information.

Read the article
Yahoo BOSS Python Library, ExpatError

- by Wraith

I tried to install the Yahoo BOSS mashup framework, but am having trouble running the examples provided. Examples 1, 2, 5, and 6 work, but 3 & 4 give Expat errors. Here is the output from ex3.py: gpython examples/ex3.py examples/ex3.py:33: Warning: 'as' will become a reserved keyword in Python 2.6 Traceback (most recent call last): File "examples/ex3.py", line 27, in <module> digg = db.select(name="dg", udf=titlef, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news") File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 214, in select tb = create(name, data=data, url=url, keep_standards_prefix=keep_standards_prefix) File "/usr/lib/python2.5/site-packages/yos/yql/db.py", line 201, in create return WebTable(name, d=rest.load(url), keep_standards_prefix=keep_standards_prefix) File "/usr/lib/python2.5/site-packages/yos/crawl/rest.py", line 38, in load return xml2dict.fromstring(dl) File "/usr/lib/python2.5/site-packages/yos/crawl/xml2dict.py", line 41, in fromstring t = ET.fromstring(s) File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 963, in XML parser.feed(text) File "/usr/lib/python2.5/xml/etree/ElementTree.py", line 1245, in feed self._parser.Parse(data, 0) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 It looks like both examples are failing when trying to query Digg.com. Here is the query that is constructed in ex3.py's code: diggf = lambda r: {"title": r["title"]["value"], "diggs": int(r["diggCount"]["value"])} digg = db.select(name="dg", udf=diggf, url="http://digg.com/rss_search?search=google+android&area=dig&type=both&section=news") Any help is appreciated. Thanks!

Read the article
Asp.net Crawler Webresponse Operation Timed out.

- by Leon

Hi I have built a simple threadpool based web crawler within my web application. Its job is to crawl its own application space and build a Lucene index of every valid web page and their meta content. Here's the problem. When I run the crawler from a debug server instance of Visual Studio Express, and provide the starting instance as the IIS url, it works fine. However, when I do not provide the IIS instance and it takes its own url to start the crawl process(ie. crawling its own domain space), I get hit by operation timed out exception on the Webresponse statement. Could someone please guide me into what I should or should not be doing here? Here is my code for fetching the page. It is executed in the multithreaded environment. private static string GetWebText(string url) { string htmlText = ""; HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url); request.UserAgent = "My Crawler"; using (WebResponse response = request.GetResponse()) { using (Stream stream = response.GetResponseStream()) { using (StreamReader reader = new StreamReader(stream)) { htmlText = reader.ReadToEnd(); } } } return htmlText; } And the following is my stacktrace: at System.Net.HttpWebRequest.GetResponse() at CSharpCrawler.Crawler.GetWebText(String url) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 366 at CSharpCrawler.Crawler.CrawlPage(String url, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 105 at CSharpCrawler.Crawler.CrawlSiteBuildIndex(String hostUrl, String urlToBeginSearchFrom, List1 threadCityList) in c:\myAppDev\myApp\site\App_Code\CrawlerLibs\Crawler.cs:line 89 at crawler_Default.threadedCrawlSiteBuildIndex(Object threadedCrawlerObj) in c:\myAppDev\myApp\site\crawler\Default.aspx.cs:line 108 at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(Object state) at System.Threading.ExecutionContext.runTryCode(Object userData) at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData) at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean ignoreSyncCtx) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback() Thanks and cheers, Leon.

Read the article
What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

- by Tom

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers. An example: We tell our crawler to crawl the domain evil.com by entering an initial lookup URL. Lets assume we let it crawl the front page initially, evil.com/index The returned HTML will contain several "unique" links: evil.com/somePageOne evil.com/somePageTwo evil.com/somePageThree The crawler will add these to the buffer of uncrawled URLs. When somePageOne is being crawled, the crawler receives more URLs: evil.com/someSubPageOne evil.com/someSubPageTwo These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole". The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in. These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity. My question is, what techniques can be used to detect so called black holes? One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites. Any feedback is appreciated. Thanks.

Read the article
Site crawler/spider that tosses results into mysql

- by ian.evans

It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url and data into mysql and have sphinx index on that. Does anyone know of an open source spider that has a mysql storing option out of the box. Thanks.

Read the article
Why are there hard faults when my RAM is not 100% used?

- by Vilx-

I've got 2GB of RAM and the resource monitor shows that it's only used about 75%. However there are some apps (NetBeans, Visual Studio) that every once in a while start making a lot of hard faults (up to and over 2000/min), thus predictably slowing down to a crawl. How is this so? The memory usage during these "fits" doesn't change. Perhaps it also includes memory mapped files or something?

Read the article
Using Export-Mailbox without including subfolders

- by AspNyc

I want to delete a certain group of messages from somebody's mailbox. I already have the basic Powershell command ready to go: Get-Mailbox -Identity jshmoe | Export-Mailbox -SubjectKeywords "VirusWarning" -IncludeFolders "\Inbox" -StartDate "02/24/2010" -DeleteContent The problem is that Joe Shmoe's "Inbox" is huge, and I know the messages I want to delete are only in the main Inbox folder. However, the above Powershell command appears to crawl all subfolders beneath "Inbox". Is there a way to tell it not to?

Read the article
Application to open / edit a very large CSV file (500 MB, 4 million records)?

- by Giorgi

Hello, I have a csv file which has about 4 million rows and is about 500MB in size. Can you recommend any editor that can open the file without making the system crawl? I tried EmEditor but it is complaining that there are too many characters in a single line. Thanks.

Read the article
Utility to determine the font used on a site?

- by Lazer

For example: I like the fonts used on the website : NYTimes, so something that could list all the fonts used by the website would be great! I do NOT want a image processing utility that could analyze the screenshots or a site that will ask questions about the font and narrow down from its list. Something that could just crawl the website and list its fonts... It would be great if it could overlay the fonts on the regions, but that would be going too far, I guess.

Read the article
How do I find out which websites are slowing down my server?

- by jesse.gavin

Are there any good tools out there that I can use to find out which websites are the biggest drain on my Windows 2003 web server's resources? The server has been steadily slowing to a crawl and I want to know which sites should get priority for troubleshooting.

Read the article
How would I convert a whole web site to PDF?

- by Mark Nold

Hi, I'm looking for something that I can point at a URL and it crawl the site and create a single PDF of the of the site. I would imagine it would be something like a Firefox plugin .. but with similar capabilities to wget.

Read the article
Is there anything I can do about someone who has pointed there domain at my ip?

- by Micah

This is kind of a weird request. For some reason whoever owns safeandbuy.com has pointed there domain at my ip. The reason it's a problem is that I'm having all kinds of crawlers that are trying to crawl my site with that domain name. Is there anything I can do about this? Thanks,

Read the article
Celery and Django : How to start at boot in production env (linux)

- by llazzaro

Hello, I have and app that uses celery and django to run distribuited tasks (like send email, crawl web,etc). The app never wa sin prod, so I always start celeryd with ./manage celeryd. I want to setup a pre-post env in linux, and I will need information in how to make an init.d script for start the celeryd for django. (I had made some init.d scripts before, no need complete script just relevant part) Thanks!

Read the article
Is there anything I can do about someone who has pointed their domain at my ip?

- by Micah

This is kind of a weird request. For some reason whoever owns safeandbuy.com has pointed their domain at my IP address. The reason it's a problem is that I'm having all kinds of crawlers that are trying to crawl my site with that domain name. Is there anything I can do about this?

Read the article
open 500 mb, 4 million record csv file

- by Giorgi

Hello, I have a csv file which has about 4 million rows and is about 500MB in size. Can you recommend any editor that can open the file without making the system crawl? I tried EmEditor but it is complaining that there are too many characters in a single line. Thanks.

Read the article

< Previous Page | 3 4 5 6 7 8 9 10 11 12 13 14 | Next Page >