Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 9/14 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 | Next Page >

How to architect Rails site that can be edited while running?

- by Chris Kimpton

Hi, I am writing a Rails app that "scrapes/navigates" some other websites and webservices for content. I am using Mechanize and Savon to do the heavylifting. But given the dynamic nature of the web, I'd like to make my calls to these editable by the admin users of the site - rather than requiring me to release a new version of the site. The actual scraping thread happens async to the website, using the daemons gem. My requirements are: Thinking that the scraping/webservice calling code is quite simple, the easiest route is to make the whole class editable by the admins. Keep a history of the scraping code - so that we can fairly easily revert if we introduce a problem. Initially use the code from the file system, but as soon as thats been edited and stored somewhere, to use that code instead. I am thinking my options are: Store the code in the db (with a history table for the old versions) Store the code in a private git repo somewhere and access that for the history/latest versions. I am thinking the git route might be easiest, given its raison d'etre is to track file history... But perhaps there is a gem/plugin that does all this for me, out of the box? Thanks in advance for any tips/advice. ~chris

Read the article
Parsing HTML Documents with the Html Agility Pack

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML. The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions. Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.) This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more! Read More >

Read the article
Translating with Google Translate without API and C# Code

- by Rick Strahl

Some time back I created a data base driven ASP.NET Resource Provider along with some tools that make it easy to edit ASP.NET resources interactively in a Web application. One of the small helper features of the interactive resource admin tool is the ability to do simple translations using both Google Translate and Babelfish. Here's what this looks like in the resource administration form: When a resource is displayed, the user can click a Translate button and it will show the current resource text and then lets you set the source and target languages to translate. The Go button fires the translation for both Google and Babelfish and displays them - pressing use then changes the language of the resource to the target language and sets the resource value to the newly translated value. It's a nice and quick way to get a quick translation going. Ch… Ch… Changes Originally, both implementations basically did some screen scraping of the interactive Web sites and retrieved translated text out of result HTML. Screen scraping is always kind of an iffy proposition as content can be changed easily, but surprisingly that code worked for many years without fail. Recently however, Google at least changed their input pages to use AJAX callbacks and the page updates no longer worked the same way. End result: The Google translate code was broken. Now, Google does have an official API that you can access, but the API is being deprecated and you actually need to have an API key. Since I have public samples that people can download the API key is an issue if I want people to have the samples work out of the box - the only way I could even do this is by sharing my API key (not allowed). However, after a bit of spelunking and playing around with the public site however I found that Google's interactive translate page actually makes callbacks using plain public access without an API key. By intercepting some of those AJAX calls and calling them directly from code I was able to get translation back up and working with minimal fuss, by parsing out the JSON these AJAX calls return. I don't think this particular Warning: This is hacky code, but after a fair bit of testing I found this to work very well with all sorts of languages and accented and escaped text etc. as long as you stick to small blocks of translated text. I thought I'd share it in case anybody else had been relying on a screen scraping mechanism like I did and needed a non-API based replacement. Here's the code: /// <summary> /// Translates a string into another language using Google's translate API JSON calls. /// <seealso>Class TranslationServices</seealso> /// </summary> /// <param name="Text">Text to translate. Should be a single word or sentence.</param> /// <param name="FromCulture"> /// Two letter culture (en of en-us, fr of fr-ca, de of de-ch) /// </param> /// <param name="ToCulture"> /// Two letter culture (as for FromCulture) /// </param> public string TranslateGoogle(string text, string fromCulture, string toCulture) { fromCulture = fromCulture.ToLower(); toCulture = toCulture.ToLower(); // normalize the culture in case something like en-us was passed // retrieve only en since Google doesn't support sub-locales string[] tokens = fromCulture.Split('-'); if (tokens.Length > 1) fromCulture = tokens[0]; // normalize ToCulture tokens = toCulture.Split('-'); if (tokens.Length > 1) toCulture = tokens[0]; string url = string.Format(@"http://translate.google.com/translate_a/t?client=j&text={0}&hl=en&sl={1}&tl={2}", HttpUtility.UrlEncode(text),fromCulture,toCulture); // Retrieve Translation with HTTP GET call string html = null; try { WebClient web = new WebClient(); // MUST add a known browser user agent or else response encoding doen't return UTF-8 (WTF Google?) web.Headers.Add(HttpRequestHeader.UserAgent, "Mozilla/5.0"); web.Headers.Add(HttpRequestHeader.AcceptCharset, "UTF-8"); // Make sure we have response encoding to UTF-8 web.Encoding = Encoding.UTF8; html = web.DownloadString(url); } catch (Exception ex) { this.ErrorMessage = Westwind.Globalization.Resources.Resources.ConnectionFailed + ": " + ex.GetBaseException().Message; return null; } // Extract out trans":"...[Extracted]...","from the JSON string string result = Regex.Match(html, "trans\":(\".*?\"),\"", RegexOptions.IgnoreCase).Groups[1].Value; if (string.IsNullOrEmpty(result)) { this.ErrorMessage = Westwind.Globalization.Resources.Resources.InvalidSearchResult; return null; } //return WebUtils.DecodeJsString(result); // Result is a JavaScript string so we need to deserialize it properly JavaScriptSerializer ser = new JavaScriptSerializer(); return ser.Deserialize(result, typeof(string)) as string; } To use the code is straightforward enough - simply provide a string to translate and a pair of two letter source and target languages: string result = service.TranslateGoogle("Life is great and one is spoiled when it goes on and on and on", "en", "de"); TestContext.WriteLine(result); How it works The code to translate is fairly straightforward. It basically uses the URL I snagged from the Google Translate Web Page slightly changed to return a JSON result (&client=j) instead of the funky nested PHP style JSON array that the default returns. The JSON result returned looks like this: {"sentences":[{"trans":"Das Leben ist großartig und man wird verwöhnt, wenn es weiter und weiter und weiter geht","orig":"Life is great and one is spoiled when it goes on and on and on","translit":"","src_translit":""}],"src":"en","server_time":24} I use WebClient to make an HTTP GET call to retrieve the JSON data and strip out part of the full JSON response that contains the actual translated text. Since this is a JSON response I need to deserialize the JSON string in case it's encoded (for upper/lower ASCII chars or quotes etc.). Couple of odd things to note in this code: First note that a valid user agent string must be passed (or at least one starting with a common browser identification - I use Mozilla/5.0). Without this Google doesn't encode the result with UTF-8, but instead uses a ISO encoding that .NET can't easily decode. Google seems to ignore the character set header and use the user agent instead which is - odd to say the least. The other is that the code returns a full JSON response. Rather than use the full response and decode it into a custom type that matches Google's result object, I just strip out the translated text. Yeah I know that's hacky but avoids an extra type and firing up the JavaScript deserializer. My internal version uses a small DecodeJsString() method to decode Javascript without the overhead of a full JSON parser. It's obviously not rocket science but as mentioned above what's nice about it is that it works without an Google API key. I can't vouch on how many translates you can do before there are cut offs but in my limited testing running a few stress tests on a Web server under load I didn't run into any problems. Limitations There are some restrictions with this: It only works on single words or single sentences - multiple sentences (delimited by .) are cut off at the ".". There is also a length limitation which appears to happen at around 220 characters or so. While that may not sound like much for typical word or phrase translations this this is plenty of length. Use with a grain of salt - Google seems to be trying to limit their exposure to usage of the Translate APIs so this code might break in the future, but for now at least it works. FWIW, I also found that Google's translation is not as good as Babelfish, especially for contextual content like sentences. Google is faster, but Babelfish tends to give better translations. This is why in my translation tool I show both Google and Babelfish values retrieved. You can check out the code for this in the West Wind West Wind Web Toolkit's TranslationService.cs file which contains both the Google and Babelfish translation code pieces. Ironically the Babelfish code has been working forever using screen scraping and continues to work just fine today. I think it's a good idea to have multiple translation providers in case one is down or changes its format, hence the dual display in my translation form above. I hope this has been helpful to some of you - I've actually had many small uses for this code in a number of applications and it's sweet to have a simple routine that performs these operations for me easily. Resources Live Localization Sample Localization Resource Provider Administration form that includes options to translate text using Google and Babelfish interactively. TranslationService.cs The full source code in the West Wind West Wind Web Toolkit's Globalization library that contains the translation code. © Rick Strahl, West Wind Technologies, 2005-2011Posted in CSharp HTTP Tweet (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/plusone.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

Read the article
C# WebClient only downloads partial html

- by H4mm3rHead

Hi, I am working on some scraping app, i wanted to try to get it to work but ran into a problem. I have replaced the original scraping destination in the below code with googles webpage, just for testing. It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. How do i get it to download everything? Whats wrong with my sample code: string filename = "test.html"; WebClient client = new WebClient(); string searchTerm = HttpUtility.UrlEncode(textBox2.Text); client.QueryString.Add("q", searchTerm); client.QueryString.Add("hl", "en"); string data = client.DownloadString("http://www.google.com/search"); StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode); writer.Write(data); writer.Flush(); writer.Close();

Read the article
getting real link from rss feed link

- by pfunc

I am experimenting with scraping certain pages from an RSS feed using curl and php. The page scraping was working fine when I was just using actual links, not links from the rss feeds. However, I realize now that links in rss feeds are usually just redirects to the actual page (at least this is what it seems like). Because now when I scrape a page with the rss link, it doesn't actually get the information I am looking for. Has anyone encountered this and know of a workaround. Is there anyway to see where the rss link is redirecting to and capturing that value?

Read the article
Create a PHP cache system in MySQL database?

- by Zach Smith

I'm creating a web service that often scrapes data from remote web pages. After scraping this data, I have a simple multidimensional array of information to use. The scraping process is fairly taxing on my server, and the page load takes a while. I was considering adding a simple cache system using a MySQL database, where I create one row per remote web page with a the array of information pulled from it stored as a JSON encoded string. Is this a good enough system? Or would something like a text file per web page be a better idea?

Read the article
Is there an SO API which can fetch all Questions & Answers for a particluar Keywords

- by user4203

I am looking for an API which helps in fetching all the Questions & Answers from SO and other Stack Exchange sites only on a particular "keyword". Later using XML RPC these questions will be posted as blog post and answers to this post's answers. Just wondering whether it's possible with an API. One of my friend suggested that we should Scrape but i don't want screen scraping instead i am looking for API requests which should handle this.

Read the article
What icon would you use to denote an XML (not rss) feed available [closed]

- by mplungjan

Given two sites - one aimed at regular users and one for automated access. The first site is the best known, so many are (still) screen scraping that site for data. It is preferable to have move to the other site where the same data is available in xml format. What icon (+text/title) on a page you are about to screen scrape, would make you pay attention and decide to see what that was about? Examples from Google Image search for xml icon

Read the article
CodePlex Daily Summary for Sunday, November 06, 2011

CodePlex Daily Summary for Sunday, November 06, 2011Popular ReleasesSelf-Tracking Entity Generator for WPF and Silverlight: Self-Tracking Entity Generator v 0.9.9 Update 2: Self-Tracking Entity Generator v 0.9.9 for Entity Framework 4.0. No change to the self-tracking entity generator v 0.9.9. WPF sample (SchoolSample) is updated with unit testing for both ViewModel and Model classes.SubExtractor: Release 1020: Feature: added "baseline double quotes" character to selector box Feature: added option to save SRT files as ANSI Feature: made "Save Sup files to Source directory" apply to both Sup and Idx source files. Fix: removed SDH text (...) or ... that is split over 2 lines Fix: better decision-making in when to prefix a line with a '-' because SDH was removedAcDown????? - Anime&Comic Downloader: AcDown????? v3.6.1: ?? ● AcDown??????????、??????，??????????????????????，???????Acfun、Bilibili、???、???、???、Tucao.cc、SF???、?????80????，???????????、?????????。 ● AcDown???????????????????????????，???，???????????????????。 ● AcDown???????C#??，????.NET Framework 2.0??。?????"Acfun?????"。 ????32??64? Windows XP/Vista/7 ????????????? ??：????????Windows XP???,?????????.NET Framework 2.0???(x86)?.NET Framework 2.0???(x64)，?????"?????????"??? ??????????????，??????????： ??"AcDown?????"????????? ?? v3.6.1?? ??.hlv...Track Folder Changes: Track Folder Changes 1.1: Fixed exception when right-clicking the root nodeKinect Toolbox: Kinect Toolbox v1.1.0.2: This version adds support for the Kinect for Windows SDK beta 2.MapWindow 4: MapWindow GIS v4.8.6 - Final release - 32Bit: This is the final release of MapWindow v4.8. It has 4.8.6 as version number. This version has been thoroughly tested. If you do get an exception send the exception to us. Don't forget to include your e-mail address. Use the forums at http://www.mapwindow.org/phorum/ for questions. Please consider donating a small portion of the money you have saved by having free GIS tools: http://www.mapwindow.org/pages/donate.php What’s New in 4.8.6 (Final release) · A few minor issues have been fixed Wha...Kinect Mouse Cursor: Kinect Mouse Cursor 1.1: Updated for Kinect for Windows SDK v1.0 Beta 2!Coding4Fun Kinect Toolkit: Coding4Fun Kinect Toolkit 1.1: Updated for Kinect for Windows SDK v1.0 Beta 2!Async Executor: 1.0: Source code of the AsyncExecutorMedia Companion: MC 3.421b Weekly: Ensure .NET 4.0 Full Framework is installed. (Available from http://www.microsoft.com/download/en/details.aspx?id=17718) Ensure the NFO ID fix is applied when transitioning from versions prior to 3.416b. (Details here) TV Show Resolutions... Fix to show the season-specials.tbn when selecting an episode from season 00. Before, MC would try & load season00.tbn Fix for issue #197 - new show added by 'Manually Add Path' not being picked up. Also made non-visible the same thing in Root Folders...Nearforums - ASP.NET MVC forum engine: Nearforums v7.0: Version 7.0 of Nearforums, the ASP.NET MVC Forum Engine, containing new features: UI: Flexible layout to handle both list and table-like template layouts. Theming - Visual choice of themes: Deliver some templates on installation, export/import functionality, preview. Allow site owners to choose default list sort order for the forums. Forum latest activity. Visit the project Roadmap for more details. Webdeploy packages sha1 checksum: e6bb913e591543ab292a753d1a16cdb779488c10?????????? - ????????: All-In-One Code Framework ??? 2011-11-02: http://download.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=1codechs&DownloadId=216140 ??????，11??，?????20????Microsoft OneCode Sample，????6?Program Language Sample，2?Windows Base Sample，2?GDI+ Sample，4?Internet Explorer Sample?6?ASP.NET Sample。?????????????。 ????，?????。http://i3.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=1code&DownloadId=128165 Program Language CSImageFullScreenSlideShow VBImageFullScreenSlideShow CSDynamicallyBuildLambdaExpressionWithFie...Python Tools for Visual Studio: 1.1 Alpha: We’re pleased to announce the release of Python Tools for Visual Studio 1.1 Alpha. Python Tools for Visual Studio (PTVS) is an open-source plug-in for Visual Studio which supports programming with the Python programming language. This release includes new core IDE features, a couple of new sample libraries for interacting with Kinect and Excel, and many bug fixes for issues reported since the release of 1.0. For the core IDE features we’ve added many new features which improve the basic edit...BExplorer (Better Explorer): Better Explorer 2.0.0.631 Alpha: Changelog: Added: Some new functions in ribbon Added: Possibility to choose displayed columns Added: Basic Search Fixed: Some bugs after navigation Fixed: Attempt to fix slow navigation and slow start Known issues: - BreadcrumbBar fails on some situations - Basic search not work quite well in some situations Please if anyone find bugs be kind and report them at the Issue Tracker! Thanks!DotNetNuke® Community Edition: 05.06.04: Major Highlights Fixed issue with upgrades on systems that had upgraded the Telerik library to 6.0.0 Fixed issue with Razor Host upgrade to 5.6.3 The logic for module administration checks contains incorrect logic in 1 place, opening the possibility of a user with edit permissions gaining access to functionality they should not have through a particularly crafted url Security FixesBrowsers support the ability to remember common strings such as usernames/addresses etc. Code was adde...Terminals: Version 2.0 - Beta 3 Release: Beta 3 Refresh Dont forget to backup your config files BEFORE upgrading! The team has finally put the nail into the official release date for version 2.0. As bugs are winding down on the 2.0 Roadmap we decided to push out another build - the first 2.0 Beta build. Please take time to use and abuse this release. We left logging in place, and this is a debug build so be sure to submit your logs on each bug reported, and please do report all bugs! Check the source code page on the site, th...iTuner - The iTunes Companion: iTuner 1.4.4322: Added German (unverified, apologies if incorrect) Properly source invariant resources with correct resIDs Replaced obsolete lyric providers with working providers Fix Pseudolater to correctly morph every third char Fix null reference in CatalogBaseTumbleDeck: TumbleDeck 1.0.1 Alpha: New version of TumbleDeck is out! Check it out, it's great, we will be testing it and releasing more stable versions all the time. If you spot any unwanted bugs or features you want added please, please, please email us at tumbledeck@mail.com or contact us on the Discussions tab! If you can see your old version of TumbleDeck, please uninstall it and install this version again. Thanks.VidCoder: 1.2.1: Fixed a couple regressions: video encoder was blank in queue and crashes with the High Profile preset when opening the Settings window. Fixed problem with auto-update introduced in 1.2.0. If you have 1.2.0 you will need to update manually to get this.AssaultCube Reloaded: Release 2.3: THE RELEASE YOU'VE ALL BEEN WAITING FOR! IT CAN NOW BE CONSIDERED STABLE Linux has Debian 64-bit precompiled binaries, but you can compile your own as it also contains the source. If you are using Mac or other operating systems, download the Linux package. The server pack is ready for both Windows and Linux, but you might need to compile your own for linux (source included) If you are using Windows and require the source code, download the source package!New ProjectsApploft: Apploft is a new App Platform for windows allowing you to run apps based on powerful code which can pull content from Online.Bugshooting Output for Microsoft Dynamics CRM: Provides an output DLL to use with Bugshooting.Bulk Copy Test Cases Tool for Microsoft Test Manager & TFS: A while ago I had written a blog post Microsoft Test Manager Test Case Versioning on how to manage Test Cases over multiple releases which required you to manually copy test cases individually. I have created a tool to help with the bulk copying of Test Cases that updates the ItDiagnostics Tool for Microsoft Dynamics CRM 2011: Diagnostics Tool for Microsoft Dynamics CRM 2011 helps CRM developers and administrators to enable trace and devErrors on CRM server. It also generates an HTML report file with information about the CRM deployment.Ege University Renewable Energy Society: Ege University Renewable Energy Society Open Source Projectsfirst use of tfs: first project. connection to tfsFlagFtp: FlagFtp is a FTP library for .NET which supports various operations, such as retrieving file lists, write and read from/to files, retrieving file and directory infos, etc...LJCommon: LJCommonnwrole: .Net Worker RoleOrchard Mango Theme: Orchard Mango Theme is a simple inspired Microsoft Windows Phone OS. Original creator and designer Marco Siniscalco (http://www.marcosiniscalco.com)Project Rainbow: This is a school project from KAHO-SL in Belgium, ghent. although this is an open source site, we wish to ask not to copy or steal any of our code if you are related to our school and/or project.Rawler　-The Web scraping Framework using XAML: This is the Web scraping Framework using XAML .This framework makes Web scraping possible by only XAML. TenneySoftware Graphing Calculator: A 2D and 3D graphing calculator inspired by Analog's ZPlotter, utilizing my very own libraries to manage the 2D and 3D plotting. The article from Analog, an 8-bit Atari computer magazine from the 80's, can be located here: http://www.cyberroach.com/analog/an30/ZpltAn30.htmThe GINA bot: under constructionTime To Go: A little handy app that shows you how long you've got left until stuff.Windows Azure WordPress Accelerator: Accelerator to deploy WordPress in Windows AzureWP7 App site template: The WP7 App Site Template is intended to make it easier for Windows Phone 7 developers to market their apps. It's currently a simple one page site template, but any contributions/suggestions welcome.ZViewTV.NET: ZViewTV.NET est un programme de visualisation de flux audio-vidéo. Il a été créé par l'equipe qui à fait ZGuideTV.NET

Read the article
How to determine which source files are required for an Eclipse run configuration

- by isme

When writing code in an Eclipse project, I'm usually quite messy and undisciplined in how I create and organize my classes, at least in the early hacky and experimental stages. In particular, I create more than one class with a main method for testing different ideas that share most of the same classes. If I come up with something like a useful app, I can export it to a runnable jar so I can share it with friends. But this simply packs up the whole project, which can become several megabytes big if I'm relying on large library such as httpclient. Also, if I decide to refactor my lump of code into several projects once I work out what works, and I can't remember which source files are used in a particular run configuration, all I can do it copy the main class to a new project and then keep copying missing types till the new project compiles. Is there a way in Eclipse to determine which classes are actually used in a particular run configuration? EDIT: Here's an example. Say I'm experimenting with web scraping, and so far I've tried to scrape the search-result pages of both youtube.com and wrzuta.pl. I have a bunch of classes that implement scraping in general, a few that are specific to each of youtube and wrzuta. On top of this I have a basic gui common to both scrapers, but a few wrzuta- and youtube-specific buttons and options. The WrzutaGuiMain and YoutubeGuiMain classes each contain a main method to configure and show the gui for each respective website. Can Eclipse look at each of these to determine which types are referenced?

Read the article
Raising events and object persistence in Django

- by Mridang Agarwalla

Hi, I have a tricky Django problem which didn't occur to me when I was developing it. My Django application allows a user to sign up and store his login credentials for a sites. The Django application basically allows the user to search this other site (by scraping content off it) and returns the result to the user. For each query, it does a couple of queries of the other site. This seemed to work fine but sometimes, the other site slaps me with a CAPTCHA. I've written the code to get the CAPTCHA image and I need to return this to the user so he can type it in but I don't know how. My search request (the query, the username and the password) in my Django application gets passed to a view which in turn calls the backend that does the scraping/search. When a CAPTCHA is detected, I'd like to raise a client side event or something on those lines and display the CAPTCHA to the user and wait for the user's input so that I can resume my search. I would somehow need to persist my backend object between calls. I've tried pickling it but it doesn't work because I get the Can't pickle 'lock' object error. I don't know to implement this though. Any help/ideas? Thanks a ton.

Read the article
How to justify using a scripting language as part of a project

- by sylvanaar

I have a specific project in which I want to use either a scripting language + C, or as an alternative a 100% Java solution. The program adapts a legacy system for use with other moderns systems. Basically, I have few choices as to what language I can use. I have C/C++, Java 1.4, and I have also compiled the Lua for this environment. The program does 'screen scraping' and has to deal with alot of strings. That part of the code is highly variable. Most of the developers at my company use C, so - my original design was to write some portions in C, and use Lua for the part that dealt with strings and changed freqently. I was told 'You have to justify your use of the scripting language.' So i reworked my design using 100% Java, and was told - Java wont have enough performance. You should do the whole thing in C. I'm not controlling lasers or doing image processing - just some screen scraping. I still have to provide justification for using anything but C - so what justification can I provide?

Read the article
Web Development know how. Best practices [closed]

- by Mir

Possible Duplicate: What should every programmer know about web development? I have recently started learning about web development and I am currently working on a project that involves web scraping. While doing the project I came across an error which upon doing a little web search made me realize that one must clean the html before processing it further. Similarly, there were a few more interesting things that I had missed. My question is how can I quickly familiarize myself with best practice methods for web development.( I am asking as an an electrical engineer with experience in C/C++/Java and very little experience in web dev). Thanks

Read the article
What PC for programming? [on hold]

- by James Jeffery

I'm asking this here because I'm looking for some advice on a PC that will be suitable for my needs. I currently have mac's and have rarely used PC's apart from my Vaio laptop, which is on it's way out. I will be using the PC for C# and .NET development. I mainly develop desktop apps using a PC, but I will be doing some ASP.NET as I'm switching from PHP to ASP. The selection of PC's are on here: http://www.pcworld.co.uk/ I have £500, but if I can not spend all of that I'd be happy. I will be doing nothing on the computer apart from C# development (desktop and ASP). Any help would be much appreciated. My applications are not intensive. They are usually automation software for web scraping and marketing purposes.

Read the article
How can I move towards the Business Intelligence/ data mining fields from software developer [closed]

- by user1758043

I am working as a Python developer and I work with django. I also do some web scraping and building spiders and bots. Now from there I want to make my move to Business Intelligence. I just want to know how I can move into that field. Because as companies are not going to hire me in that field directly, I just want to know how can I make the transistion. I was thinking of first working as Database developer in SQL and then I can see further. But I want advice from you guys so that I can start learning that stuff so that I can change jobs keeping that in mind. Here in my area there are plenty of jobs in all areas but I need to know how to transition and what things I should learn before making that transition. Here jobs are plenty so if I know my stuff, getting a job is a piece of cake because they don't have any people. Same jobs keep getting advertised for months and months.

Read the article
CLI program to download album art

- by John Baber

I'd like to be able to do this: $ pwd /home/$USER/music/ripped_music/Monty_Python-Instant_Record_Collection $ ls 01.The_Executive_Intro.mp3 ... 16.The_Lumberjack_Song.mp3 $ mystery_command_or_script . $ ls 01.The_Executive_Intro.mp3 ... 16.The_Lumberjack_Song.mp3 album_cover.jpg $ Somewhere in the guts of Rhythmbox, totem, etc. this is being done. I'd like to be able to do it myself. I don't need help actually writing a script. I'd really just like to know if there's something like CDDB for album covers. (Scraping albumart.org is the current working solution.)

Read the article
How to use PostgreSQL on AWS - Ubuntu 11.10

- by That1Guy

I'm extremely new to cloud-computing, Linux, and PostgreSQL, so if this is a stupid question, I apologize. I've managed to create an m1.large instance running Ubuntu 11.10, connect via Putty SSH, and install PostgreSQL (sudo apt-get install postgresql), but that is as far as I've gotten. My goal is to run several python web-scraping scripts that I've written on this instance (so as not to eat up all of our bandwidth (smaller company at the moment)) and insert the scraped data into a PostgreSQL table on the instance and later retrieve that data to store on our local server (as I've heard AWS EBS is unreliable and I don't want to take chances). How can I configure PostgreSQL on my AWS instance? How can I access the data from my machine? I currently use PgAdmin3 to manage PosgreSQL on our local server. Can I use this same interface to manage PostgreSQL on my AWS instance? Any suggestions, solutions, links, etc is greatly appreciated. And again, if this is a dumb question, I apologize.

Read the article
wget not respecting my robots.txt. Is there an interceptor?

- by Jane Wilkie

I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little. I have implemented a robots.txt policy. I posted it below.. User-agent: wget Disallow: / User-agent: libwww Disallow: / User-agent: * Disallow: / Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so.... http://myserver.com/file.csv Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor. Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas. Janie

Read the article
How to deal with "software end-of-life" situations?

- by rwong

When a vendor declares that they no longer intend to provide any support or services to a piece of software (and stated the intent to exit the business - offering no upgrade paths), and stated that customers must pay a nominal fee in order to have the existing data exported, what kind of recourse do programmers/customers have? Things I can think of: Need to purchase spare hardware and set up a spare environment on which the software can continue to operate. Various data export methods which do not require vendor involvement. (For example, screen scraping, printing to image followed by re-scanning, etc) Parallel systems where staff will duplicate the old data into a new system manually or semi-automatically Legal means, in case the vendor is in financial trouble Any other ideas? Assuming that there is no "circumvention" involved (no DRM, no DMCA), is data recovery or reverse engineering legal/acceptable?

Read the article
What are some potential issues in blocking all incoming requests from the Amazon cloud?

- by ElHaix

Recently I, along with the rest of the world, have seen a significant increase in what appears to be scraping from Amazon AWS-related sources. So simply put, I blocked all incoming requests from the Amazon cloud for our hosted application. I know that some good services/bots are now hosted on the cloud, and I'm wondering if certain IP addresses should be allowed, as they may gather data that would in the end benefit our site's SEO rankings? -- UPDATE -- I added a feature to block requests from the following hosts: Amazon Softlayer ServerDeals GigAvenue Since then, I have seen my network traffic decrease (monitored by network out bytes). Average operation is around 10,000,000 bytes. You can see where last week I was not blocking, then started blocking. I've since removed the blocks and will see what the outcome is.

Read the article
Can a domain specific language be used to representing the Open SRD

- by NeoModulus

I am in the early stages of creating an open source C# library that would allow developers to drop in the open SRD (http://www.d20srd.org/) into an existing project. Abstracted it is a complex set of tightly coupled business rules. Having previously worked on an adaptive object model project for health care risk management I began with that pattern in mind. Due to the high coupling of rules it is becoming apparent that the project may require some kind of scripting. Have started researching DSL implementation I am now considering scraping the adaptive object model for a domain specific language. I have not work with domain specific languages so my question is it reasonable to assume a domain specific language can be used to representing the open SRD?

Read the article
Stock ticker symbol lookup API

- by dancavallaro

Is there any sort of API that just offers a simple symbol lookup service? i.e., input a company name and it will tell you the ticker symbol? I've tried just screen-scraping Google Finance, but after a little while it rate limits you and you have to enter a CAPTCHA. I'm trying to batch-lookup about 2000 ticker symbols. Any ideas?

Read the article
How can I convert HTML to Textile?

- by Joe Van Dyk

I'm scraping a static html site and moving the content into a database-backed CMS. I'd like to use Textile in the CMS. Is there a tool out there that converts HTML into Textile, so I can scrape the existing site, convert the HTML to Textile, and insert that data into the database?

Read the article
Using ASP.NET Automatically login to an external website and redirect

- by DoodleWalker

Hello, We have a series of products with built in web servers each of which has a login page, a customer wants to create a web portal in which they log into once, from there they can simply click on any of the devices (external websites) and it will automatically login to that site and redirect them to the page after the login screen. The portal is using ASP.NET MVC, the external devices are Windows CE based units running embedded web servers. Can find a lot on scraping, but not much on redirection after the event. Many Thanks Andy

Read the article
How can I get the ultimate URL without fetching the pages using perl and LWP?

- by planetp

I'm doing some web scraping using perl's LWP. I need to process a set of URLs, some of which may redirect (1 or more times). How can I get ultimate url with all redirects resolved, using HEAD method ?

Read the article

Search Results

Search found 346 results on 14 pages for 'scraping'.

Page 9/14 | < Previous Page | 5 6 7 8 9 10 11 12 13 14 | Next Page >

- by Chris Kimpton

- by Rick Strahl

- by H4mm3rHead

- by pfunc

- by Zach Smith

- by user4203

- by mplungjan

- by isme

- by Mridang Agarwalla

- by sylvanaar

- by Mir

- by James Jeffery

- by user1758043

- by John Baber

- by That1Guy

- by Jane Wilkie

- by rwong

- by ElHaix

- by NeoModulus

- by dancavallaro

- by Joe Van Dyk

- by DoodleWalker

- by planetp

< Previous Page | 5 6 7 8 9 10 11 12 13 14 | Next Page >