Search Results

Search found 32104 results on 1285 pages for 'html parsing'.

Page 3/1285 | < Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

Scrape HTML tables from a given URL into CSV

- by dreeves

I seek a tool that can be run on the command line like so: tablescrape 'http://someURL.foo.com' [n] If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV. Potential additional features: To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill. The Perl module HTML::TableExtract can do this and may be good place to start for writing the tool I have in mind. An option to asciify any unicode. An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table. Related questions: http://stackoverflow.com/questions/259091/how-can-i-scrape-an-html-table-to-csv http://stackoverflow.com/questions/1403087/how-can-i-convert-an-html-table-to-csv http://stackoverflow.com/questions/2861/options-for-html-scraping

Read the article
Java library for HTML analysis

- by Raj

Hi, (I've seen similar questions, but I think none of them cater to my specific needs, hence...) I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like: figuring out the most prominent color in an HTML chunk changing that color to some other color (hence, has to support modification of the HTML as well) pruning out unwanted tags fixing up the HTML to result in a well formed HTML snippet Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great. Thanks in advance!

Read the article
Simplest way to add HTML as a String to a new Nokogiri HTML document body?

- by viatropos

I have a bunch of content from the body of one HTML file. How do I put that into the body of a new blank-slate HTML document using Nokogiri? Something like this, but with Nokogiri: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Default Title</title> </head> <body class='default-class'> <%= yield :body %> </body> </html>

Read the article
Haskell - Parsec Parsing <p> element

- by Martin

I'm using Text.ParserCombinators.Parsec and Text.XHtml to parse an input like this: This is the first paragraph example\n with two lines\n \n And this is the second paragraph\n And my output should be: <p>This is the first paragraph example\n with two lines\n</p> <p>And this is the second paragraph\n</p> I defined: line= do{ ;t<-manyTill (anyChar) newline ;return t } paragraph = do{ t<-many1 (line) ;return ( p << t ) } But it returns: <p>This is the first paragraph example\n with two lines\n\n And this is the second paragraph\n</p> What is wrong? Any ideas? Thanks!

Read the article
Managing JS and CSS for a static HTML web application

- by Josh Kelley

I'm working on a smallish web application that uses a little bit of static HTML and relies on JavaScript to load the application data as JSON and dynamically create the web page elements from that. First question: Is this a fundamentally bad idea? I'm unclear on how many web sites and web applications completely dispense with server-side generation of HTML. (There are obvious disadvantages of JS-only web apps in the areas of graceful degradation / progressive enhancement and being search engine friendly, but I don't believe that these are an issue for this particular app.) Second question: What's the best way to manage the static HTML, JS, and CSS? For my "development build," I'd like non-minified third-party code, multiple JS and CSS files for easier organization, etc. For the "release build," everything should be minified, concatenated together, etc. If I was doing server-side generation of HTML, it'd be easy to have my web framework generate different development versus release HTML that includes multiple verbose versus concatenated minified code. But given that I'm only doing any static HTML, what's the best way to manage this? (I realize I could hack something together with ERB or Perl, but I'm wondering if there are any standard solutions.) In particular, since I'm not doing any server-side HTML generation, is there an easy, semi-standard way of setting up my static HTML so that it contains code like <script src="js/vendors/jquery.js"></script> <script src="js/class_a.js"></script> <script src="js/class_b.js"></script> <script src="js/main.js"></script> at development time and <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js"></script> <script src="js/entire_app.min.js"></script> for release?

Read the article
HTML Parsing for multiple input files using java code [closed]

- by mkp

FileReader f0 = new FileReader("123.html"); StringBuilder sb = new StringBuilder(); BufferedReader br = new BufferedReader(f0); while((temp1=br.readLine())!=null) { sb.append(temp1); } String para = sb.toString().replaceAll("<br>","\n"); String textonly = Jsoup.parse(para).text(); System.out.println(textonly); FileWriter f1=new FileWriter("123.txt"); char buf1[] = new char[textonly.length()]; textonly.getChars(0,textonly.length(),buf1,0); for(i=0;i<buf1.length;i++) { if(buf1[i]=='\n') f1.write("\r\n"); f1.write(buf1[i]); } I've this code but it is taking only one file at a time. I want to select multiple files. I've 2000 files and I've given them numbering name from 1 to 2000 as "1.html". So I want to give for loop like for(i=1;i<=2000;i++) and after executing separate txt file should be generated.

Read the article
Ajax Control Toolkit July 2011 Release and the New HTML Editor Extender

- by Stephen Walther

I’m happy to announce the July 2011 release of the Ajax Control Toolkit which includes important bug fixes and a completely new HTML Editor Extender control. You can download the July 2011 Release by visiting the Ajax Control Toolkit CodePlex site at: http://AjaxControlToolkit.CodePlex.com Using the New HTML Editor Extender Control You can use the new HTML Editor Extender to extend any standard ASP.NET TextBox control so that it supports rich formatting such as bold, italics, bulleted lists, numbered lists, typefaces and different foreground and background colors. The following code illustrates how you can extend a standard ASP.NET TextBox control with the HtmlEditorExtender: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Simple.aspx.cs" Inherits="WebApplication1.Simple" %> <%@ Register TagPrefix="asp" Namespace="AjaxControlToolkit" Assembly="AjaxControlToolkit" %> <html xmlns="http://www.w3.org/1999/xhtml"> <head runat="server"> <title>Simple</title> </head> <body> <form id="form1" runat="server"> <asp:ToolkitScriptManager runat="Server" /> <asp:TextBox ID="txtComments" TextMode="MultiLine" Columns="60" Rows="8" runat="server" /> <asp:HtmlEditorExtender TargetControlID="txtComments" runat="server" /> </form> </body> </html> This page has the following three controls: ToolkitScriptManager – The ToolkitScriptManager renders all of the scripts required by the Ajax Control Toolkit. TextBox – The TextBox control is a standard ASP.NET TextBox which is set to display multiple lines (a TextArea instead of an Input element). HtmlEditorExtender – The HtmlEditorExtender is set to extend the TextBox control. You can use the standard TextBox Text property to read the rich text entered into the TextBox control on the server. Lightweight and HTML5 The HTML Editor Extender works on all modern browsers including the most recent versions of Mozilla Firefox (Firefox 5), Google Chrome (Chrome 12), and Apple Safari (Safari 5). Furthermore, the HTML Editor Extender is compatible with Microsoft Internet Explorer 6 and newer. The HTML Editor Extender is very lightweight. It takes advantage of the HTML5 ContentEditable attribute so it does not require an iframe or complex browser workarounds. If you select View Source in your browser while using the HTML Editor Extender, we hope that you will be pleasantly surprised by how little markup and script is generated by the HTML Editor Extender. Customizable Toolbar Buttons Depending on the web application that you are building, you will want to display different toolbar buttons with the HTML Editor Extender. One of the design goals of the HTML Editor Extender was to make it very easy for you to customize the toolbar buttons. Imagine, for example, that you want to use the HTML Editor Extender when accepting comments on blog posts. In that case, you might want to restrict the type of formatting that a user can display. You might want to enable a user to format text as bold or italic but you do not want the user to make any other formatting changes. The following page illustrates how you can customize the HTML Editor Extender toolbar: <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="CustomToolbar.aspx.cs" Inherits="WebApplication1.CustomToolbar" %> <%@ Register TagPrefix="asp" Namespace="AjaxControlToolkit" Assembly="AjaxControlToolkit" %> <html> <head runat="server"> <title>Custom Toolbar</title> </head> <body> <form id="form1" runat="server"> <asp:ToolkitScriptManager Runat="server" /> <asp:TextBox ID="txtComments" TextMode="MultiLine" Columns="50" Rows="10" Text="Hello <b>world!</b>" Runat="server" /> <asp:HtmlEditorExtender TargetControlID="txtComments" runat="server"> <Toolbar> <asp:Bold /> <asp:Italic /> </Toolbar> </asp:HtmlEditorExtender> </form> </body> </html> Notice that the HTML Editor Extender in the page above has a Toolbar subtag. You can list the toolbar buttons which you want to appear within the subtag. In the case above, only Bold and Italic buttons are displayed. Here is a complete list of the Toolbar buttons currently supported by the HTML Editor Extender: Undo Redo Bold Italic Underline StrikeThrough Subscript Superscript JustifyLeft JustifyCenter JustifyRight JustifyFull InsertOrderedList InsertUnorderedList CreateLink UnLink RemoveFormat SelectAll UnSelect Delete Cut Copy Paste BackgroundColorSelector ForeColorSelector FontNameSelector FontSizeSelector Indent Outdent InsertHorizontalRule HorizontalSeparator Of course the HTML Editor Extender was designed to be extensible. You can create your own buttons and add them to the control. Compatible with the AntiXSS Library When using the HTML Editor Extender on a public facing website, we strongly recommend that you use the HTML Editor Extender with the AntiXSS Library. If you allow users to submit arbitrary HTML, and you don’t take any action to strip out malicious markup, then you are opening your website to Cross-Site Scripting Attacks (XSS attacks). The HTML Editor Extender uses the Provider Model to support different Sanitizer Providers. The July 2011 release of the Ajax Control Toolkit ships with a single Sanitizer Provider which uses the AntiXSS library (see http://AntiXss.CodePlex.com ). A Sanitizer Provider is responsible for sanitizing HTML markup by removing any malicious elements, attributes, and attribute values. For example, the AntiXss Sanitizer Provider will take the following block of HTML: <b><a href=""javascript:doEvil()"">Visit Grandma</a></b> <script>doEvil()</script> And return the following sanitized block of HTML: <b><a href="">Visit Grandma</a></b> Notice that the JavaScript href and <SCRIPT> tag are both stripped out. Be aware that there are a depressingly large number of ways to sneak evil markup into your HTML. You definitely want a Sanitizer as a safety net. Before you can use the AntiXSS Sanitizer Provider, you must add three assemblies to your web application: AntiXSSLibrary.dll, HtmlSanitizationLibrary.dll, and SanitizerProviders.dll. All three assemblies are included with the CodePlex download of the Ajax Control Toolkit in the SanitizerProviders folder. Here’s how you modify your web.config file to use the AntiXSS Sanitizer Provider: <configuration> <configSections> <sectionGroup name="system.web"> <section name="sanitizer" requirePermission="false" type="AjaxControlToolkit.Sanitizer.ProviderSanitizerSection, AjaxControlToolkit"/> </sectionGroup> </configSections> <system.web> <compilation targetFramework="4.0" debug="true"/> <sanitizer defaultProvider="AntiXssSanitizerProvider"> <providers> <add name="AntiXssSanitizerProvider" type="AjaxControlToolkit.Sanitizer.AntiXssSanitizerProvider"></add> </providers> </sanitizer> </system.web> </configuration> You can detect whether the HTML Editor Extender is using the AntiXSS Sanitizer Provider by checking the HtmlEditorExtender SanitizerProvider property like this: if (MyHtmlEditorExtender.SanitizerProvider == null) { throw new Exception("Please enable the AntiXss Sanitizer!"); } When the SanitizerProvider property has the value null, you know that a Sanitizer Provider has not been configured in the web.config file. Because the AntiXSS library requires Full Trust, you cannot use the AntiXSS Sanitizer Provider with most shared website hosting providers. Because most shared hosting providers only support Medium Trust and not Full Trust, we do not recommend using the HTML Editor Extender with a public website hosted with a shared hosting provider. Why a New HTML Editor Control? The Ajax Control Toolkit now includes two HTML Editor controls. Why did we introduce a new HTML Editor control when there was already an existing HTML Editor? We think you will like the new HTML Editor much more than the previous one. We had several goals with the new HTML Editor Extender: Lightweight – We wanted to leverage HTML5 to create a lightweight HTML Editor. The new HTML Editor generates much less markup and script than the previous HTML Editor. Secure – We wanted to make it easy to integrate the AntiXSS library with the HTML Editor. If you are creating a public facing website, we strongly recommend that you use the AntiXSS Provider. Customizable – We wanted to make it easy for users to customize the toolbar buttons displayed by the HTML Editor. Compatibility – We wanted to ensure that the HTML Editor will work with the latest versions of the most popular browsers (including Internet Explorer 6 and higher). The old HTML Editor control is still included in the Ajax Control Toolkit and continues to live in the AjaxControlToolkit.HTMLEditor namespace. We have not modified the control and you can continue to use the control in the same way as you have used it in the past. However, we hope that you will consider migrating to the new HTML Editor Extender for the reasons listed above. Summary We’ve introduced a new Ajax Control Toolkit control with this release. I want to thank the developers and testers on the Superexpert team for the huge amount of work which they put into this control. It was a non-trivial task to build an entirely new control which has the complexity of the HTML Editor in less than 6 weeks. Please let us know what you think! We want to hear your feedback. If you discover issues with the new HTML Editor Extender control, or you have questions about the control, or you have ideas for how it can be improved, then please post them to this blog. Tomorrow starts a new sprint

Read the article
How to parse malformed HTML in python, using standard libraries

- by bukzor

There are so many html and xml libraries built into python, that it's hard to believe there's no support for real-world HTML parsing. I've found plenty of great third-party libraries for this task, but this question is about the python standard library. Requirements: Use only Python standard library components (I'm currently using v2.6) DOM support Handle HTML entities ( ) Handle partial documents (like: Hello, <iWorld</i!) Bonus points: XPATH support Handle unclosed/malformed tags. (<bigdoes anyone here know <html ???

Read the article
extract part of html using tidy in php

- by hunt

hi, I want to extract particular part of HTML using tidy in php. the html page has table in it and i just want to fetch that table from html page. please help and post the solution.... Thanks

Read the article
301 redirect from "/index.html" to root if index.html not exist

- by Andrij Muzychka

Can I create 301 redirect from "index.html" to root directory if file "index.html" not exist? For example: link "http://example.com/index.html" show "404 Error" page. I need 301 redirect to root directory: "http://example.com/" in .htaccess I add rule: Options +FollowSymLinks RewriteCond %{THE_REQUEST} ^.*/index.html RewriteRule ^(.*)index.html$ http://example.com/$1 [R=301,L] but it doesn't work. Can you help me solve this problem?

Read the article
HTML to XML conversion or a good HTML Parser Suggestion written in .Net

- by mehmet6parmak

Hi all, I need to parse html for a project and looking for a good html parser or an API providing conversion from html to xml. Waiting for suggestions... Thanks All...

Read the article
Vote on Pros and Cons of Java HTML to XML cleaners

- by George Bailey

I am looking to allow HTML emails (and other HTML uploads) without letting in scripts and stuff. I plan to have a white list of safe tags and attributes as well as a whitelist of CSS tags and value regexes (to prevent automatic return receipt). I asked a question: Parse a badly formatted XML document (like an HTML file) I found there are many many ways to do this. Some systems have built in sanitizers (which I don't care so much about). This page is a very nice listing page but I get kinda lost http://java-source.net/open-source/html-parsers It is very important that the parsers never throw an exception. There should always be best guess results to the parse/clean. It is also very important that the result is valid XML that can be traversed in Java. I posted some product information and said Community Wiki. Please post any other product suggestions you like and say Community Wiki so they can be voted on. Also any comments or wiki edits on what part of a certain product is better and what is not would be greatly appreciated. (for example,, speed vs accuracy..) It seems that we will go with either jsoup (seems more active and up to date) or TagSoup (compatible with JDK4 and been around awhile). A +1 for any of these products would be if they could convert all style sheets into inline style on the elements.

Read the article
extract specific element from nested elements using lxml html

- by Dan.StackOverflow

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier. <table> <tr> <td> <table> <tr><td></td></tr> <tr><td> <table> <tr><td><u><b>Header1</b></u></td></tr> <tr><td>Data</td></tr> </table> </td></tr> </table> </td></tr> </table> What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so: from lxml import html page = '...' tree = html.fromstring(page) print tree.xpath('//table[//*[contains(text(), "Header1")]]') but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

Read the article
Regex to match 2 things in 1 HTML file

- by CyberK

Hi, I have a HTML file which contains the following: <img src="MATCH1" bla="blabla"> <something:else bla="blabla" bla="bla"><something:else2 something="something"> <something image="MATCH2" bla="abc"> Now I need a regex to match both MATCH1 and MATCH2 Also the HTML contains multiple parts like this, so it can be in the HTML 1, 2, 3 of x times.. When I say: <img\s*src="(.*?)".*?<something\s*image="(.*?)" It doesn't match it. What am I missing here? Thanks in advance!

Read the article
When should JavaScript generate HTML?

- by VirtuosiMedia

I try to generate as little HTML from JavaScript as possible. Instead, I prefer to manipulate existing markup whenever I can and only generate HTML when I need to dynamically insert an element that isn't a good candidate for using Ajax. This, I believe, makes it far easier to maintain the code and quickly make changes to it because the markup is easier to read and trace. My rule of thumb is: HTML is for document structure, CSS is for presentation, JavaScript is for behavior. However, I've seen a lot of JS code that generates mounds of HTML, including entire forms and content-heavy modal dialogs. In general, which method is considered best practice? In what circumstances should JavaScript be used to generate HTML and when should it not?

Read the article
Parse HTML with CSS or XPath selectors?

- by ovolko

My goal is to parse HTML with lxml, which supports both XPath and CSS selectors. I can tie my model properties either to CSS or XPath, but I'm not sure which one would be the best, e.g. less fuss when HTML layout is changed, simpler expressions, greater extraction speed. What would you choose in such a situation?

Read the article
Get HTML element informations in .NET

- by martin.malek

Hi, I'm just thinking if there is any way how to get information about element in HTML in my .NET application. The input is HTML page and path to CSS files etc. I want to take e.g. H1 tag and found what will be the CSS for it. Is there any code or can I use IE and try to take this information from it automatically inside of my application?

Read the article
Removing HTML from a Java String

- by Mason

Is there a good way to remove HTML from a Java string? A simple regex like replaceAll("\\<.*?>","") will work, but things like & wont be converted correctly and non-HTML between the two angle brackets will be removed (ie the .*? in the regex will disappear).

Read the article
Parse HTML Offline

- by Clark Gable

Are there any HTML parsers that parse HTML docs offline, i.e. stored on your computer? If so, can anyone name some good ones please?

Read the article
Pure JSP without mixing HTML, by writing html as Java-like code

- by ADTC

Please read before answering. This is a fantasy programming technique I'm dreaming up. I want to know if there's anything close in real life. The following JSP page: <% html { head { title {"Pure fantasy";} } body { h1 {"A heading with double quote (\") character";} p {"a paragraph";} String s = "a paragraph in string. the date is "; p { s; new Date().toString(); } table (Border.ZERO, new Padding(27)) { tr { for (int i = 0; i < 10; i++) { td {i;} } } } } } %> could generate the following HTML page: <html> <head> <title>Pure fantasy</title> </head> <body> <h1>A heading with double quote (") character</h1> <p>a paragraph</p> <p>a paragraph in string. the date is 11 December 2012</p> <table border="0" padding="27"> <tr> <td>0</td> <td>1</td> <td>2</td> <td>3</td> <td>4</td> <td>5</td> <td>6</td> <td>7</td> <td>8</td> <td>9</td> </tr> </table> </body> </html> The thing about this fantasy is it reuses the same old Java programming language technique that enable customized keywords used in a way similar to if-else-then, while, try-catch etc to represent html tags in a non-html way that can easily be checked for syntactic correctness, and most importantly can easily be mixed up with regular Java code without being lost in a sea of <%, %>, <%=, out.write(), etc. An added feature is that strings can directly be placed as commands to print out into generated HTML, something Java doesn't support (where pure strings have to be assigned to variables before use). Is there anything in real life that comes close? If not, is it possible to define customized keywords in Java or JSP? Or do I have to create an entirely new programming language for that? What problems do you see with this kind of setup? PS: I know you can use HTML libraries to construct HTML using Java code, but the problem with such libraries is, the source code itself doesn't have a readable HTML representation like the code above does - if you get what I mean.

Read the article
Sending HTML to Gmail always lands in Spam

- by cartaysm

I am having an issue with sending HTML emails to Gmail. I can send them to Yahoo, Hotmail, RR, AOL, etc. with no problem at all, but when I send them to Gmail I get kicked to spam. I have checked my IP with a lot of different list to make sure it is not listed anywhere, which it is not. spamhaus = is not listed in the DBL abuse.net = is not listed in the SBL abuse.net = is not listed in the PBL abuse.net = is not listed in the XBL spamcop = not listed in bl.spamcop.net host 24.172.204.xxx xxx.204.172.24.in-addr.arpa domain name pointer xxxevents.com. host xxxevents.com xxxevents.com has address 24.172.204.xxx xxxevents.com mail is handled by 10 mail.xxxevents.com. I am just trying to send a very VERY basic HTML message (listed below). I use an Ubuntu server, swiftmailer, multipart/alternative (HTML & plain), SPF = pass, and I am going to setup DKIM today to see if that fixes it (but I doubt it will)... For now I will only post the message I sent that gets kicked to spam and can provide any details needed. <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><title>Triathlon</title></head> <body> <table cellpadding="0" cellspacing="0"> <tr> <td> <p>Thank you for attending our 4th annual Triathlon/Duathlon/5k at Hueston Woods State Park on August 12th. This event is held annually to raise research funding for Crohn's Disease, Ulcerative Colitis, and Muscular Dystrophy diseases.</p> </td> </tr> <tr> <td> <p>As you know the results and pictures have been posted on our home page at since Sunday 8/13/2012. Now we also have updated our Facebook page with those photos and you can start tagging yourself or downloading the pictures now! <br /> our page and tag yourself at </p> <p> test test </p> <p>Race day events is professionally managed by Speedy-Feet</p> </td> </tr> </table> </body> </html> Just plain text works great, I thought maybe wording was messing me up but not the case... I am almost done install opendkim so I will be able to rule that out very soon. Edit: Okay installed opendkim and I am getting passing results so I sent the html I posted above it went through just fine. So now when I start to add a few more lines I am getting kicked back to spam again. Here is updated html code: ` <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head><title>Triathlon</title></head> <body> <table cellpadding="0" cellspacing="0"> <tr> <td> <center><a href='http://xxxevents.com' target="_blank"> <font face="Verdana, Arial, Helvetica, sans-serif" color="#666666" size="2"> <img src="http://xxxevents.com/marketemailimages/xxxlogo.png" alt="xxx It Events | Raising funds for Crohns, Colitis, and Muscular Dystrophy" border="0" /> </font></a></center> </td> <tr> <td> <p>Thank you for attending our 4th annual Triathlon/Duathlon/5k at Hueston Woods State Park on August 12th. This event is held annually to raise research funding for Crohn's Disease, Ulcerative Colitis, and Muscular Dystrophy diseases.</p> </td> </tr> <tr> <td> <p>As you know the results and pictures have been posted on our home page at since Sunday 8/13/2012. Now we also have updated our Facebook page with those photos and you can start tagging yourself or downloading the pictures now! <br /> our page and tag yourself at </p> <p> test test </p> <p>Race day events is professionally managed by Speedy-Feet</p> </td> </tr> </table> <table width="100%" border="0" cellspacing="0" cellpadding="0"> <tr> <td valign="top"> <div align="center" style="font-family:Verdana, Arial, Helvetica, sans-serif; font-size:10px;"><br />PO Box xxx Maineville, OH 45039<br /> <a href="mailto:[email protected]">[email protected]</a> | <a href='http://xxxevents.com' target="_blank">xxxevents.com</a><br /> <br /> </div> </td> </tr> </table> </body> </html>`

Read the article
Html Agility Pack for Reading “Real World” HTML

- by WeigeltRo

In an ideal world, all data you need from the web would be available via well-designed services. In the real world you sometimes have to scrape the data off a web page. Ugly, dirty – but if you really want that data, you have no choice. Just don’t write (yet another) HTML parser. I stumbled across the Html Agility Pack (HAP) a long time ago, but just now had the need for a robust way to read HTML. A quote from the website: This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams). Using the HAP was a simple matter of getting the Nuget package, taking a look at the example and dusting off some of my XPath knowledge from years ago. The documentation on the Codeplex site is non-existing, but if you’ve queried a DOM or used XPath or XSLT before you shouldn’t have problems finding your way around using Intellisense (ReSharper tip: Press Ctrl+Shift+F1 on class members for reading the full doc comments).

Read the article
Idea of an algorithm to detect a website's navigation structure?

- by Uwe Keim

Currently I am in the process of developing an importer of any existing, arbitrary (static) HTML website into the upcoming release of our CMS. While the downloading the files is solved successfully, I'm pulling my hair off when it comes to detect a site structure (pages and subpages) purely from the HTML files, without the user specifying additional hints. Basically I want to get a tree like: + Root page 1 + Child page 1 + Child page 2 + Child child page1 + Child page 3 + Root page 2 + Child page 4 + Root page 3 + ... I.e. I want to be able to detect the menu structure from the links inside the pages. This has not to be 100% accurate, but at least I want to achieve more than just a flat list. I thought of looking at multiple pages to see similar areas and identify these as menu areas and parse the links there, but after all I'm not that satisfied with this idea. My question: Can you imagine any algorithm when it comes to detecting such a structure? Update 1: What I'm looking for is not a web spider, but an algorithm do create a logical tree of the relationship of the pages to be able to create pages and subpages inside my CMS when importing them. Update 2: As of Robert's suggestion I'll solve this by starting at the root page, and then simply parse links as you go and treat every link inside a page simply as a child page. Probably I'll recurse not in a deep-first manner but rather in a breadth-first manner to get a more balanced navigation structure.

Read the article
Removing HTML entities while preserving line breaks with JSoup

- by shrodes

I have been using JSoup to parse lyrics and it has been great until now, but have run into a problem. I can use Node.html() to return the full HTML of the desired node, which retains line breaks as such: Glóandi augu, silfurnátt <br />Blóð alvöru, starir á <br />Óður hundur er í vígamóð, í maga... mér <br /> <br />Kolniður gref, kvik sem dreg hér <br />Kolniður svart, hvergi bjart né But has the unfortunate side-effect, as you can see, of retaining HTML entities and tags. However, if I use Node.text(), I can get a better looking result, free of tags and entities: Glóandi augu, silfurnátt Blóð alvöru, starir á Óður hundur er í vígamóð, í maga... mér Kolniður gref, kvik sem dreg hér Kolniður svart, Which has another unfortunate side-effect of removing the line breaks and compressing into a single line. Simply replacing <br /> from the node before calling Node.text() yields the same result, and it seems that that method is compressing the text onto a single line in the method itself, ignoring newlines. Is it possible to have the best of both worlds, and have tags and entities replaced correctly which preserving the line breaks, or is there another method or way of decoding entities and removing tags without having to replace them manually?

Read the article
Parsec Haskell to HTML

- by Martin

I'm using Text.ParserCombinators.Parsec and Text.XHtml to parse an input like this: hello 123 --this is an emphasized text-- bye\n And my output should be: <p>hello 123 <em>this is an emphasized text</em> bye\n</p> Any ideas? Thanks!!

Read the article

< Previous Page | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >