parsing - Developer IT

Parsing / Extracting Text from String in Rails?

- by user641116

I have a string in Rails, e.g. "This is a Twitter message. #books War & Peace by Leo Tolstoy. I love this book!", and I want to parse the text and extract only certain phrases, like "War & Peace by Leo Tolstoy". Is this a matter of using Regex and lifting the text between "#books" to "."? What if there's no structure to the message, like: "This is a Twitter message #books War & Peace by Leo Tolstoy I love this book!" or "This is a Twitter message. I love the book War & Peace by Leo Tolstoy #books" How can I reliably pull the phrase "War & Peace by Leo Tolstoy" without knowing the phrase ex ante. Are there any gems, methods, etc. that can help me do this? At the very least, what would you call what I'm trying to do? It will help me search for a solution on Google. I've tried a few searches on "parsing" with no luck.

Read the article

Parsing scripts that use curly braces

- by Keikoku

To get an idea of what I'm doing, I am writing a python parser that will parse directx .x text files. The problem I have deals with how the files are formatted. Although I'm writing it in python, I'm looking for general algorithms for dealing with this sort of parsing. .x files define data using templates. The format of a template is template_name { [some_data] } The goal I have is to parse the file line-by-line and whenever I come across a template, I will deal with it accordingly. My initial approach was to check if a line contains an opening or closing brace. If it's an open brace, then I will check what the template name is. Now the catch here is that the open brace doesn't have to occur on the same line as the template name. It could just as well be template_name { [some_data] } So if I were to use my "open brace exists" criteria, it won't work for any files that use the latter format. A lot of languages also use curly braces (though I'm not sure when people would be parsing the scripts themselves), so I was wondering if anyone knows how to accurately get the template name (or in some other languages, it could just as well be a function name, though there aren't any keywords to look for)

Read the article

Parsing a string, Grammar file.

- by defn

How would I separate the below string into its parts. What I need to separate is each < Word including the angle brackets from the rest of the string. So in the below case I would end up with several strings 1. "I have to break up with you because " 2. "< reason " (without the spaces) 3. " . But Let's still " 4. "< disclaimer " 5. " ." I have to break up with you because <reason> . But let's still <disclaimer> . below is what I currently have (its ugly...) boolean complete = false; int begin = 0; int end = 0; while (complete == false) { if (s.charAt(end) == '<'){ stack.add(new Terminal(s.substring(begin, end))); begin = end; } else if (s.charAt(end) == '>') { stack.add(new NonTerminal(s.substring(begin, end))); begin = end; end++; } else if (end == s.length()){ if (isTerminal(getSubstring(s, begin, end))){ stack.add(new Terminal(s.substring(begin, end))); } else { stack.add(new NonTerminal(s.substring(begin, end))); } complete = true; } end++;

Read the article

Parsing tab delimited file with double quotes in Perl

- by sfactor

I have a data set that is tab delimited with the user-agent strings in double quotes. I need to parse each of these columns and based on the answer of my other post I used the Text::CSV module. 94410634 0 GET "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; AskTB5.5)" 1 The code is a simple one. #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $csv = Text::CSV->new(sep_char => "\t"); while (<>) { if ($csv->parse($_)) { my @columns = $csv->fields(); print "@columns\n"; } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } } But i get the Failed to parse line: error when I try it on this dataset. what am I doing wrong? I need to extract the 4th column containing the user-agent strings for further processing.

Read the article

Perl: parsing string enclosed by double quotes

- by sfactor

I need to parse tab/space delimited files that have a lot of columns in Perl. The values are such that the there are large strings enclosed within double quotes. These strings can have any characters such as tabs and spaces or anything else. When I try to parse them with the split function it splits these strings as well. Now how can I make perl understand that the strings within the " " are a single column entry? A simple example is, 12 345546.67677 "Hello World!!!" -567.55656 0.5465767 "Hello_Again; "

Read the article

Language parsing to find important words

- by Matt Huggins

I'm looking for some input and theory on how to approach a lexical topic. Let's say I have a collection of strings, which may just be one sentence or potentially multiple sentences. I'd like to parse these strings to and rip out the most important words, perhaps with a score that denotes how likely the word is to be important. Let's look at a few examples of what I mean. Example #1: "I really want a Keurig, but I can't afford one!" This is a very basic example, just one sentence. As a human, I can easily see that "Keurig" is the most important word here. Also, "afford" is relatively important, though it's clearly not the primary point of the sentence. The word "I" appears twice, but it is not important at all since it doesn't really tell us any information. I might expect to see a hash of word/scores something like this: "Keurig" => 0.9 "afford" => 0.4 "want" => 0.2 "really" => 0.1 etc... Example #2: "Just had one of the best swimming practices of my life. Hopefully I can maintain my times come the competition. If only I had remembered to take of my non-waterproof watch." This example has multiple sentences, so there will be more important words throughout. Without repeating the point exercise from example #1, I would probably expect to see two or three really important words come out of this: "swimming" (or "swimming practice"), "competition", & "watch" (or "waterproof watch" or "non-waterproof watch" depending on how the hyphen is handled). Given a couple examples like this, how would you go about doing something similar? Are there any existing (open source) libraries or algorithms in programming that already do this?

Read the article

Parsing a website's source

- by Davlog

I want to create an application and maybe upload it to the play store but I am not sure if that what my app does is legal or not. I am downloading a page's source from a website to get some information I need. For example if I download a page about the song "Ramble On" by Led Zeppelin and parse this page source to get the song's name, maybe a link to an image and the lyrics. Would that be illegal or can I display these information to my users without getting any problem? Also the website says it's an "open 'wiki-style' [...].It's completely user built by people like you and used every day by fans and developers alike."

Read the article

Parsing mathematical experssions with two values that have parenthesis and minus signs

- by user45921

I'm trying to parse equations like these which only has two values or the square root of a certain value from a text file: 100+100 -100-100 -(100)+(-100) sqrt(100) by the minues signs, parenthesis and the operator symbol in the middle and the square root, and I have no idea how to start off... I've got the file part done and the simple calculation parts except that I couldnt get my program to solve the equations in the above. #include <stdio.h> #include <string.h> #include <stdlib.h> #include <math.h> main(){ FILE *fp; char buff[255], sym,sym2,del1,del2,del3,del4; double num1, num2; int ret; fp = fopen("input.txt","r"); while(fgets(buff,sizeof(buff),fp)!=NULL){ char *tok = buff; sscanf(tok,"%lf%c%lf",&num1,&sym,&num2); switch(sym){ case '+': printf("%lf\n", num1+num2); break; case '-': printf("%lf\n", num1-num2); break; case '*': printf("%lf\n", num1*num2); break; case '/': printf("%lf\n", num1/num2); break; default: printf("The input value is not correct\n"); break; } } fclose(fp); } that is what have I written for the other basic operations without parenthesis and the minus sign for the second value and it works great for the simple ones. I'm using a switch method to calculate the add, sub, mul and divide but I'm not sure how to properly use the sscanf function (if I am not using it properly) or if there is another way using a function like strtok to properly parse the parenthesis and the minus signs. Any kind help?

Read the article

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful Soup?

- by Eric Wilson

I want to do some screen-scraping with Python 2.7, and I have no context for the differences between HTMLParser, SGMLParser, or Beautiful Soup. Are these all trying to solve the same problem, or do they exist for different reasons? Which is simplest, which is most robust, and which (if any) is the default choice? Also, please let me know if I have overlooked a significant option. Edit: I should mention that I'm not particularly experienced in HTML parsing, and I'm particularly interested in which will get me moving the quickest, with the goal of parsing HTML on one particular site.

Read the article

Hot to fix nautilus desktop on linux mint

- by user59530

so I'm using Linux Mint 13 with Cinnamon and suddenly there are no icons on the desktop and the right click doesn't work, it's like the desktop doesn't start up at all, but the Cinnamon interface and everything else are working just fine. This happens only when I open the session with Cinnamon, if I start the session on the classic Gnome or MATE the desktop works. I tried to re-install Cinnamon but nothing changed. Then, I noticed that there are some little problems in Nautilus (sometimes menus aren't the color they're supposed to be), so I'm convinced that Nautilus might be the problem, but I don't know how to fix this, I've tried a few thing but I'm starting to fear that I'm only making it worse. Also, when I open the terminal and type in nautilus here's what's shows up, any help? (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:85:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:192:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:228:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:275:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:310:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:389:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:737:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1095:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1137:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1755:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1856:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1873:18: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1889:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1947:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1954:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:1967:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:2025:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:2075:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:2090:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gtk-widgets.css:2195:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: gnome-panel.css:92:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:15:15: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:15:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:79:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:84:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:113:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nautilus.css:118:18: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:15:15: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:15:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:79:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:84:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:113:17: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING **: Theme parsing error: nemo.css:118:18: Not using units is deprecated. Assuming 'px'. (nautilus:2906): Gtk-WARNING *: Theme parsing error: unity.css:21:18: Not using units is deprecated. Assuming 'px'. Initializing nautilus-dropbox 1.4.0 Initializing nautilus-open-terminal extension * Message: Initializing gksu extension...

Read the article

XML parsing with SAX | how to handle special characters?

- by cedar715

We have a JAVA application that pulls the data from SAP, parses it and renders to the users. The data is pulled using JCO connector. Recently we were thrown an exception: org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character. So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML. My questions here are : 1. Is there any existing(open source) utility that does this job of replacing illegal characters in XML? 2. Or if I had to write such utility, how should i handle them? 3. Why is the above exception thrown? Thank You.

Read the article

MODX: Snippet strips and hangs string when parsing the vars.

- by CuSS

Hey all i have a snippet call like this: [!mysnippet?&content=`[*content*]` !] What happen is that, if i send some html like this: [!mysnippet?&content=`<p color='red'>Yeah</p>` !] it will return this: <p colo the [test only] snippet code (mysnippet) is: <?php return $content; ?> Why is this happening? My actual snippet is converting html to pdf, so i really need this. Thank you all ;D EDIT: I'm using Modx Evo 1.0.2

Read the article

Treetop: parsing single node returns nil

- by Matchu

I'm trying to get the basic of Treetop parsing. Here's a very simple bit of grammar so that I can say ArithmeticParser.parse('2+2').value == 4. grammar Arithmetic rule additive first:number '+' second:number { def value first.value + second.value end } end rule number [1-9] [0-9]* { def value text_value.to_i end } end end Parsing 2+2 works correctly. However, parsing 2 or 22 returns nil. What did I miss?

Read the article

Parsing every part of an HTTP header field-value

- by brickner

Hi all. I'm parsing HTTP data directly from packets (either TCP reconstructed or not, you can assume it is). I'm looking for the best way to parse HTTP as accurately as possible. The main issue here is the HTTP header. Looking at the basic RFC of HTTP/1.1, it seems that HTTP header parsing would be complex. The RFC describes very complex regular expressions for different parts of the header. Should I write these regular expressions to parse the different parts of the HTTP header? The basic parsing I've written so far for HTTP header is for the generic HTTP header: message-header = field-name ":" [ field-value ] And I've included replacing inner LWS with SP and repeating headers with the same field-name with comma separated values as described in section 4.2. However, looking at section 14.9 for example would show that in order to parse the different parts of the field-value I need a much more complex parsing scheme. How do you suggest I should handle the complex parts of HTTP parsing (specifically the field-value) assuming I want to give the parser users the full capabilities of HTTP and to parse every part of HTTP? Design suggestions for this would also be appreciated. Thanks.

Read the article

lr parsing table

- by flufferok

Could any1 explain how can i transform ll(1) parsing table to lr(1) parsing table? Or are there any tables already for lr1 mathematical parsing(+,-,/,*,^)?

Read the article

android sdk main.out.xml parsing error?

- by mobibob

I just started a new Android project, "WeekendStudy" to continue learning Android development and I got stumped compiling the default 'hello weekendstudy' compile / run. I think that I missed a step in configuration and setup, but I am at a loss to find out where. I have an AVD configured, set and launched. When I press 'run', the SDK is building a file main.out.xml and then fails as this: [2010-03-06 09:46:47 - WeekendStudy]Error in an XML file: aborting build. [2010-03-06 09:46:48 - WeekendStudy]res/layout/main.xml:0: error: Resource entry main is already defined. [2010-03-06 09:46:48 - WeekendStudy]res/layout/main.out.xml:0: Originally defined here. [2010-03-06 09:46:48 - WeekendStudy]/Users/mobibob/Projects/workspace-weekend/WeekendStudy/res/layout/main.out.xml:1: error: Error parsing XML: no element found [2010-03-06 09:48:16 - WeekendStudy]Error in an XML file: aborting build. [2010-03-06 09:48:16 - WeekendStudy]res/layout/main.xml:0: error: Resource entry main is already defined. [2010-03-06 09:48:16 - WeekendStudy]res/layout/main.out.xml:0: Originally defined here. [2010-03-06 09:48:16 - WeekendStudy]/Users/mobibob/Projects/workspace-weekend/WeekendStudy/res/layout/main.out.xml:1: error: Error parsing XML: no element found [2010-03-06 09:55:29 - WeekendStudy]res/layout/main.xml:0: error: Resource entry main is already defined. [2010-03-06 09:55:29 - WeekendStudy]res/layout/main.out.xml:0: Originally defined here. [2010-03-06 09:55:29 - WeekendStudy]/Users/mobibob/Projects/workspace-weekend/WeekendStudy/res/layout/main.out.xml:1: error: Error parsing XML: no element found [2010-03-06 09:55:49 - WeekendStudy]Error in an XML file: aborting build. [2010-03-06 09:55:49 - WeekendStudy]res/layout/main.xml:0: error: Resource entry main is already defined. [2010-03-06 09:55:49 - WeekendStudy]res/layout/main.out.xml:0: Originally defined here. [2010-03-06 09:55:49 - WeekendStudy]/Users/mobibob/Projects/workspace-weekend/WeekendStudy/res/layout/main.out.xml:1: error: Error parsing XML: no element found

Read the article

Memory Issues When DOM Parsing A Large XML File on Android Devices

- by tonyc

Hey awesome SO users, I have an Android application that parses an XML file for users and displays results in a much more mobile friendly format. The app works great for most users, but some users have lots and lots of data and the app crashes on them because it runs out of memory. Is there any way I have a DOM style XML parser quit parsing data after a certain amount of parsing? I only need the first 30 or so elements so it would make the application much more efficient. I'd like to use a SAX or pull parser instead, but the XML I'm parsing is not valid and I have no control over it. Unless anyone has some good SAX solutions that let me parse messy, invalid XML, I think DOM is the only way to go. Thanks for reading!

Read the article

getting 502 proxy error while parsing

- by developer

Iam parsing a page and im getting response from that but after some time i.e. after some of the parsing gets done i get this error from the server - Proxy Error The proxy server received an invalid response from an upstream server. The proxy server could not handle the request GET /file.php. Reason: Error reading from remote server and after this my parsing fails. I even tried sleep() function but it didnt helped and the error still came. Are they temporarily blocking my ip or what?? What could be the reason for this and how can i parse those pages without getting this error and all ???

Read the article

What are the arguments against parsing the Cthulhu way?

- by smarmy53

I have been assigned the task of implementing a Domain Specific Language for a tool that may become quite important for the company. The language is simple but not trivial, it already allows nested loops, string concatenation, etc. and it is practically sure that other constructs will be added as the project advances. I know by experience that writing a lexer/parser by hand -unless the grammar is trivial- is a time consuming and error prone process. So I was left with two options: a parser generator à la yacc or a combinator library like Parsec. The former was good as well but I picked the latter for various reasons, and implemented the solution in a functional language. The result is pretty spectacular to my eyes, the code is very concise, elegant and readable/fluent. I concede it may look a bit weird if you never programmed in anything other than java/c#, but then this would be true of anything not written in java/c#. At some point however, I've been literally attacked by a co-worker. After a quick glance at my screen he declared that the code is uncomprehensible and that I should not reinvent parsing but just use a stack and String.Split like everybody does. He made a lot of noise, and I could not convince him, partially because I've been taken by surprise and had no clear explanation, partially because his opinion was immutable (no pun intended). I even offered to explain him the language, but to no avail. I'm positive the discussion is going to re-surface in front of management, so I'm preparing some solid arguments. These are the first few reasons that come to my mind to avoid a String.Split-based solution: you need lot of ifs to handle special cases and things quickly spiral out of control lots of hardcoded array indexes makes maintenance painful extremely difficult to handle things like a function call as a method argument (ex. add( (add a, b), c) very difficult to provide meaningful error messages in case of syntax errors (very likely to happen) I'm all for simplicity, clarity and avoiding unnecessary smart-cryptic stuff, but I also believe it's a mistake to dumb down every part of the codebase so that even a burger flipper can understand it. It's the same argument I hear for not using interfaces, not adopting separation of concerns, copying-pasting code around, etc. A minimum of technical competence and willingness to learn is required to work on a software project after all. (I won't use this argument as it will probably sound offensive, and starting a war is not going to help anybody) What are your favorite arguments against parsing the Cthulhu way?* *of course if you can convince me he's right I'll be perfectly happy as well

Read the article

Parsing Strings ( .crt files )

- by user1661521

Base Knowledge : I have a .crt file ( certification authoritie file ) and he is composed of many fields but in one line that resumes this question i have this : Certificate: ...(alot of stuff before)... Subject: C=US, ST=Maryland, L=Pasadena, O=Brent Baccala, OU=FreeSoft, CN=www.freesoft.org/[email protected] Subject Public Key Info: ...(alot of stuff after) and i need to parse the file to populate a .csv file and i have that done the problem that i need help is, i need to get the field: CN=www.fresoft.org but when i get this kind of CN=...(Value instead of the ...) with alot of slashes i get a error in the parsing like the raw string is: CN=foo/bar/the/hell/emailAddress=blablabla and i need only: foo/bar/the/hell and for a moment i got that in the correct column but when i dont have the emailAddress something just fail in my parsing and i then get in my CN .csv column the information wrong instead of |CN| foo/bar/the/hell i get: |CN| OU=FreeSoft, foo/bar/the/hell. I have this code doing the CN parsing: #!/bin/bash subject_line=$(echo $cert | grep -o "Subject:.*Subject Public Key Info") cn=$(echo $subject_line | grep -o "CN=.*" ) if [ $(echo $cn | grep -c ".*email.*") -gt 0 ]; then end_cn=$(echo $cn | grep -b -o emailAddress) end_cn_idx=$(echo $end_cn | grep -o .*:) final_end_cn=${end_cn_idx:0:-1} common_name=${cn:3:$final_end_cn-4} echo $common_name else end_cn=$(echo $cn | grep -b -o "Subject Public Key Info") end_cn_idx=$(echo $end_cn | grep -o .*:) final_end_cn=${end_cn_idx:0:-1} common_name=${cn:3:$final_end_cn-5} echo $common_name fi

Read the article

Java iteration reading & parsing

- by Patrick Lorio

I have a log file that I am reading to a string public static String Read (String path) throws IOException { StringBuilder sb = new StringBuilder(); InputStream in = new BufferedInputStream(new FileInputStream(path)); int r; while ((r = in.read()) != -1) { sb.append(r); } return sb.toString(); } Then I have a parser that iterates over the entire string once void Parse () { String con = Read("log.txt"); for (int i = 0; i < con.length; i++) { /* parsing action */ } } This is hugely a waste of cpu cycles. I loop over all the content in Read. Then I loop over all the content in Parse. I could just place the /* parsing action */ under the while loop in the Read method, which would be find but I don't want to copy the same code all over the place. How can I parse the file in one iteration over the contents and still have separate methods for parsing and reading? In C# I understand there is some sort of yield return thing, but I'm locked with Java. What are my options in Java?

Read the article

Which type of file parsing easiest and efficient and good ?(html,pdf,csv,text)

- by Harikrishna

I want to parse the html file, pdf file, csv file and text file. Now parsing for which type of file (specified above) is easiest and efficient ? Like parsing for html file is easiest and efficient OR parsing for pdf file is easiest and efficient OR parsing for csv file is easiest and efficient ? I am asking this question because I want to parse pdf ,html ,csv and text file through common parsing code if possible. And now suppose if parsing for html is easiest and efficient then : I will write the parsing code for html file and will try to convert pdf file to the html file(if possible)so the code written for parsing html file will also work for pdf file also. And thus I will try to convert pdf,csv and text file to html file.And write the code for parsing html file and thus this code will parse html,pdf,csv and text file. Suppose if parsing for pdf is easiest and efficient then : I will convert html,csv and text file to pdf and write the code for parsing pdf file.So the code for parsing pdf file can parse html,csv and text file. So my question is (1) Which type of file parsing is easiest and efficient (pdf,csv,html,text) ? (2) And converting files(pdf,text,html,csv) to eachother is possible. Like if html parsing easiest then pdf to html,text to html and csv to html.

Read the article

Parsing an header with two different version [ID3] avoiding code duplication?

- by user66141

I really hope you could give me some interesting viewpoints for my situation, my ways to approach my issue are not to my liking . I am writing an mp3 parser , starting with an ID3v2 parser . Right now I`m working on the extended header parsing , my issue is that the optional header is defined differently in version 2.3 and 2.4 of the tag . The 2.3 version optional header is defined as follows : struct ID3_3_EXTENDED_HEADER{ DWORD dwExtHeaderSize; //Extended header size (either 6 or 8 bytes , excluded) WORD wExtFlags; //Extended header flags DWORD dwSizeOfPadding; //Size of padding (size of the tag excluding the frames and headers) }; While the 2.4 version is defined : struct ID3_4_EXTENDED_HEADER{ DWORD dwExtHeaderSize; //Extended header size (synchsafe int) BYTE bNumberOfFlagBytes; //Number of flag bytes BYTE bFlags; //Flags }; How could I parse the header while minimizing code duplication ? Using two different functions to parse each version sounds less great , using a single function with a different flow for each occasion is similar , any good practices for this kind of issues ? any tips for avoiding code duplication ? anything would be great .

Read the article

html parsing with libxml

- by zajcev

In another thread I got convinced into using HTML parsers instead of regexps for HTML parsing (I thought they would work fine, but they didn't ;) ). I thought of using libxml (it has some HTML parser built in), but failed to find any useful tutorial. I also found this site and it says here it should do fine even with severly broken HTML. Could you give me some examples of HTML parsing with libxml, or maybe recommend some different free library for Linux? I'm using C++. I just thought someone would have some example code, so that I don't have to analyze the headers ;)

Read the article

Perl, efficient parsing of csv file

- by Mike

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient. My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application. My question is, what is the most time efficient means of parsing a large CSV file using only built in tools? note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although that might work effectively.

Search Results

Search found 3176 results on 128 pages for 'parsing'.

Page 1/128 | 1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >

- by user641116

- by Keikoku

- by defn

- by sfactor

- by sfactor

- by Matt Huggins

- by Davlog

- by user45921

- by Eric Wilson

- by user59530

- by cedar715

- by CuSS

- by Matchu

- by brickner

- by flufferok

- by mobibob

- by tonyc

- by developer

- by smarmy53

- by user1661521

- by Patrick Lorio

- by Harikrishna

- by user66141

- by zajcev

- by Mike

1 2 3 4 5 6 7 8 9 10 11 12 | Next Page >