Need help with regex parsing (in perl)
        Posted  
        
            by Charlie
        on Stack Overflow
        
        See other posts from Stack Overflow
        
            or by Charlie
        
        
        
        Published on 2010-05-02T03:14:01Z
        Indexed on 
            2010/05/02
            3:17 UTC
        
        
        Read the original article
        Hit count: 289
        
Hi all, need some help parsing an html file in perl.
I used the LWP module to retrieve a webpage into $_ with $/ undefined so there are no newline issues. Then I'm trying to find all strings matching a pattern. How do I do that? I know how to find 1 instance of it, but how do I match all instances? and what data structure would the results go to? a multi dimensional array?
my text (excerpt) looks like the following:
<TR> 
 <TD BGCOLOR=EEEEEE><A HREF="/program.cgi?pid=1233"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 1</A></FONT></TD> 
 <TD BGCOLOR=EEEEEE nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jun 27 2010  3:00PM</FONT></TD> 
 <TD BGCOLOR=EEEEEE> </TD> 
</TR> 
<TR><TD BGCOLOR=EEEEEE COLSPAN=3><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR><TD COLSPAN=3 BGCOLOR=999999><IMG SRC="http://images.domain.com/images/spacer.gif" HEIGHT=1 WIDTH=1></TD></TR> 
<TR><TD COLSPAN=3 ><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR> 
 <TD><A HREF="/program.cgi?pid=1234"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 2</A></FONT></TD> 
 <TD nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jun 29 2010  7:00PM</FONT></TD> 
 <TD> </TD> 
</TR> 
<TR><TD COLSPAN=3><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR><TD COLSPAN=3 BGCOLOR=999999><IMG SRC="http://images.domain.com/images/spacer.gif" HEIGHT=1 WIDTH=1></TD></TR> 
<TR><TD COLSPAN=3  BGCOLOR=EEEEEE><IMG SRC="http://images.domain.com/images/spacer.gif" WIDTH=1 HEIGHT=2><BR></TD></TR> 
<TR> 
 <TD BGCOLOR=EEEEEE><A HREF="/program.cgi?pid=1235"><FONT FACE="ARIAL,HELVETICA,SANS-SERIF" SIZE=2>Title 3</A></FONT></TD> 
 <TD BGCOLOR=EEEEEE nowrap><FONT FACE="ARIAL,HELVETICA" SIZE=2>Jul  3 2010  7:00PM</FONT></TD> 
 <TD BGCOLOR=EEEEEE> </TD> 
</TR> 
I want to get the following into an array (or any structure):
{ ["/program.cgi?pdi=1233", "Title 1"], ["/program.cgi?pdi=1234", "Title 2"], ["/program.cgi?pdi=1235", "Title 3"] }
Thanks
© Stack Overflow or respective owner