How can I extract paragaphs and selected lines with Perl?

Posted by neversaint on Stack Overflow See other posts from Stack Overflow or by neversaint
Published on 2010-04-14T10:40:22Z Indexed on 2010/04/16 4:33 UTC
Read the original article Hit count: 571

Filed under:

I have a text where I need to:

  1. to extract the whole paragraph under the section "Aceview summary" until the line that starts with "Please quote" (not to be included).
  2. to extract the line that starts with "The closest human gene".
  3. to store them into array with two elements.

The text looks like this (also on pastebin):

  AceView: gene:1700049G17Rik, a comprehensive annotation of human, mouse and worm genes with mRNAs or ESTsAceView.

  <META NAME="title"
 CONTENT="
AceView: gene:1700049G17Rik a comprehensive annotation of human, mouse and worm genes with mRNAs or EST">

<META NAME="keywords"
 CONTENT="
AceView, genes, Acembly, AceDB, Homo sapiens, Human,
 nematode, Worm, Caenorhabditis elegans , WormGenes, WormBase, mouse,
 mammal, Arabidopsis, gene, alternative splicing variant, structure,
 sequence, DNA, EST, mRNA, cDNA clone, transcript, transcription, genome,
 transcriptome, proteome, peptide, GenBank accession, dbest, RefSeq,
 LocusLink, non-coding, coding, exon, intron, boundary, exon-intron
 junction, donor, acceptor, 3'UTR, 5'UTR, uORF, poly A, poly-A site,
 molecular function, protein annotation, isoform, gene family, Pfam,
 motif ,Blast, Psort, GO, taxonomy, homolog, cellular compartment,
 disease, illness, phenotype, RNA interference, RNAi, knock out mutant
 expression, regulation, protein interaction, genetic, map, antisense,
 trans-splicing, operon, chromosome, domain, selenocysteine, Start, Met,
 Stop, U12, RNA editing, bibliography">
<META NAME="Description" 
 CONTENT= "
AceView offers a comprehensive annotation of human, mouse and nematode genes
 reconstructed by co-alignment and clustering of all publicly available
 mRNAs and ESTs on the genome sequence. Our goals are to offer a reliable
 up-to-date resource on the genes, their functions, alternative variants,
 expression, regulation and interactions, in the hope to stimulate
 further validating experiments at the bench
">


<meta name="author"
 content="Danielle Thierry-Mieg and Jean Thierry-Mieg,
 NCBI/NLM/NIH, [email protected]">




   <!--
    var myurl="av.cgi?db=mouse" ;
    var db="mouse" ;
    var doSwf="s" ;
    var classe="gene" ;
  //-->

However I am stuck with the following script logic. What's the right way to achieve that?

   #!/usr/bin/perl -w

   my  $INFILE_file_name = $file;      # input file name

    open ( INFILE, '<', $INFILE_file_name )
        or croak "$0 : failed to open input file $INFILE_file_name : $!\n";


    my @allsum;

    while ( <INFILE> ) {
        chomp;

        my $line = $_;

        my @temp1 = ();
        if ( $line =~ /^ AceView summary/ ) {
            print "$line\n";
            push @temp1, $line;
        }
        elsif( $line =~ /Please quote/) {
            push @allsum, [@temp1];
             @temp1 = ();
        }
        elsif ($line =~ /The closest human gene/) {

            push @allsum, $line;
        }

    }

    close ( INFILE );           # close input file
    # Do something with @allsum

There are many files like that I need to process.

© Stack Overflow or respective owner

Related posts about perl