Imputing missing data in aligned sequences

Posted by Kwame Oduro on Stack Overflow See other posts from Stack Overflow or by Kwame Oduro
Published on 2012-09-06T15:31:07Z Indexed on 2012/09/06 15:38 UTC
Read the original article Hit count: 184

Filed under:

I want a simple perl script that can help me impute missing nucleotides in aligned sequences: As an example, my old_file contains the following aligned sequences:

seq1
ATGTC
seq2
ATGTC
seq3
ATNNC
seq4
NNGTN
seq5
CTCTN

So I now want to infer all Ns in the file and get a new file with all the Ns inferred based on the majority nucleotide at a particular position. My new_file should look like this:

seq1
ATGTC
seq2
ATGTC
seq3
ATGTC
seq4
ATGTC
seq5
CTCTC

A script with usage: "impute_missing_data.pl old_file new_file" or any other approach will be helpful to me. Thank you.

© Stack Overflow or respective owner

Related posts about perl