Python - calculate multinomial probability density functions on large dataset?

Posted by Seafoid on Stack Overflow See other posts from Stack Overflow or by Seafoid
Published on 2010-06-14T12:18:32Z Indexed on 2010/06/14 12:22 UTC
Read the original article Hit count: 659

Filed under:

Based on the background frequency of occurence of each amino acid and the count of quadruplets, I aim to calculate the multinomial probability density function for each quadruplet and subsequently use it as the expected value in a maximum likelihood calculation.

The multinomial distribution is as follows:

f(x|n, p) = n!/(x1!*x2!*...*xk!)*((p1^x1)*(p2^x2)*...*(pk^xk))

where x is the number of each of k outcomes in n trials with fixed probabilities p. n is 4 four in all cases in my calculation.

I have created three functions to calculate this distribution.

# functions for multinomial distribution


def expected_quadruplets(x, y):
    expected = x*y
    return expected

# calculates the probabilities of occurence raised to the number of occurrences

def prod_prob(p1, a, p2, b, p3, c, p4, d):
    prob_prod = (pow(p1, a))*(pow(p2, b))*(pow(p3, c))*(pow(p4, d))
    return prob_prod 


# factorial() and multinomial_coefficient() work in tandem to calculate C, the multinomial coefficient

def factorial(n):
    if n <= 1:
        return 1
    return n*factorial(n-1)


def multinomial_coefficient(a, b, c, d):
    n = 24.0
    multi_coeff =  (n/(factorial(a) * factorial(b) * factorial(c) * factorial(d)))
    return multi_coeff

The problem is how best to structure the data in order to tackle the calculation most efficiently, in a manner that I can read (you guys write some cryptic code :-)) and that will not create an overflow or runtime error.

To data my data is represented as nested lists.

amino_acids = [['A', '0.25', '1'], ['S', '0.25', '1'], ['T', '0.25', '1'], ['P', '0.25', '1']]

quadruplets = [['ASTP', '1']]

I initially intended calling these functions within a nested for loop but this resulted in runtime errors or overfloe errors. I know that I can reset the recursion limit but I would rather do this more elegantly.

I had the following:

for i in quadruplets:
    quad = i[0].split(' ')
    for j in amino_acids:
        for k in quadruplets:
            for v in k:
                if j[0] == v:
                    multinomial_coefficient(int(j[2]), int(j[2]), int(j[2]), int(j[2]))

I haven'te really gotten to how to incorporate the other functions yet. I think that my current nested list arrangement is sub optimal.

I wish to compare the each letter within the string 'ASTP' with the first component of each sub list in amino_acids. Where a match exists, I wish to pass the appropriate numeric values to the functions using indices.

Is their a better way? Can I append the appropriate numbers for each amino acid and quadruplet to a temporary data structure within a loop, pass this to the functions and clear it for the next iteration?

Thanks, S :-)

Developer IT

Python - calculate multinomial probability density functions on large dataset? - Developer IT

Python - calculate multinomial probability density functions on large dataset?

python

beginner

data-structures

Related posts about python

unmet dependencies in Ubuntu 12.04

How can I get sikuli-ide to work?

Getting PATH right for python after MacPorts install

call python with system() in R to run a python script emulating the python console

Python - Calling a non python program from python?

Related posts about beginner

assembly registers beginner

Beginner learning assembly preserving esp after function calls

I am a beginner to C and this is the dumbest question..Confused about getchar() function

Top 10 Reasons SQL Developer is Perfect for Oracle Beginners

Would you recommend Head First Programming for someone new?

Categories cloud