# Probability Distribution on Sequences

Using transition matrix, pdf is assigned on sequences

\[

p(x) = Tx

\]

where \(T\) is the transition matrix and \(x\) is the sparse binary representation of the sequence. For example,

\[

x = \left( \begin{array}{lllll}

1 & 0 & 0 & 0 & 0\\

0 & 1 & 1 & 0 & 0\\

0 & 0 & 0 & 1 & 0\\

0 & 0 & 0 & 0 & 1

\end{array} \right)

\]

represent \( (a,c,c,g,t) = (1, 2, 2, 3, 4)\).

Once you get the corresponding probability distribution function (pdf) of each sequence,

Kalback-Leibler divergence of pdf is given for each pair of the sequences,

\[

KL^1(x||y) == KL(p(x)||p(y))

\]

\[

= E_{p(x)} \log{\frac{p(x)}{p(y)}} = \sum_{H} p(x) \log{\frac{p(x)}{p(y)}}

\]

For gapped alignments of variable lengths of sequences, you can still calculate the pdf of each sequence on sequence space of variable length somehow. For practical application, you can restrict the sequence space within the maximum length of the considering sequences.

Can we efficiently calculate the sum in \(KL^1\)? Probably yes.