Derichlet Prior for Sequence Family Modelling

The Dirichlet process is used to model probability distributions that are mixtures of an unknown number of components. Amino acid frequencies at homologous positions within related proteins have been fruitfully modeled by Dirichlet mixtures, and we use the Dirichlet process to derive such mixtures with an unbounded number of components. This application of the method requires several technical innovations to sample an unbounded number of Dirichlet-mixture components. The resulting Dirichlet mixtures model multiple-alignment data substantially better than do previously derived ones. They consist of over 500 com- ponents, in contrast to fewer than 40 previously, and provide a novel perspective on the structure of proteins. Individual protein positions should be seen not as falling into one of several categories, but rather as arrayed near probability ridges winding through amino acid multinomial space.

アラインメントの位置 \(j\) での残基 \(a\) の頻度が\(c_{ja}\)のとき，出力確率\(e_{M_j(a)}\) の最尤推定は次のようになる．
\[
e_{M_j(a)} = \frac{c_{ja}}{\sum_{a’}c_{ja’}} \tag{5.2}\label{eqn:05.02}
\]

単純擬似度数

最尤推定では、ある残機が観察されていなければ，その確率は 0 と推定されてしまう．
擬似度数法では，全ての度数に一定値を足し合わせる．例えば，バックグラウンドの分布に比例した量を加える．
\[
e_{M_j(a)} = \frac{c_{ja} + Aq_a}{\sum_{a’}c_{ja’} + A’} \tag{5.3}\label{eqn:05.03}
\]
ここで，\(c_{ja}\) は実際の度数，\(A\) は実際の度数と比較して擬似度数に加える重みである．20 辺りの \(A\) の値が，タンパク質のアラインメントには適していると思われる．

ベイズ推定では，パラメータ \(\theta\)の事前分布\(P(\theta)\)を考えることにより、データ \(D\) を観測した後のパラメータの事後確率分布を推定する．
\[
P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}
\]
ここで，パラメータ \(\theta\) はモデルの確率である．上述の擬似度数法は，ベイズの枠組みでは，パラメータ \(\alpha_a = A q_a\) のDirichlet事前分布を仮定していることに対応している．

Reference
1. VA Nguyen, J Boyd-Graber and SF Altschul, Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space, J Comput Biol. 2013 Jan;20(1):1-18. doi: 10.1089/cmb.2012.0244.