This method implements the reliable metric for quantifying the conservation of all amino acid positions in a multiple sequence alignment (MSA) of a given protein as proposed by Valdar [1]. The conservation score is calculated based on three elements:
Each of the above elements has a different contribution to the final score as defined by three parameters: $\alpha$, $\beta$, and $\gamma$: $$score(x) = (1-t(x))^{\alpha}(1-r(x))^{\beta}(1-g(x))^{\gamma}$$ For a formal definition of each term look below. By default they are set to $1$, $0.5$, and $2$, respectively. For each amino acid position this method will provide a value in the range from $0$ to $1$, where $score(x) = 1$ means that the position $x$ is strictly conserved, while $score(x)=0$ means that the position $x$ is not conserved.
The conservation scores derived from a multiple sequence alignment of a protein and its homologues can usually be used to infer the importance of the residues for maintaining the structure and function of the protein [1]. Recommended software for generating multiple sequence alignment is listed below:
Input data:
Output data:
W. Valdar[1] proposes a generalized formula for scoring conservation as a function of three variables: symbol diversity $t$, stereochemical diversity $r$, and gaps $g$. For a position $x$, it is defined as $$score(x) = (1-t(x))^{\alpha}(1-r(x))^{\beta}(1-g(x))^{\gamma}$$ where $\alpha$, $\beta$, and $\gamma$ weight the importance of each term.
The symbol diversity $t(x)$ is specified as Shannon's entropy: $$t(x)={\lambda_t}\sum_{a}^{K}{p_a \log_2 p_a}$$ where $K$ is the alphabet size, i.e., 20 amino acids and one gap symbol, and $p_a$ is the probability of observing the $a$th symbol type. $\lambda_t$ scales the entropy to range $[0,1]$ and is defined as $$\lambda_t=[\log_2(\min(N,K))]^{-1}$$ where $N$ the number of sequences in the alignment. By normalizing each $p_a$, a sequence weighting can be incorporated into Shannon's entropy: $$p_a=\sum_{i\in \left \{ i:s_i(x)=a \right \} }{\omega_i}$$ where $w_i$ is the weight of the $i$th sequence and $s_i(x)$ is the symbol type at position $x$ in that sequence. Ideally, the sum of all weights should be $1$. Thus, the weighting scheme of Henikoff and Henikoff is used: $$\omega_i=\frac{1}{L}\sum_{x}^{L}\frac{1}{k_x n_{x_i}}$$ where $L$ is the length of the alignment, $k_x$ is the number of symbol types present at the $x$th position and $n_{x_i}$ is the frequency of amino acid $i$ at position $x$.
The stereochemical diversity is denoted by $r(x)$ and uses a substitution matrix. Each amino acid $a$ is represented by a point $X_a$ in 20-dimensional space such that $$X_a=\begin{bmatrix}M(a,a_1) \\ M(a, a_2) \\ ... \\M(a,a_{20}) \end{bmatrix}$$ where $a_i$ is the the $i$th amino acid type and $M(a,b)$ is the similarity between amino acids $a$ and $b$ based on the BLOSUM62 matrix denoted by $m$: $$M(a,b)=\frac{m(a,b)}{\sqrt{m(a,a)m(b,b)}}$$ Then, for a position $x$, we define the consensus amino acid type as point $\overline{X}(x)$: $$\overline{X}(x)=\frac{1}{k_x}\sum_{a}^{k_x}X_a$$ where $k_x$ is the number of amino acide types present at position $x$. Finally, the stereochemical diversity is calculated as the average distance of observed amino acids from the consensus point: $$r(x)=\lambda_r\frac{1}{k_x}\sum_{a}^{k_x}|\overline{X}(x)-X_a|$$ where the scalar $\lambda_r=\left[\sqrt{20(max(M)-min(M))^2}\right]^{-1}$ ensures $r=1$.
The gap cost $g(x)$ is simply defined as the fraction of symbols in column $x$ that are gaps.
[1] Valdar, W (2002): Scoring Residue Conservation, Proteins, Vol. 48:227-241.