Supplementary Materials

Supplementary Table 1

Scoring a simple alignment for covariance by the OMES algorithm. The alignment shown yields six possible pairs of residues YH, YF, YM, SH, SF and SM. Summing the last column of the table and dividing by 7 ( the number of sequences in the alignment with non-gapped residues at both positions) would yield a score for pair of the column of 1.31
Pair Cxi Cxj Nex Nobs (Nobs-Nex)2
YH 4 4 2.29 4 2.94
YF 4 1 0.57 0 0.33
YM 4 2 1.14 0 1.31
SH 3 4 1.71 0 2.94
SF 3 1 0.43 1 0.33
SM 3 2 0.86 2 1.31

Supplementary Figures 1-5

Shown are the results of running the Conservation, McBASC, OMES, SCA and MI algorithms for all 224 Pfam alignments. A pdb file was chosen for each Pfam alignment as described in the methods section of the main paper. The x-axis of each trace is scaled from the minimum covariance score to the maximum covariance score. The y-axis is scaled from the minimum pair distance to the maximum pair distance. The vertical red lines indicate the top 75 scoring pair of residues. The horizontal red line indicates the 50th percentile of all pair distances. The orange line indicates 8A.

Supplementary Figure 1 Supplementary Figure 2 Supplementary Figure 3 Supplementary Figure 4 Supplementary Figure 5

Supplementary Figures 6-11

The accuracy of predicting residue contacts ( Cβ-Cβ within 8A) as a function of the number of predictions each algorithm was asked to make for all 224 Pfam families.

ConservationMcBASCOMES SCAMIrandom
Supplementary Figure 6 Supplementary Figure 7 Supplementary Figure 8 Supplementary Figure 9 Supplementary Figure 10 Supplementary Figure 11

Supplementary Figures 12-17

Shown are the results of running the algorithms over all 224 Pfam alignments. The x-axis of each trace ranges from the 0th percentile to the 100th score percentile. The y-axis of each trace is the pair distance percentile corresponding to Cβ-Cβ distances in the chosen crystal structure. The red line in each trace represents the 50th percentile of pair distance. Each point on each trace is the average of the predictions made for 20 pairs of columns. See Normalization Schemes below.

ConservationMcBASCOMES SCAMIrandom
Supplementary Figure 12 Supplementary Figure 13 Supplementary Figure 14 Supplementary Figure 15 Supplementary Figure 16 Supplementary Figure 17

Normalization schemes

The goal of all x-axis normalizations was to convert all the predictions from each protein family into a percentile score that could be compared across algorithms and protein families. A simple approach would be to have each point on the x-axis represent a fixed width of percentile. For example, the x-axis could be divided into 4,000 points with each point representing 0.025 percentile. The rightmost point on such a graph under such a scheme would be the average of the points with covariance scores between the 99.975th and 100th percentile.

Such an approach is a bit too simplistic for supplemental figures 12-17, which shows the results for all 224 protein families. The problem is that the alignments all have a different number of columns, and the number of predictions each algorithm makes about each protein family goes up as the square of the number of columns. Imagine two protein families, one with 100 columns and one with 500 columns. The smaller family will have ~1000/2 = 5,000 pairs of residues for each algorithm to predict. The larger family will have ~5000/2 = 125,000 pairs of residues. If graphed with a constant number of points on the x-axis, say 5,000, these two protein families will look drastically different even if an algorithm is similarly effective for both of them. For the smaller family, each point would represent one ( 5,000 pairs of residues / 5,000 bins ) prediction of the algorithm. For the larger family, each point would be the average of 25 predictions of the algorithm ( 125,000 pairs of residues / 5,000 bins). The performance of the algorithm would look much noisier and "worse" for the smaller family than the larger family even if, for example, on average, the covariance algorithm was an equally good predictor of distance for both families. In fact, for the small family, the noise level would be similar to the data show in Figure 3 in the main paper, where there is a good deal of variation between adjacent points. The large family would look much smoother because much of the noise between adjacent points would have been averaged out.

In order to solve this problem, therefore, each point in supplemental figures 12-17 has been constrained to represent the average of 20 residue-pair distances placed at an appropriate percentile. (The first and last points on the x-axis may have fewer than 20 predictions since the number of predictions for each family will not, as a general rule, be an even multiple of 20.) The results of this normalization is that the longer alignments have more points in their graphs but the overall level of noise is similar between each graph. That is, we argue that as displayed, differences in noise are due to true algorithm noise and are not an artifact of the length of each alignment.

Figure 5 in the main paper is meant to represent an average across all 224 protein families. We could not simply, however, average all of the points on all of the graphs in the supplemental figures, because some protein families make many more predictions than others and we did not to wish to bias the average towards the longer protein families. So in this case, we returned to a constant number of bins per family ( 4000 each representing 0.025%) and took the average of all 224 families binned in that way. For example, to calculate the right most point for all of the graphs in Figure 5, for each protein family we calculated the average value for all the points for that protein family that lay between the 99.975th percentile and the 100th percentile. We then plotted the average of all of those 224 points. In this case, we let the fact that we are collapsing across protein families smooth out the noise.

While these normalization schemes are unfortunately complex, we believe that they faithfully capture the intuitive sense of "percentile" and offer a fair comparison across protein families and algorithms. The Java code for all normalizations is available upon request.

Pfam families used in this study

The following 224 Pfam families were used in this study. The order listed below is the same order of all the graphs in all of the Supplementary Figures going from left to right and top to bottom.

2-Hacid_DH_C, 2-oxoacid_dh, 3HCDH_N, aakinase, abhydrolase, aconitase, Aconitase_C, Acyl-CoA_dh, Acyl-CoA_dh_N, Acyl_transf, adenylatekinase, AhpC-TSA, AIRS, AIRS_C, Ald_Xan_dh_C2, aldedh, aldo_ket_red, alpha-amylase, Amidohydro_1, Amino_oxidase, aminotran_1_2, aminotran_3, aminotran_5, ANF_receptor, Anth_synt_I_N, arf, asp, ATP-grasp, B12-binding, Bet_v_I, beta-lactamase, BPL_LipA_LipB, carb_anhydrase, catalase, Cation_ATPase_C, cellulase, Chal_stil_syntC, CheW, chorismate_bind, CLP_protease, CN_hydrolase, CoA_binding, COesterase, COX3, cpn60_TCP1, CPSase_L_chain, CPSase_L_D2, CRAL_TRIO, CTP_transf_2, cyclin, cyclin_C, Cys_Met_Meta_PP, DAO, DegT_DnrJ_EryC1, DHDPS, DJ-1_PfpI, DNA_methylase, DNA_pol_B, DNA_pol_B_exo, DNA_topoisoIV, DnaJ_C, DSPc, dUTPase, E1-E2_ATPase, E1_dehydrog, ECH, EFG_IV, enolase, enolase_N, Epimerase, Exo_endo_phos, Exonuclease, FAD_binding_2, FAD_binding_4, fer4_NifH, ferritin, FGGY, FGGY_C, Fimbrial, flavodoxin, formyl_transf, FtsJ, G-alpha, Gal-bind_lectin, GATase, GATase_2, GFO_IDH_MocA, GHMP_kinases, GLFV_dehydrog, gln-synt, Glyco_hydro_1, Glyco_hydro_10, Glyco_hydro_16, Glyco_hydro_17, Glyco_hydro_18, Glyco_hydro_19, Glyco_hydro_28, Glyco_hydro_3, Glyco_hydro_3_C, Glyco_hydro_9, Glyoxalase, GMC_oxred, gpdh, gpdh_C, Gram-ve_porins, hemocyanin, hemocyanin_C, Hist_deacetyl, hormone, hormone_rec, HSP70, inositol_P, Isochorismatase, isodh, Jacalin, ketoacyl-synt, ketoacyl-synt_C, kinesin, lactamase_B, ldh, ldh_C, lectin_legB, Lipase_3, Lipase_GDSL, lipocalin, Lipoprotein_6, lyase_1, malic_N, MATH, MCPsignal, Metallophos, MIP, molybdopterin, Monooxygenase, Mur_ligase, myosin_head, NAD_binding_1, NAD_binding_2, NDK, Nitroreductase, NTP_transferase, OMPdecase, Orn_Arg_deC_N, Orn_DAP_Arg_deC, OTCace, OTCace_N, oxidored_FMN, PALP, Peptidase_C1, Peptidase_M20, Peptidase_M24, Peptidase_S8, Peripla_BP_2, Peripla_BP_like, peroxidase, PFK, pfkB, PGAM, PGK, PGM_PMM_I, Phage_integrase, phoslip, PI3_PI4_kinase, pilin, PK, polyprenyl_synt, PP2C, Pribosyltran, pro_isomerase, proteasome, PTS_EIIA_2, pyr_redox, recA, Reprolysin, RhoGAP, RhoGEF, ribonuc_L-PSP, ribonuc_red_lgC, ribonuc_red_sm, ribonuclease_T2, Ribosomal_S7, rnaseA, rnaseH, SBP_bac_1, SCP, Semialdhyde_dh, Semialdhyde_dhC, serine_carbpept, serpin, SLT, SMC_N, sodcu, SRP54, Sulfotransfer, Terpene_synth, thiolase, thiolase_C, TIM, TIR, TonB_dep_Rec, Topoisom_bac, TP_methylase, TPP_enzymes, TPP_enzymes_C, TPP_enzymes_N, transferrin, transket_pyr, transketolase, transketolase_C, Transpeptidase, Transposase_11, tRNA-synt_1, tRNA-synt_1b, tRNA-synt_1c, tRNA-synt_2, tRNA-synt_2b, tyrosinase, UDG, UQ_con, Usp, UvrD-helicase, vwa, Y_phosphatase, Zn_carbOpept