|Supplementary Figure 1||Supplementary Figure 2||Supplementary Figure 3||Supplementary Figure 4||Supplementary Figure 5|
|Supplementary Figure 6||Supplementary Figure 7||Supplementary Figure 8||Supplementary Figure 9||Supplementary Figure 10||Supplementary Figure 11|
|Supplementary Figure 12||Supplementary Figure 13||Supplementary Figure 14||Supplementary Figure 15||Supplementary Figure 16||Supplementary Figure 17|
The goal of all x-axis normalizations was to convert all the predictions from each protein family into a percentile score that could be compared across algorithms and protein families. A simple approach would be to have each point on the x-axis represent a fixed width of percentile. For example, the x-axis could be divided into 4,000 points with each point representing 0.025 percentile. The rightmost point on such a graph under such a scheme would be the average of the points with covariance scores between the 99.975th and 100th percentile.
Such an approach is a bit too simplistic for supplemental figures 12-17, which shows the results for all 224 protein families. The problem is that the alignments all have a different number of columns, and the number of predictions each algorithm makes about each protein family goes up as the square of the number of columns. Imagine two protein families, one with 100 columns and one with 500 columns. The smaller family will have ~1000/2 = 5,000 pairs of residues for each algorithm to predict. The larger family will have ~5000/2 = 125,000 pairs of residues. If graphed with a constant number of points on the x-axis, say 5,000, these two protein families will look drastically different even if an algorithm is similarly effective for both of them. For the smaller family, each point would represent one ( 5,000 pairs of residues / 5,000 bins ) prediction of the algorithm. For the larger family, each point would be the average of 25 predictions of the algorithm ( 125,000 pairs of residues / 5,000 bins). The performance of the algorithm would look much noisier and "worse" for the smaller family than the larger family even if, for example, on average, the covariance algorithm was an equally good predictor of distance for both families. In fact, for the small family, the noise level would be similar to the data show in Figure 3 in the main paper, where there is a good deal of variation between adjacent points. The large family would look much smoother because much of the noise between adjacent points would have been averaged out.
In order to solve this problem, therefore, each point in supplemental figures 12-17 has been constrained to represent the average of 20 residue-pair distances placed at an appropriate percentile. (The first and last points on the x-axis may have fewer than 20 predictions since the number of predictions for each family will not, as a general rule, be an even multiple of 20.) The results of this normalization is that the longer alignments have more points in their graphs but the overall level of noise is similar between each graph. That is, we argue that as displayed, differences in noise are due to true algorithm noise and are not an artifact of the length of each alignment.
Figure 5 in the main paper is meant to represent an average across all 224 protein families. We could not simply, however, average all of the points on all of the graphs in the supplemental figures, because some protein families make many more predictions than others and we did not to wish to bias the average towards the longer protein families. So in this case, we returned to a constant number of bins per family ( 4000 each representing 0.025%) and took the average of all 224 families binned in that way. For example, to calculate the right most point for all of the graphs in Figure 5, for each protein family we calculated the average value for all the points for that protein family that lay between the 99.975th percentile and the 100th percentile. We then plotted the average of all of those 224 points. In this case, we let the fact that we are collapsing across protein families smooth out the noise.
While these normalization schemes are unfortunately complex, we believe that they faithfully capture the intuitive sense of "percentile" and offer a fair comparison across protein families and algorithms. The Java code for all normalizations is available upon request.