This compressed-sensing distance matrix approach to protein fold recognition was tested using sets of fold families from the CATH database. Only fully-resolved protein domain structures (no missing residues) with length between 100—400 residues were considered. CATH v3.3 lists 1288 distinct fold families, but only 562 meet these criteria. Recognition accuracy was assessed by the following method:
with the third step repeated five times for each dimensionality.
There are 466 CATH fold families with at least three domains ($n=3$), and we ran the search algorithm while varying $s$, the number of samples per family:
Notice that using at most 20 samples per fold (), the algorithm has a 95% classification accuracy rate at just 21 dimensions. Compare this to the baseline, random-guess accuracy of 1/466 = 0.2%. Essentially, a six-by-six pixel thumbnail of a protein’s distance matrix is distinctive enough to determine its fold 95% of the time. Furthermore, there is no substantial improvement at higher dimensions, suggesting that the intrinsic dimensionality of the space of protein folds is only around twenty. Even when the support of each family is restricted to two representatives out of the 932 dictionary atoms (), the accuracy is still greater than 80%.
The runtime of the algorithm depends on the reconstruction difficulty (the number of samples per class $s$ in the dictionary). On a Linux virtual machine running on a 2.8 GHz core with 1 GB RAM, this was about 0.3—1 seconds per protein. This figure includes distance matrix calculation (i.e. dictionary construction) time; the classification time itself is less.
Bibliography →