The primary inspiration for this project was a paper that used compressed sensing for face recognition:
(available at their research group webpage)
We used a modified version of YALL1 to solve the optimization problem.
The figures are written in the Protovis language, and the 3D protein view is powered by Mr. Doob’s three.js JavaScript rendering engine.
There are several possible extensions for this work. With respect to the search accuracy, it is very likely that performance can be greatly enhanced using dictionary learning techniques. In this work, we simply solved the problem $\mathrm{min} \norm{\vec x}_{1}$ with $\norm{\mat D \vec x - \vec y}_{2} \lt \epsilon$. But our task is classification, and explicitly optimizing for this (rather than minimum $l^1$-norm with the $l^2$ residual constraint) may well improve performance. That is, we should give up representational power from the dictionary (get solutions with larger residuals) in exchange for greater discriminative power. Techniques for such supervised dictionary learning are being actively researched; see this paper or this one.
There exist two programs for predicting individual contacts: SVMcon and NNcon (based on support vector machines and neural networks, respectively).
What is perhaps more interesting than the fold search itself is the demonstration that 21-dimensional vectors contain enough information to accurately determine a protein domain’s fold. The resized distance matrix representation of protein structure may to be an easier target for structure prediction algorithms. Rather than predicting individual contacts, machine learning algorithms only need to estimate an aggregate quantity; the average distance between these 15 residues.
If you have specific questions or ideas for research directions, contact me or (even better) fork the project and start exploring them immediately.
Github repository →