TY - GEN
T1 - Predicting core columns of protein multiple sequence alignments for improved parameter advising
AU - Deblasio, Dan
AU - Kececioglu, John
N1 - Funding Information: This research was supported by NSF grant IIS-1217886 to J.K. Publisher Copyright: © Springer International Publishing Switzerland 2016.
PY - 2016
Y1 - 2016
N2 - In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.
AB - In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the folded three-dimensional structures of the proteins. When computing a protein multiple sequence alignment in practice, a reference alignment is not known, so its coreness can only be predicted. We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment’s accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner’s scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other columnconfidence estimators from the literature, and affords a substantial boost in alignment accuracy.
UR - http://www.scopus.com/inward/record.url?scp=84984999015&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84984999015&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-43681-4_7
DO - 10.1007/978-3-319-43681-4_7
M3 - Conference contribution
SN - 9783319436807
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 77
EP - 89
BT - Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Proceedings
A2 - Frith, Martin
A2 - Pedersen, Christian Nørgaard Storm
PB - Springer-Verlag
T2 - 16th International Workshop on Algorithms in Bioinformatics, WABI 2016
Y2 - 22 August 2016 through 24 August 2016
ER -