Improving the Quality of Protein Sequence Alignments by Estimating their Accuracy

New technical advances in next-generation sequencing have provided biologists with massive amounts of DNA and protein data. A non-trivial step in the analysis of such data is aligning similar sequences for comparative studies. Each alignment tool offers different strengths and weaknesses. Aligners often have many user-specified parameters that can greatly affect the accuracy of the computed alignment, and users often rely on the default parameter setting. Researchers are forced to either use this default setting, or spend considerable time finding a suitable alternative. For a set of input sequences to align, our tool Facet (feature-based accuracy estimator) selects a good aligner and a good parameter setting. Facet does this by combining alignment features into an accuracy estimator. These independent features are informed by our knowledge of how proteins evolve and fold. Using Facet to choose a parameter setting improves alignment accuracy by up to 27% over the best default setting.