"Perhaps one of the most immediate applications is in direct medical decisions concerning the matching of stem cell transplant donors to unrelated recipients," writes Jennifer Listgarten, lead researcher for the paper, and her team. "However, high-resolution HLA (Human Leukocyte Antigen) typing is frequently unavailable due to its high cost or the inability to re-type historical data."
Listgarten looked into specialised immune cells and how they are trained to kill. For these immune cells to know what to target, they must first be sensitized to recognize small peptides from the foreign sources, such as HIV or bacteria. The sensitization only occurs when the foreign peptide is paired up with an HLA molecule.
The way the HLA molecule interacts with the peptide defines how the immune system will react and the HLA type can even determine a person's susceptibility to disease. However, there a huge repertoire of HLA molecules and almost no two people have the same set. Laboratory tests to obtain high quality HLA data currently takes a long time and is also very expensive due to the need for specialized equipment.
The team from Microsoft Research, the National Cancer Institute, Massachusetts General, and the University of Oxford therefore modelled a large set of previously measured, high-quality HLA data and extrapolated statistical patterns from it. Using these patterns, they were able to take low quality HLA data and clean it up so that it was of higher quality than originally measured in the laboratory.
"We introduced new methodology to this area, improving upon the Expectation-Maximization (EM) based approaches currently used within the HLA community. Our improvements are achieved by using a parsimonious parameterization for haplotype distributions and by smoothing the maximum likelihood (ML) solution. These improvements make it possible to scale the refinement to a larger number of alleles and loci in a more computationally efficient and stable manner."
They have also shown how to modify their method to incorporate variations in HLA types due to race as well as geographic areas, though they do warn that their modelling approach assumes the testing populations are drawn from the same distribution. Thus caution must be had for case-control studies where case and controls may be drawn from different distributions.
"Further work in probabilistic HLA refinement may involve comparing EM based approaches to full Bayesian approaches,' concludes Listgarten. The full paper is published in PLoS Computational Biology.