Phasing is the task or process of assigning alleles (the As, Cs, Ts and Gs) to the paternal and maternal chromosomes. The term is usually applied to types of DNA that recombine, such as autosomal DNA or the X-chromosome. Phasing can help to determine whether matches are on the paternal side or the maternal side, on both sides or on neither side. Phasing can also help with the process of chromosome mapping – assigning segments to specific ancestors. The use of phased data reduces the number of false positive matches, particularly for smaller segments under 15 centiMorgans (cMs).

Trio phasing

Trio phasing – using data from a child and both parents – is the gold standard for phasing. It is possible to phase about 94% of the alleles in an autosomal dataset using a two parent/one child trio. The number of alleles that can be phased is marginally increased if siblings are also tested. Roach et al found that they were able to phase 98.8% of the alleles by using data from two parents and four children.[1]

Phasing with data from one parent or other family members

If only one parent is available for testing first test the parent, and all of that parent's children. Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.

If no parents are available for testing first test all children of the family up to at least five (assuming five or more are available for testing). Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.

Once you have done the above then start concentrating on testing first and second cousins of the parents. There will be a diminishing return after about five or so first cousins, but it makes sense to test as many first cousins as you can afford to test up to some limit.

Some phasing can also be done using siblings, aunts and uncles or other close relatives as a proxy for a parent. This is sometimes known as poor man's phasing.

Statistical phasing

It is not always possible to obtain trios for phasing and, even if it were, it is not economical or computationally feasible to phase large trio datasets. Sophisticated statistical algorithms have been developed which phase the data based on allele frequencies derived from reference populations. A number of programs are available such as Beagle and FastIBD. Phasing can be done with a high degree of accuracy if large enough reference cohorts are available which are representative of the populations being studied. However, with genotype data the current methodologies are not able to reliably phase small segments under 5 cMs. One study reported a false positive rate of over 67% for 2-4 cM segments when compared with trios.[2]

Statistical or population-based phasing works because our DNA is all very similar and because it's passed on in chunks. Think of it like trying to read a sentence when some of the letters are missing. There are only so many combinations that will fit in the available spaces. If you saw these words:

R-d is my f-v--r-t- c-l--r

You would probably be able to work out that the sentence should read:

Red is my favourite colour

There are regional variations in the "sentences" but even if there were a couple of "deletions" you'd still be able to work it out:

Red is my favorite color

Difficulties arise when you have a short word without the context of a full sentence. R-d on its own could be red, rid, or rod.

Genetic genealogy companies

The raw genotype data generated by the Illumina microarray chips used for the autosomal DNA tests from the genetic genealogy companies is unphased and therefore does not distinguish the alleles on the maternal and paternal chromosomes. Customers who download their raw data file will observe that in the genotype column there are two DNA letters for each SNP. These letters are unsorted and could have come from either parent.

AncestryDNA and MyHeritage DNA are currently the only two companies which phase the data before assigning matches. Ancestry has developing its own phasing algorithm known as Underdog. The technical details are provided in the AncestryDNA Matching White Paper. They claim to have an error rate of under 1% and the error rate improves as the size of the training reference dataset increases. As of the beginning of 2016, AncestryDNA uses a reference panel of more than 300,000 genotypes. The details of MyHeritage DNA's phasing is given in the their blog post on major updates and improvements to MyHeritage DNA matching. See also the presentation given by Yaniv Erlich, MyHeritage DNA's Chief Scientific Officer, at Rootstech 2018 MyHeritage DNA 1010: from test to results

Note, however, that if you download the raw data from AncestryDNA or MyHeritage to upload to third-party sites you will receive a file of unphased data.

The 23andMe test and the Family Finder test from Family Tree DNA do not phase the data before assigning matches. However, 23andMe uses statistical phasing for their Ancestry Composition. If one or both parents has been tested at 23andMe Ancestry Composition can determine which ancestral segments have been inherited from each parent. For a detailed explanation see the 23andMe article on The phasing process.

None of the companies currently provide a facility for customers who have tested their parents to phase their data, and none of the companies allow customers to upload their own phased file.

The future

Phased whole genome sequencing is now available from Illumina and from 10X Genomics.

GedMatch

The free GedMatch website provides a Phasing Data Generator which allows the user to generate phased maternal and paternal data files. The algorithm was developed by John S Walden and implemented by John Olson. Phased paternal files have the prefix P. Phased maternal files have the prefix M. The phased kits can be compared in the GedMatch database in the usual way. For a detailed explanation see the GedMatch Wiki page on phasing.

David Pike's phasing utilities

David Pike has developed two tools for phasing which can be accessed from his website:

Felix Immanuel's phasing utility

Felix Immanuel has written his own phasing utility which can be downloaded from his Genetic Genealogy Tools website.

Oxford Statistics Phasing Server

Oxford Statistics provides a free phasing server for phasing whole genomes used VCF files. For details see the Oxford Statistics website.

Excel programs

Early pioneers of autosomal phasing, like Whit Athey and Tim Janzen, used Microsoft Excel. (NOTE: Do not use versions of Excel prior to 2007 since they will not have enough rows. Phasing can also be done with the free and open-source office suite LibreOffice.)

Tim Janzen's Excel program will phase either 23andMe or Family Finder data from two parents and one of their children. The program can do multiple or all of the autosomal chromosomes at once assuming that your computer can handle a large Excel file with all of the data in it. The program can be downloaded from Tim's Dropbox account at: http://dl.dropbox.com/u/21841126/phasing%20program%20%28small%20version%29.xls.

Instructions on how to use the program may be found at: http://dl.dropbox.com/u/21841126/phasing%20program%20instructions.rtf.

Tim has also uploaded a small version of the program that includes sample data from two parents and one of their children for 500 SNPs which will give people an idea of what the output looks like on a small scale. The program can be downloaded here.

For instructions on phasing see the artlcle on the phasing process which outlines Tim Janzen's methodology.

Scientific papers

Articles

Blog posts

Videos

David Pike gave a presentation on "The use of phasing in genetic genealogy" at the Institute for Genetic Genealogy held in Maryland in 2014. The lecture can be viewed online for a small fee.

A guide to phasing from Illumina:

See also

References