HumanGenomeDating

Frequently Asked Questions

How to interpret age estimation profiles?

...

How to interpret the cumulative coalescent function (CCF)?

...

How to interpret the coalescent intensity function (CIF)?

...

How was the ancestral/derived state determined?

...

How are the figures generated?

Every figure displayed on this website is dynamically generated in your browser using the data fetched for a given component from the human.genome.dating database. The underlying plotting library is Vega v4.4, which is build on D3 (Data-Driven Documents).

If you encounter problems with the visualisation of any figure, please try again using another browser. Most modern browsers (e.g. Chrome, Safari, Firefox, Opera, etc) should be able to correctly display any of the figures by default. Note that JavaScript must be enabled in your browser.

How to download figures?

Every figure displayed on this website is dynamically generated in your browser and can be downloaded in PNG format.

Most modern browsers (e.g. Chrome, Safari, Firefox, Opera, etc) should be able to correctly display and download any figure. However, some browsers may show errors when attempting to download a figure, due to the large number of visual components that need to be converted into a downloadable graphical format. For example, Chrome is known to block such requests if the number of components exceeds a certain threshold, resulting in a "network error". If you encounter problems with PNG downloads, please try again using another browser. Note that JavaScript must be enabled in your browser.

How to download data?

A download button is profived on every page that displays a figure. By clicking on the download button, a file is dynamically generated for the currently viewed component, which should start the download automatically. This file will be downloaded and locally stored on your computer under the filename displayed next to the download button. By default, all data files are generated in common CSV format.

If you encounter problems with the downloading function, please try again using another browser. Note that you do not need to have JavaScript enabled to download data.

When using the Safari browser, file downloads work just fine, but the Safari console emits an error, which is a known issue in Safari.

How much data is there?

Data in the Atlas of Variant Age has an approximate size of 7.5 Terabytes.

Data in the Shared Ancestry Database has an approximate size of 22.1 Terabytes for the 1000 Genomes Project (TGP) sample, and 271.7 Gigabytes for the Simons Genome Diversity Project (SGDP) sample.

These numbers refer to the approximate total diskspace required to store all downloads provided for age estimation profiles or pairwise shared ancestry results. The actual size of the underlying database is smaller, approximately 300 Gigabytes, in which all data are highly compressed, clustered, and cross-referenced. The framework includes several MySQL and SQLite3 databases, as well as static data files. A direct download of this database framework is not provided.

Data formats

Variant age profile data

Variant age estimation profiles can be downloaded by variant locus. Profile data consists of pairwise inference results for the haplotype pairs that were analysed to estimate allele age. Relative to a given variant locus, GEVA estimates the local haplotype segment shared between a pair of haplotypes (i.e. the position of recombination breakpoints), from which the TMRCA (time to the most recent common ancestor) is inferred. Information from a larger number of pairs is combined to estimate allele age. All data files are provided in tabular CSV format. Each file has a meta-header (lines beginning with ##), which contains the download date, variant ID (rsID), allele information, and genomic location. The first line following the meta-header is the actual CSV header that defines the number and names of each column in the data table. Column names GammaAlpha_*, GammaBeta_*, and MeanTMRCA_* are distinguished by their suffix, indicating the clock model used; mutation clock (Mut), recombination clock (Rec), and joint clock (Jnt). The following table lists and describes each column.

Pair1stSample ID (as defined in a given data source) of the first diploid individual in the haplotype pair. Haplotypes are distinguished by ˜A and ˜B, referring to the first or second phased haplotype of a given individual (in order of appearence in the data set).
Pair2ndAs above, but for the second haplotype in the pair.
PairTypeType of pair; either Concordant (both haplotypes carry the focal allele) or Discordant (one carrier and one non-carrier).
SourceAbbreviation of the data source; 1000 Genomes Project (TGP), Simons Genome Diversity Project (SGDP).
SegmentBreakLHSPhysical position (GRCh37) of inferred recombination breakpoint on left-hand side of focal variant position. Floating point values (as opposed to position integers) are given for consistency with the inferred physical length of a shared haplotype segment, which is bound by recombination occurring in between sites (rounded to the nearest 0.5 distance).
SegmentBreakRHSPhysical position (GRCh37) of inferred recombination breakpoint on right-hand side of focal variant position. Floating point values (as opposed to position integers) are given for consistency with the inferred physical length of a shared haplotype segment, which is bound by recombination occurring in between sites (rounded to the nearest 0.5 distance).
SegmentLengthPhysical length of locally inferred shared haplotype segment.
GeneticLengthGenetic length (in units of Morgan) of locally inferred shared haplotype segment, based on HapMap genetic maps.
GammaAlpha_MutInferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the mutation clock model; derived from the number of pairwise differences between the two haplotypes along the shared haplotype segment (after applying a correction to make this number consistent with expectations under the infinite-sites model), plus 1 due to the prior expectation of exponential coalescent times with rate = 1.
GammaAlpha_RecInferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the recombination clock model; derived from the number of inferred recombination breakpoints that delimit the shared haplotype segment (0 if it stretches along the whole chromosome, 1 if one-sided, and 2 if breakpoints were inferred on both sides), plus 1 due to the prior expectation of exponential coalescent times with rate = 1.
GammaAlpha_JntInferred α parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the joint clock model (which considers both mutational and recombinational information), plus 1 due to the prior expectation of exponential coalescent times with rate = 1.
GammaBeta_MutInferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the mutation clock model; derived from the physical length of the shared haplotype segment and the mutation rate (µ = 1.2 × 10-8) per site per generation and Ne, where Ne = 10,000.
GammaBeta_RecInferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the recombination clock model; derived from the genetic length and, thus, variable recombination rates (based on HapMap genetic maps) along the shared haplotype segment and Ne, where Ne = 10,000.
GammaBeta_JntInferred β parameter of Gamma distribution (describing posterior probability of TMRCA over time) in the joint clock model (which considers both mutational and recombinational information), derived from both the mutation and recombination rates (as given above) and Ne, where Ne = 10,000.
MeanTMRCA_MutMean posterior density of TMRCA of the inferred Gamma distribution under the mutation clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000.
MeanTMRCA_RecMean posterior density of TMRCA of the inferred Gamma distribution under the recombination clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000.
MeanTMRCA_JntMean posterior density of TMRCA of the inferred Gamma distribution under the joint clock model (with parameters as given above), scaled by 2Ne, where Ne = 10,000.

See example download on page: rs182549

Variant age summary data

Variant dating results can be downloaded for the variants in a given gene or genomic region (as well as by chromosome; see bulk downloads). Summary results are provided as a point-estimate of allele age for each variant (per data source). Full results (variant age profiles) can be downloaded seperately for each variant. All data files are provided in tabular CSV format. Each file has a meta-header (lines beginning with ##), which contains information about the download date and genomic location (as well as gene names if downloaded for a specific gene). The first line following the meta-header is the actual CSV header that defines the number and names of each column in the data table. Columns starting with AgeMode_*, AgeMean_*, AgeMedian_*, AgeCI95Lower_*, AgeCI95Upper_*, and QualScore_* are distinguished by their suffix, indicating the clock model used; mutation clock (Mut), recombination clock (Rec), and joint clock (Jnt). The following table lists and describes each column.

VariantIDGenetic variant rsID. Note that some variant IDs begin with X followed by a unique numeric string; this is because the data source from which allele age has been estimated either did not contain rsID information or matching of the variant to reference data (Ensembl) was inconclusive (matched by genomic location and allelic states).
ChromosomeHuman chromosome 1 to 22 (i.e. autosome) on which a given variant is located.
PositionPhysical position of variant on chromosome (GRCh37).
AlleleRefReference allele.
AlleleAltAlternate allele. Note that the alternate allele was assumed to be derived in all analyses, but which may not correctly distingish ancestral and derived states. Dating results reflect the estimated age of the alternate allele.
AlleleAncAncestral allele according to external reference data (Ensembl), or . if unknown at the time of data upload.
DataSourceAbbreviation of the data source used to date a given variant; 1000 Genomes Project (TGP), Simons Genome Diversity Project (SGDP), or combined from both (Combined). For the latter, results from the pairwise inference of TMRCA between haplotype pairs (independently in each data set) were combined to re-estimate allele age.
NumConcordantNumber of concordant haplotype pairs (both carrying the derived/alternate allele) available that were analysed (shared haplotype detection and inference of TMRCA) to eventually estimate allele age. All concordant pairs were sampled at random from the set of possible concordant pairs for a given variant.
NumDiscordantNumber of discordant haplotype pairs (carrier and non-carrier haplotypes) available. Discordant pairs were sampled after applying a "relaxed" prioritisation algorithm to identify non-carrier haplotypes that are the nearest genealogical neighbours to the focal sub-tree (carriers). Effectively, on average, half the pairs were selected from a prioritised set, and the other half was sampled at random.
AgeMode_*Allele age estimate taken at the mode of the composite posterior distribution, resulting from combining TMRCA information across available haplotype pairs (after filtering) under the mutation clock (suffix Mut), recombination clock (suffix Rec), or joint clock (suffix Jnt) model.
AgeMean_*Allele age estimate taken at the mean of the composite posterior distribution; as above.
AgeMedian_*Allele age estimate taken at the median of the composite posterior distribution; as above.
AgeCI95Lower_*95% confidence interval, lower bound; estimated by computing the cumulative composite posterior distribution.
AgeCI95Upper_*95% confidence interval, upper bound; estimated by computing the cumulative composite posterior distribution.
QualScore_*Quality score, calculated from the proportion of concordant/discordant pairs retained after filtering of outlier haplotype pairs.

See example download on page: LCT

Ancestry shared between two individuals

...

Sample-wide shared ancestry

...