a, The joint prior distribution for stem cell numbers (HSCs) and the generation time for the first approximate Bayesian computation (ABC). b, The location in sample space of the 10% of simulations that produced summary statistics (using only the ltt summary statistics; Supplementary Methods and Supplementary Information) most similar to the observed summary statistics. c, The joint prior distribution for the second ABC, in the area of sample space indicated to be plausible by the first set of simulations. d, The joint posterior distribution of the best 500 simulations from the second ABC, as shown in Fig. 5 for ease of reference. ‘n’, ‘o’ and ‘p’ on the plot indicate the position in sample space from which panels n–p were drawn. e–i, Cross-validation of the model to choose the number of accepted simulations and the weighting applied to the ltt summary statistics (Supplementary Methods and Supplementary Information). j, For illustrative purposes, five simulations were sampled for each of three population sizes along the plausible diagonal of sample space indicated in b. One set of summary statistics are shown for these simulations in k. k, The red line indicates a simulation coming from the area of sample space indicated by a red point in j; and similarly for blue and green lines. The black dotted line indicates the observed values for these summary statistics. These summary statistics provide a count—for the different numbers of samples (x axis)—of how many of the 3,952 mutations that we considered (y axis) are in this many samples with two or more reads, using error model 1 (which simulates errors according to the error rate in control DNA (Supplementary Methods)). The same summary statistics were calculated for different mutant read number cut-offs. l, For each of the 1,000 simulations that produce summary statistics that were the most similar to the observed data, the Euclidean distance from the observed data (y axis) is plotted against the number of stem cells in that simulation (x axis). This information is used by the neural network regression step to define the most likely value for the number of stem cells. The most similar values are seen at around 100,000 stem cells, which was the location of the median of the posterior distribution from neural network regression. m, The observed phylogeny, with branch points indicated by asterisks. n–p, Phylogenies drawn from simulations that occur at the points in sample space indicated in d. n, A relatively plausible simulation, since the pattern of branch points is not dissimilar from the pattern of the observed phylogeny (m). Simulations with smaller stem cell populations and faster stem cell turnover rates resulted in phylogenies in which the stem cells were very closely related to each other (o), whereas those with larger populations and slower turnover result in phylogenies in which the stem cells only share an embryonic common ancestor, and no branches are seen through the tree (p).