table of contents
- expected learning outcome
- getting started
- exercise 1: file conversion
- exercise 2: estimating Θ alone
- exercise 3: estimating Θ and growth rate
- exercise 4: estimating Θ and recombination rate
- exercise 5: improving a poorly performing run
- final thoughts
expected learning outcomes
The objective of this activity is to help you understand the estimation of population genetic parameters using the program LAMARC. The things you will learn include input file preparation and estimation of Θ (Theta), growth rate, and recombination rate. You will also learn a number of ways to determine if an analysis is successful, and how to use Tracer in conjunction with LAMARC for diagnostics.
For additional help with running the program and interpreting the results, you can view LAMARC’s documentation at: https://biotech.inbre.alaska.edu/fungal_portal/program_docs/lamarc/index…
getting started
A reminder: if you want to keep results (they are written by default into a file called outfile and outfile.pdf), save them using cp outfile yourfavoredfilename or change the name of the output file in the menu.
You will use two data sets for the LAMARC exercises. You can download them using these links:
Download lamarcFiles.zip, which contains files corresponding to a human mtDNA data set (Ingman et al. 2000). The original data were 53 complete mtDNA sequences from people living on the five contintents. A cut-down version with 10 sequences is available as europe500.phy. The full European data set is in the fileeurope16kb.mig. Use of the cut-down version for this exercise is recommended for speed.
Note that the data set was cut down by randomly removing sequences and NOT by systematically removing duplicates.
exercise 1: file conversion
Use the file conversion utility, lam_conv, to convert the sample file europe500.phy to a LAMARC infile. After reading in the data file, you will want to check (and probably change) the converter’s guess for the data set’s data type. The easiest way to do this is by double-clicking in the contiguous segment area of the window, then checking the appropriate box under Data Type (in this case DNA, not SNP). Be sure to click the Apply button to actually save any changes made.
To save your converted data file, choose Write Lamarc File under the File drop-down menu. This will by default write a file named infile.xml.
If you become frustrated with the file converter, a pre-converted input file is available as europe500.xml.
exercise 2: estimating Θ alone
Run lamarc and give it the full name of your LAMARC input file. Be careful to include extensions at the end of a filename (like .xml or .txt).
We will do a likelihood-based run, in which the overall strategy is to generate many genealogies based on a single set of driving values. Each genealogy proposed is one “step”. Some proportion of the steps are sampled (recorded) for use in parameter estimation. Periodically the program will estimate the parameters and start a new search with the improved parameters; each iteration of this is a “chain.” The typical likelihood run will do several “initial chains” to improve the driving values, and one or two much longer “final chains” to produce the final estimate.
The default run length is generally too short for publication, but too long for a demonstration. Under the Search Strategy submenu, Sampling Strategy submenu, cut the Number of samples to discard down to 100 for both initial and final chains, and the final-chain Number of recorded genealogies down to 1000.
Since we know this is mtDNA with a high frequency of transitions over transversions, go back to the main menu and under submenu Data options, submenu Edit data model(s) for Region 1 edit the data model. The TT ratio (transition/transversion ratio) is set by default to 2, which is unreasonable for mtDNA. Set it to the more reasonable value of 20. (PAUP* with ModelTest is an ideal tool for finding an optimal data model for your data.)
Run the program. Progress reports should appear, including a guess as to how long the analysis will take. At the end of this section are some tips for deciding whether your run is satisfactory. You may want to consider these as you watch the progress reports; you will certainly want to consult them when the run has finished.
When the program finishes, your output will be in outfile unless you told the program to rename this file.
- What is the point estimate of Θ? Write the Θ into the spreadsheet. What does this number mean?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Theta is a measure of genetic variability that is a function of the effective population size (Ne) and mutation rate per site per generation (mu). For diploid DNA sequences, Theta would be equal to 4Ne(mu). But, because mtDNA is heritable only in females, Theta is Ne(mu) or 2Ne(f)mu.[/toggle] - What are the 95% confidence interval boundaries for this estimate?
For subsequent runs, you may want to use the menusettings_infile.xml file produced by this run as your starting point, as it will already have the desired TT ratio and search strategy.
validation
Tracer is mainly useful for Bayesian runs; for this likelihood run LAMARC’s own diagnostics will be more helpful. Examine the output file from your run, looking for any of the following symptoms. You won’t find them all; this is a general checklist useful for any likelihood-based run.
- Do the parameter estimates increase (or decrease) throughout the run, rather than settling around a value?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If you observe directional changes in the parameter estimates, the run is too short; try more chains.[/toggle] - In the run reports, does the “data lnL”, which shows the score of the best genealogy found during that chain, increase throughout the run, rather than settling around a value?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If so, the run is too short; try more or longer chains.[/toggle] - In the run reports, does the “posterior lnL”, which shows improvement in driving value from one chain to the next, greater than 2-3x the number of parameters being estimated in the final chain?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In general, if the posterior lnL is too high, your run is too short; try more or longer chains.[/toggle] - If you run the program twice with the same data and settings, but different random number seeds, do the results disagree? (Compare the between-run difference with the confidence intervals.)
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If the results from multiple, independent runs disagree, the run is too short, or data are uninformative. Try more or longer chains. If that doesn’t work, try a Bayesian analysis, or reconsider use of these data.[/toggle] - Do the estimates of the parameters vary wildly from one chain to the next?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If so, chains are too short, or data are uninformative. Try longer chains; if this doesn’t work, try a Bayesian analysis, or reconsider use of these data.[/toggle] - Is the acceptance rate above 50% or below 5%?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Acceptance above 50% suggests that the data are saturated with mutations; this is usually an alignment problem. Acceptance below 5% may be unavoidable with some data sets, but it suggests the need for particularly long runs.[/toggle]
exercise 3: estimating Θ and growth rate
Repeat the analysis, but turn on estimation of growth from the Analysis menu. Make sure to save your previous results by either copying/moving/renaming the previous outfile, or by telling LAMARC to save its output to a file with a different name (change this in the Input and Output menu).
- What is the point estimate of Θ? What is the interpretation of this number?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In a run with growth, Theta is Ne(mu) at the time the sequences were collected. NOTE: for diploid data, theta would be equal to 4Ne(mu).[/toggle] - What are the 95% confidence interval boundaries for this estimate?
- What is the point estimate of the growth rate g? Write the Θ and g into the spreadsheet. How do you interpret your estimate?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Positive g indicates a growing population, negative g a shrinking one. Further interpretation requires some idea of the mutation rate mu, using the following relationship: Theta(t)=Theta(now) exp(-gt(mu)) where Theta(t) is Theta at a time t generations in the past and Theta(now) is the modern-day Theta estimated by the run.[/toggle] - What are the 95% confidence interval boundaries for this estimate?
- Is there a lot of correlation between Θ and g? Why?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Hint: look at the profile likelihoods. Theta and g are normally strongly correlated, because the average rate of coalescence is about the same for a smaller population which has grown more slowly, and a somewhat larger population that has grown somewhat faster.[/toggle] - How much does this estimate of Θ differ from one made without taking growth into account? Why might we expect it to differ?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In a run with growth, Theta is an estimate of Ne(mu) at the time the data were sampled. In a run without growth, it is a composite estimate over the last (approximately) N generations. If the population size has changed, these estimates will differ. NOTE: Again, if estimated from diploid data, Theta is 4Ne(mu) at time of sampling with growth. And, without growth, it is a composite estimate over 4N generations.[/toggle]
validation
Check the same points as for exercise 2. Additionally, for a run with growth the following problem is relevant:
- Are the estimates of Theta and g enormous, with upper confidence limits that are nearly infinite?
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If the values of these estimates are extremely large and the population has grown very rapidly, the genealogy will resemble a star and not enough information will be available to co-estimate Theta and g. Additional unlinked loci may help. Otherwise, you can estimate Theta while holding g constant, or vice versa. If multiple time point data are available, consider using BEAST instead.[/toggle]
exercise 4: estimating Θ and recombination rate
For estimation of recombination we provide a file lpl_short.xml with SNP data from three human populations: European Finns from North Karelia; assorted Europeans from Rochester, Minnesota; and African-Americans from Jackson, Mississippi (Nickerson et al. 1998). The full version, too large for class exercises, is in lpl.xml.
These are pre-generated LAMARC infiles which will perform a Bayesian analysis with one final chain sampling 7000 parameter values split randomly among all of the available parameters. If you would like to experiment with some of the more complex features of the data converter, we provide the raw migrate format file containing the full dataset in lpl.mig, plus an XML command file for use with the converter in lpl.conv.
You will want to examine the priors for Theta (Search Strategy menu); they probably start off too wide for human data. A more reasonable choice might be an upper bound of 1.0 and a lower bound close to zero, like 0.00001. You can experiment with a linear prior here if you like; then the lower bound can actually be zero.
- What is the point estimate of Θ?
- What are the 95% support interval boundaries for this estimate?
- What is the point estimate of the recombination rate r? How is this number interpreted?
[toggle title_open=”Hide Answer” title_closed=”Show Answer” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]The recombination rate is estimated as C/mu, where C is the probability of a recombination at a given site and mu is the probability of a mutation at a given site, both per generation.[/toggle] - What are the 95% support interval boundaries for this estimate?
- (If you have time) How does changing the prior for r change these results?
validation
All of the questions for Exercise 1 are relevant here. Since this was a Bayesian run, we can additionally use the Tracer tool of Drummond and Rambaut. Files with “tracefile” in the name are ready to be read by Tracer; in this case, read tracefile_LPL_1.txt.
- View the trace and histogram for each parameter. The trace shows how the value of that parameter changed over the course of the run; the histogram shows the final distribution of values. You can also look at the trace of P(D|G), called the “Ln(Data Likelihood)”. Note that less negative numbers are better scores (a difference from PAUP*).
- Check the ESS (Effective Sample Size) for each parameter.
Look for the following symptoms. Bear in mind that a bad Tracer result is a near-guarantee of a failed run, but a good Tracer result is not proof of success. If the search has failed to find an island of good trees, no diagnostic can detect this.
- ESS (Effective Sample Size) for any parameter less than 200.
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Run is too short; try more steps. (Multiple chains are not helpful in a Bayesian run.)[/toggle] - Directional trends in the trace of a parameter.
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]Run is too short; try more steps. (Multiple chains are not helpful in a Bayesian run.) A good trace resembles a uniformly fuzzy caterpillar. The fuzz can be big or small, but the caterpillar should not be climbing up or down.[/toggle] - Sudden leaps in the trace of a parameter, usually the recombination rate.
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]The search has found a new area. These don’t invalidate the analysis if they are early in the run, but if they are still happening late in the run, it should be run longer.[/toggle] - Multiple peaks in the histogram of a parameter, usually a migration rate.
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]In a migration parameter this can reflect a real feature of the data; some data sets are genuinely ambiguous, supporting asymmetrical migration but not defining its direction. Multiple peaks in Theta or r, however, are probably signs of a too-short run.[/toggle] - A “knee” on the lower side of the histogram for a parameter.
[toggle title_open=”Hide Diagnosis” title_closed=”Show Diagnosis” hide=”yes” border=”yes” style=”default” excerpt_length=”0″ read_more_text=”Read More” read_less_text=”Read Less” include_excerpt_html=”no”]If the prior puts a lot of weight on values the data cannot actually distinguish, parts of the final curve will be due more to the prior than to the data. In general, a Bayesian posterior which looks like its prior is cause for concern. You may have an inappropriate prior, or uninformative data. For this data set, a flat prior on r is better than a log prior.[/toggle]
exercise 5: improving a poorly performing run
With the LPL data, the given run conditions lead to very poor performance. Here are some strategies that may improve these runs. Try a few and see what happens.
- Increase the number of steps in the chain (Search Strategy menu).
- Add heating, with one heated chain (Search Strategy menu). A temperature of 1.2 for the hot chain is recommended. Aim for a temperature at which swaps between hot and cold are moderate, between 10% and 50%.
- Refine the mutation model (Analysis menu) so that migration rates are symmetrical or some migration pathways are not allowed. For example, it is probably reasonable not to estimate gene flow from Jackson (African-American) into Karelia (Finnish).
- Set some Theta values to pre-determined figures so that they will not have to be estimated (Analysis menu). A previous study suggested Theta(Jackson)=0.0072, Theta(Karelia)=0.0027, Theta(Rochester)=0.0031.
- Turn off estimation of recombination (Analysis menu).
- Narrow the Bayesian priors (Search Strategy menu) so that a smaller range of values will be proposed. This may improve acceptance rates.
- Change to a likelihood run. (This is mainly helpful if data are very sparse or parameter values very close to their lower limits.) If you do, you will have two more options to consider:
- Run more chains (Search Strategy menu).
- Adjust your starting values (Analysis menu).
- Cut down this three-population case to a two-population case. (This can be done by cutting down the ram migrate input file (lpl.mig) and then feeding it into the file converter lam_conv.)
- Randomly remove some sequences from the LAMARC input file, being careful to keep the rest of the XML intact. Be sure to save the file as flat text, not .doc or other rich formats.
final thoughts
LAMARC is a powerful tool but must be used thoughtfully. It will often perform poorly with “out of the box” settings. Plan to do some trial runs to explore the behavior of the program with your data set. If you encounter any difficulties, please email the LAMARC development team. It is helpful if you include the data file that is causing the problem, especially for crashes and other bugs.