# EMGM 2011 – day one

After a confused but exhilarating ride through London, during which I had the privilege of seeing the beautiful sights of London –Hyde Park, Buckingham Palace, and Big Ben, not to mention Havelock (yes, thanks, I have) — I’m at the European Mathematical Genetics Meeting in King’s College London. Last time I went conferencing I blogged the talks, and I quite enjoyed it, so I’ve decided to do it again, and here it is:

**Jo Knight** – Welcome. 181 people here. Talking about weather; today nice, tomorrow not. Lots of talks so a tight programme. There’s a joke involving a water pistol and a….oh no, it’s just a water pistol. There’s a toilet joke as well! Conference dinner will be on a boat. Various admin details, finally it’s the Future Of The Meeting, which will be in ~~Heidelberg~~ Gottingen next year.

Now onto the talks.

**First session.**

**Simon Teyssedre** (INRA). Talking about single-SNP analysis in family-structured populations. We know populations often have related individuals. (In fact I know several related individuals myself.) Conventional methods assume independence of individuals. Question: what is the range of validity of methods that do not take into account structure? Question 2: If we control for structure, what is the power and robustness of the model? He aims to answer these algebraically for some common models. True model is assumed to be , but this slide was only up for a couple of seconds so I don’t know what assumptions are on the variables. Presumably is random effects, fixed effects, and error term. Tested simple regression model , QTDT, GRAMMAR, and FASTA (or EMMAX) (Kang et al, Nature Genetics 2010). Now talking about simple regression model. There are lots of formulas deriving the student’s t-test which, of course, I follow perfectly, but are too large for me to write down in the margin just now. Graphs to show that Type 1 error increases considerably as <something> increases, but either I’m too sleepy or these slides are going to fast because I miss what the something is. QTDT has much lower increase in Type I error but lower power. Similar for GRAMMAR and FASTA, which does better in terms of power while maintaining good rates of Type I error. i.e. FASTA/EMMAX is the best for this situation. Somebody comments that QTDT is designed also to cope with environmental covariance, which the other methods aren’t. Somebody else asks for an extension to multiple SNPs. (It seemed to me that if you simulate from a mixed model, it isn’t terrifically surprising to find that the mixed model fits best. Or maybe I’ve got the wrong end of the stick and there’s something more subtle going on.)

**Amelie Baud** (WTCHG, Oxford). About controlling for relatedness in “heterogeneous stock” rodents, 1900 mice in fact, derived from inbred strains 50 generations ago, small LD blocks, very few rare variants, known founder streams which allows mapping phenotypes to ancestral haplotype segments. Now explaining how relatedness creates spurious associations and identity-by-state. Now talking about mixed models. Amelie implemented a modified version of EMMAX (again) in R customised to deal with covariates and testing for ancestral haplotypes (rather than SNPs). Model is where represents the “random effects” terms. As in Kang et al, is estimated only once for the null model and the same estimate used in the alternative model. A trick (multiply by to convert to a standard linear model is presented. Now discussing simulations, with simulated phenotype = relatedness + QTL effects + errors. Conclusion: mixed models do correct for relatedness much better than linear models. (But, as in the previous talk, isn’t the simulation assumption basically that the mixed model holds?) Another conclusion: SNP mapping misses some associations that can be seen by considering ancestral haplotypes. (In response to a question it is pointed out that this is essentially equivalent to imputation.) David Balding says he prefers allelic correlation to IBS as a measure of relatedness. (And he says that he has an implementation in GenABEL that doesn’t (I think) need the only-null-model estimation step).

**Fazil Baksh** (Reading). Talking about a robust score test for family-based association studies with “ordinal responses”. Examples of ordinal traits: cancer stage, asthma state, nicotine dependence scale, and baldness (measured on the Hamilton-Norwood scale. Who knew there was a scale for baldness?) Methods developed for binary and quantitative traits have been shown to be inefficient and may be inappropriate under some circumstances, e.g. differential effect of genes in more severe cases, environment, or when trait measurement reliability depends on severity. Now comparing score tests using joint likelihood of genotype and trait (Baksh et al), using retrospective likelihood due to Wang an Ye (2006), or using variance components (Diao and Lin). All were more efficient than QTDT. Diao and Lin claimed Baksh et. al is sensitive to population structure and has higher Type I error rate. Speaker claims this not true if the population structure is appropriately accounted for, and presents this work and simulation results.

**Oliver Davis** (King’s College London). Talking about “visual analysis” of geocoded twin data. Why do visual analysis? Shows someone’s “classic” example with 4 datasets having same mean, variance, correlation, regression line, but are totally different. Human visual system is clever. Shows optical illusions with squares that are the same colour, but aren’t the same colour, or maybe I’ve got that the wrong way round, but either way, it’s surprising. But then gives a more serious misleading example from a published plot. Now talking about TEDS and the “standard twin model” with additive, shared, and non-shared components. Now doing visualisation. Sibling pairs on the map of Britain, most concentrated in central London. Seems to show a plot that shows heritability varies geographically. (That’s what he says, anyway. I think this is all talking about the “standard twin model” so it is about correlations between sib’s phenotype values.) London has high heritability, Liverpool and Newcastle have low heritability. Questioner says heritability is a meaningless construct, and that’s what the speaker’s plots show. Speaker says that he was surprised to see the geographical aspect of the results.

**Second session**

**Suzanne Leal** (Baylor). “Analysis, power and replication of complex trait rare variants”. Age-related Macular Degeneration was a special case. Talking about model that complex traits are thre result of multiple rare variants with large phenotypic effects, says few direct tests of this hypothesis have been reported. Interesting because of the missing (measured) heritability, usually << 10%. Other possibilities interactions, epigenetics, etc. GWAS poorly powered to detect rare variants due to low correlation with tag SNPs. Haplotype analysis can increase power, but still underpowered. Need direct mapping, starting with identification of rare variants using sequencing (which is getting cheaper, etc.) For analysis, use aggregation methods, e.g. across a gene. Survey of methods for rare variants association: RVE method (Cohen et al, Science 2004). CMC method (Li & Leal AJHG 2008), WSS (Madsen and Browning, Plos Genet 2009), KBAC (Liu & Leal PLOS Genetics 200?), , VT (Price et al AJHG 2010), RareCover (Bhatia et al 2010 PloS Comp. Biology), ANRV (Morris & Zeggini 2010 Genet Epidemiology). Also aSum (Han & Pan Hum Hered 2010), C-alpha (Neale et al 2011 PloS Genetics), TestRate (Ionita-Laza et al 2011 PLoS Genetics) when you think there are protective and deleterious variants in the same gene. Power analysis: estimate needed sample size. For rare variants it is necessary to generate data under a realistic disease model and variant spectrum. RarePower generates variant data with demographic history under realistic models (Boyko et al 2009 PLOS Genetic, Kryukov et al 2009 PNAS, Williamson et all 2005 PNAS). Can incorporate purifying selection. Quantitative (linear model , and qualititative traits. Different population sampling strategies supported. Lots of specifyable parameters. Details also given for qualitative traits. P-values estimated by permutations (performed adaptively for speed). Most (or maybe all?) of the previously mentioned tests are supported.

What are the most powerful methods? It is difficult to compare them from the literature because of different simulation contexts. Speaker present unbiased (she hopes) comparison with European demographic, mutatation rate , gene size 1500bp, variants <1% freq, 3000 permutations to evalute p-values, 2000 replicates to evaluate power. 1. Variant effect model. OR=1.5-5.0 depending on allele frequency. Best tests (in terms of power) are VT, RareCover, KBAC, WSS. RVE does not do well at all. Next, fixed effect model, OR=2. KBAC, WSS, ANRV good, RVE still bad. Effect of gene length: length=1500,5000,10000. Power increases with gene length for all methods, but this is not very good comparison because it does not take into account known causal variants. With variable effects or fixed effects model, power falls off sharply with proportion of causal variants. Next, scenario with protective & detrimental variants in different proportions. KBAC, WSS, VT do well when there are no protective. Nothing does all that well in the around half protective-half detrimental scenario, not even those designed for this scenario. Some methods improve again when almost all variants are protective. Running time (without adaptive permutations) is 5 seconds to 254 minutes depending on which test is run. Summary: some tests are superior. Advice might be to use different tests (keeping in mind Multiple Testing, that scourge of statisticians worldwide). Also, note that this analysis has been (almost) exclusively about power.

Now change topics: Replication of genes or region association studies. Because of spurious associations, of course. (Spurious associations make me furious). (The word “independent” is underlined.) Ways to replicate: variant-based (only variants are typed in replication, cheaper), or sequence-based (replication samples are sequenced, more expensive, but you might get more variants). Comparison of power for these two methods, generating variants via population-genetic model (African) and genotyping and sequencing error rates taken from existing studies. Variable effects model, others are similar, 50% of sites causal (really? that seems a lot). Sequencing errors are platform and coverage dependent and not “random”. Genotyping at known variant sites K, a proportion of which can’t be genotyped. “Error ratio” is ratio between genotyping and sequencing. Small/medium and large-scale studies considered. CMC and WSS used. Conclusions: small studies, 60% of causal variants uncovered, explaining >90% of population-attributable risk, large studies >90% uncovered, almost all PAR. Power is better for sequence-based study but only by a little bit. Also applied to real dataset, Dallas Heart Study, with various metabolism traits, extremalised to make a case-control study, <10% and >90% in “discovery” and 10-25% and 65-90% in “replication”. ANGPTL4 and another known gene which I didn’t catch were uncovered. Conclusions: sequence-based replication is more powerful, but not much more, so customised genotyping might be a better choice, particularly as this facilitates large sample sizes for replication. But proceed with caution if you are sampling in a different population in replication, as this can strongly affect results.

A questioner asks whether there is any data indicating that rarer variants have stronger effects, answer is that there are few (but some) studies to that effect. Another questioner points out that none of these methods tell you which variants cause the disease (they just do an association test in aggregate), and does the speaker think people will want to know that? Answer is that yes, this is a big problem. Another questioner says that in his data (14000 sequences, I think), over 50% of variants are singletons and this will make sequence-based replication more appealing.

**Lunch.** Sandwiches and other things. I could do with a bag of chips.

**Third session**

**Daniel Crouch **(King’s College London). “Ancestry in forensic casework”. 2004 Madrid train bombings (Phillips et al 2009 PLoS ONE, “Operation Minstead”, ancestry analysis used to trace rapist to south caribbean (genetically, I presume). New method for providing information on parental ancestry. Method designed for *recent* admixture. The “Wahlund effect“, departures from HW in the first generation after admixture depending on and restored in subsequent generations. Generalised to N populations, formulas are a bit complicated but nicely worked out on the slides. Simulations under Balding-Nichols model, simulate parents then offspring and maximise over parameters. Simulations using HapMap III, offspring are 50/50 admixed, parent 1 European, parent 2 African, works very well. Japan versus China works less well for parental admixture. Conclusions. 1st generation admixture between divergent populations can be identified using GWAS data. Absence of parental divergence is harder to discern. Future work: phased data, intercontinental admixture.

**Kimmo Palin** (Sanger). “Systematic long-range phasing (SLRP) with max-product algorithm”. Statistical haplotype phasing is not very accurate. Most methods assume a general population, most recent ancestor a very long time ago, and a small number of extant haplotypes. This works well e.g. for HapMap or WTCCC controls, but we could do better in isolated populations. Long range phasing. P(nth degree cousins share a locus IBD)=. Expected length of the segment at a shared locus is centi-Morgans (that is, very long). A rule-based method was developed by DECODE who have genotyped 10% of Icelanders, but here is a model-based, Bayesian network method. Diplotype = order pair of genotypes, is probability of diplotype given genotype, is prior of diplotype, p is IBD status which is assumed to be Markovian along a chromosome, given the diplotypes. IBD states form a continuous time Markov chain parameterised by expected time in IBD state (10cM) and IBS (1cM) states. Simulations. Isolated founder population, 100 CEU founders, 12 generations. In this setting SLRP does much better than Mach or Beagle (comparison presented in terms of error rate). Real data from Orkney. Orkney is a mixture of Viking and later Scottish immigrant haplotypes. 599 individuals + parents for 169, 327 distantly related individuals + parents for 102. Again SLRP does much better than Mach or Beagle. David Balding says “what do you mean by IBD?” and says that it doesn’t mean anything, really (unless it is defined with respect to a well-defined set of founders, but that’s not possible with a real population like Orkney which has at least two sets of founder haplotypes.)

**Uwe Roesler** (). “Population genetics on groups”. Motivation: modelling microsatellites. Mutations can change length of microsatellite . Shifted frequency distribution of length for different populations, which speaker would like to model. Haploid population of size N with coalescence, independent mutations, exchangeability, mutations increase or decrease by one. vector of states in nth generation, i.i.d. transitions (mothers), mutations. Model is iterated function system . Advantage: captures essential structure (e.g. no exchangeability), spatial interpretation equating spatial distance with mating distance. Coalescence most recent common ancestor unique life line. Global: Look at one particle, . What about the others? Intuiition: states in a generation are strongly dependent on each other and are ‘close’ together. For length of microsatellites, look at vector of distances . Main Theorems. 1. and are asymptotically independent for given $\latex \tau_n$. 2. and are independent for given $\tau_n$. 3. converges exponentially fast to the invariant distribution. 4. is a random walk. Is modelling with state space ok? Apparently yes because the walk spends most of its time away from the boundary. (Note: interpretation is is the state (= microsatellite length) of particle i in nth generation, is ancestor in mth generation of ith particle in nth generation.)

**Amke Caliebe** (Institute of Legal Medicine, Berlin). Also forensics. “What evidence can rare haplotypes survive?”. Null model: suspect is not source of trace, a random person is source of trace. Alternative: suspect is source of trace. Suspect is male and victim is female. (This reminds me of Peter’s paper about the court case). Likelihood ratio test. Difficulty: estimate probability of the suspect haplotype if that haplotype is not in the known database. Example: man and woman split up. Animation draws laughs. Later, woman attacked and left for dead. This draws no laughs. (I’m not sure the wisdom of putting pictures of these people on the slides. They both *look* nice enough. Maybe they’re not the real people.) Discussion of two different methods for estimating the desired probability but I’m afraid I haven’t been paying attention. In the example the two different methods give . This low number was enough to make the suspect confess, whereupon he was given years jail.

**Clive Hoggart** (). Fine-scale estimation of birthplace. Systematic non-random mating in populations results in stratification. Shows image from “Genes mirror geography within Europe“, Novembre et al Nature Genetics (2008) which uses 2 PCAs to predict geographical origin to within a few hundred kilometres. Speaker is talking about North Finland Birth Cohort 1966. Investigated population structure using PCA, performed on 61917 SNPs thinned for LD. Settlement of Finland can came in two phases, early settlement of south and west, inhabited for many millennia, followed by 16th century migration into other areas. Now showing PCA side-by-side with map. Higher order PCs have “no linear relationship” with geography, but do take extreme values in some towns. Introduces “pcLOCATE” program uses top p PCs to estimate probability of origin from each town. Model the kth PC for an individual from the jth town. Estimated location is then a weighted average of locations of the different possible towns of origin. Seems to require a known parental town of origin (?), but this can’t really be true because he goes on to estimate this. Model does better than linear model of cited NG paper, accuracy increasing with number of PCs to estimate (A) parental birthplace and (B) most recent residence. Summary: pcLOCALE exploits local smoothness of PC variation.

**Dominik Glodzik** (Edinburgh). Inference of shared ancestral haplotypes in population isolates. Isolated population -> only a few common ancestors -> only a few haplotypes at any locus. So should be able to find the haplotypes from genotypes, and find those associated with disease. (And something about sequencing). SNP data from Orcads (749), Korcula (945), Vis (991), Soccs (958). Tree figure from Morris AP et al, 2002. Description of phasing using parents or surrogate parents. To find these, look for regions with long stretches of no opposing homozygotes. Sharing can start or end anywhere. Plot of number of surrogate parents against genomic position shows most sharing in middle of chromosomes. Orkney has most sharing, Soccs has least. 65% or 72% accuracy of phasing, depending on IBD threshhold, which also affects inconsistencies. Error rate 0.558% using parent-offspring trios. How many haplotypes are ther e really? About 40% of haplotypes cluster into big clusters on average. Optimising sequencing studies: if we sequenced 200 Orkney individuals, about 95% of all individuals will have a surrogate parent within the 200. Somebody asks the same “what is IBD?” again, and about populations without clearly-defined founder event.

**Fourth Session.**

**Elinor Jones** (Leicester) Mendelian randomisation studies. Estimate causal effect of X on Y in the presence of unobserved confounding. Presents a Bayesian model. Why go bayesian? (Because it’s better.) Sorry, this is a well-explained talk but my brain is dozing. There are questions about priors and their specification, and why do we need them.

**So-Youn Shin** (Sanger). SEM modelling for causal inferences from genetic variants. “Metabolite profiles and the risk of developing diabetes“, Wang et al, Nature Medicine (2011). Want to test model where SNP affects lipids through metabolite. SEM is extension of regression model. Pearson’s goodness-of-fit used to check if model agrees with data. Consider 10 Models for effects between SNP, Lipid and Metabolites (they differ in the positions and directions of the causation arrows). Pros: SEM allow both direct and indirect effects, and variables can be both response and predictor simultaneously. Cons: Nonlinearity can’t be detected, hidden confounders can affect things. Association test in KORA (N=~1800) 95 LipidSNPs, 151 Metabolites, 4 Lipids. Replication in (singletons from) TwinsUK, N=800, Two-stage least squares. In a seperate stream, PCA was used first to reduce dependence on highly correlated metabolites. Results: Model 4 (Lipid SNP -> Metabolite -> Lipid) found 318 pathways. (what? where did pathways come from?) And <X> (a hundred and something) replicated. But only 9 pathways in PCA version, 3 replicated. (In response to question says that’s probably because of the high correlation between metabolites leading to high number of pathways in first version). Example PC.aa.C36.3, model 4 fits best.

**Christina Loley** (). Why was the X chromosome ignored? X chromosome is special. Different numbers in males and females. X chromosome inactivation. But some loci may escape inactivation in some females, so should model with and without inactivation. Model without inactivation: females have 0, 1, or 2 risk alleles, men with 0 or 1. Model with inactivation: males are like homozygous females. No inactivation: tests proposed by Zheng et al (2007) Genet. Epidemiology: allele-based -squared test, allele-based test for males + trend test for females; linear combination of alleles-based tests for males and females; linear combination of allele-base test for males and trend test for females. Dominance model also possible. Talking about variance estimates, it all sounds like David Clayton’s paper to me. Simulation study. Big slide about which Type I errors are ok (tick) or not (cross). Lack of HWE gives problems for allele-based tests. Sex-specific allele frequencies cause problems for non-stratified tests. Power comparison. No test was uniformly most powerful over all genetic models. Test does best under sex-specific allele frequencies. Question asks if it is known which regions are subject to inactivation or not, she says that it is getting more known, some females may be inactivated some activated at the same locus. Another question asks what causes sex-specific allele frequencies, she doesn’t know but has observed it.

**Tom Cattaert** (Liege). Impact of genotyping error. * (My battery is about to die)*. He is doing a literature survey. Types studied (binary traits): random, effecting cases and controls equally. GE reduces power especially for low MAF or bad tags. Kang et al consider general genotyping model using 6 GE rates. Most severe type of error is when a frequent type is read as a less frequent type. Now qtls, the topic of this talk. Trait is normal with genotype-specific mean . *Battery dead!* Probabilistic model. *Ok, battery not dead any more (because this is now Tuesday and I’m catching up, so these notes will be even more sketchy than usual).* Simulation for QTL with random GE, and non-random GE whose severity depends on the trait value. Conclusion: we must try to reduce genotyping errors as much as possible. Um. A questioner says it would be interesting to compare these GE models to commonly used QC metrics.

**Jinghua Zhao** “PLSPM-based statistics for regional and polygenic association on latent QTLs”. This is similar in principle to So-Youn Shin’s talk (above) but the speaker claims “partial least-squares modelling” (PLSPM) is better than SEM. (Even though still linear). Bootstrap used for significance testing. EPIC-Norfolk data, 25631 individuals, 12,559 replicants.

**Daniel Barnes** (Cambridge) Non-random ascertained gene mutation carriers. About cancer and BRCA1/2 mutations. A systematic evaluation of association methods. Sorry, I didn’t note the details.

**Poster session.** After looking at the posters for a while I wandered off over the bridge to the National Theatre, where I found a band playing a funky groove in 11/8, in fact, 3,3,2,2,3,2,2,3,2. Then they crotcheted down the tension for a beautiful bridge which was 4/4/5 in crotches before quavering it up again.

**Dinner.** Dinner was on a boat in the river, which swayed gently in the swell (either that or I was more drunk than I thought). There were seats, but not enough, so we stood.

**To be continued** here, in fact.

YEAH