Sunday, 23 April 2017

Sample selection in genetic studies: impact of restricted range

I'll shortly be posting a preprint about methodological quality of studies in the field of neurogenetics. It's something I've been working on with a group of colleagues for a while, and we are aiming to make recommendations to improve the field.

I won't go into details here, as you will be able to read the preprint fairly soon. Instead, what I want to do here is to expand on a small point that cropped up as I looked at this literature, and which I think is underappreciated.

It's to do with sampling. There's a particular problem that I started to think about a while back when I heard someone give a talk about a candidate gene study. I can't remember who it was or even what the candidate gene was, but basically they took a bunch of students, genotyped them, and then looked for associations between their genotypes and measures of memory. They were excited because they found some significant results. But I was, as usual, sitting there thinking convoluted thoughts about all of this, and wondering whether it really made sense. In particular, if you have a common genetic variant that has such a big effect on memory, would this really show up in a bunch of students – who are presumably people who have pretty good memories? Wouldn't it rather be the case that what you'd expect would be an alteration in the frequencies of genotypes in the student population?

Whenever I have an intuition like that, I find the best thing to do is to try a simulation. Sometimes the intuition is confirmed, and sometimes things turn out different and, very often, more complicated.

But this time, I'm pleased to say my intuition seems to have something going for it.

So here's the nuts and bolts.

I simulated genotypes and associated phenotypes by just using R's nice mvrnorm function. For the examples below, I specified that a and A are equally common (i.e. minor allele frequency is .5), so we have 25% as aa, 50% as aA, and 25% AA. The script lets you specify how closely these are related to the phenotype, but from what we know about genetics, it's very unlikely that a common variant would have a value more than about .25.

We can then test for two things:
1)  How far does the distribution of genotypes in the sample (i.e. people who are aa, aA or AA) resemble that in the general population? If we know that MAF is .5, we expect this distribution to be 1:2:1.
2) We can assign each person a score corresponding to number of A alleles (coding aa as zero, aA as 1, and AA as 2) and look at the regression of the phenotype on the genotype. That's the standard approach to looking for genotype-phenotype association.

If we work with the whole population of simulated data, these values will correspond to those that we specified in setting up the simulation, provided we have a reasonably large sample size.

But what if we take a selective sample of cases who fall above some cutoff on the phenotype? This is equivalent to taking, for instance, a sample from a student population from a selective institution, when the phenotype is a measure of cognitive function. You're not likely to get into the institution unless you have a good cognitive ability. Then, working with this selected subgroup, we recompute our two measures, i.e. the proportions of each genotype, and the correlation between the genotype and the phenotype.

Now, the really interesting thing here is that, as the selection cutoff gets more extreme, two things happen:
a) The proportions of people with different genotypes starts to depart from the values expected for the population in general. We can test to see when the departure becomes statistically significant with a chi square test.
b) The regression of the phenotype on the genotype weakens. We can quantify this effect by just computing the p-value associated with the correlation between genotype and phenotype.

Figure 1: Genotype-phenotype associations for samples selected on phenotype

Figure 1 shows the mean phenotype scores for each genotype for three samples: an unselected sample, a sample selected with z-score cutoff zero (corresponding to the top 50% of the population on the phenotype) and a sample selected with z-score cutoff of .5 (roughly selecting the top third of the population).

It's immediately apparent from the figure that the selection dramatically weakens the association between genotype and phenotype. In effect, we are distorting the relationship between genotype and phenotype by focusing just on a restricted range. 

Comparison of p-values from conventional regression approach and chi square test on genotype frequencies in relation to sample selection

Figure 2 shows the data from another perspective, by considering the statistical results from a conventional regression analysis, when different z-score cutoffs are used, selecting an increasingly extreme subset of the population. If we take a cutoff of zero – in effect selecting just the top half of the population, the regression effect (predicting phenotype from genotype), shown in the blue line, which was strong in the full population, is already much reduced. If you select only people with z-scores of .5 or above (equivalent to an IQ score of around 108), then the regression is no longer significant. But notice what happens to the black line. This shows the p-value from a chi square test which compares the distribution of genotypes in relation to expected population values in each subsample. If there is a true association between genotype and phenotype, then greater the selection on the phenotpe, the more the genotype distribution departs from expected values. The specific patterns observed will depend on the true association in the population and on the sample size, but this kind of cross-over is a typical result.

So what's the moral of this exercise? Well, if you are interested in a phenotype that has a particular distribution in the general population, you need to be careful when selecting a sample for a genetic association study. If you pick a sample that has a restricted range of phenotypes relative to the general population, then you make it less likely that you will detect a true genetic association in a conventional regression analysis. In fact, if you take a selected sample, there comes a point when the optimal way to demonstrate an association is by looking for a change in the frequency of different genotypes in the selected population vs the general population.

No doubt this effect is already well-known to geneticists, and it's all pretty obvious to anyone who is statistically savvy, but I was pleased to be able to quantify the effect via simulations. It is clear that it has implications for those who work predominantly with selected samples such as university students. For some phenotypes, use of a student sample may not be a problem, provided they are similar to the general population in the range of phenotype scores. But for cognitive phenotypes that's very unlikely, and attempting to show genetic effects in such samples seems a doomed enterprise.

The script for this simulation, simulating genopheno cutoffs.R should be available here:

(This link updated on 29/4/17).


  1. A familiar variant of this is, if you aim to compare two groups on something (gene frequencies, brain measures, cognitive scores), sampling from a population where the two groups differ less than in the general population is not a sensible strategy.
    As I'm arguing in a forthcoming paper, looking for brain correlates of dyslexia in a university student population (or in children reading just 1SD below the norm) is likely to decrease the expected effect size, hence the likelihood of finding a reliable difference. A more sensible sampling strategy would be to recruit the most severe dyslexic individuals that can be found. Yet my feeling is that neuroimaging studies of dyslexia have particular lax inclusion criteria.

    1. Thanks Franck. It is, of course, much harder to get truly representative samples, and with cognitive phenotypes it is often the case that those who have major difficulties are less likely to volunteer.
      Marcus Munafo also drew my attention to this paper on 'collider bias' which is relevant:

  2. Off topic: even though I don't really understand github I managed to grab the code and it runs in R.

    More or less on topic, I read the post, without understanding 50% and still said. Oh yes, obvious once someone points this out.

    Now I need to go back and read this carefully and see if I really agree. :) It is considerably outside my area but the implications are fascinating.

  3. That’s a very good work that you and your colleagues are doing. Quality and methods of teaching and learning needs a lot of improvement. Glad someone is working on it.

    1. I think if you are offering your services as someone writing CVs, you should polish up your written English.