Fixed or random?

Want to run a mixed model but don’t know which of your variables to have as fixed effects and which as random effects? Follow my simple guide for deciding if it will be a fixed or random effect below! There will be exceptions, but after all, Biology is the science of exceptions…

  geese live in social groups, except if you’re a fox

You have some data (whoop!). Your response is a continuous variable (e.g. body size, number of offspring) or binary (survived or not). You have a bunch of variables that you want to include, some of which are continuous, some are categorical. You are happy with using a general linear model, but perhaps it should be a mixed model. That means that some of the variables are fixed effects, some are random effects. Lets assume you know this, but your statistical knowledge starts petering out here. Which of those variables are fixed effects, and which are random?

Just follow this simple two-step guide:

For each predictor variable:

  1. Is your variable a) categorical or b) continuous?

If a), proceed to 2. If b) Fixed effect

  1. Do you want to know a) general effect of the factor on the population b) about specific differences among levels of the factor?

If a), Random effect, if b), Fixed effect

 How did we arrive at these decisions?

The first comes from the fact that a random effect uses 1 degree of freedom in a model, but only gives you an estimate of variance accounted for by that effect.

For continuous fixed effects, the model is already estimating the intercept/population mean, so estimating a slope (effect of a continuous variable on another variable) only uses 1 degree of freedom as well. The slope however gives you information on the direction of the relationship as well as the strength. So if you have a continuous variable, it may as well always be a fixed effect1.

For the second, again the fact that a random effect only gives you a measure of variance is key. There is no information about whether being in group A, B or C results in higher or lower values of the response. That can only be got at by having the variable as a fixed effect. If A, B and C are very different from each other you will find a large amount of variance is accounted for by the random effect, but you will still not know which results in high scores and which low scores2.

You use a fixed effect if you wish to know about the levels of the effect, you use a random effect if you only are interested in the effect on the population at large.

Of course there can be exceptions. Lets say you are analysing the effect of height on school childrens’ test scores. You also have the categorical variable of the school a child went to. You know that school will have an effect on the population, so it is necessary to include in the model, probably as a random effect. However, if your categorical variable has few levels (2-4) and you have a large data set (100s of points) and you are not fitting many other variables, then you will have a very large number of degrees of freedom to play with. In which case, you can fit the variable as a fixed effect and gain the information about the different levels i.e. which school leads to better test scores. So although your focus was not which schools give higher scoring children, if you have the degrees of freedom spare you can gain the result for free by fitting school as a fixed rather than random effect. This will then limit your inference to only those schools however. A benefit of random effects is that they assume the categories in the analysis are only a subset of all possible categories in the population (more precisely, a random sample of categories where the effect sizes across all categories in a population follows a normal distribution). So it depends on the question. Do you care about the schools, or the whole population of children?

The answer is probably both. Go get more data.

  data collection: like a large pink pavement

  1. A slight caveat is if you have a continuous variable that represents a relationship between two data points e.g. degree of relatedness, or distance in space, rather than a score on a scale. In this case, you can use this variable as a random effect by creating a variance-covariance matrix of all the pairwise relationships. But this is not the time or place to go into that.
  2. Actually, R packages such as lme4 and MCMCglmm for mixed modelling will spit out the estimated mean effect for each level of a random effect. So in that way perhaps you can have your cake and eat it. But statistical testing those values will likely be dubious.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s