Have you ever wondered whether you should report separate means for different groups or a pooled mean from the entire sample? This is a common scenario that comes up, for instance in deciding whether to separate by sex, region, observed treatment, et cetera.
For example, imagine you are an agriculture researcher working with potato farmers. Growing potatoes depletes phosphorus from the soil, so farmers add fertilizer every year to restore the phosphorus. The problem is, no one is ever exactly sure how much fertilizer to use so the potatoes have the nutrients they need, while keeping costs sustainable.
So, you design a study with 10 potato farmers; 5 of them use manure in their fields, and 5 do not.
A neighboring potato farmer hears about your study and asks your advice on how much fertilizer to buy for next year.
You begin to calculate the average estimated amount of fertilizer added across the 10 farms. Then you hesitate. The farmer mentioned that they use manure in their fields. Perhaps you should just average the five fields with manure? But those five fields may differ from your neighbor’s fields in important ways.
One field that used manure was like your neighbor’s field. Maybe you should just give them that estimate. Yet there are still differences between the two fields. Not to mention there is some error in the measurement of the average nutrient uptake for a field.
The struggle you are feeling illustrates the balance that two competing forces have on the error in calculating a mean. On the one hand, the error in anticipating uptake depends upon the number of fields that you sample. Including more fields will reduce your overall error.
On the other hand, if you average across all the fields – those with manure and without- the overall average is going to be an overestimate for fields without manure and underestimate for fields that do have manure.
The Variance-Bias Tradeoff
The balance between these two sources of error in statistics is called the variance-bias tradeoff. Here, increasing the number of samples reduces the variance, but pooling both manure and non-manure samples increases the bias. So how should we choose whether to pool or to separate?
Your two basic options are to make an educated guess ahead of time or to use the data to make the decision.
Making an Educated Guess
To make an educated guess, you would review prior literature with a focus on differences in uptake between manure and non-manure fields.
You might even conduct a precision analysis for your study. This is an appraisal of the precision of your estimates for both the pooled and separate cases.
Overall, you would balance the loss in precision you see in the pooled calculation with the anticipated difference in uptake between manure and non-manure fields.
Using Your Data to Decide
The other option is to use the data to help you choose whether to pool or separate. In this case, you could use techniques such as Akaike’s Information Criterion (AIC) or cross-validation. Both are techniques that allow you to balance the group sizes and the effect size at the time of data analysis.
In choosing between the two approaches, you will face a conundrum. If you make the decision ahead of time, you are in the land of guesswork. You will have to use possibly insufficient prior information in a qualitative fashion.
If you use AIC or any other data-based technique to choose your statistical methods, the choice to pool becomes a random variable, and the downstream estimates no longer follow easily determinable sampling distributions.
What does this all mean? Well in short, we cannot say that our estimates are necessarily unbiased, or that our p-values will precisely follow the properties they are supposed to.
Main Takeaway
In reporting means for small samples, you might need to decide between averaging across a smaller but more representative group versus a larger, less representative group. Ultimately, your choice will depend upon the aim of your analysis.
You can choose to report pooled means or separate group means either before or at the time of analysis. Choosing ahead of time comes down to reviewing the literature to make a guess at an effect size. This might also involve running precision calculations under different scenarios.Choosing once you have the data means you can use tools that optimize the variance-bias tradeoff like AIC or cross-validation.
Clark Kogan is an experienced statistical scientist and owner of StatsCraft LLC.
His expertise includes Bayesian models, generalized linear mixed models, research design, and R programming. As a collaborator, consultant, and mentor across multiple fields (including agriculture, veterinary medicine, psychology and pharmacy), Clark loves the challenge of finding the best statistical strategies for the specific research project. He especially enjoys the collaborative discussions that lead to insights and solutions.
Leave a Reply