In our last two posts, we explained (1) that every member of a simple random sample had an equal probability of selection and (2) that there are some really good reasons why complex samples can work better, despite being more complex.
Today, we’re going to talk a bit about one complex sampling technique: stratified sampling.
What is Stratified Sampling?
In stratified sampling, the target population is first classified into subgroups or strata. (Grammar note: “strata” is plural for “stratum” just as “data” is plural for “datum.”).
A simple random sample is then selected within every stratum.
That’s it.
For example, let’s say you’re doing a linguistics study within the US. You want to make sure that you have enough people in your sample with most of the major dialects within American English. You know there are regional differences in pronunciation and word use, and you want to ensure you include people in your sample who say “crawfish” and “crayfish” as well as “tennis shoes” and “sneakers.”
So rather than taking one simple random sample across the US, your first create four regional strata: Northeast, Southeast, Midwest, and West. These regions are based on other studies that generally define four common dialectical patterns.
You then randomly sample from each of the four regions.
What are the advantages of stratified sampling?
- Administration of field work is more convenient and less costly. You can, for example, assign a separate research team to each region. They have to do less travelling, reducing expenses.
- If people really are more similar within each stratum, stratified sampling will lead to improved precision of estimates. In other words, standard errors will be lower. Not just standard errors within a stratum, but the standard errors for estimates of the entire US.This is incredibly important, as it’s one of the few things you can do to improve power in a study without huge increases in sample size.
- It allows oversampling of small populations of interest. This is also incredibly important, as sometimes those small populations are vital to your research questions.
Let’s say that there is a 5th known, but relatively small, dialect in Alaska. You have a particular interest in being able to compare this group to others in the study.
Perhaps Alaskans have their own word for crayfish and gym shoes that no one else uses. (And yes, I’m totally making up this example). Lumping them in with the population of other populous western states means you may end up with only 5 or 10 Alaskans, even if the entire western state sample is pretty large.
(fyi, if you’re not familiar with US Geography, although Alaska is huge in area, it has a small population. Other western states, like California, on the other hand, have enormous populations).
If all you care about is representing the US, it’s fine. But if you’d like to also do some analyses on just Alaskans, you won’t have enough in the sample without specifically sampling Alaskans at a higher rate than residents of other western states.
So, one option is to make Alaska its own stratum, then make sure the sample in that stratum is large enough to use in your statistical tests.
The Consequences
In order to have accurate estimates of the US in general, though, you need to account for the fact that there are proportionally more Alaskans in the sample than are really representative of the population.
You do this through weighting and through incorporating the stratification into the statistical analysis.
The weighting ensures that parameter estimates like means and regression coefficients are accurate and unbiased. Incorporating the stratification ensures the standard errors are accurate.
Unfortunately, although most procedures in general statistical software can incorporate weights, you need to use software designed for complex surveys to include the stratification.
Luckily, all the major stat packages (and a few specialized ones) now have complex survey procedures available.
Marie says
Thank you for this! What about a survey where I want to investigate people from, say, five specific areas in a country and:
– do a random sampling (by that,t I mean that I do not target the areas specifically but collect a lot of data and “hope for the best” in terms of how many respondents from each area will complete the survey)
– end up with one area being overrepresented or another underrepresented? By overrepresented/underrepresented, I mean that the percentage of this group is larger/smaller in my sample than in the population.
My understanding is that this does not qualify as stratified sampling since I used random sampling. Am I right?
But should I still use weights to correct the skewed sample if finding the difference between these areas is important to my research question?
Thanks!
Peter Sych says
Thank you for the Nice Informative article. Following from https://www.seku.ac.ke/
Samson Odira Omolo says
really enjoyed what you posted to me on statistical analyses. As per to studies which is on
* Impact on abundance, diversity, distribution and Public Health implications of disease vectors in solid waste disposal system in Mombasa County*. How would I go about the Stratified Sampling and what advantages it has on this study. Mombasa county has six sub-counties, each has different solid waste disposal sites. I would appreciate your idea so that I develop good proposal