
In my last article, Hierarchical Regression in Stata: An Easy Method to Compare Model Results, I presented the following table which examined the impact several predictors have on one’ mental health.

At the bottom of the table is the number of observations (N) contained within each sample.
The sample sizes are quite large. Does it really matter that they are different? The answer is absolutely yes.
Fortunately in Stata it is not a difficult process to use the same sample for all four models shown above.
Some background info:
As I have mentioned previously, Stata stores results in temp files. You don’t have to do anything to cause Stata to store these results, but if you’d like to use them, you need to know what they’re called.
To see what is stored after an estimation command, use the following code:
ereturn list
After a summary command:
return list
One of the stored results after an estimation command is the function e(sample). e(sample) returns a one column matrix. If an observation is used in the estimation command it will have a value of 1 in this matrix. If it is not used it will have a value of 0.
Remember that the “stored” results are in temp files. They will disappear the next time you run another estimation command.
The Steps
So how do I use the same sample for all my models? Follow these steps.
Using the regression example on mental health I determine which model has the fewest observations. In this case it was model four.
I rerun the model:
regress MCS weeks_unemployed i.marital_status kids_in_house religious_attend income
Next I use the generate command to create a new variable whose value is 1 if the observation was in the model and 0 if the observation was not. I will name the new variable “in_model_4”.
gen in_model_4 = e(sample)
Now I will re-run my four regressions and include only the observations that were used in model 4. I will store the models using different names so that I can compare them to the original models.
My commands to run the models are:
regress MCS weeks_unemployed i.marital_status if in_model_4==1
estimates store model_1a
regress MCS weeks_unemployed i.marital_status kids_in_house if in_model_4==1
estimates store model_2a
regress MCS weeks_unemployed i.marital_status kids_in_house religious_attend if in_model_4==1
estimates store model_3a
regress MCS weeks_unemployed i.marital_status kids_in_house religious_attend income if in_model_4==1
estimates store model_4a
Note: I could use the code if in_model_4 instead of if in_model_4==1. Stata interprets dummy variables as 0 = false, 1 = true.
Here are the results comparing the original models (eg. Model_1) versus the models using the same sample (eg. Model_1a):


Comparing the original models 3 and 4 one would have assumed that the predictor variable “Income level” significantly impacted the coefficient of “Frequent religious attendance”. Its coefficient changed from -58.48 in model 3 to 6.33 in model 4.
That would have been the wrong assumption. That change is coefficient was not so much about any effect of the variable itself, but about the way it causes the sample to change via listwise deletion. Using the same sample, the change in the coefficient between the two models is very small, moving from 4 to 6.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.
I recently opened a very large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables. (more…)
Have you ever worked with a data set that had so many observations and/or variables that you couldn’t see the forest for the trees? You would like to extract some simple information but you can’t quite figure out how to do it.
Get to know Stata’s collapse command–it’s your new friend. Collapse allows you to convert your current data set to a much smaller data set of means, medians, maximums, minimums, count or percentiles (your choice of which percentile).
Let’s take a look at an example. I’m currently looking at a longitudinal data set filled with economic data on all 67 counties in Alabama. The time frame is in decades, from 1960 to 2000. Five time periods by 67 counties give me a total of 335 observations.
What if I wanted to see some trend information, such as the total population and jobs per decade for all of Alabama? I just want a simple table to see my results as well as a graph. I want results that I can copy and paste into a Word document.
Here’s my code:
preserve
collapse (sum) Pop Jobs, by(year)
graph twoway (line Pop year) (line Jobs year), ylabel(, angle(horizontal))
list
And here is my output:


By starting my code with the preserve command it brings my data set back to its original state after providing me with the results I want.
What if I want to look at variables that are in percentages, such as percent of college graduates, mobility and labor force participation rate (lfp)? In this case I don’t want to sum the values because they are in percent.
Calculating the mean would give equal weighting to all counties regardless of size.
Fortunately Stata gives you a very simple way to weight your data based on frequency. You have to determine which variable to use. In this situation I will use the population variable.
Here’s my coding and results:
Preserve
collapse (mean) lfp College Mobil [fw=Pop], by(year)
graph twoway (line lfp year) (line College year) (line Mobil year), ylabel(, angle(horizontal))
list


It’s as easy as that. This is one of the five tips and tricks I’ll be discussing during the free Stata webinar on Wednesday, July 29th.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.
Stata allows you to describe, graph, manipulate and analyze your data in countless ways. But at times (many times) it can be very frustrating trying to create even the simplest results. Join us and learn how to reduce your future frustrations.
This one hour demonstration is for new and intermediate users of Stata. If you’re a beginner, the drop down commands can be extremely daunting.
If you’re an intermediate user and not constantly using Stata, it’s impossible to remember which commands generate the results you are looking to create.
This webinar, by guest presenter Jeff Meyer, will give you five actionable tips (and examples you can re-use) that will make your next analysis in Stata much simpler.
We’ll explore:
- Save time with a do-file to create the table you want exactly as you want.
- A few methods (some easier than others) to create dummy variables out of a categorical variable with several categories
- At least three ways to insert a table into a document
- Quickly alter the looks of your graphs through the use of macros
- How to aggregate data to the group level based on a number of parameters
Date: Wednesday, July 29, 2015
Time: 4pm EDT (New York time)
Cost: Free
***Note: This webinar has already taken place. Sign up below to get access to the video recording of the webinar.
Our next free webinar is titled: “Random Intercept and Random Slope Models” and is coming up in August
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.
One of Stata’s incredibly useful abilities is to temporarily store calculations from commands.
Why is this so useful? (more…)
If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.
Every data analysis project is unique and there are always many good ways to keep your data organized.
In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.
1. Use file directory structures to keep relevant files together
In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.
Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.
This made it always easy to find the final data set, analysis, or output for any particular analysis.
You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.
2. Split large data sets into smaller relevant ones
In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.
Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.
For each analysis, we merged the common variables data set with the relevant unique variable data set.
This allowed us to run each analysis without the clutter of irrelevant variables.
This strategy can be particularly helpful when you are running secondary data analysis on a large data set.
Spend some time thinking about which variables are common to all analyses and which are unique to a single model.
3. Do all data manipulation in syntax
I can’t emphasize this one enough.
As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.
It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.
Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.
So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.
Do this each time you clean data and you end up with dozens of versions of the same data.
A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.
Picture this instead.
Start with one raw data set.
Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.
If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.
If you love using menus instead of writing syntax, still no problem.
Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.
If you made a mistake in recoding something, edit the syntax, not the data file.
Need to make small changes? If it’s set up well, rerunning it only takes seconds.
There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.