I recently opened a very large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables. (more…)
I recently opened a very large data set titled “1998 California Work and Health Survey” compiled by the Institute for Health Policy Studies at the University of California, San Francisco. There are 1,771 observations and 345 variables. (more…)
Have you ever worked with a data set that had so many observations and/or variables that you couldn’t see the forest for the trees? You would like to extract some simple information but you can’t quite figure out how to do it.
Get to know Stata’s collapse command–it’s your new friend. Collapse allows you to convert your current data set to a much smaller data set of means, medians, maximums, minimums, count or percentiles (your choice of which percentile).
Let’s take a look at an example. I’m currently looking at a longitudinal data set filled with economic data on all 67 counties in Alabama. The time frame is in decades, from 1960 to 2000. Five time periods by 67 counties give me a total of 335 observations.
What if I wanted to see some trend information, such as the total population and jobs per decade for all of Alabama? I just want a simple table to see my results as well as a graph. I want results that I can copy and paste into a Word document.
Here’s my code:
preserve
collapse (sum) Pop Jobs, by(year)
graph twoway (line Pop year) (line Jobs year), ylabel(, angle(horizontal))
list
And here is my output:
By starting my code with the preserve command it brings my data set back to its original state after providing me with the results I want.
What if I want to look at variables that are in percentages, such as percent of college graduates, mobility and labor force participation rate (lfp)? In this case I don’t want to sum the values because they are in percent.
Calculating the mean would give equal weighting to all counties regardless of size.
Fortunately Stata gives you a very simple way to weight your data based on frequency. You have to determine which variable to use. In this situation I will use the population variable.
Here’s my coding and results:
Preserve
collapse (mean) lfp College Mobil [fw=Pop], by(year)
graph twoway (line lfp year) (line College year) (line Mobil year), ylabel(, angle(horizontal))
list
It’s as easy as that. This is one of the five tips and tricks I’ll be discussing during the free Stata webinar on Wednesday, July 29th.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.
Stata allows you to describe, graph, manipulate and analyze your data in countless ways. But at times (many times) it can be very frustrating trying to create even the simplest results. Join us and learn how to reduce your future frustrations.
This one hour demonstration is for new and intermediate users of Stata. If you’re a beginner, the drop down commands can be extremely daunting.
If you’re an intermediate user and not constantly using Stata, it’s impossible to remember which commands generate the results you are looking to create.
This webinar, by guest presenter Jeff Meyer, will give you five actionable tips (and examples you can re-use) that will make your next analysis in Stata much simpler.
We’ll explore:
Date: Wednesday, July 29, 2015
Time: 4pm EDT (New York time)
Cost: Free
***Note: This webinar has already taken place. Sign up below to get access to the video recording of the webinar.
Jeff Meyer is a statistical consultant with The Analysis Factor, a stats mentor for Statistically Speaking membership, and a workshop instructor. Read more about Jeff here.Our next free webinar is titled: “Random Intercept and Random Slope Models” and is coming up in August
One of Stata’s incredibly useful abilities is to temporarily store calculations from commands.
Why is this so useful? (more…)
If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.
Every data analysis project is unique and there are always many good ways to keep your data organized.
In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.
In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.
Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.
This made it always easy to find the final data set, analysis, or output for any particular analysis.
You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.
In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.
Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.
For each analysis, we merged the common variables data set with the relevant unique variable data set.
This allowed us to run each analysis without the clutter of irrelevant variables.
This strategy can be particularly helpful when you are running secondary data analysis on a large data set.
Spend some time thinking about which variables are common to all analyses and which are unique to a single model.
I can’t emphasize this one enough.
As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.
It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.
Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.
So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.
Do this each time you clean data and you end up with dozens of versions of the same data.
A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.
Picture this instead.
Start with one raw data set.
Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.
If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.
If you love using menus instead of writing syntax, still no problem.
Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.
If you made a mistake in recoding something, edit the syntax, not the data file.
Need to make small changes? If it’s set up well, rerunning it only takes seconds.
There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.
Just last week, a colleague mentioned that while he does a lot of study design these days, he no longer does much data analysis.
His main reason was that 80% of the work in data analysis is preparing the data for analysis. Data preparation is s-l-o-w and he found that few colleagues and clients understood this.
Consequently, he was running into expectations that he should analyze a raw data set in an hour or so.
You know, by clicking a few buttons.
I see this as well with researchers new to data analysis. While they know it will take longer than an hour, they still have unrealistic expectations about how long it takes.
So I am here to tell you, the time-consuming part is preparing the data. Weeks or months is a realistic time frame. Hours is not.
(Feel free to send this to your colleagues who want instant results.)
There are three parts to preparing data: cleaning it, creating necessary variables, and formatting all variables.
Data cleaning means finding and eliminating errors in the data. How you approach it depends on how large the data set is, but the kinds of things you’re looking for are:
You can’t avoid data cleaning and it always takes a while, but there are ways to make it more efficient. For example, rather than search through the data set for impossible values, print a table of data values outside a normal range, along with subject ids.
This is where learning how to code in your statistical software of choice really helps. You’ll need to subset your data using IF statements to find those impossible values.
But if your data set is anything but small, you can also save yourself a lot of time, code, and errors by incorporating efficiencies like loops and macros so that you can perform some of these checks on many variables at once.
Once the data are free of errors, you need to set up the variables that will directly answer your research questions.
It’s a rare data set in which every variable you need is measured directly.
So you may need to do a lot of recoding and computing of variables.
Examples include:
And of course, part of creating each new variable is double-checking that it worked correctly.
Both original and newly created variables need to be formatted correctly for two reasons:
First, so your software works with them correctly. Failing to format a missing value code or a dummy variable correctly will have major consequences for your data analysis.
Second, it’s much faster to run the analyses and interpret results if you don’t have to keep looking up which variable Q156 is.
Examples include:
All three of these steps require a solid knowledge of how to manage data in your statistical software. Each one approaches them a little differently.
It’s also very important to keep track of and be able to easily redo all your steps. Always assume you’ll have to redo something. So use (or record) syntax, not only menus.