Karen Grace-Martin

3 Tips for Keeping Track of Data Files in a Large Data Analysis

March 23rd, 2015 by

If you’ve ever worked on a large data analysis project, you know that just keeping track of everything is a battle in itself.

Every data analysis project is unique and there are always many good ways to keep your data organized.

In case it’s helpful, here are a few strategies I used in a recent project that you may find helpful. They didn’t make the project easy, but they helped keep it from spiraling into overwhelm.

1. Use file directory structures to keep relevant files together

In our data set, it was clear which analyses were needed for each outcome. Therefore, all files and corresponding file directories were organized by outcomes.

Organizing everything by outcome variable also allowed us to keep the unique raw and cleaned data, programs, and output in a single directory.

This made it always easy to find the final data set, analysis, or output for any particular analysis.

You may not want to organize your directories by outcome. Pick a directory structure that makes it easy to find each set of analyses with corresponding data and output files.

2. Split large data sets into smaller relevant ones

In this particular analysis, there were about a dozen outcomes, each of which was a scale. In other words, each one had many, many variables.

Rather than create one enormous and unmanageable data set, each outcome scale made up a unique data set. Variables that were common to all analyses–demographics, controls, and condition variables–were in their own data set.

For each analysis, we merged the common variables data set with the relevant unique variable data set.

This allowed us to run each analysis without the clutter of irrelevant variables.

This strategy can be particularly helpful when you are running secondary data analysis on a large data set.

Spend some time thinking about which variables are common to all analyses and which are unique to a single model.

3. Do all data manipulation in syntax

I can’t emphasize this one enough.

As you’re cleaning data it’s tempting to make changes in menus without documenting them, then save the changes in a separate data file.

It may be quicker in the short term, but it will ultimately cost you time and a whole lot of frustration.

Above and beyond the inability to find your mistakes (we all make mistakes) and document changes, the problem is this: you won’t be able to clean a large data set in one sitting.

So at each sitting, you have to save the data to keep changes. You don’t feel comfortable overwriting the data, so instead you create a new version.

Do this each time you clean data and you end up with dozens of versions of the same data.

A few strategic versions can make sense if each is used for specific analyses. But if you have too many, it gets incredibly confusing which version of each variable is where.

Picture this instead.

Start with one raw data set.

Write a syntax file that opens that raw data set, cleans, recodes, and computes new variables, then saves a finished one, ready for analysis.

If you don’t get the syntax file done in one sitting, no problem. You can add to it later and rerun everything from your previous sitting with one click.

If you love using menus instead of writing syntax, still no problem.

Paste the commands as you go along. The goal is not to create a new version of the data set, but to create a clean syntax file that creates the new version of the data set. Edit it as you go.

If you made a mistake in recoding something, edit the syntax, not the data file.

Need to make small changes? If it’s set up well, rerunning it only takes seconds.

There is no problem with overwriting the finished data set because all the changes are documented in the syntax file.

 


Why Mixed Models are Harder in Repeated Measures Designs: G-Side and R-Side Modeling

February 25th, 2015 by

I have recently worked with two clients who were running generalized linear mixed models in SPSS.

Both had repeated measures experiments with a binary outcome.

The details of the designs were quite different, of course. But both had pretty complicated combinations of within-subjects factors.

Fortunately, both clients are intelligent, have a good background in statistical modeling, and are willing to do the work to learn how to do this. So in both cases, we made a lot of progress in just a couple meetings.

I found it interesting, through, that both were getting stuck on the same subtle point. It’s the same point I was missing for a long time in my own learning of mixed models.

Once I finally got it, a huge light bulb turned on. (more…)


When Main Effects are Not Significant, But the Interaction Is

January 21st, 2015 by

If you have significant a significant interaction effect and non-significant main effects, would you interpret the interaction effect?

It’s a question I get pretty often, and it’s a more straightforward answer than most.

(more…)


Actually, you can interpret some main effects in the presence of an interaction

November 14th, 2014 by

One of those “rules” about statistics you often hear is that you can’t interpret a main effect in the presence of an interaction.

Stats professors seem particularly good at drilling this into students’ brains.

Unfortunately, it’s not true.

At least not always. (more…)


When Does Repeated Measures ANOVA not work for Repeated Measures Data?

September 8th, 2014 by

Repeated measures ANOVA is the approach most of us learned in stats classes for repeated measures and longitudinal data. It works very well in certain designs.

But it’s limited in what it can do. Sometimes trying to fit a data set into a repeated measures ANOVA requires too much data gymnastics. (more…)


When a Variable’s Level of Measurement Isn’t Obvious

July 14th, 2014 by

A central concept in statistics is the level of measurement of a variable. It’s so important to everything you do with data that it’s usually taught within the first week in every intro stats class.

But even something so fundamental can be tricky once you start working with real data. (more…)