It’s easy to develop bad habits in data analysis. When you’re new to it, you just don’t have enough experience to realize that what feels like efficiency will actually come back to make things take longer, introduce problems, and lead to more frustration.
I’ve outlined 14 steps to running any data analysis, in four phases. They help keep your analysis on track. But even if you’re following those steps, you can make this harder on yourself with a few bad habits.
Bad Habit #1. Not allowing enough time to implement and learn
One of the great things about doing data analysis is each one is an opportunity to constantly improve your skills. Because there is so much nuance in every messy data set and analysis, you learn something new with every analysis.
The other side of this, of course, is that few statistical analyses are routine or quick.
That means it’s easy to underestimate not just the time it will take to run the analysis, but to troubleshoot issues and to learn new methods that you hadn’t realized you needed. Chances are the new methods you’ll have to employ will be challenging.
Even if you already have good statistical skills, a method that is new to you can take weeks or months to learn and implement. Not days.
This is especially true if it turns out you need to use a new statistical software program to implement it.
Suggested Strategy: Plan your data analysis to take months, not days. If there are no surprises, you’ll finish early.
Bad Habit #2. Not using a system for keeping track of files and steps
No matter which statistical software you use, every analysis has lots of files. Data files, program files, output files, log files. Then there are the supporting files, like the data codebook file and the statistical analysis plan. Oh yes, and the report you’re writing from the results.
And there are many steps to any analysis, within each of the four phases of data analysis: design & planning; data preparation; data analysis; and communicating results.
Your system doesn’t have to be complicated or technical. But throwing dozens of files with generic names into too few (or too many) folders is a recipe for frustration.
The same is true for variable names. Come up with a naming convention for variables too (and make sure it’s documented somewhere).
This is especially helpful if you’re collaborating with someone else. But even if you’re on your own, be kind to your future self and make everything clear.
Suggested Strategy: Take the time to organize files and institute a naming convention for files and variables. And follow them!
Bad Habit #3. Not making it easy to replicate what you’ve done on each step
As I already mentioned, there are many steps in each of the four phases of data analysis. One thing to keep in mind: while there is a clear order to these steps, it’s also common to have to backtrack.
For example, writing a statistical analysis plan is an early step and checking model assumptions is a later one.
But data don’t always behave the way we expected. So checking an assumption can derail the plan, requiring you to come up with a new plan. This is fine and it’s common. It’s worth planning, but you have to incorporate the realities and limitations of the real data set in the final analysis.
So the easier you made it to replicate your early steps, the easier they will be to rerun. There are just too many steps to remember exactly what you did on each one.
Suggested Strategy: Always use (or record) syntax for every data change and analysis you do. Use the tools available in your software to help you with this. Comment liberally so you remember exactly what each piece of code does.
Jeremy says
Good advice. It’s nice to hear from a pro that you have to learn new things all the time, and that analysis can take weeks or months (not just days) when techniques that are new to you are involved. I especially like the advice on having a naming convention for files and intermediate output– something I’m guilty of not doing enough.