Formatting Date Variables seems like it should be straightforward, but sadly, it’s not.
If you are given data that includes dates, expect confusion. Dates can be represented in many different ways.
Avoid two-digit years
The year value of the date may have only two digits. This leads you to guess whether 10/28/29 is referring to the year 2029, a hopeful date not too far in our future, or the year 1929, when the stock market crash started off the Great Depression.
Maybe 12/7/41 is the year 2041, when you see your first grandchild. Or maybe it is 1941, the day that Franklin Delano Roosevelt said would “live in infamy.”
When you feed a two digit year into your statistical software, it will make a guess as to which century it belongs to. Some systems will assume that any two digit date is 19xx. Others will treat large values, say greater than 30, as 19xx and smaller values as 20xx.
If your software guesses wrong, you may end up with a pediatric study with patients aged 101 or -99. It’s easy to mess this up because every package is different and the guesses that your software makes may change over time.
If you inherit data with two digit years, be sure to redisplay the data with four digit years after you import it. Then do a careful review.
Watch the day and month order
In the United States, we ordinarily list dates with the month first, then the day, then the year. In most of the rest of the world, the day is listed first, followed by the month and the year.
For any day 13 and above, it will be obvious what the order is, but there’s plenty of opportunities for confusion. Star Wars Day will be 05/04 here in the U.S., but they will celebrate Guerre Des Etoilles on 04/05 in France.
So know what continent your data is coming from before you start analyzing it.
Month names and abbreviations, of course, will also vary. In Mexico, Santa Claus visits in “Dic,” not “Dec.”
ISO 8601 to the rescue
There is a standard format for displaying dates, ISO 8601. ISO, the International Standards Organization, tries to bring order to a variety of computer systems by enforcing a common standard.
For dates, ISO requires you to specify a four digit year, followed by a two digit month, and ending with a two digit day. The year, month, and day are separated by dashes. So the Declaration of Independence was signed on 1776-07-04.
It’s important to include the leading zeros for single digit months and days. So 1776-7-4 is not consistent with the standard. You can’t use “July” or “Jul” either. It has to be a two digit month.
The ISO standard is nice in that if your software package stores the date as a string, the dates still sort properly.
If you have control over how dates are entered, insist on the ISO 8601 standard.
Randall Munroe, author of the xkcd comic series, satirizes all the variations of date storage in a cartoon:
Conclusions
Dates come in all different shapes and sizes. This can lead to all sorts of confusion.
Two digit years might be placed in the wrong century. Sometimes the months are listed before the days, sometimes afterwards.
You have to do a lot of quality checks when you are using dates that someone else entered. If you are entering dates yourself, use the ISO 8601 standard.
by Steve Simon, PhD
Leave a Reply