You probably learned about the four levels of measurement in your very first statistics class: nominal, ordinal, interval, and ratio.
Knowing the level of measurement of a variable is crucial when working out how to analyze the variable. Failing to correctly match the statistical method to a variable’s level of measurement leads either to nonsense or to misleading results.
But the simple framework of the four levels is too simplistic in most real-world data analysis situations.
(more…)
by Christos Giannoulis, PhD
Attributes are often measured using multiple variables with different upper and lower limits. For example, we may have five measures of political orientation, each with a different range of values.
Each variable is measured in a different way. The measures have a different number of categories and the low and high scores on each measure are different.
(more…)
Have you ever experienced befuddlement when you dust off a data analysis that you ran six months ago? 
Ever gritted your teeth when your collaborator invalidates all your hard work by telling you that the data set you were working on had “a few minor changes”?
Or panicked when someone running a big meta-analysis asks you to share your data?
If any of these experiences rings true to you, then you need to adopt the philosophy of reproducible research.
(more…)
We’ve talked a lot around here about the reasons to use syntax — not only menus — in your statistical analyses.
Regardless of which software you use, the syntax file is pretty much always a text file. This is true for R, SPSS, SAS, Stata — just about all of them.
This is important because it means you can use an unlikely tool to help you code: Microsoft Word.
I know what you’re thinking. Word? Really?
Yep, it’s true. Essentially it’s because Word has much better Search-and-Replace options than your stat software’s editor.
Here are a couple features of Word’s search-and-replace that I use to help me code faster:
(more…)
Sometimes when you’re learning a new stat software package, the most frustrating part is not knowing how to do very basic things. This is especially frustrating if you already know how to do them in some other software.
Let’s look at some basic but very useful commands that are available in R.
We will use the following data set of tourists from different nations, their gender and numbers of children. Copy and paste the following array into R.
A <- structure(list(NATION = structure(c(3L, 3L, 3L, 1L, 3L, 2L, 3L,
1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L), .Label = c("CHINA",
"GERMANY", "FRANCE"), class = "factor"), GENDER = structure(c(1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L
), .Label = c("F", "M"), class = "factor"), CHILDREN = c(1L,
3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L
)), .Names = c("NATION", "GENDER", "CHILDREN"), row.names = 2:18, class = "data.frame")
Want to check that R read the variables correctly? We can look at the first 3 rows using the head()
command, as follows:
head(A, 3)
NATION GENDER CHILDREN
2 FRANCE F 1
3 FRANCE M 3
4 FRANCE M 2
Now we look at the last 4 rows using the tail()
command:
tail(A, 4)
NATION GENDER CHILDREN
15 FRANCE F 1
16 FRANCE M 1
17 FRANCE F 0
18 GERMANY F 2
Now we find the number of rows and number of columns using nrow()
and ncol()
.
nrow(A)
[1] 17
ncol(A)
[1] 3
So we have 17 rows (cases) and three columns (variables). These functions look very basic, but they turn out to be very useful if you want to write R-based software to analyse data sets of different dimensions.
Now let’s attach A and check for the existence of particular data.
attach(A)
As you may know, attaching a data object makes it possible to refer to any variable by name, without having to specify the data object which contains that variable.
Does the USA appear in the NATION variable? We use the any()
command and put USA inside quotation marks.
any(NATION == "USA")
[1] FALSE
Clearly, we do not have any data pertaining to the USA.
What are the values of the variable NATION?
levels(NATION)
[1] "CHINA" "GERMANY" "FRANCE"
How many non-missing observations do we have in the variable NATION?
length(NATION)
[1] 17
OK, but how many different values of NATION do we have?
length(levels(NATION))
[1] 3
We have three different values.
Do we have tourists with more than three children? We use the any()
command to find out.
any(CHILDREN > 3)
[1] FALSE
None of the tourists in this data set have more than three children.
Do we have any missing data in this data set?
In R, missing data is indicated in the data set with NA.
any(is.na(A))
[1] FALSE
We have no missing data here.
Which observations involve FRANCE? We use the which()
command to identify the relevant indices, counting column-wise.
which(A == "FRANCE")
[1] 1 2 3 5 7 9 10 14 15 16
How many observations involve FRANCE? We wrap the above syntax inside the length()
command to perform this calculation.
length(which(A == "FRANCE"))
[1] 10
We have a total of ten such observations.
That wasn’t so hard! In our next post we will look at further analytic techniques in R.
About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.
See our full R Tutorial Series and other blog posts regarding R programming.