Sometimes when you’re learning a new stat software package, the most frustrating part is not knowing how to do very basic things. This is especially frustrating if you already know how to do them in some other software.
Let’s look at some basic but very useful commands that are available in R.
We will use the following data set of tourists from different nations, their gender and numbers of children. Copy and paste the following array into R.
A <- structure(list(NATION = structure(c(3L, 3L, 3L, 1L, 3L, 2L, 3L,
1L, 3L, 3L, 1L, 2L, 2L, 3L, 3L, 3L, 2L), .Label = c("CHINA",
"GERMANY", "FRANCE"), class = "factor"), GENDER = structure(c(1L,
2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L
), .Label = c("F", "M"), class = "factor"), CHILDREN = c(1L,
3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L
)), .Names = c("NATION", "GENDER", "CHILDREN"), row.names = 2:18, class = "data.frame")
Want to check that R read the variables correctly? We can look at the first 3 rows using the head()
command, as follows:
head(A, 3) NATION GENDER CHILDREN 2 FRANCE F 1 3 FRANCE M 3 4 FRANCE M 2
Now we look at the last 4 rows using the tail()
command:
tail(A, 4) NATION GENDER CHILDREN 15 FRANCE F 1 16 FRANCE M 1 17 FRANCE F 0 18 GERMANY F 2
Now we find the number of rows and number of columns using nrow()
and ncol()
.
nrow(A) [1] 17 ncol(A) [1] 3
So we have 17 rows (cases) and three columns (variables). These functions look very basic, but they turn out to be very useful if you want to write R-based software to analyse data sets of different dimensions.
Now let’s attach A and check for the existence of particular data.
attach(A)
As you may know, attaching a data object makes it possible to refer to any variable by name, without having to specify the data object which contains that variable.
Does the USA appear in the NATION variable? We use the any()
command and put USA inside quotation marks.
any(NATION == "USA") [1] FALSE
Clearly, we do not have any data pertaining to the USA.
What are the values of the variable NATION?
levels(NATION) [1] "CHINA" "GERMANY" "FRANCE"
How many non-missing observations do we have in the variable NATION?
length(NATION) [1] 17
OK, but how many different values of NATION do we have?
length(levels(NATION)) [1] 3
We have three different values.
Do we have tourists with more than three children? We use the any()
command to find out.
any(CHILDREN > 3) [1] FALSE
None of the tourists in this data set have more than three children.
Do we have any missing data in this data set?
In R, missing data is indicated in the data set with NA.
any(is.na(A)) [1] FALSE
We have no missing data here.
Which observations involve FRANCE? We use the which()
command to identify the relevant indices, counting column-wise.
which(A == "FRANCE") [1] 1 2 3 5 7 9 10 14 15 16
How many observations involve FRANCE? We wrap the above syntax inside the length()
command to perform this calculation.
length(which(A == "FRANCE")) [1] 10
We have a total of ten such observations.
That wasn’t so hard! In our next post we will look at further analytic techniques in R.
About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.
See our full R Tutorial Series and other blog posts regarding R programming.
Tony says
Thanks David! This guideline is very easy to follow and very helpful!
Charles O'Riley says
Thanks for the exercise. Do note, where you’re setting the ‘Children’ variable, there is an incorrect character at the end of the value list. See below:
CHILDREN = c(1L,
3L, 2L, 2L, 3L, 1L, 0L, 1L, 0L, 1L, 2L, 2L, 1L, 1L, 1L, 0L, 2L
<))