R Is Not So Hard! A Tutorial, Part 15: Counting Elements in a Data Set

Combining the length() and which() commands gives a handy method of counting elements that meet particular criteria.

b <- c(7, 2, 4, 3, -1, -2, 3, 3, 6, 8, 12, 7, 3)
b

Let’s count the 3s in the vector b.

count3 <- length(which(b == 3))
count3
[1] 4

In fact, you can count the number of elements that satisfy almost any given condition.

length(which(b < 7))
[1] 9

Here is an alternative approach, also using the length() command, but also using square brackets for sub-setting:

length(b[ b < 7 ])
[1] 9

The square brackets allow us to subset. For such operations using square brackets, I like to use the words “such that”. Here, we have the elements of b, such that the elements are less than 7.

R provides another alternative that not everyone knows about

sum(b < 7)
[1] 9

This syntax gives a count rather than a sum. Be aware of the meaning of syntax like sum(b < 7). Both work on logical vectors whose elements are either TRUE or FALSE. Try entering b <- 7 at the keyboard.

b < 7
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE

We see that sum(b < 7) counts the number of elements that are TRUE. There are nine such elements.

Now try:

mean(b < 7)
[1] 0.6923077

That syntax found the proportion of elements meeting the criterion rather than the mean. Again, if you use the sum() and mean() function you must be very careful to ensure that your output is what you intended. Note that sum(), length() and length(which()) all provide mechanisms for counting elements.

Now find the percentage of 7s in b.

P7 <- 100 * length(which(b == 7)) / length(b)
P7
[1] 15.38462

extension example

You can find counts and percentages using functions that involve length(which()). Here we create two functions; one for finding counts, and the other for
calculating percentages.

count <- function(x, n){ length((which(x == n))) }
perc <- function(x, n){ 100*length((which(x == n))) / length(x) }

Note the syntax involved in setting up a function in R. Now let’s use the count function to count the threes in the vector b.

count(b, 3)
[1] 4

perc(b, 4)
[1] 7.692308

That wasn’t so hard! In our next blog post we’ll discuss counting values within cases.

About the Author: David Lillis has taught R to many researchers and statisticians. His company, Sigma Statistics and Research Limited, provides both on-line instruction and face-to-face workshops on R, and coding services in R. David holds a doctorate in applied statistics.

See our full R Tutorial Series and other blog posts regarding R programming.

 

Reader Interactions

Comments

  1. Rob Baer says

    Missing Values
    Just a note on using length() on a whole vector that includes NA. The missing values are counted in the whole vector length when using the length() function.

    b <- c(7, 2, 4, 3, -1, -2, 3, 3, 6, 8, 12, 7, 3)
    b1 <- c(b, NA)
    length(b)
    length(b1
    sd(b)
    sd(b1, na.rm = TRUE)

    # If you want want an "n" to go with the sd for b1, don't use length().
    (n = sum(!is.na(b))) #13
    (n = sum(!is.na(b1))) # 13

  2. Nathalie says

    I am stucked with a string counting issue and could not find any helpful post so far maybe someone here can help me:

    I have a string variable tours in my dataframe df that represents the different stops an individuum did during a journey.

    For example:
    1. home_work_leisure_home
    2. home_work_shopping_work_home
    3. home_work_leisure_errand_home

    In Transport planning we group activities in primary (work and education) and secondary activities (everything else). I want to count the number of secondary activities before the first primary activity, inbetween two primary activities after the last primary activity for each tour.

    This means I am looking for a function in R that:
    a. identifies the first work in the string variable,
    b. then counts the number of activities before this first work activity
    c. then identifies the last work in the string if there is more than one
    d. if there is then count the number of activities between the two work activities,
    e. then count the number of activities after the last work activity

    The result for the three example tours then would be:
    1.number of activities before first primary: 1 (home)
    number of activities between first and last primary: 0
    number of activities after last primary: 2 (leisure & home)
    number of primary activities: 1 (work)
    2.number of activities before first primary: 1 (home)
    number of activities between first and last primary: 1 (shopping)
    number of activities after last primary: 1 (home)
    number of primary activities: 2 (work)
    3.number of activities before first primary: 1 (home)
    number of activities between first and last primary: 0
    number of activities after last primary: 3 (leisure, errand & home)
    number of primary activities: 1 (work)

    I would be super thankful if someone could give me a hand with this issue – even if it is a link to a similar question.

    Tank you. Kind regards N

    • Karen Grace-Martin says

      Nathalie,

      I’m not the R expert, but I’ve done a lot of this kind of thing in other software. It sounds like this will be a multi-step process. The very first thing you need to do is split this into multiple variables.

  3. Pranjit Sarmah says

    obj<-function(x,y,x_cat, y_val){
    xx<-which(x==x_cat)
    yy<-which(y==y_val)
    return(xx[xx %in% yy]) ## will return the index of observation for which x_cat ##has observation value y_val
    }

  4. Karol says

    Hi,
    I have a data something like this:
    X Y
    A 1
    A 2
    B 1
    B 2
    B 3
    C 1

    I meen – X variable is a fator o k categories length and Y is a continous variable.
    I’d like to compute a vector (let’s say Z) counting which observation of X (in each category) is Y… Something like ID for each category of X. Can You please give me some tip?
    Thank You in advanced!
    Karol

    • Pranjit Sarmah says

      obj<-function(x,y,x_cat, y_val){
      xx<-which(x==x_cat)
      yy<-which(y==y_val)
      return(xx[xx %in% yy]) ## will return the index of observation for which x_cat ##has observation value y_val
      }


Leave a Reply

Your email address will not be published. Required fields are marked *

Please note that, due to the large number of comments submitted, any questions on problems related to a personal study/project will not be answered. We suggest joining Statistically Speaking, where you have access to a private forum and more resources 24/7.