Extract rows

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How can I extract rows from a data frame based on their values?

Objectives
  • Filter data frames using logical operators

  • Combine logical expressions in a filter

If we are looking to subset the rows of our data (rather than the columns using select()) we can use filter(). This function takes a data frame as it’s first argument, then a set of conditions to check. Any row of the data frame that meets all the conditions is kept and any that fail are discarded.

For example, to get just the Australian data from the gapminder set:

filter(gapminder, country == "Australia")
# A tibble: 12 × 6
   country   continent  year lifeExp      pop gdpPercap
   <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Australia Oceania    1952    69.1  8691212    10040.
 2 Australia Oceania    1957    70.3  9712569    10950.
 3 Australia Oceania    1962    70.9 10794968    12217.
 4 Australia Oceania    1967    71.1 11872264    14526.
 5 Australia Oceania    1972    71.9 13177000    16789.
 6 Australia Oceania    1977    73.5 14074100    18334.
 7 Australia Oceania    1982    74.7 15184200    19477.
 8 Australia Oceania    1987    76.3 16257249    21889.
 9 Australia Oceania    1992    77.6 17481977    23425.
10 Australia Oceania    1997    78.8 18565243    26998.
11 Australia Oceania    2002    80.4 19546792    30688.
12 Australia Oceania    2007    81.2 20434176    34435.

or to get only data from 1997 or later:

filter(gapminder, year >= 1997)
# A tibble: 426 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1997    41.8 22227415      635.
 2 Afghanistan Asia       2002    42.1 25268405      727.
 3 Afghanistan Asia       2007    43.8 31889923      975.
 4 Albania     Europe     1997    73.0  3428038     3193.
 5 Albania     Europe     2002    75.7  3508512     4604.
 6 Albania     Europe     2007    76.4  3600523     5937.
 7 Algeria     Africa     1997    69.2 29072015     4797.
 8 Algeria     Africa     2002    71.0 31287142     5288.
 9 Algeria     Africa     2007    72.3 33333216     6223.
10 Angola      Africa     1997    41.0  9875024     2277.
# ℹ 416 more rows

Filter operators

Any of the standard (comparison operators)(../02-Using-R/index.html#comparing-things) can be used in a filter

Any function that produces a TRUE/FALSE output (such as the is.na() function) can be used as well.

Challenge 1

Extract all rows from gapminder where the recorded life expectancy is 80 or higher.

Then try to extract just the rows from European countries.

Solution to Challenge 1

filter(gapminder, lifeExp >= 80)
# A tibble: 22 × 6
   country          continent  year lifeExp      pop gdpPercap
   <chr>            <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Australia        Oceania    2002    80.4 19546792    30688.
 2 Australia        Oceania    2007    81.2 20434176    34435.
 3 Canada           Americas   2007    80.7 33390141    36319.
 4 France           Europe     2007    80.7 61083916    30470.
 5 Hong Kong, China Asia       1997    80    6495918    28378.
 6 Hong Kong, China Asia       2002    81.5  6762476    30209.
 7 Hong Kong, China Asia       2007    82.2  6980412    39725.
 8 Iceland          Europe     2002    80.5   288030    31163.
 9 Iceland          Europe     2007    81.8   301931    36181.
10 Israel           Asia       2007    80.7  6426679    25523.
# ℹ 12 more rows
filter(gapminder, continent == "Europe")
# A tibble: 360 × 6
   country continent  year lifeExp     pop gdpPercap
   <chr>   <chr>     <dbl>   <dbl>   <dbl>     <dbl>
 1 Albania Europe     1952    55.2 1282697     1601.
 2 Albania Europe     1957    59.3 1476505     1942.
 3 Albania Europe     1962    64.8 1728137     2313.
 4 Albania Europe     1967    66.2 1984060     2760.
 5 Albania Europe     1972    67.7 2263554     3313.
 6 Albania Europe     1977    68.9 2509048     3533.
 7 Albania Europe     1982    70.4 2780097     3631.
 8 Albania Europe     1987    72   3075321     3739.
 9 Albania Europe     1992    71.6 3326498     2497.
10 Albania Europe     1997    73.0 3428038     3193.
# ℹ 350 more rows

If you provide multiple filter conditions (separated by a comma) then only rows matching all of the conditions will be kept.

filter(gapminder, country == "Australia", year >= 1997)
# A tibble: 3 × 6
  country   continent  year lifeExp      pop gdpPercap
  <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Australia Oceania    1997    78.8 18565243    26998.
2 Australia Oceania    2002    80.4 19546792    30688.
3 Australia Oceania    2007    81.2 20434176    34435.

Challenge 2

How would you extract just rows that are healthy (life expectancy of 80 or higher) and rich (GDP per capita of $30,000 or higher)?

Solution to Challenge 2

filter(gapminder, lifeExp >= 80, gdpPercap >= 30000)
# A tibble: 13 × 6
   country          continent  year lifeExp       pop gdpPercap
   <chr>            <chr>     <dbl>   <dbl>     <dbl>     <dbl>
 1 Australia        Oceania    2002    80.4  19546792    30688.
 2 Australia        Oceania    2007    81.2  20434176    34435.
 3 Canada           Americas   2007    80.7  33390141    36319.
 4 France           Europe     2007    80.7  61083916    30470.
 5 Hong Kong, China Asia       2002    81.5   6762476    30209.
 6 Hong Kong, China Asia       2007    82.2   6980412    39725.
 7 Iceland          Europe     2002    80.5    288030    31163.
 8 Iceland          Europe     2007    81.8    301931    36181.
 9 Japan            Asia       2007    82.6 127467972    31656.
10 Norway           Europe     2007    80.2   4627926    49357.
11 Sweden           Europe     2007    80.9   9031088    33860.
12 Switzerland      Europe     2002    80.6   7361757    34481.
13 Switzerland      Europe     2007    81.7   7554661    37506.

If you need to keep all rows that meet one or the other condition, use the logical OR operator (|). | resolves as TRUE if either of the left and right conditions are TRUE.

# Tries to find data from Australia AND New Zealand (and returns an empty table)
filter(gapminder, country == "Australia", country == "New Zealand")
# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, continent <chr>, year <dbl>, lifeExp <dbl>,
#   pop <dbl>, gdpPercap <dbl>
# Use | to look for data from Australia OR New Zealand
filter(gapminder, country == "Australia" | country == "New Zealand")
# A tibble: 24 × 6
   country   continent  year lifeExp      pop gdpPercap
   <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Australia Oceania    1952    69.1  8691212    10040.
 2 Australia Oceania    1957    70.3  9712569    10950.
 3 Australia Oceania    1962    70.9 10794968    12217.
 4 Australia Oceania    1967    71.1 11872264    14526.
 5 Australia Oceania    1972    71.9 13177000    16789.
 6 Australia Oceania    1977    73.5 14074100    18334.
 7 Australia Oceania    1982    74.7 15184200    19477.
 8 Australia Oceania    1987    76.3 16257249    21889.
 9 Australia Oceania    1992    77.6 17481977    23425.
10 Australia Oceania    1997    78.8 18565243    26998.
# ℹ 14 more rows

It might seem natural to write country == "Australia" | "New Zealand" but try it and you will see that you get an error. Each side of the | operator must result in a TRUE or FALSE, and “New Zealand” is neither.

Challenge 3

Modify your answer to Challenge 2 so that you extract the rows where countries are either healthy (life expectancy of 80 or higher) or rich (GDP per capita of $30,000 or higher).

Solution to Challenge 3

filter(gapminder, lifeExp >= 80 | gdpPercap >= 30000)
# A tibble: 65 × 6
   country   continent  year lifeExp      pop gdpPercap
   <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Australia Oceania    2002    80.4 19546792    30688.
 2 Australia Oceania    2007    81.2 20434176    34435.
 3 Austria   Europe     2002    79.0  8148312    32418.
 4 Austria   Europe     2007    79.8  8199783    36126.
 5 Belgium   Europe     2002    78.3 10311970    30486.
 6 Belgium   Europe     2007    79.4 10392226    33693.
 7 Canada    Americas   2002    79.8 31902268    33329.
 8 Canada    Americas   2007    80.7 33390141    36319.
 9 Denmark   Europe     2002    77.2  5374693    32167.
10 Denmark   Europe     2007    78.3  5468120    35278.
# ℹ 55 more rows

When you have many possible matches from a row that you want to keep, writing a long expression with many | can be time consuming and error prone. Instead, the %in% operator can be used to simplify things. %in% returns TRUE if the left hand side is found somewhere in the right hand side and FALSE otherwise.

filter(gapminder, country %in% c("Australia", "New Zealand"))
# A tibble: 24 × 6
   country   continent  year lifeExp      pop gdpPercap
   <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Australia Oceania    1952    69.1  8691212    10040.
 2 Australia Oceania    1957    70.3  9712569    10950.
 3 Australia Oceania    1962    70.9 10794968    12217.
 4 Australia Oceania    1967    71.1 11872264    14526.
 5 Australia Oceania    1972    71.9 13177000    16789.
 6 Australia Oceania    1977    73.5 14074100    18334.
 7 Australia Oceania    1982    74.7 15184200    19477.
 8 Australia Oceania    1987    76.3 16257249    21889.
 9 Australia Oceania    1992    77.6 17481977    23425.
10 Australia Oceania    1997    78.8 18565243    26998.
# ℹ 14 more rows

Challenge 4

Extract the rows from all countries in Africa, Asia, or Europe. How many rows does your dataframe have?

Solution to Challenge 4

filter(gapminder, continent %in% c("Africa","Asia", "Europe"))
# A tibble: 1,380 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,370 more rows

Key Points

  • Use filter() to choose data based on values.

  • The logical operator %in% can filter data from a list of possible values