Extract rows
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can I extract rows from a data frame based on their values?
Objectives
Filter data frames using logical operators
Combine logical expressions in a filter
If we are looking to subset the rows of our data (rather than the columns using select()
) we can
use filter()
. This function takes a data frame as it’s first argument, then a set of conditions to
check. Any row of the data frame that meets all the conditions is kept and any that fail are discarded.
For example, to get just the Australian data from the gapminder set:
filter(gapminder, country == "Australia")
# A tibble: 12 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
11 Australia Oceania 2002 80.4 19546792 30688.
12 Australia Oceania 2007 81.2 20434176 34435.
or to get only data from 1997 or later:
filter(gapminder, year >= 1997)
# A tibble: 426 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1997 41.8 22227415 635.
2 Afghanistan Asia 2002 42.1 25268405 727.
3 Afghanistan Asia 2007 43.8 31889923 975.
4 Albania Europe 1997 73.0 3428038 3193.
5 Albania Europe 2002 75.7 3508512 4604.
6 Albania Europe 2007 76.4 3600523 5937.
7 Algeria Africa 1997 69.2 29072015 4797.
8 Algeria Africa 2002 71.0 31287142 5288.
9 Algeria Africa 2007 72.3 33333216 6223.
10 Angola Africa 1997 41.0 9875024 2277.
# ℹ 416 more rows
Filter operators
Any of the standard (comparison operators)(../02-Using-R/index.html#comparing-things) can be used in a filter
==
equal to!=
not equal to<
/>
less/greater than<=
/>=
less/greater than or equal to
Any function that produces a TRUE
/FALSE
output (such as the is.na()
function) can be used as well.
Challenge 1
Extract all rows from
gapminder
where the recorded life expectancy is 80 or higher.Then try to extract just the rows from European countries.
Solution to Challenge 1
filter(gapminder, lifeExp >= 80)
# A tibble: 22 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Australia Oceania 2002 80.4 19546792 30688. 2 Australia Oceania 2007 81.2 20434176 34435. 3 Canada Americas 2007 80.7 33390141 36319. 4 France Europe 2007 80.7 61083916 30470. 5 Hong Kong, China Asia 1997 80 6495918 28378. 6 Hong Kong, China Asia 2002 81.5 6762476 30209. 7 Hong Kong, China Asia 2007 82.2 6980412 39725. 8 Iceland Europe 2002 80.5 288030 31163. 9 Iceland Europe 2007 81.8 301931 36181. 10 Israel Asia 2007 80.7 6426679 25523. # ℹ 12 more rows
filter(gapminder, continent == "Europe")
# A tibble: 360 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Albania Europe 1952 55.2 1282697 1601. 2 Albania Europe 1957 59.3 1476505 1942. 3 Albania Europe 1962 64.8 1728137 2313. 4 Albania Europe 1967 66.2 1984060 2760. 5 Albania Europe 1972 67.7 2263554 3313. 6 Albania Europe 1977 68.9 2509048 3533. 7 Albania Europe 1982 70.4 2780097 3631. 8 Albania Europe 1987 72 3075321 3739. 9 Albania Europe 1992 71.6 3326498 2497. 10 Albania Europe 1997 73.0 3428038 3193. # ℹ 350 more rows
If you provide multiple filter conditions (separated by a comma) then only rows matching all of the conditions will be kept.
filter(gapminder, country == "Australia", year >= 1997)
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1997 78.8 18565243 26998.
2 Australia Oceania 2002 80.4 19546792 30688.
3 Australia Oceania 2007 81.2 20434176 34435.
Challenge 2
How would you extract just rows that are healthy (life expectancy of 80 or higher) and rich (GDP per capita of $30,000 or higher)?
Solution to Challenge 2
filter(gapminder, lifeExp >= 80, gdpPercap >= 30000)
# A tibble: 13 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Australia Oceania 2002 80.4 19546792 30688. 2 Australia Oceania 2007 81.2 20434176 34435. 3 Canada Americas 2007 80.7 33390141 36319. 4 France Europe 2007 80.7 61083916 30470. 5 Hong Kong, China Asia 2002 81.5 6762476 30209. 6 Hong Kong, China Asia 2007 82.2 6980412 39725. 7 Iceland Europe 2002 80.5 288030 31163. 8 Iceland Europe 2007 81.8 301931 36181. 9 Japan Asia 2007 82.6 127467972 31656. 10 Norway Europe 2007 80.2 4627926 49357. 11 Sweden Europe 2007 80.9 9031088 33860. 12 Switzerland Europe 2002 80.6 7361757 34481. 13 Switzerland Europe 2007 81.7 7554661 37506.
If you need to keep all rows that meet one or the other condition, use the logical OR operator (|
).
|
resolves as TRUE
if either of the left and right conditions are TRUE
.
# Tries to find data from Australia AND New Zealand (and returns an empty table)
filter(gapminder, country == "Australia", country == "New Zealand")
# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, continent <chr>, year <dbl>, lifeExp <dbl>,
# pop <dbl>, gdpPercap <dbl>
# Use | to look for data from Australia OR New Zealand
filter(gapminder, country == "Australia" | country == "New Zealand")
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ℹ 14 more rows
It might seem natural to write country == "Australia" | "New Zealand"
but try it and you will see
that you get an error. Each side of the |
operator must result in a TRUE
or FALSE
, and “New
Zealand” is neither.
Challenge 3
Modify your answer to Challenge 2 so that you extract the rows where countries are either healthy (life expectancy of 80 or higher) or rich (GDP per capita of $30,000 or higher).
Solution to Challenge 3
filter(gapminder, lifeExp >= 80 | gdpPercap >= 30000)
# A tibble: 65 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Australia Oceania 2002 80.4 19546792 30688. 2 Australia Oceania 2007 81.2 20434176 34435. 3 Austria Europe 2002 79.0 8148312 32418. 4 Austria Europe 2007 79.8 8199783 36126. 5 Belgium Europe 2002 78.3 10311970 30486. 6 Belgium Europe 2007 79.4 10392226 33693. 7 Canada Americas 2002 79.8 31902268 33329. 8 Canada Americas 2007 80.7 33390141 36319. 9 Denmark Europe 2002 77.2 5374693 32167. 10 Denmark Europe 2007 78.3 5468120 35278. # ℹ 55 more rows
When you have many possible matches from a row that you want to keep, writing a long expression with
many |
can be time consuming and error prone. Instead, the %in%
operator can be used to simplify
things. %in%
returns TRUE
if the left hand side is found somewhere in the right hand side and
FALSE
otherwise.
filter(gapminder, country %in% c("Australia", "New Zealand"))
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ℹ 14 more rows
Challenge 4
Extract the rows from all countries in Africa, Asia, or Europe. How many rows does your dataframe have?
Solution to Challenge 4
filter(gapminder, continent %in% c("Africa","Asia", "Europe"))
# A tibble: 1,380 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ℹ 1,370 more rows
Key Points
Use
filter()
to choose data based on values.The logical operator
%in%
can filter data from a list of possible values