Extract rows
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can I extract rows from a data frame based on their values?
Objectives
Filter data frames using logical operators
Combine logical expressions in a filter
If we are looking to subset the rows of our data (rather than the columns using select()
) we can
use filter()
. This function takes a data frame as it’s first argument, then a set of conditions to
check. Any row of the data frame that meets all the conditions is kept and any that fail are discarded.
For example, to get just the Australian data from the gapminder set:
filter(gapminder, country == "Australia")
# A tibble: 12 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
11 Australia Oceania 2002 80.4 19546792 30688.
12 Australia Oceania 2007 81.2 20434176 34435.
or to get only data from 1997 or later:
filter(gapminder, year >= 1997)
# A tibble: 426 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1997 41.8 22227415 635.
2 Afghanistan Asia 2002 42.1 25268405 727.
3 Afghanistan Asia 2007 43.8 31889923 975.
4 Albania Europe 1997 73.0 3428038 3193.
5 Albania Europe 2002 75.7 3508512 4604.
6 Albania Europe 2007 76.4 3600523 5937.
7 Algeria Africa 1997 69.2 29072015 4797.
8 Algeria Africa 2002 71.0 31287142 5288.
9 Algeria Africa 2007 72.3 33333216 6223.
10 Angola Africa 1997 41.0 9875024 2277.
# ℹ 416 more rows
Filter operators
Any of the standard (comparison operators)(../02-Using-R/index.html#comparing-things) can be used in a filter
==
equal to!=
not equal to<
/>
less/greater than<=
/>=
less/greater than or equal to
Any function that produces a TRUE
/FALSE
output (such as the is.na()
function) can be used as well.
Challenge 1
Extract all rows from
gapminder
where the recorded life expectancy is 80 or higher.Then try to extract just the rows from European countries.
Solution to Challenge 1
If you provide multiple filter conditions (separated by a comma) then only rows matching all of the conditions will be kept.
filter(gapminder, country == "Australia", year >= 1997)
# A tibble: 3 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1997 78.8 18565243 26998.
2 Australia Oceania 2002 80.4 19546792 30688.
3 Australia Oceania 2007 81.2 20434176 34435.
Challenge 2
How would you extract just rows that are healthy (life expectancy of 80 or higher) and rich (GDP per capita of $30,000 or higher)?
Solution to Challenge 2
If you need to keep all rows that meet one or the other condition, use the logical OR operator (|
).
|
resolves as TRUE
if either of the left and right conditions are TRUE
.
# Tries to find data from Australia AND New Zealand (and returns an empty table)
filter(gapminder, country == "Australia", country == "New Zealand")
# A tibble: 0 × 6
# ℹ 6 variables: country <chr>, continent <chr>, year <dbl>, lifeExp <dbl>,
# pop <dbl>, gdpPercap <dbl>
# Use | to look for data from Australia OR New Zealand
filter(gapminder, country == "Australia" | country == "New Zealand")
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ℹ 14 more rows
It might seem natural to write country == "Australia" | "New Zealand"
but try it and you will see
that you get an error. Each side of the |
operator must result in a TRUE
or FALSE
, and “New
Zealand” is neither.
Challenge 3
Modify your answer to Challenge 2 so that you extract the rows where countries are either healthy (life expectancy of 80 or higher) or rich (GDP per capita of $30,000 or higher).
Solution to Challenge 3
When you have many possible matches from a row that you want to keep, writing a long expression with
many |
can be time consuming and error prone. Instead, the %in%
operator can be used to simplify
things. %in%
returns TRUE
if the left hand side is found somewhere in the right hand side and
FALSE
otherwise.
filter(gapminder, country %in% c("Australia", "New Zealand"))
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ℹ 14 more rows
Challenge 4
Extract the rows from all countries in Africa, Asia, or Europe. How many rows does your dataframe have?
Solution to Challenge 4
Key Points
Use
filter()
to choose data based on values.The logical operator
%in%
can filter data from a list of possible values