Selecting columns
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How can I select specific columns from a data frame to work with?
Objectives
Select columns using several methods
Rename columns to make them easier to work with
The first task we will cover is extracting specific columns from a dataset to work with. This may be needed if your data is very, very wide and you are only interested in a few columns. The gapminder data doesn’t exactly meet that description, but we will continue using it to understand how the functions work.
gapminder <- read_csv("data/gapminder.csv")
select()
You can specify columns to keep from a data frame using select()
. This function, as well as many
others we will be learning today are found in the dplyr
package of the tidyverse. At it’s simplest,
you provide select()
with a data frame and the column names you would like to keep.
#Select the year, country and pop columns
select(gapminder, year, country, pop)
# A tibble: 1,704 × 3
year country pop
<dbl> <chr> <dbl>
1 1952 Afghanistan 8425333
2 1957 Afghanistan 9240934
3 1962 Afghanistan 10267083
4 1967 Afghanistan 11537966
5 1972 Afghanistan 13079460
6 1977 Afghanistan 14880372
7 1982 Afghanistan 12881816
8 1987 Afghanistan 13867957
9 1992 Afghanistan 16317921
10 1997 Afghanistan 22227415
# ℹ 1,694 more rows
Here you can see that the select()
produces a new data frame with just the columns you provided.
It also keeps them in the order you specified.
You can also select columns by index position:
#Select the first, third and fourth columns
select(gapminder, 1, 3, 4)
# A tibble: 1,704 × 3
country year lifeExp
<chr> <dbl> <dbl>
1 Afghanistan 1952 28.8
2 Afghanistan 1957 30.3
3 Afghanistan 1962 32.0
4 Afghanistan 1967 34.0
5 Afghanistan 1972 36.1
6 Afghanistan 1977 38.4
7 Afghanistan 1982 39.9
8 Afghanistan 1987 40.8
9 Afghanistan 1992 41.7
10 Afghanistan 1997 41.8
# ℹ 1,694 more rows
or you can select columns to exclude by using negation:
#Remove lifeExp and pop columns
select(gapminder, -lifeExp, -pop)
# A tibble: 1,704 × 4
country continent year gdpPercap
<chr> <chr> <dbl> <dbl>
1 Afghanistan Asia 1952 779.
2 Afghanistan Asia 1957 821.
3 Afghanistan Asia 1962 853.
4 Afghanistan Asia 1967 836.
5 Afghanistan Asia 1972 740.
6 Afghanistan Asia 1977 786.
7 Afghanistan Asia 1982 978.
8 Afghanistan Asia 1987 852.
9 Afghanistan Asia 1992 649.
10 Afghanistan Asia 1997 635.
# ℹ 1,694 more rows
If the columns you want are all together in the original data frame, you can select them as a range
#Select all columns between year and lifeExp
select(gapminder, year:lifeExp)
# A tibble: 1,704 × 2
year lifeExp
<dbl> <dbl>
1 1952 28.8
2 1957 30.3
3 1962 32.0
4 1967 34.0
5 1972 36.1
6 1977 38.4
7 1982 39.9
8 1987 40.8
9 1992 41.7
10 1997 41.8
# ℹ 1,694 more rows
#Exclude all columns between year and lifeExp
select(gapminder, -(year:lifeExp))
# A tibble: 1,704 × 4
country continent pop gdpPercap
<chr> <chr> <dbl> <dbl>
1 Afghanistan Asia 8425333 779.
2 Afghanistan Asia 9240934 821.
3 Afghanistan Asia 10267083 853.
4 Afghanistan Asia 11537966 836.
5 Afghanistan Asia 13079460 740.
6 Afghanistan Asia 14880372 786.
7 Afghanistan Asia 12881816 978.
8 Afghanistan Asia 13867957 852.
9 Afghanistan Asia 16317921 649.
10 Afghanistan Asia 22227415 635.
# ℹ 1,694 more rows
Save your output
In the examples above, the output of the
select()
function is being printed to the screen to demonstrate how it works. Normally, you will want to actually use the selected data later on, so don’t forget to save the output as a variable.# Just prints to screen, can't use data select(gapminder, year, country, pop)
# A tibble: 1,704 × 3 year country pop <dbl> <chr> <dbl> 1 1952 Afghanistan 8425333 2 1957 Afghanistan 9240934 3 1962 Afghanistan 10267083 4 1967 Afghanistan 11537966 5 1972 Afghanistan 13079460 6 1977 Afghanistan 14880372 7 1982 Afghanistan 12881816 8 1987 Afghanistan 13867957 9 1992 Afghanistan 16317921 10 1997 Afghanistan 22227415 # ℹ 1,694 more rows
# Saves output to variable; can be accessed later just_population <- select(gapminder, year, country, pop) just_population
# A tibble: 1,704 × 3 year country pop <dbl> <chr> <dbl> 1 1952 Afghanistan 8425333 2 1957 Afghanistan 9240934 3 1962 Afghanistan 10267083 4 1967 Afghanistan 11537966 5 1972 Afghanistan 13079460 6 1977 Afghanistan 14880372 7 1982 Afghanistan 12881816 8 1987 Afghanistan 13867957 9 1992 Afghanistan 16317921 10 1997 Afghanistan 22227415 # ℹ 1,694 more rows
Challenge 1
Use three different methods to select just the
country
,year
,pop
, andgdpPercap
columns from thegapminder
dataframe.Solution to Challenge 1
#By name select(gapminder, country, year, pop, gdpPercap)
# A tibble: 1,704 × 4 country year pop gdpPercap <chr> <dbl> <dbl> <dbl> 1 Afghanistan 1952 8425333 779. 2 Afghanistan 1957 9240934 821. 3 Afghanistan 1962 10267083 853. 4 Afghanistan 1967 11537966 836. 5 Afghanistan 1972 13079460 740. 6 Afghanistan 1977 14880372 786. 7 Afghanistan 1982 12881816 978. 8 Afghanistan 1987 13867957 852. 9 Afghanistan 1992 16317921 649. 10 Afghanistan 1997 22227415 635. # ℹ 1,694 more rows
#By position select(gapminder, 1, 3, 5, 6)
# A tibble: 1,704 × 4 country year pop gdpPercap <chr> <dbl> <dbl> <dbl> 1 Afghanistan 1952 8425333 779. 2 Afghanistan 1957 9240934 821. 3 Afghanistan 1962 10267083 853. 4 Afghanistan 1967 11537966 836. 5 Afghanistan 1972 13079460 740. 6 Afghanistan 1977 14880372 786. 7 Afghanistan 1982 12881816 978. 8 Afghanistan 1987 13867957 852. 9 Afghanistan 1992 16317921 649. 10 Afghanistan 1997 22227415 635. # ℹ 1,694 more rows
#By exclusion select(gapminder, -continent, -lifeExp)
# A tibble: 1,704 × 4 country year pop gdpPercap <chr> <dbl> <dbl> <dbl> 1 Afghanistan 1952 8425333 779. 2 Afghanistan 1957 9240934 821. 3 Afghanistan 1962 10267083 853. 4 Afghanistan 1967 11537966 836. 5 Afghanistan 1972 13079460 740. 6 Afghanistan 1977 14880372 786. 7 Afghanistan 1982 12881816 978. 8 Afghanistan 1987 13867957 852. 9 Afghanistan 1992 16317921 649. 10 Afghanistan 1997 22227415 635. # ℹ 1,694 more rows
Select helper functions
There are a number of helper functions that can be used to select the correct columns. Some commonly
used ones include starts_with()
, ends_with()
and contains()
, but you can see a full list by
looking at the help file (?tidyselect::select_helpers
).
These helpers do exactly what you would expect from their names, but to see them in action:
#All columns that start with "co"
select(gapminder, starts_with("co"))
# A tibble: 1,704 × 2
country continent
<chr> <chr>
1 Afghanistan Asia
2 Afghanistan Asia
3 Afghanistan Asia
4 Afghanistan Asia
5 Afghanistan Asia
6 Afghanistan Asia
7 Afghanistan Asia
8 Afghanistan Asia
9 Afghanistan Asia
10 Afghanistan Asia
# ℹ 1,694 more rows
#All columns that contain the letter "e"
select(gapminder, contains("e"))
# A tibble: 1,704 × 4
continent year lifeExp gdpPercap
<chr> <dbl> <dbl> <dbl>
1 Asia 1952 28.8 779.
2 Asia 1957 30.3 821.
3 Asia 1962 32.0 853.
4 Asia 1967 34.0 836.
5 Asia 1972 36.1 740.
6 Asia 1977 38.4 786.
7 Asia 1982 39.9 978.
8 Asia 1987 40.8 852.
9 Asia 1992 41.7 649.
10 Asia 1997 41.8 635.
# ℹ 1,694 more rows
Challenge 2
Select all columns from
gapminder
that end with the letter “p”Solution to Challenge 2
select(gapminder, ends_with("p"))
# A tibble: 1,704 × 3 lifeExp pop gdpPercap <dbl> <dbl> <dbl> 1 28.8 8425333 779. 2 30.3 9240934 821. 3 32.0 10267083 853. 4 34.0 11537966 836. 5 36.1 13079460 740. 6 38.4 14880372 786. 7 39.9 12881816 978. 8 40.8 13867957 852. 9 41.7 16317921 649. 10 41.8 22227415 635. # ℹ 1,694 more rows
Challenge 3 (optional)
Run the following code:
select(gapminder, contains("P"))
Does it behave the way you expected? What additional argument do you need to provide to
contains()
to make it return only thegdpPercap
column?Hint
The
contains()
function is actually provided by thetidyselect
package, try?tidyselect::contains
if you are having trouble finding the help page for it.Solution to Challenge 3
select(gapminder, contains("P"))
# A tibble: 1,704 × 3 lifeExp pop gdpPercap <dbl> <dbl> <dbl> 1 28.8 8425333 779. 2 30.3 9240934 821. 3 32.0 10267083 853. 4 34.0 11537966 836. 5 36.1 13079460 740. 6 38.4 14880372 786. 7 39.9 12881816 978. 8 40.8 13867957 852. 9 41.7 16317921 649. 10 41.8 22227415 635. # ℹ 1,694 more rows
select(gapminder, contains("P", ignore.case = F))
# A tibble: 1,704 × 1 gdpPercap <dbl> 1 779. 2 821. 3 853. 4 836. 5 740. 6 786. 7 978. 8 852. 9 649. 10 635. # ℹ 1,694 more rows
rename()
select()
can be used to rename columns while you are selecting them using the form
new_name = old_name
:
select(gapminder, country_name = country, population = pop)
# A tibble: 1,704 × 2
country_name population
<chr> <dbl>
1 Afghanistan 8425333
2 Afghanistan 9240934
3 Afghanistan 10267083
4 Afghanistan 11537966
5 Afghanistan 13079460
6 Afghanistan 14880372
7 Afghanistan 12881816
8 Afghanistan 13867957
9 Afghanistan 16317921
10 Afghanistan 22227415
# ℹ 1,694 more rows
but since it drops all the other columns, it is might not be the outcome you were looking for.
Instead, rename()
can rename columns while retaining the rest of the data frame.
rename(gapminder, country_name = country, population = pop)
# A tibble: 1,704 × 6
country_name continent year lifeExp population gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
This will often be useful to convert column names from your imported data into the consistent naming format you have decided on.
Challenge 4
Rename the
lifeExp
andgdpPercap
columns to thesnake_case
formatSolution to Challenge 4
rename(gapminder, life_exp = lifeExp, gdp_per_cap = gdpPercap)
# A tibble: 1,704 × 6 country continent year life_exp pop gdp_per_cap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ℹ 1,694 more rows
Key Points
Use
select()
to choose variables from a dataframe.Helper functions make it easier to select the correct columns.
Use
rename()
to rename variables without dropping columns.