Selecting columns

Overview

Teaching: 15 min
Exercises: 10 min
Questions
  • How can I select specific columns from a data frame to work with?

Objectives
  • Select columns using several methods

  • Rename columns to make them easier to work with

The first task we will cover is extracting specific columns from a dataset to work with. This may be needed if your data is very, very wide and you are only interested in a few columns. The gapminder data doesn’t exactly meet that description, but we will continue using it to understand how the functions work.

gapminder <- read_csv("data/gapminder.csv")

select()

You can specify columns to keep from a data frame using select(). This function, as well as many others we will be learning today are found in the dplyr package of the tidyverse. At it’s simplest, you provide select() with a data frame and the column names you would like to keep.

#Select the year, country and pop columns
select(gapminder, year, country, pop)
# A tibble: 1,704 × 3
    year country          pop
   <dbl> <chr>          <dbl>
 1  1952 Afghanistan  8425333
 2  1957 Afghanistan  9240934
 3  1962 Afghanistan 10267083
 4  1967 Afghanistan 11537966
 5  1972 Afghanistan 13079460
 6  1977 Afghanistan 14880372
 7  1982 Afghanistan 12881816
 8  1987 Afghanistan 13867957
 9  1992 Afghanistan 16317921
10  1997 Afghanistan 22227415
# ℹ 1,694 more rows

Here you can see that the select() produces a new data frame with just the columns you provided. It also keeps them in the order you specified.

You can also select columns by index position:

#Select the first, third and fourth columns
select(gapminder, 1, 3, 4)
# A tibble: 1,704 × 3
   country      year lifeExp
   <chr>       <dbl>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ 1,694 more rows

or you can select columns to exclude by using negation:

#Remove lifeExp and pop columns
select(gapminder, -lifeExp, -pop)
# A tibble: 1,704 × 4
   country     continent  year gdpPercap
   <chr>       <chr>     <dbl>     <dbl>
 1 Afghanistan Asia       1952      779.
 2 Afghanistan Asia       1957      821.
 3 Afghanistan Asia       1962      853.
 4 Afghanistan Asia       1967      836.
 5 Afghanistan Asia       1972      740.
 6 Afghanistan Asia       1977      786.
 7 Afghanistan Asia       1982      978.
 8 Afghanistan Asia       1987      852.
 9 Afghanistan Asia       1992      649.
10 Afghanistan Asia       1997      635.
# ℹ 1,694 more rows

If the columns you want are all together in the original data frame, you can select them as a range

#Select all columns between year and lifeExp
select(gapminder, year:lifeExp)
# A tibble: 1,704 × 2
    year lifeExp
   <dbl>   <dbl>
 1  1952    28.8
 2  1957    30.3
 3  1962    32.0
 4  1967    34.0
 5  1972    36.1
 6  1977    38.4
 7  1982    39.9
 8  1987    40.8
 9  1992    41.7
10  1997    41.8
# ℹ 1,694 more rows
#Exclude all columns between year and lifeExp
select(gapminder, -(year:lifeExp))
# A tibble: 1,704 × 4
   country     continent      pop gdpPercap
   <chr>       <chr>        <dbl>     <dbl>
 1 Afghanistan Asia       8425333      779.
 2 Afghanistan Asia       9240934      821.
 3 Afghanistan Asia      10267083      853.
 4 Afghanistan Asia      11537966      836.
 5 Afghanistan Asia      13079460      740.
 6 Afghanistan Asia      14880372      786.
 7 Afghanistan Asia      12881816      978.
 8 Afghanistan Asia      13867957      852.
 9 Afghanistan Asia      16317921      649.
10 Afghanistan Asia      22227415      635.
# ℹ 1,694 more rows

Save your output

In the examples above, the output of the select() function is being printed to the screen to demonstrate how it works. Normally, you will want to actually use the selected data later on, so don’t forget to save the output as a variable.

# Just prints to screen, can't use data
select(gapminder, year, country, pop)
# A tibble: 1,704 × 3
    year country          pop
   <dbl> <chr>          <dbl>
 1  1952 Afghanistan  8425333
 2  1957 Afghanistan  9240934
 3  1962 Afghanistan 10267083
 4  1967 Afghanistan 11537966
 5  1972 Afghanistan 13079460
 6  1977 Afghanistan 14880372
 7  1982 Afghanistan 12881816
 8  1987 Afghanistan 13867957
 9  1992 Afghanistan 16317921
10  1997 Afghanistan 22227415
# ℹ 1,694 more rows
# Saves output to variable; can be accessed later
just_population <- select(gapminder, year, country, pop)

just_population
# A tibble: 1,704 × 3
    year country          pop
   <dbl> <chr>          <dbl>
 1  1952 Afghanistan  8425333
 2  1957 Afghanistan  9240934
 3  1962 Afghanistan 10267083
 4  1967 Afghanistan 11537966
 5  1972 Afghanistan 13079460
 6  1977 Afghanistan 14880372
 7  1982 Afghanistan 12881816
 8  1987 Afghanistan 13867957
 9  1992 Afghanistan 16317921
10  1997 Afghanistan 22227415
# ℹ 1,694 more rows

Challenge 1

Use three different methods to select just the country, year, pop, and gdpPercap columns from the gapminder dataframe.

Solution to Challenge 1

#By name
select(gapminder, country, year, pop, gdpPercap)
# A tibble: 1,704 × 4
   country      year      pop gdpPercap
   <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan  1952  8425333      779.
 2 Afghanistan  1957  9240934      821.
 3 Afghanistan  1962 10267083      853.
 4 Afghanistan  1967 11537966      836.
 5 Afghanistan  1972 13079460      740.
 6 Afghanistan  1977 14880372      786.
 7 Afghanistan  1982 12881816      978.
 8 Afghanistan  1987 13867957      852.
 9 Afghanistan  1992 16317921      649.
10 Afghanistan  1997 22227415      635.
# ℹ 1,694 more rows
#By position
select(gapminder, 1, 3, 5, 6)
# A tibble: 1,704 × 4
   country      year      pop gdpPercap
   <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan  1952  8425333      779.
 2 Afghanistan  1957  9240934      821.
 3 Afghanistan  1962 10267083      853.
 4 Afghanistan  1967 11537966      836.
 5 Afghanistan  1972 13079460      740.
 6 Afghanistan  1977 14880372      786.
 7 Afghanistan  1982 12881816      978.
 8 Afghanistan  1987 13867957      852.
 9 Afghanistan  1992 16317921      649.
10 Afghanistan  1997 22227415      635.
# ℹ 1,694 more rows
#By exclusion
select(gapminder, -continent, -lifeExp)
# A tibble: 1,704 × 4
   country      year      pop gdpPercap
   <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan  1952  8425333      779.
 2 Afghanistan  1957  9240934      821.
 3 Afghanistan  1962 10267083      853.
 4 Afghanistan  1967 11537966      836.
 5 Afghanistan  1972 13079460      740.
 6 Afghanistan  1977 14880372      786.
 7 Afghanistan  1982 12881816      978.
 8 Afghanistan  1987 13867957      852.
 9 Afghanistan  1992 16317921      649.
10 Afghanistan  1997 22227415      635.
# ℹ 1,694 more rows

Select helper functions

There are a number of helper functions that can be used to select the correct columns. Some commonly used ones include starts_with(), ends_with() and contains(), but you can see a full list by looking at the help file (?tidyselect::select_helpers).

These helpers do exactly what you would expect from their names, but to see them in action:

#All columns that start with "co"
select(gapminder, starts_with("co"))
# A tibble: 1,704 × 2
   country     continent
   <chr>       <chr>    
 1 Afghanistan Asia     
 2 Afghanistan Asia     
 3 Afghanistan Asia     
 4 Afghanistan Asia     
 5 Afghanistan Asia     
 6 Afghanistan Asia     
 7 Afghanistan Asia     
 8 Afghanistan Asia     
 9 Afghanistan Asia     
10 Afghanistan Asia     
# ℹ 1,694 more rows
#All columns that contain the letter "e"
select(gapminder, contains("e"))
# A tibble: 1,704 × 4
   continent  year lifeExp gdpPercap
   <chr>     <dbl>   <dbl>     <dbl>
 1 Asia       1952    28.8      779.
 2 Asia       1957    30.3      821.
 3 Asia       1962    32.0      853.
 4 Asia       1967    34.0      836.
 5 Asia       1972    36.1      740.
 6 Asia       1977    38.4      786.
 7 Asia       1982    39.9      978.
 8 Asia       1987    40.8      852.
 9 Asia       1992    41.7      649.
10 Asia       1997    41.8      635.
# ℹ 1,694 more rows

Challenge 2

Select all columns from gapminder that end with the letter “p”

Solution to Challenge 2

select(gapminder, ends_with("p"))
# A tibble: 1,704 × 3
   lifeExp      pop gdpPercap
     <dbl>    <dbl>     <dbl>
 1    28.8  8425333      779.
 2    30.3  9240934      821.
 3    32.0 10267083      853.
 4    34.0 11537966      836.
 5    36.1 13079460      740.
 6    38.4 14880372      786.
 7    39.9 12881816      978.
 8    40.8 13867957      852.
 9    41.7 16317921      649.
10    41.8 22227415      635.
# ℹ 1,694 more rows

Challenge 3 (optional)

Run the following code:

select(gapminder, contains("P"))

Does it behave the way you expected? What additional argument do you need to provide to contains() to make it return only the gdpPercap column?

Hint

The contains() function is actually provided by the tidyselect package, try ?tidyselect::contains if you are having trouble finding the help page for it.

Solution to Challenge 3

select(gapminder, contains("P"))
# A tibble: 1,704 × 3
   lifeExp      pop gdpPercap
     <dbl>    <dbl>     <dbl>
 1    28.8  8425333      779.
 2    30.3  9240934      821.
 3    32.0 10267083      853.
 4    34.0 11537966      836.
 5    36.1 13079460      740.
 6    38.4 14880372      786.
 7    39.9 12881816      978.
 8    40.8 13867957      852.
 9    41.7 16317921      649.
10    41.8 22227415      635.
# ℹ 1,694 more rows
select(gapminder, contains("P", ignore.case = F))
# A tibble: 1,704 × 1
   gdpPercap
       <dbl>
 1      779.
 2      821.
 3      853.
 4      836.
 5      740.
 6      786.
 7      978.
 8      852.
 9      649.
10      635.
# ℹ 1,694 more rows

rename()

select() can be used to rename columns while you are selecting them using the form new_name = old_name:

select(gapminder, country_name = country, population = pop)
# A tibble: 1,704 × 2
   country_name population
   <chr>             <dbl>
 1 Afghanistan     8425333
 2 Afghanistan     9240934
 3 Afghanistan    10267083
 4 Afghanistan    11537966
 5 Afghanistan    13079460
 6 Afghanistan    14880372
 7 Afghanistan    12881816
 8 Afghanistan    13867957
 9 Afghanistan    16317921
10 Afghanistan    22227415
# ℹ 1,694 more rows

but since it drops all the other columns, it is might not be the outcome you were looking for. Instead, rename() can rename columns while retaining the rest of the data frame.

rename(gapminder, country_name = country, population = pop)
# A tibble: 1,704 × 6
   country_name continent  year lifeExp population gdpPercap
   <chr>        <chr>     <dbl>   <dbl>      <dbl>     <dbl>
 1 Afghanistan  Asia       1952    28.8    8425333      779.
 2 Afghanistan  Asia       1957    30.3    9240934      821.
 3 Afghanistan  Asia       1962    32.0   10267083      853.
 4 Afghanistan  Asia       1967    34.0   11537966      836.
 5 Afghanistan  Asia       1972    36.1   13079460      740.
 6 Afghanistan  Asia       1977    38.4   14880372      786.
 7 Afghanistan  Asia       1982    39.9   12881816      978.
 8 Afghanistan  Asia       1987    40.8   13867957      852.
 9 Afghanistan  Asia       1992    41.7   16317921      649.
10 Afghanistan  Asia       1997    41.8   22227415      635.
# ℹ 1,694 more rows

This will often be useful to convert column names from your imported data into the consistent naming format you have decided on.

Challenge 4

Rename the lifeExp and gdpPercap columns to the snake_case format

Solution to Challenge 4

rename(gapminder, life_exp = lifeExp, gdp_per_cap = gdpPercap)
# A tibble: 1,704 × 6
   country     continent  year life_exp      pop gdp_per_cap
   <chr>       <chr>     <dbl>    <dbl>    <dbl>       <dbl>
 1 Afghanistan Asia       1952     28.8  8425333        779.
 2 Afghanistan Asia       1957     30.3  9240934        821.
 3 Afghanistan Asia       1962     32.0 10267083        853.
 4 Afghanistan Asia       1967     34.0 11537966        836.
 5 Afghanistan Asia       1972     36.1 13079460        740.
 6 Afghanistan Asia       1977     38.4 14880372        786.
 7 Afghanistan Asia       1982     39.9 12881816        978.
 8 Afghanistan Asia       1987     40.8 13867957        852.
 9 Afghanistan Asia       1992     41.7 16317921        649.
10 Afghanistan Asia       1997     41.8 22227415        635.
# ℹ 1,694 more rows

Key Points

  • Use select() to choose variables from a dataframe.

  • Helper functions make it easier to select the correct columns.

  • Use rename() to rename variables without dropping columns.