Creating New Columns

Overview

Teaching: 30 min
Exercises: 15 min

Questions

How can I create new columns of data in my data frame?

How do I deal with a data frame made up of different groups?

Objectives

Create new columns by performing calculations on old variables.

Use functions to create new variables.

So far, we have been looking at functions that deal just with the variables in our data frame. But what if we are needing to create new variables? For that task, we will use the mutate() function from the dplyr package of the tidyverse.

To create a new variable, you provide mutate() with the name of the new column and how to calculate the new value.

mutate(gapminder, gdp = gdpPercap * pop)

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          gdp
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>        <dbl>
Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

This line adds a new column to our gapminder data called gdp and the value in this column is calculated by multiplying the gdpPercap and the pop figure for each row.

Challenge 1

Create a new column in the gapminder data frame that contains the population in millions (ie. divide the population column by 1,000,000).

Solution to Challenge 1

mutate(gapminder, pop_m = pop/1e6)

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap pop_m
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl> <dbl>
Afghanistan Asia       1952    28.8  8425333      779.  8.43
Afghanistan Asia       1957    30.3  9240934      821.  9.24
Afghanistan Asia       1962    32.0 10267083      853. 10.3 
Afghanistan Asia       1967    34.0 11537966      836. 11.5 
Afghanistan Asia       1972    36.1 13079460      740. 13.1 
Afghanistan Asia       1977    38.4 14880372      786. 14.9 
Afghanistan Asia       1982    39.9 12881816      978. 12.9 
Afghanistan Asia       1987    40.8 13867957      852. 13.9 
Afghanistan Asia       1992    41.7 16317921      649. 16.3 
Afghanistan Asia       1997    41.8 22227415      635. 22.2 
# ℹ 1,694 more rows

You are not limited to mathematical operators in creating new columns. Any function that produces a value for each row can be used:

# Take the logarithm of the population value
mutate(gapminder, log_pop = log(pop))

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap log_pop
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <dbl>
Afghanistan Asia       1952    28.8  8425333      779.    15.9
Afghanistan Asia       1957    30.3  9240934      821.    16.0
Afghanistan Asia       1962    32.0 10267083      853.    16.1
Afghanistan Asia       1967    34.0 11537966      836.    16.3
Afghanistan Asia       1972    36.1 13079460      740.    16.4
Afghanistan Asia       1977    38.4 14880372      786.    16.5
Afghanistan Asia       1982    39.9 12881816      978.    16.4
Afghanistan Asia       1987    40.8 13867957      852.    16.4
Afghanistan Asia       1992    41.7 16317921      649.    16.6
Afghanistan Asia       1997    41.8 22227415      635.    16.9
# ℹ 1,694 more rows

# Abbreviate the country name
# str_sub() takes a subset of a string from the given coordinates
mutate(gapminder, country_abbr = str_sub(country, start = 1, end = 4))

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap country_abbr
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl> <chr>       
Afghanistan Asia       1952    28.8  8425333      779. Afgh        
Afghanistan Asia       1957    30.3  9240934      821. Afgh        
Afghanistan Asia       1962    32.0 10267083      853. Afgh        
Afghanistan Asia       1967    34.0 11537966      836. Afgh        
Afghanistan Asia       1972    36.1 13079460      740. Afgh        
Afghanistan Asia       1977    38.4 14880372      786. Afgh        
Afghanistan Asia       1982    39.9 12881816      978. Afgh        
Afghanistan Asia       1987    40.8 13867957      852. Afgh        
Afghanistan Asia       1992    41.7 16317921      649. Afgh        
Afghanistan Asia       1997    41.8 22227415      635. Afgh        
# ℹ 1,694 more rows

Challenge 2

Using the function str_length() which returns the length of a string, create a new column that contains the number of letters in the country’s name.

Solution to Challenge 2

mutate(gapminder, num_letters = str_length(country))

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap num_letters
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>       <int>
Afghanistan Asia       1952    28.8  8425333      779.          11
Afghanistan Asia       1957    30.3  9240934      821.          11
Afghanistan Asia       1962    32.0 10267083      853.          11
Afghanistan Asia       1967    34.0 11537966      836.          11
Afghanistan Asia       1972    36.1 13079460      740.          11
Afghanistan Asia       1977    38.4 14880372      786.          11
Afghanistan Asia       1982    39.9 12881816      978.          11
Afghanistan Asia       1987    40.8 13867957      852.          11
Afghanistan Asia       1992    41.7 16317921      649.          11
Afghanistan Asia       1997    41.8 22227415      635.          11
# ℹ 1,694 more rows

In fact, anything that produces a vector of the same length as the data frame can be used. It does not even have to reference data within the data frame.

# 1704 rows in gapminder
index_numbers <- 1:1704

mutate(gapminder, row_num = index_numbers)

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap row_num
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>   <int>
Afghanistan Asia       1952    28.8  8425333      779.       1
Afghanistan Asia       1957    30.3  9240934      821.       2
Afghanistan Asia       1962    32.0 10267083      853.       3
Afghanistan Asia       1967    34.0 11537966      836.       4
Afghanistan Asia       1972    36.1 13079460      740.       5
Afghanistan Asia       1977    38.4 14880372      786.       6
Afghanistan Asia       1982    39.9 12881816      978.       7
Afghanistan Asia       1987    40.8 13867957      852.       8
Afghanistan Asia       1992    41.7 16317921      649.       9
Afghanistan Asia       1997    41.8 22227415      635.      10
# ℹ 1,694 more rows

Multiple mutations

You can perform multiple mutations at the same time by separating them with a comma.

mutate(gapminder, gdp = gdpPercap * pop, log_pop = log(pop))

# A tibble: 1,704 × 8
   country     continent  year lifeExp      pop gdpPercap          gdp log_pop
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>        <dbl>   <dbl>
Afghanistan Asia       1952    28.8  8425333      779.  6567086330.    15.9
Afghanistan Asia       1957    30.3  9240934      821.  7585448670.    16.0
Afghanistan Asia       1962    32.0 10267083      853.  8758855797.    16.1
Afghanistan Asia       1967    34.0 11537966      836.  9648014150.    16.3
Afghanistan Asia       1972    36.1 13079460      740.  9678553274.    16.4
Afghanistan Asia       1977    38.4 14880372      786. 11697659231.    16.5
Afghanistan Asia       1982    39.9 12881816      978. 12598563401.    16.4
Afghanistan Asia       1987    40.8 13867957      852. 11820990309.    16.4
Afghanistan Asia       1992    41.7 16317921      649. 10595901589.    16.6
Afghanistan Asia       1997    41.8 22227415      635. 14121995875.    16.9
# ℹ 1,694 more rows

One useful feature is that you can refer to your created columns later on in the same mutate() call.

mutate(
  gapminder,
  gdp = gdpPercap * pop,
  log_gdp = log(gdp)
)

# A tibble: 1,704 × 8
   country     continent  year lifeExp      pop gdpPercap          gdp log_gdp
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>        <dbl>   <dbl>
Afghanistan Asia       1952    28.8  8425333      779.  6567086330.    22.6
Afghanistan Asia       1957    30.3  9240934      821.  7585448670.    22.7
Afghanistan Asia       1962    32.0 10267083      853.  8758855797.    22.9
Afghanistan Asia       1967    34.0 11537966      836.  9648014150.    23.0
Afghanistan Asia       1972    36.1 13079460      740.  9678553274.    23.0
Afghanistan Asia       1977    38.4 14880372      786. 11697659231.    23.2
Afghanistan Asia       1982    39.9 12881816      978. 12598563401.    23.3
Afghanistan Asia       1987    40.8 13867957      852. 11820990309.    23.2
Afghanistan Asia       1992    41.7 16317921      649. 10595901589.    23.1
Afghanistan Asia       1997    41.8 22227415      635. 14121995875.    23.4
# ℹ 1,694 more rows

Challenge 3

Create a column in the gapminder data showing the life expectancy in days and one for GDP in billions of dollars

Solution to Challenge 3

mutate(
  gapminder,
    life_exp_days = lifeExp * 365,
    gdp_billion = gdpPercap * pop / 10^9
)

# A tibble: 1,704 × 8
   country    continent  year lifeExp    pop gdpPercap life_exp_days gdp_billion
   <chr>      <chr>     <dbl>   <dbl>  <dbl>     <dbl>         <dbl>       <dbl>
Afghanist… Asia       1952    28.8 8.43e6      779.        10512.        6.57
Afghanist… Asia       1957    30.3 9.24e6      821.        11071.        7.59
Afghanist… Asia       1962    32.0 1.03e7      853.        11679.        8.76
Afghanist… Asia       1967    34.0 1.15e7      836.        12417.        9.65
Afghanist… Asia       1972    36.1 1.31e7      740.        13172.        9.68
Afghanist… Asia       1977    38.4 1.49e7      786.        14030.       11.7 
Afghanist… Asia       1982    39.9 1.29e7      978.        14547.       12.6 
Afghanist… Asia       1987    40.8 1.39e7      852.        14900.       11.8 
Afghanist… Asia       1992    41.7 1.63e7      649.        15211.       10.6 
Afghanist… Asia       1997    41.8 2.22e7      635.        15243.       14.1 
# ℹ 1,694 more rows

Variable creation functions

As mentioned before, any function that can take a vector of inputs and return a vector with the same length as an output can be used in a mutate() call. There are many R functions that can be used in this situation, and dplyr introduces a number of new functions that may be useful for creating new variables. Some functions that you might find useful are:

Offset functions

To calculate differences between observations, you may be wanting to look at the value immediately before or after it. These can be accessed using lag() or lead() respectively.

mutate(gapminder, life_exp_prev = lag(lifeExp), life_exp_next = lead(lifeExp))

# A tibble: 1,704 × 8
   country  continent  year lifeExp    pop gdpPercap life_exp_prev life_exp_next
   <chr>    <chr>     <dbl>   <dbl>  <dbl>     <dbl>         <dbl>         <dbl>
Afghani… Asia       1952    28.8 8.43e6      779.          NA            30.3
Afghani… Asia       1957    30.3 9.24e6      821.          28.8          32.0
Afghani… Asia       1962    32.0 1.03e7      853.          30.3          34.0
Afghani… Asia       1967    34.0 1.15e7      836.          32.0          36.1
Afghani… Asia       1972    36.1 1.31e7      740.          34.0          38.4
Afghani… Asia       1977    38.4 1.49e7      786.          36.1          39.9
Afghani… Asia       1982    39.9 1.29e7      978.          38.4          40.8
Afghani… Asia       1987    40.8 1.39e7      852.          39.9          41.7
Afghani… Asia       1992    41.7 1.63e7      649.          40.8          41.8
Afghani… Asia       1997    41.8 2.22e7      635.          41.7          42.1
# ℹ 1,694 more rows

Cumulative computations

Cumulative sums (cumsum()), products (cumprod()) and means (cummean()) are all available to provide running computation of a variable. While not particularly useful for the gapminder data, you can get an idea of how they function:

mutate(gapminder, cumulative_life_exp = cumsum(lifeExp))

# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap cumulative_life_exp
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>               <dbl>
Afghanistan Asia       1952    28.8  8425333      779.                28.8
Afghanistan Asia       1957    30.3  9240934      821.                59.1
Afghanistan Asia       1962    32.0 10267083      853.                91.1
Afghanistan Asia       1967    34.0 11537966      836.               125. 
Afghanistan Asia       1972    36.1 13079460      740.               161. 
Afghanistan Asia       1977    38.4 14880372      786.               200. 
Afghanistan Asia       1982    39.9 12881816      978.               240. 
Afghanistan Asia       1987    40.8 13867957      852.               280. 
Afghanistan Asia       1992    41.7 16317921      649.               322. 
Afghanistan Asia       1997    41.8 22227415      635.               364. 
# ℹ 1,694 more rows

Challenge 4

Using an offset function, create a column showing the change in life expectancy between each sample period. Do any countries have no change in life expectancy for a period?
Solution to Challenge 4
diff_lifeExp <- mutate(gapminder, life_exp_change = lifeExp - lag(lifeExp))  

filter(diff_lifeExp, life_exp_change == 0)
# A tibble: 1 × 7
  country continent  year lifeExp      pop gdpPercap life_exp_change
  <chr>   <chr>     <dbl>   <dbl>    <dbl>     <dbl>           <dbl>
1 Romania Europe     1967    66.8 19284814     6471.               0

Key Points

Use mutate() to create new variables from old ones.

You can create new variables using any function that returns a vector of the same length as the data frame.

Use group_by() to group your data based on a variable.

previous episode

Introduction to R

next episode