Creating New Columns
Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can I create new columns of data in my data frame?
How do I deal with a data frame made up of different groups?
Objectives
Create new columns by performing calculations on old variables.
Use functions to create new variables.
So far, we have been looking at functions that deal just with the variables in our data frame. But
what if we are needing to create new variables? For that task, we will use the mutate()
function
from the dplyr package of the tidyverse.
To create a new variable, you provide mutate()
with the name of the new column and how to calculate
the new value.
mutate(gapminder, gdp = gdpPercap * pop)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
# ℹ 1,694 more rows
This line adds a new column to our gapminder data called gdp
and the value in this column is
calculated by multiplying the gdpPercap
and the pop
figure for each row.
Challenge 1
Create a new column in the
gapminder
data frame that contains the population in millions (ie. divide the population column by 1,000,000).Solution to Challenge 1
mutate(gapminder, pop_m = pop/1e6)
# A tibble: 1,704 × 7 country continent year lifeExp pop gdpPercap pop_m <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 8.43 2 Afghanistan Asia 1957 30.3 9240934 821. 9.24 3 Afghanistan Asia 1962 32.0 10267083 853. 10.3 4 Afghanistan Asia 1967 34.0 11537966 836. 11.5 5 Afghanistan Asia 1972 36.1 13079460 740. 13.1 6 Afghanistan Asia 1977 38.4 14880372 786. 14.9 7 Afghanistan Asia 1982 39.9 12881816 978. 12.9 8 Afghanistan Asia 1987 40.8 13867957 852. 13.9 9 Afghanistan Asia 1992 41.7 16317921 649. 16.3 10 Afghanistan Asia 1997 41.8 22227415 635. 22.2 # ℹ 1,694 more rows
You are not limited to mathematical operators in creating new columns. Any function that produces a value for each row can be used:
# Take the logarithm of the population value
mutate(gapminder, log_pop = log(pop))
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap log_pop
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 15.9
2 Afghanistan Asia 1957 30.3 9240934 821. 16.0
3 Afghanistan Asia 1962 32.0 10267083 853. 16.1
4 Afghanistan Asia 1967 34.0 11537966 836. 16.3
5 Afghanistan Asia 1972 36.1 13079460 740. 16.4
6 Afghanistan Asia 1977 38.4 14880372 786. 16.5
7 Afghanistan Asia 1982 39.9 12881816 978. 16.4
8 Afghanistan Asia 1987 40.8 13867957 852. 16.4
9 Afghanistan Asia 1992 41.7 16317921 649. 16.6
10 Afghanistan Asia 1997 41.8 22227415 635. 16.9
# ℹ 1,694 more rows
# Abbreviate the country name
# str_sub() takes a subset of a string from the given coordinates
mutate(gapminder, country_abbr = str_sub(country, start = 1, end = 4))
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap country_abbr
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 Afghanistan Asia 1952 28.8 8425333 779. Afgh
2 Afghanistan Asia 1957 30.3 9240934 821. Afgh
3 Afghanistan Asia 1962 32.0 10267083 853. Afgh
4 Afghanistan Asia 1967 34.0 11537966 836. Afgh
5 Afghanistan Asia 1972 36.1 13079460 740. Afgh
6 Afghanistan Asia 1977 38.4 14880372 786. Afgh
7 Afghanistan Asia 1982 39.9 12881816 978. Afgh
8 Afghanistan Asia 1987 40.8 13867957 852. Afgh
9 Afghanistan Asia 1992 41.7 16317921 649. Afgh
10 Afghanistan Asia 1997 41.8 22227415 635. Afgh
# ℹ 1,694 more rows
Challenge 2
Using the function
str_length()
which returns the length of a string, create a new column that contains the number of letters in the country’s name.Solution to Challenge 2
mutate(gapminder, num_letters = str_length(country))
# A tibble: 1,704 × 7 country continent year lifeExp pop gdpPercap num_letters <chr> <chr> <dbl> <dbl> <dbl> <dbl> <int> 1 Afghanistan Asia 1952 28.8 8425333 779. 11 2 Afghanistan Asia 1957 30.3 9240934 821. 11 3 Afghanistan Asia 1962 32.0 10267083 853. 11 4 Afghanistan Asia 1967 34.0 11537966 836. 11 5 Afghanistan Asia 1972 36.1 13079460 740. 11 6 Afghanistan Asia 1977 38.4 14880372 786. 11 7 Afghanistan Asia 1982 39.9 12881816 978. 11 8 Afghanistan Asia 1987 40.8 13867957 852. 11 9 Afghanistan Asia 1992 41.7 16317921 649. 11 10 Afghanistan Asia 1997 41.8 22227415 635. 11 # ℹ 1,694 more rows
In fact, anything that produces a vector of the same length as the data frame can be used. It does not even have to reference data within the data frame.
# 1704 rows in gapminder
index_numbers <- 1:1704
mutate(gapminder, row_num = index_numbers)
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap row_num
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333 779. 1
2 Afghanistan Asia 1957 30.3 9240934 821. 2
3 Afghanistan Asia 1962 32.0 10267083 853. 3
4 Afghanistan Asia 1967 34.0 11537966 836. 4
5 Afghanistan Asia 1972 36.1 13079460 740. 5
6 Afghanistan Asia 1977 38.4 14880372 786. 6
7 Afghanistan Asia 1982 39.9 12881816 978. 7
8 Afghanistan Asia 1987 40.8 13867957 852. 8
9 Afghanistan Asia 1992 41.7 16317921 649. 9
10 Afghanistan Asia 1997 41.8 22227415 635. 10
# ℹ 1,694 more rows
Multiple mutations
You can perform multiple mutations at the same time by separating them with a comma.
mutate(gapminder, gdp = gdpPercap * pop, log_pop = log(pop))
# A tibble: 1,704 × 8
country continent year lifeExp pop gdpPercap gdp log_pop
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 15.9
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 16.0
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 16.1
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 16.3
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 16.4
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 16.5
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 16.4
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 16.4
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 16.6
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. 16.9
# ℹ 1,694 more rows
One useful feature is that you can refer to your created columns later on in the same
mutate()
call.
mutate(
gapminder,
gdp = gdpPercap * pop,
log_gdp = log(gdp)
)
# A tibble: 1,704 × 8
country continent year lifeExp pop gdpPercap gdp log_gdp
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330. 22.6
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670. 22.7
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797. 22.9
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150. 23.0
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274. 23.0
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231. 23.2
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401. 23.3
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309. 23.2
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589. 23.1
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875. 23.4
# ℹ 1,694 more rows
Challenge 3
Create a column in the gapminder data showing the life expectancy in days and one for GDP in billions of dollars
Solution to Challenge 3
mutate( gapminder, life_exp_days = lifeExp * 365, gdp_billion = gdpPercap * pop / 10^9 )
# A tibble: 1,704 × 8 country continent year lifeExp pop gdpPercap life_exp_days gdp_billion <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghanist… Asia 1952 28.8 8.43e6 779. 10512. 6.57 2 Afghanist… Asia 1957 30.3 9.24e6 821. 11071. 7.59 3 Afghanist… Asia 1962 32.0 1.03e7 853. 11679. 8.76 4 Afghanist… Asia 1967 34.0 1.15e7 836. 12417. 9.65 5 Afghanist… Asia 1972 36.1 1.31e7 740. 13172. 9.68 6 Afghanist… Asia 1977 38.4 1.49e7 786. 14030. 11.7 7 Afghanist… Asia 1982 39.9 1.29e7 978. 14547. 12.6 8 Afghanist… Asia 1987 40.8 1.39e7 852. 14900. 11.8 9 Afghanist… Asia 1992 41.7 1.63e7 649. 15211. 10.6 10 Afghanist… Asia 1997 41.8 2.22e7 635. 15243. 14.1 # ℹ 1,694 more rows
Variable creation functions
As mentioned before, any function that can take a vector of inputs and return a vector with the same
length as an output can be used in a mutate()
call. There are many R functions that can be used
in this situation, and dplyr
introduces a number of new functions that may be useful for creating
new variables. Some functions that you might find useful are:
Offset functions
To calculate differences between observations, you may be wanting to look at the value immediately
before or after it. These can be accessed using lag()
or lead()
respectively.
mutate(gapminder, life_exp_prev = lag(lifeExp), life_exp_next = lead(lifeExp))
# A tibble: 1,704 × 8
country continent year lifeExp pop gdpPercap life_exp_prev life_exp_next
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghani… Asia 1952 28.8 8.43e6 779. NA 30.3
2 Afghani… Asia 1957 30.3 9.24e6 821. 28.8 32.0
3 Afghani… Asia 1962 32.0 1.03e7 853. 30.3 34.0
4 Afghani… Asia 1967 34.0 1.15e7 836. 32.0 36.1
5 Afghani… Asia 1972 36.1 1.31e7 740. 34.0 38.4
6 Afghani… Asia 1977 38.4 1.49e7 786. 36.1 39.9
7 Afghani… Asia 1982 39.9 1.29e7 978. 38.4 40.8
8 Afghani… Asia 1987 40.8 1.39e7 852. 39.9 41.7
9 Afghani… Asia 1992 41.7 1.63e7 649. 40.8 41.8
10 Afghani… Asia 1997 41.8 2.22e7 635. 41.7 42.1
# ℹ 1,694 more rows
Cumulative computations
Cumulative sums (cumsum()
), products (cumprod()
) and means (cummean()
) are all available to
provide running computation of a variable. While not particularly useful for the gapminder data,
you can get an idea of how they function:
mutate(gapminder, cumulative_life_exp = cumsum(lifeExp))
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap cumulative_life_exp
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 28.8
2 Afghanistan Asia 1957 30.3 9240934 821. 59.1
3 Afghanistan Asia 1962 32.0 10267083 853. 91.1
4 Afghanistan Asia 1967 34.0 11537966 836. 125.
5 Afghanistan Asia 1972 36.1 13079460 740. 161.
6 Afghanistan Asia 1977 38.4 14880372 786. 200.
7 Afghanistan Asia 1982 39.9 12881816 978. 240.
8 Afghanistan Asia 1987 40.8 13867957 852. 280.
9 Afghanistan Asia 1992 41.7 16317921 649. 322.
10 Afghanistan Asia 1997 41.8 22227415 635. 364.
# ℹ 1,694 more rows
Challenge 4
Using an offset function, create a column showing the change in life expectancy between each sample period. Do any countries have no change in life expectancy for a period?
Solution to Challenge 4
diff_lifeExp <- mutate(gapminder, life_exp_change = lifeExp - lag(lifeExp)) filter(diff_lifeExp, life_exp_change == 0)
# A tibble: 1 × 7 country continent year lifeExp pop gdpPercap life_exp_change <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Romania Europe 1967 66.8 19284814 6471. 0
Key Points
Use
mutate()
to create new variables from old ones.You can create new variables using any function that returns a vector of the same length as the data frame.
Use
group_by()
to group your data based on a variable.