Summarise and Grouping
Overview
Teaching: 20 min
Exercises: 20 minQuestions
How can I make summaries of groups of data?
How can I count the elements of groups in a dataset?
Objectives
Identify when grouping is necessary to make summaries
Be able to to usefully summarise variables in a dataframe
The summarise()
function lets us create new variables by collapsing a data frame into a single
summary statistic. You can use summarise()
with any function that takes a vector as input and
returns a single value as output. For example, what is the average life expectance in the gapminder
data?
summarise(gapminder, mean_life_exp = mean(lifeExp))
# A tibble: 1 × 1
mean_life_exp
<dbl>
1 59.5
On it’s own, this may not seem that exciting. You could just as easily get the same result by using
mean(gapminder$lifeExp)
. Where is becomes more useful however, is that you can use multiple
summary functions at the same time.
summarise(
gapminder,
mean_life_exp = mean(lifeExp),
sd_life_exp = sd(lifeExp),
mean_gdp_per_cap = mean(gdpPercap),
max_gdp_per_cap = max(gdpPercap)
)
# A tibble: 1 × 4
mean_life_exp sd_life_exp mean_gdp_per_cap max_gdp_per_cap
<dbl> <dbl> <dbl> <dbl>
1 59.5 12.9 7215. 113523.
Challenge 1
Calculate the mean and median population for the gapminder data
Solution to Challenge 1
summarise(gapminder, mean_pop = mean(pop), median_pop = median(pop))
# A tibble: 1 × 2 mean_pop median_pop <dbl> <dbl> 1 29601212. 7023596.
and you can get summaries for different groups in conjunction with group_by()
gapminder_by_country <- group_by(gapminder, country)
summarise(gapminder_by_country, mean_life_exp = mean(lifeExp))
# A tibble: 142 × 2
country mean_life_exp
<chr> <dbl>
1 Afghanistan 37.5
2 Albania 68.4
3 Algeria 59.0
4 Angola 37.9
5 Argentina 69.1
6 Australia 74.7
7 Austria 73.1
8 Bahrain 65.6
9 Bangladesh 49.8
10 Belgium 73.6
# ℹ 132 more rows
Challenge 2
Adjust your answer to Challenge 1 to show the mean and median population for each continent.
Solution to Challenge 2
gapminder_by_continent <- group_by(gapminder, continent) summarise(gapminder_by_continent, mean_pop = mean(pop), median_pop = median(pop))
# A tibble: 5 × 3 continent mean_pop median_pop <chr> <dbl> <dbl> 1 Africa 9916003. 4579311 2 Americas 24504795. 6227510 3 Asia 77038722. 14530830. 4 Europe 17169765. 8551125 5 Oceania 8874672. 6403492.
Sorting your results
If you need to sort your resulting data frame by a particular variable, use arrange()
. This
function takes a data frame and a set of column names and it rearranges the rows so that the
specified columns are in order.
arrange(gapminder, gdpPercap)
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Congo, Dem. Rep. Africa 2002 45.0 55379852 241.
2 Congo, Dem. Rep. Africa 2007 46.5 64606759 278.
3 Lesotho Africa 1952 42.1 748747 299.
4 Guinea-Bissau Africa 1952 32.5 580653 300.
5 Congo, Dem. Rep. Africa 1997 42.6 47798986 312.
6 Eritrea Africa 1952 35.9 1438760 329.
7 Myanmar Asia 1952 36.3 20092996 331
8 Lesotho Africa 1957 45.0 813338 336.
9 Burundi Africa 1952 39.0 2445618 339.
10 Eritrea Africa 1957 38.0 1542611 344.
# ℹ 1,694 more rows
# Use desc() to sort from highest to lowest
arrange(gapminder, desc(gdpPercap))
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Kuwait Asia 1957 58.0 212846 113523.
2 Kuwait Asia 1972 67.7 841934 109348.
3 Kuwait Asia 1952 55.6 160000 108382.
4 Kuwait Asia 1962 60.5 358266 95458.
5 Kuwait Asia 1967 64.6 575003 80895.
6 Kuwait Asia 1977 69.3 1140357 59265.
7 Norway Europe 2007 80.2 4627926 49357.
8 Kuwait Asia 2007 77.6 2505559 47307.
9 Singapore Asia 2007 80.0 4553009 47143.
10 Norway Europe 2002 79.0 4535591 44684.
# ℹ 1,694 more rows
Challenge 3
Calculate the average life expectancy per country. Which has the shortest average life expectancy and which has the longest average life expectancy?
Solution to Challenge 3
summarised_life_exp <- summarise(gapminder_by_country, mean_life_exp = mean(lifeExp)) arrange(summarised_life_exp, mean_life_exp)
# A tibble: 142 × 2 country mean_life_exp <chr> <dbl> 1 Sierra Leone 36.8 2 Afghanistan 37.5 3 Angola 37.9 4 Guinea-Bissau 39.2 5 Mozambique 40.4 6 Somalia 41.0 7 Rwanda 41.5 8 Liberia 42.5 9 Equatorial Guinea 43.0 10 Guinea 43.2 # ℹ 132 more rows
arrange(summarised_life_exp, desc(mean_life_exp))
# A tibble: 142 × 2 country mean_life_exp <chr> <dbl> 1 Iceland 76.5 2 Sweden 76.2 3 Norway 75.8 4 Netherlands 75.6 5 Switzerland 75.6 6 Canada 74.9 7 Japan 74.8 8 Australia 74.7 9 Denmark 74.4 10 France 74.3 # ℹ 132 more rows
Counting things
A very common summary operation is to count the number of observations. The n()
function will help simplify this process. n()
will return the number of rows in the data frame (or
in the group if the data frame is grouped).
summarise(gapminder, num_rows = n())
# A tibble: 1 × 1
num_rows
<int>
1 1704
summarise(gapminder_by_country, num_rows = n())
# A tibble: 142 × 2
country num_rows
<chr> <int>
1 Afghanistan 12
2 Albania 12
3 Algeria 12
4 Angola 12
5 Argentina 12
6 Australia 12
7 Austria 12
8 Bahrain 12
9 Bangladesh 12
10 Belgium 12
# ℹ 132 more rows
The n()
function can be very useful if we need to use the number of observations in calculations.
For instance, if we wanted to get the standard error of the population per country:
#standard error = standard deviation / square root of the number of samples
summarise(gapminder_by_country, se_pop = sd(pop) / sqrt(n()) )
# A tibble: 142 × 2
country se_pop
<chr> <dbl>
1 Afghanistan 2053803.
2 Albania 239192.
3 Algeria 2486462.
4 Angola 771421.
5 Argentina 2178518.
6 Australia 1130222.
7 Austria 126342.
8 Bahrain 60880.
9 Bangladesh 10020393.
10 Belgium 150295.
# ℹ 132 more rows
Challenge 4
Let’s try to put together all three functions introduced here. Produce a data frame that summarises the number of rows for each continent, sorted from highest to lowest. Use
group_by()
,summarise()
, andarrange()
in that order to achieve it.Solution to Challenge 4
gap_by_cont <- group_by(gapminder, continent) count_by_cont <- summarise(gap_by_cont, num_rows = n()) arrange(count_by_cont, desc(num_rows))
# A tibble: 5 × 2 continent num_rows <chr> <int> 1 Africa 624 2 Asia 396 3 Europe 360 4 Americas 300 5 Oceania 24
Key Points
summarise()
creates a new variable that provides a one row summary per group
n()
is useful to count rows per group