Putting it all together
Overview
Teaching: 25 min
Exercises: 25 minQuestions
How can I use multiple data verbs together?
How should I plan my data analysis?
Objectives
Break down an analysis into several small steps
Understand approaches to linking analysis steps together.
Use the pipe (
%>%
) to chain together multiple functions in a readable format.
Now that we have learned some data manipulation verbs you might begin to see how they could be used together. By combining several small and simple steps you can start to perform complex manipulations.
Challenge 1
Using both
select()
andfilter()
, create a data frame containing only thecountry
,year
, andpop
columns for Australian data.
Your solution to this challenge probably took one of two approaches:
Using an intermediate variable
filtered_data <- filter(data, ...)
final_data <- select(filtered_data, ...)
Nesting function calls
final_data <- select(filter(data, ...), ...)
Each of these are perfectly acceptable ways to solve the problem, but can lead to code that is difficult to read and understand. In the case of the intermediate variables, you need to come up with a name for each one, and if you aren’t very careful about keeping the connected lines of code together it can become difficult to track which variable is being used in each function. The nested version removes some of these problems, but it’s hard enough to read with only two simple steps.
Piping in R
From the magrittr package in the tidyverse we have another approach
for chaining a sequence of functions. This is the pipe operator (%>%
), which works in a similar
way to the unix pipe.
Shortcuts
Typing
%>%
in everytime you want a pipe is a bit awkward, so RStudio has a shortcut to help. Use Ctrl+Shift+M (Cmd+Shift+M on a Mac) to insert a pipe into your code.
Using the pipe
The pipe works by taking the output of it’s left hand side and inserting it as the first argument to the function on it’s right hand side.
This means that the following
some_function(data)
could be rewritten using a pipe as
data %>% some_function()
Other function arguments are left untouched so
#Standard form
some_function(data, first_arg, second_arg)
#Piped form
data %>% some_function(first_arg, second_arg)
are equivalent.
Challenge 2
Rewrite the following line from the previous lesson in a piped form. Does it give you the same output?
filter(gapminder, country == "Australia", year >= 1997)
Solution to Challenge 2
gapminder %>% filter(country == "Australia", year >= 1997)
# A tibble: 3 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Australia Oceania 1997 78.8 18565243 26998. 2 Australia Oceania 2002 80.4 19546792 30688. 3 Australia Oceania 2007 81.2 20434176 34435.
Combining
The real value of the pipe comes when there are multiple steps to complete. To rewrite our example above using pipes:
#Nested form
final_data <- select(filter(data, ...), ...)
#Piped form
final_data <- data %>%
filter(...) %>%
select(...)
This can be read as a series of instructions. Take the data frame data
, then filter
it to keep
some rows, then select
some columns.
For many people, the piped form is one that is easier to read and understand, as well as easier to write.
Challenge 3
Take your answer to Challenge 1 and rewrite it in a piped form.
Now, imagine that you have decided later more steps are required. Add a step renaming the
pop
column topopulation
for both forms. Do you find one form easier to work with than the other?Solution to Challenge 3
For the piped version:
gapminder %>% select(country, year, pop) %>% filter(country == "Australia")
# A tibble: 12 × 3 country year pop <chr> <dbl> <dbl> 1 Australia 1952 8691212 2 Australia 1957 9712569 3 Australia 1962 10794968 4 Australia 1967 11872264 5 Australia 1972 13177000 6 Australia 1977 14074100 7 Australia 1982 15184200 8 Australia 1987 16257249 9 Australia 1992 17481977 10 Australia 1997 18565243 11 Australia 2002 19546792 12 Australia 2007 20434176
And adding a rename step:
gapminder %>% select(country, year, pop) %>% filter(country == "Australia") %>% rename(population = pop)
# A tibble: 12 × 3 country year population <chr> <dbl> <dbl> 1 Australia 1952 8691212 2 Australia 1957 9712569 3 Australia 1962 10794968 4 Australia 1967 11872264 5 Australia 1972 13177000 6 Australia 1977 14074100 7 Australia 1982 15184200 8 Australia 1987 16257249 9 Australia 1992 17481977 10 Australia 1997 18565243 11 Australia 2002 19546792 12 Australia 2007 20434176
Common pitfalls
When using pipes to construct a sequential analysis, there are a few problems that can trip people up.
Failing to complete the pipe
When constructing a sequence by adding or removing steps, a very common problem is that you will leave a pipe command trailing at the end of your code block.
#Pipe to nowhere
final_data <- data %>%
filter(...) %>%
select(...) %>%
In this case, your code will not complete and R will pause and wait for the rest of the pipe before
continuing. This is easy to spot in your console because the input indicator will change from a >
to a +
, telling you that R is waiting for you to complete a command. Just hit the Esc
key to cancel the command, and fix up your pipe before running it again.
Forgetting to assign the output
This has been covered before, but is something that often gets forgotten again once you start using pipes. If the final output of your pipe is not saved into a variable, it will just get printed to the screen and then lost.
So instead of
gapminder %>%
select(country, year, pop) %>%
filter(country == "Australia") %>%
rename(population = pop)
make sure to save the output if you need access to it later
aust_data <- gapminder %>%
select(country, year, pop) %>%
filter(country == "Australia") %>%
rename(population = pop)
Forgetting to pass some data in to the pipe
Some people have the opposite problem where they focus so much on getting each step of the pipe right that they forget that it needs some data to work on.
Try running the pipe above without any data:
select(country, year, pop) %>%
filter(country == "Australia") %>%
rename(population = pop)
Error in select(country, year, pop): object 'country' not found
It throws an error because select()
in the first step is expecting it’s first argument to be a
data frame that it can work on. Instead, it finds country
, which is not a data frame.
You could fix this by providing the data frame directly to select
select(gapminder, country, year, pop) %>%
filter(country == "Australia") %>%
rename(population = pop)
But it is easier to know exactly what data is going into a pipe if you put it on it’s own line at the start
gapminder %>%
select(country, year, pop) %>%
filter(country == "Australia") %>%
rename(population = pop)
Overly long pipes
Using pipes can make your code more understandable, but there is a limit to their effectiveness. If your pipe has too many steps it becomes harder to read through and keep in your memory all the steps that have been applied to the data.
It also makes identifying errors harder. Imagine a pipe with 30 steps that is giving you the wrong output or throwing an error. Working out which of those 30 steps is the cause of the problem can become time consuming and difficult. If you find yourself writing very long pipes, consider if there are any logical ways to break it up into a series of smaller pipes.
Style questions
There are a couple of guidelines about style you can use when writing pipes to make them more understandable
Separate each step in a pipe
Each step of the pipe should be on a separate line, and all steps after the first should be indented. This helps to clearly identify the elements of the pipe and read through them in a step-by-step fashion.
Comment sensibly
You can add comments into the middle of a pipe chain
gapminder %>%
rename(gdpPerCap = gdpPercap) %>% # Inconsistent capitalisation annoys me
select(country, year, gdpPerCap) %>%
# Data version of #MapsWithoutNZ
filter(country != "New Zealand")
but consider the effect those comments have on the legibility of your code. It may be best to add your comment at the start of the pipe instead. Or to break the pipe up so that the complicated parts needing comments are separated from the simpler steps.
Constructing an analysis
We have now learned a number of tools for manipulating data, as well as a convenient way to link them together. When it comes time to put them together, it can help to start at the end. Decide on what your end goal is, and then work backwards step by step to figure out how to achieve it. As an example using the gapminder data, suppose a colleague asked you which country had the sixth highest population in 1972. In order to answer that, you would need a list of the countries ordered by their population in 1972. To get such a list, you would first need to extract just the data from 1972 from the complete gapminder set. And to extract just the 1972 data (assuming you are doing this analysis in R), you would first need to import the data from somewhere.
So this very basic analysis has a number of steps to complete:
- Read gapminder data into R
- Keep only the data from 1972
- Sort this data by population size
- Get the sixth highest population size
- Look at the country with that population
Challenge 4
Using this process, what steps might you need to determine which countries are in the top ten life expectancy lists for both 1987 and 2007?
From design to implementation
Once you have sketched out an analysis, you will have a series of small, self contained steps to complete. This maps well onto the tidyverse philosophy that each function should be as simple as possible, and do one thing well. Ideally each of your analysis steps will correspond to one (or a small number) of tidyverse verbs that we have covered. For example, we could take our analysis from before and add in some tidyverse verbs without losing the descriptiveness.
read_csv
the gapminder data into Rfilter
only the data from 1972arrange
this data by population sizefilter
the sixth highest population sizeselect
the country variable with that population
As you become familiar with these functions and the ‘grammar’ of the tidyverse, it will become easier to see how to link them together step by step to complete an analysis. Eventually this process will become second nature and you will easily be able to break down complex analyses into a series of small steps that can solved using a handful of simple functions.
Challenge 5
Try and map some tidyverse verbs onto the steps you identified in Challenge 4. How many verbs did you need for each step?
Now, see if you can implement them in code to answer the question.
Key Points
Data analyses can be broken down into discrete stages
Most data analysis stages fit into a small number of types
Pipes pass their left hand side through as the first argument of the right hand side.
Pipes make your code more readable, but be careful of going overboard.