Putting it all together

Overview

Teaching: 25 min
Exercises: 25 min

Questions

How can I use multiple data verbs together?

How should I plan my data analysis?

Objectives

Break down an analysis into several small steps

Understand approaches to linking analysis steps together.

Use the pipe (%>%) to chain together multiple functions in a readable format.

Now that we have learned some data manipulation verbs you might begin to see how they could be used together. By combining several small and simple steps you can start to perform complex manipulations.

Challenge 1

Using both select() and filter(), create a data frame containing only the country, year, and pop columns for Australian data.

Your solution to this challenge probably took one of two approaches:

Using an intermediate variable

filtered_data <- filter(data, ...)

final_data <- select(filtered_data, ...)

Nesting function calls

final_data <- select(filter(data, ...), ...)

Each of these are perfectly acceptable ways to solve the problem, but can lead to code that is difficult to read and understand. In the case of the intermediate variables, you need to come up with a name for each one, and if you aren’t very careful about keeping the connected lines of code together it can become difficult to track which variable is being used in each function. The nested version removes some of these problems, but it’s hard enough to read with only two simple steps.

Piping in R

From the magrittr package in the tidyverse we have another approach for chaining a sequence of functions. This is the pipe operator (%>%), which works in a similar way to the unix pipe.

Shortcuts

Typing %>% in everytime you want a pipe is a bit awkward, so RStudio has a shortcut to help. Use Ctrl+Shift+M (Cmd+Shift+M on a Mac) to insert a pipe into your code.

Using the pipe

The pipe works by taking the output of it’s left hand side and inserting it as the first argument to the function on it’s right hand side.

This means that the following

some_function(data)

could be rewritten using a pipe as

data %>% some_function()

Other function arguments are left untouched so

#Standard form
some_function(data, first_arg, second_arg)

#Piped form
data %>% some_function(first_arg, second_arg)

are equivalent.

Challenge 2

Rewrite the following line from the previous lesson in a piped form. Does it give you the same output?

filter(gapminder, country == "Australia", year >= 1997)
Solution to Challenge 2
gapminder %>% 
    filter(country == "Australia", year >= 1997)
# A tibble: 3 × 6
  country   continent  year lifeExp      pop gdpPercap
  <chr>     <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Australia Oceania    1997    78.8 18565243    26998.
2 Australia Oceania    2002    80.4 19546792    30688.
3 Australia Oceania    2007    81.2 20434176    34435.

Combining

The real value of the pipe comes when there are multiple steps to complete. To rewrite our example above using pipes:

#Nested form
final_data <- select(filter(data, ...), ...)

#Piped form
final_data <- data %>% 
                filter(...) %>% 
                select(...)

This can be read as a series of instructions. Take the data frame data, then filter it to keep some rows, then select some columns.

For many people, the piped form is one that is easier to read and understand, as well as easier to write.

Challenge 3

Take your answer to Challenge 1 and rewrite it in a piped form.

Now, imagine that you have decided later more steps are required. Add a step renaming the pop column to population for both forms. Do you find one form easier to work with than the other?

Solution to Challenge 3

For the piped version:

gapminder %>% 
    select(country, year, pop) %>% 
    filter(country == "Australia")

# A tibble: 12 × 3
   country    year      pop
   <chr>     <dbl>    <dbl>
Australia  1952  8691212
Australia  1957  9712569
Australia  1962 10794968
Australia  1967 11872264
Australia  1972 13177000
Australia  1977 14074100
Australia  1982 15184200
Australia  1987 16257249
Australia  1992 17481977
Australia  1997 18565243
Australia  2002 19546792
Australia  2007 20434176

And adding a rename step:

gapminder %>% 
    select(country, year, pop) %>% 
    filter(country == "Australia") %>% 
    rename(population = pop)

# A tibble: 12 × 3
   country    year population
   <chr>     <dbl>      <dbl>
Australia  1952    8691212
Australia  1957    9712569
Australia  1962   10794968
Australia  1967   11872264
Australia  1972   13177000
Australia  1977   14074100
Australia  1982   15184200
Australia  1987   16257249
Australia  1992   17481977
Australia  1997   18565243
Australia  2002   19546792
Australia  2007   20434176

Common pitfalls

When using pipes to construct a sequential analysis, there are a few problems that can trip people up.

Failing to complete the pipe

When constructing a sequence by adding or removing steps, a very common problem is that you will leave a pipe command trailing at the end of your code block.

#Pipe to nowhere
final_data <- data %>% 
                filter(...) %>% 
                select(...) %>% 

In this case, your code will not complete and R will pause and wait for the rest of the pipe before continuing. This is easy to spot in your console because the input indicator will change from a > to a +, telling you that R is waiting for you to complete a command. Just hit the Esc key to cancel the command, and fix up your pipe before running it again.

Forgetting to assign the output

This has been covered before, but is something that often gets forgotten again once you start using pipes. If the final output of your pipe is not saved into a variable, it will just get printed to the screen and then lost.

So instead of

gapminder %>% 
  select(country, year, pop) %>% 
  filter(country == "Australia") %>% 
  rename(population = pop)

make sure to save the output if you need access to it later

aust_data <- gapminder %>% 
  select(country, year, pop) %>% 
  filter(country == "Australia") %>% 
  rename(population = pop)

Forgetting to pass some data in to the pipe

Some people have the opposite problem where they focus so much on getting each step of the pipe right that they forget that it needs some data to work on.

Try running the pipe above without any data:

select(country, year, pop) %>% 
  filter(country == "Australia") %>% 
  rename(population = pop)

Error in select(country, year, pop): object 'country' not found

It throws an error because select() in the first step is expecting it’s first argument to be a data frame that it can work on. Instead, it finds country, which is not a data frame.

You could fix this by providing the data frame directly to select

select(gapminder, country, year, pop) %>% 
  filter(country == "Australia") %>% 
  rename(population = pop)

But it is easier to know exactly what data is going into a pipe if you put it on it’s own line at the start

gapminder %>% 
  select(country, year, pop) %>% 
  filter(country == "Australia") %>% 
  rename(population = pop)

Overly long pipes

Using pipes can make your code more understandable, but there is a limit to their effectiveness. If your pipe has too many steps it becomes harder to read through and keep in your memory all the steps that have been applied to the data.

It also makes identifying errors harder. Imagine a pipe with 30 steps that is giving you the wrong output or throwing an error. Working out which of those 30 steps is the cause of the problem can become time consuming and difficult. If you find yourself writing very long pipes, consider if there are any logical ways to break it up into a series of smaller pipes.

Style questions

There are a couple of guidelines about style you can use when writing pipes to make them more understandable

Separate each step in a pipe

Each step of the pipe should be on a separate line, and all steps after the first should be indented. This helps to clearly identify the elements of the pipe and read through them in a step-by-step fashion.

Comment sensibly

You can add comments into the middle of a pipe chain

gapminder %>% 
  rename(gdpPerCap = gdpPercap) %>% # Inconsistent capitalisation annoys me
  select(country, year, gdpPerCap) %>% 
  # Data version of #MapsWithoutNZ
  filter(country != "New Zealand")

but consider the effect those comments have on the legibility of your code. It may be best to add your comment at the start of the pipe instead. Or to break the pipe up so that the complicated parts needing comments are separated from the simpler steps.

Constructing an analysis

We have now learned a number of tools for manipulating data, as well as a convenient way to link them together. When it comes time to put them together, it can help to start at the end. Decide on what your end goal is, and then work backwards step by step to figure out how to achieve it. As an example using the gapminder data, suppose a colleague asked you which country had the sixth highest population in 1972. In order to answer that, you would need a list of the countries ordered by their population in 1972. To get such a list, you would first need to extract just the data from 1972 from the complete gapminder set. And to extract just the 1972 data (assuming you are doing this analysis in R), you would first need to import the data from somewhere.

So this very basic analysis has a number of steps to complete:

Read gapminder data into R
Keep only the data from 1972
Sort this data by population size
Get the sixth highest population size
Look at the country with that population

Challenge 4

Using this process, what steps might you need to determine which countries are in the top ten life expectancy lists for both 1987 and 2007?

From design to implementation

Once you have sketched out an analysis, you will have a series of small, self contained steps to complete. This maps well onto the tidyverse philosophy that each function should be as simple as possible, and do one thing well. Ideally each of your analysis steps will correspond to one (or a small number) of tidyverse verbs that we have covered. For example, we could take our analysis from before and add in some tidyverse verbs without losing the descriptiveness.

read_csv the gapminder data into R
filter only the data from 1972
arrange this data by population size
filter the sixth highest population size
select the country variable with that population

As you become familiar with these functions and the ‘grammar’ of the tidyverse, it will become easier to see how to link them together step by step to complete an analysis. Eventually this process will become second nature and you will easily be able to break down complex analyses into a series of small steps that can solved using a handful of simple functions.

Challenge 5

Try and map some tidyverse verbs onto the steps you identified in Challenge 4. How many verbs did you need for each step?

Now, see if you can implement them in code to answer the question.

Key Points

Data analyses can be broken down into discrete stages

Most data analysis stages fit into a small number of types

Pipes pass their left hand side through as the first argument of the right hand side.

Pipes make your code more readable, but be careful of going overboard.

previous episode

Introduction to R

next episode

Putting it all together

Overview

Challenge 1

Using an intermediate variable

Nesting function calls

Piping in R

Shortcuts

Using the pipe

Challenge 2

Solution to Challenge 2

Combining

Challenge 3

Solution to Challenge 3

Common pitfalls

Failing to complete the pipe

Forgetting to assign the output

Forgetting to pass some data in to the pipe

Overly long pipes

Style questions

Separate each step in a pipe

Comment sensibly

Constructing an analysis

Challenge 4

From design to implementation

Challenge 5

Key Points

previous episode

next episode