Reproducibility

Overview

Teaching: 10 min
Exercises: 45 min

Questions

How can I ensure my code is reproducible?

Objectives

Understand the value of reproducibility

Identify the key considerations to enable reproducibility

A core concept of reproducible research is that all steps from raw data to results are available and well documented so that others can make use of them. The explanation of research methods in published papers is rarely detailed enough to recreate the results, and so there is no easy way to confirm the published results, or apply the methods in a new context.

Taking the first step towards making your research more reproducible is easy when your analysis is recorded in code. Just collect your analysis code into a script and make it available with your raw data and results. Some people have an aversion to letting other people look at their code, but give it a try as there are a lot of potential benefits.

Benefits of reproducibility

It makes it easy for your methods to be reused in the future. This may be a colleague trying to apply your approach in their own project or yourself in a year or two trying to compare newly collected data with your previous work.
You are more likely to catch mistakes in your analysis in the process of documenting it for others. Since each step in the process is recorded, it also makes it easy to trace back to find when the mistake was introduced and correct it.
Other people will be able to spot mistakes in your code. More people looking at your code means more opportunity to carch and correct errors, ensuring that your results are accurate.

Reproducibility with R

Besides recording your analysis in a script, there are a few other considerations to increase the reproducibility of your research.

Firstly, your script will run with the R environment that launched it. Use the RStudio Environment pane to explore your current environment after completing these lessons. Or use ls() to view the data in your environment, and loadedNamespaces() to view the packages that are loaded in the environment.

Challenge: New sessions

Create a fresh R session (in RStudio select “Session” > “New Session”) and compare the environments. What effect might this have on a script?

To make sure that your script is self-contained, always test that it performs as expected in a fresh session.

Secondly, make sure that your analysis is documented clearly. This might mean adding short comments to your code as you write it, or for longer form documentation you can create a plain text README file in your project’s directory (eg. to explain the experimental design). An alternative approach is to use an R markdown document, which lets you mix together text and code more easily.

When documenting your process, it is often more important to record the decisions that you make rather than just describing what your code is doing. Explaining why your analysis is taking a particular approach is far more valuable for other people to understand it.

Finally, record precisely which software versions were used to produce your results. Software can be updated and may not always produce the same results as older versions. Record the version of R you are using, as well as the versions of any packages you use. This information can be found with sessionInfo()

Challenge: Putting it all together

As a final challenge, you will work in pairs to complete an end to end data analysis task in a reproducible fashion.

Getting the data

Create a new RStudio project for this analysis

Download two data files from the Bureau of Meterology, one containing meterological information, and one containing metadata about weather stations

Take some time to explore the data files and understand what they contain

Write a script that answers the following questions:

Question 1

For each station, how many days have a minimum temperature, a maximum temperature and a rainfall measurement recorded?

Hint

This question can be answered using just the BOM_data file.

You will first need to separate() a column to access both the minimum and maximum temperature data.

Then, you can filter() the data to keep only rows that have minimum temperature, maximum temperature, and rainfall measurements.

A group_by() followed by summarise() will then allow you to count the number of rows remaining for each station.

Question 2

Which month saw the lowest average daily temperature difference?

Hint

This question can be answered using just the BOM_data file.

In addition to the functions you used above, this question will need a mutate() to calculate the temperature difference.

The temperature values are stored as characters after you have run separate() (see the <chr> in the second row if you print the data frame to the console). To be able to calculate the difference without an error, you will need to convert them to numeric values with as.numeric() first.

For rows that are missing a temperature measurement, the temperature difference will be NA. How will you deal with these in the rest of the analysis?

Question 3

Which state saw the lowest average daily temperature difference?

Hint

State information is found in the BOM_stations file. So we will need to join this with our previous dataset.

The station data is not in a tidy format however, as each station is recorded in it’s own column. (Why is this data not tidy?)

To tidy it before merging, you will need to gather() the station data into an intermediate form that has three columns, one for the station ID number, one for the type of data being recorded (the info column in the original data), and one for the actual recorded value itself. (Is this intermediate data tidy?)

This data frame can then be spread() into a shape with one row for each station. Remember that the key argument to spread() identifies the column that will provide the data for the new column names, and the value argument identifies the column that will provide the data for the new cells.

Finally, you will want to join the two datasets together to identify the state of each weather station. If you run into errors at this step, check that the two data frames have a shared column to merge, and that they are the same data type (eg. you can’t merge a character column with a numeric column).

Question 4

Does the westmost (lowest longitude) or eastmost (highest longitude) weather station in our dataset have a higher average solar exposure?

Hint

This question will need both the BOM_data and the BOM_stations file.

You will not need any new verbs other than what you have used in previous answers.

If answering this is final question is easy, spend some time reviewing your entire script to see if there are any ways you can improve it. Are there any repeated steps that you could save as an intermediate variable? Could you add some comments to make your code understandable?

Optional extension

Design your own question. What is a question you had after exploring the contents of the data? Or was there something that surprised you when working with the data.

Sharing your work

After answering these questions, swap scripts with another pair, along with any instructions they will need to make sure it works. Can you get their script to run?

Compare your code for the first four questions. Are there any major differences in how you went about solving them?

Key Points

A script is a discrete unit of analysis

A script will be run in the context of an environment

Software (and compute) dependencies need to be considered

previous episode

Introduction to R

lesson home

Reproducibility

Overview

Benefits of reproducibility

Reproducibility with R

Challenge: New sessions

Challenge: Putting it all together

Getting the data

Question 1

Hint

Question 2

Hint

Question 3

Hint

Question 4

Hint

Optional extension

Key Points

previous episode

lesson home