Reproducibility
Overview
Teaching: 10 min
Exercises: 45 minQuestions
How can I ensure my code is reproducible?
Objectives
Understand the value of reproducibility
Identify the key considerations to enable reproducibility
A core concept of reproducible research is that all steps from raw data to results are available and well documented so that others can make use of them. The explanation of research methods in published papers is rarely detailed enough to recreate the results, and so there is no easy way to confirm the published results, or apply the methods in a new context.
Taking the first step towards making your research more reproducible is easy when your analysis is recorded in code. Just collect your analysis code into a script and make it available with your raw data and results. Some people have an aversion to letting other people look at their code, but give it a try as there are a lot of potential benefits.
Benefits of reproducibility
-
It makes it easy for your methods to be reused in the future. This may be a colleague trying to apply your approach in their own project or yourself in a year or two trying to compare newly collected data with your previous work.
-
You are more likely to catch mistakes in your analysis in the process of documenting it for others. Since each step in the process is recorded, it also makes it easy to trace back to find when the mistake was introduced and correct it.
-
Other people will be able to spot mistakes in your code. More people looking at your code means more opportunity to carch and correct errors, ensuring that your results are accurate.
Reproducibility with R
Besides recording your analysis in a script, there are a few other considerations to increase the reproducibility of your research.
Firstly, your script will run with the R environment that launched it. Use the RStudio Environment
pane to explore your current environment after completing these lessons. Or use ls()
to view the
data in your environment, and loadedNamespaces()
to view the packages that are loaded in the
environment.
Challenge: New sessions
Create a fresh R session (in RStudio select “Session” > “New Session”) and compare the environments. What effect might this have on a script?
To make sure that your script is self-contained, always test that it performs as expected in a fresh session.
Secondly, make sure that your analysis is documented clearly. This might mean adding short comments to your code as you write it, or for longer form documentation you can create a plain text README file in your project’s directory (eg. to explain the experimental design). An alternative approach is to use an R markdown document, which lets you mix together text and code more easily.
When documenting your process, it is often more important to record the decisions that you make rather than just describing what your code is doing. Explaining why your analysis is taking a particular approach is far more valuable for other people to understand it.
Finally, record precisely which software versions were used to produce your results. Software can
be updated and may not always produce the same results as older versions. Record the version of R
you are using, as well as the versions of any packages you use. This information can be found with
sessionInfo()
Challenge: Putting it all together
As a final challenge, you will work in pairs to complete an end to end data analysis task in a reproducible fashion.
Getting the data
- Create a new RStudio project for this analysis
- Download two data files from the Bureau of Meterology, one containing meterological information, and one containing metadata about weather stations
- Take some time to explore the data files and understand what they contain
- Write a script that answers the following questions:
Question 1
For each station, how many days have a minimum temperature, a maximum temperature and a rainfall measurement recorded?
Hint
This question can be answered using just the BOM_data file.
You will first need to
separate()
a column to access both the minimum and maximum temperature data.Then, you can
filter()
the data to keep only rows that have minimum temperature, maximum temperature, and rainfall measurements.A
group_by()
followed bysummarise()
will then allow you to count the number of rows remaining for each station.Question 2
Which month saw the lowest average daily temperature difference?
Hint
This question can be answered using just the BOM_data file.
In addition to the functions you used above, this question will need a
mutate()
to calculate the temperature difference.The temperature values are stored as characters after you have run
separate()
(see the<chr>
in the second row if you print the data frame to the console). To be able to calculate the difference without an error, you will need to convert them to numeric values withas.numeric()
first.For rows that are missing a temperature measurement, the temperature difference will be
NA
. How will you deal with these in the rest of the analysis?Question 3
Which state saw the lowest average daily temperature difference?
Hint
State information is found in the BOM_stations file. So we will need to join this with our previous dataset.
The station data is not in a tidy format however, as each station is recorded in it’s own column. (Why is this data not tidy?)
To tidy it before merging, you will need to
gather()
the station data into an intermediate form that has three columns, one for the station ID number, one for the type of data being recorded (theinfo
column in the original data), and one for the actual recorded value itself. (Is this intermediate data tidy?)This data frame can then be
spread()
into a shape with one row for each station. Remember that thekey
argument tospread()
identifies the column that will provide the data for the new column names, and thevalue
argument identifies the column that will provide the data for the new cells.Finally, you will want to join the two datasets together to identify the state of each weather station. If you run into errors at this step, check that the two data frames have a shared column to merge, and that they are the same data type (eg. you can’t merge a character column with a numeric column).
Question 4
Does the westmost (lowest longitude) or eastmost (highest longitude) weather station in our dataset have a higher average solar exposure?
Hint
This question will need both the BOM_data and the BOM_stations file.
You will not need any new verbs other than what you have used in previous answers.
If answering this is final question is easy, spend some time reviewing your entire script to see if there are any ways you can improve it. Are there any repeated steps that you could save as an intermediate variable? Could you add some comments to make your code understandable?
Optional extension
Design your own question. What is a question you had after exploring the contents of the data? Or was there something that surprised you when working with the data.
Sharing your work
After answering these questions, swap scripts with another pair, along with any instructions they will need to make sure it works. Can you get their script to run?
Compare your code for the first four questions. Are there any major differences in how you went about solving them?
Key Points
A script is a discrete unit of analysis
A script will be run in the context of an environment
Software (and compute) dependencies need to be considered