Tidy Data
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can I use a consistent underlying data structure?
Objectives
To recognise tidy or messy data.
To know where to find more information about the tidyverse.
Happy families are all alike; every unhappy family is unhappy in its own way
Leo Tolstoy
Tidy data are all alike; every untidy data is untidy in its own way
Hadley Wickham
Data can come in many different shapes and forms, and often people invent whatever makes sense to them. This often means that a great deal of time is spent modifying data to be structured in a format that R can use.
Within R, different packages can have different expectations about data structures, which can make it difficult to move between functions in different packages.
The tidyverse is a subset of R packages that conform to a particular philosophy about data structure.
The concept of tidy data can be distilled into three principles. A data set can be considered ‘tidy’ if:
- Each variable is in its own column
- Each case is in its own row
- Each value is in its own cell
Challenge 1
In the following table, what makes it untidy?
id rep1 rep2 1 1.44 2.07 2 1.77 2.13 3 3.56 3.72
Challenge 2
What it would look like if it was tidy?
Solution to Challenge 2
id rep value 1 1 1.44 2 1 1.77 3 1 3.56 1 2 2.07 2 2 2.13 3 2 3.72
In the above the same variable (a measurement value) was stored in two different columns. In this case making the data tidy required converting those two columns into one, which made the dataset have twice as many rows. This is usually called going from “wide” to “long” format, which is often done in the simplest cases of tidying data.
Challenge 3
Open the file
plates.xlsx
(download here). This is a very common format to store data from 96-well plates. What would this look like if it was tidy? Discuss the steps you would need to go through to convert it to a tidy format.
There is a tidyverse package, which doesn’t have any functionality, except to attach core packages of the tidyverse.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ purrr 1.0.2
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ ggplot2 3.5.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::group_rows() masks kableExtra::group_rows()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
is the equivalent of
library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)
The power of maintaining a consistent approach to data structures will become more clear as we work with data in R. You will see that by transforming data to make it tidy, all subsequent work will be easier to understand, and will make your code more clear as well.
Other great resources
Key Points
For data to be tidy, each variable must be in its own column
For data to be tidy, each case must be in its own row
For data to be tidy, each value must be in its own cell