Tidy Data

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I use a consistent underlying data structure?

Objectives
  • To recognise tidy or messy data.

  • To know where to find more information about the tidyverse.

Happy families are all alike; every unhappy family is unhappy in its own way

Leo Tolstoy

Tidy data are all alike; every untidy data is untidy in its own way

Hadley Wickham

Data can come in many different shapes and forms, and often people invent whatever makes sense to them. This often means that a great deal of time is spent modifying data to be structured in a format that R can use.

Within R, different packages can have different expectations about data structures, which can make it difficult to move between functions in different packages.

The tidyverse is a subset of R packages that conform to a particular philosophy about data structure.

The concept of tidy data can be distilled into three principles. A data set can be considered ‘tidy’ if:

  1. Each variable is in its own column
  2. Each case is in its own row
  3. Each value is in its own cell

Challenge 1

In the following table, what makes it untidy?

id rep1 rep2
1 1.44 2.07
2 1.77 2.13
3 3.56 3.72

Challenge 2

What it would look like if it was tidy?

Solution to Challenge 2

id rep value
1 1 1.44
2 1 1.77
3 1 3.56
1 2 2.07
2 2 2.13
3 2 3.72

In the above the same variable (a measurement value) was stored in two different columns. In this case making the data tidy required converting those two columns into one, which made the dataset have twice as many rows. This is usually called going from “wide” to “long” format, which is often done in the simplest cases of tidying data.

Challenge 3

Open the file plates.xlsx (download here). This is a very common format to store data from 96-well plates. What would this look like if it was tidy? Discuss the steps you would need to go through to convert it to a tidy format.

There is a tidyverse package, which doesn’t have any functionality, except to attach core packages of the tidyverse.

library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()     masks stats::filter()
✖ dplyr::group_rows() masks kableExtra::group_rows()
✖ dplyr::lag()        masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

is the equivalent of

library(ggplot2)
library(dplyr)
library(tidyr)
library(readr)
library(purrr)
library(tibble)
library(stringr)
library(forcats)

The power of maintaining a consistent approach to data structures will become more clear as we work with data in R. You will see that by transforming data to make it tidy, all subsequent work will be easier to understand, and will make your code more clear as well.

Other great resources

Key Points

  • For data to be tidy, each variable must be in its own column

  • For data to be tidy, each case must be in its own row

  • For data to be tidy, each value must be in its own cell