Cleaning Data

Overview

Teaching: 15 min
Exercises: 30 min
Questions
  • How can I process data that is untidy/inconsistent/missing parts?

Objectives
  • Identify inconsistencies in data

  • Appropriately fix errors in data without overwriting the original

  • Reshape and manipulate data to make it tidy

  • Set data to missing where necessary

Real world data sets rarely come in a form ready for analysis. Even the untidy example in the previous lesson was unusual because it was a complete data set with no missing values and informative column names. Using gather()/spread() and separate()/unite() to reshape your data will solve a lot of problems, but they can not fix all the problems you will encounter.

A few other situations you may encounter when trying to clean your data up for analysis are:

Handling of missing data

Sometimes, missing data is not recorded as blank orNA, but as some other value. As discussed in the data import lesson, you can define the missing data values on import with the na argument. But in some situations it might not be possible because you are importing from a very specialised data format or using a data frame provided by someone else.

In this situation, we can recode values to be NA with the na_if(). You provide na_if() with a vector of values to check, and then the value that needs to be recoded as NA. This works well within a mutate() call.

# -1 should be NA
missing <- data_frame(id = c("a", "b", "c"), value = c(10, -1, 20))

missing
# A tibble: 3 × 2
  id    value
  <chr> <dbl>
1 a        10
2 b        -1
3 c        20
missing %>% 
  mutate(value = na_if(value, -1))
# A tibble: 3 × 2
  id    value
  <chr> <dbl>
1 a        10
2 b        NA
3 c        20

Filling in implied values

When recording a number of measurements, repeated values may be left blank for convenience, with the implication that the first value is carried through the blank spaces. For example:

implied_id <- data_frame(
  id = c("a", NA, NA, "b", NA, NA), 
  value = c(12, 23, 18, 34, 23, 16)
)

implied_id
# A tibble: 6 × 2
  id    value
  <chr> <dbl>
1 a        12
2 <NA>     23
3 <NA>     18
4 b        34
5 <NA>     23
6 <NA>     16

Here, the id variable is recorded in the first row, but then left blank until an observation with a new id is recorded. To complete the data frame, you will need to fill() the id column, which replaces blank entries with the last recorded value.

implied_id %>% 
  fill(id)
# A tibble: 6 × 2
  id    value
  <chr> <dbl>
1 a        12
2 a        23
3 a        18
4 b        34
5 b        23
6 b        16

Dealing with outliers/errors

Your data might have values that look like they have been recorded incorrectly or are outliers. It can be appropriate to remove or alter these values in some situations. But just editing your raw data is a poor choice because it does not leave any record of what you have done.

A conservative choice in this situation is to create a new variable containing your modifications, which makes explicit what your criteria are. The case_when() function may be useful for this. case_when() is a fast and powerful way of specifying an outcome in response to a set of particular conditions. For example, assume we have the following data frame, but know that the measurement system can only record values between 0 and 30. We might decide that any measurements outside these values should be set to either 0 or 30 (Is this a good way of dealing with these errors?).

outliers <- data_frame(
  id = c("a", "b", "c", "d", "e"),
  value = c(15, -2, 19, 9, 35)
)

outliers
# A tibble: 5 × 2
  id    value
  <chr> <dbl>
1 a        15
2 b        -2
3 c        19
4 d         9
5 e        35
# Clamp values outside of the range 0-30 to the closest limit
outliers %>% 
  mutate(clean_value = case_when(
    value < 0 ~ 0,
    value > 30 ~ 30,
    TRUE ~ value
  )
)
# A tibble: 5 × 3
  id    value clean_value
  <chr> <dbl>       <dbl>
1 a        15          15
2 b        -2           0
3 c        19          19
4 d         9           9
5 e        35          30

Each part of the case_when() call has two elements separated by a ~. The left hand side is a condition that evaluates to TRUE or FALSE. The right hand side is the output that case_when should return when the condition is TRUE. These conditions are evaluated from top to bottom and only return the output from the first condition that is TRUE.

So the code above can be read as:

Key Points

  • Real world data is messy

  • It takes time and care to prepare it for analysis and visualisation