Cleaning Data
Overview
Teaching: 15 min
Exercises: 30 minQuestions
How can I process data that is untidy/inconsistent/missing parts?
Objectives
Identify inconsistencies in data
Appropriately fix errors in data without overwriting the original
Reshape and manipulate data to make it tidy
Set data to missing where necessary
Real world data sets rarely come in a form ready for analysis. Even the untidy example in the
previous lesson was unusual because it was a complete data set with no missing values and informative
column names. Using gather()
/spread()
and separate()
/unite()
to reshape your data will solve
a lot of problems, but they can not fix all the problems you will encounter.
A few other situations you may encounter when trying to clean your data up for analysis are:
Handling of missing data
Sometimes, missing data is not recorded as blank orNA
, but as some other value. As discussed in the
data import lesson, you can
define the missing data values on import with the na
argument. But in some situations it might not
be possible because you are importing from a very specialised data format or using a data frame
provided by someone else.
In this situation, we can recode values to be NA
with the na_if()
. You provide na_if()
with
a vector of values to check, and then the value that needs to be recoded as NA
. This works
well within a mutate()
call.
# -1 should be NA
missing <- data_frame(id = c("a", "b", "c"), value = c(10, -1, 20))
missing
# A tibble: 3 × 2
id value
<chr> <dbl>
1 a 10
2 b -1
3 c 20
missing %>%
mutate(value = na_if(value, -1))
# A tibble: 3 × 2
id value
<chr> <dbl>
1 a 10
2 b NA
3 c 20
Filling in implied values
When recording a number of measurements, repeated values may be left blank for convenience, with the implication that the first value is carried through the blank spaces. For example:
implied_id <- data_frame(
id = c("a", NA, NA, "b", NA, NA),
value = c(12, 23, 18, 34, 23, 16)
)
implied_id
# A tibble: 6 × 2
id value
<chr> <dbl>
1 a 12
2 <NA> 23
3 <NA> 18
4 b 34
5 <NA> 23
6 <NA> 16
Here, the id
variable is recorded in the first row, but then left blank until an observation with
a new id is recorded. To complete the data frame, you will need to fill()
the id
column, which
replaces blank entries with the last recorded value.
implied_id %>%
fill(id)
# A tibble: 6 × 2
id value
<chr> <dbl>
1 a 12
2 a 23
3 a 18
4 b 34
5 b 23
6 b 16
Dealing with outliers/errors
Your data might have values that look like they have been recorded incorrectly or are outliers. It can be appropriate to remove or alter these values in some situations. But just editing your raw data is a poor choice because it does not leave any record of what you have done.
A conservative choice in this situation is to create a new variable containing your modifications, which
makes explicit what your criteria are. The case_when()
function may be useful
for this. case_when()
is a fast and powerful way of specifying an outcome in response to a set
of particular conditions. For example, assume we have the following data frame, but know that the
measurement system can only record values between 0 and 30. We might decide that any measurements
outside these values should be set to either 0 or 30 (Is this a good way of dealing with these errors?).
outliers <- data_frame(
id = c("a", "b", "c", "d", "e"),
value = c(15, -2, 19, 9, 35)
)
outliers
# A tibble: 5 × 2
id value
<chr> <dbl>
1 a 15
2 b -2
3 c 19
4 d 9
5 e 35
# Clamp values outside of the range 0-30 to the closest limit
outliers %>%
mutate(clean_value = case_when(
value < 0 ~ 0,
value > 30 ~ 30,
TRUE ~ value
)
)
# A tibble: 5 × 3
id value clean_value
<chr> <dbl> <dbl>
1 a 15 15
2 b -2 0
3 c 19 19
4 d 9 9
5 e 35 30
Each part of the case_when()
call has two elements separated by a ~
. The left hand side is a
condition that evaluates to TRUE
or FALSE
. The right hand side is the output that case_when
should return when the condition is TRUE
. These conditions are evaluated from top to bottom and
only return the output from the first condition that is TRUE
.
So the code above can be read as:
- Take the
outliers
data frame, then mutate
it to create a new column calledclean_value
- If the
value
column is less than zero,clean_value
should be zero - Otherwise, if the
value
column is greater than 30,clean_value
should be 30 - Otherwise, if
TRUE
isTRUE
(ie. always),clean_value
should be the same asvalue
Key Points
Real world data is messy
It takes time and care to prepare it for analysis and visualisation