Reading Data In
Overview
Teaching: 35 min
Exercises: 10 minQuestions
How can I get data into R?
Objectives
Read data in from plain text and Excel files
Control import parameters to cope with unusual data formats
So far, we have been using datasets that are already available within R. Chances are though that the data you want to work with exists separately, perhaps as a spreadsheet or CSV file. In order to start working with this data, we need to learn how to read it in to R.
Common file types
There are many different ways that data could be stored, but for now we will focus on reading in tabular data. This is data that is already in a data frame-like structure and so we will want to read it in to R as a data frame.
Two of the most common ways you might find data like this are:
Plain text files
These are text files where the columns of data are separated by some delimiting character. Examples of these might include
- CSV (comma separated value) files that use a comma to delimit the columns
carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
- TSV (tab separated value) files, which use a tab instead
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
Spreadsheet files
All spreadsheet programs (Excel, Calc, Numbers, etc.) will have a way to export data into one of the plain text formats above. This will usually be the best way to get data into R. For excel files (.xls or .xlsx) however, it is possible to read them in directly which we will demonstrate later
Gapminder data
For this section, we will be reading in data from the Gapminder organisation, which records various statistics for 142 countries betwen 1952 and 2007. This data is available as an R package, but we have prepared csv, tsv, and excel versions for you to practice with.
Challenge 1
Download the three versions of the Gapminder above and save them to your project folder.
Open one of the files and describe what statistics are recorded
Solution to Challenge 1
Using the ideas discussed previously about project structure, we will save the files into a
data
directory within our project. We can then access them with a relative pathdata/gapminder.csv
(for the csv example)Opening a file we can see that there are six columns of data: a country name and continent, the year that the data was recorded, and the life expectancy, population and GDP per capita.
To load this data into R, we will use the read_csv
function from the readr
package (which will
be loaded automatically if you have preciously run library(tidyverse)
, but here we will
load it separately).
library(readr)
gapminder_csv <- read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gapminder_csv
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Here, we can see that read_csv
provides us with some information on what it thinks the data types
are as it reads the file in (two columns of character data and four columns of double in this case).
We can also see that it has imported the data in the
tibble format discussed previously.
Similarly, for the tsv file there is a read_tsv
function
Challenge 2
Read in the
gapminder.tsv
file usingread_tsv
Confirm that reading from csv or tsv files produce the same output (you might find the
all.equal
function useful)Solution to Challenge 2
gapminder_tsv <- read_tsv("data/gapminder.tsv") gapminder_tsv
# A tibble: 1,704 × 6 country continent year lifeExp pop gdpPercap <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan Asia 1952 28.8 8425333 779. 2 Afghanistan Asia 1957 30.3 9240934 821. 3 Afghanistan Asia 1962 32.0 10267083 853. 4 Afghanistan Asia 1967 34.0 11537966 836. 5 Afghanistan Asia 1972 36.1 13079460 740. 6 Afghanistan Asia 1977 38.4 14880372 786. 7 Afghanistan Asia 1982 39.9 12881816 978. 8 Afghanistan Asia 1987 40.8 13867957 852. 9 Afghanistan Asia 1992 41.7 16317921 649. 10 Afghanistan Asia 1997 41.8 22227415 635. # ℹ 1,694 more rows
all.equal(gapminder_csv, gapminder_tsv)
[1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"
Control of import parameters
Both read_csv
and read_tsv
have a number of parameters that may be needed to help import a file.
You can read more about these in the help file for the functions, but some more commonly encountered
situations are:
Ignoring comments or metadata from the top of a file
Sometimes files will have metadata information included along with the data. Use skip
to ignore a
set number of lines from the top of the file or comment
if metadata lines are indicated by a
specific character such as #
.
Eg. a file such as:
#Metadata information detailing data collection process
#Not part of the data, but included to explain it
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 9240934 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
Can be read in using:
read_tsv("data/commented_file.tsv", comment = "#")
read_tsv("data/commented_file.tsv", skip = 2)
Both lines above produce the same output
Providing column names
By default, readr
assumes that the first row of data is a “header” - that is defines the names of the columnes for the data frame. If this is not
the case you can either specify the names manually with a character vector, or have the column names generated
automatically
Eg.
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 9240934 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
Can be read in using:
read_tsv("data/no_names.tsv", col_names = F)
# A tibble: 3 x 6
X1 X2 X3 X4 X5 X6
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
or
read_tsv("data/no_names.tsv", col_names = c("country", "continent","year","lifeExp","pop","gdpPercap"))
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
Dealing with missing data
By default, blank columns or ones containing “NA” are considered missing. If your data uses a different value for missing data you will need to specify it.
Eg.
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453145
Afghanistan Asia 1957 30.332 - 820.8530296
Afghanistan Asia 1962 31.997 10267083 853.10071
read_tsv("data/missing.tsv")
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 - 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
Using the default parameters, pop
is incorrectly read in as a character column because
missing data is indicated by -
read_tsv("data/missing.tsv", na = "-")
# A tibble: 3 x 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 NA 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
Now the pop
data is correctly recognised as a number, and the missing data is correctly recorded
as NA
.
Setting explicit column types
Sometimes, the default column types might not be what you intend. In that case you can set them explicitly by either providing a string representing the column types
From the
read_tsv
help file: c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- to skip the column
Eg. setting country
to be a factor, pop
as an integer, and not reading in the lifeExp
column.
The remaining columns are left as ‘guess’, in which case readr
tries to determine their type automatically.
read_tsv("data/gapminder.tsv", col_types = "f??-i?")
# A tibble: 1,704 × 5
country continent year pop gdpPercap
<fct> <chr> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 8425333 779.
2 Afghanistan Asia 1957 9240934 821.
3 Afghanistan Asia 1962 10267083 853.
4 Afghanistan Asia 1967 11537966 836.
5 Afghanistan Asia 1972 13079460 740.
6 Afghanistan Asia 1977 14880372 786.
7 Afghanistan Asia 1982 12881816 978.
8 Afghanistan Asia 1987 13867957 852.
9 Afghanistan Asia 1992 16317921 649.
10 Afghanistan Asia 1997 22227415 635.
# ℹ 1,694 more rows
Note:
If you choose to use a type string like this, you must provide a type for each colum in the data or R will throw a warning
Or by providing a column specification, like how read_csv
did to explain it’s default parsing
when we first ran it
Importing the same types as above in this format would look like this:
# Default is col_guess()
# So we just need to specify the
# columns that are different
gapminder_spec <- cols(
country = col_factor(),
pop = col_integer(),
gdpPercap = col_skip()
)
read_tsv("data/gapminder.tsv", col_types = gapminder_spec)
# A tibble: 1,704 × 5
country continent year lifeExp pop
<fct> <chr> <dbl> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# ℹ 1,694 more rows
Knowing how these few parameters work will enable you to read most csv/tsv files you come across into R.
Challenge 4
Read in the gapminder tsv again, but make the
continent
column a factor and theyear
column a date. (Hint: Use thecols()
specification and “%Y” is the date format you will need)Check whether the newly imported data is still the same as the csv data imported previously
Solution to Challenge 4
spec <- cols( continent = col_factor(), year = col_date(format = "%Y") ) gapminder_tsv <- read_tsv("data/gapminder.tsv", col_types = spec) all.equal(gapminder_csv, gapminder_tsv)
[1] "Incompatible type for column `country`: x character, y factor" "Incompatible type for column `year`: x numeric, y Date"
Reading excel files
For reading in excel files, we will use the readxl
package. This
package is part of the tidyverse, but is not loaded with library(tidyverse)
, so we will have to do
it ourselves.
library(readxl)
We can now use the read_excel
function from readxl
to read in our data.
gapminder_excel <- read_excel("data/gapminder.xlsx")
Since read_excel
uses most of the same options as read_csv
/read_tsv
covered above, you already
know how to use it. Some excel specific options you may need to use include
Specifying the worksheet to extract
The gapminder data is all in the first sheet in the file (called ‘gapminder’). This means that
read_excel("data/gapminder.xlsx")
read_excel("data/gapminder.xlsx", sheet = 1)
read_excel("data/gapminder.xlsx", sheet = "gapminder")
will all produce the same result.
Extracting data from a specific range
Perhaps you are only interested in data from a particular region of the worksheet. This can be imported using an excel range pattern.
read_excel("data/gapminder.xlsx", range = "A1:E4")
# A tibble: 3 × 5
country continent year lifeExp pop
<chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
read_csv
andread.csv
You may have noticed that there is also a
read.csv
function. This default behaviour of this function is that all strings are read in as factors, which can be a common source of mistakes for newer R users. For this reason, we will stick with using the tidyverse equivalents where possible.Regardless of the method chosen. After importing your data, it’s always a good choice to check it afterwards using the data frame exploration methods
Key Points
Use
read_csv()
orread_tsv()
to read in plain text dataUse
read_excel()
from thereadxl
package to read in Excel files