Reading Data In

Overview

Teaching: 35 min
Exercises: 10 min
Questions
  • How can I get data into R?

Objectives
  • Read data in from plain text and Excel files

  • Control import parameters to cope with unusual data formats

So far, we have been using datasets that are already available within R. Chances are though that the data you want to work with exists separately, perhaps as a spreadsheet or CSV file. In order to start working with this data, we need to learn how to read it in to R.

Common file types

There are many different ways that data could be stored, but for now we will focus on reading in tabular data. This is data that is already in a data frame-like structure and so we will want to read it in to R as a data frame.

Two of the most common ways you might find data like this are:

Plain text files

These are text files where the columns of data are separated by some delimiting character. Examples of these might include

carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
carat	cut	color	clarity	depth	table	price	x	y	z
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.2	4.23	2.63
0.31	Good	J	SI2	63.3	58	335	4.34	4.35	2.75

Spreadsheet files

All spreadsheet programs (Excel, Calc, Numbers, etc.) will have a way to export data into one of the plain text formats above. This will usually be the best way to get data into R. For excel files (.xls or .xlsx) however, it is possible to read them in directly which we will demonstrate later

Gapminder data

For this section, we will be reading in data from the Gapminder organisation, which records various statistics for 142 countries betwen 1952 and 2007. This data is available as an R package, but we have prepared csv, tsv, and excel versions for you to practice with.

Challenge 1

Download the three versions of the Gapminder above and save them to your project folder.

Open one of the files and describe what statistics are recorded

Solution to Challenge 1

Using the ideas discussed previously about project structure, we will save the files into a data directory within our project. We can then access them with a relative path data/gapminder.csv (for the csv example)

Opening a file we can see that there are six columns of data: a country name and continent, the year that the data was recorded, and the life expectancy, population and GDP per capita.

To load this data into R, we will use the read_csv function from the readr package (which will be loaded automatically if you have preciously run library(tidyverse), but here we will load it separately).

library(readr)

gapminder_csv <- read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gapminder_csv
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Here, we can see that read_csv provides us with some information on what it thinks the data types are as it reads the file in (two columns of character data and four columns of double in this case). We can also see that it has imported the data in the tibble format discussed previously.

Similarly, for the tsv file there is a read_tsv function

Challenge 2

Read in the gapminder.tsv file using read_tsv

Confirm that reading from csv or tsv files produce the same output (you might find the all.equal function useful)

Solution to Challenge 2

gapminder_tsv <- read_tsv("data/gapminder.tsv")

gapminder_tsv
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows
all.equal(gapminder_csv, gapminder_tsv)
[1] "Attributes: < Component \"spec\": Component \"delim\": 1 string mismatch >"

Control of import parameters

Both read_csv and read_tsv have a number of parameters that may be needed to help import a file. You can read more about these in the help file for the functions, but some more commonly encountered situations are:

Ignoring comments or metadata from the top of a file

Sometimes files will have metadata information included along with the data. Use skip to ignore a set number of lines from the top of the file or comment if metadata lines are indicated by a specific character such as #.

Eg. a file such as:

#Metadata information detailing data collection process
#Not part of the data, but included to explain it
country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071

Can be read in using:

read_tsv("data/commented_file.tsv", comment = "#")
read_tsv("data/commented_file.tsv", skip = 2)

Both lines above produce the same output

Providing column names

By default, readr assumes that the first row of data is a “header” - that is defines the names of the columnes for the data frame. If this is not the case you can either specify the names manually with a character vector, or have the column names generated automatically

Eg.

Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071

Can be read in using:

read_tsv("data/no_names.tsv", col_names = F)
# A tibble: 3 x 6
   X1          X2       X3    X4       X5    X6
   <chr>       <chr> <dbl> <dbl>    <dbl> <dbl>
 1 Afghanistan Asia   1952  28.8  8425333  779.
 2 Afghanistan Asia   1957  30.3  9240934  821.
 3 Afghanistan Asia   1962  32.0 10267083  853.

or

read_tsv("data/no_names.tsv", col_names = c("country", "continent","year","lifeExp","pop","gdpPercap"))
# A tibble: 3 x 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.

Dealing with missing data

By default, blank columns or ones containing “NA” are considered missing. If your data uses a different value for missing data you will need to specify it.

Eg.

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	-	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
read_tsv("data/missing.tsv")
# A tibble: 3 x 6
  country     continent  year lifeExp pop      gdpPercap
  <chr>       <chr>     <dbl>   <dbl> <chr>        <dbl>
1 Afghanistan Asia       1952    28.8 8425333       779.
2 Afghanistan Asia       1957    30.3 -             821.
3 Afghanistan Asia       1962    32.0 10267083      853.

Using the default parameters, pop is incorrectly read in as a character column because missing data is indicated by -

read_tsv("data/missing.tsv", na = "-")
# A tibble: 3 x 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3       NA      821.
3 Afghanistan Asia       1962    32.0 10267083      853.

Now the pop data is correctly recognised as a number, and the missing data is correctly recorded as NA.

Setting explicit column types

Sometimes, the default column types might not be what you intend. In that case you can set them explicitly by either providing a string representing the column types

From the read_tsv help file: c = character, i = integer, n = number, d = double, l = logical, f = factor, D = date, T = date time, t = time, ? = guess, or _/- to skip the column

Eg. setting country to be a factor, pop as an integer, and not reading in the lifeExp column. The remaining columns are left as ‘guess’, in which case readr tries to determine their type automatically.

read_tsv("data/gapminder.tsv", col_types = "f??-i?")
# A tibble: 1,704 × 5
   country     continent  year      pop gdpPercap
   <fct>       <chr>     <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952  8425333      779.
 2 Afghanistan Asia       1957  9240934      821.
 3 Afghanistan Asia       1962 10267083      853.
 4 Afghanistan Asia       1967 11537966      836.
 5 Afghanistan Asia       1972 13079460      740.
 6 Afghanistan Asia       1977 14880372      786.
 7 Afghanistan Asia       1982 12881816      978.
 8 Afghanistan Asia       1987 13867957      852.
 9 Afghanistan Asia       1992 16317921      649.
10 Afghanistan Asia       1997 22227415      635.
# ℹ 1,694 more rows

Note:

If you choose to use a type string like this, you must provide a type for each colum in the data or R will throw a warning

Or by providing a column specification, like how read_csv did to explain it’s default parsing when we first ran it

Importing the same types as above in this format would look like this:

# Default is col_guess()
# So we just need to specify the 
# columns that are different 
gapminder_spec <- cols(
  country = col_factor(), 
  pop = col_integer(), 
  gdpPercap = col_skip()
)

read_tsv("data/gapminder.tsv", col_types = gapminder_spec)
# A tibble: 1,704 × 5
   country     continent  year lifeExp      pop
   <fct>       <chr>     <dbl>   <dbl>    <int>
 1 Afghanistan Asia       1952    28.8  8425333
 2 Afghanistan Asia       1957    30.3  9240934
 3 Afghanistan Asia       1962    32.0 10267083
 4 Afghanistan Asia       1967    34.0 11537966
 5 Afghanistan Asia       1972    36.1 13079460
 6 Afghanistan Asia       1977    38.4 14880372
 7 Afghanistan Asia       1982    39.9 12881816
 8 Afghanistan Asia       1987    40.8 13867957
 9 Afghanistan Asia       1992    41.7 16317921
10 Afghanistan Asia       1997    41.8 22227415
# ℹ 1,694 more rows

Knowing how these few parameters work will enable you to read most csv/tsv files you come across into R.

Challenge 4

Read in the gapminder tsv again, but make the continent column a factor and the year column a date. (Hint: Use the cols() specification and “%Y” is the date format you will need)

Check whether the newly imported data is still the same as the csv data imported previously

Solution to Challenge 4

spec <- cols(
  continent = col_factor(),
  year = col_date(format = "%Y")
)

gapminder_tsv <- read_tsv("data/gapminder.tsv", col_types = spec)

all.equal(gapminder_csv, gapminder_tsv)
[1] "Incompatible type for column `country`: x character, y factor" "Incompatible type for column `year`: x numeric, y Date"

Reading excel files

For reading in excel files, we will use the readxl package. This package is part of the tidyverse, but is not loaded with library(tidyverse), so we will have to do it ourselves.

library(readxl)

We can now use the read_excel function from readxl to read in our data.

gapminder_excel <- read_excel("data/gapminder.xlsx")

Since read_excel uses most of the same options as read_csv/read_tsv covered above, you already know how to use it. Some excel specific options you may need to use include

Specifying the worksheet to extract

The gapminder data is all in the first sheet in the file (called ‘gapminder’). This means that

read_excel("data/gapminder.xlsx")
read_excel("data/gapminder.xlsx", sheet = 1)
read_excel("data/gapminder.xlsx", sheet = "gapminder")

will all produce the same result.

Extracting data from a specific range

Perhaps you are only interested in data from a particular region of the worksheet. This can be imported using an excel range pattern.

read_excel("data/gapminder.xlsx", range = "A1:E4")
# A tibble: 3 × 5
  country     continent  year lifeExp      pop
  <chr>       <chr>     <dbl>   <dbl>    <dbl>
1 Afghanistan Asia       1952    28.8  8425333
2 Afghanistan Asia       1957    30.3  9240934
3 Afghanistan Asia       1962    32.0 10267083

read_csv and read.csv

You may have noticed that there is also a read.csv function. This default behaviour of this function is that all strings are read in as factors, which can be a common source of mistakes for newer R users. For this reason, we will stick with using the tidyverse equivalents where possible.

Regardless of the method chosen. After importing your data, it’s always a good choice to check it afterwards using the data frame exploration methods

Key Points

  • Use read_csv() or read_tsv() to read in plain text data

  • Use read_excel() from the readxl package to read in Excel files