Dataframes
Overview
Teaching: 25 min
Exercises: 15 minQuestions
What is a dataframe?
Why use a dataframe as a tidy data structure?
Objectives
To learn how to create a dataframe.
To understand how to find basic information about a dataframe
To know how to inspect the data in a dataframe
Now that we understand a little bit about why we might prefer our data to be ‘tidy’, we have one data structure left to learn - the dataframe. Dataframes are where the vast majority of work in R is done. Dataframes can look a lot like any table of data (and we will often refer to them in that way), but they are a very particular structure. Dataframes are a special type of list that is made up of vectors that all have to be the same length. Since vectors must have the same data types, this means that a dataframe produces a rectangular table of data, where each column must have the same type. Dataframes are an ideal format for storing and working with tidy data. All the tidyverse tools we will be learning from here are designed to work with data in this form.
Creating a data frame
In order to start working with data, we first need to learn how to read it in to R. For learning how to work with data, we will be using records from the Gapminder organisation, which contains various statistics for 142 countries betwen 1952 and 2007. This data is available as an R package, but we have prepared a csv version for you to practice with.
Challenge 1
Download the gapminder.csv file and save it in your project directory.
Open the file in a text editor and describe what statistics are recorded.
Solution to Challenge 1
Using the ideas discussed previously about project structure, we will save the files into a
data
directory within our project. We can then access them with a relative pathdata/gapminder.csv
.Opening the file we can see that there are six columns of data: a country name and continent, the year that the data was recorded, and the life expectancy, population and GDP per capita.
To load this data into R, we will use the read_csv
function.
For reading in different data formats, or for control of the import options,
see the optional section on reading data in to R.
library(tidyverse)
gapminder <- read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gapminder
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
What’s a tibble?
You might notice in the output above that it calls itself a tibble, rather than a data frame. A tibble is just the tidyverse’s version of a data frame that has a few behaviours tweaked to make it behave more predictibly. Try comparing the output of a base R
data.frame
version of gapminder withas.data.frame(gapminder)
to get some idea of the differences.We will try to refer to them as data frames throughout these lessons. But know that dataframes and tibbles are interchangable for our purposes.
Inspecting a dataframe
Looking at the printed output from the gapminder
dataframe can tell us a lot of information about
it. The first line tells us the dimensions of the data, in this case there are 1,704 rows and 6
columns. This information can also be found with:
nrow(gapminder)
[1] 1704
ncol(gapminder)
[1] 6
dim(gapminder)
[1] 1704 6
The next row of output gives the names of the columns, which can also be found using colnames()
.
colnames(gapminder)
[1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
The next row tells you the data type of each column, followed by the data itself. We won’t discuss data types in too much detail, but see this section for a full description.
Should you need the data from a specific column, it can be accessed using the $
notation.
gapminder$lifeExp
[1] 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763
[11] 42.129 43.828 55.230 59.280 64.820 66.220 67.690 68.930 70.420 72.000
[21] 71.581 72.950 75.651 76.423 43.077 45.685 48.303 51.407 54.518 58.014
[31] 61.368 65.799 67.744 69.152 70.994 72.301 30.015 31.999 34.000 35.985
[41] 37.928 39.483 39.942 39.906 40.647 40.963 41.003 42.731 62.485 64.399
[51] 65.142 65.634 67.065 68.481 69.942 70.774 71.868 73.275 74.340 75.320
[61] 69.120 70.330 70.930 71.100 71.930 73.490 74.740 76.320 77.560 78.830
[71] 80.370 81.235 66.800 67.480 69.540 70.140 70.630 72.170 73.180 74.940
[ reached getOption("max.print") -- omitted 1624 entries ]
Overview of a dataframe
There are many other ways to view the data and look at its data types, and the
structure of the data. To look at the first 5 rows of the gapminder dataset, use head()
:
head(gapminder, 5)
# A tibble: 5 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
and use tail()
to look at the last 10 rows:
tail(gapminder, 10)
# A tibble: 10 × 6
country continent year lifeExp pop gdpPercap
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Zimbabwe Africa 1962 52.4 4277736 527.
2 Zimbabwe Africa 1967 54.0 4995432 570.
3 Zimbabwe Africa 1972 55.6 5861135 799.
4 Zimbabwe Africa 1977 57.7 6642107 686.
5 Zimbabwe Africa 1982 60.4 7636524 789.
6 Zimbabwe Africa 1987 62.4 9216418 706.
7 Zimbabwe Africa 1992 60.4 10704340 693.
8 Zimbabwe Africa 1997 46.8 11404948 792.
9 Zimbabwe Africa 2002 40.0 11926563 672.
10 Zimbabwe Africa 2007 43.5 12311143 470.
To look at the structure of the data (particularly when there are many columns) use glimpse()
:
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
And use summary()
to get a summarised breakdown of each column:
summary(gapminder)
country continent year lifeExp
Length:1704 Length:1704 Min. :1952 Min. :23.60
Class :character Class :character 1st Qu.:1966 1st Qu.:48.20
Mode :character Mode :character Median :1980 Median :60.71
Mean :1980 Mean :59.47
3rd Qu.:1993 3rd Qu.:70.85
Max. :2007 Max. :82.60
pop gdpPercap
Min. :6.001e+04 Min. : 241.2
1st Qu.:2.794e+06 1st Qu.: 1202.1
Median :7.024e+06 Median : 3531.8
Mean :2.960e+07 Mean : 7215.3
3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Max. :1.319e+09 Max. :113523.1
Factors
One special type of data in R is a factor, which is most often used in statistical modelling for categorical data. We can look at what these are by explicitly reading part of our data frame in as a factor.
Challenge 2
Read in a new gapminder data frame with the continent column as a factor using
gapminder_factor <- read_csv("data/gapminder.csv", col_types = cols(continent = col_factor()))
What does the output from
glimpse
andsummary
on this new data frame show you? How is it different from our originalgapminder
data frame and why do you think it has a different format for different columns?Solution to Challenge 2
The output from
summary()
changes depending on the class of the data in the column. For the numeric columns it shows the minimum, maximum, mean and quartile values. For the factor column, it shows a count of each category. For the character column it cannot provide any useful summary.
Let’s look at what has changed in this column.
class(gapminder_factor$continent)
[1] "factor"
gapminder_factor$continent
[1] Asia Asia Asia Asia Asia Asia Asia Asia
[9] Asia Asia Asia Asia Europe Europe Europe Europe
[17] Europe Europe Europe Europe Europe Europe Europe Europe
[25] Africa Africa Africa Africa Africa Africa Africa Africa
[33] Africa Africa Africa Africa Africa Africa Africa Africa
[41] Africa Africa Africa Africa Africa Africa Africa Africa
[49] Americas Americas Americas Americas Americas Americas Americas Americas
[57] Americas Americas Americas Americas Oceania Oceania Oceania Oceania
[65] Oceania Oceania Oceania Oceania Oceania Oceania Oceania Oceania
[73] Europe Europe Europe Europe Europe Europe Europe Europe
[ reached getOption("max.print") -- omitted 1624 entries ]
Levels: Asia Europe Africa Americas Oceania
Here you can see that the continent
column in the gapminder_factor
data set is a factor that
lists the continent the country is in. In the final line, you can see that it lists the possible
categories of the data with Levels: Asia Europe Africa Americas Oceania
. The levels of this factor can
also be accessed using:
levels(gapminder_factor$continent)
[1] "Asia" "Europe" "Africa" "Americas" "Oceania"
But what happens when we look more closely at this data using glimpse()
glimpse(gapminder_factor$continent)
Factor w/ 5 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
This tells us we have a factor with 5 levels, just like we expected. But when it comes to
show the data itself, all we see are a bunch of numbers. This is because, to R, a factor is really
just an integer underneath, with the levels telling it how to map the integer to the actual category.
So a value of 1 would map to the first level (Asia
), while a value of 2 would map to the second
level (Europe
), etc..
Expecting factors to behave as characters, rather than integers, is a common cause of errors for people new to R. So always remember to inspect your data with the methods shown here to make sure it is of the right type.
Your turn
So far, you’ve been walked through investigating a dataframe. Let’s use those skills to explore a data set you have not yet been exposed to.
Challenge 3
The
storms
data set comes built in to the tidyverse packages and contains information on hurricanes recorded in the Atlantic Ocean. It can be accessed just by typingstorms
into your console.Using the tools you have learned so far, explore this data set and describe what it contains. Explain both the structural features of the data set as a whole, as well as its content.
Hint
If you are encountering a data type you haven’t seen before, try looking at the
class()
of the column to see if that helps you work out what it is.Solution to Challenge 3
The object
storms
is a dataframe (a tibble) with 10,010 rows and 13 columns.
name
andstatus
are character vectors.year
,month
,hour
,lat
,long
,ts_diamater
andhu_diameter
are numeric vectors.day
,wind
, andpressure
are integer vectors.category
is an ordered factor vector, with levels-1 < 0 < 1 < 2 < 3 < 4 < 5
Key Points
Dataframes (or tibbles in the tidyverse) are lists where each element is a vector of the same length
Use
read_csv()
to read comma separated files into a data frameUse
nrow()
,ncol()
,dim()
, orcolnames()
to find information about a dataframeUse
head()
,tail()
,summary()
, orglimpse()
to inspect a dataframe’s content