Dataframes

Overview

Teaching: 25 min
Exercises: 15 min
Questions
  • What is a dataframe?

  • Why use a dataframe as a tidy data structure?

Objectives
  • To learn how to create a dataframe.

  • To understand how to find basic information about a dataframe

  • To know how to inspect the data in a dataframe

Now that we understand a little bit about why we might prefer our data to be ‘tidy’, we have one data structure left to learn - the dataframe. Dataframes are where the vast majority of work in R is done. Dataframes can look a lot like any table of data (and we will often refer to them in that way), but they are a very particular structure. Dataframes are a special type of list that is made up of vectors that all have to be the same length. Since vectors must have the same data types, this means that a dataframe produces a rectangular table of data, where each column must have the same type. Dataframes are an ideal format for storing and working with tidy data. All the tidyverse tools we will be learning from here are designed to work with data in this form.

Creating a data frame

In order to start working with data, we first need to learn how to read it in to R. For learning how to work with data, we will be using records from the Gapminder organisation, which contains various statistics for 142 countries betwen 1952 and 2007. This data is available as an R package, but we have prepared a csv version for you to practice with.

Challenge 1

Download the gapminder.csv file and save it in your project directory.

Open the file in a text editor and describe what statistics are recorded.

Solution to Challenge 1

Using the ideas discussed previously about project structure, we will save the files into a data directory within our project. We can then access them with a relative path data/gapminder.csv.

Opening the file we can see that there are six columns of data: a country name and continent, the year that the data was recorded, and the life expectancy, population and GDP per capita.

To load this data into R, we will use the read_csv function. For reading in different data formats, or for control of the import options, see the optional section on reading data in to R.

library(tidyverse)

gapminder <- read_csv("data/gapminder.csv")
Rows: 1704 Columns: 6
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent
dbl (4): year, lifeExp, pop, gdpPercap

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

What’s a tibble?

You might notice in the output above that it calls itself a tibble, rather than a data frame. A tibble is just the tidyverse’s version of a data frame that has a few behaviours tweaked to make it behave more predictibly. Try comparing the output of a base R data.frame version of gapminder with as.data.frame(gapminder) to get some idea of the differences.

We will try to refer to them as data frames throughout these lessons. But know that dataframes and tibbles are interchangable for our purposes.

Inspecting a dataframe

Looking at the printed output from the gapminder dataframe can tell us a lot of information about it. The first line tells us the dimensions of the data, in this case there are 1,704 rows and 6 columns. This information can also be found with:

nrow(gapminder)
[1] 1704
ncol(gapminder)
[1] 6
dim(gapminder)
[1] 1704    6

The next row of output gives the names of the columns, which can also be found using colnames().

colnames(gapminder)
[1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"

The next row tells you the data type of each column, followed by the data itself. We won’t discuss data types in too much detail, but see this section for a full description.

Should you need the data from a specific column, it can be accessed using the $ notation.

gapminder$lifeExp
 [1] 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763
[11] 42.129 43.828 55.230 59.280 64.820 66.220 67.690 68.930 70.420 72.000
[21] 71.581 72.950 75.651 76.423 43.077 45.685 48.303 51.407 54.518 58.014
[31] 61.368 65.799 67.744 69.152 70.994 72.301 30.015 31.999 34.000 35.985
[41] 37.928 39.483 39.942 39.906 40.647 40.963 41.003 42.731 62.485 64.399
[51] 65.142 65.634 67.065 68.481 69.942 70.774 71.868 73.275 74.340 75.320
[61] 69.120 70.330 70.930 71.100 71.930 73.490 74.740 76.320 77.560 78.830
[71] 80.370 81.235 66.800 67.480 69.540 70.140 70.630 72.170 73.180 74.940
 [ reached getOption("max.print") -- omitted 1624 entries ]

Overview of a dataframe

There are many other ways to view the data and look at its data types, and the structure of the data. To look at the first 5 rows of the gapminder dataset, use head():

head(gapminder, 5)
# A tibble: 5 × 6
  country     continent  year lifeExp      pop gdpPercap
  <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.

and use tail() to look at the last 10 rows:

tail(gapminder, 10)
# A tibble: 10 × 6
   country  continent  year lifeExp      pop gdpPercap
   <chr>    <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Zimbabwe Africa     1962    52.4  4277736      527.
 2 Zimbabwe Africa     1967    54.0  4995432      570.
 3 Zimbabwe Africa     1972    55.6  5861135      799.
 4 Zimbabwe Africa     1977    57.7  6642107      686.
 5 Zimbabwe Africa     1982    60.4  7636524      789.
 6 Zimbabwe Africa     1987    62.4  9216418      706.
 7 Zimbabwe Africa     1992    60.4 10704340      693.
 8 Zimbabwe Africa     1997    46.8 11404948      792.
 9 Zimbabwe Africa     2002    40.0 11926563      672.
10 Zimbabwe Africa     2007    43.5 12311143      470.

To look at the structure of the data (particularly when there are many columns) use glimpse():

glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
$ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

And use summary() to get a summarised breakdown of each column:

summary(gapminder)
   country           continent              year         lifeExp     
 Length:1704        Length:1704        Min.   :1952   Min.   :23.60  
 Class :character   Class :character   1st Qu.:1966   1st Qu.:48.20  
 Mode  :character   Mode  :character   Median :1980   Median :60.71  
                                       Mean   :1980   Mean   :59.47  
                                       3rd Qu.:1993   3rd Qu.:70.85  
                                       Max.   :2007   Max.   :82.60  
      pop              gdpPercap       
 Min.   :6.001e+04   Min.   :   241.2  
 1st Qu.:2.794e+06   1st Qu.:  1202.1  
 Median :7.024e+06   Median :  3531.8  
 Mean   :2.960e+07   Mean   :  7215.3  
 3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
 Max.   :1.319e+09   Max.   :113523.1  

Factors

One special type of data in R is a factor, which is most often used in statistical modelling for categorical data. We can look at what these are by explicitly reading part of our data frame in as a factor.

Challenge 2

Read in a new gapminder data frame with the continent column as a factor using

gapminder_factor <- read_csv("data/gapminder.csv", col_types = cols(continent = col_factor()))

What does the output from glimpse and summary on this new data frame show you? How is it different from our original gapminder data frame and why do you think it has a different format for different columns?

Solution to Challenge 2

The output from summary() changes depending on the class of the data in the column. For the numeric columns it shows the minimum, maximum, mean and quartile values. For the factor column, it shows a count of each category. For the character column it cannot provide any useful summary.

Let’s look at what has changed in this column.

class(gapminder_factor$continent)
[1] "factor"
gapminder_factor$continent
 [1] Asia     Asia     Asia     Asia     Asia     Asia     Asia     Asia    
 [9] Asia     Asia     Asia     Asia     Europe   Europe   Europe   Europe  
[17] Europe   Europe   Europe   Europe   Europe   Europe   Europe   Europe  
[25] Africa   Africa   Africa   Africa   Africa   Africa   Africa   Africa  
[33] Africa   Africa   Africa   Africa   Africa   Africa   Africa   Africa  
[41] Africa   Africa   Africa   Africa   Africa   Africa   Africa   Africa  
[49] Americas Americas Americas Americas Americas Americas Americas Americas
[57] Americas Americas Americas Americas Oceania  Oceania  Oceania  Oceania 
[65] Oceania  Oceania  Oceania  Oceania  Oceania  Oceania  Oceania  Oceania 
[73] Europe   Europe   Europe   Europe   Europe   Europe   Europe   Europe  
 [ reached getOption("max.print") -- omitted 1624 entries ]
Levels: Asia Europe Africa Americas Oceania

Here you can see that the continent column in the gapminder_factor data set is a factor that lists the continent the country is in. In the final line, you can see that it lists the possible categories of the data with Levels: Asia Europe Africa Americas Oceania. The levels of this factor can also be accessed using:

levels(gapminder_factor$continent)
[1] "Asia"     "Europe"   "Africa"   "Americas" "Oceania" 

But what happens when we look more closely at this data using glimpse()

glimpse(gapminder_factor$continent)
 Factor w/ 5 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...

This tells us we have a factor with 5 levels, just like we expected. But when it comes to show the data itself, all we see are a bunch of numbers. This is because, to R, a factor is really just an integer underneath, with the levels telling it how to map the integer to the actual category. So a value of 1 would map to the first level (Asia), while a value of 2 would map to the second level (Europe), etc..

Expecting factors to behave as characters, rather than integers, is a common cause of errors for people new to R. So always remember to inspect your data with the methods shown here to make sure it is of the right type.

Your turn

So far, you’ve been walked through investigating a dataframe. Let’s use those skills to explore a data set you have not yet been exposed to.

Challenge 3

The storms data set comes built in to the tidyverse packages and contains information on hurricanes recorded in the Atlantic Ocean. It can be accessed just by typing storms into your console.

Using the tools you have learned so far, explore this data set and describe what it contains. Explain both the structural features of the data set as a whole, as well as its content.

Hint

If you are encountering a data type you haven’t seen before, try looking at the class() of the column to see if that helps you work out what it is.

Solution to Challenge 3

The object storms is a dataframe (a tibble) with 10,010 rows and 13 columns.

  • name and status are character vectors.
  • year, month, hour, lat, long, ts_diamater and hu_diameter are numeric vectors.
  • day, wind, and pressure are integer vectors.
  • category is an ordered factor vector, with levels -1 < 0 < 1 < 2 < 3 < 4 < 5

Key Points

  • Dataframes (or tibbles in the tidyverse) are lists where each element is a vector of the same length

  • Use read_csv() to read comma separated files into a data frame

  • Use nrow(), ncol(), dim(), or colnames() to find information about a dataframe

  • Use head(), tail(), summary(), or glimpse() to inspect a dataframe’s content