Introduction to R

Key Points

Introduction to R and R Studio
  • Use RStudio to write and run R programs.

  • Set up an RStudio project for each analysis you are performing.

Using R
  • R has the usual arithmetic operators and mathematical functions.

  • Use <- to assign values to variables.

  • Use ls() to list the variables in a program.

  • Use rm() to delete objects in a program.

  • Use install.packages() to install packages.

Data
  • The basic data types in R are double, integer, complex, logical, and character.

  • Vectors are an ordered collection of data of the same type.

  • Create vectors with c().

  • Lists are an ordered collection of data that can be any type.

Getting help in R
  • Use help() to get online help in R.

Reading Data In
  • Use read_csv() or read_tsv() to read in plain text data

  • Use read_excel() from the readxl package to read in Excel files

Tidy Data
  • For data to be tidy, each variable must be in its own column

  • For data to be tidy, each case must be in its own row

  • For data to be tidy, each value must be in its own cell

Dataframes
  • Dataframes (or tibbles in the tidyverse) are lists where each element is a vector of the same length

  • Use read_csv() to read comma separated files into a data frame

  • Use nrow(), ncol(), dim(), or colnames() to find information about a dataframe

  • Use head(), tail(), summary(), or glimpse() to inspect a dataframe’s content

Selecting columns
  • Use select() to choose variables from a dataframe.

  • Helper functions make it easier to select the correct columns.

  • Use rename() to rename variables without dropping columns.

Extract rows
  • Use filter() to choose data based on values.

  • The logical operator %in% can filter data from a list of possible values

Creating New Columns
  • Use mutate() to create new variables from old ones.

  • You can create new variables using any function that returns a vector of the same length as the data frame.

  • Use group_by() to group your data based on a variable.

Summarise and Grouping
  • summarise() creates a new variable that provides a one row summary per group

  • n() is useful to count rows per group

Adding and Combining Datasets
  • bind_rows combines datasets that share the same variables

  • The join family of functions provide a complete range of methods to merge datasets that share common variables

Putting it all together
  • Data analyses can be broken down into discrete stages

  • Most data analysis stages fit into a small number of types

  • Pipes pass their left hand side through as the first argument of the right hand side.

  • Pipes make your code more readable, but be careful of going overboard.

Gather & Spread
  • Use the tidyr package to change the layout of dataframes.

  • Use gather() to go from wide to long format.

  • Use spread() to go from long to wide format.

Cleaning Data
  • Real world data is messy

  • It takes time and care to prepare it for analysis and visualisation

Writing Data
  • Intermediate data objects do not need to be written to disk

  • Write data in an appropriate format

  • Write data to the most useful location

Reproducibility
  • A script is a discrete unit of analysis

  • A script will be run in the context of an environment

  • Software (and compute) dependencies need to be considered