This lesson is still being designed and assembled (Pre-Alpha version)

What is ML: A Taxonomy of Machine Learning

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • What is the difference between classification and regression?

  • What is the difference between supervised and unsupervised learning?

  • How can I quantify machine learning algorithm performance?

Objectives
  • Understand the landscape of machine learning algorithms

  • Use this understanding to identify the appropriate type of algorithm to use for a given problem.

  • Understand the importance of performance metrics.

Machine learning Taxonomy scikit-learn algorithm cheat sheet

The following taxonomy draws heavily from Chapter 5, Machine Learning Basics in (Goodfellow, Bengio, & Courville, 2016)

The Experience,

Typically, the experience a machine learning algorithm encounters during learning is in the form of a dataset, or exposure to a dataset (or subset thereof). A dataset is a collection of examples, each example comprising a set of features that have been quantitatively measured from some object or event. We typically represent an example as a vector , where each entry of the vector is another feature. Broadly speaking, experiences are often categorised as either unsupervised or supervised.

Unsupervised learning algorithms experience a datset containing many features, then learn useful properties of the structure of this dataset.

Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target.

Roughly speaking, unsupervised learning involves observing several examples of a random vector and attempting to implicitly or explicitly learn the probability distribution , or some interesting properties of that distribution; while supervised learning involves observing several examples of random vector and an associated value or vector , then learning the predict from , usually by estimating .

(Goodfellow, Bengio, & Courville, 2016)

The types of experiences are not necessarilly mutually exclusive. Often times a single problem may involve the use of either one of the above techniques, most likely both and potentially a hybrid of the two.

For completeneness, when characterising the types of experiences available to a machine learning algorithm we will also include reinforcement learning. Reinforcement learning algorithms work with a dataset that is not necessarilly fixed, these algorithms interact with their environment such that there is a feedback loop between the learning system and its experiences.

Challenge

Try to identify the type of experience for each of the examples below.

  1. A set of holiday images taken from Flickr with there associated locations.
  2. A set of satellite images continuosly collected across the globe.
  3. A time series of temperatures recorded across a range of sites.
  4. A game of Go.
  5. A stream of news artciles.

Solution

  1. Supervised.
  2. Unsupervised.
  3. Supervised.
  4. Reinforcement.
  5. Unsupervised.

Discussion

What type of datasets (experiences) have you worked with in the past? Are there any unique experiences you can identify in your domain that might be applicable to a learning algorithm?

The Task,

Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:

Of course, many other tasks and types of tasks are possible. The types of tasks we list here are intended only to provide examples of what machine learning can do, not to define a rigid taxonomy of tasks.

Challenge

Try to identify the type of task for each of the examples below.

  1. Estimate required steering wheel angle given an image from a dash-cam.
  2. Predict the rating a user might assign a particular movie, given a handful of ratings from other movies and users.
  3. Identify potentially malicious traffic in a computer network.
  4. Convert page layout sketches into functioning html.
  5. Identify the sub-surface structure based on sensor readings.

Solution

  1. Regression.
  2. Imputation.
  3. Anomoly detection
  4. Translation.
  5. Classification.

Discussion

Are any of these tasks applicable to datasets that you have? Would any of these tasks solve some interesting science questions you have?

The Performance Measure,

To evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure is specific to the task being carried out by the system. For example, tasks such as classification, classification with missing inputs, and transcription, we often measure the accuracy of the model.

Usually we are interested in how well the machine learning algorithm performson data that it has not seen before, since this determines how well it will work when deployed in the real world. We therefore evaluate these performance measures using a test set of data that is separate from the data used for training the machine learning system.

Discussion

What are the useful metrics of performance for some of the tasks you identified above? Are they easy to capture or express mathematically?


Challenge

  1. Assume we are given the task of building a system to distinguish healthy crops from unhealthy crops. What is in an unhealthy crop that lets us know that it is unhealthy? How can the computer detect an unhealthy crop through image analysis? What would we like the computer to do if it detects an unhealthy crop?

  2. Write the phrase “data school” ten times on a piece of paper. Also ask a friend to do the same. Analysing these twenty images try to find features, types of strokes, curvatures, loops how you make dots, and so on, that discriminate your handwriting from that of your friends.

  3. In estimating the price of a used car, it makes more sense to estimate the percent depreciation over the original price than to estimate the absolute price. Why?

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Retrieved from https://www.deeplearningbook.org

Key Points

  • There are a number of machine learning algorithms available, which one you use depends on the type of data you have, the problem you are trying to solve and your definition of ‘what is good’.

  • There is typically more than one way to solve a problem, usually it depends on how you frame what you are doing.