Databases and SQL: Glossary

Key Points

Why use Relational Databases
  • Relational Databases are an efficient way to store and query data…

  • making use of relationships between multiple tables of information.

  • The most common syntax for interacting with such databases is SQL- Structured Query Language.

  • Implementations include MySQL, SQL Server, Oracle and Postgres.

Selecting Data
  • Use SELECT… FROM… to get values from a database table.

  • SQL is case-insensitive (but data is case-sensitive).

Sorting and Removing Duplicates
  • The records in a database table are not intrinsically ordered: if we want to display them in some order, we must specify that explicitly with ORDER BY.

  • The values in a database are not guaranteed to be unique: if we want to eliminate duplicates, we must specify that explicitly as well using DISTINCT.

Filtering
  • Use WHERE to specify conditions that records must meet in order to be included in a query’s results.

  • Use AND, OR, and NOT to combine tests.

  • Filtering is done on whole records, so conditions can use fields that are not actually displayed.

  • Write queries incrementally.

Calculating New Values
  • Queries can do the usual arithmetic operations on values.

  • Use UNION to combine the results of two or more queries.

Missing Data
  • Databases use a special value called NULL to represent missing information.

  • Almost all operations on NULL produce NULL.

  • Queries can test for NULLs using IS NULL and IS NOT NULL.

Aggregation
  • Use aggregation functions to combine multiple values.

  • Aggregation functions ignore null values.

  • Aggregation happens after filtering.

  • Use GROUP BY to combine subsets separately.

  • If no aggregation function is specified for a field, the query may return an arbitrary value for that field.

Combining Data
  • Use JOIN to combine data from two tables.

  • Use table.field notation to refer to fields when doing joins.

  • Every fact should be represented in a database exactly once.

  • A join produces all combinations of records from one table with records from another.

  • A primary key is a field (or set of fields) whose values uniquely identify the records in a table.

  • A foreign key is a field (or set of fields) in one table whose values are a primary key in another table.

  • We can eliminate meaningless combinations of records by matching primary keys and foreign keys between tables.

  • The most common join condition is matching keys.

LLMs for SQL
  • LLMs can be used for tasks including tutoring, debugging, writing code, fixing errors, and search

  • LLMs can generate incorrect or unexpected responses

  • Testing is essential and this too can be in collaboration with an LLM

Interfacing Programming Languages and Databases - Python
  • General-purpose languages have libraries for accessing databases.

  • We can create data

  • Polars is a database like high performance Pandas like library

Geospatial Data Science with Python and Databases
  • Shapefiles are a common geospatial format that can be used in DuckDB

  • Geospatial data can be processed in some specialized databases

  • Visualising Geospatial Data from DuckDB is possible using GeoPandas and Folium

Glossary

aggregation function
A function that combines multiple values to produce a single new value (e.g. sum, mean, median).
atomic
Describes a value not divisible into parts that one might want to work with separately. For example, if one wanted to work with first and last names separately, the values “Ada” and “Lovelace” would be atomic, but the value “Ada Lovelace” would not.
cascading delete
An SQL constraint requiring that if a given record is deleted, all records referencing it (via foreign key) in other tables must also be deleted.
case insensitive
Treating text as if upper and lower case characters were the same. See also: case sensitive.
case sensitive
Treating upper and lower case characters as different. See also: case insensitive.
comma-separated values (CSV)
A common textual representation for tables in which the values in each row are separated by commas.
cross product
A pairing of all elements of one set with all elements of another.
cursor
A pointer into a database that keeps track of outstanding operations.
database manager
A program that manages a database, such as SQLite.
fields
A set of data values of a particular type, one for each record in a table.
filter
To select only the records that meet certain conditions.
foreign key
One or more values in a database table that identify records in another table.
prepared statement
A template for an SQL query in which some values can be filled in.
primary key
One or more fields in a database table whose values are guaranteed to be unique for each record, i.e., whose values uniquely identify the entry.
query
A textual description of a database operation. Queries are expressed in a special-purpose language called SQL, and despite the name “query”, they may modify or delete data as well as interrogate it.
record
A set of related values making up a single entry in a database table, typically shown as a row. See also: field.
referential integrity
The internal consistency of values in a database. If an entry in one table contains a foreign key, but the corresponding records don’t exist, referential integrity has been violated.
relational database
A collection of data organized into tables.
sentinel value
A value in a collection that has a special meaning, such as 999 to mean “age unknown”.
SQL
A special-purpose language for describing operations on relational databases.
SQL injection attack
An attack on a program in which the user’s input contains malicious SQL statements. If this text is copied directly into an SQL statement, it will be executed in the database.
table
A set of data in a relational database organized into a set of records, each having the same named fields.
wildcard
A character used in pattern matching. In SQL’s like operator, the wildcard “%” matches zero or more characters, so that %able% matches “fixable” and “tablets”.