Exercise: NCBI E-utilities

Overview

Teaching: 5 min
Exercises: 20 min
Questions
  • How can we use the Entrez Programming Utilities (E-utilities) to search across the Entrez Molecular Sequence Database System?

Objectives
  • Introduce the Entrez Molecular Sequence Database System (Entrez) and the databases it includes.

  • Provide sources for documentation and more information about using the E-utilities.

  • Develop API calls answer research questions using data pulled from Entrez through the E-utilities.

Background Information

Entrez Molecular Sequence Database System (Entrez)

Entrez is a molecular biology database system that provides integrated access to nucleotide and protein sequence data, genomic mapping informaiton, 3D structure data, PubMed MEDLINE, and more. This system is produced by the National Center for Biotechnology Information (NCBI).

Entrez is NCBI’s primary text search and retrieval system that integrates the PubMed database of biomedical literature with 38 other literature and molecular databases

The web based search interface for these NCBI databases is available to the public here, through the U.S. National Library of Medicine.

Databases included in Entrez

You can find a full list of Entrez databases listed here.

Entrez Programming Utilities (E-utilities)

The E-utilities are made up of 9 programs that provide access to Entrez(https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4_ESummary). You can find a list of these 9 programs in the table below. The information shown in this table was taken from Eric Sayers A General Introduction to the E-utilities.

E-utilities Query string (base URL for the API) Use
EInfo https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases.
ESearch https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.
EPost https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.
ESummary https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi Responds to a list of UIDs from a given database with the corresponding document summaries.
EFetch https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi Responds to a list of UIDs in a given database with the corresponding data records in a specified format.
ELink https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database; checks for the existence of a specified link from a list of one or more UIDs; creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.
EGQuery https://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi Responds to a text query with the number of records matching the query in each Entrez database.
ESpell https://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi Retrieves spelling suggestions for a text query in a given database.
ECitMatch https://eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi Retrieves PubMed IDs (PMIDs) corresponding to a set of input citation strings.

E-utilities Documentation

Basic Searching

API string: esearch.fcgi?db=&term=

Input: Entrez database (&db); Any Entrez text query (&term)

Output: List of UIDs matching the Entrez query

Example: Get the PubMed IDs (PMIDs) for articles about brain tumour published in Science in 2023

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=science[journal]+AND+brain+tumour+AND+2023[pdat]

Basic Downloading

API string: efetch.fcgi?db=&id=&rettype=&retmode=

Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode)

Output: Formatted data records as specified

Example: Download nuccore GIs 34577062 and 24475906 in FASTA format

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=34577062,24475906&rettype=fasta&retmode=text

Getting Database Statistics and Search Fields

API string: einfo.fcgi?db=

Input: Entrez database (&db)

Output: XML containing database statistics

Note: If no database parameter is supplied, einfo will return a list of all valid Entrez databases.

Example: Find database statistics for Entrez Protein.

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein

Exercises

Retrieve PubMed Articles About Cold Urticaria

Using the Entrez API, write a python program to search for articles about “cold urticaria” in PubMed, sorted by publication date.

Solution

import requests
import json

# Base URL of the API endpoint
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"

# Parameters to be sent with the request
params = {
   'db': 'pubmed',
   'term': 'cold urticaria',
   'sort': 'pub_date',
   'retmode': 'json'
}

# Making the GET request
response = requests.get(base_url, params=params)

# Checking the response status and handling the data
if response.status_code == 200:
   # Request was successful, process the data
   data = response.json()  # Parsing the response as JSON
   print("Data received from the API:")
   json_data = json.dumps(data, indent=4, sort_keys=True)
   print(json_data) 
else:
   # Handling request errors
   print(f"Request failed with status code: {response.status_code}")

Gather Information on the PubMed Database

Using the Entrez API, write a python program to gather detailed information about the PubMed database.

Solution

import requests
import json

# Base URL of the API endpoint
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi"

# Parameters to be sent with the request
params = {
   'db': 'pubmed',
   'retmode': 'json'
}

# Making the GET request
response = requests.get(base_url, params=params)

# Checking the response status and handling the data
if response.status_code == 200:
   # Request was successful, process the data
   data = response.json()  # Parsing the response as JSON
   print("Data received from the API:")
   json_data = json.dumps(data, indent=4, sort_keys=True)
   print(json_data)
else:
   # Handling request errors
   print(f"Request failed with status code: {response.status_code}")

Search for Genetic Information About BRCA1

Using the Entrez API, write a python program to search for genetic information about the gene BRCA1 in the Nucleotide database. Limit the results to 10 entries.

Solution

import requests
import json

# Base URL of the API endpoint
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"

# Parameters to be sent with the request
params = {
   'db': 'nucleotide',
   'term': 'BRCA1',
   'retmax': '10',
   'retmode': 'json'
}

# Making the GET request
response = requests.get(base_url, params=params)

# Checking the response status and handling the data
if response.status_code == 200:
   # Request was successful, process the data
   data = response.json()  # Parsing the response as JSON
   print("Data received from the API:")
   json_data = json.dumps(data, indent=4, sort_keys=True)
   print(json_data)
else:
   # Handling request errors
   print(f"Request failed with status code: {response.status_code}")

Fetch Sequence Data for a Specific Nucleotide ID

Use the Entrez API to fetch the sequence data for a specific nucleotide ID returned by a previous search. Let’s Use one of the IDs returned by the ESearch query, for example, 359465566.

Solution

import requests
import json

# Base URL of the EFetch API endpoint
efetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"

# ID to fetch details for (taken from the ESearch result)
nucleotide_id = "359465566"

# Parameters for the EFetch request
params = {
   'db': 'nucleotide',
   'id': nucleotide_id,
   'rettype': 'fasta',
   'retmode': 'text'
}

# Making the GET request
response = requests.get(efetch_url, params=params)

# Checking the response status and handling the data
if response.status_code == 200:
   # Request was successful, process the data
   fasta_data = response.text  # Fetch the response as plain text
   print("FASTA data received from the API:")
   print(fasta_data)
   # Save the FASTA data to a file
   with open("sequence.fasta", "w") as file:
     file.write(fasta_data)
   print("FASTA data saved to sequence.fasta")
else:
   # Handling request errors
   print(f"Request failed with status code: {response.status_code}")

Gather Information on the Taxonomy Database

Using the Entrez API, write a python program to gather detailed information about the Taxonomy database.

Solution

import requests
import json

# Base URL of the API endpoint
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi"

# Parameters to be sent with the request
params = {
   'db': 'taxonomy',
   'retmode': 'json'
}

# Making the GET request
response = requests.get(base_url, params=params)

# Checking the response status and handling the data
if response.status_code == 200:
   # Request was successful, process the data
   data = response.json()  # Parsing the response as JSON
   print("Data received from the API:")
   json_data = json.dumps(data, indent=4, sort_keys=True)
   print(json_data)
else:
   # Handling request errors
   print(f"Request failed with status code: {response.status_code}")

Key Points

  • By linking to the NCBI Entrez system through the E-utilities, you can make complicated data requests across a huge dataset.