Introduction

I’m Marina and I work on viruses. Before data school I would typically enter my data into Excel or Prism to make bar graphs with errors bars, then communicate these in publications and powerpoint slides. Using R, I can now reproducibly wrangle data imported from Excel, plot with ggplot2 to make really flexible and informative visualizations of the raw data. This has helped me to understand the sources of variation in my data, apply appropriate statistics and make novel visualizations.

My Project

My group are looking to find microRNAs that are predictive of viral infection. Such microRNAs could act as early bio-markers of infection, allowing us to isolate the infected individual before they show signs of infection. To find such microRNAs, we set up an experiment in which three horses were infected with Hendra virus and blood was taken at day 0, 1, 3, 5 and 7. RNA was extracted and sequenced, many millions of sequencing reads were then aligned to the 800+ currently annotated microRNAs in the horse genome. This process generates a table where the number of aligned reads for each microRNA are counted for each sample. My goal was to model a relationship between microRNA expression level and time since infection using multi-linear regression.

I began with two tables, one a count matrix for each microRNA and the other a table containing metadata, a snippet of each is shown here.

Just by looking at these tables, there are some challenging aspects to this data.
* 800 microRNAs
* microRNAs with zero counts
* Sample 6 did not work
* High dynamic range of counts

gene	s1	s2	s3	s4	s5
eca-miR-486-5p	811365	558104	640028	512889	838624
eca-miR-451	327463	344708	459206	398780	201224
eca-miR-92a	60426	75414	66700	63154	42697
eca-miR-25	58482	100886	80506	73170	60151
eca-let-7g	57536	79323	108838	101643	88311
eca-let-7f	48383	102888	157322	86508	29084

sample	animal	condition
s1	h1	d0
s2	h1	d1
s3	h1	d3
s4	h1	d5
s5	h1	d7
s6	h2	d0

Plotting the total counts

Plotting the total counts for each sample or library we see that they vary considerably around the median for this experiment (1.803372 million counts as shown by the red dashed line). To conduct a filtering step I needed to normalize counts by library size so that sample 11 for example, doesn’t loose too many microRNAs because the library wasn’t as high yielding as the others.

counts_to_plot <- microRNA_counts %>%
  gather(sample, counts, -gene) %>% 
  left_join(redlands_horse_metadata_long, by = "sample") %>% 
  rename(library = sample) %>% 
  mutate(library = sub("s","", library)) %>%
  mutate(counts_million = counts/1000000) %>% 
  full_join(lib_size, by = "library")
  
ggplot(counts_to_plot, aes(y = counts_sum, x = library, color = day, shape = animal)) +
  geom_point(size = 5) +
  geom_hline(yintercept=1.808277, linetype="dashed", color = "red") +
  scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) +
  labs( x = "Library", 
        y = "Library size (million counts)", 
        title = "Library sizes") +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 14, face = "bold"),
        title = element_text(size = 14, face = "bold"))

Filtering out the lowly expressed genes

I converted the raw counts to counts per million (CPM) using the edgeR package after removing sample 6. Then filtered the lowly expressed microRNAs. The filterByExpr function used here removes genes based on sequencing depth and experimental design. For this data set I kept microRNAs with less than 5 cpm in at least 3 samples because we have 3 animals or biological replicates and also removed genes with less than 80 cpm across all samples. Shown here is a density plot of log transformed counts before and after filtering, where the red dashed line represents the 5 cpm cut off chosen for this dataset. log-cpm is used to normalize the increasing variance with increasing count size.

#creating a DGEList with counts per million after removing s6
microRNA_counts_s6 <- microRNA_counts %>% 
  select(-s6)

redlands_horse_metadata_s6 <- redlands_horse_metadata %>% 
  filter(sample != "s6")

horse_counts <- DGEList(counts = microRNA_counts_s6[, -1], genes = microRNA_counts_s6[, 1], samples = redlands_horse_metadata_s6)

# tidying for plotting
cpm_long <- cpm(horse_counts) %>% 
  as.tibble() %>%
  bind_cols(as.tibble(horse_counts$genes)) %>%
  gather(sample, cpm, -gene)
  
redlands_horse_metadata_s6_long <- 
  redlands_horse_metadata_long %>% 
  filter(sample != "s6")

counts_cpm <- microRNA_counts_s6 %>%
  gather(sample, counts, -gene) %>% 
  left_join(redlands_horse_metadata_s6_long, by = "sample") %>%
  bind_cols(cpm_long) %>%
  select(-sample1, -gene1) %>% 
  rename(library = sample) %>% 
  mutate(library = sub("s","", library)) %>%
  mutate(counts_million = counts/1000000) %>% 
  full_join(lib_size, by = "library") %>% 
  mutate(log_cpm = log(cpm))

min_count <- 10/median


# filtering lowly expressed genes
design_matrix <- model.matrix(~ condition + animal, data = redlands_horse_metadata_s6)

gene_filter <- filterByExpr(horse_counts, min.count = 6, min.total.count = 80, design = design_matrix)

horse_filtered <- horse_counts[gene_filter, , keep.lib.sizes = FALSE]

# tidying for plotting
filtered_cpm_long <- cpm(horse_filtered) %>% 
  as.tibble() %>%
  bind_cols(as.tibble(horse_filtered$genes)) %>%
  gather(sample, cpm, -gene)

filtered_cpm <- filtered_cpm_long %>% 
  left_join(cpm_long) %>% 
  left_join(redlands_horse_metadata_s6_long) %>%
  rename(library = sample) %>% 
  mutate(library = sub("s","", library)) %>% 
  mutate(log_filtered_cpm = log(cpm))

# Plotting

plot_unfiltered <- ggplot(counts_cpm, aes(x = log_cpm, color = library)) +
  geom_density() +
  geom_vline(xintercept = log(5.53), linetype="dashed", color = "red") +
  scale_color_discrete(limits = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15)) +
  labs(title = "Unfiltered microRNAs",
       x = "log-cpm",
       y = "density") +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 14, face = "bold"),
        title = element_text(size = 14, face = "bold"),
        plot.margin = margin(6, 2, 6, 2))


plot_filtered <- ggplot(filtered_cpm, aes(x = log_filtered_cpm, color = library)) +
  geom_density() +
  geom_vline(xintercept = log(5.53), linetype="dashed", color = "red") + 
  scale_color_discrete(limits = c(1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15)) +
  labs(title = "Filtered microRNAs",
       x = "log-cpm",
       y = "") +
  theme(axis.text = element_text(size = 12),
        axis.title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 14, face = "bold"),
        title = element_text(size = 14, face = "bold"),
        plot.margin = margin(6, 0, 6, 0))


plots <- plot_grid(
  plot_unfiltered + theme(legend.position="none"),
  plot_filtered + theme(legend.position="none"),
  hjust = -1,
  nrow = 1
)

legend <- get_legend(
  plot_unfiltered + theme(legend.box.margin = margin(0, 0, 0, 6))
)

plot_grid(plots, legend, rel_widths = c(3, .4))

Trimmed Means of M values normalization

I was left with 231 from the original 889 and then needed to perform normalization. It is assumed that all samples should have a similar range and distribution of expression values. However, during the sample preparation or the sequencing process, external factors that are not of biological interest can affect the expression of individual samples. Normalization is required to ensure that the expression distributions of each sample are similar across the entire experiment. edgeR employs the Trimmed Means of M values (TMM) in which highly expressed genes and those that have a large variation of expression are excluded, whereupon a weighted average of the subset of genes is used to calculate a normalization factor shown in the table below.

horse_norm <- calcNormFactors(horse_filtered, method = "TMM")

# Table of norm factors
norm_factors <- horse_norm$samples %>% 
  select(norm.factors, sample) %>% 
  rename(library = sample) %>% 
  mutate(library = sub("s","", library)) %>%
  mutate(library = as.numeric(library)) %>% 
  mutate(norm.factors = round(norm.factors, 2)) %>% 
  spread(library, norm.factors)
  

norm_cpm_long <- cpm(horse_norm) %>% 
  as.tibble() %>%
  bind_cols(as.tibble(horse_norm$genes)) %>%
  gather(sample, norm_cpm, -gene)

norm_cpm <- norm_cpm_long %>% 
  left_join(cpm_long) %>% 
  left_join(redlands_horse_metadata_s6_long) %>%
  rename(library = sample) %>% 
  mutate(library = sub("s","", library)) %>%
  mutate(library = as.numeric(library)) %>% 
  mutate(log_cpm = log(cpm)) %>% 
  mutate(log_norm_cpm = log(norm_cpm))


plot_unnorm <- ggplot(norm_cpm, aes(y = log_cpm, x = library, group = library)) +
  geom_boxplot() +
  scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) +
  labs(title = "Unnormalized",
       y = "log-cpm") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        title = element_text(size = 10, face = "bold"),
        legend.position = "none")

plot_norm <- ggplot(norm_cpm, aes(y = log_norm_cpm, x = library, group = library)) +
  geom_boxplot() +
  scale_x_discrete(limits = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)) +
  labs(title = "Normalized",
       y = "log-cpm") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        title = element_text(size = 10, face = "bold"),
        legend.position = "none")

1	2	3	4	5	7	8	9	10	11	12	13	14	15
0.68	1.08	1.26	1.2	0.99	0.94	0.96	1.65	0.79	0.94	1	0.82	0.92	1.08

plot_grid(plot_unnorm, plot_norm)

Voom transformation to normalize heteroscedastic counts

I could then look at expression of individual genes in the libraries. In genome-wide statistical analysis We can use the extensive between gene comparisons by generalized linear regression to provide a reliable inference of expression. However count measurements are heteroscedastic, i.e. the variance depends on the mean abundance. This can be modeled using the negative binomial distribution. In voom we model the mean-variance trend of the log-cpm values at the individual observation level. Voom stands for “variance modeling at the observational level”. Note the flat line, indicating high biological variation in this dataset.

horse_voom <- voom(horse_norm, plot = T, design = design_matrix)

Multi-linear regression

The the voom weights calculated by the distance from the red curve feed intolimma so that the variance is no longer dependent on the mean expression level. Here I used the lmFit function with model ~ day + animal. This plot highlights microRNAs with a p-value <0.05 and coefficient more than 0.15 units away from zero. Many of the significant coefficients are close to zero meaning that the day post infection has little impact on expression of most microRNAs. There were however some of interest which have been plotted against day pot infection below. Equine miR-143 looks promising. This microRNA is consitently downregulated in Human Papilloma Virus-induced cancer.

metadata_numeric <- redlands_horse_metadata %>% 
  filter(sample != "s6") %>%
  mutate(day = sub("d","", condition)) %>% 
  select(-condition) %>% 
  mutate(day = as.numeric(day))

horse_counts_day <- DGEList(counts = microRNA_counts_s6[, -1], genes = microRNA_counts_s6[, 1], samples = metadata_numeric)

design_matrix_day <- model.matrix(~ day + animal, data = metadata_numeric)

gene_filter_day <- filterByExpr(horse_counts_day, min.count = 6, min.total.count = 80, design = design_matrix_day)

horse_filtered_day <- horse_counts_day[gene_filter_day, , keep.lib.sizes = FALSE]

horse_norm_day <- calcNormFactors(horse_filtered_day)

horse_voom_day <- voom(horse_norm_day, design = design_matrix_day)

vfit <- lmFit(horse_voom_day, design_matrix_day)

efit <- eBayes(vfit)

# plotting the significant genes

coeff <- topTable(efit, number = 231) %>% 
  as.tibble() %>% 
  select(gene, day, adj.P.Val) %>% 
  mutate(signif = -log10(adj.P.Val))

highlight <- coeff %>% 
  filter(adj.P.Val <= 0.05) %>% 
  filter(day >=0.15 | day <=-0.15)
  
 
ggplot(coeff, aes(x = day, y = signif)) +
  geom_point(size = 1) +
  geom_hline(yintercept = -log10(0.05), linetype="dashed", color = "red") +
  geom_vline(xintercept = 0.15, linetype = "dashed", color = "blue") +
  geom_vline(xintercept = -0.15, linetype = "dashed", color = "blue") +
  geom_text(data = highlight, aes(x = day, y = signif, label = gene), 
            hjust = 0.5, vjust = -0.5, size = 5, check_overlap=TRUE) +
  coord_cartesian(xlim = c(-1, 1), ylim = c(0, 4)) +
  labs(
    x = "coefficient of `~ day + animal' model",
    y = "-log10(adjusted P value)", 
    title = "Identifying microRNAs that respond to infection") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        title = element_text(size = 14, face = "bold"))

microRNA_expn_day <- horse_voom$E %>% 
  as.tibble %>%
  bind_cols(as.tibble(horse_voom$genes)) %>% 
  gather(sample, expression, -gene) %>% 
  left_join(metadata_numeric, by = "sample") %>% 
  filter(gene == "eca-miR-143"| gene == "eca-miR-215"|gene =="eca-miR-10b"|gene == "6_70095"|gene =="31_60371"|gene == "24_47765"|gene == "20_38845"|gene =="8_76957")

ggplot(microRNA_expn_day, aes(x = day, y = expression, color = animal, group = gene)) +
  geom_point() +
  geom_smooth(method = "glm") +
  facet_wrap(~gene, ncol = 4) +
  labs(title = "Plotting individual microRNAs",
       x = "day post Hendra infection",
       y = "microRNA expression")+
    theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        title = element_text(size = 14, face = "bold"))

Principle Components Analysis (PCA) for QC

Using gene expression levels from my voom transformation, I performed a principle component analysis to show similarities and dissimilarities between samples in an unsupervised manner. Ideally, samples would cluster well within the primary condition of interest, and any sample straying far from its group could be identified and followed up for sources of error or extra variation. Ideally, replicates should lie very close to one another. We see that microRNA expression in horse 3 day 0 (s11) is more affected by a source of variation than any other samples suggesting a technical problem. This sample was only sequenced to a depth of 0.5 million reads. It would be interesting to exclude this sample as an outlier and repeat differential expression analysis. A further concern is that horse 1 and 3 cluster in different locations along the dominant principle component (PC1), suggesting that our biological replicates hold a greater source of variation than conditions.

#spread and convert to a matrix for PCA analysis 
microRNA_expn <- horse_voom$E %>% 
  as.tibble %>%
  bind_cols(as.tibble(horse_voom$genes)) %>% 
  gather(sample, expression, -gene) %>% 
  left_join(redlands_horse_metadata, by = "sample") 

scaled_microRNAs <- microRNA_expn %>%
  spread(gene, expression) %>%
  select(-animal, -condition) %>% 
  column_to_rownames("sample") %>% 
  scale()

pca_microRNAs <- prcomp(scaled_microRNAs)

pca_microRNAs$x %>% 
  as_tibble(rownames = "sample") %>%
  gather(PC, expression, -sample) %>% 
  left_join(redlands_horse_metadata, by = "sample") %>%
  spread(PC, expression) %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_text(aes(label = condition, color = animal), size = 6) +
  theme(axis.text = element_text(size = 14),
        axis.title = element_text(size = 16, face = "bold"),
        title = element_text(size = 14, face = "bold"),
        legend.text = element_text(size = 14)) +
  labs(title = "Principle Component Analysis")

My Digital Toolbox

I’ve used tidyverse to bring tabular data outputs from analysis into ‘tidy’ tables in which each variable forms a column, each observation forms a row and each cell holds a value. This can then feed into ggplot2 to make flexible visualizations. I have also used the RNA-seq packages, edgeR, limma and voom then the publishing packages rmarkdown and knitr. All these were totally new to me!

Favourite tool

I’ve really enjoyed tidyverse, such a straightforward and standardized way of looking at data, particularly love the filter function! Also rmarkdown, I really want to write papers using rmarkdown.

My time went …

In the early weeks, learning to use a programming language was really hard. There were so many functions to remember and things moved very slowly. Once I could gather and spread and make plots of my data things moved much more quickly and it became addictive!

Next steps

I would like to try the same analysis pipeline used here with different data sets. Going forward I’m eager to generate my own data, beginning with FAIR data principles so that my work is findable, accessible, interoperable and reproducible!

My Data School Experience

I’m so thankful to be given such a great introduction to programming, data management and app development. I also appreciated the in-depth stats lessons. Already I have been collecting data in a tidy way to save time on analysis and plotting the raw data to understand sources of variation which feeds into improved experimental design. On a personal note I have really enjoyed the feeling of doors opening. Excited to feel connected to a community of data scientists at CSIRO and beyond.

Mining the transcriptome for biomarkers of infection

A play with RNA-seq count data

Marina Alexander

Health & Biosecurity| Managing invasive species and Disease