Transcriptomic analysis of high-oil transgenic cowpea

Exploration of gene expression changes

Jose Barrero

CSIRO Agriculture and Food

Introduction

My name is Jose Barrero and I am a Research Scientist in Agriculture and Food. Although I have worked mainly with cereals in the past I recently started leading the Cowpea Team in Black Mountain. I have not done any coding before but I generate large datasets that need complex analysis, so that is why I decided to learn a bit of R and to come to Data School as part of my Julius Career Award activities.

My Project

My project is about exploring gene changes in a transgenic “high-oil” cowpea line that we have made. Cowpea (Image 1) is a key staple crop in West Africa being a main protein source for over 200 million people. By introducing the gene cassette that appears below (Image 2) with three oil-related genes (oleosin, WRI1 and DGAT1) we tried to increased the oil-content in the seeds making them more nutritious.

Our transgenic line (named 15B) has about 7 times more oil in the seeds that the wildtype line (named IT86). We collected RNA samples from developing seeds from 15B and IT86 at 4 different stages and performed a trancriptomic RNAseq analysis.

The goal of this project is to evaluate the gene expression changes bewteen the two genotypes and to identify genes related with the production of oil. The plan was making a heatmap illustrating the gene expression changes across the samples and to group the genes into clusters with similar expression patterns.

Image 1. Cowpea seeds.Image 2. Gene cassette used.

Description of the samples

After generating the transgenic line, we first analysed the oil content in wildtype (IT86) and transgenic (15B) seeds during seed developmet (Figure 1). The transgenic line showed a significant increased in oil content from stage 3, remaining high until stage 5 (matured dry seed). So we decided to collect RNA samples from stages 1 to 4 to performed RNAseq.

Oil content analysis in our cowpea samples. Seeds from five developmental stages were analysed, being the stage 5 the dry mature seed. For gene expression analysis RNA samples were collected from stages 1 to 4.

Figure 1: Oil content analysis in our cowpea samples. Seeds from five developmental stages were analysed, being the stage 5 the dry mature seed. For gene expression analysis RNA samples were collected from stages 1 to 4.

Workflow

I started with a matrix of expression values and first I have to transform the data into a tidy dataframe and merge different files to add the gene descriptions into it. After that I did some basic data processing to remove genes with low expression, transform the expression and normalise it.

After that I performed a cluster analysis using k-means clustering (cluster::clara()) with an arbitrary number of clusters (for this example I used k=10) to get a quick look at expression patterns. It was a bit tricky because for this kind of analysis I have to change between dataframes and matrices, but I got there at the end (Table 1).

Table 1: First rows of the final table containing the clustered relative expression values for each gene and their annotations.
gene_name annotation genotype tissue stage expression sample log_expn norm_expn cluster
Vigun01g000100.1 cleavage and polyadenylation specificity factor 73 kDa subunit-II WT Embryo 1 282 norm.WT_Embryo_1 8.144658 -1.0343485 1
Vigun01g000200.1 Domain of unknown function (DUF543) WT Embryo 1 248 norm.WT_Embryo_1 7.960002 -0.3842461 2
Vigun01g000200.2 Domain of unknown function (DUF543) WT Embryo 1 1508 norm.WT_Embryo_1 10.559377 -0.8153969 3
Vigun01g000400.1 Protein of unknown function (DUF179) WT Embryo 1 129 norm.WT_Embryo_1 7.022368 -0.0667888 4
Vigun01g000500.1 O-fucosyltransferase family protein WT Embryo 1 198 norm.WT_Embryo_1 7.636625 -1.1377203 5

Results

Whole genome clustering

Once I obtained the data frame containing the clustered genes with relative expression values I used ggplot() to built the heatmap converting the gene names to a factor, with the levels of the factor being the gene order produced by the clustering (Figure 2).

Gene expression heatmap during seed developmet. Normalised expression of 24,534 genes is shown for all transgenic (T) and wildtype (WT) samples.

Figure 2: Gene expression heatmap during seed developmet. Normalised expression of 24,534 genes is shown for all transgenic (T) and wildtype (WT) samples.

My Digital Toolbox

For this project I have used mainly the Tydyverse package, but I had to add a few others to add images to graphs (library(png) and library(grid)) I also explored the heatmap() function that is built in R for using a matrix of data, although I decided to use ggplot at the end for making the heatmap. I have also enjoyed a lot learning about Rmarkdown which will be a great tool in the future too.

Favourite tools

and of course

My time went …

I spent about 10% of the time tidying the data, about 80% trying to do the cluster analysis and heatmap, and about 10% preparing the Rmarkdown file. I found it a bit overwhelming initially because there where so many options when I searched the web for help. It was difficult to change from dataframes to matrix formats. Also, because I tried to analyse the whole cowpea genome I was running out of computing power. Likely, the instructors put me into the right track and I could find the appropiate approach.

Next steps

After this exploratory work, I need now to keep playing with the pre-processing filtering steps and with the clustering parameters to filter the genes to identify those more likely to be biologically relevant for oil accumulation.

My Data School Experience

With this exercise I was able to identify groups of genes that are high- or low-expressed at different stages during seed developemnt in transgenic seeds with high oil.

I am very happy to have attended Data School. The experiencxe has been much more positive than I though and, starting from zero, I have learnt heaps. The course was very well structured and the teachers were fantastic. I will definetly recommend it.

Thanks a lot to Stephen, Keransa, Khristian and all the helpers for your effort and patience!

And thanks to TJ Higgins and Surinder Singh’s Team and to Jerome Verdier for sharing the data.

R version: 4.0.0 Tidyverse version: 1.3.0