My name is Jose Barrero and I am a Research Scientist in Agriculture and Food. Although I have worked mainly with cereals in the past I recently started leading the Cowpea Team in Black Mountain. I have not done any coding before but I generate large datasets that need complex analysis, so that is why I decided to learn a bit of R and to come to Data School as part of my Julius Career Award activities.
My project is about exploring gene changes in a transgenic “high-oil” cowpea line that we have made. Cowpea (Image 1) is a key staple crop in West Africa being a main protein source for over 200 million people. By introducing the gene cassette that appears below (Image 2) with three oil-related genes (oleosin, WRI1 and DGAT1) we tried to increased the oil-content in the seeds making them more nutritious.
Our transgenic line (named 15B) has about 7 times more oil in the seeds that the wildtype line (named IT86). We collected RNA samples from developing seeds from 15B and IT86 at 4 different stages and performed a trancriptomic RNAseq analysis.
The goal of this project is to evaluate the gene expression changes bewteen the two genotypes and to identify genes related with the production of oil. The plan was making a heatmap illustrating the gene expression changes across the samples and to group the genes into clusters with similar expression patterns.
After generating the transgenic line, we first analysed the oil content in wildtype (IT86) and transgenic (15B) seeds during seed developmet (Figure 1). The transgenic line showed a significant increased in oil content from stage 3, remaining high until stage 5 (matured dry seed). So we decided to collect RNA samples from stages 1 to 4 to performed RNAseq.
I started with a matrix of expression values and first I have to transform the data into a tidy dataframe and merge different files to add the gene descriptions into it. After that I did some basic data processing to remove genes with low expression, transform the expression and normalise it.
After that I performed a cluster analysis using k-means clustering (cluster::clara()
) with an arbitrary number of clusters (for this example I used k=10) to get a quick look at expression patterns. It was a bit tricky because for this kind of analysis I have to change between dataframes and matrices, but I got there at the end (Table 1).
gene_name | annotation | genotype | tissue | stage | expression | sample | log_expn | norm_expn | cluster |
---|---|---|---|---|---|---|---|---|---|
Vigun01g000100.1 | cleavage and polyadenylation specificity factor 73 kDa subunit-II | WT | Embryo | 1 | 282 | norm.WT_Embryo_1 | 8.144658 | -1.0343485 | 1 |
Vigun01g000200.1 | Domain of unknown function (DUF543) | WT | Embryo | 1 | 248 | norm.WT_Embryo_1 | 7.960002 | -0.3842461 | 2 |
Vigun01g000200.2 | Domain of unknown function (DUF543) | WT | Embryo | 1 | 1508 | norm.WT_Embryo_1 | 10.559377 | -0.8153969 | 3 |
Vigun01g000400.1 | Protein of unknown function (DUF179) | WT | Embryo | 1 | 129 | norm.WT_Embryo_1 | 7.022368 | -0.0667888 | 4 |
Vigun01g000500.1 | O-fucosyltransferase family protein | WT | Embryo | 1 | 198 | norm.WT_Embryo_1 | 7.636625 | -1.1377203 | 5 |
Once I obtained the data frame containing the clustered genes with relative expression values I used ggplot()
to built the heatmap converting the gene names to a factor, with the levels of the factor being the gene order produced by the clustering (Figure 2).
For this project I have used mainly the Tydyverse
package, but I had to add a few others to add images to graphs (library(png)
and library(grid)
) I also explored the heatmap()
function that is built in R for using a matrix of data, although I decided to use ggplot
at the end for making the heatmap. I have also enjoyed a lot learning about Rmarkdown
which will be a great tool in the future too.
and of course
I spent about 10% of the time tidying the data, about 80% trying to do the cluster analysis and heatmap, and about 10% preparing the Rmarkdown file. I found it a bit overwhelming initially because there where so many options when I searched the web for help. It was difficult to change from dataframes to matrix formats. Also, because I tried to analyse the whole cowpea genome I was running out of computing power. Likely, the instructors put me into the right track and I could find the appropiate approach.
After this exploratory work, I need now to keep playing with the pre-processing filtering steps and with the clustering parameters to filter the genes to identify those more likely to be biologically relevant for oil accumulation.
With this exercise I was able to identify groups of genes that are high- or low-expressed at different stages during seed developemnt in transgenic seeds with high oil.
I am very happy to have attended Data School. The experiencxe has been much more positive than I though and, starting from zero, I have learnt heaps. The course was very well structured and the teachers were fantastic. I will definetly recommend it.
Thanks a lot to Stephen, Keransa, Khristian and all the helpers for your effort and patience!
And thanks to TJ Higgins and Surinder Singh’s Team and to Jerome Verdier for sharing the data.
R version: 4.0.0 Tidyverse version: 1.3.0