Hunting for fungal rust effectors

Melania Figueroa

Agriculture and Food - Next generation crop protection

Introduction

My name is Melania Figueroa and I am currently Leader of the Next Generation Crop Protection Group, Ag & Food Traits Program. I am very interested in understanding the evolution of rust fungi and delivering genetic disease resistance to cereals. I did not have any previous experience using R language before Data School.

My Project

I aim to identify genes responsible for virulence in rust fungi. In this case, our team is asking if deletions of specific genes (effectors) result in gain of virulence in a species of fungus (Puccinia graminis). Here, I am visualising data that depicts gene coverage in a defined region of the genome of P. graminis.

Figure.1. Symptoms caused the P. graminis in wheat. Credit: Zak Pretorius

Figure.1. Symptoms caused the P. graminis in wheat. Credit: Zak Pretorius

Data used in this project

My project uses two dataframes called: read_coverage_AvrSr27_deleted.xlsx and Phenotypes.xlsx.

I am sharing here code to input and read dataframes

Here is preview of dataframe Del_pgt21

Chr Region start end strand Name ID secretome kmeans cluster allias gene length 21-0,percent zero bases Sr27_1,percent zero bases Sr27_2,percent zero bases Sr27_3,percent zero bases 34M1,percent zero bases 34M2,percent zero bases 98,percent zero bases 194,percent zero bases SA01,percent zero bases SA02,percent zero bases SA03,percent zero bases SA04,percent zero bases SA05,percent zero bases SA06,percent zero bases SA07,percent zero bases
chr2_B 554728..556351 554728 556351
PGT21_005586 PGT21_005586 NA NA NA 1624 0 100.00000 62.50000 100.00000 0 0 0 5.726601 0 0 0.0000000 0.0000000 0.0000000 0.000000 0.0000000
chr2_B 556676..561771 556676 561771
TRP5_1 PGT21_005613 NA NA TRP5_1 5096 0 100.00000 89.58006 60.08634 0 0 0 8.202512 0 0 0.0000000 0.2158556 3.1593407 1.236264 3.2378336
chr2_B complement(557532..558142) 557532 558142
PGT21_005606 PGT21_005606 NA NA NA 611 0 100.00000 100.00000 79.54173 0 0 0 0.000000 0 0 0.0000000 0.0000000 0.0000000 0.000000 0.0000000
chr2_B 562054..565783 562054 565783
PGT21_005646 PGT21_005646 NA NA NA 3730 0 92.09115 87.58713 88.04290 0 0 0 3.672922 0 0 0.0000000 0.4289544 0.1340483 0.000000 6.7292225
chr2_B complement(565786..567670) 565786 567670
PGT21_005680 PGT21_005680 PGT21_005680 2 NA 1885 0 100.00000 79.89390 52.14854 0 0 0 1.273210 0 0 0.5835544 0.0000000 0.0000000 0.000000 0.6366048

and a preview of dataframe Phenotypes

Sr gene Ug99 Pgt55 Pgt59 Pgt60 Pgt61 04KEN156 126-5,6,7,11 21-0 34-2-12 34-2-12-13 98-1,2,3,5,6 194-1,2,3,5,6 326-1,2,3,5,6 SA-01 SA-02 SA-03 SA-04 SA-05 SA-06 SA-07 Pgt62 MCCFC RKQQC Mutant 1 Mutant 2 Mutant 3
5 V V V V V V V A V V V A A V V A A A A A A V V NA NA NA
6 V V V V V V A A A A V V V V V A A A A A A A V NA NA NA
7a NA V V NA NA V NA NA NA NA NA NA NA V V V V NA NA NA NA NA NA NA NA NA
7b V V V V V V V V V V V V A V V A A A A A A V V NA NA NA
8a V V V V V V V A A A V V V V V V V V V V V A V NA NA NA

Some scripts to manipulate dataframes and prepare materials for plotting are included below.

It was key to extract the data I cared about (only certain strains) and so I created a different table, Del_pgt21

Then, I was set up to tidy my data in Del_pgt21.

Another task was to manipulate column names and other content of the dataframes so I could get ready to make some plots.

#Step 4. Complete other dataframe manipulations. 

#In the column "Percent_coverage", the isolate name was combined with other #information, for example"21-0,percent zero bases", so I separated this using the #"," . The dataframe was also showing "0" to indicate full coverage so I changed #this to make it "100". Here, I created a new dataframe called *df_plot*.

df_plot <- separate(tidy_gcoverage_isolate,
         Percent_coverage, into = c("isolate","percent"), sep = "," ) %>% 
         mutate(Coverage = 100-coverage) %>% 
         select(-"percent", - "coverage") 

#Some rust strains mentioned in the _tidy_gcoverage_isolate_ dataframe needed renaming becuase they did not match the information provided by the dataframe _Phenotypes_. Here, I created another dataframe called *df_plot2*

df_plot2 <- df_plot %>% 
  mutate(isolate = str_replace(isolate,"Sr27_", "Mutant ")) %>%
  mutate(isolate= str_replace(isolate,"SA", "SA-")) %>% 
  mutate(isolate = replace(isolate, isolate == "194", "194-1,2,3,5,6")) %>% 
  mutate(isolate = replace(isolate, isolate == "98", "98-1,2,3,5,6")) %>%  
  mutate(isolate = replace(isolate, isolate == "34M1", "34-2-12")) %>%  
  mutate(isolate = replace(isolate, isolate == "34M2", "34-2-12-13")) 

#I converted the columns to factors becuase otherwise the plot won't mantain the order of genes and isolates that I wish to keep from the daaframe.Here, I created another dataframe called *df_plot3*

df_plot3 <- df_plot2 %>% 
mutate(isolate = factor(isolate,levels = rev(unique(isolate)))) %>%  
mutate(ID = factor(ID,levels = unique(ID))) 

Next, I started working with the dataframe Phenotypes, which provides information about phenotypes, to then integrate with the data about deletions of genes in the region of interest. So, manipulations in this dataframe began.

After tidying and extracting the data I needed, I joined the two tables by isolate name.

Data visualisation

Now, we have all the elements to make some plots and visualise our data

and I ended up with something like this:

Figure.2. Heatmap showing percentage of gene coverage without including phenotypic data

Figure.2. Heatmap showing percentage of gene coverage without including phenotypic data

I saved my plot using the code below, although the phenotypic data was yet to be included.

Then I re-did this plot but showing the isolates that are virulent or avirulent on wheat that carry the gene Sr27 modifying my code as follows

And here I can show you my final plot

Figure.3. Heatmap showing percentage of gene coverage and phenotypic data per rust strain

Figure.3. Heatmap showing percentage of gene coverage and phenotypic data per rust strain

My Digital Toolbox

Since starting data FOCUS school I have learned:

My time went …

to tidying the data. This was the most difficult part. It was surprising how many name inconsistencies existed between the two tables. This made it hard to combine information from both tables to generate a hypothesis.

Next steps

I did not have time to apply statistics to determine if the deletion of one or more genes could explain the change in phenotype. So, I would like to explore this question further, and apply some other material we covered in the course.

My Data School Experience

Data School has been a very positive experience and has made me aware of the many resources we have access to enhance computational skills within our groups. This small project is part of a publication my group is working on. I am keen on gaining additional expertise to bring those to my group. My experience will benefit memeber of my groups as I will be better possition to guide them through data management practices.

This document was completed using R 3.6.1 and the tidyverse package 1.2.1

My support and admiration to all R ladies…

My support and admiration to all R ladies…