My name is Melania Figueroa and I am currently Leader of the Next Generation Crop Protection Group, Ag & Food Traits Program. I am very interested in understanding the evolution of rust fungi and delivering genetic disease resistance to cereals. I did not have any previous experience using R language before Data School.
I aim to identify genes responsible for virulence in rust fungi. In this case, our team is asking if deletions of specific genes (effectors) result in gain of virulence in a species of fungus (Puccinia graminis). Here, I am visualising data that depicts gene coverage in a defined region of the genome of P. graminis.
Figure.1. Symptoms caused the P. graminis in wheat. Credit: Zak Pretorius
My project uses two dataframes called: read_coverage_AvrSr27_deleted.xlsx
and Phenotypes.xlsx
.
I am sharing here code to input and read dataframes
#Step 1. Read in dataframes and assign them to objects
Del_pgt21 <- read_excel("data/read_coverage_AvrSr27_deleted.xlsx")
Phenotypes <- read_excel("data/Phenotypes.xlsx")
Here is preview of dataframe Del_pgt21
Chr | Region | start | end | strand | Name | ID | secretome | kmeans cluster | allias | gene length | 21-0,percent zero bases | Sr27_1,percent zero bases | Sr27_2,percent zero bases | Sr27_3,percent zero bases | 34M1,percent zero bases | 34M2,percent zero bases | 98,percent zero bases | 194,percent zero bases | SA01,percent zero bases | SA02,percent zero bases | SA03,percent zero bases | SA04,percent zero bases | SA05,percent zero bases | SA06,percent zero bases | SA07,percent zero bases |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr2_B | 554728..556351 | 554728 | 556351 |
|
PGT21_005586 | PGT21_005586 | NA | NA | NA | 1624 | 0 | 100.00000 | 62.50000 | 100.00000 | 0 | 0 | 0 | 5.726601 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 |
chr2_B | 556676..561771 | 556676 | 561771 |
|
TRP5_1 | PGT21_005613 | NA | NA | TRP5_1 | 5096 | 0 | 100.00000 | 89.58006 | 60.08634 | 0 | 0 | 0 | 8.202512 | 0 | 0 | 0.0000000 | 0.2158556 | 3.1593407 | 1.236264 | 3.2378336 |
chr2_B | complement(557532..558142) | 557532 | 558142 |
|
PGT21_005606 | PGT21_005606 | NA | NA | NA | 611 | 0 | 100.00000 | 100.00000 | 79.54173 | 0 | 0 | 0 | 0.000000 | 0 | 0 | 0.0000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 |
chr2_B | 562054..565783 | 562054 | 565783 |
|
PGT21_005646 | PGT21_005646 | NA | NA | NA | 3730 | 0 | 92.09115 | 87.58713 | 88.04290 | 0 | 0 | 0 | 3.672922 | 0 | 0 | 0.0000000 | 0.4289544 | 0.1340483 | 0.000000 | 6.7292225 |
chr2_B | complement(565786..567670) | 565786 | 567670 |
|
PGT21_005680 | PGT21_005680 | PGT21_005680 | 2 | NA | 1885 | 0 | 100.00000 | 79.89390 | 52.14854 | 0 | 0 | 0 | 1.273210 | 0 | 0 | 0.5835544 | 0.0000000 | 0.0000000 | 0.000000 | 0.6366048 |
and a preview of dataframe Phenotypes
Sr gene | Ug99 | Pgt55 | Pgt59 | Pgt60 | Pgt61 | 04KEN156 | 126-5,6,7,11 | 21-0 | 34-2-12 | 34-2-12-13 | 98-1,2,3,5,6 | 194-1,2,3,5,6 | 326-1,2,3,5,6 | SA-01 | SA-02 | SA-03 | SA-04 | SA-05 | SA-06 | SA-07 | Pgt62 | MCCFC | RKQQC | Mutant 1 | Mutant 2 | Mutant 3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | V | V | V | V | V | V | V | A | V | V | V | A | A | V | V | A | A | A | A | A | A | V | V | NA | NA | NA |
6 | V | V | V | V | V | V | A | A | A | A | V | V | V | V | V | A | A | A | A | A | A | A | V | NA | NA | NA |
7a | NA | V | V | NA | NA | V | NA | NA | NA | NA | NA | NA | NA | V | V | V | V | NA | NA | NA | NA | NA | NA | NA | NA | NA |
7b | V | V | V | V | V | V | V | V | V | V | V | V | A | V | V | A | A | A | A | A | A | V | V | NA | NA | NA |
8a | V | V | V | V | V | V | V | A | A | A | V | V | V | V | V | V | V | V | V | V | V | A | V | NA | NA | NA |
Some scripts to manipulate dataframes and prepare materials for plotting are included below.
It was key to extract the data I cared about (only certain strains) and so I created a different table, Del_pgt21
#Step 2. The dataframe _Del_pgt21_ has more information than I needed so I #extracted the data for the rust strains I care about. Here I created a new #dataframe called *gcoverage_isolate*:
gcoverage_isolate <- Del_pgt21 %>%
select("ID","21-0,percent zero bases",
"Sr27_1,percent zero bases","Sr27_2,percent zero bases",
"Sr27_3,percent zero bases", "34M1,percent zero bases",
"34M2,percent zero bases", "98,percent zero bases",
"194,percent zero bases", "SA01,percent zero bases",
"SA02,percent zero bases","SA03,percent zero bases",
"SA04,percent zero bases", "SA05,percent zero bases",
"SA06,percent zero bases", "SA07,percent zero bases")
Then, I was set up to tidy my data in Del_pgt21
.
#*Step 3. Tidy the dataframe _gcoverage_isolate_:
tidy_gcoverage_isolate <- gcoverage_isolate %>%
gather(key = "Percent_coverage", value = "coverage", -ID)
Another task was to manipulate column names and other content of the dataframes so I could get ready to make some plots.
#Step 4. Complete other dataframe manipulations.
#In the column "Percent_coverage", the isolate name was combined with other #information, for example"21-0,percent zero bases", so I separated this using the #"," . The dataframe was also showing "0" to indicate full coverage so I changed #this to make it "100". Here, I created a new dataframe called *df_plot*.
df_plot <- separate(tidy_gcoverage_isolate,
Percent_coverage, into = c("isolate","percent"), sep = "," ) %>%
mutate(Coverage = 100-coverage) %>%
select(-"percent", - "coverage")
#Some rust strains mentioned in the _tidy_gcoverage_isolate_ dataframe needed renaming becuase they did not match the information provided by the dataframe _Phenotypes_. Here, I created another dataframe called *df_plot2*
df_plot2 <- df_plot %>%
mutate(isolate = str_replace(isolate,"Sr27_", "Mutant ")) %>%
mutate(isolate= str_replace(isolate,"SA", "SA-")) %>%
mutate(isolate = replace(isolate, isolate == "194", "194-1,2,3,5,6")) %>%
mutate(isolate = replace(isolate, isolate == "98", "98-1,2,3,5,6")) %>%
mutate(isolate = replace(isolate, isolate == "34M1", "34-2-12")) %>%
mutate(isolate = replace(isolate, isolate == "34M2", "34-2-12-13"))
#I converted the columns to factors becuase otherwise the plot won't mantain the order of genes and isolates that I wish to keep from the daaframe.Here, I created another dataframe called *df_plot3*
df_plot3 <- df_plot2 %>%
mutate(isolate = factor(isolate,levels = rev(unique(isolate)))) %>%
mutate(ID = factor(ID,levels = unique(ID)))
Next, I started working with the dataframe Phenotypes
, which provides information about phenotypes, to then integrate with the data about deletions of genes in the region of interest. So, manipulations in this dataframe began.
#Here, I needed to fix the column name to be Sr_gene
colnames(Phenotypes)[1] <- "Sr_gene"
#Then I prepared a tidy dataframe and clarified the information in the column"reaction", otherwise the data was going to be difficult to be interpreted. The dataframe created here was named *clean_phenotypes*.
clean_phenotypes <- Phenotypes %>%
gather(- Sr_gene, key = "isolate", value = "reaction") %>%
mutate(reaction = replace_na(reaction,"N")) %>%
mutate(reaction = replace(reaction, reaction == "?", "N")) %>%
mutate(reaction = replace(reaction, reaction == "A/V", "N")) %>%
mutate(reaction = replace(reaction, reaction == "A", "Avirulence")) %>%
mutate(reaction = replace(reaction, reaction == "V", "Virulence")) %>%
mutate(reaction = replace(reaction, reaction == "X", "Mesothetic")) %>%
mutate(reaction = replace(reaction, reaction == "N", "Unknown")) %>%
mutate(isolate = factor(isolate,levels = unique(isolate)))
#The column order needed to be changed to make it easier to read and then the phenotypes of the rust strains when gene Sr27 is present in the plant was extracted since it was the only bit of data I cared about. Here, I finished with a dataframe called *pheno_interest*
clean_phenotypes <- clean_phenotypes[c(2,1,3)]
pheno_interest <- clean_phenotypes %>%
filter(Sr_gene =="27")
After tidying and extracting the data I needed, I joined the two tables by isolate name.
#*Step 6. Next I needed to link up the data regarding phenotypes for Sr27 and #gene coverage that reflects deletions. Here, I created a new dataframe called:
#*df_plot4*.
df_plot4 <- right_join(pheno_interest, df_plot3, by = "isolate")
Now, we have all the elements to make some plots and visualise our data
del_heatmap <- ggplot(data = df_plot3,
mapping= aes(y=isolate, x= ID, color = Coverage)) +
geom_tile(aes(fill = Coverage), color="white") +
theme_minimal() +
theme(text = element_text(size=8),
axis.text.x = element_text(angle=90, hjust=0.1,vjust=0.1, color= "black"),
axis.text.y = element_text(angle=0, hjust=1,vjust=1, color = "black")) +
ylab("Isolate") +
xlab("Gene model") +
scale_fill_gradient(low = "lightgray", high ="red") +
scale_x_discrete(expand = c(0, 0),
position = "top") +
coord_equal()
and I ended up with something like this:
Figure.2. Heatmap showing percentage of gene coverage without including phenotypic data
I saved my plot using the code below, although the phenotypic data was yet to be included.
Then I re-did this plot but showing the isolates that are virulent or avirulent on wheat that carry the gene Sr27 modifying my code as follows
heatmap_2 <- df_plot4 %>%
mutate(isolate = factor(isolate,levels = rev(unique(isolate)))) %>%
ggplot(
mapping= aes(y=isolate, x= ID, color = Coverage)) +
geom_tile(aes(fill = Coverage), color="white") +
theme_minimal() +
theme(text = element_text(size=10),
axis.text.x = element_text(angle=90, hjust=0.1,vjust=0.1, color= "black"),
axis.text.y = element_blank(),
panel.grid = element_blank()) +
ylab("Isolate") +
xlab("Gene model") +
scale_fill_gradient(low = "lightgray", high ="red") +
scale_colour_manual("Virulence", values= c("grey70","black")) +
scale_x_discrete(expand = expand_scale(mult = c(0, 0), add= c(7,0)),
position = "top") +
coord_equal() +
geom_text(aes(x = 0, y = isolate, label = isolate, colour = reaction), hjust = 1, size = 3)
ggsave("results/phenotype_heatmap.jpg", plot = heatmap_2,
width = 20, height = 16, units = "cm")
And here I can show you my final plot
Figure.3. Heatmap showing percentage of gene coverage and phenotypic data per rust strain
Since starting data FOCUS school I have learned:
dplyr
, ggplot
, R markdown
to tidying the data. This was the most difficult part. It was surprising how many name inconsistencies existed between the two tables. This made it hard to combine information from both tables to generate a hypothesis.
I did not have time to apply statistics to determine if the deletion of one or more genes could explain the change in phenotype. So, I would like to explore this question further, and apply some other material we covered in the course.
Data School has been a very positive experience and has made me aware of the many resources we have access to enhance computational skills within our groups. This small project is part of a publication my group is working on. I am keen on gaining additional expertise to bring those to my group. My experience will benefit memeber of my groups as I will be better possition to guide them through data management practices.
This document was completed using R 3.6.1 and the tidyverse package 1.2.1
My support and admiration to all R ladies…