Introduction

I am a Principal Research Scientist in Agriculture and Food and my main interest is in understanding how polyploidy impacts crop genetic improvement. Before Data School I used SAS or GenStat for statistical analysis and for generating graphs either Excel or SigmaPlot. I joined Data School to learn how to create reproducible graphs from large sequence data sets. I did not have any previous experience using R language before Data School.

My Project

Sugarcane is a complex autopolyploid with a genome size of 10 Gb condensed into 110 chromosomes. There are from 8-12 copies of each chromosome which makes genome assembly very complex. To reduce the complexity of the genome and to help in the assembly of long read data, sugarcane chromosomes from a variety R570 were flow sorted to collect individual chromosomes. The single chromosome was then amplified using multiple displacement amplification and sequenced using Illumina Hiseq. For each chromosome library a set of reads were generated these were quality trimmed then aligned to both the sugarcane R570 BAC monoploid sequence and the genome sequence of sugarcanes closest diploid relative, sorghum. The resultant coverage reports were then tidied and used in this project to generate coverage graphs across the gene space from the R570 monoploid genome sequence and Sorghum.

Preliminary results

In this project coverage reports from the single chromosome libraries of chromosomes numbered 38 and 54 were analysed. The library of reads had been aligned to both R570 monoploid genome sequence and the sorghum genome. The coverage report was read into R. These files were tidied to remove all regions with zero reads aligned. The gene files were also read in and the gene length and mid position were calculated. The two files were joined using the column ‘gene’ in common between the two files. Then the percent coverage was calculated and reads that did not map to chromosomes removed. Then a plot of coverage depth of each gene was generated for each chromosome using ggplot. An example of the data is given in Table 1.

Table 1: Data file used for plotting gene coverage graphs
X	gene	coverage	chromosome	begin	end	length	position	perc_coverage	Mb_position	Chr
192	Sh01_g000050	19	Sh01	69088	74749	5661	71918.5	0.3356297	0.0719185	Sh0
193	Sh01_g000060	37	Sh01	77087	83334	6247	80210.5	0.5922843	0.0802105	Sh0
194	Sh01_g000070	320	Sh01	102543	106773	4230	104658.0	7.5650118	0.1046580	Sh0
195	Sh01_g000080	151	Sh01	130451	137050	6599	133750.5	2.2882255	0.1337505	Sh0
196	Sh01_g000180	18	Sh01	192073	196925	4852	194499.0	0.3709810	0.1944990	Sh0

Data Visualisation

Figures 1 and 2 show that chromosome 54 aligns to sugarcane chromosome 1 and chromosome 38 aligns to two different chromosomes 8 and 9 of the monoploid gene sequence. Figures 3 and 4 confirm that sugarcane chromosome 54 is colinear to sorghum chromosome 1 but indicates that chromosome 38 is a recombinant chromosome and aligns to the whole of sorghum chromosome 9 and half of chromosome 8. The centre of each sorghum chromosome has no reads as this is the centromeric region which has no genes. This is not seen in the R570 alignment as only the gene regions are present in this BAC assembly.

Figure 1: Single chromosome library aligned to R570 sugarcane monoploid sequence

Figure 2: Single chromosome library aligned to R570 sugarcane monoploid sequence

Figure 3: Single chromosome library aligned to the sorghum genome

Figure 4: Single chromosome library aligned to the sorghum genome

My Digital Toolbox

I have learned multiple tools since I started Data School Focus these include:

tidyverse
ggplot
R-dply
R Markdown
readr
kableExtra

My time went …

The 10 weeks flew by and having such a great course structure helped in the transition to working from home that has happened since COVID-19. I really wanted to start to use R to tidy my data as the sequence files are so large and I was expecting that it would take a really long time. I was surprised how quickly I started to be able to troubleshoot and figure out things on my own using online resources. It helped to have so many people from data school that I could ask for advice.

Next steps

Although Data School has given me an excellent foundation in data analyses using R there is still a lot more skills I would like to learn, especially statistical analysis methods. I will be expanding on the analysis I have carried out here as this project has many more chromosomes to analyse and looking at the data, I think there are even better ways to present this information.

My Data School Experience

I have found Data School a very positive and enjoyable experience often challenging but with the help of really good presenters who were happy to go back over anything I didn’t understand I have gained many skills that I will apply to my everyday work. I really enjoyed working out how to graph my data using ggplot and using all the vast options available. I have also enjoyed comparing notes with other team members who are using R.

Sequencing the complex polyploid sugarcane genome

Karen Aitken

CSIRO Agriculture and Food