I am a Principal Research Scientist in Agriculture and Food and my main interest is in understanding how polyploidy impacts crop genetic improvement. Before Data School I used SAS or GenStat for statistical analysis and for generating graphs either Excel or SigmaPlot. I joined Data School to learn how to create reproducible graphs from large sequence data sets. I did not have any previous experience using R language before Data School.
Sugarcane is a complex autopolyploid with a genome size of 10 Gb condensed into 110 chromosomes. There are from 8-12 copies of each chromosome which makes genome assembly very complex. To reduce the complexity of the genome and to help in the assembly of long read data, sugarcane chromosomes from a variety R570 were flow sorted to collect individual chromosomes. The single chromosome was then amplified using multiple displacement amplification and sequenced using Illumina Hiseq. For each chromosome library a set of reads were generated these were quality trimmed then aligned to both the sugarcane R570 BAC monoploid sequence and the genome sequence of sugarcanes closest diploid relative, sorghum. The resultant coverage reports were then tidied and used in this project to generate coverage graphs across the gene space from the R570 monoploid genome sequence and Sorghum.
In this project coverage reports from the single chromosome libraries of chromosomes numbered 38 and 54 were analysed. The library of reads had been aligned to both R570 monoploid genome sequence and the sorghum genome. The coverage report was read into R. These files were tidied to remove all regions with zero reads aligned. The gene files were also read in and the gene length and mid position were calculated. The two files were joined using the column ‘gene’ in common between the two files. Then the percent coverage was calculated and reads that did not map to chromosomes removed. Then a plot of coverage depth of each gene was generated for each chromosome using ggplot. An example of the data is given in Table 1.
X | gene | coverage | chromosome | begin | end | orientation | length | position | perc_coverage | Mb_position | Chr |
---|---|---|---|---|---|---|---|---|---|---|---|
192 | Sh01_g000050 | 19 | Sh01 | 69088 | 74749 |
|
5661 | 71918.5 | 0.3356297 | 0.0719185 | Sh0 |
193 | Sh01_g000060 | 37 | Sh01 | 77087 | 83334 |
|
6247 | 80210.5 | 0.5922843 | 0.0802105 | Sh0 |
194 | Sh01_g000070 | 320 | Sh01 | 102543 | 106773 |
|
4230 | 104658.0 | 7.5650118 | 0.1046580 | Sh0 |
195 | Sh01_g000080 | 151 | Sh01 | 130451 | 137050 |
|
6599 | 133750.5 | 2.2882255 | 0.1337505 | Sh0 |
196 | Sh01_g000180 | 18 | Sh01 | 192073 | 196925 |
|
4852 | 194499.0 | 0.3709810 | 0.1944990 | Sh0 |
Figures 1 and 2 show that chromosome 54 aligns to sugarcane chromosome 1 and chromosome 38 aligns to two different chromosomes 8 and 9 of the monoploid gene sequence. Figures 3 and 4 confirm that sugarcane chromosome 54 is colinear to sorghum chromosome 1 but indicates that chromosome 38 is a recombinant chromosome and aligns to the whole of sorghum chromosome 9 and half of chromosome 8. The centre of each sorghum chromosome has no reads as this is the centromeric region which has no genes. This is not seen in the R570 alignment as only the gene regions are present in this BAC assembly.
I have learned multiple tools since I started Data School Focus these include:
The 10 weeks flew by and having such a great course structure helped in the transition to working from home that has happened since COVID-19. I really wanted to start to use R to tidy my data as the sequence files are so large and I was expecting that it would take a really long time. I was surprised how quickly I started to be able to troubleshoot and figure out things on my own using online resources. It helped to have so many people from data school that I could ask for advice.
Although Data School has given me an excellent foundation in data analyses using R there is still a lot more skills I would like to learn, especially statistical analysis methods. I will be expanding on the analysis I have carried out here as this project has many more chromosomes to analyse and looking at the data, I think there are even better ways to present this information.
I have found Data School a very positive and enjoyable experience often challenging but with the help of really good presenters who were happy to go back over anything I didn’t understand I have gained many skills that I will apply to my everyday work. I really enjoyed working out how to graph my data using ggplot and using all the vast options available. I have also enjoyed comparing notes with other team members who are using R.