Finding gluten proteins in cereal grains

Sophia Escobar-Correas

CSIRO Agriculture & Food

Introduction

Hello! My name is Sophia, I am a molecular biologist working in proteomics, currently a Postdoctoral Fellow. Before Data School, I coded using Macro (Excel). I used to spend a lot of time cleaning and tidying protein data; I always felt like I could do it faster if I had programming skills. These weeks learning R have changed my daily work. The skills I have obtained have allowed me to fast track the ordinary and focus on new perspectives of my research.

My Project

Gluten refers to a class of storage proteins found in cereal grains, including: wheat, rye, barley, and oats. Consumption of these gluten proteins leads to an autoimmune response in the case of coeliac disease. For my project, I want to characterise the known gluten proteins and peptides that are found in cereal grains to find a easy way of identified gluten in different cultivars.

Preliminary results

I will analyse the amino acid composition of the proteins in my database. Since gluten proteins have a high composition of the amino acids glutamine (Q) and proline (P). I will search for all proteins that have over 20% glutamine.

Tables
Table 1: Protein database
N Accession Name Sequence
1 spP4910614331_MAIZE 14-3-3-like MASAELSREENVYMAKLAEQAERYEEMVEFMEKVAKTVDSEELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEEGRGNEDRVTLIKDYRGKIETELTKICDGILKLLETHLVPSSTAPESKVFYLKMKGDYYRYLAEFKTGAERKDAAENTMVAYKAAQDIALAELAPTHPIRLGLALNFSVFYYEILNSPDRACSLAKQAFDEAISELDTLSEESYKDSTLIMQLLRDNLTLWTSDISEDPAEEIREAPKRDSSEGQ
2 spQ84Q72HS181_ORYSJ 18.1 MSLIRRSNVFDPFSLDLWDPFDGFPFGSGSRSSGSIFPSFPRGTSSETAAFAGARIDWKETPEAHVFKADVPGLKKEEVKVEVEDGNVLQISGERSKEQEEKTDKWHRVERSSGKFLRRFRLPENTKPEQIKASMENGVLTVTVPKEEPKKPDVKSIQVTG
3 spP69555PSBH_WHEAT Photosystem MATQTVEDSSKPRPKRTGAGSLLKPLNSEYGKVAPGWGTTPFMGVAMALFAIFLSIILEIYNSSVLLDGILTN
4 spP36886PSAK_HORVU Photosystem MASQLSAMTSVPQFHGLRTYSSPRSMATLPSLRRRRSQGIRCDYIGSSTNLIMVTTTTLMLFAGRFGLAPSANRKATAGLKLEARESGLQTGDPAGFTLADTLACGAVGHIMGVGIVLGLKNTGVLDQIIG
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde MAGAAAASAAAAAVASGISARPVAPRPSPSRARAPRSVVRAAISVEKGEKAYTVEKSEEIFNAAKELMPGGVNSPVRAFKSVGGQPIVFDSVKGSRMWDVDGNEYIDYVGSWGPAIIGHADDTVNAALIETLKKGTSFGAPCVLENVLAEMVISAVPSIEMVRFVNSGTEACMGALRLVRAFTGREKILKFEGCYHGHADSFLVKAGSGVATLGLPDSPGVPKGATSETLTAPYNDVEAVKKLFEENKGQIAAVFLEPVVGNAGFIPPQPGFLNALRDLTKQDGALLVFDEVMTGFRLAYGGAQEYFGITPDVSTLGKIIGGGLPVGAYGGRKDIMEMVAPAGPMYQAGTLSGNPLAMTAGIHTLKRLMEPGTYDYLDKITGDLVRGVLDAGAKTGHEMCGGHIRGMFGFFFTAGPVHNFGDAKKSDTAKFGRFYRGMLEEGVYLAPSQFEAGFTSLAHTSQDIEKTVEAAAKVLRRI
Note: 5 examples of proteins found in the database. The column Sequence indicates the amino acids (letter code) that make up each protein.

Look for amino acids Q and P.

Table 2: Aminoacid composition
N Accession Name totalAA Qcomp Q100 Pcomp P100
1 spP4910614331_MAIZE 14-3-3-like 261 6 2.30 7 2.68
2 spQ84Q72HS181_ORYSJ 18.1 161 4 2.48 12 7.45
3 spP69555PSBH_WHEAT Photosystem 73 1 1.37 5 6.85
4 spP36886PSAK_HORVU Photosystem 131 5 3.82 5 3.82
5 spQ6YZE2GSA_ORYSJ Glutamate-1-semialdehyde 478 8 1.67 27 5.65
Note: totalAA = Number of total amino acids of the protein
Qcomp= Number of glutamine found in the protein
Q100= Percentage of glutamine in the protein
Pcomp= Number of proline found in the protein
P100= Percentage of proline in the protein

Working with Protein Data

Plotting
Glutamine and Proline composition in databaseGlutamine and Proline composition in database

Figure 1: Glutamine and Proline composition in database

Now we see how the gluten proteins group in Wheat

My Digital Toolbox

To work with Protein Databases, which are usually in the .fasta format, I have used the package Biostrings. For tidying the data,Tidyverse (my new best friend). Other packages I have used are: dplyr and stringr. For visualization ggplot and gganimate.

Favourite tool

My favorite package is tidyverse. With only learning a few functions in the first few days of Data School I already found ways to make my daily work much easier. It was love at second sight. I was able to clean and tidy my data. The functions that I used the most are mutate, join and of course pipe %>%. Moreover, another of my favorite parts of working with R is using Regex, learn this was so useful for making scripts.

My time went …

tidying and cleaning protein raw data usually in Excel. At the beginning I thought it was going to be hard to work with Excel sheets in R but once you use read. xlsx is easy. I started creating a script to tidy and clean the protein data. In the beginning, it was hard to figure out how to tell the program to select variables, but once we learned regex is much faster. When I have doubts about functions or how to do something on R, I use stackoverflow. I also check on twitter because there is always good news or updates on there.

Next steps

I will keep working with R; I think there is a lot I haven’t tried yet. I want to practice creating functions more. In the future I would like to create a script that identifies non-gluten proteins - proteins that are similar to gluten or that generate and immune reaction in patients. But for that I will have to learn more things in R, like working with API.

And maybe some day start with python……

My Data School Experience

These months in Data School have been really helpful for my career and have reassured how much I like bioinformatics (somedays too much!). In the future, I will keep focusing on improving my coding skills. I hope that with this new skill I will be able to help my team and myself in making the process of tidying protein data much easier and faster. Since we started Data School I have been able to develop different scripts that will help myself and the team in future works. One of the scripts is .proteins_fdr_report for cleaning and tidying proteins reports using FDR threshold and selecting peptides with no modifications. Also, the one I showed here is for .finding_gluten. Moreover, I have been supported by my helper who is part of my team and we hope to create new scripts that help improve our team’s capabilities.