The information sources of LARGE-like GlcNAc Transferase Database
Glycosylation is one of the major post-translational modification processes essential for expression and function of many proteins. It has been estimated that 1% of the open reading frames of a genome is dedicated to glycosylation. Many different enzymes are involved in glycosylation, such as glycosyltransferases and glycosidases.
Traditionally, glycosyltransferases are classified based on their enzymatic activities by Enzyme Commission (http://www.chem.qmul.ac.uk/iubmb/enzyme/). Based on the activated donor type, glycosyltransferases are named, for example glucosyltransferase, mannosyltransferase and
Glycosyltransferases are enzymes involved in synthesizing sugar moieties by transferring activated saccharide donors into various macro-molecules such as DNA, proteins, lipids and glycans. More than 100 glycosyltransferases are localized in the endoplasmic reticulum (ER) and Golgi apparatus and are involved in the glycan synthesis (Narimatsu, H., 2006). The structural studies on the ER and golgi glycosyltransferases has revealed several common domains and motifs present between them. The glycosyltransferases are grouped into functional subfamilies based on similarities of sequence, their enzyme characteristics, donor specificity, acceptor specificity and the specific donor and acceptor linkages (Ishida et al., 2005). The glycosyltransferase sequences comprise of 330-560 amino acids long and share the same type II transmembrane protein structure with four functional domains: a short cytoplasmic domain, a targeting / membrane anchoring domain, a stem region and a catalytic domain (Fukuda et al., 1994). Mammals utilize only 9 sugar nucleotide donors for glycosyltransferases such as UDP-glucose, UDP-galactose, UDP-GlcNAc, UDP-GalNAc, UDP-xylose, UDP-glucuronic acid, GDP-mannose, GDP-fucose, and CMP-sialic acid. Other organisms have an extensive range of nucleotide sugar donors (Varki et al., 2008). Based on the structural studies, we have designed an intelligent platform for the LARGE protein, a golgi glycosyltransferase. The LARGE is a member of glycosyltransferase which has been studied in protein glycosylation (Fukuda & Hindsgaul, 2000). It was originally isolated from a region in chromosome 22 of the human genome which was frequently deleted in human meningiomas with alteration in glycosphingolipid composition. This led to a suggestion that the LARGE may have possible role in complex lipid glycosylation (Dumanski et al., 1987; Peyrard et al., 1999).
LARGE is one of the largest genes present in the human genome and it is comprised of 660 kb of genomic DNA and contains 16 exons encoding a 756-amino-acid protein. It showed 98% amino acid identity to the mouse homologue and similar genomic organization. The expression of LARGE is ubiquitous but the highest levels of LARGE mRNA are present in heart, brain and skeletal muscle (Peyrard et al., 1999).
LARGE encodes a protein which has an N-terminal transmembrane anchor, coiled coil motif and two putative catalytic domains with a conserved DXD (Asp-any-Asp) motif typical of many glycosyltransferases that uses nucleoside diphosphate sugars as donors (Longman et al., 2003& Peyrard et al., 1999). The proximal catalytic domain in the LARGE was most homologous to the bacterial glycosyltransferase family 8 (GT8 in CAZy database) members (Coutinho et al., 2003). The members of this family are mainly involved in the synthesis of bacterial outer membrane lipopolysaccharide. The distal domain resembled the human β1,3-N-acetytglucosaminyltransferase (iGnT), a member of GT49 family. The iGnT enzyme is required for the synthesis of the poly-N-acetyllactosamine backbone which is part of the erythrocyte
2.1. Functions of LARGE
2.1.1. Dystroglycan glycosylation
The Dystroglycan (DG) is an important constituent of the dystrophin-glycoprotein complex (DGC). This complex plays an essential role in the maintaining the stability of the muscle membrane and for the correct localization and/or ligand-binding activity, the glycosylation of some of these components are required (Durbeej et al., 1998). The DG comprises of two subunits, the extracellular α-DG and the transmembrane β-DG (Barresi, 2004). Various components present in the extracellular matrix including laminin (Smalheiser & Schwartz 1987), agrin (Gee et al., 1994), neurexin, (Sugita et al., 2001), and perlecan (Peng et al., 1998) interacts with α-DG. The carbohydrate moieties present in the α-DG are essential to bind with laminin and other ligands. The α-DG is modified by three different types of glycans such as: mucin type
2.1.2. Human LARGE and α-Dystroglycan
The α-DG functional glycosylation by LARGE is likely to be involved in the generation of a glycan polymer which gives rise to the broad molecular weight range observed for α-DG detected by VIA4-1 and IIH6 antibodies. Both the human and mouse LARGE C-terminal glycosyltransferase domain is similar to β3GnT6, which adds GlcNAc to Gal to generate linear polylactosamine chains (Sasaki et al., 1997), the chain formed by LARGE might also be composed of GlcNAc and Glc.
In 1963, Myodystrophy,
The patients with clinical spectrum ranging from severe congenital muscular dystrophy (CMD), structural brain and eye abnormalities [Walker-Warburg syndrome (WWS), MIM 236670] to a relative mild form of limb-girdle muscular dystrophy (LGMD2I, MIM 607155) are linked to the abnormal O-linked glycosylation of α-DG (van Reeuwijk et al., 2005). A study made by Barresi R. et al. (2004) revealed the existence of dual and concentration dependent functions of LARGE. In physiological concentration, LARGE may be involved in regulating the α-DG O-mannosylation pathway. But when the LARGE is expressed by force, it may trigger some other alternative pathways for the
2.1.3. LARGE in visual signal processing
The role of LARGE in proper visual signal processing was studied from the retina retinal pathology in Largemyd mice. The functional abnormalities of the retina was investigated by a sensitive tool called Electroretinogram (ERG). In Largemyd mice, the normal a-wave indicated that the mutant glycosyltransferase does not have any effect on its photoreceptor function.
But the alteration in b-wave may have resulted in downstream retinal circuitry with altered signal processing (Newman & Frishman, 1991). The DGC may also have a possible role in this aspect of the phenotype. The abnormal b-wave was responsible for the loss of retinal isoforms of dystrophin in humans and mice similar to the Largemyd mice.
2.2. LARGE homologues
A homologous gene to LARGE was identified and named as LARGE2. It is found to be involved in α-DG maturation as like LARGE, according to Fujimura et al., (2005). It is still not well understood whether these two proteins are compensatory or cooperative. The co-expression of LARGE and LARGE2 did not increase the maturation of α-DG in comparison with either one of them alone and it proved that for the maturation of α-DG, the function of LARGE2 is compensatory and not cooperative. Gene therapy for muscular dystrophy using the LARGE gene is a current topic of research (Barresi R. et al., 2004; Braun, 2004). When compared to LARGE, LARGE2 gene may be more effective because it can glycosylate heavily than LARGE and it also prevents the harmful and immature α-DG production.
The closely related homologues of LARGE are found in the human genome, (glycosyltransferase-like 1B; GYLTL1B), mouse genome (Glylt1b; also called LARGE-Like or LargeL) and in some other vertebrate species (Grewal & Hewitt, 2002). The homologue gene is positioned on the chromosome 11p11.2 of the human genome and it encodes 721 amino acid protein which has 67% identity with LARGE, suggests that the two genes may have risen by gene duplication. Like LARGE, it is also predicted to have two catalytic domains, though it lacks the coiled-coiled motif present in the former protein. The hyperglycosylation of α-dystroglycan by the overexpression of GYLTL1B increased its ability to bind laminin and both the genes showed the same level of increase in laminin binding ability (Brockington, et al., 2005).
3. Bioinformatics workflow and platform design
Many public databases and bioinformatics tools have been developed and are currently available for use (Ding & Berleant, 2002). The primary goal of bioinformaticians is to develop reliable databases and effective analysis tools capable of handling bulk amount of biological data. But the objective of laboratory researchers is to study specific areas within the life sciences, which requires only a limited set of databases and analysis tools. Thus the existing free bioinformatics tools are sometimes too complicated for the biologists to choose. One solution is to have an expert team who are familiar with both bioinformatics databases and to know the needs of a research group in a particular field. The expert team will recommend a workflow by using selected bioinformatics tools and databanks and also helps the scientists with the complicated issue of tools and databases. Moreover, such a team could organize large number of heterogeneous sources of biological information into a specific, expertly annotated databank.
The team can also regularly and systematically update the information essential to help biologists overcome the problems of integrating and keeping up-to-date with heterogeneous biological information (Gerstein, 2000).
We have built a novel information management platform, LGTBase (Hyperlink).This composite knowledge management platform includes the “LARGE-like GlcNAc Transferase Database” by integrating specific public databases like CAZy database, and the workflow analysis combined the usage of specific, public & designed bioinformatics tools to identify the members of the LARGE-like protein family.
4. Tools and database selection
To analyze a novel protein family, biologists need to understand many different types of information. Moreover, the speed of discovery in biology has been expanding exponentially in recent years. So the biologists have to pick the right information available from the vast resources available. To overcome these obstacles, a bioinformatics workflow can be designed for analysing a specific protein family. In our study, a workflow was designed based on the structure and characteristics of LARGE protein as shown in Figure 1 (Hwa et al., 2007). The unknown DNA/protein sequences will be first identified as members of the known gene families by using the Basic Local Alignment Search Tool (BLAST). The
The DXD motif prediction was then followed by the transmembrane domain prediction by using the TMHMM program (version 2.0; Center for Biological Sequence Analysis, Technical University of Denmark [http://www.cbs.dtu.dk/services/TMHMM-2.0/]). The transmembrane domain is a characteristic feature of the Golgi enzymes.
The sequence motifs are then identified by MEME (Multiple Expectation-maximization for Motif Elicitation) program (version 3.5.4; San Diego Supercomputer Center, UCSD [http://meme.sdsc.edu/meme/]).
This program finds the motif-homology between the target sequence and other known glycosyltransferases. In addition to all the above mentioned tools, the Pfam search (Sanger Institute [http://www.sanger.ac.uk/Software/Pfam/search.shtml]) can also be used to find the multiple sequence alignments and hidden Markov models in many existing protein domains and families. The Pfam results will indicate what kind of protein family the peptide belongs to. If it is a desired protein, investigators can then identify the evolutionary relationships by using phylogenetic analysis.
4.1. LARGE-like GlcNAc transferase database
The specific annotation entries used in the LGTBase are currently being used in a configuration that uses the information retrieved from several databases.
In CAZy database (Carbohydrate- Active enZymes) database ([http://afmb.cnrs-mrs.fr/CAZY/]), the glycosyltransferases are classified as families, clans, and folds based on their structural and sequence similarities, and also on their mechanistic investigation. The other databases used in this platform were listed in Table 1.
|EntrezGene||NCBI's repository for gene-specific information||http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene|
|GenBank||NIH genetic sequence database, an annotated collection of all publicly available DNA sequences||http://www.ncbi.nlm.nih.gov/sites/entrez?db=nucleotide|
|Dictybase||Database for model organism ||http://dictybase.org/|
|UniProtKB/Swiss-Prot||High-quality, manually annotated, non-redundant protein sequence database||http://www.uniprot.org/|
|InterPro||Database of protein families, domains and functional sites||http://www.ebi.ac.uk/interpro/|
|MGI||Database provides integrated genetic, genomic, and biological data of the laboratory mouse||http://www.informatics.jax.org/|
|Ensembl||It provides genome- annotation information||http://www.ensembl.org/index.html|
|HGMD||Human Gene Mutation Database (HGMD) provides comprehensive data on human inherited disease mutations||http://www.hgmd.cf.ac.uk/ac/index.php|
|UniGene||NCBI database of the transcriptome||http://www.ncbi.nlm.nih.gov/unigene|
|GeneWiki||The database transfers information on human genes to Wikipedia article||http://en.wikipedia.org/wiki/Gene_Wiki|
|TGDB||Database with information about the genes involved in cancers||http://www.tumor-gene.org/TGDB/tgdb.html|
|HUGE||The database provides the results of the Human cDNA project at the Kazusa DNA Research Institute||http://zearth.kazusa.or.jp/huge/|
|RGD||Database with collection of genetic and genomic information on the||http://rgd.mcw.edu/|
|OMIM||Database provides information on human genes and genetic disorders.||http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim|
|CGAP||Information of gene expression profiles of normal, precancer, and cancer cells.||http://cgap.nci.nih.gov/|
|PubMed||Database with 20 million citations for biomedical literature from medical journals, life science journals, related books.||http://www.ncbi.nlm.nih.gov/PubMed/|
|GO||Representation of and attributes across all||http://www.geneontology.org/|
All the information related to the LARGE-like protein family was retrieved from the different biological databases. In order to confirm that the information obtained was reliable, the data was scrutinized at two levels. First the information was selected from the above mentioned biological databases with customized programs (using the
The annotated data in the LGTBase database was divided into nine categories (Figure 2). The first category is related to genomic location, displays the chromosome, the cytogenetic band and the map location of the gene. The second is related to aliases and descriptions, displays synonyms and aliases for the relevant gene, and descriptions of its function, cellular localization and effect on phenotype. The third category on proteins provides annotated information about the proteins encoded by the relevant genes. The fourth is about protein domains and families, provides annotated information about protein domains and families and the fifth on protein function which provides annotated information about gene function. The sixth category is related to pathways and interactions, provides links to pathways and interactions followed by the seventh on disorders and mutations which draws its information from OMIM and UniProt. The eighth category is on expression in specific tissues, shows that the tissue expression values are available for a given gene. The last category is about research articles, lists the references related to the proteins which are studied. In addition, the investigator can also use DNA or protein sequences to assemble the dataset for the analysis using this workflow.
4.2. LARGE-like GlcNAc transferase workflow
4.2.1. Reference sequences search
The unknown DNA/protein sequences are identified as members of the known gene families using the Basic Local Alignment Search Tool (BLAST). BlastP is one of the BLAST programs and it searches protein databases using a protein query. We used BlastP to look for new LARGE-like proteins from different species and gathered the protein sequences of LARGE like GlcNAc Transferases and built a protein database of ‘LARGE-like protein’. This database would assist in search for more reference sequences of LARGE-like protein.
4.2.2. DXD motif search
In several glycosyltransferase families, the DXD motif is essential for the enzymatic activity (Busch et al. 1998). So we first searched for aspartate-any residue-aspartate (DXD) motif, commonly found in glycosyltransferase. Therefore, the ‘DXD Motif Search’ tool was designed. The input protein sequences are loaded or pasted in this tool and the results indicate the presence or absence of DXD motif.
4.2.3. Transmembrane helices search
The LARGE protein is a member of the N-acetylglucosaminyltransferase family. The presence of transmembrane domain is a characteristics feature of this family. TMHMM program is used to predict the transmembrane helices based on the hidden Markov model. The prediction gives the most probable location and orientation of transmembrane helices in the sequence. TMHMM can predict the location of transmembrane alpha helices and the location of intervening loop regions. This program also predicts the location of the loops that are present between the helices either inside or outside of the cell or organelle. The program is designed based on a 20 amino acids long alpha helix which contains hydrophobic amino acids that can span through a cell membrane.
4.2.4. MEME analysis
A motif is a sequence pattern that occurs repeatedly in a group of related protein or DNA sequences. MEME (Multiple Expectation-maximization for Motif Elicitation) represents motifs as position-dependent letter-probability matrices which describe the probability of each possible letter at each position in the pattern. The program can search for homologous sequences among the input protein sequences.
4.2.5. Protein families search
The Pfam HMM search was used to identify the protein family to which the input protein sequences belong. The Pfam database contains the information about most of the protein domains and families. The results from the Pfam HMM search will show the relation of input protein sequences with the existing protein families and domains.
4.2.6. Phylogenetic analysis
The phylogenetic analysis was performed to find any significant evolutionary relationship between the new protein sequences and the LARGE protein family and to support our previous findings. ClustalW, a multiple alignment program which aligns two or more sequences to determine any significant consensus sequences between them (Thompson et al., 1994). This approach can also be used for searching patterns in the sequence. The phylogenetic tree was constructed by using PHYLIP program (v.3.6.9) and viewed by Treeview software (v.1.6.6). In GlcNAc-transferase phylogenetic analysis, once the multiple alignment of all GlcNAc-transferase has been done, it can be used to construct the phylogenetic tree. About 25 protein sequences were identified as the LARGE-like protein family. By using the neighbor joining distance method, the phylogenetic tree showed that these proteins can be divided into 6 groups (Figure 3). The evolutionary history inferred from phylogenetic analysis is usually depicted as branching, tree-like diagrams which represents an estimated pedigree of the inherited relationships among the protein sequences from different species. These evolutionary relationships can be viewed either as Cladograms (Chenna et al., 2003) or Phylograms (Radomski & Slonimski, 2001).
4.3. Organization of the LGTBase platform
The data obtained from the analyses were stored in a MySQL relational database and the web interface was built by using PHP and CGI/Java scripts. According to the characteristics of LARGE-like GlcNAc transferase proteins, the workflow was designed and developed by using Java language and several open source bioinformatics programs. Tools with different languages, C,
5. Application with LARGE protein family
A protein sequence (fasta format) can be entered into the BlastP assistant interface, enabling the other known proteins with similar sequences to be identified (Figure 5). The investigator can select all the resulting sequences or use only some of them. The data can then be transferred to the DXD analysis page (Figure 6). The rationale behind choosing the DXD analysis was since they are represented in many families of glycosyltransferases and it will be easy to narrow down the analysis of putative protein sequences to particular protein families or domains. There were many online tools available for the identification and characterization of unknown protein sequences. So depending upon the target protein of study, one can pick the tools to characterize it.
The sequences are analyzed with the DXD motif search tool (Figure 6), which selects those sequences containing the DXD motif for the TMHMM analysis. The transmembrane helices can be predicted with TMHMM analysis (Figure 7). The transmembrane domains are predicted by the hydrophobic nature of the proteins and mainly used to identify the cellular location of the proteins. Similar to transmembrane domain prediction, there were several other domains that can be predicted based on the protein’s characters like hydrophobic, hydrophilic etc., The dataset containing DXD motifs and transmembrane helices are then selected for MEME (Figure 8) and Pfam analysis (Figure 9). Some sequence motifs occur repeatedly in the data set and are conjectured to have a biological significance are predicted by MEME analysis. This application plays a significant role in characterization of the putative protein sequences after the initial studies with the DXD motif, transmembrane domain, and other tools. This tool can be used for all kind of protein sequences since its prediction is based on the pattern of sequences present in the study. The protein sequences in the dataset can be identified to the known protein families by Pfam analysis. The pfam classification can also be used for almost all the putative protein sequences because of its large collection of protein domain families represented by multiple sequence alignments and Hidden Markov Models. After the MEME and Pfam analysis were done, ClustalW and Phylip programs were used for Phylogenetic Analysis (Figure 9) to see the evolutionary relationship among the data sets (Figure 10). Finally, these results can be used to design experiments to be performed in the laboratory.
6. Future direction
We have described how to construct a computational platform to analyze the LARGE protein family. Since the platform was built based on several commonly shared protein domains and motifs, it can also be modified for analyzing other golgi glycosyltransferases. Furthermore, the phylogenetic analysis (Figure 3) revealed that LARGE protein family is related to β-1,3-
The β3GnT1 (iGnT) was the first enzyme to be isolated when cDNA of a human β-1,3-N-acetylglucosaminyltransferase essential for poly-N-acetyllactosamine synthesis was studied (Zhou et al., 1999). The poly-N-acetyllactosamine synthesized by iGnT provides critical backbone structure for the addition of functional oligosaccharides such as Sialyl Lewis X. It has been reported recently that β3GnT1 is involved in attenuating prostate cancer cell locomotion by regulating the synthesis of laminin-binding glycans on α-DG (Bao et al., 2009). Since there are several common shared domains similar to the LARGE protein, the new platform for β3GnT protein family can be constructed based on the original platform. Apart from β3GnT1, β3GnT2 enzyme is responsible for elongation of poly-lactosamine chains. This enzyme was isolated based on structural similarity with the β3GalT family. Studies showed that on a panel of invasive and noninvasive fresh transitional cell carcinomas (TCCs) showed strong down regulation of β3GnT2 in the invasive lesions, suggesting that a decline in the expression levels of some members of the glycosyltransferase (Gromova et al., 2001).
The β3GnT3 and β3GnT4 enzymes were subsequently isolated based on the structural similarity with β3GalT family. β3GnT3 is a type II transmembrane protein and contains a signal anchor that is not cleaved. It prefers the substrates of lacto-N-tetraose and lacto-N-neotetraose, and it is also involved in the biosynthesis of poly-N-acetyllactosamine chains and the biosynthesis of the backbone structure of dimeric sialyl Lewis A. It plays dominant role in the L-selectin ligand biosynthesis, lymphocyte homing and lymphocyte trafficking. The β3GnT3 enzyme is highly expressed in the non-invasive colon cancer cells. β3GnT4 is involved in the biosynthesis of poly-N-acetyllactosamine chains and prefers lacto-N-neotetraose as the substrate. It is a type II transmembrane protein and it is expressed more in bladder cancer cells (Shiraishi et al., 2001). β3GnT5 is responsible for lactosyltriaosylceramide synthesis, an essential component of lacto/neolacto series glycolipids (Togayachi et al., 2001 ). The expression of the HNK-1 and Lewis x antigens on the lacto/neo-lacto-series of glycolipids is developmentally and tissue-specifically regulated by β3GnT5. The overexpression of β3GnT5 in human gastric carcinoma cell lines led to increased sialyl-Lewis X expression and increased
The β3GnT6 synthesizes the core 3 O-glycan structure and speculates that this enzyme plays an important role in the synthesis and function of mucin O-glycan in the digestive organs. In addition, the expression of β3GnT6 was markedly down regulated in gastric and colorectal carcinomas (Iwai et al., 2005). Expression of β3GnT7 has been reported to be down-regulated upon malignant transformation (Kataoka et al., 2002). Elongation of the carbohydrate backbone of keratan sulfate proteoglycan is catalyzed by β3GnT7 and β1,4-galactosyltransferase 4 (Hayatsu et al., 2008). β3GnT7 can transfer GlcNAc to Gal to synthesize a polylactosamine chain with each enzyme differing in its acceptor molecule preference. The polylactosamine and related structures plays crucial role in cell-cell interaction, cell-extracellular matrix interaction, immune response and determining metastatic capacity. The β3GnT8 enzyme extends a polylactosamine chain specifically on a tetraantennary N-glycans. β3GnT8 transfers GlcNAc to the non-reducing terminus of the Galβ1-4GlcNAc of tetra antennary
Our initial motif analysis showed that there are 3 important functional domains predicted are commonly found among the β3GnT enzymes. The first motif is a structural motif necessary for maintaining the protein fold. The second, DXD motif represented in many glycosyltransferases is involved in the binding of the nucleotide-sugar donor substrate, both directly and indirectly through coordination of metal ions such as magnesium or manganese in the active site. A glycine-rich loop is the third motif found at the bottom of the active site cleft. This loop is likely to play a role in the recognition of both the GlcNAc portion of the donor and the substrate. Since the three common domains of β3GnT are similar to the LARGE protein family, it is feasible to modify the current LARGE platform to analyze other golgi glycosyltransferases such as β3GnT.