Summary of major allergen databases and their relevant features.
Allergic diseases are considered as one of the major health problems worldwide due to their increasing prevalence. Advancements in genomic, proteomic, and analytical techniques have resulted in considerable progress in the field of allergology, which has led to accumulation of huge amount of data. Allergen bioinformatics comprises allergen-related data resources and computational methods/tools, which deal with an efficient archival, management, and analysis of allergological data. Significant work has been done in the area of allergen bioinformatics that has proven pivotal for the development and progress of this field. In this chapter, we describe the current status of databases and algorithms, encompassing the field of allergen bioinformatics by examining work carried out thus far with respect to features such as allergens and allergenicity, allergen databases, algorithms/tools for allergen/allergenicity prediction, allergen epitope prediction, and allergenic cross-reactivity assessment. This chapter illustrates concepts and algorithms in allergen bioinformatics, as well as it outlines the key areas for potential development in allergology field.
- allergen databases
- allergen epitope prediction
- allergenic proteins
The immune system represents a very complex system comprising numerous biological molecules and processes, which combine to form body’s defense against infectious agents and other threats. Immunity is basically divided into two types such as innate immunity and adaptive immunity . Innate immunity also referred to as natural, native, or nonspecific immunity acts as a first line of defense against common harmful agents. Innate immune response provides immediate protection and involves number of components such as monocytes, macrophages, neutrophils, cytokines, complement, and epithelial barriers. Adaptive or acquired immunity comprises highly specific immune responses that are elicited against particular pathogens or antigens. These immune responses are either cell mediated or antibody mediated (humoral) and executed by specialized lymphocytes or immunoglobulins, respectively. On certain occasions, the immune system produces immune responses that are harmful for the host organism. Autoimmunity denotes one such case wherein the body elicits immune responses against its own cells and tissues (self-antigens) which lead to development of autoimmune diseases. In some cases, immune system produces inappropriate immune responses known as hypersensitivity, which has deleterious effects on the host organism. Hypersensitivity reactions are categorized into four groups based on the type of immune response and the effector mechanism involved. These are (i) immediate hypersensitivity (type I), (ii) antibody-mediated hypersensitivity (type II), (iii) immune complex–mediated hypersensitivity (type III), and (iv) cell-mediated hypersensitivity (type IV) .
Allergic reactions are type I hypersensitivity reactions, which are characterized by induction of specific class of antibodies known as immunoglobulin E (IgE). These reactions are elicited against specific type of antigens commonly referred to as allergens. An allergic reaction involves specialized cells and specific molecules of the immune system . IgE antibodies induced by allergens upon allergic sensitization bind to effector cells such as basophils and mast cells via specific Fc receptors present on the surfaces of those cells. Subsequent exposure to the allergen causes cross-linking of membrane-bound IgE on these effector cells, which leads to their degranulation and release of pharmacologically active agents such as histamine. These pharmacological mediators are responsible for clinical manifestations of allergic reactions in the affected individuals. Immunogenicity in general refers to the potential of an antigen to elicit an immune response, while in case of allergens, allergenicity is considered as a reflection of its allergenic potential. Allergenicity indicates the capability of an allergen to induce clinical symptoms of allergy as well as to induce and bind to IgE antibodies . The prevalence of allergic reactions has increased significantly in the last few years, especially in the developing countries . This has resulted in considerable increase in disease burden as well as economic issues due to costs associated with these diseases. Therefore, the study of allergic diseases has gained tremendous importance as they represent one of the major health problems in urban and rural regions.
The field of allergy research has rapidly progressed in the last few years . Recent advances in genomic, proteomic, and analytical methods have given rise to large amounts of data relevant to allergens. This data can be correlated with pathology of various allergic diseases based on experimental, clinical, and epidemiological data for allergic reactions. The continuous growth of data calls for the efficient archival, management, and analysis of data. This has led to the development of the field of allergen informatics, which comprises allergen specific databases/resources and computational methods/tools. Allergen informatics constitutes an important branch of immunoinformatics . In this chapter, a review of the existing status of allergen informatics with respect to important aspects such as allergens and allergenicity, allergen databases, algorithms/tools for allergen/allergenicity prediction, allergen epitope prediction, and allergenic cross-reactivity assessment has been presented (Figure 1).
2. Allergens and allergenicity
Allergens represent the most critical component of an allergic reaction, although IgE antibody, Fc receptors, mast cells, and basophils as well as pharmacological mediators such as histamine and heparin also play very significant roles. Allergens are ubiquitous substances, which arise from a variety of sources such as foods, plants, animals, or environment. An allergen can either be a chemical substance (e.g., penicillin) or a protein (e.g., albumin, profilin, etc.). Majority of the allergens are proteins or glycoproteins that possess high water solubility. Several biochemical and structural features of allergens such as stability, hydrophobicity, and ligand-binding domains are known to contribute to their allergenicity . However, common molecular and structural features of allergens that are responsible for allergenicity have not yet been conclusively discovered.
Allergens are provided with a unique, unambiguous, and systematic nomenclature which has been developed and maintained by the World Health Organization (WHO) and International Union of Immunological Societies’ (IUIS) “Allergen Nomenclature Sub-committee” [9, 10]. The nomenclature is based on the Linnean system, and an allergen, which satisfies certain biochemical and immunological criteria, is included in the WHO/IUIS nomenclature. An allergen name consists of an abbreviation of the scientific name of the allergen source organism. First 3–4 letters denote the genus name, while the subsequent 1–2 letters represent species, followed by an Arabic numeral that denotes the order of its identification. For instance, Der p 1 represents the first allergen to be characterized from the house dust mite Dermatophagoides pteronyssinus. An allergen may possess isoallergens or isoforms/variants, which are considered as multiple molecular forms of the same allergen. The WHO/IUIS nomenclature defines isoallergen as an allergen belonging to a single species, with a similar molecular size and identical biological function, and possessing ≥67% amino acid sequence identity while a variant or isoform corresponds to allergen sequences that differ by only a limited number of amino acid substitutions . It is very important to archive and study the data on isoallergens and isoforms/variants in a differentiated manner as it has been shown that variations in allergens significantly affect their allergenicity and cross-reactivity as well as influence recognition of epitopes by T cells and IgE . An allergen can be considered as a major or minor allergen based on the measure of its allergenicity. Major allergens are the ones to which >50% of patients with an allergy to its source are sensitized, while minor allergens are recognized by a limited number of patients .
Allergens display important features such as epitopes and cross-reactivity that are very critical with respect to understanding of allergic reactions and developing newer approaches for diagnosis and treatment of allergic diseases. Epitope or antigenic determinant refers to the immunologically active region of the allergen. An epitope can be an IgE-binding epitope or a T-cell epitope depending on whether it interacts with an IgE or a T-lymphocyte. An IgE epitope can be either sequential (linear) that consists of contiguous stretch of amino acids or conformational (discontinuous), which comprises amino acids present at different loci in an antigen. An antibody is said to be cross-reactive when it recognizes and binds to multiple antigens.
2.1. IgE-binding epitopes
IgE-binding epitopes refer to the IgE recognition sites in allergens that are involved in specific interaction of allergens and IgE antibody. Inferences drawn from allergen–antibody complexes and other important studies have shown that majority of IgE-binding epitopes are conformational in nature . IgE epitopes possess some defining structural and immunological features such as they are more cross-reactive in nature and have higher intrinsic flexibility. These features make them distinct from other antibody epitopes and contribute significantly in the allergenicity [15, 16]. Identification and in-depth analysis of IgE-binding epitopes has the potential to contribute immensely in accurate diagnosis and allergen-specific immunotherapy of allergies, especially the food allergy [17, 18]. Large amount of data on allergen epitopes are generated by employing strategies based on the use of overlapping synthetic peptides, recombinant allergenic fragments, cocrystal structure complexes, etc. However, it is believed that insights obtained from study of allergen–antibody complexes will be the most helpful in understanding the role these epitopes play in allergic reactions [19–21].
2.2. T-cell epitopes
T-cell epitopes are the antigenic determinants of allergens that interact with T-lymphocytes via specific T-cell receptors. T-cell epitopes of allergens have shown to be very important for the modulation of allergic response and thereby contributing to symptoms associated with allergic diseases . They have enormous potential in the development of allergy vaccines as well as newer strategies in allergen immunotherapy, considering their fundamental role in allergic response [23, 24]. Recent findings have indicated that T-cell epitope repertoire in allergens is diverse than IgE epitopes, and it can be very useful in specific immunotherapy in allergy . An analysis carried out on available epitope data has shown that T-cell epitopes are known to occur more commonly in the airborne allergens as compared to food allergens .
Cross-reactivity denotes a clinically and immunologically critical phenomenon displayed by allergens from various sources and is the cause of pollen-food syndromes, such as the one seen in case of birch and apple. Cross-reactivity is considered as a property of antibodies and it arises when an antibody or a subgroup of antibodies recognizes more than one allergen or epitope . Two allergens are considered cross-reactive if they are recognized by a single antibody (or T-cell receptor). It has been stated that cross-reactivity among allergens at the level of B cells, T cells, and mast cells reflects clinical sensitivities and contributes very significantly in the regulation of allergic sensitization .
Cross-reactivity is predominantly an antibody defined phenomenon and IgE antibodies are shown to be more cross-reactive in nature. Affinity of the antibodies toward the allergen is known to play an important role in cross-reactivity. However, the properties of the allergenic protein are also very important and shared features on the level of both primary and tertiary structures of the cross-reactive proteins are found to be responsible for cross-reactivity . Similarity at the level of sequence is an important indicator and cross-reactivity seems to require more than 70% sequence identity. In addition to this, other factors such as the host immune response against the allergen, dosage of allergen, and mode of exposure also contribute in clinical relevance of allergic cross-reactivity. Inferences drawn from studying a large number of allergens have led to the conclusion that structural similarity among proteins from diverse sources is the molecular basis of allergic cross-reactivity . Considering the role it plays in the development of allergic symptoms, a detailed analysis of cross-reactivity has the potential to contribute in the development of new strategies in diagnosis and therapy of allergic diseases.
3. Allergen databases
Last few years have witnessed substantial technological advances in the field of genomics and proteomics along with tremendous improvements in analytical methods. This has led to a significant progress in the area of allergy research. As a result of this, there has been a steady and continuous increase in the number of characterized protein allergens over the last few years. Efficient storage and management of data has become very important because of such incessant accumulation of molecular and clinical data on allergens. Therefore, allergen databases represent very crucial resources for basic allergy research as they are involved in archival of available allergen knowledge.
|Database||Developed by (URL)||Type of data archived||Computational|
tools (if any)
|IUIS Allergen ||WHO/IUIS Allergen|
|Allergome ||Centre for Clinical|
|Sealy Centre for|
University of Texas, USA
|National Institute of|
|Sequence, structure, IgE|
epitopes, small molecule
|AllergenOnline ||Food Allergy and|
Resource Program (FARP)
|AllFam ||Department of Pathophysio|
logy and Allergy Research,
Medical University of
|Allergen family data,|
cross-link to Pfam
|AllergenPro ||The National Agricultural|
|Sequence, IgE epitopes||Yes||2015|
|AllerBase ||Bioinformatics Centre,|
Savitribai Phule Pune
|Sequence and structure|
epitopes, IgE antibody,
experimental evidences of
Many allergen-specific databases have been developed in the past few years although they differ from each other with respect to their objectives, type of data archived, accessibility of contents, and the level of annotation and applications . In addition to dedicated allergen databases, primary bioinformatics databases also document significant data on allergens. Examples of these databases include GenBank/GenPept [31, 32], UniProtKB , and Protein Data Bank (PDB) , which archive sequence and structure data on allergens along with its annotation. The Summary of allergen-specific databases is provided in Table 1. In the following section, the existing allergen databases are described.
3.1. IUIS Allergen database
The IUIS Allergen Nomenclature Sub-Committee, under the auspices of the WHO, provides the systematic nomenclature of allergenic proteins and it has developed and maintained Allergen database [35, 36]. The database archives all of the WHO/IUIS–recognized allergens along with their isoallergens and isoforms (variants). In order to maintain a consistent allergen nomenclature for newly discovered allergens, researchers are required to submit newly described allergens to the Allergen Nomenclature Sub-Committee before submitting their manuscript to a journal for consideration for publication.
Each allergen in this database is provided with annotation that includes biochemical name, molecular weight, information on its allergenicity, reference, etc. Additionally, sequence data for allergens and isoallergens/isoforms are also stored in the database, along with cross-references to GenBank , GenPept , and UniProtKB , as well as to PDB , for nucleotide, protein sequences, and 3D structure data, respectively. Allergen database can be searched by using allergen name, biochemical name, allergen source organism, taxonomic group, etc., as search criteria. The database is updated continuously with specific names assigned to newly discovered allergens and isoallergens/variants . Allergen database does not exemplify the comprehensive allergen data although it documents majority of the characterized allergens. This is because there are a large number of allergens that have been reported in literature which are not recognized by IUIS-Allergen. The database does not archive data on allergen epitopes and cross-reactivity.
3.2. Allergome database
Allergome  represents an extensive repository of information on allergen molecules causing IgE-mediated (allergic, atopic) diseases . The database comprises comprehensive data on WHO/IUIS-approved allergens along with other non-recognized allergens. These allergenic molecules are selected and curated from the published literature and web-based resources. It also contains data on allergenic sources based on whether they possess identified molecules or not. Allergome documents information on allergen and isoallergens/isoforms along with their sequences. Cross-links to sequence and structure databases like UniProtKB  and PDB  are also provided.
Allergome can be searched by using basic and advanced search options. Basic search employs numerous search criterions such as allergen name, biochemical name, source organism, etc., while advanced search enables the user to search using specific attributes. Each allergen molecule is represented by a monograph which represents information about the three parts of allergen such as basic information, data on the native form, and its recombinant form. The most important and unique feature of Allergome platform is the presence of several support modules that deal with archival of specific aspects of allergen data. A couple of important modules are RefArray, for easy access to references stored in the Allergome, and Real Time Monitoring of IgE sensitization (ReTiME), for real-time data collection and storage of IgE sensitization data and the number of other utilities. Allergome is updated regularly and allergen data curated from literature is documented.
3.3. Structural database of allergenic proteins
Structural Database of Allergenic Proteins (SDAP)  is an allergen database that prominently deals with structural aspects of allergens . It houses comprehensive cross-referenced sequence data on allergens, IgE-binding epitopes, 3D structures, and models of allergens. Each allergen in SDAP is provided with cross-links to primary databases such as UniProtKB , PDB , as well as to important resources such as NCBI Taxonomy Browser  and PubMed  for literature references. SDAP also has a utility as a web server that integrates various computational tools, which assist structural biology–related studies dealing with allergens and their epitopes. It employs an algorithm based on the conserved properties of amino acid side chains to detect regions associated with allergenicity in novel sequences. The database consists of number of tools that can be used to assess potential cross-reactivity of allergens and also help in screening of IgE epitopes in novel proteins. The last update of the database was carried out on February 25, 2013. SDAP does not archive complete data for allergens that are not recognized by IUIS while data on allergen cross-reactivity is also not documented.
3.4. Allergen database for food safety
Allergen Database for Food Safety (ADFS)  is developed as a project of the Division of Biochemistry and Immunochemistry of National Institute of Health Sciences (Japan). The aim of the database is to archive allergenic proteins and their IgE epitopes with a special emphasis on food allergens and food safety . Allergens archived in ADFS are grouped into eight categories such as pollen, mite, animal, fungus, insect, food, latex, and others, and each allergen entry is provided with the primary database accession numbers of their genes and 3D structure information. The database is also equipped with homology-based sequence search tool for the evaluation of allergenicity. One of the most distinct features of ADFS is the archival of data on small molecule, nonprotein (chemical) allergens. The database does not archive data on allergen cross-reactivity.
3.5. AllergenOnline database
AllergenOnline  is a well curated allergen database that documents a peer reviewed allergen list, which is compiled from various resources such IUIS-Allergen, PubMed, scientific publications, and other allergen databases. The database was developed within the Food Allergy Research and Resource Program (FARRP) at the University of Nebraska . For each allergen, data on source organism, common name, IUIS official nomenclature, protein length, class of allergen like food allergen, contact allergen, etc., and a link to the NCBI protein (GenPept)  database are provided. AllergenOnline also provides the utility for sequence-based searches for allergens, which include alignments by FASTA and an eight-amino acid short-sequence identity search. This utility can be very useful in the identification of proteins that may present a potential risk of allergenic cross-reactivity. AllergenOnline is updated every year and the last update that resulted in version 16 of the database was reported on January 27, 2016. It does not archive data on allergen epitopes as well as on allergenic cross-reactivity.
3.6. AllFam database
AllFam  represents a very important resource for allergens as it is involved in classification of allergens into protein families . This study has shown that allergens are distributed into relatively few protein families and possess a limited number of biochemical functions. The structural classification of allergens in AllFam is performed by using family information from PFam  and the Structural Classification of Proteins (SCOP) database , while biochemical functions of allergens were extracted from the Gene Ontology annotation database . The database provides the option of browsing lists of allergen families based on allergen source (plants, animals, and fungi) and route of exposure (inhalation, ingestion, etc.) while search for specific protein families can also be performed. Each allergen family in AllFam is linked to a family fact sheet that describes the biochemical properties of the family members as well as a list of key references related to this family. The last update of AllFam was reported on September 12, 2011. AllFam does not archive data on molecular features of individual allergens although cross-link to IUIS-Allergen and Allergome is provided for each documented allergen.
3.7. AllergenPro database
AllergenPro  is a recently developed allergen database that archives data on allergen sequences, structures, and epitopes from various sources. It is an integrated database which provides information about allergens in foods, microorganisms, fungi, animals, and plants . It has been provided with a utility to search for allergens based on keywords as well as the sequence. AllergenPro is also equipped with a computational tool for the prediction of allergenicity. Prediction is based on three different approaches such as FAO/WHO guidelines (sequence)–based approach, motif-based approach, and epitope-based approach. The database was last updated on June 4, 2015. AllergenPro does not archive data on allergen cross-reactivity while the literature references for documented allergens and epitopes have also not been provided.
3.8. Archival of allergen epitope data
As mentioned earlier, epitopes denote very important feature of allergens as they play vital role in allergic diseases. Because the molecular characterization of allergens has risen immensely in recent years, the data on allergenic epitopes has also increased significantly. Therefore, it has become necessary to store and manage the epitope data for its efficient utilization.
Some of the existing allergen databases described above, such as SDAP, ADFS, and AllergenPro are involved in storage of allergen epitope data. There are few databases available that are dedicated for epitope data from all types of antigens, which also document information on allergen epitopes [54–56]. However, the allergy-associated epitope data stored in these databases may not be comprehensive. The Immune Epitope Database (IEDB) , which is a repository of immune epitope reactivity data, is also a major database of allergy-derived epitope data . It archives extensive allergen epitope data along with biological assays associated with them, including IgE-binding as well as T-cell epitopes curated and compiled from allergy-related references. IEDB is also equipped with several strategies for efficient searching and visualization of data on allergy-related epitopes . Therefore, it represents a very useful and user friendly platform to access and retrieve allergy-related epitope data for the community of allergists. In a study involving classification of all the epitope-specific literature in various immunological domains, it is stated that IEDB comprises relatively fewer references for allergy-derived epitopes as compared to Cancer and Infectious Diseases . This indicates that there is considerable scope for more in-depth archival of allergen epitope data from literature. Another study on meta-analysis of the allergy-associated epitope data in IEDB has indicated that relatively lesser data is archived for allergen T-cell epitopes as compared to IgE epitopes .
3.8.1. AllerBase database
Observations from the study of the existing allergen databases indicated that they archive significant data on various aspects of allergen and allergenicity, although the level of completeness of data differs considerably for diverse allergen features. AllerBase  is a recently developed comprehensive database of allergens and allergen features which addresses some of the limitations associated with the existing allergen databases [Kadam et al. 2016, unpublished]. The database comprises extensive data on experimentally validated allergens and allergen specific features such as IgE-binding epitopes, IgE cross-reactivity, IgE antibodies, and evidences for experimental validation of allergens. AllerBase is provided with basic and advanced search utilities along with browse database option to retrieve desired allergen data. The Completeness Index, which represents availability of data for various features for each allergen and a structure visualization utility, denote important features of the database. AllerBase also provides cross-references to several immunological and allergen databases and represents a notable instance of integration of allergen data from number of resources.
4. Computational prediction of allergens/allergenicity
Allergens mainly comprise commonly occurring proteins in foods, pollens, and other biological entities in the environment. It has become necessary to assess the potential allergenicity of these proteins considering the health hazards associated with allergic reactions to them. In recent years, genetic engineering and food processing methods are routinely employed for modifying the existing proteins or introducing new ones. Analysis of allergenicity of such proteins/products along with newly introduced biopharmaceuticals is absolutely essential in order to avoid transfer of an allergenic molecule. Computational assessment or prediction of allergenicity represents the major approach to test for allergenicity, and numerous bioinformatics tools/methods have been employed successfully for this purpose . The majority of these methods utilize the amino acid sequence of allergens along with its different features, while a very few approaches use structure information. Table 2 denotes the list of computational tools/servers available for the prediction of allergens/allergenicity. In the following section, the prominent approaches used for the computational assessment or prediction of allergens/allergenicity are described briefly.
|No.||Method (URL)||Approach used||Efficiency|
sequence and SVM
|AROC = 0.90, SE = 86%,|
SP = 86%
|Sequence features and|
SVM, sequence motifs,
|Accuracy = 85%, SE = 88%,|
SP = 81%
|Sequence based descriptors,|
auto and cross-covariance,
|Accuracy = 85.3%, SE = 82.5%,|
SP = 88.1%
|DFLAP algorithm and SVM||–|
|Iterative pairwise sequence|
similarity and SVM
|AROC = 0.928, accuracy = 95.3%,|
SE = 83.4%, SP = 96.4%
|Sequence based features|
|MCC = 0.95, SE = 93%,|
SP = 99.9%
|AFFP dataset, normalized|
BLAST E-values and
|MCC = 0.97, SE = 98.6%,|
SP = 98.4%
|Accuracy = 93.42%|
|Sequences as text|
documents, Naive Bayes
classifier and SVM
|Integration of methods based|
on FAO/WHO guidelines,
sequence motifs and
|Auto and cross-covariance,|
|Accuracy = 88%, MCC = 0.759|
|Fuzzy rule based system||–|
4.1. Sequence similarity-based approaches
One of the first studies dealing with analysis of allergenicity was put forth by Metcalfe et al. . They have proposed a decision tree–based approach for allergenicity assessment of foods derived from genetically modified crops. The first computational approach for the assessment of allergenicity was provided by “Codex Alimentarius Commission” of FAO/WHO [64, 65]. It stated that a protein can be regarded as an allergen if it consists of an exact match with at least six contiguous amino acids or showed more than 35% similarity over a window of 80 amino acids when compared with a sequence of known allergen. This approach has been widely used to predict allergenicity and there are number of web servers for allergen prediction, which are based on it. Allermatch , AllerTool , and AllergenPro  are some of the prominent web servers which employ these FAO/WHO guidelines for allergen prediction. Additionally, some of the major allergen databases such as SDAP  and AllergenOnline  also utilize this strategy for allergenicity prediction. A recent study performed by Verma et al.  has shown that the sequence similarity-based approach gives substantially better results when used in combination with other bioinformatics methods. However, results obtained by certain studies indicated that approaches based on these guidelines are not highly efficient for identifying allergenic proteins and many of times they lead to false or irrelevant allergenicity estimations [69–71]. As a result of these observations, it became necessary to discover and employ other strategies for the prediction of allergenicity.
4.2. Motif-based approaches
In a study carried out by Stadler and Stadler , it was observed that the use of sequence motifs, which represent the secondary structures of proteins, performs significantly better than the approach based on FAO/WHO guidelines. This method employs MEME motifs of a length of 50 residues for the prediction of allergenicity by using pairwise sequence alignment with certain threshold. WebAllergen  is a web server for the prediction of allergenic proteins which is also based on specific detectable allergenic motifs in known allergens . Furthermore, a study carried out by Kong et al. showed that an approach based on search of multiple motifs is more specific and efficient than the conventional single motif search . AlgPred  and AllergenPro  are important web servers for allergen prediction in which one of the prediction approaches is based on allergen-derived motifs. A recent study that employs computational approaches for comparison of allergens and metazoan parasite proteins stated that significant sequence and structure similarity exists between parasite proteins and allergenic proteins . The analysis was carried out using sequence and structural motifs in allergens and a workflow was developed for the computational analysis of parasite proteins.
4.3. Machine learning–based approaches
Recent years have witnessed tremendous increase in the application of machine learning methods for solving biological problems. Machine learning–based approaches have been widely used for predicting various aspects of protein function . These methods are also employed routinely for the development of algorithms to predict allergenicity of novel proteins.
Although Support Vector Machine (SVM) is the most commonly used machine learning method for allergen prediction, other methods have also been frequently employed. One of the earliest methods was developed by Zorzet et al.  that utilizes a k-Nearest-Neighbor (kNN) classification algorithm for the prediction of allergenicity, while a Bayesian classifier was employed by Soeria-Atmadja et al.  for the same purpose. An approach based on the combination of hidden Markov model (HMM) and conserved motifs in allergen was also used to successfully predict protein allergenicity . Dimitrov et al.  developed two artificial neural network (ANN)-based algorithms for allergenicity prediction, which utilize descriptors derived from amino acids that denote their structural and physicochemical properties. AllerTOP  is an online bioinformatics tool to perform the computational prediction of allergens . This algorithm employs descriptors that denote the chemical properties of amino acids in allergen sequences and auto- and cross-covariance transformation along with five machine learning methods for classification. These methods are random forest, multilayer perceptron, logistic regression, decision tree, naïve Bayes, and kNN.
There are number of web-based tools/servers developed which use SVM for performing classification/prediction of allergens. AlgPred  is one of the earlier web servers developed for the prediction of allergenic proteins . It employs SVM with amino acid and dipeptide composition as features of allergens to achieve accuracy of 85.02 and 84.00%, respectively. EVALLER  is another web server created for in silico determination of potential allergenicity with very good efficiency . It performs detection based on filtered length–adjusted allergen peptides (DFLAP) algorithm and SVM. AllerTool  web server also applies SVM-based algorithm for the prediction of allergenicity and provides sensitivity and specificity of 86.00% . AllerHunter  is an important web-based computational system for allergenicity assessment which uses a scheme based on iterative pairwise sequence similarity encoding along with SVM . The method is very efficient with a sensitivity of 83.4% and a specificity of 96.4%.
A web-based tool APPEL  is developed for the prediction of allergenic proteins that employs physicochemical and structural features derived from allergen sequence in combination with SVM . Zhang et al. have developed an online allergen prediction tool titled SORTALLER , which is based on allergen family featured peptide (AFFP) dataset and employs SVM as a classifier . An algorithm developed by Mohabatkar et al.  for the prediction of allergenic proteins utilizes pseudoamino acid composition (PseAAC) along with SVM and provides an accuracy of 91.19%. PREAL  is web-based tool that performs allergen prediction by using SVM along with feature selection methods such as maximum relevance minimum redundancy (mRMR) and incremental feature selection (IFS) . A combination of hydrophobicity amino acid index and discrete Fourier transform along with an SVM classifier is employed for highly efficient prediction of allergenicity in a signal-processing bioinformatics approach . Allerdictor  is web server that specializes in large-scale allergen discovery. It models protein sequences as text documents and employs SVM in text classification for carrying out allergen prediction .
4.4. Other approaches
A study carried out by Wang et al. evaluated sequence-, motif-, and SVM-based approaches for the computational prediction of allergens and also performed parameter optimization to obtain better performance . The resulting methods from this study are integrated and made available as a web application titled proAP . AllergenFP  is a recently developed web server for allergenicity prediction that utilizes alignment-free descriptor-based fingerprint approach . The descriptors used here are important properties of amino acid such as size, hydrophobicity, relative abundance, helix, and beta-strand forming propensities, etc. In a structure-based approach proposed by Bragin et al. , information derived from protein 3D structure is used for the representation of protein surface as patches designated as discontinuous peptides. It is observed that prediction of allergenic proteins based on this approach gave better accuracy. Vijayakumar and Lakshmi have developed a fuzzy inference system–based algorithm for allergenicity prediction that utilizes five different modules . These modules consist of a machine learning classifier, motif search, sequence similarity, FAO/WHO evaluation scheme, etc. FuzzyApp , a web server based on fuzzy rule–based system, is then developed for the prediction of allergenicity . Jiang et al. performed an analysis of food allergens using a computational model that simulates gastric fluid digestion . This study stated that food allergens could be classified as alimentary canal-sensitized and nonalimentary canal-sensitized allergens based on the digestibility of these allergens in simulated gastric fluid.
5. Computational prediction of allergen epitopes
Epitopes represent distinctive amino acid residues on the antigens and are important determinants of an immune response. Identification of epitopes is considered a key aspect of designing highly effective multiple-subunit vaccines and developing efficient diagnostic and therapy methods against allergens. Although experimental methods have been very useful for the identification of epitopes, their usefulness is restricted because of their time- and cost-intensive nature and inability in dealing with large-scale elucidation of epitopes. Hence, computational approaches are considered to be very beneficial alternative as they are cost and time effective.
Large number of highly efficient algorithms and tools have been developed over the years for the computational prediction of epitopes. These methods deal with the prediction of both B-cell and T-cell epitopes as well as sequential (linear) and discontinuous (conformational) epitopes. Based on the information (data) utilized for performing prediction, the methodologies can be grouped as sequence-based or structure-based approaches. Many sequence-based linear epitope prediction methods for B cells have been developed and used since long time and majority of them are propensity scale and machine learning–based methods [109, 110]. Some of the major tools/servers that deal with the prediction of linear B-cell epitopes are listed in Table 3.
|No.||Method (URL)||Approach used||Efficiency|
|Fixed length epitope patterns,|
|Accuracy = 65.93%, SE =|
67.14%, SP = 64.71%
|Amino acid anchoring pair|
composition (APC) and SVM
|AROC = 0.809, accuracy|
|SVM classifiers with string|
|AROC = 0.758|
|Physicochemical properties of|
|Accuracy = 58.7%|
scale and HMM
|Antigen sequence features, SVM||AROC = 0.85|
|Bayes feature extraction and SVM||AROC = 0.84,|
accuracy = 74.5%
|Antigen fragment score and SVM||AROC = 0.829|
|Sequence features and multiple|
linear regression (MLR)
|AROC = 0.728,|
SE = 81.8%, SP = 64.1%
|10||IEDB Analysis Resource (http://tools.|
|A collection of tools based on|
|Large datasets of epitopes,|
|Accuracy = 86%|
|Tri-peptide similarity, propensity|
scores and SVM
|AROC = 0.702,|
SE = 80.1%, SP = 55.2%
Number of methods that utilize 3D structure of antigens for discontinuous epitope prediction have also been developed. These methods use different approaches for prediction such as solvent accessibility of surface residues [123, 124], solvent accessibility with propensity scores , and propensity scores with packing density of amino acids . An account of major tools/servers that are involved in conformational epitope prediction is provided in Table 4.
|No.||Method (URL)||Approach used||Efficiency|
|1||BEpro (formerly PEPITO) (http://pepito.proteomics.ics.uci.edu/) ||3D structure of antigen, amino acid propensity scores||AROC = 0.75|
|2||B-Pred (http://immuno.bio.uniroma2.it/bpred) ||3D structure or model of antigen, solvent exposure of residues||SE = 0.70|
|3||CBTOPE (http://www.imtech.res.in/raghava/cbtope/) ||Sequence features and SVM||AROC = 0.9, Accuracy = 85%|
|4||CEP (http://126.96.36.199/cgi-bin/cep.pl) ||3D structure of antigen, solvent accessibility of amino acids||Accuracy = 75%|
|5||DiscoTope 2.0 (http://www.cbs.dtu.dk/services/DiscoTope/) ||3D structure of antigen, epitope propensity scores, surface accessibility||AROC = 0.824|
|6||ElliPro (http://tools.immuneepitope.org/tools/ElliPro) ||3D structure of antigen, Thornton's method, residue clustering algorithm||AROC = 0.732|
|7||Epitopia (http://epitopia.tau.ac.il/) ||Antigen sequence or 3D structure, Naïve Bayes classifier||AROC = 0.59|
|8||EPSVR (http://sysbio.unl.edu/EPSVR/) ||3D structure of antigen, Support vector regression (SVR)||AROC = 0.597|
|9||EPMeta (http://sysbio.unl.edu/EPMeta/) ||Meta server integrating EPSVR with other methods||AROC = 0.638|
|10||EPCES (http://sysbio.unl.edu/EPCES/) ||3D structure of antigen, surface features||–|
|11||SEPPA 2.0 (http://badd.tongji.edu.cn/|
|3D structure of antigen, subcellular localization of antigen, residue propensity, etc.||AROC = 0.745|
Studies have shown that the analysis of antigen–antibody complex structures is very useful for the characterization of conformational epitopes . A dedicated resource titled AgAbDb  that archives the interactions derived from antigen–antibody complexes is available, which can be very useful for the analysis of epitopes [20, 21]. Several algorithms have also been developed for the prediction of T-cell epitopes in antigens. These methodologies deal with the prediction of peptides that possess the ability to interact with specific major histocompatibility complex (MHC) molecules . Machine learning–based approaches are very commonly employed for this purpose and are found to be very efficient . The details of epitope prediction methods/tools for B cells and T cells have been reviewed elsewhere [138, 139]. Some of the important tools/servers that perform the prediction of T-cell epitopes are listed in Table 5. Recently, it has been shown that epitope prediction can be performed over the whole proteome by integrating multiple epitope prediction methods . Antibody-specific epitope prediction has emerged as a significant alternative to the traditional antibody-independent epitope prediction methods .
|No.||Method (URL)||Approach used||Efficiency|
|Cytotoxic T-lymphocyte epitopes,|
|Accuracy = 75.2%|
|Multi-step algorithm that employs|
|Accuracy = 60%|
|QSAR approach based on|
|Accuracy = 89%|
|Integrated method employing|
proteasomal cleavage, TAP
transport efficiency, and MHC
class I binding affinity
|AROC = 0.95|
|A method for all HLA class II|
molecules based on peptide-
binding MHC environment
|AROC = 0.807|
|Based on specificity-determining|
|Scoring system based on position of|
residue in the epitopes
|Algorithm based on HLA-DR binding|
|Combination of methods based on|
proteasomal cleavage, TAP
transport and MHC binding
Epitopes represent critical components of allergens from the perspective of allergic reactions and development of new diagnosis and treatment strategies. Therefore, the computational prediction of these epitopes in allergens is of immense importance. Due to limitations associated with detailed archival of allergen epitope data and highly heterogeneous nature of the data, the number of tools available for allergen epitope prediction are far less, especially when compared with the number of tools available for allergen/allergenicity prediction. Therefore, the general epitope prediction methods listed above can be employed for epitope prediction studies in allergens. Kleter and Peijnenburg  developed a strategy to screen for the potential linear IgE-epitopes using sequence comparison with a minimal length of six amino acids. The approach was moderately effective and it showed that further verification of IgE binding of epitopes by experimental tests is necessary. AlgPred  developed by Saha and Raghava is one of the first and major tools for the computational assessment of IgE epitopes . Here, a database of known IgE epitopes is created and it is used to accurately predict allergenic proteins. AllerPred is a SVM-based computational system for the assessment of overlapping continuous and discontinuous B-cell epitope binding patterns in allergenic proteins . This approach is successfully used to predict allergenicity of novel proteins. Dall’Antonia et al.  have developed a software tool titled Surface comparison–based Prediction of Allergenic Discontinuous Epitopes (SPADE). The algorithm consists of a structure-based comparison of allergen surfaces and IgE cross-reactivity data and is able to predict IgE epitopes from three important allergen families. A recent work performed by Lollier et al.  on meta-analysis of IgE-binding epitopes provided some important findings regarding these epitopes. They computed the fraction of allergen amino acids that are involved in epitopes and modeled a relationship between the rising number of literature references and the amino acid fractions to assess the possibility of binary classification of epitopes and nonepitopes. A web-based tool LocAllEpi  is also developed for the visualization of allergen epitopes along the protein sequence and their structural features.
6. Computational prediction of allergenic cross-reactivity
Cross-reactivity plays an important role in allergic reaction from the immunological and clinical context. Therefore, the computational prediction of allergenic cross-reactivity has been considered of substantial significance. The prediction of cross-reactivity in allergens is associated with the prediction of allergenicity for the majority of the cases. This is mainly because the antigenic determinants that contribute to the cross-reactivity in allergens are also responsible for their allergenicity. As a result of this, many of the tools/algorithms that have been developed for the prediction of allergens/allergenicity also perform cross-reactivity prediction.
The criteria defined by FAO/WHO experts, which have been mentioned earlier, help to identify cross-reactivity in allergens . AllerTool  is a web server that performs cross-reactivity prediction based on amino acid sequence and WHO/FAO guidelines . It also provides a graphical representation of the published and predicted cross-reactivity patterns of allergens. Stadler and Stadler  developed a sequence-based approach and stated that motif-based strategy provides better results for the computational assessment of cross-reactivity than the FAO/WHO guidelines. SDAP , which is a specialized allergen database described before, also comprises a sequence-based tool for the identification of cross-reactivity among allergens . AllerHunter  is a SVM-based web server that deals with efficient assessment of allergic cross-reactivity in proteins . A recently developed fuzzy inference system–based algorithm for allergenicity prediction is also able to predict cross-reactivity in allergens .
7. Future perspectives and challenges
Allergy represents a serious problem, as allergic diseases are known to affect millions of people worldwide. Advancements in genomic, proteomic, and analytical techniques have led to the generation of large amount of data related to allergy and allergens. Archival and analysis of these data denotes a major challenge in allergen bioinformatics. Data integration is one of the key limitations for efficient and useful storage of allergy associated data. This is mainly due to the heterogeneous nature of the data, which is derived from various sources such as molecular data from experimental characterization of allergic reactions, clinical, and epidemiological data from patients/populations. Bioinformatics resources and tools have an important role to play in overcoming this problem. In the wake of ever-expanding volume of data, it is vital to focus on developing databases/resources that will integrate information from different sources as well as from literature and provide rapid access to it. Analysis of such data can be further utilized to obtain important insights to understand allergic reactions. Structural features of allergens contribute significantly to their allergenicity and therefore this knowledge can be employed for developing more efficient methods for allergen/allergenicity and allergic cross-reactivity prediction. Recent advances in epitope prediction methodologies focus on antibody-specific epitope prediction approaches . Application of such approaches for predicting IgE-binding epitopes will be extremely important in the development of newer and effective strategies for diagnosis and treatment of allergic diseases. Allergen immunotherapy (AIT), which is an individualized and allergens-based treatment approach, has been considered as a prototype of precision medicine or personalized medicine . Bioinformatics has the potential to play an important role in the development of novel approaches in AIT as well as contribute for further enrichment of the field of allergen informatics. This will surely aid in gaining better understanding of allergic diseases and positively influence upcoming research in the field.
The work was supported under the Senior Research Fellowship (SRF) granted to KK by the University Grants Commission (UGC), New Delhi, India. UKK and SS would like to acknowledge the Centre of Excellence grant from the Department of Biotechnology (DBT), New Delhi, India. The authors would also like to acknowledge the Bioinformatics Centre, Savitribai Phule Pune University, for providing the infrastructure and resources.