Two-Dimensional Gel Electrophoresis as an Information Base for Human Proteome Two-Dimensional Gel Electrophoresis as an Information Base for Human Proteome

The main intricacy in the human proteome is that it is tremendously complex and com- posed of diverse and heterogeneous gene products. These products are called protein species or proteoforms and are the smallest units of the proteome. In pursuit of the comprehensive profiling of the human proteome, significant advances should be performed. The approaches that allow disclosing and keeping the information about human proteome using two-dimensional gel electrophoresis (2DE) are described. Experimental identification methods such as mass spectrometry of high resolution and sensitiv- ity (MALDI-TOF MS and ESI LC-MS/MS) or immunodetection in combination with bioinformatics and 2DE can be used for the development of a comprehensive knowledge base of the human proteome. over 250 maps for 23 species, totalizing nearly 40,000 identified spots, making it the biggest gel-based proteomics dataset accessible from a single interface. Here, we can select a 2DE map which will be displayed for inspection. The database can be queried by keywords (protein description, protein name, gene name, species, author, full text, protein spot serial number) or graphically by clicking on a spot. Each spot is linked to a page containing the corresponding gene (protein) information and identification details. Also, information is displayed about other spots in different maps, where product of the same gene is detected. All these spots are highlighted in the maps and the calculated parameters [isoelectric point (pI) and molecular weight (Mw)] are displayed. There is a possibility for cross-references and obtaining more information from different 2DE databases and from UniProtKB. UniProtKB, a comprehensive protein sequence knowledge base has two sections: UniProtKB/Swiss-Prot, which is manually curated and UniProtKB/TrEMBL that contains computer-annotated entries. UniProtKB/Swiss-Prot entries provide users with cross-links to about 100 external databases and with access to additional information or tools [52].


Introduction
The first aim in human proteomics, as it was proposed by the Human Proteome Organization (HUPO), is a complete catalog of all human proteins. Due to collaborative efforts inside the Chromosome-based Human Proteome Project (C-HPP), this task is close to completion now [1,2]. According to NextProt release from Aug 1, 2017, 17,168 from 20,199 predicted proteins have already been found. Still,~3000 proteins are in the list of so-called "missing proteins." But even after completion of this task, only representative proteins will be identified in most cases [3,4]. The situation here is much more complicated, as proteins can exist as different proteoforms (protein species) [5][6][7]. Proteoforms, as the smallest units of the proteome, are molecules (polypeptides) arising from all combinatorial sources of variation after expression of a single gene. Each proteoform is a chemically clearly defined molecule. These molecules are different due to genetic variation, alternatively spliced RNA, transcripts and posttranslational modifications [5,7]. Accordingly, the term protein refers to its coding gene and, therefore, becomes as the umbrella term for all developing proteoforms/protein species [6]. Sometimes the term "proteoform" is used for the description of structural variants of proteins as well [8]. But it will make the issue of terminology in proteomics more complicated and confused, as even inside the abovementioned definition, a proteomics field of proteoforms is very broad and could encompass many billions of components [9][10][11][12]. For instance, all combinations of 30 known modifications of histone H3 alone can theoretically produce more than 1 billion of proteoforms [10,13]. Because of such a variety, huge range of concentration (7-8 orders of magnitude in blood plasma), and dynamic changes during life cycle, their identification, quantitation and database organization is a serious challenge. Nevertheless, there is evident progress in this area. So far, the main workhorse in proteomics was bottom-up mass spectrometry, but the top-down approach is becoming pre-eminent today [14]. Top-down proteomics implies that mass spectrometry is applied at the proteoform level, allowing the acquisition of information about all intramolecular complexity preserved during analysis, that might be overlooked in bottom-up shotgun workflows [14,15]. But top-down proteomics cannot be just a one-step procedure. There are also several approaches based on protein separation that are involved in proteoform analysis. Among these methods, two-dimensional gel electrophoresis (2DE) occupies the special place. Accordingly, different schemes could be used to establish a basis for a comprehensive knowledge base for protein/proteoform inventory.

The principles of protein separation by 2DE
Application of different electrophoretic separation methods in two-dimensional combinations has a long history [16][17][18][19][20][21][22][23]. Continuous development on improvement of electrophoretic techniques and working out new support media for separation of proteins allowed to achieve better and better resolution. Finally, the best combination was chosen and optimized by O'Farrell in 1975 [24]. The method is based on separations in completely denaturing conditions according to two independent parameters (pI and Mw). Isoelectric focusing (IEF) that separates proteins due to difference in their isoelectric points (pI) is used as a first direction. High resolution according to protein size or weight (Mw) in the second direction is achieved by SDS polyacrylamide gel electrophoresis (SDS-PAGE) [23]. Quickly, the method was accepted by scientific community as the most powerful approach for separation of complex protein samples [25,26]. Besides the high resolution, an important part of this method is that all stages of separation are performed in denaturing conditions, so in case of complex mixtures like cell extracts, all proteins or proteoforms are separated according to their basic parameters (pI and Mw) of their primary chemical structure ( Figure 1). This is very important, as it allows not only to separate proteoforms but also to experimentally determine their pI and Mw. Also, these parameters can be calculated based on information about the chemical structure of the proteoforms. Accordingly, two-dimensional separation of proteoforms can be performed not only experimentally but virtually as well [27][28][29] (Figure 2).

Information produced by 2DE gels
Separation with high resolution of proteoforms is the main function of 2DE, but extra steps should be done toward obtaining necessary information. Protein identification is the next very important step. Immunological methods as most specific approaches were mainly used for this purpose. They are still successfully used in 2DE-based proteomics for protein identification (Western blot) ( Figure 3).
To study protein-protein interactions, so-called Far-Western blot is used ( Figure 4) [32,33]. In this case, the proteins after 2DE separation are transferred to a membrane and then treated with a set of buffers, which apply washes that allow "prey" proteins to denature and renature [32][33][34]. The membrane is then blocked and probed with a purified "bait" protein (the protein, which binding partners ("preys") need to be detected). The bait protein is detected on spots where the prey protein is located if the bait proteins and the prey protein bind together. For detection of bait proteins, they can be labeled, or antibodies can be used [32,33]. This technique is not only very informative but also laborious and not convenient for a large-scale analysis. The situation was radically improved when mass spectrometry became a central element for proteomics analysis [35,36].
The MS approach for protein identification by searching for the best match between peptide masses produced by specific hydrolysis and peptide masses calculated from theoretical cleavage of proteins was developed simultaneously by different groups [37][38][39][40][41]. In this case, the peptide mass sets are acquired by matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS). This approach was named peptide mass fingerprinting (PMF). For a long time, PMF has been the most powerful technique for high-throughput protein identification. Using more sophisticated MS instruments (especially ESI LC-MS/MS), it was revealed that, depending on the gel resolution, the 2DE spots often contain more than a single protein, especially in the case of mammalian cells [29,42]. Accordingly, the quantitation of proteins became an ambiguous task, as the densitometry of spots cannot be used for accurate proteoform quantitation. Fortunately, the special MS-based quantitative approaches can be applied in this case. In addition, as only the proteins detected as stained spots are analyzed, a lot of information is missing. This problem can be solved only by analyzing all parts of the gel [28,43]. The general view of this method is shown in Figure 1. The main steps are as follows: 1. 2DE separation.
3. Scanning the image produced (2DE map).   6. The treatment of each section with trypsin according to the protocol for mass-spectrometry by ESI LC-MS/MS.

7.
Analysis of the tryptic peptides obtained from each 2DE section by Orbitrap Q-Exactive mass spectrometer.
8. Protein identification and relative quantification are performed using Mascot "2.4.1" and exponentially modified form of protein abundance index (emPAI).
In total, up to 500 unique proteins were identified in each section of 2DE gel after separation of proteins from glioblastoma cell extract ( Figure 1). All proteins detected in the same section were given the pI/Mw parameters of this section.
Respectively, the same proteins detected in different sections were considered as different proteoforms. About 20,000 proteoforms coded by~4000 genes were identified using this approach [28,43]. Additionally, 3D graphs represented gene centric expression of proteoforms can be generated ( Figure 5). Here, the proteoform profiles for each gene can  be observed. Considering that only 96 sections from a small gel (8 cm Â 8 cm) were taken for analysis, the situation can be improved significantly by increasing the gel size, the sampleloading and the number of sections. This approach also solves the problem with a sensitivity of the staining. However, there are still some issues that need to be tackled. One of them is a resolution, which drops largely if we cut the gel into big sections. Ideally, the size of these sections should be close to the size of the smallest spots. But in this case, we are facing a dramatically increased amount of samples (around 1000). As there is a limitation in the processing time of each sample by ESI LC-MS/MS (usually at least 0.5 h per sample), we will need about 1 month of continuous work on the Orbitrap to analyze these samples from a single 2DE gel. Another issue is proteoform quantitation. There are several options available. The exponentially modified form of protein abundance index (emPAI) [28,43,44] is not ideal, since it gives only a relative and not very accurate estimation of the protein content. Another option is MaxQuant that recently became possibly the most frequently used platforms for mass-spectrometry (MS)-based proteomics data analysis [45]. Since its release in 2008, it has been improved substantially [45,46]. Selected reaction monitoring (SRM) and isotope-coded affinity tag (ICAT) are other excellent examples of the power of MS technology in protein quantitation [47][48][49]. There are also more approaches that are described and reviewed [50,51].

Databases based on 2DE gels
The overall approach has been used to generate annotated 2DE gel databases for many cell types. The first 2DE database, SWISS-2DPAGE database has been launched in 1993 and is maintained by the Central Clinical Chemistry Laboratory of the Geneva University Hospital and the Swiss Institute of Bioinformatics (SIB). Now this database is a part of the World-2DPAGE (http://world-2dpage.expasy.org/list/), a dynamic portal to query simultaneously world-wide gel-based proteomics databases. These databases put together over 250 maps for 23 species, totalizing nearly 40,000 identified spots, making it the biggest gel-based proteomics dataset accessible from a single interface. Here, we can select a 2DE map which will be displayed for inspection. The database can be queried by keywords (protein description, protein name, gene name, species, author, full text, protein spot serial number) or graphically by clicking on a spot. Each spot is linked to a page containing the corresponding gene (protein) information and identification details. Also, information is displayed about other spots in different maps, where product of the same gene is detected. All these spots are highlighted in the maps and the calculated parameters [isoelectric point (pI) and molecular weight (Mw)] are displayed. There is a possibility for cross-references and obtaining more information from different 2DE databases and from UniProtKB. UniProtKB, a comprehensive protein sequence knowledge base has two sections: UniProtKB/Swiss-Prot, which is manually curated and UniProtKB/TrEMBL that contains computer-annotated entries. UniProtKB/Swiss-Prot entries provide users with cross-links to about 100 external databases and with access to additional information or tools [52].

Collection of experimental and theoretical data on the platform of 2DE gel
The proteomes can be retrieved from experimental data or generated by available programs, which calculate theoretical protein parameters [27,53,54]. According to their basic parameters (pI/Mw), the unique principles of separation of polypeptides allow to organize the crosstalk between experimental and theoretical data. In a simple way, this approach was realized in abovementioned 2DE databases. The main idea behind the approach, where each spot is considered as containing only one protein, works well only with simple proteomes (for instance, mycoplasm), where the number of proteoforms are not so big as in mammalian cells [11,12,55,56]. A European pathogenic microorganism proteome database focused on pathogenic microorganisms was launched. Now, under the name "The Proteome 2D-PAGE Database," this database currently contains 13,893 identified spots and 3245 mass peak lists in 57 reference maps representing experiments from 26 different organisms and strains. The database provides protein information such as ORF name, predicted isoelectric point (pI) and molecular weight (Mw), several protein identifiers, identification method, sequence coverage, and so on. [57]. This database contains information about mammalian cells as well. Therefore, multiple proteins can be attributed to the same spot in case of these cells.  The idea to manage the theoretical proteomes and link them with experimental, 2DE-based data has been tried to realize in another online database, DynaProt 2D [58]. It was developed for dynamic access to proteomes and 2DE gels. Here, a 2DE gel could serve as a reference map and as a tool for navigation of the database [58]. Integrated into 2DE database a complete theoretical proteome could provide a powerful tool allowing simply linking newly identified spots to the already available appropriate theoretical data [58]. But this idea was tried for one organism only, Lactococcus lactis, and stopped in realization.
If we plot experimentally measured physicochemical parameters of proteoforms (pI or Mw) against the theoretical ones, the general view of the proteome according to the diversity of proteoforms is revealed ( Figure 6). Here, the dots are distributed along the diagonal in the graph, if experimentally detected parameters match or are close to the theoretical values. Otherwise, dots are distributed above or below the diagonal, and the bigger the difference between theoretical and experimental parameters, the bigger the deviation of the dot position from the diagonal. In particular, in case of pI, the location of proteoform dots above the diagonal shows that the experimental pIs of this proteoform is smaller than the theoretical one. It can be usually because of adding negative charges (phosphorylation) or removing positive charges (acetylation). An opposite situation with increased pI (the dots are under the diagonal) is observed; then, PTMs are removing negative charges from proteins (esterification). In case of Mw, location of a dot above the diagonal means that theoretical mass (weight) of this polypeptide is bigger than the experimentally observed one. It could be if the polypeptide is truncated or proteolytically processed. If the dots are observed below the diagonal, it can be because of several reasons. It can happen because of technical issues of 2DE procedure (protein polymerization, aggregation or precipitation). Glycosylation also can strongly increase a polypeptide mass. This situation can be analyzed in more detail by representation of all proteoforms corresponding to the same gene on the separate chart ( Figure 7).

Conclusion
The continuing evolution of the detection technique (mostly mass spectrometry) and usage of it in combination with optimum protein separation techniques will finally allow us to reach the main aim of the HUPO-image of the whole human proteome. A union of such a classic proteomics method for separation of proteins as 2DE with bottom-up mass spectrometry (shot-gun analysis of peptides by ESI LC-MS/MS) is an efficient approach for increasing the productivity of tandem mass-spectrometry. Additionally, this union of top-down and bottomup approaches allows very convenient visual representation (profiling) of information about diverse proteoforms. As 2DE maps are a convenient and effective way to represent information about proteomes and navigate around all its proteoforms, it will allow the construction of a knowledge base for an inventory of all human protein species/proteoforms, that is, visually attractive, clear, easy to search and perceptive.
Particularly, the development of chromosome-centric interactive virtual 2DE maps of proteins coded by specific genes in combination with experimental 2DE protein maps will allow executing more effectively C-HPP, to estimate more accurately the number of proteoforms and could be a basis for the knowledge base of human proteins. The development of such inventory will be based on existing databases like http://world-2dpage.expasy.org/, https://www.nextprot.org/, http://www.uniprot.org/ and http://atlas.topdownproteomics.org.