Genetically linked to the survival motor neuron 1 gene SMN1, spinal muscular atrophy (SMA) is an autosomal recessive neuromuscular disease with dysfunctional α-motor neurons. As the product of the SMN1 gene, the survival motor neuron protein (SMN) plays an essential role in the molecular pathogenesis of SMA. On 1 June 2017, a PLoS ONE article reported a set of computational structural analysis to illustrate how do SMA-linked mutations of SMN1 lead to structurally/functionally deficient variants of SMN. Following this article, this chapter provides a brief update of the structural and functional consequences of the missense mutations of this SMA protein.
- spinal muscular atrophy
- survival motor neuron protein
- missense mutation
- structural consequence(s)
- functional consequence(s)
1. Setting the scene up
On 1 June 2017, PLoS ONE published an original research article (Figure 1)  with a title ‘How do SMA-linked mutations of SMN1 lead to structural/functional deficiency of the SMA protein?’, of which this chapter aims to provide a brief update.
1.1. The genetics of SMA: a brief introduction
SMA is an autosomal recessive neuromuscular disease with -motor neuron (anterior horn of the spinal cord) dysfunction and muscular atrophy . SMA is caused by loss (95% of SMA cases) or mutation (5% of SMA cases) of the survival motor neuron gene 1 SMN1 (telomeric SMN, telSMN or SMN1, GenBank: U18423, the 5q13 region of human chromosome) . In the 5q13 region of the human chromosome, there is also a nearly identical survival motor neuron 2 gene SMN2 (centromeric SMN, cenSMN or SMN2, GenBank: NM_022875) . The two genes (SMN1 and SMN2) have been extensively characterised, and their roles in SMA have been reviewed in detail [2, 3, 4, 5, 6, 7, 8].
1.2. The survival motor neuron protein and its role in SMA
The survival motor neuron (SMN) protein is the product of SMN1, the SMA-determining survival motor neuron gene [2, 3]. As a result, SMN is also called the SMA protein. In fact, the 38-kD SMN is the actually affected protein in SMA [9, 10, 11], and is a cytoplasmic protein that also occurs in dot-like nuclear structures called gems, which is why SMN is formerly termed Gemin1 [3, 12], too.
In the molecular pathogenesis of SMA, of particular interest is an exon 7-skipping splicing defect identified in the pre-mRNA editing of the SMN2 gene . Due to this splicing defect, SMN2 predominantly produces exon 7-skipped transcripts, which encode a truncated isoform of the SMN protein (SMN7 or SMN2 with 282 residues), in comparison with the full-length SMN protein with 294 residues (SMN1 or FL-SMN).
In pre-mRNA editing, spliceosome is the major functional unit, and spliceosomal small nuclear ribonucleoproteins (snRNPs) are essential components of the nuclear pre-mRNA processing machinery [13, 14, 15, 16, 17]. In the pathogenesis of SMA, the SMN protein plays a critical role in pre-mRNA processing, because the biogenesis of spliceosomal snRNPs is promoted by the SMN complex [14, 18, 19], which consists of SMN (Gemin1), Gemin2–8 and UNR-interacting protein (UNRIP) [13, 16, 20]. In the formation of the SMN complex, SMN forms oligomers and directly interacts via its N-terminus with Gemin2 and via its tudor domain with spliceosomal (Sm) proteins [13, 21, 22]. A key component of the SMN complex, SMN first assembles the essential SMN/Gemin complex, which in turn mediates the formation of the Sm core domain of the spliceosomal snRNPs [13, 21, 22].
2. Structural and functional consequences of the SMA-linked missense mutations of SMN
In general, genetic mutation includes missense, nonsense, insertion and deletion mutations. A nonsense mutation is a point mutation in a DNA sequence that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete and usually functionally deficient protein product. In contrast, a missense mutation involves substitution of one single amino acid residue, and therefore is able to provide unique access to residue-specific structural insights into the role of the residue in the structure and function of the target protein, provided that the three-dimensional structure of the target protein is experimentally determined and deposited in the Protein Data Bank. Thus, this chapter focuses on SMA-linked missense mutations of SMN and aims to provide a brief update of their structural and functional consequences with a set of computational structural analysis as described in .
2.1. An update of SMA-linked missense mutations of SMN
A set of point mutations (missense and nonsense mutations) have been previously summarised in , including A2G , nonsense mutation Q15X , D30N , D44V [25, 26, 27], V94G , G95R , Y130C , nonsense mutation Q157X , A188S , nonsense mutation W190X , nonsense mutation L228X , P245L , L260S , S262G and S262I [4, 25], M263T , S266P , Y272C [4, 35, 36], H273R , T274I [4, 35, 36], G275S , G279C and G279V [4, 35, 37, 38]. As of 25 September 2018, eight more missense mutations of SMN were summarised and reported, including A2V, Y109C, Y130C, Y130H, P221L, S230L, P244L and R288S .
2.2. An update of experimentally determined SMN-related structures
In , 11 SMN-related structures were retrieved from the PDB database  with 2 search parameters (text search for: survival motor neuron protein and molecule: survival motor neuron protein). In a new search of the PDB database (accessed 25 September 2018)  with the same parameters, 14 PDB entries were retrieved, including 1G5V, 1MHN, 2LEH, 4A4E, 4A4G, 4GLI, 4QQ6, 4V98, 5XJL, 5XJQ, 5XJR, 5XJS, 5XJT and 5XJU. In a comparison with the PDB entries in , during the past 16 months, six new SMN-related structures were deposited in the Protein Data Bank, including 5XJL (to supersede 3S6N ), 5XJQ , 5XJR , 5XJS , 5XJT  and 5XJU . While the six PDB entries do contain a set of different yet functionally related protein molecules, including snRNP Sm-D1, snRNP Sm-D2, snRNP E, snRNP F and snRNP G, they also contain a fragment of the survival motor neuron protein (SMN residues 26–62), according to the fasta format data of the six PDB entries .
2.3. An update of the structural and functional consequences of the missense mutations of SMN
2.3.1. Asp44 in the Gemin2-binding domain of SMN
In light of the six new experimentally determined SMN-related structures (Table 1), a new set of computational structural analysis, as previously described in detail in , is within the reach of this chapter to provide an update of it. Two aspartates (Asp 35 and Asp44) of SMN stood out in the structural analysis of both intramolecular and intermolecular salt bridges for this SMA protein, as listed in Table 2.
|PDB ID||Structure title||Method||Release date|
|5XJL||Crystal structure of the Gemin2-binding domain of SMN, Gemin2 in complex with SmD1/D2/F/E/G from human||X-ray||2 May 2018|
|5XJQ||Crystal structure of the Gemin2-binding domain of SMN, Gemin2 in complex with SmD1(1–82)/D2/F/E/G from human||X-ray||4 July 2018|
|5XJR||Crystal structure of the Gemin2-binding domain of SMN, Gemin2dN39 in complex with SmD1(1-82)/D2/F/E/G from human||X-ray||4 July 2018|
|5XJS||Crystal structure of the Gemin2-binding domain of SMN, Gemin2dN39 in complex with SmD1(1-82)/D2/F/E from human||X-ray||4 July 2018|
|5XJT||Crystal structure of the Gemin2-binding domain of SMN, Gemin2 in complex with SmD1(1-82)/D2.R61A/F/E/G from human||X-ray||4 July 2018|
|5XJU||Crystal structure of the Gemin2-binding domain of SMN, Gemin2dN39 in complex with SmD1(1-82)/D2.R61A/F/E/G from human||X-ray||4 July 2018|
|PDB ID||SBnum||Residue A||Atom A||Residue B||Atom B||Distance (Å)|
Asp44 is in the exon 2a of SMN1 (the Gemin2-binding domain), and involved in an SMA-linked Asp44Val (D44V) missense mutation , which involves a substitution of Asp44’s charged side chain by Val44’s hydrophobic side chain. Of extraordinary functional significance is that SMN’s Gemin2-binding activity is totally suppressed by the D44V mutation in SMN1 . Moreover, the D44V SMN (SMND44V) mutant’s snRNP assembly activity is lower than that of the wild-type SMN (FL-SMN or SMN1) .
In a solid alignment with the computational analysis in , a set of salt bridges were structurally identified between SMN’s Asp44 (M_Asp_44) and Gemin2’s Arg213 (2_Arg_213), as shown in Table 2. In particular, four intermolecular salt bridges were identified between the buried side chains (Table 3) of these two charged residues, i.e. according to the coordinates data in the PDB entry 5XJL , as shown in Figure 2.
|Residue||SASA (Å2)||SASA-intrinsic (Å2)||SASA-Ratio|
Taken together, it is conceivable that the buried side chains of SMN’s Asp44 and Gemin2’s Arg213 form a salt bridge, which constitutes a favourable electrostatic energy contribution to the SMN-Gemin2 complex structural stability , and highlights the functionally indispensable roles of the two residues’ charged side chains, considering the experimental observation that the SMN-Gemin2 binding is abrogated by the D44V mutation , resulting in a functionally deficient SMA-linked D44V SMN mutant.
In addition to the intermolecular salt bridges formed between SMN’s Asp44 and Gemin2’s Arg213, a set of intramolecular salt bridges were also identified between side chains of SMN’s Asp35 and Lys41 (Table 2), which was reported in , too, where 15 salt bridges were identified between the side chains of SMN’s Asp35 and Lys41 in the salt bridge analysis of the NMR-determined SMN-Gemin2 complex ensemble (PDB ID: 2LEH) [22, 41]. In SMN, Lys41 is a positively charged residue and also a neighbouring residue of Asp44. Functionally different to the SMA-linked D44V mutation, a Lys41Ala (K41A) mutation (not SMA-linked) does not affect SMN-Gemin2 binding . Thus, in another solid agreement with the structural analysis in , the structural analysis highlights that the salt bridges between SMN’s Asp35 and Lys41 are intramolecular, i.e. within the apo SMN protein, instead of intermolecular, i.e. at the SMN-Gemin2 complex structure interface, which help to explain why the Lys41Ala (K41A) mutation is not SMA-linked .
Overall, there is a solid agreement between the old  and the new (this chapter) sets of computational structural analysis for both NMR and X-ray SMN-related structures, reflecting the technical maturity of the two main biophysical tools for biomolecular structure determination, particularly in light of the booming number of cryo-electron microscopy (cryo-EM) images uploaded to the Electron Microscopy Data Bank (EMDB), where a long way is there to go still for cryo-EM to match NMR spectroscopy and X-ray crystallography in terms of technical maturity and the urgent need of tools for structural model quality validation .
2.3.2. Gly95 in the SMN tudor domain
Although not located in the structurally determined region of the six new structures (Table 1), Gly95 is a residue in the SMN tudor domain, and it is involved in a Gly95Arg (G95R) mutation . This G95R mutation significantly reduces SMN’s ability to bind Sm proteins, such as Sm-B and Sm-D1 , confirming that tudor domain is the essential binding site of SMN to Sm proteins.
In a further inspection of the computational analysis as reported in , no salt bridge or hydrogen bond was identified for Gly95. Nonetheless, in the SMN tudor domain NMR ensemble , between the side chains of Asp96 and Lys93, 1 salt bridge was found for PDB ID 1G5V  with 10 structure models, 18 salt bridges were found for PDB ID 4A4E  with 20 structure models (Figure 3) and 16 salt bridges were found for PDB ID 4A4G  with 20 structure models. Similarly, 15 salt bridges were also identified between the side chains of Glu147 and Lys97 of SMN (PDB ID: 4A4G , with 20 structure models), with the distance between 2 oppositely charged groups being 2.93 0.39 Å.
Quite interestingly, Gly95 sits right between the two oppositely charged neighbouring residues (Asp96 and Lys93), which are the only two charged residues in the tudor domain that are in the spatial proximity of Gly95. Thus, it is conceivable that a G95R mutation disrupts the Asp96-Lys93 salt bridge and/or builds another one (possibly even stronger) between the side chains of Lys95 and Asp96, which either perturbs the structure-stabilising activity of the Asp96-Lys93 salt bridge, and/or makes it energetically more unfavourable for Asp96’s side chain to orient towards positively charged side chains in Sm proteins and thereby affect the binding of SMN to Sm proteins. While the potential local electrostatic interaction disruption mechanism here for this SMA-linked G95R mutation is similar to that of the E134K and the Q136E mutations of SMN , the former mechanism is dependent on the occurrence of energetically unfavourable electrostatic interaction(s), but the latter mechanism is dependent on the loss of energetically favourable electrostatic interaction(s) for local structural stability of the SMN tudor domain, the essential part of SMN for the Sm protein-binding, which can help explain the reduced Sm core assembly activity of the two SMA-linked SMNE134K and SMNQ136E mutants.
2.3.3. Y109C, Y130C and Y130H in the SMN tudor domain
Among the eight SMN residues with SMA-linked missense mutations , only Y109 and Y130 are located in the structurally determined region of SMN , according to the updated list of SMN-related structures as of 25 September 2018. Although Y109C, Y130C and Y130H are not located in the structurally determined region of the six new structures, the three missense mutations are located in the structurally determined region of the experimentally determined structures .
Tyr130 is a tudor domain hydrophobic residue with a Tyr130Cys (Y130C) mutation . In the computational analysis in , no salt bridge or hydrogen bond was identified for Tyr130. Nonetheless, Tyr130 is 50% buried, with an value of 111.1 4.18 Å2 compared with its standard value at 212.7 Å2, while Tyr109 is deeply buried, with an value of 61.1 8.43 Å2 compared with its standard value at 212.7 Å2. Taken together, the SASA analysis of the three SMA-linked mutations highlights the potential significance of the deeply buried hydrophobic side chains of Tyr109 and Tyr130 in the SMN tudor domain.
What is more, in the computational analysis in , 10 side chain hydrogen bonds (Table 4) were identified between SMN’s Tyr109 and Asp105 in the PDB entry 4A4E , with the donor-acceptor distances (in Table 4) at 2.72 0.06 Å and at 14.75 2.93, no salt bridge was identified for Asp105, and no further hydrogen bonds were identified for Tyr109 and Asp105 for all experimentally determined SMN-related structures as of 25 September 2018.
|PDB File||Acceptor (A)||Donor (D)||Hydrogen (H)||D-A (Å)||H-A (Å)|
|0.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.73||1.80||13.75|
|3.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.69||1.77||15.61|
|4.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.67||1.72||10.96|
|5.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.77||1.86||17.26|
|6.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.71||1.78||15.09|
|8.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.78||1.87||16.14|
|12.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.83||1.95||20.70|
|14.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.71||1.78||13.76|
|18.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.71||1.76||10.91|
|19.pdb||OD2, A_ASP_105||OH, A_TYR_109||HH, A_TYR_109||2.63||1.70||13.36|
Taken together, the computational findings here indicate that SMN’s Tyr109 and Asp105 contribute to the structural stability of SMN through hydrogen bonding between their side chains, as it is quite clear that if Tyr109 is replaced by Cys109, then the side chain hydrogen bond (Figure 4, Table 4) will disappear, and that the negatively charged side chain of Asp105 will gain more geometric freedom due to the disappearance of the hydrogen bond, which can cause a potential disruption of the (either intramolecular and/or intermolecular) electrostatic interaction network, not to mention the possibility of a disrupted disulphide bonding network within the SMN protein, the SMN complex or even the snRNP assembly, which is critical to ensure that pre-mRNA editing of the SMN1 gene does not go wrong and that its product is the FL-SMN protein, instead of its truncated functionally deficient counterpart.
2.3.4. A structural analysis of the hydrogen bonds formed within the six new SMN-related structures
In light of the six new experimentally determined SMN-related structures (Table 1), a new set of hydrogen bonding analysis is conducted according to the details in , the result of which is briefly summarised in Table 5.
|PDB ID||Acceptor (A)||Donor (D)||Hydrogen (H)||D-A (Å)||H-A (Å)|
|5XJR||OE1, A_GLN_24||NH2, B_ARG_94||HH21, B_ARG_94||3.00||1.99||1.79|
|5XJR||OD2, B_ASP_104||NH1, B_ARG_102||HH12, B_ARG_102||2.98||2.07||20.93|
|5XJS||OD1, B_ASP_93||NE, 2_ARG_235||HE, 2_ARG_235||2.94||1.96||11.86|
|5XJS||OD1, B_ASP_93||NH2, 2_ARG_239||HH21, 2_ARG_239||2.98||1.98||5.80|
|5XJS||OD2, B_ASP_60||ND2, B_ASN_64||HD22, B_ASN_64||2.99||2.13||25.53|
|5XJT||OD1, B_ASP_93||NE, 2_ARG_235||HE, 2_ARG_235||2.65||1.75||21.43|
|5XJU||OD1, B_ASP_93||NE, 2_ARG_235||HE, 2_ARG_235||2.98||2.08||23.00|
|5XJU||OD2, B_ASP_60||ND2, B_ASN_64||HD21, B_ASN_64||2.80||1.99||29.63|
Table 5 shows the four hydrogen bonds formed between snRNP Sm-D2’s Asp93 and Gemin2’s Arg235 and Arg239. Functionally, Gemin2 is closely linked to SMN (formerly known as Gemin1), and NMR spectroscopy was used to experimentally determine a Gemin1-Gemin2 complex structure (PDB ID: 2LEH) [22, 41], making a closer visual inspection worthwhile of the SMN-related structures (PDB IDs: 5XJS, 5XJT and 5XJU , Table 1).
From Figure 5 (PDB ID:5XJS), it is quite clear that the three charged residues (snRNP Sm-D2’s Asp93 and Gemin2’s Arg235 and Arg239) sit right at the structural interface between Sm-D2 (pink) and Gemin2 (green), with their oppositely charged side chains closely facing each other, similar to the situation as reported by , where the deeply buried side chains of SMN’s Lys45 and Asp36 act as two electrostatic clips at the SMN-Gemin2 interface via interactions with both the side chains and the backbone of Gemin2’s Gln105, Gln109, His120, His123 and Trp124.
In the subsequent computational salt bridge analysis of the six new SMN-related structures, it turned out that the three charged residues did form salt bridges between their closely facing oppositely charged side chains, as listed in Table 6 below and illustrated in Figure 6.
|PDB ID||SBnum||Residue A||Atom A||Residue B||Atom B||Distance (Å)|
Collectively, snRNP Sm-D2’s Asp93 and Gemin2’s Arg235 and Arg239 are three structurally important residues which help stabilise the structural interface through intermolecular electrostatic interactions, including both salt bridges and also hydrogen bonds, similar to the way SMN’s Asp44, Gemin2’s Arg213 and the two SMN residues (Lys45 and Asp36) play stabilising roles in the SMN-Gemin2 complex structure formation .
Considering the intimate functional relationship between Gemin2 and SMN, a further set of structural analysis was conducted for the hydrogen bond and the salt bridge for Arg235 and Arg239 of PDB entry 2LEH [22, 41], and it turned out that the two arginines did not form any intermolecular electrostatic interaction with SMN, neither salt bridge nor hydrogen bond. Instead, the 2 arginines of Gemin2 only formed 2 hydrogen bonds with Gln272 and His231 of Gemin2, and 1 stable salt bridge with Asp274 of Gemin2, where 16 salt bridges were identified for the 32 NMR structural models (Table 7), according to the structural analysis of PDB entry 2LEH [22, 41].
|PDB ID||SBnum||Residue A||Atom A||Residue B||Atom B||Distance (Å)|
3. Concluding remarks
Given SMN’s critical role in the maturation of snRNP and in the development of SMA [2, 6, 11], it is necessary for the structure-activity relationship (SAR) characterisation to continue for the SMA protein. With various biophysical tools available for structural determination, for SMN-related proteins and biological complexes, such as the SMN complex and snRNPs, their structure determination and functional characterisation will undoubtedly continue to advance, which will be helpful both in further understanding of SMN’s role in SMA from a molecular structural point of view. In practice, however, advancements do not come easy. For instance, although both full-length structures of FL-SMN (with 294 residues) and SMN7 (with 282 residues) were already experimentally determined using X-ray crystallography and deposited in the database (PDB IDs: 4NL6 and 4NL7), they were subsequently withdrawn by the author because the sample used for the structure determination was wrong. Otherwise, these two full-length SMN structures would constitute the very first step towards a comprehensive picture of the structural and functional insights into SMN’s role in the molecular pathogenesis SMA.
As of 25 September 2018, there is still no full-length SMN (or the SMN complex or the snRNP assembly) structure deposited in the wwPDB website , although it contains six new experimentally determined SMN-related structures, in addition to those reported in . In terms of amino acid sequence, those SMN-related structures are still only SMN fragments, ranging from Gly26 to Lys51, and from Asn84 to Glu147. In between, there is still structurally not-determined-yet regions (referred to as structural gaps below) consisting of 204 SMN residues. Sixteen months have passed since the publication of , the structural gaps still remain, literally zero progress has been made to bridge them in spite of the six newly deposited structures, calling again  for further comprehensive structural determination and functional research for this SMA protein.
4. A residue-specific distributional analysis of the structural gaps in the Protein Data Bank
As a 38-kD protein, SMN is essentially a small one in terms of molecular weight, in comparison with all proteins whose structures have been deposited in the Protein Data Bank (PDB), a primary database for experimentally determined structures of biological molecules . As discussed above, even for a protein as small as SMN, experimental structure determination does not seem simple or easy, especially when it has to be done in a full-length and gapless manner. Therefore, to test whether any residue-specific statistical pattern (not known yet before this chapter) exists in the structural gaps in the whole Protein Data Bank (accessed 25 September 2018), this chapter presents a set of residue-specific distributional analysis of all structural gaps throughout PDB.
While the number of experimentally determined protein structures keeps increasing in the PDB, with the number of cryo-EM structures  on the rise, X-ray crystallography and NMR spectroscopy remain to date the two main (Table 8) supplementary biophysical tools in structural biology, both with strengths and weaknesses [50, 51].
|Experimental method||Proteins||Nucleic acids||Protein/NA complex||Other||Total|
In PDB-format data, the atomic coordinates presented in ATOM records in a PDB file may not exactly match the sequence in the SEQRES records. However, these amino acids will often be included in the SEQRES records, since the portion of the chain was present during the experiment. In these cases, a ‘REMARK 465’ entry will be included in the header of the PDB file to identify each missing residue. For X-ray crystallography data, the ends of chains and mobile loops are often not observed in crystallographic experiments, and as a result, atomic coordinates are not included as ATOM records in the file, leading to the occurrence of gaps for structure determined by X-ray crystallography. Among currently available biophysical tools, NMR spectroscopy is able to provide unique access to atomic-level structural dynamic behaviour of protein molecules in solution under physiological conditions (such as temperature, pH, etc.). As a result, this chapter focuses on the structural gaps within protein structures determined by NMR spectroscopy, and aims to test whether any residue-specific statistical pattern exists in them. Here, structural gaps are defined as protein fragments with residues which exist in the originally studied molecule as shown in the SEQRES records, but not in the observed structure/atomic coordinates.
As of 20 September 2018, 10,844 NMR-determined protein structures have been deposited in the Protein Data Bank, according to a structure search with two parameters (molecule type = protein, experimental method = NMR). After the 10,844 PDB files were downloaded from the PDB website, the numbers of the total and the missing amino acid residues were extracted with an in-house python script for all proteins, as listed in Table 9.
|Residue||Missing no.||Total no.||Ratio = Missing no./Total no.|
In total, the 10,844 protein structures contains 1,066,583 amino acid residues, 2.8% of which (29812) are missing, i.e. the atomic positions of the 29,841 residues were not experimentally determined by NMR spectroscopy, although they were present in the NMR sample during the structural determination process.
From Figure 7, it can be seen that for 19 residues (excluding histidine), the missing ratio is well below or pretty close to 5%, while the missing ratio is 19.6% for histidine, as shown by the blue sharp peak on Figure 7. In a statistical one sample t-test analysis of the 19 missing ratios, it turned out 100% acceptable () that the average of ratio is 0.0231, and that the fitness between the 19 missing ratios and the red horizontal line (Figure 8) is 100% acceptable (), according to a statistical Chi-square test, as revealed by Figure 8.
While a missing ratio of 5% might be considered statistically insignificant, a missing ratio of 19.6% is clearly not to be ignored here, raising one obvious question: what on earth is so special about histidine that makes it so special among the 20 naturally occurring amino acids in this residue-specific distributional analysis of the structural gaps?
Similar to the other 19, histidine is a naturally occurring amino acid that is used in the biosynthesis of proteins. Also similar to the other 19, it contains an amino group (which is in the protonated ▬NH3+ form under biological conditions) and a carboxylic acid group (which is in the deprotonated ▬COO− form under biological conditions). In particular, histidine has an imidazole side chain (which is partially protonated), classifying it as a positively charged amino acid at physiological pH (7.4). That is, among the 20 naturally occurring amino acids, five (Arg, Lys, His, Glu and Asp) possess ionisable side chains. Among the five, histidine is the only one whose side chain has an ionisable (with an intrinsic pKa at 6.04) [52, 53] imidazole ring structure, which can exist in two inter-convertible tautomeric states. While at a pH of 7.0, the imidazole ring is mostly deprotonated (proton occupancy = 9.88%), at a pH of 6.0, the imidazole ring is largely protonated (proton occupancy = 52.30%), as defined by the classical Henderson-Hasselbalch equation , where the positively charged imidazole ring bears two NH bonds and has a positive electric charge, which is equally distributed between both nitrogens. As the pH increases, the imidazole ring loses the positive charge, and the remaining proton of the neutral imidazole ring can reside on either nitrogen, giving rise to two tautomeric states of the histidine side chain [52, 54, 55].
To sum up, it is probable that the missing ratio of histidine is much higher than the other 19 because it has a special side chain with special dynamic structural and physicochemical properties (such as stacking interaction ), and with a special imidazole ring in constant protonation-deprotonation equilibrium  and two tautomeric states [52, 54, 55], making its NMR-observables (chemical shift for instance) difficult to be experimentally observed and measured by NMR spectroscopy and structurally calculated by NMR-related software in the structural determination of proteins. To address this issue of PDB-wide structural gaps, selective isotope labelling of histidine residues (the side chains in particular) can be a useful approach in biomolecular structural determination by NMR spectroscopy, not just alone, but also in collaboration with other biophysical tools, not just for the special histidine, but also for its 19 siblings in the fundamental building block of life.