Mass Spectrometry (MS)-based strategies featuring chemical or biochemical probing represent powerful and versatile tools for studying structural and dynamic features of proteins and their complexes. In fact, they can be used both as an alternative for systems intractable by other established high-resolution techniques, and as a complementary approach to these latter, providing different information on poorly characterized or very critical regions of the systems under investigation (Russell et al., 2004). The versatility of these MS-based methods depends on the wide range of usable probing techniques and reagents, which makes them suitable for virtually any class of biomolecules and complexes (Aebersold et al., 2003). Furthermore, versatility is still increased by the possibility of operating at very different levels of accuracy, ranging from qualitative high-throughput fold recognition or complex identification (Young et al., 2000), to the fine detail of structural rearrangements in biomolecules after environmental changes, point mutations or complex formations (Nikolova et al.,1998; Millevoi et al., 2001; Zheng et al., 2007). However, these techniques heavily rely upon the availability of powerful computational approaches to achieve a full exploitation of the information content associated with the experimental data.
The determination of three-dimensional (3D) structures or models by MS-based techniques (MS3D) involves four main activity areas: 1) preparation of the sample and its derivatives labelled with chemical probes; 2) generation of derivatives/fragments of these molecules for further MS analysis; 3) interpretation of MS data to identify those residues that have reacted with probes; 4) derivation of 3D structures consistent with information from previous steps. Ideally, this procedure should be considered the core of an iterative process, where the final model possibly prompts for new validating experiments or helps the assignment of ambiguous information from the mass spectra interpretation step.
Both the overall MS3D procedure and its different steps have been the subject of several accurate review and perspective articles (Sinz, 2006; Back et al., 2003; Young et al., 2000; Friedhoff, 2005, Renzone, et al., 2007a). However, with the partial exception of a few recent papers (Van Dijk et al., 2005; Fabris et al., 2010; Leitner et al., 2010), the full computational detail behind 3D model building (step 4) has generally received less attention than the former three steps. Structural derivation in MS3D, in fact, is considered a special case of structural determination from sparse/indirect constraints (SD-SIC). Nevertheless, information for modelling derivable from MS-based experiments exhibits some peculiar features that differentiate it from the data types associated with other experimental techniques involved in SD-SIC procedures, such as nuclear magnetic resonance (NMR), electron microscopy, small-angle X-ray scattering (SAXS), Förster resonance energy transfer (FRET) and other fluorescence spectroscopy techniques, for which most of the currently available SD-SIC methods have been developed and tailored (Förster et al., 2008; Lin et al., 2008; Nilges et al., 1988a; Aszodi et al., 1995).
In this view, this study will illustrate possible approaches to model building in MS3D, underlining the main issues related to this specific field and outlining some of the possible solutions to these problems. Whenever possible, alternative methods employing either different programs selected among most popular applications in homology modelling, threading, docking and molecular dynamics (MD), or different strategies to exploit the information contained in MS data will be described. Discussion will be limited to packages either freely available, or costing less than 1,000 US$ for academic users. For programs, the home web address has been reported, rather than references that are very often partial and/or outdated. Some examples, derived from the literature available in this field, or developed ad hoc to illustrate some critical features of the computational methods in MS3D, should clarify potentiality and current limitations of this approach.
2. General MS3D modelling procedures
2.1. Possible computational protocols for MS3D approaches
MS3D can be fruitfully applied to many structure-related problems; thus, it requires the (possibly combined) use of different modelling procedures. However, a very general scheme for a MS3D approach can still be sketched (Fig. 1). It includes:
an initial generation of possible structures for the investigated system by some sampling algorithms (S1 or S2 stages);
followed by classification, clustering and selection steps of the best sampled structures based on one or more criteria (F1 or F2a-F2b-F2c);
an optional narrowing of the ensemble by a refinement of the selected models (R);
followed by new classification, clustering and selection stages for the identification of the most representative models (FF).
Selection criteria are very often represented by more or less sophisticated combinations of different scoring (i.e. the higher, the better), penalty (i.e. the lower, the better) or target (i.e. the closer to its reference value, the better) functions. For the sake of brevity, from here onwards the term “scoring” will be indiscriminately used for either true scoring, or penalty, or target function, when their discrimination is not necessary.
The features characterizing a specific approach are: a) combination of sampling (and optimization) algorithms, b) scoring functions in sampling/optimization and classification/ clustering/selection stages, c) strategies to introduce MS-based experimental information.
A first major branching in this scheme already occurs in the earliest modelling stages (box A), depending if MS-based information is, at least in part, integrated in the structure generation stage (path S1-F1), or rather deferred to a subsequent model classification/ selection step (path S2-F2a-F2b-F2c).
Depending on information types, programs and strategies used in modelling (see next sections for theory and examples), MS-based data can be either all introduced during sampling (S1), or all used in the filtering stage (F2a), or subdivided between the two steps (S1+F1). The main advantage of the inclusion of MS-based information into sampling (path S1-F1) is an increase in model generation efficiency by limitation of the conformational or configurational subspace to be explored. In several potentially problematic cases, i.e. large molecules with very limited additional information available, this reduction can transform a potentially insoluble problem into a reliable model generation, capable of correlating structural and functional features of the investigated system. However, for the very same reason, if information is introduced too abruptly or tightly during structural sampling, it can artificially freeze the models into a wrong, or at least incomplete, set of solutions (Latek et al., 2007; Bowers et al., 2000). Also the weight of erroneous restraints will be considerably amplified by the impossibility of a comparison with solutions characterized by some restraint violations, but considerably more favourable scoring function values, which are often diagnostic of inadequate sampling and/or errors in the experimental restraint set.
Accordingly, both the protocol used to implement MS-based information into modelling procedures and the MS-based data themselves generally represent very critical features, which require the maximum attention during computational setup and final analyses. In addition, implementation of restraints in the sampling procedure either requires some purposely programming activity, or severely limits the choice of modelling tools to programs already including suitable user-defined restraints.
Use of MS-based information in post-sampling analyses (path S2-F2a-F2b-F2c) to help classifying and selecting the final models exhibits a mostly complementary profile of advantages-disadvantages. In fact, it decreases the sampling efficiency of the modelling methods (S2), by leading to a potentially very large number of models to be subsequently discarded on the mere basis of their violations of MS-derived restraints (F2a), and by providing no ab initio limitations to the available conformational/configurational space of the system. Furthermore, it may still require programming activity if available restraint analysis tools (F2a) are lacking or inefficient in the case of the implemented information. However, this approach warrants the maximum freedom to the user in the choice of the sampling program; this may result very useful in those cases where the peculiar features of a specific program are strongly required to model the investigated system. In addition, a compared analysis of both structural features and scoring function values between models accepted and rejected on the basis of MS-based data may allow the identification of potential issues in the selected models and the corresponding data sets (steps F2c-X).
2.2. Integration of MS-based data into modelling procedures
Although an ever-increasing number of MS-based strategies has been developed, they provide essentially two information classes for model building: i) surface accessible residues, from chemical/isotopic labelling or limited proteolysis experiments (Renzone et al., 2007a); ii) couples of residues whose relative distances span a prefixed range, from crosslinking experiments (Sinz, 2006; Renzone et al., 2007a). Details on the nature of the combined biochemical and MS approaches used to generate these data and the experimental procedures adopted in these cases is provided in the exhaustive reviews reported above.
2.2.1. Surface-related information (selective proteolysis and chemical labelling)
Although many structural generation approaches include surface-dependant terms, usually they are not exposed to the user; thus, direct implementation of accessibility information is always indirect and ranges from very difficult to impossible. In some docking programs, surface residue patches can be excluded from the exploration, thus restricting the region of space to be sampled (Section 3.2). This information is generally exploited through programs that build and evaluate different kinds of molecular surfaces, applied during the model validation stages. In this view, the main available programs and their usage will be described in the section dedicated to model validation (Section 3.3.2).
In the case of modelling procedures based on sequence alignment with templates of known 3D structure, surface-dependent data can be employed both to validate alignments before modelling (early steps in S1 stage), and to filter the structures resulting from the different steps of a traditional model building procedure (stages F1 or F2a, and FF).
Cross-linking information often directly contribute to the model building procedure (under the form of distance restraints or direct linker addition to the simulated systems) (stage S1 in Fig.1), in addition to their model validation/interpretation role (stages F1, F2a, FF).
Whenever information from crosslinking experiments is integrated within the modelling procedure, the most common approach recurring in literature is its translation into distance constraints (i.e. “hard”, fixed distances) or restraints (i.e. variable within an interval and/or around a fixed distance with a given tolerance) involving atoms, in a full-atomistic representation, or higher-order units, such as residues, secondary structure (SS) elements, or domains, in coarse-grained models. A less common approach consists in the explicit inclusion of the crosslinker atoms in the simulation.
18.104.22.168. Distance restraints
Distance restraints (DRs) are usually implemented by adding a penalty term to the scoring function used to generate, classify or select the models, whenever the distance between specified atom pairs exceeds a threshold value. In this way, associated experimental information can be introduced rather easily and with moderate computational overheads in all the molecular modelling and simulation approaches based on scoring functions. However, since crosslinking agents are molecules endowed with well-defined and specific conformational and interaction properties, both internal and with crosslinked molecules, accurate theoretical and experimental estimates of distance ranges associated with the corresponding cross-link agents only qualitatively correspond to experimentally-detected distances between pairs of cross-linked residues (Green et al., 2001; Leitner et al., 2010). Steric bumps, specific favourable or unfavourable electrostatic interactions, presence of functional groups capable of promoting/hampering the crosslinking reaction and changes in crosslinker conformational population under the effects of macromolecule are all possible causes for observed discrepancies.
22.214.171.124. Explicit linkers
Explicit inclusion of crosslinkers in the systems, although potentially allowing to overcome the limits of DRs, presently suffers from several drawbacks that limit its usage to either final selection/validation stages, or to cases where a limited number of totally independent and simultaneously holding crosslinks are observed. In fact, when many crosslinks are detected in a system by MS analysis, they very often correspond to mixtures of different patterns, because crosslinks can interfere each other either by direct steric hindrance, or by competition for one of the macromolecule reacting groups, or by inducing deformation in the linked system, thus preventing further reactions. However, the added information from explicit crosslinkers may: i) allow disambiguation between alternative predicted binding modes, ii) provide more realistic and strict estimates of the linker length to be used in further stages of DR-based calculations, iii) help modelling convergence, iv) substantially contribute to model validation.
An attempt to reproduce by an implicit approach at least the geometrical constraints associated with a physical linker has been performed by developing algorithms to identify minimum-length paths on protein surfaces (Potluri et al., 2004). This approach provides upper/lower bounds to possible crosslinking distances on static structures but it only worked on static structures as a post-modelling validation tool, and no further applications have been reported so far.
3. Available computational approaches in MS3D
MS-based data can be used to obtain structural information on different classes of problems:
single conformational states (e.g. the overall fold);
conformational changes upon mutations/environmental modifications;
macromolecular aggregation (multimerization);
binding of small ligands to macromolecules.
Sampling efficiency and physical soundness of the scoring functions used during sampling (stages S1/S2 of Fig. 1) and to select computed structures (stages F1/F2b and FF) generally represent the main current limitations of 3D structure prediction and simulation methods. In this view, introduction of experimental data represents a powerful approach to reduce the geometrical space to be explored during sampling, and also an independent criterion to evaluate the quality of selected models.
From a computational point of view, structural problems a)-d) translate into system-dependent proper combinations of:
fold identification and characterization;
structural refinement and characterization of dynamic properties and of changes under the effects of local or environmental perturbations.
Since the optimal combination of methods for a given problem depends upon a large number of system- and data-dependent parameters, and the number of programs developed for biomolecular simulations is huge, an exhaustive description and compared analysis of methods for biomolecular structure generation/refinement is practically impossible. However, we will try to offer a general overview of the main approaches to generate, refine and select 3D structures in MS3D applications, with a special attention to possible ways of introducing MS-based data and exploiting their full information content.
3.1. Fold identification and characterization
The last CASP (Critical Assessment of techniques for protein Structure Prediction) experiment call (CASP9, 2010) classified modelling methods in two main categories: “Template Based Modelling” (TBM) and “Template Free Modelling” (TFM), depending if meaningful homology can be identified or not before modelling between the target sequence and those of proteins/domains whose 3D structures are known (templates).
TFM represents the most challenging task because it requires the exploration of the widest conformational space and heavily relies on scoring methods inspired by those principles of physics governing protein folding (de novo or ab initio methods), eventually integrated by statistical predictions, such as probabilities of interresidue contacts, surface accessibility of single residues or local patches and SS occurrence. When number and quality of these information increase, together with the extent of target sequence for which they are available, “folding recognition” and “threading” techniques can be used, including a broad range of methods at the interface between TFM and TBM. In these approaches, several partial 3D structure “seeds” are generated by statistical prediction or distant homology relationships, and their relative arrangements are subsequently optimized by strategies deriving from de novo methods.
The most typical TBM approach, “comparative” or “homology” modelling (HM), uses experimentally elucidated structures of related protein family members as “templates” to model the structure of the protein under investigation (the “target”). Target sequence can either be fully covered by one or more templates, exhibiting good homology over most of the target sequence, or can require a “patchwork” of different templates, each best covering a different region of the target.
A further group of approaches, presently under active development and already exhibiting good performances in CASP and other benchmark and testing experiments, is formed by the “integrative” or “hybrid” methods. They combine information from a varied set of computational and experimental sources, often acting as/based on “metaservers”, i.e. servers that submit a prediction request to several other servers, then averaging their results to provide a consensus that in many cases is more reliable than the single predictions from which it originated. Some metaservers use the consensus as input to their own prediction algorithms to further elaborate the models.
In order to provide some guidelines for structural prediction/refinement tasks in the presence of MS-based data, a general procedure will be outlined for protein fold/structure modelling. The starting step in protein modelling is usually represented by a search for already structurally-characterized similar sequences. Sensitive methods for sequence homology detection and alignment have been developed, based on iterative profile searches, e.g. PSI-Blast (Altschul et al., 1997), Hidden Markov Models, e.g. SAM (K. Karplus et al. 1998), HMMER (Eddy, 1998), or profile-profile alignment such as FFAS03 (Jaroszewski et al., 2005), profile.scan (Marti-Renom et al., 2004), and HHsearch (Soding, 2005).
When homology with known templates is over 40%, HM programs can be used rather confidently. In this case, especially when alignments to be used in modelling have already been obtained, local programs represent a more viable alternative to web-based methods than in TFM processes. If analysis is limited to most popular programs and web services capable of implementing user MS-based restraints (strategy S1 in Fig. 1), the number of possible candidates considerably decreases. Among web servers, on the basis of identified homologies with templates, Robetta is automatically capable of switching from ab initio to comparative modelling, while I-TASSER requires user-provided alignment or templates to activate comparative modelling mode. A very powerful, versatile and popular HM program, available both as a standalone application, and as a web service, and embedded in many modelling servers, is MODELLER (http://www.salilab.org/modeller/). It include routines for template search, sequence and structural alignments, determination of homology-derived restraints, model building, loop modelling, model refinement and validation. MS-based distance restraints can be added to those produced from target-template alignments, as well as to other restraints enforcing secondary structures, symmetry or part of the structure that must not be allowed to change upon modelling. However, some scripting ability is required to fully exploit MODELLER versatility.
The overall accuracy of HM models calculated from alignments with sequence identities of 40% or higher is almost always good (typical root mean square deviations (RMSDs) from corresponding experimental structures less than 2Å). The frequency of models deviating by more than 2Å RMSD from experimental structures rapidly increases when target–template sequence identity falls significantly below 30–40%, the so-called “twilight zone” of HM (Blake & Cohen, 2001; Melo & Sali, 2007). In such cases, the quality of resulting modelled structures significantly increases by combining additional information, both of statistical origin, such as SS prediction profiles, and from sparse experimental data (low resolution NMR or chemical crosslinking, limited proteolysis, chemical/isotopical labelling coupled with MS).
If the search does not produce templates with sufficient homology and/or covering of the target sequence, TFM or mixed TFM/TBM methods must be used. Many programs based on ab initio, fold recognition and threading methods are presently offered as web services; this is because very often they use a metaserver approach for some steps, need extensive searches in large databases, require huge computational resources, or to better protect underlying programs and algorithms, currently under very active development. Although this may offer some advantages, especially to users less-experienced in biocomputing or endowed with limited computing facilities, it may also imply strong limitations in the full exploitation of the features implemented in the different methods, with particularly serious implications in MS3D. Only few servers either include a NMR structure determination module (not always suitable for MS-based data), or explicitly allow the optional usage of user-provided distance restraints in the main input form. Fortunately, two of the most used and versatile servers, Robetta (http://robetta.bakerlab.org/) and I-TASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/), good performers at the last CASP rounds (
A successful examples of modelling with MS-based information in a low-homology case is Gadd45β. A model was built, despite the low sequence identity (<20%) with template identified by fold recognition programs, through the introduction of additional SS restraints, which were based on SS profiles and experimental data from limited proteolysis and alkylation reactions combined with MS analysis (Papa et al., 2007). Model robustness was confirmed by comparison with the homolog Gadd45γ structure solved later (Schrag JD et al., 2008), where the only divergence in SS profiles was the occurrence of two short 310 helices (three residues each long) and an additional two-residues β-strand in predicted loop regions (Fig. 2). Furthermore, this latter β-strand is so distorted that only a few SS assignment programs could identify it, and the corresponding sequence in Gadd45β, predicted unstructured and outside the template alignment, was not modelled at all.
Usually, methods for protein docking involve a six-dimensional search of the rotational and translational space of one protein with respect to the other where the molecules are treated as rigid or semirigid-bodies. However, during protein-protein association, the interface residues of both molecules may undergo conformational changes that sometimes involve not only side-chains, but also large backbone rearrangements. To manage at least in part these conformational changes, protein docking protocols have introduced some degree of protein flexibility by either use of "soft" scoring functions allowing some steric clash, or explicit inclusion of domain movement/side chain flexibility. Biological information from experimental data on regions or residues involved in complexation can guide the search of complex configurations or filter out wrong solutions. Among the programs most frequently used for protein-protein docking, recently reviewed by Moreira and colleagues (Moreira et al., 2010), some of them can manage biological information and will be discussed in this context.
In the Attract program (http://www.t38.physik.tu-muenchen.de/08475.html ), proteins are represented with a reduced model (up to 3 pseudoatoms per amino acid) to allow the systematic docking minimization of many thousand starting structures. During the docking, both partner proteins are treated as rigid-body and the protocol is based on energy minimization in translational and rotational degrees of freedom of one protein with respect to the other. Flexibility of critical surface side-chains as well as large loop movements are introduced in the calculation by using a multiple conformational copy approach (Bastard et al., 2006). Experimental data can be taken into account at various stages of the docking procedure.
The 3D-Dock algorithm (http://www.sbg.bio.ic.ac.uk/docking/) performs a global scan of translational and rotational space of the two interacting proteins, with a scoring function based on shape complementarity and electrostatic interaction. The protein is described at atomic level, while the side-chain conformations are modelled by multiple copy representation using a rotamer library. Biological information can be used as distance restraints to filter final complexes.
ZDOCK (http://zdock.bu.edu/) is a rigid body docking program based on FFT algorithm and an energy function that combines shape complementarity, electrostatics and desolvation terms. RDOCK (
As in protein folding, also for docking the use of MS-based information allowed the modelling of several complexes even in the lack of suitable templates with high homology. The fold of prohibitin proteins PHB1 and PHB2 was predicted (Back et al., 2002) by SS and fold recognition algorithms, while crosslinking allowed to model the relative spatial arrangement of the two proteins in their 1:1 complex. Another example of combined use of SS information, chemical crosslinking, limited proteolysis and MS analysis results with a low sequence identity (~ 20%) template is the modelling of porcine aminoacylase 1 dimer; in this case, standard modelling procedures based on automatic alignment had failed to produce a dimeric model consistent with experimental data (D'Ambrosio et al., 2003).
In the case of protein-small ligand docking, the conformational space to be explored is reduced by the small size of the ligand, whose full flexibility can usually be allowed, and by the limited fraction of protein surface to be sampled, corresponding to the binding site, often already known. Among the programs for ligand-flexible docking that allow protein side-chains flexibility, Autodock is one of most popular (
In general, MS-based data can be used to limit the protein region to be sampled (Kessl et al., 2009) or can be explicitly considered in the docking procedure, as in the case of the mapping of Sso7d ATPase site (Renzone et al., 2007b). In this case, three independent approaches for molecular docking/MD studies were followed, considering both FSBA-derivatives and the ATP-Sso7d non-covalent complex: i) unrestrained MD, starting from a full-extended, external conformation for Y7-FSBA and K39-FSBA residue sidechains, and from several random orientations for ATP, with an initial distance of 20 Å from Sso7d surface, in regions not involved in protein binding; ii) restrained MD, by gradually imposing distance restraints corresponding to a H-bond between adenine NH2 group and each accessible (i.e., within a distance lower or equal to the maximum length of the corresponding FSBA-derivative) donor sidechain; iii) rigid ligand docking, by calculating 2000 ZDOCK models of the non-covalent complex of Sso7d with an adenosine molecule. The rigid ligand docking reproduced only in part features from other approaches, as rigid docking correctly predicted the anchoring point for adenosine ring, but failed to achieve a correct position for the ribose moiety, due to the required concerted rearrangement of two Sso7d loops involved in the binding. This latter feature represents one of the main advantages of modelling strategies involving MD (in particular, in cartesian coordinates) because MD-based simulation techniques are the best or the only approaches that reproduce medium-to-large scale concerted rearrangements of non-contiguous regions.
3.3. Model simulation, refinement and validation
Refinement (R stage in Fig.1) and validation of final models (FF stage) represent very important steps, especially in cases of low homologies with known templates and when fine details of the models are used to predict or explain functional properties of the investigated system. In addition, very often the modelled structures are aimed at understanding the structural effects of point mutations or other local sequence alterations (sequence deletions/insertions, addition or deletion of disulphide bridges, formation of covalent constructs between two molecules and post-translational modifications), or of changes in environmental parameters (temperature, pressure, salt concentration and pH). In these cases, techniques are required to simulate the static or dynamic behaviour of the investigated system in its perturbed and unperturbed states.
3.3.1. Computational techniques and programs for model simulation and refinement
Model refinement, when not implemented in the modelling procedure, can be performed by energy minimization (EM) or, better, by different molecular simulation methods, mostly based on variants of molecular dynamics (MD) or Monte Carlo (MC) techniques. They are also commonly used to characterize dynamic properties and structural changes upon local or environmental perturbations.
Structures deriving from folding or docking procedures need, in general, at least a structural regularization by EM before final validation steps, to avoid meaningless results from many methods. Scoring functions of the latter evaluate the probity of parameters, such as dihedral angle distributions, presence and distribution of steric bumps, voids in the molecular core, specific nonbonded interactions (H-bonds, hydrophobic clusters). Representing a mandatory step in most MC/MD protocols, EM programs are included in all the molecular simulation packages, and they share with MC/MD most input files and part of the setup parameters. Thus, unless they are be explicitly discussed, all system- and restraint-related features or issues illustrated for simulation methods also implicitly held for EM.
As we are mostly interested in techniques implementing experimentally-derived constraints or restraints, some of the most popular methods for constraints-based modelling will be briefly described. These methods have been developed and optimized mainly to identify and refine 3D structures consistent with spatial constraints from diffraction and resonance experiments (de Bakker et al., 2006). They have also been extensively applied to both TBM (Fiser & Sali, 2003) and free modelling prediction and simulation (Bradley et al., 2005; Schueler-Furman et al., 2005), and are often used to refine/validate models produced in TFM and TBM approaches described in sections 3.1 and 3.2. There are two main categories of constraint-based modelling algorithms: i) distance geometry embedding, which uses a metric matrix of distances from atomic coordinates to their collective centroid, to project distance space to 3D space (Havel et al. 1983; Aszodi et al. 1995, 1997); ii) minimization, which incorporates distance constraints in variable energy optimization procedures, such as molecular dynamics (MD) and Monte Carlo (MC). For both MD and MC, it is possible to work both in full cartesian coordinates, or in the restricted torsion angle (TA) space, with covalent structure parameters kept fixed at their reference values, thus originating the Torsional Angle MD (TAMD) and Torsional Angle MC (TAMC) approaches. They are currently implemented in several modelling and refinement packages, developed for structural refinement of X-ray or NMR structures (Rice & Brünger, 1994; Stein et al. 1997; Güntert et al., 1997), folding prediction (Gray et al., 2003), or more general packages (Mathiowetz et al., 1994; Vaidehi et al., 1997). Standard MC/MD methods are only useful for structural refinement, local exploration and to characterize limited global rearrangements. However, they are also widely used as sampling techniques in folding/docking approaches, although in those cases enhanced sampling extensions of both methods are employed. Simulated annealing (SA) (Kirkpatrick et al., 1983) and replica exchange (RE) approaches (Nymeyer et al., 2004) are the most common examples of these MC/MD enhancements, both potentially overcoming the large energy barriers required for sampling the wide conformational and configurational spaces to be explored in folding and docking applications, respectively.
A non-exhaustive list of the most diffused simulation packages including a more-than-basic treatment of distance-related restraints and also exhibiting good versatility (i.e. implementation of different algorithms, approaches, force fields and solvent representations), may include at least: AMBER (
The main problems associated with simulation methods having relevant potential implications on MS3D are: i) insufficient sampling; ii) inaccuracy in the potential energy functionals driving the simulations; iii) influence of the approach used to implement experimentally-derived information on final structure sets.
Sampling problem can be approached both by increasing the sampling efficiency with MC/MD variations like SA and RE, and by decreasing the size of the space to be explored. This latter result can be reached by reducing the overall number of degrees of freedom to be explicitly sampled and/or by reducing the number of possible values per variable to a small, finite number (discretization, like in grid-based methods), and/or by restraining acceptable variable ranges. Reduction of the total number of degrees of freedom can be accomplished by switching to coarse-grained representations of the system, where a number of explicit atoms, ranging from connected triples, to amino acid sidechains, to whole residues, up to full protein subdomains, are replaced by a single particle. This method is frequently used in initial stages of ab initio folding modelling, or in the simulation of very large systems, such as giant structural proteins of huge protein aggregates.
Another possible way to reduce the number of degrees of freedom is the aforementioned TA approach, requiring for a N atom system only N/3 torsional angles compared with 3N coordinates in atomic cartesian space (Schwieters & Clore, 2001). Moreover, as the high frequency motions of bending and stretching are removed, TAMD can use longer time steps in the numerical integration of equations of motion than that required for a classical molecular dynamics in cartesian space. Its main limitation may derive from neglecting covalent geometry variations (in particular, bending centred on protein Cα atoms) that are known to be associated with conformational variations (Berkholz et al., 2009), for instance from α-helix to β-strand, and that can be important in concerted transitions or in large structures with extensive and oriented SS regions. Discretization is mostly employed in the initial screening of computationally intensive problems, such as ab initio modelling. Restraining variable value ranges in MS3D is usually associated with either predictive methods (SS, H-bond pattern, residue exposure), or to homology analysis, or to experimentally-derived information. Origin, nature and form of these restraints have already been discussed in previous sections, while some more detail on the implementation of distance-related information into simulation programs will be given at the end of this section.
While the implementation of restraints can be very variable in methods where the scoring function does not intend to mimic or replicate a physical interaction between involved entities, in methods based on physically-sounding molecular potential functions (forcefields) have DRs implemented by a more limited number of approaches. At its simplest, a DR will be represented as a harmonic restraint, for which only the target distance and the force constant need to be specified in input. This functional form is present in practically all most common programs, but either requires a precise knowledge of the target distance, or it will result in a very loose restraint if the force constant is lowered too much to account for low-precision target values, the usual case in MS-based data. In a more complex and useful form, implemented with slight variations in several programs (AMBER, CHARMM, GROMACS, XPLOR/CNS, DESMOND, TINKER), the restraint is a well with a square bottom with parabolic sides out to a defined distance, and then linear beyond that on both (AMBER) or just the upper limit side (CHARMM, GROMACS, XPLOR/CNS, DESMOND). In some programs (CHARMM, AMBER, XPLOR/CNS), it is possible to select an alternative behaviour when a distance restraint gets very large (Nilges et al,1988b) by “flattening out” the potential, thus leading to no force for large violations; this allows for errors in constraint lists, but might tend to ignore constraints that should be included to pull a bad initial structure towards a more correct one.
Other forms for less-common applications can also be available in the programs or be implemented by an user. However, the most interesting additional features of versatile DR implementations are the different averages that can be used to describe DRs: i) complex restraints can involve atom groups rather than single atoms at either or both restraint sides; ii) time-averaged DRs, where target values are satisfied on average within a given time lapse rather than instantaneously; iii) ambiguous DRs, averaged on different distance pairs. The latter two cases are very useful when the overall DRs are not fully consistent each other, because they are observed in the presence of conformational equilibria and, as such, they are associated with different microstates of the system. In addition, complex and versatile protocols can be simply developed in those programs where different parameters can be smoothly varied during the simulation (AMBER).
3.3.2. Programs for model validation
A validation of the final models, very often included in part in the available automated modelling protocols, represents a mandatory step, especially for more complex (low-homology, few experimental data) modelling tasks. A huge number of protein and nucleic acid structural analysis and validation tools exists, based on many different criteria, and subjected to continuous development and testing; thus, even a CASP section is dedicated to structural assessment tools (
Specific parameters associated with MS-based data can be usually analysed with available tools. Distance restraints and their violations can be analysed both on single structures and on ensembles (sets of possible solutions of prediction methods, frames from molecular dynamics trajectories) with several graphic or textual programs, the most specialized obviously being those tools developed for the analysis of NMR-derived structures.
Surface information can be analysed by programs like:
that calculate different kinds of molecular surfaces, such as van der Waals, accessible, or solvent excluded surfaces for overall systems and contact surfaces for complexes are used. However, differently from distance restraints, available programs usually work on a single input structure at a time, thus making structure filtering and analysis on the large ensembles of models potentially produced by conformational prediction, molecular simulation or docking calculations, a painful or impossible task. In these cases, scripts or programs to automate the surface calculations and to average or filter the results must be developed.
4. Modelling with sparse experimental restraints
In the previous section many of the computational methods that can concur to produce structural models in MS3D applications have been outlined, together with different ways to integrate MS-based experimental information into them. Here we will refocus on the overall computational approach in MS3D, to illustrate some of its peculiar features and issues, its present potentialities and the variety of possible combinations of data and protocols that can be devised to optimally handle different types of structural problems. Depending on nature and quantity of available experimental information and on previous knowledge of the investigated system, different combinations of the methods mentioned in previous sections can be optimally employed. We will start illustrating examples of methods for de novo protein folding, a frontier application of modelling with sparse restraints, because it is based on minimal additional information on the system under investigation.
The MONSSTER program (Skolnick et al. 1997) only makes use of SS profiles and a limited number of long-distance restraints. By employing system discretization and coarse-graining to reduce required sampling, a protein is represented by a lattice-based Cα-backbone trace with single interaction center rotamers for the side-chains. By using N/7 (N is the protein length) long-range restraints, this method is able to produce folds of moderate resolution, falling in the range from 4 to 5.5 Å of RMSD for Cα-traces with respect to the native conformation for all α and α/β proteins, whereas β-proteins require, for the same resolution, N/4 restraints. A more recent method for de novo protein modelling (Latek et al., 2007) adopts restrained folding simulations supported by SS predictions, reinforced with sparse experimental data. Authors focused on NMR chemical-shift-based restraints, but also sparse restraints from different sources can be employed. A significant improvement of model quality was already obtained by using a number of DRs equal to N/12.
As already stated by Latek and colleagues, the introduction of DRs in protein folding protocol represents a critical step that in principle could negatively affect the sampling of conformational space. In fact, restraint application at too early stages of calculations can trap the protein into local minima, where restraints are satisfied, but the native conformation is not reached. In addition to the number, even the specific distribution of long-range restraints along the sequence can affect the sampling efficiency. To test the influence of data sets in folding problem, we applied a well-tested protocol of SA, developed for AMBER program and mainly oriented to NMR structure determination, to the folding simulation of bovine pancreatic trypsin inhibitor (BPTI), by using different sets of ten long-distance restraints, randomly selected from available NMR data (Berndt et al., 1992), with optional inclusion of a SS profile. Fig. 3 shows representative structures for each restraint set.
The four native BPTI disulphide bridges were taken into account by additional distance and angle restraints. BPTI represents a typical benchmark for this kind of studies, due to its peculiar topology (an α/β fold with long connection loops, stabilized by disulphide bonds) still associated with a limited size (58 residues), and to the availability of both X-ray and NMR accurate structures. SA cycles of 50 structures each were obtained and compared for four combinations of three sets (S1-3) of ten long distance restraints, totally non-redundant among different sets and SS profiles: a) S1+SS profile; b) S1 alone; c) S2+SS profile; d) S3+SS profile. S1 set performed definitely better than the other two, its best model exhibiting a RMSD value of 2.4 Å on protein backbone of residues 3-53 from the representative NMR structure.
This set was also able to provide a reasonable low-resolution fold even in the absence of SS restraints (b). S3 resulted in almost correctly folded models, but with significantly worse RMSD values than S1 (c). In S3 pseudomirror images (d) of the BPTI fold occurred several times and only one model out of 50 was correctly folded (not shown).
These results suggest a strong dependency of results upon both the exact nature of experimental data used in structure determination, and the protocol followed for model building. Thus, the number of restraints estimated in the aforementioned studies as necessary/sufficient for a reliable structural prediction should be prudently interpreted for practical purposes. If a proper protocol is adopted, increasing quantity, quality and distribution homogeneity of data should decrease this dependency, but the problem still remains severe when using very sparse restraints, such as those associated with many MS3D applications. A careful validation of the models and, possibly, execution of more modelling cycles with variations in different protocol parameters, can help to identify and solve these kinds of problems.
However, in spite of these potential issues, ab initio MS3D can provide a precious insight into systems that are impossible to study with other structural methods. In addition to increases in the number of experimental data, also homology-based information and other statistically-derived constraints can substantially increase the reliability of MS3D predictions. Thus, suitable combinations of experimental data, predictive methods and computational approaches have allowed the modelling of many different proteins and protein complexes spanning a wide range of sizes and complexity. The illustrative examples shown in Table 1 represents just a sample of systems affordable with current computational MS3D techniques and a guideline to select possible approaches for different problem classes. Heterogeneity of reported systems, data and methods, while suggesting the enormous potentialities of MS3D approaches, practically prevents any really meaningful critical comparison among methods, whose description in applicative papers is often incomplete. A standardization of MS3D computational methods is still far from being achieved, since it requires considerable computational effort to tackle with the considerable number of strategies and parameters that should be tested in a truly exhaustive analysis. Furthermore, the extreme sensitivity of modelling with sparse data to constraint distribution, as seen in the example shown in Fig. 3, either introduces some degree of arbitrariness in comparative analyses, or make them even more computationally-intensive, by requiring the use of more subsets for each system setup to be sampled.
Advancements in MS3D experimental approaches continuously change the scenarios for computational procedures, by substantially increasing the number of data, as well as the types of crosslinking or labelling agents and proteolytic enzymes. The large number of crosslinks obtained for apolipoproteins (Silva et al., 2005; Tubb et al., 2008) or CopA copper ATPase (Lübben et al., 2009) represent good examples of these trends (Table 1).
As already stated in the preceding section, the compared analysis of computational approaches involved in MS3D is still considerably limited, because of the complexity both of the systems to be investigated, and of the methods themselves, especially when they are used in combination with restraints as sparse as those usually available in MS3D studies. The continuous development in all involved experimental and computational techniques considerably accelerates the obsolescence of the results provided by any accurate methodological analysis, thus representing a further disincentive to these usually very time consuming studies. In this view, rather than strict prescriptions, detailed recipes or sharp critical compared analysis of available approaches, this study was meant to provide an
overall and as wide as possible picture of the state-of-art approaches in MS3D computational techniques and their potential application fields. However, in spite of these limitations, some general conclusions can still be drawn.
For predictive methods that stay behind the most ambitious MS3D applications (ab initio folding, folding prediction, threading), at least when used in the absence of experimental data, metaservers exhibit on average best performances than the single employed servers, as also shown by the results of the last CASP rounds on automatic servers (
In comparing server-based applications to standalone programs (often available in alternative for a given approach), potential users should also consider that the former require less computational skill and resources, but are intrinsically less flexible than the latter, and that legal and secrecy issues may arise, because several servers consider submitted prediction requests and the corresponding results as public data, as usually clearly stated in submission pages. In addition to possible information “leakage” in projects, the public status of the models would prevent their use in patents.
When considering more specifically MS3D procedures, it has been shown that even a small number of MS-based restraints can significantly help in restricting the overall space to be explored and in identifying the correct fold/complexation mode, especially if they are introduced in early modelling stages of a computational procedure optimized to deal with both the investigated system and the available data. Thus, experimental restraints can allow the use of a single model generation procedure, rather than a multiple/metaserver approach, at least in non-critical cases. In fact, they should filter out all wrong solutions deriving from the biases of the modelling method, leaving only those close to the “real” one, if it is included in the sampled set. In particular, since the lowest energy structure should ideally also be associated with a minimum violation of experimentally-derived restraints, the coincidence of minimum energy structures with least violated restraints should be suggestive of correct modelling convergence and evaluation of experimental data. However, particular care must be adopted not only in the choice of the overall computational procedure, but especially of the protocol used to introduce experimental information, because a too abrupt build up of the restraints can easily bring to local minima far from the correct solution. Comparison of proper scoring functions other than energy between experimentally-restrained and unrestrained solutions may provide significant help in identifying potential issues in data or protocols. Estimates of the sensitivity of solutions to changes in protocols may also enforce the reliability of best converged cases. In particular, when other restraints are also present, the relative strength and/or introduction order of the different sets could play an important role in the final result; thus, their weight should be carefully evaluated by performing more modelling runs with different setups.
When evaluating the overall modelling procedures, their corresponding caveats and performance issues, the importance of many details in setup and validation of MS3D computational procedures fully emerges, thus suggesting that they still requires a considerable human skill, although many full automated programs and servers allow in principle the use of MS3D protocols even to inexperienced users. This is also demonstrated for the pure ab initio modelling stage by the still superior performances obtained by human-guided predictions in CASP rounds, when compared to fully automated servers.
Future improvements in MS3D are expected as a natural consequence of continuous development in biochemical/MS techniques for experimental data, and in hardware/ software for molecular simulations and predictive methods. However, some specific, less expensive and, possibly, quicker evolution in MS3D could be propelled by targeted development of computational approaches more directly related to the real nature of the experimental information on which MS3D is based, notably algorithms implementing surface-dependent contributions and more faithful representations of crosslinkers than straight distance restraints.