X-Ray Diffraction in Biology: How Can We See DNA and Proteins in Three Dimensions?

Knowing the three-dimensional structure of biological macromolecules, such as proteins and DNA, is crucial for understanding the functioning of life. Biological crystallography, the main method of structural biology, which is the branch of biology that studies the structure and spatial organization in biological macromolecules, is based on the study of X-ray diffraction by crystals of macromolecules. This article will present the principle, methodology and limitations of solving biological structures by crystallography.


Introduction
In 1953, James Watson and Francis Crick revealed the double helical structure of DNA using the results of Rosalyn Franklin obtained by X-ray scattering on natural filaments formed by DNA molecules [1]. Proteins, the nanomachines essential to living organisms, have their "manufacturing plan" encoded in their DNA gene sequence [2]. During their synthesis, proteins adopt a specific three-dimensional structure that allows them to perform their functions within the cell. "Seeing" the structure of biological macromolecules, such as proteins or nucleic acids (RNA or DNA), allows researchers to elucidate the mechanisms of live in all organisms, and among many other applications, allows them to design new drugs [3]. "Seeing" proteins or nucleic acids in three dimensions, a dream or a reality? Could microscopy, a technic known since more than 350 years that allows to visualize biological cells, be the right approach? Of course, the dimensions of these two objects, macromolecules and cells are very different: The cell size ranges generally from 10 to 100 microns (10 −6 m), the dimensions of biological macromolecules, proteins or nucleic acids, are of the order of tens of angstroms (10 −10 m) (Figure 1). To reach atomic details, the method of choice is crystallography, whose principle is based on the bombardment by X-ray of crystals composed of biological macromolecules [4]. Why using X-rays? Their wavelength is of the order of the angström and thus corresponds to the distance between two bound atoms. Why using a crystal? To date, the conception of an Xray microscope encounters two obstacles. First, the signal from a single macromolecule is too low, second, a device, such as lenses, generating a direct image of a macromolecules, does not exist for X-rays. Using a crystal, that contains about 10 15 identical macromolecules periodically arranged in the three directions of space, overcomes these obstacles.
In only 50 years, crystallography has become the technique of choice for the determination of structures of biological macromolecules at atomic scale, taking advantage of the major advances in the scientific fields as diverse as molecular biology, biochemistry, computer science, physics and more recently robotics. Today, crystallography is able to address the determination of three-dimensional structures of macromolecules more and more complex, more and more quickly. Currently, more than 25 crystal structures are deposited daily in the Protein Data Bank (http://www.rcsb.org) 1 [5].
The physical principle of crystallography is based on X-ray diffraction by all the electrons constituting the atoms of all the macromolecules contained in the crystal (Figure 2). The analysis of these diffraction data then allows the crystallographer to calculate the electron density, which is the distribution of the electron cloud of the macromolecule in the crystal. This electron density provided it is sufficiently precise-this preciseness depends on the resolution of the diffraction data-allows the localization of each atom of the molecule, and thus the determination of its coordinates in the three-dimensional space [6]. To get this three-dimensional structure, several steps that falls within multiple disciplines are required (Figure 3). Each of these steps represents potential bottlenecks that need to be overcome. These are the production and the purification of the macromolecule, its crystallization, diffraction data collection and processing. Another crucial step is the determination of the phases of the measured signal, absolutely required to calculate the electron density. The last step is the refinement of the built structure, called the model, which will then be interpreted in the context of its biological function. The analysis of the model will thus raise new questions leading to the resolution of other crystal structures, such as structure of a complex between the studied protein and its partners [7]. We will in the following sections describe each of these steps.

Steps upstream the structure determination
The first step, a step that falls within biology and includes molecular biology and biochemistry techniques, is the production of highly pure macromolecule in large quantity. Once the sequence of the macromolecule to be studied has been identified and characterized by bioinformatics analyses, the sequence corresponding to the gene of the macromolecule is cloned in an expression vector and produced classically in a bacterial organism (typically Escherichia coli). The macromolecule is then extracted from the bacterial cells and purified using chromatographic techniques. The prerequisite for the next step is to obtain a concentrated 2 (of the order of tens of grams per liter) and highly pure sample (greater than 98%) of the macromolecule.
The next bottleneck is based on physical chemistry, specifically crystallization which addresses concepts such as solubility of molecules and their transition from soluble state to a solid crystalline ordered state [8]. This step, built on statistical screenings plays with the variation of parameters such as temperature, pH, concentrations of biological macromolecules, as well as nature and concentration of crystallizing agents and various additives [9]. Obtaining a single homogeneous crystal, that result to high quality diffraction data, represents a crucial step in the process of determining a macromolecular structure. In order to increase the success rate, crystallization robots are used today to screen more than several thousands of parameters. The size (from tens to hundreds of microns) and the morphology of the crystals are highly variable (Figure 4) and are not necessarily related to their diffracting power and quality.

The diffraction data
The crystals obtained during the previous step are fished using a small loop (Figure 4), cryocooled to protect them from radiation damage [10], and then placed into a monochromatic Xray beam produced by an appropriate source, either a rotating anode generator available in crystallography laboratories or a synchrotron radiation, the latter producing significantly more intense beams [11]. Under these conditions, the waves scattered by the electrons of the macromolecules that are three-dimensionally ordered in the crystal add up in given directions (the diffracted beam is characterized by a structure factor, Figure 7) and generate a diffraction spot on the screen of the detector ( Figure 5A). All the spots, regularly spaced, form the diffraction pattern ( Figure 5A). This diffraction pattern is reconstituted by using several hundreds of images, each corresponding to an orientation of the crystal that rotates on itself during the measurement of the diffraction data (Figure 2 and Figure 5B). The information contained in each diffraction spot is characterized by the amplitude and the phase of the structure factor characterizing the corresponding scattered wave.
The three-dimensional distribution of the spots is directly related to the cell parameters, e.g. the three lengths of the parallelepiped that constitutes the volume element (the cell), which is regularly repeated in space (Figure 6) and allows to describe the crystal. The distribution of the spot intensities is directly related to the electron density distribution (the macromolecules) in the cell. Mathematically, this means that the diffraction pattern is the Fourier transform of the electron density (Figure 7).  The electron density contained in one cell can thus be calculated by inverse Fourier transform, a mathematical property of this transformation, provided the amplitude and the phase of all the diffracted beams are known (Figure 7). Whereas the amplitude is directly proportional to the intensity of the diffracted spots, the phase information is not experimentally measurable.
In summary, the crystal "realizes" a Fourier analysis producing diffraction data, and the crystallographer will calculate a Fourier synthesis to get the electron density contained in one cell (Figure 7).

From the diffraction data to the electron density
Three main methods exist for the estimation of the phases [12]. We have to remember here that the number of phases to be estimated is typically several tens to hundreds of thousands (the phase of each spot for which the intensity has been measured has to be estimated).
The first method is molecular replacement. It uses the known structure of a homologous protein.
To date, approximately 60% of the structures found in the PDB were solved by this method [5]. It consists of constructing a virtual crystal by placing the homologous structure in the cell of the crystal studied using mathematical translation and rotation functions and comparing the diffraction pattern calculated from this virtual crystal and the measured diffraction data. Since the Fourier transforms of two homologous molecules placed in the same crystal are similar, the calculated phases are an excellent approximation of the phases of the measured signal [13][14][15].
The second method is multiple isomorphous replacement, which consists in diffusing heavy atoms (electron-rich) in the crystal [16]. In the first protein structure determination, the phase problem was solved using this method, those of the myoglobin and the hemoglobin [17,18], by John Kendrew and Max Perutz in 1960. The presence of the heavy element slightly modifies the diffraction intensities and the comparison of the diffraction pattern in the presence and absence of these heavy elements allows the estimation of the phases by triangulation, after having positioned the heavy atoms in the crystal lattice using methods known as Patterson functions [19].
The third method is anomalous dispersion, a specific property of the diffraction pattern when absorption of X-radiation is no longer negligible [20,21]. This method consists in varying the incident beam wavelength around the absorption edge of one of the atom type contained in the molecule. Comparing the diffraction pattern at different wavelengths will allow the estimation of the phases using methods similar to that of the isomorphous replacement [22]. Selenium is often used because it has an absorption edge near to the wavelengths used (e.g. 1 Å). For proteins, selenomethionine, an amino acid for which the sulfur is replaced by selenium, is generally introduced biosynthetically [23]. In the case of nucleic acids, modified bases containing bromine are frequently used [24].

From the electron density to the structural model
Once a first set of phases is estimated, a first electron density map is calculated. If this map is sufficiently interpretable, the macromolecule can be built step by step in this map (Figure 8). A combination of automated algorithm and manual method available through interactive graphics softwares are used [25], leading to a final model composed of the three-dimensional coordinates of each atom of the cell content constituted by one or several macromolecules.
From that first built model, the diffraction intensities are calculated by Fourier transform and compared to the intensities experimentally measured. This comparison allows the step by step improvement of the model. This cyclical process is called the crystallographic refinement, alternating the search for global minimum of energy functions and manual reconstruction of the model [26].

Steps downstream the structure determination
The final step, downstream the structure determination by X-ray diffraction, concerns the interpretation of the structure and its integration into the biological context [27][28][29]. It consists in the understanding of the structural result as a three-dimensional object and the appreciation of its function at the cellular or evolution level. The description of the interatomic interactions, the secondary structures (Figure 9), the domains and their arrangement that defines the fold or the tertiary structure (Figure 9), as well as the characterization of the shape, the electrostatic properties and the quaternary structure based on the content of the cell in the crystal packing, are often complemented by the study of the macromolecule in solution, to better characterize its oligomeric (Figure 9) and its dynamic behavior, alone or in the presence of interactors, if known. These studies use a variety of biophysical methods, such as mass spectrometry, analytical ultracentrifugation, light scattering, microcalorimetry or surface plasmon resonance  Figure 2). The "allatom" representation shows all the atoms in the protein, the representation "Cα backbone" shows only one atom of each amino acid, the Cα carbon atom, and cartoon representation shows the secondary structures in the shape of a helix for α-helices and in the form of arrows for β-strands. (B) Protein structures are described in four levels, from primary to quaternary structure.
(Biacore ® technology), etc … [30]. In the case of enzymes, these studies will be coupled with enzymological approaches to determine the activity and the catalytic constants.
An analysis based on bioinformatics tools will allow to place the structure determined in the context of structural and evolutionary knowledge at a given time [31]. The lessons learned from these studies, often of primary importance, provide information including the classification of the structure and its sequence within a family counterparts, on the distribution and evolution of folding in the different domains of life (viruses, bacteria, archaea, eukaryotes), on the possible function when it is unknown, on the catalytic site and its spatial conservation and sequence, on the degree of oligomerization or on the existence of interaction with other partners, proteins, nucleic acids or ligands. A final type of study seeks to place the threedimensional object into the context of the knowledge on the major biological mechanisms of live, such as knowledge on gene expression with transcriptomics, on complex formation with interactomics, etc … This information will include the characterization of the partners of the studied macromolecule at the scale of the cell or the whole organism.
All these steps, from the structure determination to the biological interpretation, far from being the end of the story, are often the beginnings of new structural studies (Figure 3). These can be articulated around analyses of the relative importance of the components of the macromolecule, the aminoacids, by determining the structure of mutants, or the studies of the interactions with partners by determining the structure of macromolecular complexes.