Third party software tools implemented in the REACT suite. �n.a.�, not applicable.
1. Introduction
The �age of omics� has provided a wealth of genomic and transcriptomic information that is readily available in public databases. In September 2011, GOLD (the Genomes OnLine Database) URL for the GOLD database: http://genomesonline.org URL for the Stanford Microarray database: http://smd.stanford.edu URL for the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
This enormous amount of data provides a treasure chest of information ready to explore. In recent years, a number of powerful comparative genomics databases such as GenoList URL for GenoList: http://genolist.pasteur.fr/ URL for the MicrobesOnline database: http://www.microbesonline.org
Combining microarray data with genomic information is a particular powerful approach for identifying and predicting regulons, which are regulatory units consisting of a number of genes or operons under the control of specific transcription factors. Such studies require the identification of co-expressed genes (indicative of co-regulation) from in-depth comparative transcriptome profiling, combined with genomic information, including operon structure, genomic context conservation and the presence of specific regulator binding sites.
The major problem of combining genomic with transcriptomic data to ultimately extract meaningful regulon information is the lack of defined standard formats and software interfaces that allow a direct transfer of data sets derived from transcriptome analyses to comparative genomics databases and vice versa. The REACT suite was developed with the purpose in mind to facilitate such combinations of the different analysis steps outlined above in one intuitive and user-friendly environment. Transcriptome datasets from different sources can be integrated into REACT via a sophisticated import interface and are stored, together with the cognate genomic information, in a MySQL database. This database, together with the central part of the software toolkit and all interlinked third-party tools run on a central computer, which actually performs the analyses: the "REACT-server". It is accessed by the user-interface ("REACT-client�) via inter- or intranet. The user will solely work with the corresponding client program, which can be installed on the personal computers or laptops of various users. While the installation of the REACT-server demands some technical knowledge, the client can be run easily on computers with a java runtime environment.
Taken together, the REACT suite provides users with a simple-to-use but powerful bioinformatics environment to perform regulon annotation and comparative genomics analyses based on microarray data and genome sequences. Both server and client software of the REACT suite are freely available from the corresponding author.
2. The basic concept of REACT
REACT was developed to enable users to perform the various steps of expression- and regulon analyses in a quick and intuitive manner. Tools are no longer separated entities demanding different and often incompatible data formats, but can be rather regarded as parts of a comprehensive, fully integrated unit. Data from a wide range of sources can be collected and analysed together. When working with REACT, the user has access to the various representations of the data as well as to the analysis tools via so-called "views" that are intuitively interlinked to enable an interactive flow of both data and analyses:
The �
The �
The �
The �
The concept of REACT includes an in-depth integration of the different views via links, enabling users to switch easily between different aspects of the data. Most views are flexible and can be extended with additional data fields to accommodate additional external links, allowing more individualized views and analyses of the data.
Moreover, wherever gene or array data are displayed, the user can easily collect them, thereby creating a data subset available as input for all other implemented analyses. During the various analysis steps, these collections can be continuously changed and expanded, again by selecting single genes and arrays or whole groups of them, such as groups of genes clustering together within a scatter plot analysis. All collected or �marked� arrays and genes are displayed throughout the various views of REACT in form of sortable lists. The items of these lists act as internal links to the corresponding detailed
The implemented REACT-databases are organism-specific. In its current version, REACT contains two databases for the model bacteria
3. Description of the individual views of the REACT suite
The information stored in REACT databases can be accessed via so-called views that display the data, allow their selection and provide functional links between different types of data for their interactive analysis. In the following sections, we will describe the major views of REACT, to provide an overview of their features.
3.1. The GeneView
The � URL for the NCBI genome database: http://www.ncbi.nlm.nih.gov/sites/genome
In addition to the name, each gene has a unique gene-ID or gene number, which consists of an abbreviation for the organism and a number of the gene (based on the chromosomal position). For example, the identifier of the
The central part of the URL for the COGs database: http://www.ncbi.nlm.nih.gov/COG URL for the ENZYME database: http://enzyme.expasy.org/enzyme_ref.html URL for the NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein URL for PDB: http://www.rcsb.org/ URL for the Pfam database: http://pfam.sanger.ac.uk/ URL for the Prosite database: http://prosite.expasy.org/ URL for the SMART database: http://smart.embl-heidelberg.de/ URL for the BSORF site: http://bacillus.genome.ad.jp/ URL for SubtiList: http://genolist.pasteur.fr/SubtiList/
For all of the above, the links in the REACT databases are gene-specific and directly connect the user with the cognate gene/protein-specific page of the external database. Depending on the type of the external database and the information available for the displayed gene, zero to many external hits will be provided as links. If no such specific database identifier exists, as in the case of Google`s search engine, a gene-related term (e.g. the gene name) has been chosen as the link parameter. REACT is highly adjustable to the individual users� needs. Hence, the external links are not limited to those preimplemented in the existing REACT databases for
In addition to the links and data fields, the URL for BLASTn and BLASTp: http://blast.ncbi.nlm.nih.gov/Blast.cgi
Two additional functions are available in the
3.2. The OperonView
Operons are transcriptional units consisting of two or more neighbouring genes that are co-expressed. If a gene has been assigned to an operon and annotated accordingly in the REACT database, a link from the
The operon identifier is again immutable, since it is used by REACT as internal reference. The operon name by default consists of the concatenated names of the genes within this operon. When displayed outside of the
In addition to providing a direct link to all corresponding
3.3. The RegulonView
The next higher level of genetic units is the regulon, which consists of a number of genes or operons under the direct control of a specific transcription factor. Regulons are displayed within the REACT suite in the
The regulon-ID is derived from the gene-ID of the corresponding transcription factor and marked by the extension �_R�. It is implemented as an active link that directly connects to an external regulon database. In case of URL for the DBTBS database: http://dbtbs.hgc.jp/ URL for RegulonDB: http://regulondb.ccg.unam.mx/
The central part of the
3.4. The ArrayView
All three views explained so far are highly similar to one another and strongly integrated, not only regarding the information provided but also in the way the user can navigate from one view to the next. They all provide a gene-centric view on the REACT data and invariantly rely on the genomic sequence as a reference. Regulons consist of operons, which are made up of individual genes with a defined position on the chromosome. The same is true for regulator binding sites.
In contrast, the URL for the description of the MIAME standard: http://www.mged.org/Workgroups/MIAME/miame.html
3.4.1. Organizing microarray data in the REACT suite
A complete microarray dataset contains at least three types of information. (i) A list of all genes represented by a given DNA microarray, which is linked to the corresponding expression values, either expressed as (ii) raw fluorescence values for the reference and experimental condition, or as (iii) the respective ratio (or fold-change) between the two conditions. Within the REACT suite, such a data collection is called an �Array�. Obviously, the Array is only useful if additional descriptive information (meta-information) is available. This can be a short description of the specific experimental set-up or a link where this information is stored. Often, a group of array datasets are related to each other and described in a single format, e.g. as a result of one experiment. This is reflected by the REACT data format �Array Set�.
The
Selection of one array or array set leads to the next sub-view �Act. Array�, which provides the detailed information, including the ID, name, a description of the underlying experiment, the source of the data, available literature, and external links. The �Array Set� subview lists all individual arrays within the set, which can be marked separately for further analysis. The most important feature of this sub-view is a tabulated, sortable list of all genes, for which data are available within this array. It contains information on the gene name, the signal value, the control value, the ratio of signal to control, the number of replicates that were combined, the arithmetic mean and error of the values. This data is normally directly derived from the original data sets. Two additional columns indicate which genes are currently marked and if their value can be trusted. The trust value is a simple way to allow users to flag single values as untrusted, thereby automatically excluding them from subsequent analyses. Trust values can be easily set for marked genes within the current subview.
The data table is sortable based on any column, e.g. high or low signals or ratio values. Genes of interest can be collected as �marked genes� for inclusion into follow-up analyses. Each gene-specific data row of the table functions as an internal link to the corresponding
An additional feature of the
3.4.2. Importing microarray data into the REACT database
As already mentioned, one of the major problems in comparative transcriptome analyses is the lack of a mandatory gold standard for array datasets, especially from the early, pre-MIAME era. But even ten years after this standard has been introduced, this problem is still far from be solved, and the number of microarray datasets not complying to these standards is still rising (Brazma, 2009).
Even implementing the minimum amount of information needed to integrate an array data set into the REACT database � a two-column table, with one column containing the gene identifiers and the second containing either the signal values or expression ratios between signal and control � can be daunting. Gene identifiers are either not used consistently (as synonyms often exist), or the DNA microarray might not contain all genes, or duplication of some. Likewise, signals can be represented as raw fluorescence values, either as mean or average values, in which case control values need to be provided or defined. Alternatively, a table might provide ratio of signal to control, which can be either expressed as log-values or as fold-changes. To facilitate handling and import such diverse types of data, the REACT suite contains an easy-to-use microarray import interface (Fig. 3).
During import, microarray data in any tabulated format is initially pre-loaded into the REACT import panel. REACT automatically detects the number of columns in the file and generates an adequate number of numerated preview columns for easy identification. After semiautomated discrimination of commentary lines, the appropriate type of information has to be assigned to each column. REACT needs at least one column containing the gene identifiers and one column for the signal or ratio values. Other types of information can also be assigned, such as the signal background, the control value, and the control background. Based on the assignment, REACT �knows� what to do with the individual data, e.g. if background columns are specified, their values will be subtracted from the corresponding signal or control values. Ratios between signal and control values can either be directly imported or will be calculated, depending on the data provided. It is even possible to import data with only a single column containing the signal values (e.g. during time course experiments). In a later step, one of the imported arrays (e.g. time 0) can then be used as a standard control for all datasets to calculate the ratio values needed for most analyses. Large datasets containing many replicates of one experiment can be imported in a single table. In this case, REACT offers the possibility to average the sets of columns assigned for signal, control or ratio values.
If large numbers of different experiments are stored in a single table, they can be parsed at once using the �batch�-import. The user defines the different ratio-columns, and each column will be treated as a separate array, within a common array set. Moreover, it is possible to define, if ratio data are in logarithmic format (they will be converted to internal non-logarithmic values) or not.
One major challenge when comparing data from different sources and hence formats is dealing with variations and differences in the gene identifiers used in different microarray templates. REACT knows a large amount of different gene descriptions, as mentioned in section 3.1. During data import, REACT will accept any of these names and synonyms. But if unknown identifiers occur or synonyms have been assigned more than once in a microarray dataset (e.g. in case of different probes representing a single gene), REACT will ask the user for a specific decision. The user can then skip/delete the line, manually assign a gene name, or add the new synonym to the database for future use.
Taken together, REACT should be able to import virtually all formats of array data, as long as they are tabulated. For the more complex datasets, such as those generated by the GEO, special parsing options for the corresponding meta-information are available in REACT.
3.5. The MotifView
The
The �Upstream�-panel is used to collect and display DNA regions upstream of coding sequences. Mostly, this will be intergenic regions, which are of particular interest, since they contain both (alternative) promoters and putative DNA-binding sequences of transcriptional regulators. The possibility to retrieve and manage such upstream regions is therefore of crucial importance in the context of regulon analyses. Upstream regions can be added to the �Upstream�-table by one of three means: (i) collectively from the active list of marked genes, (ii) individually by gene name, or (iii) directly from within the
The �Upstream�-table displays all upstream regions collected by the user in the course of an analysis by any of the three methods described above. For each upstream region, the ID and name of the corresponding gene, and the sequence and position of the respective region in the genome are displayed. These regions (or subsets thereof) can easily be removed or added, exported as FASTA-formatted sequence files or selected for further analyses, such as the MEME/MAST analyses (see section 4.4).
In the context of regulon analyses performed within the REACT suite, motifs are defined as short stretches of nucleotide sequence that are conserved in a collection of upstream regions, derived e.g. from co-expressed genes. They are expressed as so-called position-specific scoring matrices (PSSMs, also known as Position Weight Matrices, PWM) or regular expressions (RE), which both describe the probability for specific bases to occur at a specific position of the motif. Such matrices are graphically displayed as so-called �SequenceLogos�, in which the height of the letters representing the four bases is a measure for the degree of conservation at any given position within the motif.
In REACT, defined motifs of known regulator-binding sites are stored in the �MotifTable�. In this table, each motif is represented by the REACT-internal ID, the name of the motif (normally equivalent with the name of its cognate regulator), the motif length, the associated regular expression or PSSM, as well as the corresponding SequenceLogo. Selection of a motif opens the �Act. Motif�-panel, which provides all available information of one motif, including the name of the regulon it is associated with. This regulon name serves as an internal link to the corresponding page within the
4. Search options and analysis tools within the REACT suite
So far, this book chapter has described the major views that represent and display the data stored within the organism-specific REACT databases. In the following sections, we will describe the tools that allow the user to search the database and analyse genes, motifs, and microarray datasets in order to extract and define regulons. These tools include a search engine, an internal BLAST tool, cluster analysis and scatter blot tools for microarray datasets, as well as the MEME/MAST algorithms to identify and search for regulator binding sites in upstream regions of co-expressed genes.
4.1. The Search tool
The wealth of information stored in the REACT databases requires search tools to find specific data sets. The REACT Search tool contains four panels, enabling the user to search for genes, regulons, arrays and array-sets, respectively. These panels share the same general structure and differ only in minor features. The common features will be described for the gene search panel (Fig. 3).
Genes of interest can be searched by all gene-specific data fields, e.g. by gene-ID, name, synonyms, function, comments, but also any other user-defined field. These fields can be searched by a number of search strings, such as <containing>, <being equal to> or <starting with> a certain term. After the search hits are displayed in tabulated form in a result window below the search panel, where they can be marked or used as internal links to the respective views. Consecutive searches can be combined by <add>, <remove>, <keep> results or <negate> operations, thus enabling even for more sophisticated searches.
The search functions introduced so far are available in all four search panels. For genes and arrays, an additional function allows searching marked genes or arrays, respectively. Moreover, genes can also be successively searched by COG categories and COG terms.
4.2. The internal BLAST tool
Within the REACT suite, BLAST analyses (Altschul et al., 1990) can be performed in two different ways. First, it can be performed from within the
Both external (pasted into the input window) and internal (derived from the gene/protein displayed in the current
For each match, the DNA or amino acid sequence can be retrieved. Moreover, REACT also provides access to the corresponding upstream region via the �Retrieve upstream� function. The corresponding sequences will then be added to the �Upstream�-panel of the
4.3. Microarray analysis tools
As mentioned before, the REACT suite is based on organism-specific databases that contain two types of data. The gene-centric data is derived from public genome sequence information and accessible through the
4.3.1. The scatterplot tool
A scatter plot is a graphical way to project values for two variables of a data set into a two-dimensional grid, thereby placing similar samples in the same regions of the grid. The data is displayed as a collection of points, each having the value of one variable determining the position on the
Within REACT, scatter plot analyses are normally used to display genes according to their expression data of two selected arrays, using the expression value of the first array as
In most cases, the vast majority of analysed genes should show the same expression values/ratios under both conditions and will therefore be placed closely together on the
Scatter plot analyses can be performed using either signal or ratio array values, thereby allowing to compare the behaviour of genes in the presence of different stimuli (ratio data), but also to compare different time points from one time course experiment (using signal data). Such comparisons of expression data from two different microarrays are called two-dimensional scatter plots (see Fig. 4 for an example).
But the user can also compare the data of one array against itself, using the same signal or ratio values as coordinates for the
The input (expression data) for both types of analyses can either be log-transformed or normalized for the arrays or for the genes (array- and gene-centering, respectively). Moreover, the data can be filtered to remove �untrusted� genes prior to the analysis. Here, REACT removes all genes previously flagged as untrusted and un-reliable (either automatically during the import or later by the user) in one or both array datasets.
The major advantage over using external standard scatter plot tools is the deep integration of the REACT scatter plots with the REACT database. Without pre-selection of genes, the analysis will be carried out with the complete microarray data sets. Genes that specifically respond to only one of the two conditions will appear as outliers and can then be easily selected directly from the plot and thereby added to the list of marked genes directly for further analyses within REACT. This deep integration and direct connection of array-centric results with gene-centric information is one of the major strengths of the REACT suite, which enables the user to efficiently analyse even complex datasets.
But scatter plots can also be performed on only a small group of genes collected in previous analyses, thereby enabling the user to focus on a relevant subset of the data. The second approach is for example useful if these genes are known or suspected to belong to one regulon, in which case they should show a similar behaviour under various conditions. Two-dimensional scatter plots provide an easy way to test this hypothesis, since currently marked genes can be labelled in the plot and thereby easily visualized (Fig. 4).
Images of the scatter-plots can be directly retrieved. For presentation or publication purposes, individual genes can be labelled with their names, or specific symbols can be assigned to groups of genes, in order to distinguish them.
4.3.2. The cluster analysis (HeatMap) tool
To perform more sophisticated expression analyses of multiple microarray datasets, the hierarchical clustering functions of the Cluster 3.0 Software (de Hoon URL for the source code of the Cluster software: http://rana.lbl.gov/EisenSoftwareSource.htm
The hierarchical cluster analyses embedded in the REACT suite provide a way to compare the expression behaviour of genes over multiple microarray datasets but also, if needed, to group and cluster arrays. The result is a two-dimensional, colour-coded matrix (or grid) in which each row represents one gene, while each column corresponds to one array dataset. Rows and/or columns are sorted according to their overall distances, and this clustering is illustrated by flanking distance trees, in which the length of the branches serves as a measure for similarity: The shorter the branches, the higher their similarity (Fig. 5).
The complexity of the data is not lost, as all ratio or signal values for each gene within all arrays are visualized by the colour of the individual cells within the heat map grid. When ratio values are displayed, green colour indicates an increase (positive ratio value) and red a decrease (negative ratio value) of the expression in comparison to the control condition of the array, while the intensity of the colour is an indicator for the magnitude of change (Fig. 5). Signal values are coloured according to their percentage from the lowest and highest measured value within the array.
To run a cluster analysis, the user has first to decide, which genes and microarray datasets are to be included. Again, the active list of marked genes/arrays can directly be applied. Since REACT only serves as an interface to the Cluster 3.0 software package, its panel mimics the original input fields, with some modifications (inset to Fig. 5). The choice of parameters includes: clustering of only genes, arrays, or both; (ii) use of ratio or signal values; (iii) log-transformation of the data; (iv) removal of �untrusted� data (see above). Distance measures such as Euklidian distance, Kendall's tau, Pearson correlation or Spearman's rank correlation are available for both gene- and array-clustering. Moreover, genes and arrays can again be normalized as well as centered, as described for the scatter plot analysis above. For the final linkage, the user can choose between Pairwise Single, Pairwise Complete, Pairwise Centroide and Pairwise Average Linkage clustering methods For details on clustering, see: http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/Hierarchical.html#Hierarchical
The results of the cluster analysis are displayed in the
To further analyse a certain gene cluster, it can directly be selected from the flanking distance trees, which are also interactive: Selecting any branch will mark the corresponding rows or columns. Intersections of selected rows and columns can be obtained and selected parts of the heat map can be displayed in higher resolution in the right subpanel of the
The content of each subpanel can be exported both as an image file (different file formats can be chosen), as well as in tabulated form. Cluster results can also be stored and reloaded again, e.g. to enable the user to compare the clustering of specific groups of genes between different analyses.
4.3.3. The �Show regulons of marked genes� function
Co-expression � and therefore co-clustering � of groups of genes is a strong indication that they presumably belong to one regulon, i.e. are under the direct control of a common transcriptional regulator. In case of the two model bacteria currently implemented in the REACT suite,
To simplify the identification of known regulons within a marked group of genes derived from one of the above analyses, the �Show regulons of marked genes� function was implemented in the REACT suite, which displays all regulons to which at least one currently marked gene is associated in an additional window. Moreover, the results window will also list all operons and genes of any identified regulon, thereby providing a direct overview of the coverage of a given regulon within the group of marked genes identified by the cluster analysis. As usual, the identifiers of the regulons, operons and genes function as internal links to the corresponding views, enabling a seamless integration with subsequent analyses of the identified transcriptional units.
This function therefore offers a very straightforward and easy-to-use approach to identify the regulators responsible for an observed co-expression of a group of genes.
4.4. Motif analysis tools
If the above mentioned function did not yield a direct insight into regulatory principles underlying an observed co-expression, the next step of a typical analysis would be to search for putative regulator binding sites in the upstream genomic regions of co-expressed genes and operons. To facilitate these analyses, the MEME/MAST tools from the MEME (Multiple EM for Motif Elicitation) suite URL for online access to the complete MEME suite at: http://meme.nbcr.net
4.4.1. The MEME-Analysis tool
A prerequisite for any motif search is a collection of (upstream) sequences that are supposed to contain a common motif. In the REACT suite, this is facilitated by the �get upstream� function, which can be found in a number of views, including the
Like other analysis panels of REACT, the MEME view is also divided into two areas: in the upper part, the sequences and analysis parameters can be specified, while the results will be displayed in the lower panel (Fig. 6).
To start a MEME analysis, the user has to provide the sequences (in this case: upstream regions of genes), which are believed to share a common motif. This can be done by one of three ways: (i) Selection of upstream sequences from the �Upstream sequence� panel, (ii) directly pasting sequences into the respective sequence window of the MEME interface, or (iii) uploading an external file. The latter options enable the inclusion of sequences, which are not derived from the REACT database. Next, the number of allowed (or expected) motifs per sequence needs to be defined. Additional parameters include (i) the minimum and maximum motif-width, (ii) the maximum number of motifs to be discovered, (iii) a statistical threshold value, and (iv) limitation to palindromic sequences.
REACT`s MEME results consist of a graphical overview of the analysed sequences (Fig. 6) illustrating the occurrence and position of the motifs. Each motif is described by the following information: a motif ID, the length of the motif, a statistical value as a measure for the reliability of the motif, and a corresponding SequenceLogo as a graphical representation of the motif. As computable definitions, the description also includes the Regular Expression, an alignment of the motif from the analysed sequences, and the PSSM, which can all be exported. Alternatively, these definitions can be used directly for a MAST analysis to screen genome sequences from the REACT database for additional upstream regions containing this pattern (described in the following section) or stored in the REACT database for later analyses.
4.4.2. The MAST-Analysis tool
An important strategy to identify regulon members in large datasets, such as (multiple) genome sequences, is to screen them for the presence of sequence motifs, especially in intergenic regions, that are known or postulated to function as regulator binding sites. Such patterns can be derived from known operator sites described in other, closely related organisms (Wecke et al., 2006), or from motifs identified by MEME analyses from collections of co-expressed loci, as described above. One way of testing predicted motifs
MAST is a tool for searching biological sequence databases for sequences that contain one or more copies of a known motif. The quality of a resulting hit is calculated as the strength of the similarity of the particular sequence to all motifs, based on statistical probabilities. MAST works by calculating match scores for each sequence in the database compared with each of the provided motifs. These initial scores are then converted into statistical probability values, which are used to determine the overall match of the sequence to the group of motifs. By this approach, the best fitting sequences in the analysed data set can ultimately be identified.
The MAST interface of the REACT suite is located within the
The second important parameter is the sequence database to be searched. REACT contains pre-compiled data files containing all upstream regions of the currently implemented two model organisms but also of all of the respective reference organism. These regions are defined as the 200 bases upstream of the start codon of each gene. Other parameters to be defined are the maximum number of sequences to be displayed, a probability threshold and if genes overlapping with the upstream regions should be displayed in the results.
After the analysis has been performed, a graphical overview of the results in the form of a block diagram is displayed. It shows the matching regions for each motif within each sequence, the direction of the match (forward or reverse), the gene ID to which the upstream region belongs, and a probability value indicating the match strength. The information can also be displayed as in tabulated form. As usual, the diagram is interactive and provides a direct link to the corresponding gene-specific information.
If additional promising matches could be identified, they can then be integrated into a new iteration of creating motifs with the MEME-tool and re-checking them with MAST. Again, the integrative nature of REACT will enable and simplify such follow-up analyses.
5. Operating the REACT suite
We will conclude this chapter with a brief summary of how the REACT suite can be navigated and modified. For this purpose, we will first describe a typical work flow through the features of REACT from the perspective of a user (5.1). In the second section, we will specifically address the rights and options of REACT-administrators (5.2). Finally, we will provide a brief summary of the REACT concept and infrastructure (5.3).
5.1. Navigating REACT: The user approach
The functionality of the REACT suite relies on curated and comprehensive data that is provided by the organism-specific REACT database. It provides three different types of data: (i) gene-centric data (derived from genome sequences and their annotation), (ii) array-centric data (extracted from microarray databases and individual sources of transcriptome experiments), and (iii) motif data (based on experimental and computational evidence).
While there are many ways to use the REACT suite, it was developed with the goal in mind to enable the user to identify and characterize regulons starting from in-depth analyses of microarray datasets. Here, we will illustrate a typical workflow through the REACT suite (Fig. 7), in order to highlight the concept of REACT by connecting the central features that have so far been primarily described in isolation in the previous sections.
A typical experiment could start with importing new microarray datasets to be subsequently analysed in detail by scatter plot or cluster analysis . These initial studies will presumably be performed genome-wide, but with a limited number of relevant microarray datasets. As a result, groups of interesting genes will be identified that respond in a condition-specific manner and could potentially be co-expressed and therefore co-regulated. All unknown genes can be subjected to in-depth analyses, primarily using the information stored in the
Enabling such iterative and interactive processes that rely on both sequence-based and array-based data and analysis tools is a major advantage of the REACT suite. Because of its concept and architecture, the necessary information and data flow can be controlled easily and the analyses can be performed efficiently.
5.2. Modifying REACT: The administrator mode
In the age of omics, new genome sequences and microarray studies are published with ever-increasing speed. It is therefore important that a REACT database, once established, can be updated regularly to grow with the increase of available data and information. But as a precautious measure to avoid data corruption and thereby ensuring the integrity of the database, it is advisable that not all users have the right to modify the core data at all times during analyses. REACT has therefore implemented two different user roles: the REACT-user normally works in the �read-only� mode. This will allow him to browse the data, perform analyses, and export data to external files. In contrast, login as a REACT-administrator enables the user to permanently import additional data (such as microarray datasets or new reference genomes), to edit data already implemented in the REACT database, and even to change the main views of REACT by incorporating additional links and features.
When logged in as REACT-administrator, most data displayed in the different views can be edited manually. To prevent unintentional data corruption, data can only be changed after deliberately switching into the edit-mode via the appropriate buttons, provided in each view. In the edit-mode, all editable data is displayed in green and all links are disabled. Any changes applied to the data remains transient until they are confirmed by the REACT-administrator and thereby sent to the REACT-server and stored permanently.
However, some data fields are not editable, as REACT uses them as immutable internal references (e.g. as primary and secondary database keys) to identify the complete dataset. This includes the names and IDs of genes, operons, or microarray datasets, as well as DNA and amino acid sequences from the
A REACT-administrator can also define new data fields for the above listed views according to the individual requirements, including plain text fields and numeric fields. Moreover, new external links can also be added to the views. While it is quite easy to generate plain text or numeric fields (as just the field name and type have to be defined within the REACT-administrator dialogue), creation of additional link-fields is technically a bit more demanding.
In addition to the aforementioned options, REACT-administrators can import additional array data, create new array sets or change the assignments of arrays to a set. They can also store motifs computed during a MEME analysis permanently within the REACT database or define new regulons. In the
5.3. Expanding REACT: Embedding new organism-specific databases
REACT was initially developed for the analysis of two model organisms,
Therefore, REACT is equipped with a small set of additional tools that enables researchers with little knowledge of programming languages or database administration to create new organism-specific REACT databases from scratch. Following the instructions provided by the software, the user has to download freely available files from sources like the NCBI, Uniprot or MicrobesOnline databases that contain the data used by REACT. Additional information (e.g. links to PFAM or PDB) will be obtained from the KEGG web service via SOAP/WSDL, again without the need for more than very basic user interaction.
After the creation of an initial, empty REACT database (done by importing a provided sql-file into the SQL database), the information contained in the downloaded files and provided by KEGG are parsed by helper tools provided by the REACT package, again minimizing user interaction.
Users with basic programming knowledge will then be able to extend the new REACT database by parsing data from additional data sources, depending on the organism chosen and the focus of the respective database. Subsequently, additional data needed by REACT (e.g. interbl BLAST databases) will be computed automatically. The user will now be able to connect with this newly created REACT database, in order to upload the first array sets.
5.4. Developing REACT: Concept, sources and infrastructure
TAs mentioned previously, the major aim of the REACT bioinformatics toolkit was the creation of an intuitive and interactive graphical user interface that allows an integrative view on genomic and microarray data and provides combined access to various bioinformatics tools commonly used in comparative genomic and transcriptomic studies. The overall structure of the REACT suite is illustrated in Fig. 8.
In the current release, the tools listed in Table 1 are integrated into the REACT suite. The software was implemented using a client/server architecture, enabling the parallel and locally distributed work of one to multiple users (Fig. 8). The REACT-server is the central computer running the database-managing software (MySQL), as well as all internal and integrated third-party analysis tools. The users will solely work with the corresponding client program, which can be installed on the personal computers or laptops of all users. Client and server are communicating via intra- or internet using remote method invocation (RMI) techniques. REACT is implemented as a java swing application, therefore client and server should run under a variety of operating systems depending only on the Java Runtime Environment (Version 5 or higher). However, in case of the server, this is limited by the external tools, as some of them (e.g. the MEME suite) depend on a Linux / Unix environment. To circumvent this limitation, REACT was developed and tested for being executable on Windows OS using Cygwin (1.5.x or higher), which is a Linux emulator for Windows and provides substantial Linux API functionality.
|
|
|
|
Blast | 2.2.x | ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/LATEST/ | (Altschul |
Cluster 3 | 3.0.x | http://bonsai.hgc.jp/~mdehoon/software/cluster/ | (de Hoon |
MEME suite | 4.0.0 | http://meme.sdsc.edu/meme/meme-download.html | (Bailey |
Cygwin | 1.7.5.x | http://www.cygwin.com/install.html | n.a. |
MySQL | 5.5.x | http://www.mysql.de/downloads/mysql/ | n.a. |
6. Conclusion
This chapter aimed at providing a thorough overview of the concept and functions of the REACT suite, a bioinformatics toolkit that was developed to simplify regulon predictions and comparative transcriptomic analyses for biologists with little to no background in bioinformatics. REACT was written in the believe that it will provide a powerful, yet simple-to-use platform that will hopefully also support the work of other research groups in extracting meaningful data from transcriptome studies with the help of comparative genomics. The complete REACT suite, including the databases for
The authors would like to thank Tina Wecke for beta-testing of the REACT suite, providing the figures and critical reading of the manuscript. Work in the Mascher lab is financially supported by grants from the Deutsche Forschungsgemeinschaft (DFG). Development of the REACT suite was enabled by funding from the �Concept for the future� of the Karlsruhe Institute of Technology (KIT) within the framework of the German Excellence Initiative.
References
- 1.
Altschul S. F. Gish W. Miller W. Myers E. W. Lipman D. J. 1990 Basic local alignment search tool.215 403 410 . - 2.
Bailey T. L. Boden M. Buske F. A. Frith M. Grant C. E. Clementi L. Ren J. Li W. W. Noble W. S. 2009 MEME SUITE: tools for motif discovery and searching. 37: W202 208 . - 3.
Bairoch A. 2000 The ENZYME database in 2000.28 304 305 . - 4.
Barrett T. Troup D. B. Wilhite S. E. Ledoux P. Evangelista C. Kim I. F. Tomashevsky M. Marshall K. A. Phillippy K. H. Sherman P. M. Muertter R. N. Holko M. Ayanbule O. Yefanov A. Soboleva A. 2011 NCBI GEO: archive for functional genomics data sets- 10 years on. 39: D1005 1010 . - 5.
Brazma A. 2009 Minimum Information About a Microarray Experiment (MIAME)--successes, failures, challenges.9 420 423 . - 6.
Brazma A. Hingamp P. Quackenbush J. Sherlock G. Spellman P. Stoeckert C. Aach J. Ansorge W. Ball C. A. Causton H. C. Gaasterland T. Glenisson P. Holstege F. C. Kim I. F. Markowitz V. Matese J. C. Parkinson H. Robinson A. Sarkans U. Schulze-Kremer S. Stewart J. Taylor R. Vilo J. Vingron M. 2001 Minimum information about a microarray experiment (MIAME)-toward standards for microarray data.29 365 371 . - 7.
Cao M. Kobel P. A. Morshedi M. M. Wu M. F. Paddon C. Helmann J. D. 2002 Defining the sW regulon: a comparative analysis of promoter consensus search, run-off transcription/macroarray analysis (ROMA), and transcriptional profiling approaches. J Mol Biol316 443 457 . - 8.
de Hoon M. J. L. Imoto S. Nolan J. Miyano S. 2004 Open source clustering software.20 1453 1454 . - 9.
Dehal P. S. Joachimiak M. P. Price M. N. Bates J. T. Baumohl J. K. Chivian D. Friedland G. D. Huang K. H. Keller K. Novichkov P. S. Dubchak I. L. Alm E. J. Arkin A. P. 2010 MicrobesOnline: an integrated portal for comparative and functional genomics. 38: D396 400 . - 10.
Demeter J. Beauheim C. Gollub J. Hernandez-Boussard T. Jin H. Maier D. Matese J. C. Nitzberg M. Wymore F. Zachariah Z. K. Brown P. O. Sherlock G. Ball C. A. 2007 The Stanford Microarray Database: implementation of new analysis tools and open source release of software. 35: D766 770 . - 11.
Finn R. D. Mistry J. Tate J. Coggill P. Heger A. Pollington J. E. Gavin O. L. Gunasekaran P. Ceric G. Forslund K. Holm L. Sonnhammer E. L. Eddy S. R. Bateman A. 2010 The Pfam protein families database. 38: D211 222 . - 12.
Gama-Castro S. Salgado H. Peralta-Gil M. Santos-Zavaleta A. Muniz-Rascado L. Solano-Lira H. Jimenez-Jacinto V. Weiss V. Garcia-Sotelo J. S. Lopez-Fuentes A. Porron-Sotelo L. Alquicira-Hernandez S. Medina-Rivera A. Martinez-Flores I. Alquicira-Hernandez K. Martinez-Adame R. Bonavides-Martinez C. Miranda-Rios J. Huerta A. M. Mendoza-Vargas A. Collado-Torres L. Taboada B. Vega-Alvarado L. Olvera M. Olvera L. Grande R. Morett E. Collado-Vides J. 2011 RegulonDB version 7.0: transcriptional regulation of K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res 39: D98 105 . - 13.
Lechat P. Hummel L. Rousseau S. Moszer I. 2008 GenoList: an integrated environment for comparative analysis of microbial genomes. 36: D469 474 . - 14.
Letunic I. Doerks T. Bork P. 2009 SMART 6: recent updates and new developments. 37: D229 232 . - 15.
Liolios K. Chen I. M. Mavromatis K. Tavernarakis N. Hugenholtz P. Markowitz V. M. Kyrpides N. C. 2010 The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. 38: D346 354 . - 16.
Rose P. W. Beran B. Bi C. Bluhm W. F. Dimitropoulos D. Goodsell D. S. Prlic A. Quesada M. Quinn G. B. Westbrook J. D. Young J. Yukich B. Zardecki C. Berman H. M. Bourne P. E. 2011 The RCSB Protein Data Bank: redesigned web site and web services. 39: D392 401 . - 17.
Sierro N. Makita Y. de Hoon M. Nakai K. 2008 DBTBS: a database of transcriptional regulation in containing upstream intergenic conservation information. Nucleic Acids Res 36: D93 96 . - 18.
Sigrist C. J. Cerutti L. de Castro E. Langendijk-Genevaux P. S. Bulliard V. Bairoch A. Hulo N. 2010 PROSITE, a protein domain database for functional characterization and annotation. 38: D161 166 . - 19.
Tatusov R. L. Koonin E. V. Lipman D. J. 1997 A genomic perspective on protein families.278 631 637 . - 20.
Utts J. M. 2005 . Thomson Brooks. - 21.
Wecke T. Veith B. Ehrenreich A. Mascher T. 2006 Cell envelope stress response in : Integrating comparative genomics, transcriptional profiling, and regulon mining to decipher a complex regulatory network. J. Bacteriol.188 7500 7511 .
Notes
- URL for the GOLD database: http://genomesonline.org
- URL for the Stanford Microarray database: http://smd.stanford.edu
- URL for the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/
- URL for GenoList: http://genolist.pasteur.fr/
- URL for the MicrobesOnline database: http://www.microbesonline.org
- URL for the NCBI genome database: http://www.ncbi.nlm.nih.gov/sites/genome
- URL for the COGs database: http://www.ncbi.nlm.nih.gov/COG
- URL for the ENZYME database: http://enzyme.expasy.org/enzyme_ref.html
- URL for the NCBI Protein database: http://www.ncbi.nlm.nih.gov/protein
- URL for PDB: http://www.rcsb.org/
- URL for the Pfam database: http://pfam.sanger.ac.uk/
- URL for the Prosite database: http://prosite.expasy.org/
- URL for the SMART database: http://smart.embl-heidelberg.de/
- URL for the BSORF site: http://bacillus.genome.ad.jp/
- URL for SubtiList: http://genolist.pasteur.fr/SubtiList/
- URL for BLASTn and BLASTp: http://blast.ncbi.nlm.nih.gov/Blast.cgi
- URL for the DBTBS database: http://dbtbs.hgc.jp/
- URL for RegulonDB: http://regulondb.ccg.unam.mx/
- URL for the description of the MIAME standard: http://www.mged.org/Workgroups/MIAME/miame.html
- URL for the source code of the Cluster software: http://rana.lbl.gov/EisenSoftwareSource.htm
- For details on clustering, see: http://bonsai.hgc.jp/~mdehoon/software/cluster/manual/Hierarchical.html#Hierarchical
- URL for online access to the complete MEME suite at: http://meme.nbcr.net