DRAGON 7.0 TUTORIAL
Here a short walk-through for Dragon 7.0 is reported with a real example of use, with the following table of contents:
RETRIEVE AND PREPARE DATA
The preliminary step before working with Dragon is to retrieve and prepare the molecule set, that is, the set of molecular structures for which the descriptors and/or fingerprints will be calculated.
Dragon accepts as input several molecular structure digital formats like SMILES strings, MDL structures (.SDF and .MOL files), Hyperchem structures (.HIN files) and more. The user has to prepare the desired set of structures in one (or more) file. If more files are prepared, then these have to be stored in the same folder to be uploaded.
NOTE: Dragon does not provide tools for manipulation and drawing of the molecular structure; it is up to the user to prepare the correct structures and, if needed, optimize the three-dimensional coordinates.
In this tutorial every operation is undertaken on a set of 546 organic molecules, for which the molecular structures are provided as SMILES notations; in addition, an experimental property (i.e., the acute aquatic toxicity towards Daphnia Magna) is provided. This data is publicly available at the address http://michem.disat.unimib.it/chm/download/toxicity.htm in the format of Microsoft Excel file.
To prepare an input file suitable for Dragon, copy the SMILES notations of all the molecules and paste them into a plain text file along with their CAS numbers (which will be used as the molecule identifier throughout the software), separated by a 'tab' delimiter. Then, copy and paste into a separate text file the values of aquatic toxicity of all the molecules; this file will be used to import external variables.
LOAD AND CALCULATE DESCRIPTORS
Once a proper input file has been generated, the first operation in Dragon is to select 'File' in the main menu bar, then click on the 'Load molecules' command. Select the input structure file of interest, then Dragon will proceed to read it and import the molecular structures.
The scroll-down sheet of the main Dragon window will now show the list of imported molecules. Below this list, the status bar reports the summary of the currently loaded set: the total number of molecules, the number of warned and rejected molecules, the number of calculated descriptors and fingerprints (only after MDs and FPs calculation), the number of imported external variables (if any). As the imported structures do not present errors in SMILES notations, no molecules are marked as rejected.
The next step is to select the desired list of descriptors to be calculated and to run the calculation. Select 'Calculate' in the main menu bar, then click 'Descriptors (Mds)', the window for the descriptor selection is shown in the following figure.
Here, the user can select the whole blocks, sub-blocks, or the single descriptors. In this example, Dragon already comes with a selection of descriptors, a priori excluding all the descriptors that require information on molecular geometry that is not provided in the input file. This happens because the molecules have been imported as SMILES strings, which are a molecular format that does not contain three-dimensional information.
NOTE: if no 3D information is available for the structures but 3D descriptors are anyway included in the calculation, nothing wrong will happen. Those descriptors will simply result in a missing value code for all the compounds.
HINT: please be aware that even if the input file is a molecular format (for instance, a .SDF file) able to store three-dimensional information for the compound, this does not guarantee that the encoded information is suitable for 3D descriptor calculation. In fact, atomic coordinates should be calculated by specific procedures; if this is not the case, structure files will seemingly contain 3D information and lead to the calculation of erroneous and meaningless 3D descriptors.
Once the descriptors (in this example, all available descriptors except for 3D descriptors) have been selected, the calculation starts by clicking the 'Calculate' button.
After descriptor calculation, the status bar of the main window is updated with the total number of calculated descriptors (in this example 3,839) and the number of molecules with warning. In this example, several molecules are marked as warned (their row in the scroll-down sheet of the main window is highlighted in orange). This simply means that for these molecules some descriptors have not been calculated for some reasons (this commonly happens as several descriptors have particular constraints). To see the warnings of a molecule, click on its row in the scroll-down sheet of the main window, then look at the message box below.
IMPORT ADDITIONAL DATA
After descriptor calculation, one can import some additional experimental properties (i.e., external variables). These are not required for the calculation of descriptors, but it is often useful to have them available together with the calculated descriptors in order to perform some preliminary diagnostic analysis (for instance, the analysis of the correlation between some specific descriptors and the experimental property) and also to export the descriptors and the experimental properties in the same output file, which can be later imported and used in any modelling software.
To import external variables, the user should select 'File' in the main menu and then click on 'Add external variables'. After the selection of the the file containing the properties of interest, a window will guide the user through the import procedure:
Dragon will first check whether the number of values in the selected file (i.e., number of selected records: 546) agrees with the number of uploaded molecules (i.e., number of molecules: 546); if not, it will not be possible to import any data. To import the external variables, click the 'Import' button. The values of the property LC50[-LOG(mol/L)] are now available (in the block ) along with the calculated descriptors for any further operation throughout the software.
HINT: external variables can also be labels or any alpha-numerical string, not only numerical values. The import of text variables can be useful for merging other information about compounds (for example, alternative identifications) and then exporting from Dragon a unique output file with all the additional information.
ANALYZE CALCULATED DESCRIPTORS
The full list of the values of calculated descriptors can be explored by clicking on 'View' > 'Descriptors (MDs)'. A scroll-down sheet is shown, where each row represents a molecule and each column a descriptor or imported external variable (if any). It is possible to explore the descriptor blocks one at a time, by selection through the scroll-down lists in the upper part of the window.
To have in-depth insight of a single descriptor click 'Analyze' > 'Univariate statistics'. A window will be shown, where two sections (i.e., window tabs) are available. In the first tab (i.e., Grid), some univariate statistics are reported for each descriptor (i.e., average value, standard deviation, minimum and maximum value). In the second tab (i.e., Chart), one can see the numerical values of a selected descriptor and a bar plot that visualizes the values of the descriptor for all the input molecules. This bar plot is an easy diagnostic tool. For our case study, the following figure shows the values and the corresponding bar plot of the descriptor MW (Molecular Weight), where the option 'descending order' has been set by clicking on the heading (MW) of the descriptor column (see the down-oriented green arrow in the right of the column heading, which indicates the selected order).
HINT: the ordered bar plot can help to evaluate the descriptor distribution; in this example, we can conclude that the distribution of the molecular weights in the molecule set is quite homogeneous, nonetheless there is a single compound with a particular high value (and that could potentially be an outlier in the subsequent modelling/analysis tasks).
To analyze the correlation between descriptor pairs, click 'Analyze' > 'Correlation analysis'. A window with three sections (i.e., tabs) is shown, where calculated descriptors can be analyzed on the basis of their pair-wise correlation. The first tab (i.e., Correlation map) provides with a heat map of the descriptor correlation matrix. In the second tab (i.e., Correlation list), it is possible to select a specific descriptor (e.g., LC50[-lOG(mol/L)]) and Dragon will fill in the lists of the descriptors that have a correlation coefficient larger or smaller than the user-defined threshold:
In our case study, the external variable (i.e., LC50[-lOG(mol/L)]) has been selected to check if there are some descriptors with high correlation; indeed, this is a relevant information for the subsequent modelling stage. The following figure shows the results obtained by setting a threshold for direct correlation equal to 0.55; unfortunately, there is no descriptor with a remarkably high correlation with the experimental property; however, it is interesting to note that, among the most correlated descriptors, there is AlogP that is an estimate of the water/octanol partition coefficient, which is known to be related to chemical toxicity.
The third tab (i.e., Scatter plot) visualizes the molecule projection in the space defined by two selected descriptors. In order to better explore the data points, one can add the third dimension to the graph by selecting the third descriptor that defines the point colors. In our example, the scatter-plot is generated by selecting AlogP vs the experimental property and the mark points are colored according to the values of the descriptor P_VSA_v_3.
HINT: the scatter-plot can be very useful to study the correlation between pairs of descriptors/external variables and highlight possible outliers in the molecule set.
Calculated descriptors can be further analyzed by the statistical technique of Principal Component Analysis (PCA). To run PCA click on 'Analyze' > 'Principal Component Analysis'. The user first has to choose the descriptors to be included in the analysis; in our case study, PCA has been carried out on the whole block of 'constitutional descriptors'. These descriptors encode information on fundamental structural/chemical properties of compounds such as, for instance, the molecular weight, the number of atoms, the number of certain atom-types of atoms. The following figure shows the score and loading plots of the first two principal components; in the score plot, one can color the points according to a selected descriptor. The points are colored on the basis of their toxicity values.
HINT: PCA can provide with a quick overview of the chemical/structural information of the molecule set; in addition, it helps to discover eventual clusters of similar compounds and outliers.
Finally the user can export the calculated descriptors and use them in any third parties application for building statistical models. For instance, several scientists build their own modelling tools using platforms like Matlab and R, taking advantage of existing libraries that offer multivariate modelling approaches (e.g., MLR, PLS, KNN) and machine learning tools (e.g., neural networks, support vector machines). Many other use tools where no direct coding is needed, for instance, data exploration and pipelining tools like KNIME or machine learning platforms like WEKA.
By clicking 'File' > 'Save descriptors' a window for choosing the descriptors to be exported with the desired options is shown. The options have a particular relevance, as they provide the user with the chance of performing a pruning of the whole descriptors list, discarding all the descriptors that would provide only meaningless or redundant information, thus resulting almost useless in the following modelling stage.
The user can choose to discard descriptors with constant values (as of course they do not carry any useful information content) and with one or more missing value (strongly recommended, as such descriptors would compromise the modelling). The user can also choose to discard some descriptors on the basis of the pair-wise correlation: this means that when two descriptors have an absolute correlation coefficient higher than the desired threshold, only one of them is retained (thus avoiding redundancy).
The final set of pruned descriptors is saved as a plain text file (where values are divided by the 'tab' character) easily importable in any third parties software. In this example, all descriptors have been saved in a unique text file, including also the external variable. After the pruning procedure, the number of exported descriptors is 1,228 (starting from the initial 3,839 calculated descriptors).
HINT: the descriptor pruning procedure allows discarding several descriptors, which would be useless, or even dangerous, for modelling purposes; selecting the exclusion options, one gets a final descriptor subset that represents a good starting point for modelling with the maximal information content.