ABOUT THIS USER GUIDE This user guide is a practical guide to using the program GoldMine.

GoldMine is a tool for analysing large quantities of docking information such as are produced from a structure-based virtual screen. This guide includes instructions in using the Windows Interface as well as providing help on the scientific issues relevant to the analysis of virtual screening data. It is intended for readers who already have some experience using docking programs. Use the < and > navigational buttons above to move between pages of the user guide and the TOC and button to access the full table of contents. Additional on-line GOLD resources can be accessed by clicking on the links on the right hand side of any page. Two additional software modules are provided with the GOLD suite of programs: • The Hermes visualiser, for preparation of input files, visualisation of docking results and calculation of descriptors. The visualiser is also used for interactive docking setup, e.g. for defining the binding site and the setting of constraints. GoldMine is accessed via Hermes. • GOLD, for protein-ligand docking. Help specific to Hermes and GOLD is available via the links given on the right hand side of any page. Tutorials are also available for GoldMine. Tutorials can be accessed by clicking on the Tutorials link on the right hand side of any page. The GoldMine user guide is divided into the following sections: 1 GoldMine and GoldMine Databases (see page 3) 2 Analysing and Data Mining GoldMine Databases (see page 11) 3 More Tools for Working with the Data (see page 26) 4 Calculation of further Descriptors to characterise the Docking Pose (see page 35) 5 Arithmetically Manipulating Descriptors: Consensus Scoring (see page 39) 6 Per Atom Scores (see page 45) 7 Hotspots (see page 47) 8 Creating Training and Test Sets of poses for Regression Model Building (see page 51) 9 Creating Statistical models that describe Biological Activity: The Regression Window (see page 55) 10 Visualising and Refining Selections of Docking Poses (see page 65) 11 Interactive Docking and Analysis: Using the GOLD Server (see page 66) 12 Acknowledgements (see page 71) 13 Appendix A: Tutorials (see page 73)

GoldMine User Guide

1

2

GoldMine User Guide

1

GoldMine and GoldMine Databases

1.1 Introduction • GoldMine is a tool for the analysis and post-processing of docking results. Although primarily designed for the analysis of GOLD docking results it can also be used to process data generated by other docking tools. • GoldMine is installed as a component of Hermes and can be accessed from the Hermes top level menubar. Hermes is the name of the CCDC protein visualiser and is supplied free with GOLD and GoldMine. • It is possible to create within GoldMine a database of docking data which may comprise one or more sets of docking data. A ‘GoldMine Database’ (or GoldMine DB) is the term we will use to describe such a database. • GoldMine can be used to combine and analyse several docking runs. For instance docking runs against different protein models may be combined within a GoldMine DB and analysed for selectivity and specificity. Docking runs carried out against one protein model but scored using different scoring functions may also be combined within a GoldMine DB. Several different schemes of Consensus scoring may be carried out within GoldMine. • Each set of docking results saved within a GoldMine DB will contain one or multiple binding poses for each ligand and the corresponding protein configurations. If a GoldMine DB is created from an ensemble docking run then all the proteins from the ensemble will be included. • GoldMine DBs also contain any numerical or text information that is present as tagged fields in the .sdf or .mol2 files used to create the GoldMine DB. Such data may include the individual terms that make up the scoring function used in the docking. Each individual quantity for which a set of data is saved, is termed a Descriptor. • GoldMine allows you to filter your results in a sophisticated manner. Ranges for a number of descriptors can be set and combined in Boolean fashion to create sets of docking poses satisfying appropriate properties. These can be saved as Selections. Saved Selections can be opened on startup by other users, allowing GoldMine/Hermes to be effectively used within intranet-based information sharing systems. • It is possible within Hermes to further describe docking poses by calculating additional descriptors for them that measure aspects of the protein-ligand interaction. Further details can be found within the Hermes documentation. These descriptors can be added to a GoldMine DB and used in further analysis. • Goldmine supports the calculation of per-atom descriptors. Thus contributions to scoring functions can be broken down according to individiual atoms or groups of atoms on the receptor. • Numerical descriptors can be arithmetically transformed. They can for example be normalised. Rank orderings can be generated from them. They can also be arithmetically combined to give rise to composite descriptors. These can have value in consensus scoring schemes. • Histograms and scatter plots can be generated for any numeric descriptors.
GoldMine User Guide 3

1. • Hot Spot grids can be calculated over poses for active molecules. Receiver Operating Characteristic (ROC) curves can be generated for scoring functions. • Specify the appropriate GOLD . • Step-wise multiple regression can be used to generate linear equations of scoring functions and descriptors that give optimum enrichment profiles on training and test sets.2. Then click on Next. can be calculated for an entire dock set. You will now need to input a 4 GoldMine User Guide .1 Reading in GOLD Results • Select GoldMine from the Hermes top-level menu and choose Create from the pulldown menu. Activate the Gold run (*. • Any Selection of poses can be defined as a cluster for which a centroid can be calculated in the Euclidean space of chosen descriptors. and saved as a new descriptor. The Euclidean distance to this centroid. • The radio button adjacent to New GoldMine should be toggled on. You can use the Browse button to navigate to the appropriate file. and a variety of enrichment metrics (EF.conf file in the text box.conf) radio button if not set by default.This will initiate the GoldMine Creation Wizard. other descriptors. BEDROC) can be calculated. given a training set of docked actives and inactives. over the same descriptors. • Both the protein and a full set of docked ligands may be read in a single step.• GoldMine also has functionality to create the most effective rescoring protocols for Structure Based virtual Screening. This cluster might comprise only active molecules for instance. and linear combinations of descriptors. AUC under ROC. These can be used to identify regions preferentially favoured by certain atom types and this information can then be used in docking or post-processing.2 Creating a GoldMine Database 1.

• In the box titled New Dock Set put in the name you wish to denominate this results set by.file name for the new GoldMine DB. Click on Next. GoldMine User Guide 5 . The appropriate protein file to use with this set is that specified within the gold.conf file. Contact your database administrator for information on connecting to a PostgreSQL database. Protein files are imported into the GoldMine DB and no link to the original file location is required once the GoldMine DB is created. • If you wish to use SQlite to create your database simply type the name into the Filename box.

mol2) option. B: GoldMine decides on whether a field is Integer. Once data import has been completed you will be given a choice of opening the new GoldMine DB or not. Choose either the MACCS (*.• At this point a confirmation window should appear.0 respectively. Select the appropriate file that you wish to create the GoldMine DB from. 1.2 Reading in MACCS or MOL2 Results A GoldMine DB can be created from docking poses which are saved in MACCS (. nulls corresponding to integer and numeric fields are replaced by 0 and 0.sdf.sd) option or the MOL2(*. 6 GoldMine User Guide . • If the details are correct click on Finish. The data fields may of the type integer. The GoldMine DB should now import the relevant data.2.mol2) format. Any data fields or tags that are associated with these files will also be imported and appropriately named descriptors created. Real or Text by the properties of the first entry found. Null entries are accepted but that field will then be assigned as Text. • Select GoldMine from the Hermes top-level menu and choose Create from the pulldown menu. Notes: A: A special text field called NAME will be created using information from the first line of each ligand entry in the MACCS or MOL2 input file.sdf) or MOL2 (. real or text. *. Click on Next. This will normally be the name associated with the ligand model when it was originally created.e. detailing the makeup of the new GoldMine database. It follows that care needs to be taken in ensuring that data associated with the first structure of an input file has the desired format i.

• You will be given a choice as to whether you want to standardise the incoming structures according to Cambridge Structural Database conventions.3. Then click on Next. Navigate to the GoldMine DB you wish to append to using the Browse button and select it.conf file in the Filename box. • You will usually at this point wish to specify a protein file that you want associated with the dock set. Click on Next. Normally this will be the protein file you used in the docking run. Once data import has been completed you will be given a choice of opening the new GoldMine DB or not. • Activate the Gold run (*. However you can. If you choose to add to an old dock set select it from those available and click on GoldMine User Guide 7 . Then click on Finish. The GoldMine DB will now import the relevant data. • Toggle on the Old GoldMine radio button.Click on Next again. You can use the Browse button to navigate to the appropriate file. Simply toggle the No protein radio button in this case. • You can choose to add a new dock set or append to existing dock set by clicking the appropriate radio button. • Choose a name for your GoldMine DB as before.conf) radio button if not set by default. • Click on Next to view the confirmation pane. Click on Next again.1 Appending GOLD Results • Select GoldMine from the top-level menu and choose Create from the pulldown menu. Specify the appropriate GOLD . Then click on Next. if you wish create a Goldmine database without a protein. 1. You will be asked to supply a name for the new dock set.3Appending to an Existing GoldMine Database 1. Use the tick boxes to invoke desired standardisations.

type the name of the new set in the appropriate box and then click on Next. • The GoldMine DB should now import the relevant data. The protein file imported will be the one specified in the gold. • If you wish to add a new dock set. • If you wish to add a new dock set.Next. Select the appropriate file to that you wish to append. Navigate to the GoldMine DB you wish to append to using the Browse button and select it. You will now be asked to associate a protein with the dock set.2 Appending MACCS or MOL2 Results • Select Goldmine from the Hermes top-level menu and choose Create from the pulldown menu. *. You also have the option to overwrite the data in an existing dock set. Click on Next. 8 GoldMine User Guide . You also have the option to overwrite the data in an existing dock set.sd) option or the MOL2(*. • You will be given a choice as to whether you want to standardise the incoming structures according to Cambridge Structural Database conventions. Choose either the MACCS (*. 1. Then click on Next. Once data import has been completed you will be given a choice of opening the new GoldMine DB or not. Click on Next.conf file.mol2) option. Use the tick boxes to invoke desired standardisations. However it is not compulsory to do this. • Toggle on the Old GoldMine radio button. type the name of the new set in the appropriate box and then click on Next. • The confirmation page should come up. Click on Next.3.sdf. If you choose to add to an old dock set select it from those available and click on Next. • You can choose to add a new dock set or append to existing dock set by clicking the appropriate radio button. Click on Finish to create the GoldMine.

Click on Next. click on Close from the GoldMine pull-down menu. subsets of docking poses with associated descriptor data (see Creating a Selection. Choose the CSV (*. This data might be numerical.• The confirmation page should come up. is to open the GoldMine DB and export a CSV file containing only the Docking Solutions field for the entire dock set you wish to append to. It follows that care needs to be taken in ensuring that the first row of data is of desired format throughout. The Index is simply a count of solutions. via Linux: It is possible to create and save Selections. nulls corresponding to integer and numeric fields are replaced by 0 and 0. • Highlight the dock set you wish to associate this data with. • Select Goldmine from the Hermes top-level menu and choose Create from the pulldown menu. 1. Null entries are accepted but subsequent column entries will then be assigned as Text. i. Click on Finish to create the GoldMine. i.3. 1. The first column should be named Entry and its content should be {dock set name}¦{ligand name}¦ {index}.e.e. • To close a GoldMine DB. Select the file to that you wish to append. Click on Next. e. page 20). Alternatively it can be textual data. Real or Text by the properties of the first entry in a column. It is possible to open a previously saved Selection within Hermes by using GoldMine User Guide 9 .0 respectively. activity data from biochemical assays.db • Opening a GoldMine DB with a Selection specified. An easy way to generate a template CSV file that already has these fields in place. Only one GoldMine DB can be opened at any given time. The second column should be an integer column called Index. The first row of the CSV file should take a comma separated set of names of fields.4 Opening and Closing GoldMineDatabases • A GoldMine DB can opened from the Hermes GUI by selecting Open from the GoldMine pulldown menu on the Hermes top-level menu-bar. • The GoldMine DB should now import the relevant data. Once data import has been completed you will be given a choice of opening the new GoldMine DB or not. • Opening a GoldMine DB on GoldMine startup via Linux: If the environmental variable GOLDMINE_DIR points to the location of the GoldMine installation then the command: $GOLDMINE_DIR/bin/goldmine database.g. • Navigate to the GoldMine DB you wish to append to using the Browse button and select it. Note: GoldMine decides on whether a field is Integer.3 Appending Data in CSV Format • It is possible to add additional data to your dock sets so long as that data is in CSV format.csv) option.db will open the specified GoldMine DB database.

the modifier -selection in conjunction with the name of the Selection. 10 GoldMine User Guide .: $GOLDMINE_DIR/bin/goldmine database.db -selection newsel1 This may be useful if GoldMine is incorporated as part of an intranet-based methodology for disseminating and publicising the results of analysis of virtual screens and docking runs.e. I.

EF. A range of enrichment metrics (e. filter and perform graphical or statistical analysis of dock sets. prior to visualising the docking poses. • There are three additional important interfaces accessible from GoldMine. • GoldMine supports drag-and-drop for most functionality. Descriptors can also be arithmetically combined to give rise to composite descriptors. or onto the button that carries out the desired action. set the ranges that you want for those descriptors in the Descriptor ranges tab and then create subsets of the database or Selections by logically combining the different descriptor ranges using the Selection manager. AUC under ROC. It is also possible to directly transfer to. This page allows you to set ranges for those descriptors you are interested in and create histograms over those ranges. These can have value in consensus scoring schemes. Histograms and scatterplots can also be generated. The GoldMine Controller comprises three tabbed panes: • Descriptors pane. and create ranges for descriptors in the Selection manager. Here numerical descriptors can be arithmetically transformed. • Calculator window. Alternatively they may be saved and used in further more complex queries. This window is useful A) when trying to establish the best analysis GoldMine User Guide 11 .1 Overview • GoldMine provides the user with several interfaces to manipulate. • Descriptor ranges pane. This page allows you to view which descriptors are associated with each dock set of the GoldMine DB and to select which descriptors you wish to work with. This window allows you to calculate descriptive statistics for descriptors. The GoldMine Controller can be hidden and re-displayed by toggling the Controller option from the Main sub-menu within the GoldMine pull-down in the Hermes top level menu bar. if the database contains identified active and inactive ligands. BEDROC) can also be calculated. • The basic mode of working with a GoldMine Database is to first select the descriptors you wish to work with in the Descriptors tab. It is possible to drag-and-drop multiple descriptors which have been prehighlighted. If in doubt. and correlation matrices between descriptors.2 Analysing and Data Mining GoldMine Databases 2. These Selections may then be viewed in Hermes. • Regression window. The primary interface is called the GoldMine Controller and it will be displayed whenever a GoldMine DB is opened. The interfaces are as follows: • Data Explorer window. • Selection manager pane.g. These can be hidden and re-displayed via the Main sub-menu within the GoldMine pull-down in the Hermes top level menu bar. They can for example be normalised. pick a descriptor via the left mouse button and drag and drop it into a relevant window. Rank orderings can be generated from them. Accumulation and Receiver Operating Characteristic (ROC) curves can be viewed for any descriptors. This page allows you to combine different descriptors with different ranges in a Boolean fashion to make Selections of docking poses.

• The Descriptors tab supports a tree view. • The data type for each descriptor will also be displayed. and a range of enrichment metrics (e. • One special text descriptor called NAME is created using information from the first line of each ligand entry in the MACCS or MOL2 file input. The data type can be Integer. Multiple regression can then be used to calculate the linear combination of descriptors that provides best QSAR model for the activity data. The sort order is reversed if the column header is clicked again. • The dock set and descriptor names on view can be sorted alphabetically by clicking on the column header labelled Dock set. 2. Randomly selected training and test sets can be set up. • Test and training sets can be created for the purpose of statistical modelling.protocol for virtual screening. AUC under ROC. This is achieved under Define test sets via the Tools pull-down menu within the GoldMine top level menu bar. You will only be able to open a new GoldMine DB if you have not already got a GoldMine DB open. EF. Corresponding Accumulation and Receiver Operating Characteristic (ROC) curves can be viewed.2. page 81). The dock sets within the GoldMine DB will be displayed. Multiple regression can then be used to calculate the linear combination of descriptors that provides best discrimination between actives and inactives. 12 GoldMine User Guide . To see the descriptors associated with each docked set click on the + box next to the dock set name. • When the GoldMine DB is opened the GoldMine Controller comes up and the Descriptors tab is shown on top. • It is possible to calculate directly in GoldMine simple additional descriptors that rely only on the 2D characteristics of the structures stored in a GoldMine Database. For B) activity data is required for all ligands in the training set.g.1 The Descriptors Pane • Open an existing GoldMine DB by selecting Open from the GoldMine pull-down in the Hermes top level menu bar. These can be accessed under Simple properties. For A) a database containing docking poses for identifiable actives and inactives is required.2 Viewing and Selecting Descriptors 2. Additional descriptors that have to be calculated with respect to a protein structure. Real and Text. and browsing to the database you wish to open. BEDROC) calculated for both training and test sets. via the Descriptors pull-down menu within the GoldMine top level menu bar. This will normally be the name associated with the ligand model when it was originally created. or B) if wishing to construct a 3D Quantitative structure Activity Relationship (QSAR). can be created in Hermes (see Incorporating New Descriptors into the Analysis.

It is possible to select more than one descriptor by holding down the Control key whilst selecting. Selecting a descriptor activates all the buttons at the top of the pane. Seven options are available from the left-hand pull-down menu. • by non-zero count <=: Selects descriptors with number of non-zero entries less than the figure typed in the adjacent test box. The options available from the pull-down menu to the top left of Choose Descriptors area. • none: deselects all descriptors.The choice is activated by hitting the Choose button.2.g. • by name: Selects descriptors according to the text fragment that is typed into the adjacent text box. click on its name. via the per-atom calculation option) and it is necessary to thin these down to only those which contain useful information.2. • The Choose Descriptors area can be used to select multiple descriptors. It is possible to select a block of descriptors by holding down the Shift key when selecting the second extremity of the block. This can be used to identify descriptors with low information content. are: • all: selects all descriptors in every dock set. • by count <=: Selects descriptors with number of entries less than the figure typed in the adjacent test box.2 Selecting Descriptors • To select an individual dock set or descriptor. This also can be used to identify descriptors containing low GoldMine User Guide 13 . This may be especially useful if very many descriptors have been generated (e.

Descriptors can also be transferred to other panes via the option from the Send Choices to pulldown. When a solution is selected by clicking on it then the protein is also displayed. • Selecting the Delete option brings up a confirmation window. Explorer. the Spreadsheet (in the protein visualiser). Click on Yes to complete the permanent deletion of the item from the database. • by variance <=: Selects descriptors with variance less than the figure typed in the adjacent test box. A column displaying the values of the descriptor or descriptors selected will also be displayed. 14 GoldMine User Guide . • Selecting the Rename option brings up a window in which it is possible to change the name of selected dock sets or descriptors. If an ensemble docking run is being analysed this will be the protein model that was chosen as best for this GA attempt. Use the Delete button to permanently delete the chosen descriptors from the dock set.2. The entire dock set of solutions will be made available in Hermes for viewing.2.3 Viewing All Solutions Within a Dock Set • Select on any descriptor or set of descriptors within a dock set and then click on View. Use the Send to ranges button at the top to view and modify the range of values for the chosen descriptors in the Descriptor ranges tab. 2. the Descriptor Arithmetic window (Calculator) and the Euclidean Distance Calculator. The possible destinations to which descriptors can be sent are. OR or NOT categories in the Selection manager. the AND. All solutions will be displayed in a column within the GoldMine spreadsheet. Click on Send Choices to. Use the Send to selections button at the top to manipulate the chosen descriptors in the Selection manager tab. or a previously highlighted set of dock sets or descriptors. Multiple confirmation windows will be brought up if multiple items are selected for deletion. This too can be used to identify descriptors containing low information content. This again can be used to identify descriptors containing low information content. This may take some time to calculate if many descriptors are available.• • • • information content.4 Deleting and Renaming Dock Sets and Descriptors • Right-click with the mouse on a dock set or descriptor. 2. and Regression windows. Initially no protein or ligand will be displayed. Ranges. Alternatively select Spreadsheet from the Send Choices to pulldown. • by max absolute value <=: Selects descriptors with maximum value less than the figure typed in the adjacent test box.

GoldMine User Guide 15 . Sum and Count). This is only possible for descriptors from within one dock set.5 Aggregating Descriptors so as to allow Comparison across Dock Sets • Usually. • If it is desired to compare data from descriptors from different dock sets then it is first necessary to aggregate the data from each of the descriptors. Min. New descriptors are created to hold this data. Descriptors from different dock sets can however be used together in the Selection manager. Neither can they be directly combined into a composite descriptor within the Descriptor Calculator. • From the Descriptors pull-down menu on the GoldMine Controller Menu bar. if a GoldMine database contains more than one dock set. In the Dock set area highlight the descriptor you wish to aggregate.2. Several different ways of aggregating the data are available (Max. page 39). then the same ligands will be present in both dock sets. Aggregation functions are available from the menu bar in the GoldMine Controller or via the Descriptor Calculator (see The Descriptor Calculator. However one must be careful not to assume a one-to-onecorrespondence between the poses in one dock set and the corresponding poses in another dock set. select Aggregate.2. Mean. They cannot for instance be histogrammed on the same scale or used on the same scatterplot. They will usually lack a close relationship.. • The Import Aggregate Descriptors window will come up. If it is desired to create histograms and scatterplots or use the descriptors in a regression analysis it is necessary that the aggregated descriptors be placed in within the same dock set.. Because of this there are restrictions in GoldMine as to how descriptors can be compared with each other if they come from different dock sets. especially if each dock set contains multiple poses per ligand.

and select the Scatter Plot option. This will open the Scatter plots window and create a scatter plot for the selected descriptors. Sum returns the sum of the values for a ligand. • Type in an appropriate prefix that will be added to the existing descriptor name to give the new descriptor name. Max returns the maximum value for the descriptor for each ligand. The order of selection of the first and second descriptors will decide the axes. This identifies or calculates from the descriptor datapoints a characteristic parameter for each ligand. This should be the dock set that contains other descriptors.3. especially if there are many poses in the database. page 15). and click on Calculate. For more on the histogram functionality (see Histograms. and click on the Send to ranges button. and Count gives the number of entries for that descriptor for a ligand. Note: It may take some time for the transfer to take place.6 Histograms and Scatter Plots • To create a Histogram for a descriptor. aggregated or otherwise. that you wish to compare with this one. The data for a descriptor is cached when it is first transferred to either the Descriptor ranges or the Selection manager tabbed panes and subsequent manipulations should be considerably faster. • Each descriptor is displayed as a row.• Select an Aggregator. select a number of descriptors within the Descriptors tab. Each aggregate value characteristic of a ligand is written to all poses of that ligand in the aggregated descriptor. right-click on it and select the Histogram option. 16 GoldMine User Guide . is used to colour the scatter plot data points according to spectrum colouring. 2. if selected. Note: Scatterplots cannot be created using descriptors from different dock sets (see Aggregating Descriptors so as to allow Comparison across Dock Sets. right-click on one of them. This will open the Histogram window and create a histogram.1 The Descriptor Ranges Pane • Open a GoldMine DB. • Numeric descriptors display editable boxes for selecting upper and lower bounds of the descriptor. • Select the dock set in which the new descriptor is to be created. segregated within a box that is associated with its corresponding dock set. Mean returns the mean value for each ligand. Five parameter types are available from the associated pull-down menu. Min returns the minimum value each ligand. page 30). For more on the Scatter plots window (see Scatter Plots.3 Creating Database Subsets with Individual Descriptors 2. Initially the values displayed in these boxes are the upper and lower bounds within the whole dock set for that descriptor. The third descriptor. page 33).2. • In order to create a Scatter Plot select precisely two or three descriptors. 2.

Each docking pose corresponds to a solution.2 Selecting Ranges for Numeric Descriptors • Upper and lower bounds of ranges can be altered by typing in appropriate values into the left and right hand range boxes. • Click on Clear to delete all descriptors from the tab. The default is 10%. • Descriptors can be removed from the tab by clicking on the associated Options. • It is possible to select a fixed percentage of solutions that lie at the top or bottom of the range.3. GoldMine User Guide 17 . will be displayed (it is necessary to click on Return to display these figures). Select Set by % from the pull-down menu. • Descriptors can be sent to background by toggling off the tick in the top left hand corner. 2. Choose using the Options pull-down for a descriptor. Alternatively right-click with the mouse within the row of the descriptor of interest. If only one docking pose has been saved per ligand then the number of solutions will equal the number of ligands. Each range can be associated with only one protein.• Text descriptors display a single editable dialogue box. button and selecting Remove (Right-clicking in a descriptor space reveals a pull-down that also allows this option). • Select whether the cut be applied relative to the number of ligands or to the number of solutions. The resulting dialogue box allows you to do the following: • Select to cut at the top or the bottom end of the descriptor range • Alter the percentage of the cut... • Once a range has been modified the number of solutions and the number of ligands that are encompassed by the new range.

This instructs the visualiser to display those docked solutions that satisfy the range set for the descriptor. right-click on it and select Save binary descriptor. button for that descriptor and selecting View or right-clicking and selecting View.. 18 GoldMine User Guide .5 Transferring Descriptor Ranges to the Selection Manager • Once ranges have been set the next step is to transfer the relevant descriptors to the Selection manager tab. button for that descriptor (or right-clicking) and selecting Reset. • To create a new descriptor which takes the value 1 if within the range set for that descriptor. The first will retrieve ligands for which the text descriptor entry starts LIG_THROMBIN.. The protein linked to the appropriate dock set is also displayed.3 Searching Using a Text Descriptor • You can specify an exact text string within the dialogue box of the text descriptor. • To dock a selected range of ligands via the GOLD Server. • Clicking the Send to selections button at the top of the tab transfers all active descriptors. then those values that accord to the range will be highlighted on the histogram. The Histograms window will be brought up and the relevant histogram displayed.3. page 30). On hitting Return all incidences of that string will be found. Thus for instance LIG_THROMBIN% and %THROM% are both valid search strings. You will be asked to input a name for this descriptor. 2. right-click on it and select Show on Histogram.3.4 Other Options • To create a Histogram for a descriptor. right-click and select Dock in GOLD. 2. 2.. • The wild card character % can be used to search for strings which have a common fragment. For more on the histogram functionality (see Histograms.• All solutions that are within the specified range can be viewed by either clicking on the Options. If a histogram is already displayed for this dock set and a range has been set for the descriptor.3.. • It is possible to reset a range to its original bounds by clicking on the Options. and 0 otherwise. the second will retrieve those ligands which have the string THROM within the test descriptor entry.

• Below the Ranges tree is displayed the Saved selections tree. as this can also be done in the Selection manager. These panes are used to create complex Selections involving several different descriptors. We will refer to these as the AND. By default a selection which is prefixed All_ will already be present for each dock set.2. • The descriptors selected will be displayed underneath the appropriate dock set name in the Ranges tree in the left hand side of the screen. This selection represents the entire set of docked poses associated with the dock set. NOT and OR panes from now on. Must not be in (Boolean NOT) and Must be in at least one of (Boolean OR). Set some appropriate ranges for the descriptors and then click on the Send to selections button.1 The Selection Manager Pane • Open a GoldMine DB. • Click off the tick in the top corner of the NOT and OR panes to minimise them. GoldMine User Guide 19 .4. • The lower right hand part of the tab is subdivided into three Query panes labelled Must be in (Boolean AND).4 Combining Descriptor Ranges to Create Selections 2. Any selection that is created by the user and saved in the GoldMine DB will be displayed here. However it may sometimes be advantageous to do so. select a number of descriptors and take into the Descriptor ranges tab. Note: It is not strictly necessary to first set ranges in the Descriptor ranges tab.

The number of solutions that satisfy both ranges will be displayed at the top of the window. All the options that were available in the Descriptor ranges tab for manipulating individual descriptors are also available within the Selection manager (see Creating Database Subsets with Individual Descriptors. Set the ranges of these two descriptors to suitable values if this hasn’t been done already.2 Creating a Selection • Individual descriptors can be dragged into any of one of the Query panes. • Drag two descriptors into the AND pane. The number of ligands this corresponds to and the number of protein models will also be shown. Once within a pane the descriptor is displayed in the same fashion as in the Descriptor ranges tab. page 16).4. • Click on the Count button. 20 GoldMine User Guide .2.

You will be prompted to enter a name for the new Selection.• All three Query panes may be used to create complex Selections. • To view in Hermes the solutions that make up a Selection. The corresponding protein will also be displayed. This Selection can either be saved without viewing or. it can be viewed in the visualiser. then it is possible to highlight within them a second histogram for those solutions that make up the Selection. • To save a Selection click on Save selection. • It is possible to use saved Selections to create more complex Selections. For instance a Selection might be created to encompass ligands which we want to exclude from other queries. alternatively. This GoldMine User Guide 21 . • If you have displayed histograms that were created in the Descriptors or Descriptor ranges window or the Data Explorer. click on View. To do this click on the Show on histograms button.

4.Selection could then be dragged into the NOT window for subsequent queries. • Clicking on two or more non-consecutive bars will create two or more ranges in the OR pane. This will highlight the selected bars in red and the corresponding descriptor. 2. page 30). will appear in the OR pane of the Selection manager tab. drag over other bars and release. • Take any histogram that is displayed and left-click on one of the bars. 22 GoldMine User Guide . • Clicking on highlighted bars turns off highlighting and modifies the ranges in the OR pane accordingly. For more information on the Histogram functionality (see Histograms. The Selection manger tab will be opened up if it was not previously open.3 Creating Selections Easily via the Histogram and Scatter Plot Options It is possible to use the Histogram and Scatter plot functionality to conveniently and quickly create Selections. This has the additional advantage that you can concentrate analysis on those parts of the landscape that look most interesting from the plots. bounded by the range covered by the selected bars.

. The ranges set for these descriptors are those defined by the scatter plot region.4. Alternatively descriptor data can be exported in .• Within a Scatter plot it is possible to define a rectangular region of the plot by use of the Select Region mouse mode. button at the top left of the tab..csv format.. This opens the Selection manager in the GoldMine Controller and the descriptors that constitute X and Y axes are brought into the Must be in (Boolean AND) box. Several file formats are supported. Drag and Release within the Scatter plot.. Then the region is defined using mouse Left-click. On choosing Export. page 33). the Export GoldMine User Guide 23 . Select Export. In addition a list of chemical names or identifiers can be generated. This is useful if the intention is to purchase compounds or select from a database for screening. For more Information on the Scatter plot functionality (see Scatter Plots. Solutions and descriptor data can be saved as MACCS (SDF) or MOL2 structure files. to export the solutions satisfied by the chosen Range or Selection.. 2.. • Key Escape will cancel the highlighting on the histogram or scatter plot but will not affect the ranges already set in the OR pane. Lastly data can be exported as an HTML Table.4 Exporting Selections • Click on the Options.

• Select those descriptors in the left-hand pane that you wish to include as fields in the MACCS file and click on Add. all others take the value 0. Some of the commands below are also directly available as buttons on the top right of the Selection manager.. For instance it can be brought into the GoldMine spreadsheet as 24 GoldMine User Guide . • View: displays the ligand poses selected and the associated proteins in the visualiser. 2. button on the left hand side of the screen allows the following operations to be carried out on any Range or Selection which is picked in the Name column. This descriptor can be useful in a number of ways. • Enter an appropriate file name. when directly available. The Browse button will allow you to navigate to the appropriate directory. • View Details: displays details of the Range or Selection..Docking Solutions window appears. • Dock in GOLD: docks the Selection of ligands via the GOLD Server. by choosing this option. • Click on OK. To export: • Select the relevant file output from the pull-down menu at the bottom left corner of the pane. will be applied to the Selections currently defined through the AND. • Show on Histogram: if you have displayed histograms that were created in the Descriptors or Descriptor ranges window then it is possible.4.5 Other Options for Working with Selections • The Options. These commands. to highlight within them a second histogram for those solutions that make up the selected Range or Selection. If no histogram is already displayed and a Range has been selected then one will be created for that Range. • Save binary descriptor: this saves a descriptor for which the solutions satisfied by the chosen Range or Selection take the value 1. NOT and OR panes.

• Delete: Deletes the selected Selection. The Selections prefixed All_ cannot be edited. Such a descriptor also be used in further Boolean queries. • Edit: Transfers the Range or Selection to the Query panes to allow editing.an extra column and additional solutions manually added or removed from the list. GoldMine User Guide 25 .

3

More Tools for Working with the Data

3.1 The Data Explorer • The Data Explorer window can be opened via the Main pull-down on the top level Menu in GoldMine. • Data can be sent to the Data Explorer from the Descriptors pane. Highlight or select the descriptors required, set the destination to Explorer in the pull-down menu next to the Send Choices to button, and then click on Send Choices to.

• The Data Explorer contains a number of sub-windows. Each sub-window can be hidden by toggling off the tick-box in its top left-hand corner. The windows can also be hidden and then redisplayed via the Windows pull-down on the top level menu bar. The windows are as follows: • Values. In this window it is possible to select a data set to work with. The pull-down menu in the Choose data set area provides a list of Selections to choose from. These may be complete dock set datasets (designated by prefix All_)or they may be smaller Selections constructed in the Selection manager. Below the Choose data set area are listed the descriptors that have been brought into the Data Explorer. These can be selected with left-hand mouse button, or histograms and scatterplots constructed from them (highlight two for the latter option), or they can be removed or deleted via the right-hand mouse button. • Choose descriptors. This area can be used to select multiple descriptors. This may be especially useful if very many descriptors have been generated (e.g. via the Per-atom calculation option) and it is necessary to thin these down to only those which contain useful information. This option is documented elsewhere (see Selecting Descriptors, page 13). • Descriptor Operations. Allows use of a number of statistical tools to operate on selected

26

GoldMine User Guide

descriptors. Datapoints used will be restricted according to the Selection picked in the Values window. • Operations with a selection. This allows specific operations to be carried out on previously created Selections. These options are particularly useful if the Selection represents a set of active molecules and it is desired to see how this set is distributed within the larger set of decoys. • Scatter Plots. This allows the quick and easy generation of scatter plots. The scatterplot will be generated for only the datapoints contained in the Selection picked at the top of the Values window. Drag and drop descriptors from the Values window onto the X-axis, Y-axis and, optionally, the Colour buttons. Then click on Scatter Plot. Note: Scatterplots cannot be created using descriptors from different dock sets (see Aggregating Descriptors so as to allow Comparison across Dock Sets, page 15). 3.1.1 Descriptor Operations: Probing the data • In order to tabulate descriptive statistics for one or more descriptors, highlight these descriptors in the Values window and then click on Descriptive Stats (alternatively highlight and drag-anddrop the descriptors). If a Selection is made at the top of the Values window which is not an entire dock set (i.e. doesn’t start with All_), statistics will only generated for those poses contained in the Selection. An error window will be generated if there is insufficient data to run the statistical calculations

• The following parameters will be calculated: Count, Number of missing values, Minimum, Maximum, Sum, Mean, Variance, Standard Deviation, Mean Deviation (i.e. average absolute deviation from the mean), Skewness and Kurtosis. • In order to calculate a correlation matrix, highlight a number of descriptors in the Values window and click on Correlations. This will be useful to identify descriptors which contain redundant information. The calculation is performed on only those datapoints from the Selection picked at the top of the Values window. • To create superimposed histograms with data from multiple descriptors, highlight the relevant
GoldMine User Guide 27

descriptors in the Values pane, and the click on Histogram. The calculation is performed on only those datapoints from the Selection picked at the top of the Values window. Note: Superimposed histograms cannot be created using descriptors from different dock sets (see Aggregating Descriptors so as to allow Comparison across Dock Sets, page 15).

• To create box plots from one or more descriptors, highlight the relevant descriptors in the Values pane, and the click on Box Plot. The calculation is performed on only those datapoints from the Selection picked at the top of the Values window. 3.1.2 Operations with a Selection: Looking for Enrichment It may often be useful to identify where a subset of docked poses are ranked in the whole dataset. This subset may for instance represent a set of active molecules, the remainder of the set comprising decoy molecules. Such a situation is often used to evaluate the efficacy of docking protocols for Virtual Screening, and many such active and decoy sets are now available. In GoldMine we can employ the docking results of such test sets to quickly and easily identify the most useful scoring functions and devise the most effective rescoring function, for a specific protein target. To do this we need to identify those descriptors which contribute most to enrichment, i.e. those descriptors which generally rank actives high in the dataset. Here we will look at how we can create Enrichment curves to measure the success of ranking active molecules highly according to a descriptor. • The functions within the Operations with a selection area requires a Selection to be chosen. This choice is made via the pull-down menu next to the Options button. Note: If the data set chosen at the top left of the Values window, is itself a Selection from a dock set, then most of the functionality in this window will operate on the union of the two Selections. It will usually be desired here to work with a data set for a full dock set (i.e. have a dataset starting with All_ chosen at the top of the Values window). • A variety of options are available via the Options button. These are very similar to those available from the Selection Manager. • View Details: displays details of the Range or Selection.

28

GoldMine User Guide

• Export: Allows the export of poses making up the Selection in a number of file formats (see Exporting Selections. • Edit: transfers the Range or Selection to the Query panes to allow editing. The X-axis in this case is the rank of the pose and the Y axis is the number of actives retrieved. For the ROC curve the maximum area is always 100%. via Options. • A Receiver Operating Characteristic Curve (ROC) curve is generated for each of the descriptors in a new ROC plots window. all others take the value 0. Choose. • Show on Histogram: if you have already displayed histograms then it is possible. For instance it can be brought into the GoldMine spreadsheet as an extra column and additional solutions manually added or removed from the list. it has the disadvantage that the maximum area under the curve is dependant on the ratio of actives to decoys. The X-axis is the False Positive accumulation rate. • Delete: deletes the selected Selection. • Save as binary descriptor: this saves a descriptor for which the solutions contained by the chosen Selection take the value 1. • Dock in GOLD: docks the Selection of ligands via the GOLD Server. a Selection for which enrichment is to graphed.• View: Displays the ligand poses selected and the associated proteins in the visualiser. as one descends the ranked list for each descriptor. • The plot can be changed to an Accumulation Curve by selecting Accumulation instead of ROC. page 23). by choosing this option. GoldMine User Guide 29 . select one or more descriptors in the Values window (usually one would include any GoldScore. Very similar to the ROC curve. the Y-axis the True Positive accumulation rate. Such a descriptor also be used in further Boolean queries. The Selections prefixed All_ cannot be edited. • Enrichment curves and associated enrichment metrics can be generated using the ROC function. to highlight within them a second histogram for those solutions that make up the chosen Selection. This descriptor can be useful in a number of ways. ChemScore or ASP fitness functions available). and click on ROC.

select one or more descriptors in the Values window. and then click on the Significance button. over the chosen Selection. according to the confidence level specified in the text box at the top of the window.. Model. Each enrichment metric is calculated assuming the % cut off displayed in the Cut off (%) text box at top left. This cut off can be changed by editing this text box. is statistically different from the mean values over the whole population (i. J. per atom descriptors) and wish to use only those which contain interesting information. 47.2Histograms • Histograms can be generated in a number of ways: • Right-clicking on any descriptor in the Descriptors pane or the Data Explorer brings up an 30 GoldMine User Guide . To do this click on Choose significant descriptors. Inf. Chem. in the Data Explorer those descriptors for which the means are significantly different.• Also calculated are a number of the enrichment metrics currently reported and in use in the literature. via Options. 2007): • Enrichment Factor (EF) • Area under Accumulation Curve (AUAC) • Area under ROC curve (AU ROC) • BEDROC • MCC • From the enrichment curves and associate metrics it should be possible to identify those descriptors which best discriminate actives from inactives. Enrichment metrics available are as follows (for more information see. 3.e dock set). Sample and population means will be tabulated and chi squared significance statistics generated. Truchon.g. This might be useful functionality if we have a great many descriptors (e. The cases where the means are significantly different. C. Choose. J-F. • It is possible to highlight. I. 488-508. a Selection to represent the sample population. Bayly. will have the significance figures highlighted in blue. • It is possible to establish whether the mean values of selected descriptors.

page 15). The options available are as follows: • Number of bins: allows the number of bins in the histogram of interest to be either increased or decreased from the default value. The bar colour of this histogram can be changed by use of this option GoldMine User Guide 31 . generate a histogram for the associated descriptor.option to create a histogram from that descriptor. • Background: the background colour for the histogram can be selected from a palette. if no histograms are already displayed. • Title: allows the histogram title to be altered. • Right-clicking in a range area. • Drag-and-dropping a descriptor onto an existing histogram will place a new histogram on the same axes as the first histogram Note: Superimposed histograms cannot be created using descriptors from different dock sets (see Aggregating Descriptors so as to allow Comparison across Dock Sets. • Highlight colour: it is possible to overlay a second histogram formed from a selection of the dock set. • Selections and ranges can be highlighted on histograms: • Creating a selection in Selection manager and then clicking on Show on Histograms will highlight the selection on any displayed histograms relevant to that dock set. The datapoints in the range that is set will be highlighted. • Any Options button that is associated with a pull-down menu to choose a selection will have a Show on Histogram option available that will highlight the chosen selection on any displayed histograms relevant to that dock set. either in the Descriptor ranges window or the Selection manager and selecting Show on Histogram will. • Any selection specified in the left hand side of the Selection Manager can be drag-anddropped into a histogram and the histogram will be highlighted with the Selection. • Any Range can be drag-and-dropped from either the Descriptor ranges window or the Selection manager into a histogram to highlight that range on the histogram. • A number of options are available to tailor the display of a histogram. • Axis Title: allows the X-axis title to be altered. To access some of these these right-click in the histogram of interest and select Configure.

• Clear picks: Activating this tick boxes leads to the removal of any picked bars or ranges on a histogram.• Pick colour: It is possible to select via the mouse. • Histograms can be created individually or a number of histograms can be displayed on one pair of axes. Individual histograms can be altered in size and shape by dragging their borders. histogram bars or ranges of bars. can be changed by use of this option • Clear highlights: Activating this tick boxes leads to the removal of any overlays present on a histogram. 32 GoldMine User Guide . • Set highlight colour • Set pick Colour • Clear highlight • Clear pick • Remove: removes the histogram relevant to the Descriptor. • Other data set specific options are available by clicking on a descriptor name that lies to the right of the histogram: • Colour: the colour of the bars relevant only to this descriptor can be selected from a palette. The colour indicated on the histogram to denote these picks.

or Ranges pane onto the axes/colour defining buttons at the top of the Scatter plot window. is used to colour the scatter plot data points according to spectrum colouring. and select the Scatter Plot option. Note: Scatterplots cannot be created using descriptors from different dock sets (see Aggregating Descriptors so as to allow Comparison across Dock Sets.3. • In the Data Explorer window drag-and-drop descriptors from the Values window onto the Xaxis. Then click on Scatter Plot. Click on the down arrow at the right of the X Axis box and select a numeric descriptor from the pull down displayed. if selected. • It is possible to drag-and-drop descriptors from the Data Explorer window or Descriptors.1 Overview • Scatter plots can be generated in three ways: • In any window which a descriptor tree displayed. right-click on one of them. page 15). in order to create a modified Scatter plot. GoldMine User Guide 33 . the Colour buttons.3Scatter Plots 3. highlight precisely two or three descriptors. The order of selection of the first and second descriptors will decide the axes. optionally. Y-axis and. • A scatter plot is created within a Scatter plot window. The third descriptor.3. This will open the Scatter plots window and create a scatter plot for the selected descriptors.

The ranges set for these descriptors are those selected in the scatter plot region. • Delete: deletes the points for the Selection. • Select region: left-click. • Zoom out: clicking within the scatter plot reverses the last zoom in operation. It is possible to zoom in repeatedly.2 Manipulating a Scatter Plot • Once a scatter plot has been generated it is possible to manipulate it in a number of ways • The scatter plot can be highlighted by choosing a Selection in the Highlight selection text box. Useful in conjunction with Zoom in. 34 GoldMine User Guide . The Selection manager in the GoldMine Controller is opened and the X and Y descriptors are straight away brought into the Must be in (Boolean AND) box (see Combining Descriptor Ranges to Create Selections. • Pan: dragging the mouse within the scatter plot translates it and brings into view different portions. If that data point represents a docked pose that is already present in the GoldMine spreadsheet of Hermes then that binding pose will be shown in Hermes.3. • Symbol: shape of the symbol can be selected from a pull-down menu. drag-and-drop in the scatter plot can be used to select a rectangular region.3. • Outline colour: the colour for the symbol outline can be selected from a palette. • Select point: individual data points can be selected. The following options are available. This is a quick and easy way of making a Selection based on two descriptor ranges.3. The axes are adjusted accordingly. • Click on the pull-down menu to the right of the Mouse options box to change the mouse operation which will be associated with the cursor within the scatter plot. • Size: symbol size can be selected from a pull-down menu. • Fill colour: the fill colour for the symbol can be selected from a palette. The following options are available for modifying the scattergram display. The name of the selection will appear on the colour designator at the top right.3 Customising a Scatter plot Display • Click on Configure at the top right. • Zoom in: left-click and drag within the scatter plot can be used to select a rectangular region. page 19). • Background: the background colour can be selected from a palette. On release of the left mouse button this region is zoomed in on. The Selection data points will by default be coloured red whereas the remaining data points will be coloured blue. 3.

• You can use the SMILES representation to pick out chemical substructures in Hermes. 4. Click on Simple properties to bring up the Molecular descriptors window.1 Calculation of Simple properties • Options to calculate properties that relate only to the ligand are available under the Descriptors pull-down on the top menubar in the GoldMine Controller and Data Explorer windows. because the test search capability in the Selection manager can be co-opted to search for relevant GoldMine User Guide 35 .daylight. Donor atom count. descriptors to evaluate the occupancy of certain pockets. and descriptors that monitor the presence or absence of certain important bonding interactions such as hydrogen bonds.2 Calculation of SMILES representation • It is possible to calculate SMILES representations for all structures. Detailed information can be found on the Daylight web pages (http://www. These descriptors can be associated with the dock set used to generate them. The SMILES encoding used within GoldMine is not canonical and it is not necessarily entirely consistent with the encoding used by other vendors. 4. • It is possible to calculate Molecular Weight. Acceptor atom count and Rotatable bond count for a given selection of poses chosen from the pull-down at the bottom of the Molecular descriptors window.html).com/dayhtml/doc/theory/index.4 Calculation of further Descriptors to characterise the Docking Pose • It is possible within Hermes to calculate a wide variety of further descriptors than can be used to characterise the quality of a binding pose. These are string representations of 2D molecules. It is however consistent with the encoding used in the CCDC products Relibase+ and WebCSD. • In addition several useful descriptors can be calculated within GoldMine. These include whole ligand or protein properties such as number of occluded acceptors and donors. They are useful to analyse and filter sets of docking poses and can be employed in combination with fitness function related descriptors. The user is referred to the Hermes documentation for further information.

However because the SMILES representations are not guaranteed to be canonical some care is required to ensure all instances of the desired substructure are retrieved. page 18). representing their Euclidean distance from that centroid. and click on Send choices to. This will place a descriptor called Smiles in the dock set appropriate to the selection. 4.3 Calculation of Euclidean Distances • The intent here is to calculate a centroid in multi-dimensional space from a selection of solutions. • Select an appropriate selection via the pull-down menu and click on Calculate.SMILES strings (see Searching Using a Text Descriptor. Appropriate use of the % wild card character before and /or after the search string. 36 GoldMine User Guide . may be necessary. Select Simple properties to bring up the Molecular descriptors window. • The Euclidean distances window can be displayed by selecting Distances from the Descriptors pull-down available at the top of most GoldMine windows. then to calculate a descriptor for the corresponding larger dataset. Several different text searches may be required. for instance. select Distance calculator from the pull-down adjacent to Send choices to. Alternatively it is possible to highlight descriptors in. • Click on the Descriptors pull-down on the top menubar in the GoldMine Controller and Data Explorer windows. • Activate the Smiles tick box and deactivate all other tick boxes. the Descriptors pane of the GoldMine Controller.

The centroid in Euclidean space will be calculated from this selection. page 40). Once the calculation is finished the new descriptor will be available for use from the Descriptors pane in the GoldMine Controller. • Clicking on Calculate will calculate the centroid and distances to the centroid. It is also possible to create a descriptor which contains the RMSDs of all solutions with reference to one particular solution. • A second selection should be selected from the Calculate distances pull-down. • A name for the new descriptor should be placed in the text box at the bottom. • The descriptors that are required for the distance calculation should be highlighted with the mouse. For information on how to normalise a descriptor refer to the documentation on the Descriptor calculator (see Transformation Functions. 4. • A selection should be picked from the Define Centroid pull-down. either via drag-and-drop or through using the method described above. click on RMSD from the Tools pull-down menu on the GoldMine GoldMine User Guide 37 . • With a GoldMine open. Note: It is most useful to use normalised descriptors for distance calculation so that all descriptors have equal weighting.4 Calculation of the Solutions RMSD Matrix and Descriptor Subsets therefrom • If multiple solutions have been generated for a ligand it is possible to create the matrix of Root Mean Square Deviation (RMSD) of atom positions between each pair of solutions. The Euclidean distances will be calculated from the centroid described above.• Descriptors can be brought into the Dock set region of the Euclidean distances pane.

as above. then all entries will be highlighted red. It is possible to create a descriptor from this selection by using the Calculate descriptor button at the top-left of the window. • The RMSD window will come up. The descriptor will only be calculate for one of the poses (normally the one in the matrix furthest to the right). Note: It is not usually useful to create a descriptor if more than one pose per ligand is highlighted. the right-hand pane gives the RMSD Matrix for all solutions for the ligand that is highlighted in the left-hand pane. 38 GoldMine User Guide . • The Target selection and Reference selection boxes at the top of the RMSD window can be used to pick Selections which can then be used to highlight subsets of the RMSD matrix in red.Controller. An appropriate name has to be entered for the descriptor. • If a Selection containing a single ligand pose is created then only one row and column in the matrix appropriate to that ligand will be highlighted. The left-hand pane of this window lists the ligand structures. If the selection in both these boxes is an entire dock set.

To give an example: It may be believed that the fitness function being used is too effective at rewarding high molecular weight ligands. Drug Discovery Today. Chem. 2006. Therefore it might be desirous to create a new fitness descriptor that is normalised for molecular weight. This technique employs several different scoring functions to evaluate individual docking poses and to establish whether a given pose is good or poor The assumption is that the weaknesses in one scoring function are compensated for by strengths in another. A. on rescoring a pose with a new scoring function. S. In addition. 47. However it may also be useful to arithmetically transform and combine descriptors to generate new ones that more efficiently describe the binding interaction. Inf. Alternatively it might be necessary to use a more complex normalisation scheme such as that described in: G. 421-428. to do a Calculation. it is possible.. The area is well reviewed in: M. 5. • Transformation Functions (two columns to the far right. 2007. aggregated and composite descriptors are assembled within the box at the bottom of the Descriptor Calculator. Fehrer. • A well recognised technique for increasing hit rates in virtual screening is to use consensus scoring. to allow local minimisation of the docking pose so that an optimum score representative of the pose is obtained. J.1The Descriptor Calculator • The Descriptor Calculator can be brought up by first highlighting the descriptors within a dock set that are to be manipulated. • Available functions are at the top right of the Calculator. Knox and D.) These transform each descriptor data point within a descriptor. These calculate single parameter values over all descriptor data points. J. G. Local optimisation has been shown to increase enrichment rates in house studies on rescoring strategies.Thus an analysis strategy that involves consensus scoring is often highly appropriate when analysing GOLD virtual screens. and to exit the Descriptor GoldMine User Guide 39 . Carta. Creating a composite descriptor that is simply the fitness score divided by molecular weight might work. 1564-15712. The transformed data are saved under a new descriptor name. Four types of function are available: • Calculator Operations (at left). Model. • Global Aggregate Functions (middle column). GoldMine allows the user to create single consensus descriptors which can then be used to rank and filter solutions. 11.5 Arithmetically Manipulating Descriptors: Consensus Scoring • Many descriptors calculated during a docking job are available and further descriptors that describe the protein-ligand interaction can be calculated using Hermes. • Buttons to Clear entries in the lower box. • Transformed. Three different scoring functions are available within GOLD. • The descriptors to be manipulated appear in the top left box. and then clicking on the Arithmetic button. Lloyd. • Local Aggregate Functions (lower far right) These operate on groups of poses of each ligand. A value is returned for each and every pose under a new descriptor name.

• Mean centre() . • Normalise() .This raises to the power by a specified number n.Calculator (Done) are available at the bottom. Highest positive descriptors have the lowest rank. • Sqrt() . The descriptor values are compared for each common entry. To use.This normalises the descriptor.This calculates the square root.1. • Pow() .e each component data point is mean centred and then divided by the standard deviation. The transformed function is saved as a new descriptor within the dock set that the original descriptor came from. • Ln() . It is necessary that these be separated by commas.This calculates the log to base 10. • The following transformation functions are available: • Log() .This takes two or more descriptors as arguments.This calculates the exponential.1 Transformation Functions • Transformation functions return a value for each descriptor entry.This subtracts the mean value for the descriptor from each component data point. click on Pow() and then select the descriptor to be transformed. 40 GoldMine User Guide . and the lowest value is returned. i. Then enter a comma within the brackets after the name of the descriptor. • Least() . • Exp() .This calculates the natural logarithm.This calculates the rank order over a descriptor. followed by n. 5. • Rank() .

The name format dock_set. They cannot be saved.1. They can however be used to create composite descriptors (see Composite Functions and Consensus Scoring.This takes two or more descriptors as arguments. 5.The function will appear in the lower box associated with empty brackets. which is saved under a new descriptor name within the dock set that the original descriptor came from.The function will appear in the lower box associated with empty brackets.This sums all data points for the descriptor. click on the desired global aggregate function. • Mean(). • To use.This calculates the minimum value for a descriptor. • The following global aggregate functions are available: • Min() . Click on the descriptor at the top left that should be transformed.This returns the number of data points for the descriptor. The function returns a value for each and every pose. To use. The name of the descriptor will be placed within the empty brackets. The descriptor values are compared for each common entry.1. The name of the descriptor will be placed within the empty brackets.This calculates the mean value for a descriptor. GoldMine User Guide 41 . page 42). • Max(). Type an appropriate name in the New descriptor box.This calculates the standard deviation for a descriptor. The relevant parameter for that descriptor will be calculated and displayed on the screen. click on the desired transformation function. • Hit Calculate. It is necessary that these be separated by commas. Note: Null entries are saved for cases where the transformation function returns imaginary values. and the highest value is returned.descriptor can be used if it is desired to save the new descriptor in a dock set other than the one the old descriptor is in. • Count() . • Click on a descriptor at the top left. • Sum() .3 Local Aggregate Functions • Local aggregate functions operate over the set of poses for a ligand.This calculates the maximum value for a descriptor. Hit Calculate.2 Global Aggregate Functions • Global aggregate functions return a single value over the dock set for the descriptor. • StdDev() . The transformed descriptor will be calculated and added to the appropriate dock set.• • • • • Greatest() . 5.

• It is however possible to carry out arithmetic within the brackets of an aggregate or a transformation function.The function will appear in the lower box associated with empty brackets.• The following local aggregate functions are available: • Aggregate() . first open the pull-down menu to the right of the appropriate local aggregate function. 5. is determined by whether Max or Min is selected from the associated pull-down menu. • Best solution() This returns 1 for the best value of the descriptor over the poses for each ligand. • Click on a descriptor at the top left. If more than one pose per ligand is saved then the dangers of applying a one-to-one correspondence are even greater. The name format dock_set.2Composite Functions and Consensus Scoring • Expressions for composite functions can easily be built up from descriptors by using the calculator operations within the Descriptor Calculator. The name of the descriptor will be placed within the empty brackets. • Descriptors can be transformed in situ and then linked by arithmetic operations. • Click on the desired local aggregate function. then the composite function will act on each entry in the dock set separately. and Count gives the number of entries for that descriptor for a ligand. Five parameter types are available from the associated pull-down menu. • Hit Calculate. Max returns the maximum value for the descriptor for each ligand. There is often no sensible one-to-one correspondence and great care is needed to avoid generating nonsensical composite functions. • When descriptors from the same dock set combine within a composite function to create a new descriptor. • Type an appropriate name in the New descriptor box. Select the type of aggregation required.descriptor can be used if it is desired to save the new descriptor in a dock set other than the one the old descriptor is in. It should normally be repositioned outside the brackets before additional arithmetic operators and descriptors are added to the expression. Min returns the minimum value each ligand. Mean returns the mean value for each ligand. The relevant parameter for that descriptor will be calculated and added to the appropriate dock set.This can involve two or more descriptors from the same dock set.This identifies or calculates from the descriptor datapoints a characteristic parameter for each ligand. • Special considerations apply if the descriptors to be combined come from different dock sets. It is usually the case that docked poses with the same identifier but in different dock sets will not be closely related. Sum returns the sum of the values for a ligand. The definition as to what best is. and 0 otherwise. the cursor by default remains inside the brackets enclosing the descriptor name. • To use. A more sensible 42 GoldMine User Guide . One thing to note is that when a transformed descriptor is first defined in the lower window. Each aggregate value characteristic of a ligand is written to all poses of that ligand.

fitness) would. • If a composite function creates a new descriptor from descriptors from separate dock sets. for each dockset separately.Goldscore_Fitness)) + Normalise (Aggregate Max(Cox2_Chemscore. but also for the same poses re-scored with one or more other scoring functions. • Consensus by Rank .Goldscore_Fitness) + Normalise (Cox2_Chemscore.This consensus method doesn’t require any descriptor arithmetic.A composite descriptor is created that is made up of the sum or average of the Ranks according to each fitness function. page 41) the appropriate descriptor values over each ligand.choice in this case is to combine values representative of each ligand. • Special composite functions can be created to allow Consensus Scoring. • Consensus by Normalised Score .Chemscore. before carrying through the combination function. This is demonstrated in Tutorial 3. Max. be exactly equivalent to the following calculation Normalise(Aggregate Max(Cox2_Goldscore. Sum and Count are available. Best results may be achieved when multiple poses for each ligand are saved in the original docking run. Example: The GoldMine calculation entered asNormalise(Cox2_Goldscore. before composing the composite function. before completing the calculation. if several poses are saved per ligand then GoldMine will first automatically Aggregate (see Local Aggregate Functions. To change this. The standard Aggregate options Min. In all cases it is assumed that a GoldMine DB is available which has descriptors.Chemscore. If desired a weighting can be given to one or more of the component fitness functions. The better the rank the lower this score.A composite descriptor is created that is made up of the sum of the Normalised scores for the fitness functions to be used. not only for the original fitness function used to carry out the docking. GoldMine deals with these case in a special way. However.descriptor within the New descriptor box. This descriptor may be used in a similar way to Consensus by Normalised Score. by default. generating a new descriptor with a single value per ligand. At least three common flavours of Consensus scoring can be carried out within GoldMine. If a single pose per ligand is saved then GoldMine will combine the descriptor values in the obvious way. • Consensus by Vote . it is necessary to select a different Aggregate option from the pull-down menu next to the Aggregate() button. it is advisable to specify in which dock set the new descriptor is to be placed by using the name format dock_set. Instead the appropriate fitness functions are brought into the Boolean AND box of the GoldMine User Guide 43 . This descriptor can then be used to create a ranked list of poses and to identify the best for further study. Mean. NOTE: The default Aggregate option applied is Max.fitness)) • GoldMine will give a warning in cases such as that above. It is important to note that the three different schemes described below are not guaranteed to always give very similar results.

Selection Manager and the Selection filters set for each fitness function such that the total number of accepted poses after Boolean AND is of a desired percentage cut. It is not possible to generate a ranked list of poses easily. 44 GoldMine User Guide . Poses have to have high scores in both fitness functions to do well in this scheme of consensus scoring.

6

Per Atom Scores • The energy terms produced by GOLD during a docking can be recorded on a per atom basis (see the GOLD user guide for further details on per atom scores). These atom energies are automatically imported into a GoldMine DB but are not displayed by default. This is because each atom will produce a new descriptor in the GoldMine DB and as such there is the potential of ending up with a large set of descriptors.

6.1 Extracting Atom Energies • To extract per-atom scores into a GoldMine DB select Atom Energies from the Tools menu. This option will only be active if per-atom scores are available to be extracted. • Select the set of docking solutions for which atom energies should be extracted. • Atom energies can be extracted for the ligand atoms (Ligand atom scores), the protein atoms (Protein atom scores) and/or on a residue basis (Protein residue scores). Note that it makes little sense to extract the Ligand atom scores on a set of disparate ligands. • Tick the check boxes of interest and press Extract. This will create a new descriptor for each set of per atom energies.

• It may then be useful to use some of the Choose descriptors options in the Descriptors pane to filter out those descriptors standing in for per atom scores that contain little or no information (see Selecting Descriptors, page 13).

GoldMine User Guide

45

46

GoldMine User Guide

7

Hotspots • The hotspots functionality in GoldMine allows the calculation of pharmacophore grids based on a selection of docked solutions. These so called hotspot grids can be visualised in Hermes and provide a means of visually observing average trends in the docking solutions. The hotspots can also be used to calculate new descriptors based on whether or not a docked solution has atom(s) of interest in the hotspot grids. Furthermore, combining the hotspot and regression tools provides the functionality to create receptor based 3D QSAR models. The methodology of using docking to align ligands for a 3D QSAR model has been successfully applied by Tuccinardi (La Motta et al., J. Med. Chem., 52, 964-975, 2009) who managed to create a model for predicting IC50 binding affinities with an r2 of 0.86.

7.1 Defining Hotspots • To define GoldMine hotspots select Hotspots from the Tools menu. • The GoldMine Hotspots window has got two main areas: one for defining hotspots (left) and one for calculating descriptor based on the hotspots (right). • Select a docking solution set of interest for which you want to calculate hotspots. This is likely to be a set of known actives or a subset of known actives. • Press the Add button to define a new hotspot probe. • A number of predefined probes are available (Heavy, Polar, Donor, Acceptor, Lipophilic). Further, it is possible to define custom probes based on the Sybyl atom types. For example, one could define a carbon sp3 probe by deselecting all atom types apart from C.3. • After having defined all probes of interest press Calculate Grids. This will calculate hotspot grids based on the location of the specific atom type(s) for the selected set of docked solutions. 7.2 Writing a Hotspot ACNT file • • • • It is possible to save the hotspot grid to an *.acnt file. Select the hotspot grid that you want to write to file by clicking on it. Press the Write button in the GoldMine Hotspots window. Specify the file name and press Save.

7.3 Reading a Hotspot from an ACNT file • It is possible to import a hotspot grid from a saved *.acnt file. These files can be created from hotspots defined in GoldMine (see Writing a Hotspot ACNT file, page 47) or by other programs such as SuperStar. • Press the Read button in the GoldMine Hotspots window. This will bring up a dialogue asking for the file location.

GoldMine User Guide

47

• Click on the Add button on the right hand side to add a descriptor. display type and opacity edited. individual surfaces can be switched on and off using the Visible check box. The descriptors are named GoldMineHotspots_<descriptor_name>. • The calculated descriptors can be inspected in the GoldMine Controller. Lipophilic). By pressing the Create button the surfaces are generated in Hermes. For example. page 35) or the GoldMine Regression tool (see Creating Statistical models that describe Biological Activity: The Regression Window.• Select the file of interest and press Open. This can be achieved through the Contour Surfaces. • In the Edit pane of the Contour Surfaces window individual surfaces can be edited and deleted. Acceptor. • After a suitable set of hotspot grids have been calculated (on the left hand side of the GoldMine Hotspots window) it is possible to calculate hotspot descriptors for a set of docking solutions (on the right hand side of the GoldMine Hotspots window).3. In the Create pane one can define the number of surfaces used to represent the hotspot as well as the colouring and isocontour level of the surfaces. This will bring up the Contour Surfaces window. • It is worth noting that grids produced by external programs can be used to calculate hotspot descriptors in GoldMine. • The Contour Surfaces window has two panes Create and Edit. Polar. • The new hotspot descriptors could be used to create modified scoring functions using either the GoldMine Descriptor Calculator (see Calculation of further Descriptors to characterise the Docking Pose.. 7. 7. Further. • To view a hotspot select the probe of interest by clicking on it (on the left hand side of the GoldMine Hotspots window) and press the View button. page 55).5 Calculating Hotspot Descriptors • In order to view hotspots a suitable set of hotspots must first be defined (see Defining Hotspots. colouring. page 47). • Sometimes one will want access to the Contour Surfaces window after it has been closed. Further. Individual surfaces can have their isocontour level. • After defining the descriptors of interest press Calculate.4 Visualising Hotspots • In order to view hotspots a suitable set of hotspots must first be defined (see Defining Hotspots. it is possible to define custom descriptors based on the Sybyl atom types. • A number of predefined descriptors are available (Heavy. If activity data is available the GoldMine Regression tool could be used to create a 3D-QSAR model using the hotspot 48 GoldMine User Guide . This will calculate all descriptors and add them to the GoldMine DB. Donor. option in the Hermes Display menu. one could define a carbon sp3 probe by deselecting all atom types apart from C.. page 47).

GoldMine User Guide 49 .descriptors.

50 GoldMine User Guide .

Then GoldMine can be used to create a Discrimination Model that best discriminates between the actives and decoys. Both types of model can be thought of as rescoring functions that can be used after docking to evaluate the likelihood of a given pose/structure having biological relevance/activity. GoldMine User Guide 51 . This functionality is available as the Define test set option under the Tools pulldown menu in the top menu bar of the GoldMine Controller and the Data Explorer. a set of poses for decoys. This functionality is covered elsewhere (see Creating Statistical models that describe Biological Activity: The Regression Window.8 Creating Training and Test Sets of poses for Regression Model Building • GoldMine offers two types of regression model. to ensure that any rescoring functions generated retain good predictive powers outside of the training set. associated descriptors. within a GoldMine database. page 55). and associated descriptors. and a measured activity for each molecule. a set of docking poses for active molecules. we also need to be careful to create training and test sets of ligand poses. GoldMine then can be used to create a Quantitative Structure Activity Relationship (QSAR) that relates activity to a linear combination of descriptors. Either the user supplies. Alternatively the user supplies a set of docking poses for actives (for which no precise activity data is required). If creating statistical models.

observe the performance of the model on the test set. • If you are creating a Discrimination Model then it will be necessary to create selections that identify the active molecules in the training and test sets you have already created. One way of doing this is to have the information that a molecule is active encoded in the NAME designator within the molecule structure file. • Edit the names for the training and test sets as you wish and then click on Split. • If you are intending to create a QSAR model using activity data associated with all members of the original dataset then this is all you need do. the dataset may be an entire dock set (in which case the selection to choose will start with All_) or it may be a subset of poses previously saved as a Selection. • The names of the training and test actives sets should change accordingly. In the Regression window you will be able to create a model using your training set and. page 18). Then a text query will allow you to create the actives selection (see Searching Using a Text Descriptor. • In the Actives pull-down menu at the bottom left select the selection that contains your active molecules. simultaneously. This normally should be between 3565%.8. To create training and test sets carry out the following operations: • Choose a dataset to split from the Selection to split pull-down menu on the left hand side. Training and test sets are created by splitting a dataset into two random sets of specified size. You will need to have previously created a selection of all the molecules designated active. • Set the percentage you wish to have in the training set. • Click on Create.1 Creating Training and Test Sets assuming only a single docking pose has been saved per ligand • If the dataset only contains one docked pose per ligand then the top third of the Define Training and Test Sets window may be ignored. 52 GoldMine User Guide . These can be edited.

It is now possible to create training and test sets as described previously (see Creating Training and Test Sets assuming only a single docking pose has been saved per ligand. A much quicker way is to choose a descriptor that you think already does a reasonable job of picking out good poses and use this to select the pose per ligand that will be used in building the model.8. page 52). • Next to Best set name type in the name that you wish to assign to the new selection. This would be very time consuming to do manually. A model created from singles pose can afterwards be applied to all poses in the dock set. Using information from some or all available poses may lead to over-fitted models and should be avoided. This name will automatically populate the Selection to split pull-down further down the page. However it is not possible to use information from more than one pose per ligand to create a model. • Click on Create to the right. So it is necessary to first select the pose for each ligand that you wish to include in the model building.2 Creating Training and Test Sets assuming multiple docking poses have been saved per ligand • It is possible in GoldMine to create models from dock sets that contain multiple poses per ligand. • Choose a criterion from Min and Max to decide which ligand pose is selected. The scoring function used for the original docking would usually be a good choice. • Drag and drop from the Descriptors pane or the Data explorer the descriptor you wish to use to define the poses to work with. A selection will be created with the name assigned. into the area at the top of the Define Training and Test Sets window. GoldMine User Guide 53 .

54 GoldMine User Guide . 0 otherwise. • The steps carried out to create the Best set of poses could have been carried out in a more longwindedly way in the Descriptor Calculator by first using an aggregation function to create a Best of descriptor. and then use this to create the Best set selection. This descriptor takes the value 1 for the best poses.• A side-effect of the creation of this selection is the creation of a descriptor of the same name within the parent dock set.

However individual targets might have requirements for good ligand binding that are not sufficiently well represented in the scoring functions available. page 35). page 39). There also exist facilities for creating arithmetic combinations of scoring functions that allow consensus scoring to be carried out (see Arithmetically Manipulating Descriptors: Consensus Scoring. In the Data Explorer it is possible to evaluate how good individual scoring GoldMine User Guide 55 . So pretty much all the useful descriptors for characterising ligand binding are available for combining. There are facilities in GoldMine and Hermes for generating descriptors that encode important interactions (see Calculation of further Descriptors to characterise the Docking Pose. Scoring functions for docking are generally designed to be accurate at identifying crystallographic poses over a wide variety of target types. We will now look at functionality that enables us can construct and optimise such scoring functions by using statistical modelling. • If the target protein is one for which actives are already known then it may be possible to find a numerical model that is tailored for separating actives from inactives on the basis of their docked binding modes.9 Creating Statistical models that describe Biological Activity: The Regression Window • It is the case in Structure-based Virtual Screening that no scoring function on its own is likely to be the best tool for identifying actives for any given target. or lipophilic contacts to a particular residue might be essential. We first need to set up an enrichment study in which a number of actives and a much larger number of decoys of similar molecular weight and functionality are docked into a suitable protein. For instance a particular hydrogen bonding motif might be especially important. in order to build composite scoring functions for post processing analysis.

• It is usually advisable to divide available data into Training and Test sets for model building.sdf or . so this is another way that binary discriminators or activity data can be added. Stepwise multi-linear regression is used to generate optimised linear models from the available descriptors and scoring functions.mol2 files may contain a binary discriminator. a constant prefix in front of the structure name that is contained within the . The best model can then be set up in the descriptor Calculator (see Arithmetically Manipulating Descriptors: Consensus Scoring. A model can be built in the Regression window using the training set and the performance on the test set can be monitored at the same time.mol2 input files used for the original docking. then the dock set that is being worked with should contain a subset of actives that are identifiable in some way. 56 GoldMine User Guide . in principle. In the Regression window it is possible to go further. Additional descriptors calculated by other methods may also be imported this way if it is wished to include them in model building. • In model building it is essential to ensure numerical models are not created which are over-fitted to the data. • It is possible to append data to a GoldMine from . Creating such a model is possible via the Regression window. page 26). page 51).functions and descriptors are at separating actives from inactives (see The Data Explorer. page 39) for a set of dockings from a true virtual screen. for all actives. • A named Selection then needs to be created for all actives. 9.mol2 files. GoldMine provides tools for creating random training and test sets (see Creating Training and Test Sets of poses for Regression Model Building. If the range of activity is wide enough and the set of active molecules large enough it may possible to use measured activity data to create a Quantitative Structure Activity Relationship (QSAR) that relates descriptors calculated in docking.sdf and . and used as a rescoring tool. • If the model is to be created from activity data then this can be present as a tag in the input . In GoldMine it is possible to divide selections of actives randomly into training and test sets. either via the text search described above or via a numeric search in Selection manager.csv files.sdf or . The actives can then be retrieved in the Selection manager by carrying out a text search for Prefix% on the NAME descriptor. This model can. Alternatively a tag field in the input . One acceptable way of doing this is by placing. be used to estimate the activity for new molecules. 1 for active. This needs to be done prior to docking. to activity. • Alternatively we may have a set of actives for which measured activity data is available.1Data preparation • If only the binary active/decoy distinction is being used. 0 for inactive.

This will open up the GoldMine Regression window. Instead create a selection consisting of one pose per ligand. It would be normal to use the original docking fitness score as the descriptor to identify this pose. • Select the set of actives that are wholly within the training set from the pull-down menu in the Regress against area.. The models that result may not be meaningful. The one displayed on opening the window is the Data Set pane.train. • Select the full training set of actives and decoys from the Data set pull-down menu in the Training data area. If this set is created using the Define Training and Test Sets window the name is likely to end in . select this test data set from the Data set GoldMine User Guide 57 . This means we are creating a model from a set of actives and a set of decoys but we are not using actual activity data. Note: If you have multiple poses per ligand it is advisable not to use the full set of poses. to work with. page 51). If no training set has been defined use a selection that represents all the actives.2Model Building when the Data Set consists of Actives and Decoys • First highlight all the descriptors of possible importance. select Regression from the pull-down adjacent to the Send Choices to button. This type of selection is most easily created via the Define Training and Test Sets window (see Creating Training and Test Sets of poses for Regression Model Building. and click on Send Choices to. • First toggle on the Active selection radio button. It is here that we define the training and test sets that we wish to use.9. If you have not defined a training set then use the entire data set. in the Descriptors pane or Data Explorer.. • If you have defined a test data set of actives and decoys. The GoldMine Regression window contains two tabs.

58 GoldMine User Guide . You have the option to Accept this descriptor. The descriptor that is chosen is the one that maximally distinguishes actives from inactives. • Now click on the Regression tab at the top of the page. or Abandon the model building process. A window will come up suggesting the next descriptor to add to the model.pull-down menu in the Test data area. The Regression pane is now visible and the descriptors you loaded from the Descriptors pane are present at the bottom left. manually or by a combination of both methods. Models can be created automatically. reject it and Try again. • Models are built up by incrementally adding new descriptors to a model. • Select the selection of actives that are wholly within this test set from the Test set pull-down menu. Correlation and significance figures are quoted. • To create a model automatically click Auto add at the bottom left.

As with the Data Explorer it is possible to select ROC or Accumulation curve. It is also possible to change the cut-off at which the enrichment metrics are calculated. GoldMine User Guide 59 .• Once a descriptor has been selected it will appear in the upper left box. Additionally an enrichment curve is graphed in the middle left region. page 28). More information is available in the Data Explorer documentation (see Operations with a Selection: Looking for Enrichment. • If a test set is also specified then enrichment curves and enrichment metrics are generated in the right-hand bottom and top areas. The model generated up to this point is represented in the coefficients for the descriptors making up the model in the Regression model area. Enrichment metrics are also calculated at the top left.

• The same pull-down menu also provides options to change the Background colour. The model will be recalculated with that descriptor removed. choose Remove. but start to fall for the Test set are likely to be overfitted. Drag-and-drop a descriptor from the lower left window to the upper.• • • • • A second descriptor can be automatically added by clicking on Auto add again. • Models in which enrichment metrics continue to rise for the Training set. This will open up the GoldMine Regression window.3Model Building on the basis of Activity Data • First highlight all the descriptors of possible importance. It is possible to customise the enrichment curves: • To remove one curve from a set of superimposed curves. descriptor coefficients are re-calculated for all components of the model. The one displayed on opening the window is the Data Set pane. • It may sometimes make sense to add the first term or couple of terms manually and then Auto add subsequent terms. and the Thickness of the curve. It is here that we define the training and test sets that we wish to use. 9. select with left-click the relevant row in the metrics area above. The GoldMine Regression window contains two tabs. To do this click on the Save button at the centre of the Regression window. graphs and metrics can be removed by clicking the Reset button. • Models can also be created manually. Try removing some of the less important descriptors from the model in this case. Further descriptors can be manually added to the resulting model. Clicking the Clear button also clears the descriptors pane at bottom left. from the resulting pulldown. the effect of previously added descriptors being taken into account. and click on Send Choices to. 60 GoldMine User Guide . Models. and not just for the last descriptor added to the model. The descriptor that comes up will be the one that provides the best discrimination for the residual explanatory data. It is possible to remove descriptors from the model. Then further terms added to the model indicate what aspects the scoring function failed to take account of when ranking actives and inactives. and a new enrichment curve and new enrichment metrics calculated. however. You will be asked whether you want to add this descriptor to the model. When the new descriptor is added. For instance it might make sense to have the first term in the model the scoring or fitness function that was used to drive the original docking. the Colour of the curve. and. Once a satisfactory model has been found it can be used to create a new descriptor over all members of the relevant dock set. select Regression from the pull-down adjacent to the Send Choices to button. you will be asked to enter a name for the new descriptor. Right-click on the descriptor in the top left area and select Remove. in the Descriptors pane or Data Explorer.

If Activity data is chosen then a QSAR model will be generated using one descriptor as the Activity measurement.• Then toggle the Activity data radio button in the Training data area. This descriptor requires a measure of activity to be entered for all members of a selected training set. Activity data should be available for all members of the training set. • A test data set can be selected from the Data set pull-down in the Test data area. • Now click on the Regression tab at the top of the page. • A training set can be selected from the Data set pull-down menu in the Training data area. GoldMine User Guide 61 . then it will be necessary to drag-and-drop a descriptor representing activity into the space next to the Activity radio-button. No set is required to be set in the Test set area. The Regression pane is now visible and the descriptors you loaded from the Descriptors pane are present at the bottom left. • If Activity is highlighted.

or Abandon the model building process. reject it and Try again. Models can be created automatically.• Models are built up by incrementally adding new descriptors to a model. • Once a descriptor has been selected it will appear in the upper left box. • To create a model automatically click Auto add at the bottom left. You have the option to Accept this descriptor. The descriptor that is chosen is the one that correlates best with the activity data. The model generated up to this point is represented in the coefficients for the descriptors making up the model in 62 GoldMine User Guide . manually or by a combination of both methods. Correlation and significance figures are quoted. A window will come up suggesting the next descriptor to add to the model.

For instance it might make sense to have the first term in the model the scoring or fitness function that was used to drive the original docking. • Models in which the correlations for the test set are significantly worse than that for the training set and get even worse with addition of more variables. • Models. The model will be recalculated with that descriptor removed. • A second descriptor can be automatically added by clicking on Auto add again. The corresponding descriptor may not be contributing well to the model. You will be asked whether you want to add this descriptor to the model. Then further terms added to the model indicate what aspects the scoring function failed to take account of when ranking actives.the Regression model area. as are significance T and P values. • Models can also be created manually. GoldMine User Guide 63 . The standard error on these coefficients is also given. Drag-and-drop a descriptor from the lower left window to the upper. • If a Test set is also specified then a corresponding scatterplot is generated in the right-hand bottom area. Actual scatter plot is graphed in the middle area. When the new descriptor is added. and not just for the last descriptor added to the model. however. Try removing some of the less important descriptors from the model in this case. the effect of previously added descriptors being taken into account. regression coefficients are re-calculated for all components of the model. • Once a satisfactory model has been found it can be used to create a new descriptor over all members of the relevant dock set. Further descriptors can be manually added to the resulting model. Additionally a Predicted Vs. • It may sometimes make sense to add the first term or couple of terms manually and then Auto add subsequent terms. Right-click on the descriptor in the top left area and select Remove. and a new scatter plot and correlation coefficient calculated. • It is possible to remove descriptors from the model. To do this click on the Save button at the centre of the Regression window. The descriptor that comes up will be the one that provides the best discrimination for the residual explanatory data. graphs and metrics can be removed by clicking the Reset button. Also look out for when the standard error on a coefficient is more than half the magnitude of the coefficient itself. Clicking the Clear button also clears the descriptors pane at bottom left. are over-fitted.

64 GoldMine User Guide .

Poses can also be manually selected and can then be re-exported with a tailorable number of fields of associated data. first make a Selection on a GoldMine DB via the GoldMine Controller.The spreadsheet data can be sorted and/or colour coded according to selected data columns. Poses can be grouped. • To view a set of docking poses. • Further information on using the GoldMine spreadsheet functionality of Hermes can be found within the Hermes documentation (See Appendix B: Special Details for GOLDSuite Users). GoldMine User Guide 65 . in Hermes. Then use View to transfer over the poses making up the Selection.10 Visualising and Refining Selections of Docking Poses • Docking poses from Selections can be brought into the Hermes visualiser and spreadsheet for visualisation alongside the associated protein model. Descriptor data which are included within the Selection are tabulated as columns within the spreadsheet. The associated protein model will also be displayed.

• Click on the Get ligands from GoldMine tick box. • Check the Select Ligands pane from the Configuration Options and delete any ligand file specifications made there. It is also possible to adjust the number of GA runs that will be carried out on each ligand.2 Sending poses from GoldMine to the Server • First it is necessary to create or read into Hermes a valid GOLD configuration file. 11. reading in an existing file is probably the usual case. An appropriate machine Hostname and a Port number should automatically come up in the relevant text boxes. The default is 10. Within the pane that comes up it is possible to set GOLD up so that it either receives or sends docking poses from GoldMine. Since it will often be the case that you will be using a tried and tested docking protocol to re-dock selected ligands. The results can then be saved within output files or alternatively they can be returned to GoldMine and saved within a GoldMine DB. 66 GoldMine User Guide . • Select GoldMine from Configuration options. To do this we use the Server option within GOLD.1 Introduction • It is possible within GoldMine to take a selection of ligand poses from a GoldMine analysis and submit them seamlessly to GOLD for docking.11 Interactive Docking and Analysis: Using the GOLD Server 11.

• A progress window is now displayed. We will do this via Hermes. GOLD will start and the protein initialisation step will run. The docked run is underway and results being transferred once the word Waiting appears alongside the host name in the middle segment.e. Click on this button. • First it is necessary to create a valid GOLD configuration file. The machine that the GOLD job is running on need not be the machine that GoldMine is running on. Since it will often be the case that you will be using a tried and tested docking protocol to redock selected ligands. 11. • Open a GoldMine DB and make a selection of docking poses of interest within the Selection Manager window. and the middle segment informs which ligand is currently being docked. • At the top right of the Selection Manager is an option marked Dock in GOLD. Indeed several machines can simultaneously download their results into one GoldMine. GoldMine User Guide 67 .3 Receiving poses back into GoldMine • It is possible to use the GOLD server so that solution poses are saved within a GoldMine. Look under the tabbed panes Messages and gold_protein. if solutions were being saved in a single molecule file. The poses are saved in a new dock set and the protein file used in the docking is the protein model associated with that dock set.• Click on Run GOLD. the solutions of the second set will be appended to that same file.log to follow this step. reading in an existing file is probably the usual case. The GOLD server is now activated and ready to receive ligands for docking. The results will be treated as though they are contiguous with the first set i. • A second selection can created and then submitted to the same GOLD process. The top segment displays the number of ligands queued for docking. the index numbers of the new solutions will continue from the old end point and.

A machine Hostname and a Port number appropriate to the machine you are working on should automatically come up in the relevant text boxes. It is important that GoldMine not be 68 GoldMine User Guide . The bottom segment of the progress window will show how many ligands have been processed and saved in the GoldMine DB. If the Goldmine is open then the docking job will proceed and poses will be saved within the GoldMine. The GoldMine to which the docked results will be appended should be selected via the Browse option. The port number should be set automatically in the resulting pull-down. However you may wish to send the results to another machine. Click on OK.• Select GoldMine from Configuration Options. in which case edit these boxes appropriately. select Receive Ligands from GOLD. • If the results are being sent to GoldMine running on another machine then it is necessary to do the following: • Open GoldMine on the second machine.conf file. It is probably safest to do this before the GOLD job is started. • Click on Run GOLD. It will also be necessary to enter in the Dock set text box the name of the new dock set to which the poses need to be sent. • The GoldMine should be populated with docked poses. It should be the same as the port number set in the gold. • Click on the Send ligands to GoldMine tick box. • Under the GoldMine pull-down menu on the top Hermes menu bar.

GoldMine User Guide 69 . • It is possible to have a GoldMine open whilst running the Server docking job.closed before the docking is completed as otherwise the connection will be broken and the GOLD job will crash. It is also possible to set the server up to both send and receive poses to and from an opened GoldMine.

70 GoldMine User Guide .

for any purpose. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS. THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. sell.net) SQLite (http://sqlite. copy. OR CONSEQUENTIAL DAMAGES. and by any means. Anyone is free to copy. without fee. ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION. INDIRECT. THE UNIVERSITY OF CALIFORNIA SPECIFICALLY DISCLAIMS ANY WARRANTIES. modify. modify. SPECIAL. ENHANCEMENTS. use. INCIDENTAL. and without a written agreement is hereby granted. have signed affidavits dedicating their contributions to the public domain and originals of those signed affidavits are stored in a fire safe at the main offices of Hwaci. The Regents of the University of California Permission to use. The PostgreSQL Global Development Group Portions Copyright © 1994. provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies. publish. OR MODIFICATIONS. and distribute this software and its documentation for any purpose. PostgreSQL Database Management System (http://www. UPDATES. commercial or non-commercial.postgresql. SUPPORT. IN NO EVENT SHALL THE UNIVERSITY OF CALIFORNIA BE LIABLE TO ANY PARTY FOR DIRECT. and representatives of the companies they work for. AND THE UNIVERSITY OF CALIFORNIA HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE. either in source code form or as a compiled binary. All code authors.org) All of the deliverable code in SQLite has been dedicated to the public domain by the authors. GoldMine User Guide 71 . INCLUDING LOST PROFITS. then as Postgres95) Portions Copyright © 1996-2005. BUT NOT LIMITED TO.12 Acknowledgements Goldmine is based in part on the work of the Qwt project (http://qwt. EVEN IF THE UNIVERSITY OF CALIFORNIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.org) (formerly known as Postgres. compile. INCLUDING. or distribute the original SQLite code.sf.

72 GoldMine User Guide .

Creating a GoldMine • We will now create a GoldMine from a docking run and then append a second dock set to this GoldMine.1 Tutorial 1.db in the File name: box. This is a cross-docking experiment using two structures of the eostrogen receptor protein. Choose the GOLD configuration file gold_1x7r_1l2i. Type ER_CrossDock. • We will first create a GoldMine from a docking run in which the ligand out of the oestrogen receptor structure with pdb code 1l2i is docked into the eostrogen receptor structure 1x7r. We will be creating the GoldMine from a GOLD run so we can leave the file format in the default setting. • Click on Finish... • Now we will create a GoldMine with SQlite. The example we will use comes from Tutorial 7 of the GOLD Manual. Notice. and click on Save. Click on the Browse. because we are importing from a GOLD configuration file.. • Select the Create command in the GoldMine pull-down menu of Hermes.. button and check that the GoldMine to be created is placed in the examples/tutorial1 directory. However we would have to choose that ourselves if creating a GoldMine in other ways.. Click on No. the appropriate protein structure is already selected for us...conf from the GOLD Suite/GoldMine/ examples/tutorials1 sub-directory and then click on Next. • We will now import results from a GOLD run which has the protocol changed to take account of GoldMine User Guide 73 . Type in Cross_Dock_1. The data will be imported and you will be asked whether you want to open the new GoldMine.13 Appendix A: Tutorials 13. Then hit Next> • We now are asked to put in a name for the dock set we are importing.

The first pose can be selected by clicking on the relevant row in the GoldMine spreadsheet.2. 13. Click on Next.Using GoldMine to analyse the results of a Virtual Screen 13. Call this dock set Cross_Dock_2. • To view both sets of docked structures take the two Gold_Goldscore_Fitness descriptors into the AND box of the Selection Manager and click on View. • This tutorial will demonstrate how to use two scoring functions in the analysis to generate better enrichment rates than using one alone. • This time when asked whether you wish to open the new GoldMine. This time import the GOLD configuration file gold_1x7r_1l2i_SP..the fact that the 1x7r binding site is more constrained than the 1l2i binding site. In addition it will be demonstrated how parameters that 74 GoldMine User Guide . Click the Old GoldMine toggle button and then ensure that the correct GoldMine is present in the File name text box. click the relevant row with Control held down.1 Introduction • The object of this tutorial is to illustrate how GoldMine can be used to analyse a large amount of data obtained from a structure-based virtual screen against the Cox2 protein. Go back to the GoldMine pull-down and select Create again. • We wish to append the new dock set to the existing GoldMine.. To select the second. Hit Next followed by Finish. again.conf and then click on Next. • Superimpose the poses from each dock set which have the highest GoldScore.2Tutorial 2 .. This ends the tutorial. click in the affirmative.. one with the ChemScore scoring function. • We wish to create a new dock set so click the toggle button next to New dock set if not already set. GoldMine will be used to combine data from two virtual screens carried out with GOLD. one run with the GoldScore scoring function. You may find it convenient to group the solutions by Protein.

mol2 in the GOLD Suite/ GoldMine/examples/tutorial2 subdirectory. Chem. Then hit Connect. Li.further describe features of the protein-ligand interface can be calculated and used in the analysis. Chen. (Note: this is a database created using SQLite. This will open up the GoldMine Controller window. T. The binding site definition is that used by Chen et al. The set of ligands incorporates 160 structures of Cox2 actives which represent 125 molecules. NAME. • Under the GoldMine pull-down menu click Open. Inf. 46.This tutorial will also illustrate how to create a GoldMine and how to append to an existing one. but which are not believed to be Cox2 active (this has been cut down approximately 10 fold from the list in the original paper). One binding pose has been saved per ligand. GoldMine User Guide 75 . All components that make up the Fitness score are available as well as the GoldScore itself (Gold_Goldscore_Fitness). P. Lovell. Where there is more than one structure per molecule it is because it is represented in different tautomeric or protonation states or with different ring conformers. This will reveal the descriptors that were brought in from the GoldScore docking results file used to create this GoldMine. J. 2006). • The Cox2 structure used is PDB code 1cx2 and is the file protein. • Two dock sets named Cox2_CS and Cox2_GS should be shown. Giordanetto. which contains the text identifier of each structure. Also available is a text field.db in the tutorial library.2. This is Cox2. • Start Hermes by clicking on the Hermes icon. 13. J. D. Lyne. which is the default option). This is a protocol which optimises the trade off between speed and accuracy and is recommended for Virtual Screening.db from the examples/tutorial_1 subdirectory.2.3 Opening a GoldMine • A GoldMine has already been created which contains the set of docking results from both the GoldScore and ChemScore runs.. Model. Click on the + sign next to Cox2_GS. • The GOLD protocol used for this screen utilised the auto GA settings option. We will open this. setting search efficiency at 10%. In addition there are included 3982 decoy structures with similar molecular weight distribution and physical attributes. In the resulting window click Browse and select Cox2. 401-415. 13.2 The Virtual Screen • We use as source material for this experiment a Cox2 protein structure and set of ligands reported on by Hongming Chen et al (H. F.

• Highlight both Gold_Goldscore_Fitness and Gold_Goldscore_External _Vdw and then rightclick. 13. • Highlight Gold_Goldscore_Fitness and then right-click. Select the scatter plot option to generate a scatter plot for these two descriptors. The histograms window will be opened and a histogram for this descriptor displayed. Select Histogram.2. Some correlation between them can be observed..4 Graphing Numeric Data • It is possible to carry out simple graphical analysis of descriptors contained within a dock set. 76 GoldMine User Guide .

The Gold_Goldscore_Fitness range box will show the range of values this descriptor covers.The number of ligands will drop to 414. This time the graph shows little correlation between the descriptors.2. We will now change the range of this descriptor so that only the top scoring 10% of ligands are selected. GoldMine User Guide 77 .786 and then hit return. It appears that Cox2 has an active site that has more hydrophobic character than hydrogen bonding character. Click on Scatter Plot. with Control held down. click on NAME. • Click on Gold_Goldscore_Fitness and. • Each of the selected descriptors should be shown within a box border. 13.5 Analysis by Scoring Function and Calculation of Enrichment rates • We will now carry out some simple analyses of the dataset and calculate enrichment rates based on scoring function alone. Then hit the Send to ranges button.• Highlight both Gold_Goldscore_Fitness and Gold_Goldscore_External _HBond. Type into the left hand range box 53.

Right-click with the mouse within the box border for . The default cut of the top 10% ligands is what we need. to increase screen space. Recall we created a histogram for Gold_Goldscore_Fitness earlier? You should now see the distribution of Cox2 active molecules highlighted upon that histogram. Click on OK.. • Click and drag the NAME objects into the AND box. Click Set by % on the pull-down menu. toggle off the display tick boxes in the NOT and OR box. On hitting return this should give 160 ligands. The option to select either ligands or solutions would be relevant if we were dealing with a dataset containing more than one docking pose per ligand.. button. 78 GoldMine User Guide . Now click on Show on histograms. However here we can ignore it. Note: this functionality is also available via the Options.. • We will use the text searching capability within GoldMine to select out the active molecules. We will only be using the AND box so. The % sign is a wildcard. In the text box under the NAME descriptor type in LIG_COX2%.• There is an alternative way of setting the range.Goldscore. • This takes us to the third tab in the GoldMine Controller where we can set up Boolean analyses to generate refined lists of structures.. • Now click on Send to selections.

This sum comes to 4. The two descriptors that we have filtered on appear as columns in this spreadsheet.• Click and drag the Gold_Goldscore_Fitness object into the AND box. Individual docked poses can be viewed by clicking on the relevant row in the spreadsheet. divided by the fraction of cut (=73/(160*0.56. hold Control down when picking rows. The left hand pane holds the GoldMine spreadsheet. This should return 73 solutions and ligands. GoldMine User Guide 79 . • Now click on Count to run the analysis. • Now view the docked solutions within the active site of the protein by clicking View. We can now calculate the enrichment rate from the ratio of found ligands to known ligands. Maximum possible enrichment is 10. To see two or more poses superimposed.1).

though with the difference that they rescored with other scoring functions poses obtained using a single function. • In the AND window of the Selection Manager you should have the Gold_Chemscore_Fitness and NAME objects. 5100-5109.Chem. 1999) that taking account of more than one scoring function when ranking a Virtual Screen can be of benefit. 13. Each unique molecule is designated by the number immediately following ‘COX2’ in the ligand name. highlight Gold_Chemscore_Fitness and NAME. Take the descriptors into the Selection manager tab and carry out the Boolean AND analysis once more. • Because we now have two numeric descriptors to filter on. The total number of active molecules in the test-set is 125 and the true enrichment rate is somewhat higher at 4.6 Analyses using two Scoring Functions • It was first recognised by Charifson et al (C. J. J. and highlight Gold_Chemscore_DG. A. Go to the Descriptors tab. We will attempt to do something similar. in the box under the NAME descriptor type. so long as we can remove highly unreasonable but high scoring poses by other means. in the ‘GoldMine’ tab of the Protein Visualiser window. J.619). P. In order to restrict the query to Cox2 known actives use the same text as before. Corkery. A. Waters. Proceed as before selecting the top 10% of the database (the figure in the left hand range window will read 32.• Note: It needs to be appreciated that the enrichment we have calculated doesn’t represent a true enrichment rate because some active molecules are represented by more than one ‘ligand’ in the dataset. Take this descriptor all the way through to the Selection Manager tab and into the AND window.2. S. • Expand the Cox2_CS list of descriptors. We will use this as one of the scoring functions in the analysis. the number of unique molecules represented. The Delta G term is calibrated against binding affinities of a test set of ligand/protein complexes for which the binding affinity is known. not as good. Choose Remove from the pull-down menu. You can calculate a true enrichment rate by counting up. whereas we will analyse poses both docked and scored by separate scoring functions • The ChemScore fitness function is made up of a Delta G term (DG) and an internal energy and clash term. Place the cursor in the box for the Gold_Chemscore_Fitness object and right-click. • Drag into the AND window the Gold_Goldscore_Fitness and Gold_Chemscore_DG objects. The other terms are present to avoid highly unrealistic binding poses being returned. it becomes harder to make a 10% cut of the dataset and some trial and error is called for in setting the ranges of individual descriptors.Med. M.. Murcko. This time the analysis should return 53 ligands which corresponds to an enrichment rate of 3.96 • Hit Clear and then go back to the Descriptors tab (top left). LIG_COX2%. and click on Send to ranges. Arguably therefore the Delta G term might be more appropriate to use than the Chemscore_Fitness term to estimate an affinity of binding and to rank poses.31. It is possible to change the descriptor ranges whilst in the Selection Manager tab and this makes 80 GoldMine User Guide . click the + next to Cox2_CS to expand the descriptor tree. 42. Charifson.

Goldscore. and then use some precalculated descriptors in the actual analysis. descriptor. Then we need to take them into Hermes. The ligand count should number 415. To save time.43.. These descriptors are the same descriptors that previously could be calculated with SILVER. a tenth of the dataset near enough. Click on Count. This gives 87 active ligands returned which corresponds to an enrichment rate of 5. Here we supply the correct cut-offs. • Toggle on the tick box in the corner of the Name descriptor area to activate its inclusion.2. • First we need to make a selection of all relevant docking poses available for descriptor calculation. such as a hydrogen bond that needs to be made to a particular residue.7 Incorporating New Descriptors into the Analysis • It is possible within Hermes and GoldMine to calculate a large number of additional descriptors from the docking poses. Select the top 28. 13.. • We will first calculate a small number of simple descriptors for our docking set.. Filtering of docking results by using one or more such descriptors in combination with a scoring function may prove beneficial. We select the bottom 28.the process easier. we will only do this for the active set of molecules.1% of . These are often features that are of particular importance to the protein under study. Select the bottom 28. This will prevent its inclusion in the Boolean analysis..1% cut of the Gold_Chemscore_DG descriptor. • Now toggle off the tick box in the corner of the Name descriptor area.1% because this descriptor estimates energy of binding and so high negative is better. Clear the entries in the ‘AND’ box of GoldMine User Guide 81 . • Carefully selected descriptors may be able to represent desirable features of binding that are not sufficiently well represented by the scoring function. and recount.

on the pull-down-menu and then on Add Simple Descriptor... Drag in the Name object from the Cox2_CS dock set. Type in Actives and hit Save.. toggle against GoldMine selection and choose Actives from the pull-down menu to the right. and then hit the define button at the base of the page. Now hit Run. Hit View. Now hit Count and then View. Descriptor_1_occluded_ligand_donor_count. Now take both these descriptors through to the Selection manager tab and into the AND box. • In the Cox2_CS descriptor tree in the Descriptors tab there is one additionally calculated descriptor. the molecular weight and the number of exposed hydrophobic atoms.. edit the 82 GoldMine User Guide . • Hit the Descriptors button on the top level menu bar of the Hermes window. • Once the calculation is complete. The two new descriptors will be visible.Selection Manager. Click on Define. Take Descriptor_1_occluded_ligand_donor_count into the Descriptor Range window. • Now hit the Descriptors button on the top level menu of Hermes and click on Calculate. go to the Descriptors tab in the Goldmine Controller and look in the expanded descriptor tree for Cox2_CS. This will add columns for both descriptors to the GoldMine Spreadsheet. In the Descriptor Name: box type Descriptor 2. In the Calculate descriptors Input ligands dialog area. Toggle on the boxes next to the relevant text labels in the Ligand properties tab. We will now use this in an analysis. • We will just calculate two descriptors. In the Output options area toggle This GoldMine. This is so as not to overwrite descriptors already calculated for this GoldMine... Then click on Save Selection.

Set the Gold_Chemscore_DG range to cut down to the best (i. • Highlight Identifier. as an optional exercise..25. This will sort the solutions by ligand name. lowest) 24% and verify that this query retrieves 10% of the ligands.2. Alternatively save the 84 poses as a selection in the Selection manager.. • Click on Save. • Ensure that the bottom left text box reads MACCS(*... • Now bring in the NAME object into the AND box. highlight the new selection in the Name pane of the Selection manager and select Export via the Options. 13. Gold_ChemsScore_DG and Descriptor_1_occluded_ligand_donor_count in the left hand window and click Add>>. These fields will appear in the right hand window. set the test criterion to be LIG_COX2% as before. Lastly we could just have saved a list of compound names.right-hand range window to read 0. Select a file name and a save location in the bottom right text box.. button. If you check Gold_Chemscore_DG alone you will find only 29 ligands are retrieved at 10% cut off.. • With the mouse in the GoldMine spreadsheet. and then take the object into the Selection manager tab.sd *. These options are all available in the pull-down menu at the top left of the Export window. and find out how many active ligands are retrieved. • We could also have saved the structures in concatenated MOL2 format. Here we will manually pick out some individual poses and export them. • Once a database has been filtered down to a small enough number of poses. Alternatively we could have saved the data fields only in .csv format.sdf). it is possible to manually view the remainder and to pick out those of particular interest. This will create a MACCS format file with all 84 docked poses in it and with the data we have specified associated with each structure.e.8.occluded_ligand_donor_count and Gold_Chemscore_DG in the AND box. You can. right-click and select Start picking from the resulting pull-down menu. This query should retrieve 84 ligands which corresponds to an enrichment rate of 5. GoldMine User Guide 83 . They will be included as tags in the structure file we will generate. • Set up a query with only .8 Exporting Subsets of Solutions • We can export all 84 solutions found in the search carried out previously. calculate other descriptors for yourself and see if they too can be usefully employed. Click on View to transport these solutions to the visualiser and then under the File pull-down menu in the top menu bar of the visualiser. Care needs to be taken to ensure that only relevant descriptors are used however. So we do considerably better with the combined search • In this example we’ve seen that filtering with a post-calculated descriptor in combination with a scoring function can be a useful way of improving enrichment rates. We will just examine the related structures which have numbers between 100 and 109 and pick out unique examples of each. A new column will appear entitled Picks. • Click on the title bar of the Identifier column under the GoldMine tab in the visualiser. an enrichment of 1. select Export ..

Click on Customise and then highlight all the descriptors in the right-hand box. Alternatively click on any of the column headers to carry out a sort by that descriptor.) • It is possible to apply a temperature shading to one or more columns. Choose the option To a file. • Highlight Gold_Chemscore_Fitness and Descriptor_1_occluded_ligand_donor_count in the left-hand column and click on Add>> followed by OK. Now Click on Remove and OK. • Right-click in the GoldMine spreadsheet and select Export picks. • First sort the table by descending Gold_Chemscore_fitness and then by ascending .. • The same Export selection window will appear as before. Select which descriptors you wish to associate with the structures. (click again on a header to reverse sort order if necessary. button.2. Repeat for as many other solutions as you wish from the top ten. To do this first click on the Colours. A ‘1’ will appear in the Picks column. You can do this via the Sort. Bring into the box the objects Gold_Chemscore_Fitness and Descriptor_1_occluded_ligand_donor_count. and export the concatenated structure file containing the picked structures. • Right-click in the GoldMine spreadsheet and select Stop picking.. button which gives you the option to hierarchically sort by three separate descriptors. 84 GoldMine User Guide . other than Identifier.. Use right mouse-click and reset to set the ranges at their starting values... Now click on Count and then View.9 Working with the GoldMine Spreadsheet in Hermes • We will now further explore some of the functionality of the GoldMine spreadsheet..• Highlight the top solution and type ‘a’ at the keyboard.This will refresh the spreadsheet in preparation for the next step • Go to the GoldMine Selection Manager and clear the AND box in the Selection Manager. 13. • It is possible to sort the data by column. The two descriptors will now be tabulated in the spreadsheet.occluded_ligand_donor_count. First go to the GoldMine tab in the Hermes Visualiser.

when the branch is expanded. Select Ligand. • You can also hierarchically order the grouping to be Protein or Ligand or vice versa. Click on the Customise. • It is often useful to group solutions in alternative ways.• You can scroll down the spreadsheet to check any correlation between high scoring docking poses and low numbers of occluded donors.. each dock set is displayed separately.. This ends the tutorial GoldMine User Guide 85 . In addition you can set up a customised grouping according to any descriptors tabulated. Hit OK. the results for both GoldScore and ChemScore runs are displayed together. • Click on the pull-down menu next to Group by:. button and bring the Gold_Goldscore_Fitness object into the right-hand window. Each ligand now has a separate branch and. • Now close the GoldMine by picking the appropriate command from the GoldMine pull-down menu in Hermes. • If you select instead Protein only.

Gold_Chemscore_DG. 4141 equals the number of structures in the dataset. using either the calculator functions or by typing. Highlight both GOLD_Goldscore_Fitness and Gold_Chemscore_DG and then click on Arithmetic at the top. It was found an effective way of improving our enrichment rate. This will take you into the Descriptor Calculator. • Place the cursor to the right of the expression in the lower box and click on "-" (Note: you can do this by typing "-" as well). This method is a form of Consensus by Vote.3. add + 4141 to the end of the sum. • Click on Rank() at the top right hand corner.3Tutorial 3. We will now look at alternative consensus strategies which can be set up quickly and easily using the Descriptor Calculator. 86 GoldMine User Guide .GS_CS_Rank in the New descriptor name box. • Open up the Cox2.3. 13. In order to avoid having negative numbers in the final function. Click on Rank() again and then select Cox2_CS.1 Introduction We have already used one form of consensus scoring in Tutorial 2.The two descriptors we have picked appear in the panel at the top left.2 Consensus by Rank • We will generate a function that is the sum of Rank Scores according to two different scoring functions. Then click on Cox2_GS. when we used two scoring functions to filter down our list of actives.db database used in Tutorial 2. Expand both the Cox2_GS and Cox2_CS trees in the Descriptors pane. The function as it is currently set up will generate a ranking based on the fitness score. • We have subtracted the rank of Chemscore_DG because unlike Goldscore_Fitness. with the lowest rank corresponding to the best score. You may have click on the + sign next to each dock set name to visualise them. place the cursor to the far right and. Now click on Calculate. The first part of this name will ensure the resulting descriptor is placed in the Cox2_GS dock set.13. • Type Cox2_GS. lower scores are better.Gold_Goldscore_Fitness. However we don’t need to calculate this function yet. Using the Descriptor Calculator to carry out Consensus Scoring 13. However some trial and error was required in order to make the scoring function cutoffs return 10% (or 1%) of the dataset. We will then use this function to do a consensus filtering.

Send these descriptors to the Selection Manager. GoldMine User Guide 87 . Gold_Goldscore_Fitness and GS_CS_Rank. 12 Solutions are returned at the 1% cut off. 10 solutions should be returned. It will be necessary to select the bottom 10% and 1% according to this criterion. • 91 solutions are obtained at the 10% cut off. 73 solutions should be returned. It is safe to ignore the warning in this case. • The new descriptor will be added to the Gold_GS dock set. • Filter to the top 10% of solutions according to fitness score and calculate the intersection of the two fields. • Now do the same using GS_CS_Rank as the criterion for best solution. • Take the Gold_Goldscore_Fitness and NAME objects into the AND box Type LIG_COX2% into the NAME text box. Hit OK. when using descriptors from more than one dock set. This is because. • This consensus scoring scheme therefore does show some advantage over using the GoldScore fitness function on its own. the same as in Tutorial 2. and when more than one pose has been saved per ligand. • Now filter to the top 1% of solutions according to fitness score and do the intersection. it is easy to generate nonsensical descriptors. Further exercises that you can carry out include seeing how the rank consensus function performs when compared with Gold_Chemscore_DG. You can also calculate a new rank consensus function using Gold_Chemscore_Fitness instead of Gold_Chemscore_DG and see how it performs. which is slightly better than we managed using the consensus scheme in Tutorial 2 (87).• A warning message will appear. Go back to the Descriptors pane and highlight from this dock set NAME.

13.To give just one example. • Further exercises that you can carry out include seeing how the Normalised consensus function performs when compared with Gold_Chemscore_DG. The minus sign is needed for the same reason as before. • Which form of consensus function performs best will depend on the problem at hand and the behaviour of the scoring functions with the sets of ligands used.GS_CS_Norm in the New descriptor box. • Type Cox2_GS. The Normalised consensus scoring function is not performing as well as the Rank consensus scoring function in this case. if one scoring function incorrectly awards particularly high scores to a certain type of inactive molecule then this type of molecule is more likely to be incorrectly picked out by the Normalised function than by the Rank function. This ends the tutorial. • Make 10% and 1% top cuts of the database using the new consensus function and see how many actives are found.Gold_Goldscore_fitness.3. You can also investigate how a Normalised consensus function created using Gold_Chemscore_fitness instead of Gold_Chemscore_DG performs. • 83 actives are found at 10% and only 8 at 1%. • In the Descriptor Calculator click on Normalise() and then select Cox2_GS.Gold_Chemscore_DG. • Place a "-" to the right of the above expression. Click on Normalise() and this time select Cox2_CS. and click on Calculate.3 Consensus by Normalised Score • This time we will put the two scoring functions into comparable form by normalising them • We will want to use the same to scoring functions as above to carry out the calculation. namely to ensure the two scoring functions are not working against each other. Some experimentation may be necessary in selecting the right consensus scheme for the problem in hand. Again the warning message that comes up can be ignored. 88 GoldMine User Guide .

Descriptor_1_occluded_ligand_acceptor_count and Descriptor_1_occluded_exposed_hydrophobic_count. In addition some additional descriptors have been pre-calculated from each docking pose. The docking poses saved in this database are the poses resulting from optimisation with the ASP scoring function. Creating a Discrimination Model for Rescoring using Docking Data from a set of Actives and Decoys 13. Good quality test sets of actives and decoys are now freely available for a wide range of protein targets and these can be used to validate virtual screening protocols.4. Descriptor_1_occluded_ligand_donor_count. • We will use the DUD set of actives and decoys for the antithrombotic target factor Xa in this tutorial. K. 49. Med. 2006) available to download from http://dud. J. Irwin. J.db. Shoichet.1 Introduction • Ideally a general purpose scoring function should be able to distinguish binding and non-binding ligands with a high degree of accuracy.docking. This can be found in GOLD Suite/ GoldMine/examples/tutorial4/FactorXa_VS. B.2 Creating Training and Test Sets • Open the examples/tutorials4/FactorXa_VS.org/. So this is reckoned to be a tough set for discriminating actives from decoys. Three GA attempts were run per structure and three poses were saved for each structure. 67896801. The database contains descriptors for the parent scoring functions GoldScore and ASP. Factor Xa is a key serine protease in the blood coagulation cascade and has long been a target for the design of safe antithrombotic therapies. • A GoldMine has already been created for this dataset. The DUD set has been docked against the factor Xa crystal structure 1ezq using a fast virtual screening protocol and employing GoldScore to dock the structures.4 Tutorial 4. In the Descriptors pane of the GoldMine Controller click on the + symbol next to the name of the dock set to expose the stored descriptors. A set of decoys and a set of actives is presented for each of 40 different targets.4. These are Descriptor_1_occluded_ligand_polar_count. Chem. allowing Simplex optimisation of the pose. 13. The active ligands number 142. Subsequently each pose was rescored using the Astex Statistical Potential (ASP) scoring function. Huang. However for others it may be useful to develop a target specific scoring function that can be used to rescore docking poses prior to selecting the best structures for laboratory testing. • First we need to create a subset (called in GoldMine a selection) containing all the poses of the GoldMine User Guide 89 . For some protein targets this can be the case. and their individual components. representing 5237 ligands. The decoys in this collection are chosen to have very similar complexity and functionality to the actives they are associated with.13. The database contains 17673 poses in all. J.db database via the GoldMine command on the Hermes menu bar. One such collection are the DUD (Directory of Useful Decoys) set (N.

A selection named GS_Best should now exist. Highlight the NAME field in the Descriptors pane and click the Send to selections pane. We can only use one pose per structure for model building. Click on Tools on the menu bar of the GoldMine Controller. You will notice that the name of this selection has already been filled into the Selection to split pull-down menu. This is correct so leave it as is. Now we are in a position to create training and test sets from the GS_Best selection. We can change the percentage in the training set if we wish. We can also edit the names of the training and test sets. In the Actives pull-down menu select the Actives selection. Now we need to define the sets of poses to use to train and test the model. in the Define training and test sets from a selection area. Click on Split. Click on Create to create the selections. However for this exercise we will leave it at 50%. This will create two complementary selections called GS_Best_train and GS_Best_test. out of the GS_Best selection.This selection can be named Actives. Type FXa% in the text box for this descriptor and press return. Click on Create. Now save the selection created by clicking on Save selection. Now as a last step we need to identify subsets of the training and test data sets which contain the actives.65%. Drag and drop the Gold_Goldscore_Fitness descriptor from the Descriptors pane into the top area of the Define Training and Test Sets window. However we’ll leave them at the default names. The Define Training and Test Sets window should be filled in like so: 90 GoldMine User Guide . You will notice that the Best criterion is set as Max. Select Define test set to bring up the Define Training and Test Sets window. In the Selection manager drag and drop NAME into the AND box. The names of the active molecules all start with FXa so we can identify them via a text search.• • • • • • • active molecules. An obvious thing to do is to select the pose for each ligand for which the original docking score was the best. first we need to create a subset of poses which we think will be suitable for model building. It is good practice to have this figure not outside the range 35 . Type in a name for the selection we are about to create in the Best set name text field such as GS_best. Since we have three poses per structure. Names for the training actives and test actives sets will be automatically entered in the appropriate boxes.

Select GS_Best_train_Actives in the next pull-down menu. We will use the Received Operator Characteristics (ROC) curve and a number of established enrichment metrics to monitor our success. We will attempt to create a model consisting of a linear combination of descriptors.4. Select Regression from the pull-down menu next to the Send choices to button. Click on this button. GoldMine User Guide 91 . which best discriminates actives from inactives. • Highlight all the descriptors in the Descriptors pane of the GoldMine Controller. • The pane that is now displayed allows us to set out training and test sets. Similarly set the appropriate two test selections in the Test data area. Choose the GS_Best_train set in the top pull-down menu in the Training data area.3 Creating the Discrimination Model • Now we are ready to do a regression analysis. ensuring that the Active selection radio button is toggled on.13.

we see that the one belonging to the three component model has greater area under it than the two or one component models. ROC curves and enrichment statistics should have appeared for both test and training sets. The enrichment metrics are also better. We now have a three component model. Those of GoldScore and ASP are positive. Now looking at the ROC curve. A window displaying significance statistics will be brought up and you will be asked whether to accept this descriptor. The performance of the developing model is monitored for both training and test sets in the right hand area of the display. We will now add a third variable to the model. Accept this variable too. Click on Auto add. Accept it. However we don’t have a clear idea which one to choose. Therefore we will now allow GoldMine to choose the descriptor that best accounts for the remaining signal. • Drag and drop Gold_ASP_Fitness into the top left area.• Click on the Regression tab at the top of the GoldMine GoldMine Regression window. See what happens if you use a 1% cut off. Try changing the cut-offs at which the enrichment metrics are calculated. The coefficient 92 GoldMine User Guide . in particular those for the test set. The coefficients for each descriptor in the model are worth examination. • Drag and drop Gold_Goldscore_Fitness into the top left area. Accept it too. Models are created by transferring descriptors one-by-one from the lower left-hand pane to the upper. • The variable Gold_Goldscore_External_Bond_weighted is chosen. The coefficients for each of the components of the model should have appeared on the middle left. This is what we’d hope to see as higher values for these scoring functions should associate with better binding. We now have a consensus model that involves just our two principle scoring functions.

and so is downweighted in this regression model. However improvements to the model are likely to be slight and the danger of overfitting the data becomes greater. We could continue to add further descriptors to the model and indeed you are encouraged to try that later in the tutorial.of the Gold_Goldscore_External_Bond_weighted component is negative however. • This model appears to be significantly better at discrimination than if we used GoldScore on its own. as more variables are added. You will be asked to supply a name for this descriptor. You can if you like consider this descriptor as representing a rescoring function that is specific to docking runs carried out against the 1ezq crystal model. say Consensus_3term. You can assess whether overfitting is going on by monitoring the enrichment metrics for the test set. Therefore we conclude that the contribution of Gold_Goldscore_External_Bond_weighted within GoldScore might be overstated for this particular protein structure. then there is over-fitting. This calculates a new descriptor that enumerates the model over every pose within the dock set. and then set up a corresponding arithmetic calculation in the Descriptor Calculator. • Click on Save. GoldMine User Guide 93 . This descriptor is one of component of the GoldScore fitness. If the majority of these start to go down once a variable has been added. • If you wanted to use this model on a new set of docking poses then it would be necessary to record the coefficients for the descriptor components of the model.

However. This will create a selection representing the best poses per structure. since our model includes a positive contribution from ASP. Therefore we will now look at applying the model taking into consideration all the available poses to see if this is occurring. the overall model score may actually be better for one of the other of the three saved poses of that structure. • We now will examine the performance of the model in discriminating actives from inactives over only the best poses for each ligand according to the model. for some of these poses the ASP score may be poor and therefore. Best_Consensus_3term) and click on Create.13.4 Applying the Model to Multiple Poses • You may have noted a problem with the way we have set up the model. We are only using the poses for which the GoldScore is best. We could try this.g. We will not however optimise the model in anyway in regard to other poses. however there lies a danger of overfitting if we do. choose Best_Consensus_3term as the data set to work with. The ROC curve and associated enrichment metrics should now be displayed. according to the Discrimination model we created earlier. and click on Send choices to. • Click on ROC. • Bring up the Define Training and Test Sets window via the Tools command on the GoldMine Controller menu bar. Type in an appropriate Best set name (e. • In the Descriptors pane highlight Consensus_3term. • Highlight the descriptor Consensus_3term in the left hand window. If the structure is an active molecule then there is a danger that it will be poorly represented by the pose we are currently using. 94 GoldMine User Guide . choose as destination Explorer in the Choose descriptors area. Drag and drop the descriptor Consensus_3term from the Descriptors pane into the top area of this window and highlight it. • In the Values area at the top left.4. This is the discriminant function we are using. • In the Selection area on the right choose Actives as the selection to pick out.

• Compare the enrichment characteristics with the three component model. Accept the first variable that comes up that describes occluded functionality on the ligand. GoldMine User Guide 95 . Use the same training and test sets as you did originally. in the Values area.4. Reject any variable that comes up that is a component of a scoring function. to try to find something more interesting. Which test set is the model more successful on? • You can finish the tutorial here. This ends the tutorial. Do they make sense? In other words are the signs of the coefficients consistent with what you’d expect for the corresponding descriptors? • Save the model. 13. Create a selection for the best poses per ligand according to that model and examine its discriminatory performance over those poses. • Check the coefficients of the model. Alternatively there are one or two things you might like to experiment with.• Repeat this experiment except use the GS_Best test set as the set to work with. • Compare the enrichment metrics. There should be a slight improvement. • Experiment with adding further descriptors to the model and see if you can detect when overfitting is occurring.5 Other Things to Try • You can try adding a fourth variable to the model using Auto add.

Sign up to vote on this title
UsefulNot useful