University of West of England,
Oxon, OX14 4SD
involves the identification of molecules that have a greater probability of exhibiting desired biological activity when subjected to in vitro screening (assaying) against a particular biological target. The paper introduces the integration of cluster-oriented genetic algorithms (COGAs) with such machine-based library design environments.
COGAs have a proven capability to identify high- performance regions of complex, continuous design spaces relating to engineering design problems. Modifications to the basic COGA approach are described that allow a transfer of this capability from continuous variable parameter space to the highly discrete spaces described by reactants across reagent libraries. Results relating firstly to the identification of optimal molecules and secondly to the focussing of reagent libraries in terms of high-performance reactants are presented. Single objective optimisation and focussing are initially considered before moving on to multiple objective satisfaction.
Drug design and discovery is a systematic, serial process of identification and modification of chemical structure to achieve desired results against biological targets associated with a particular disease. Tradionally, the process involves the development of a biochemical assay for a biological target of interest and the subsequent screening of large numbers of drug-like organic chemical compounds in a high-throughput manner to identify hit compounds. Such hit compounds usually possess weak biological activity that requires improvement by a process of medicinal chemistry optimisation first into a lead series with robust properties and then into a drug candidate that is suitable for evaluation in human clinical trials. This process requires the optimisation of multiple parameters including biological activity, selectivity for the biological target of interest over related proteins, pharmacokinetics, pharmacodynamics and pharmaceutical properties.
In modern drug discovery extensive use is being made of in silico techniques to find hit molecules through virtual screening and to then aid their subsequent optimisation. A large compound collection in the form of virtual libraries is described by the reagents required for their synthesis as shown in figure 1. Reagents (inputs to the reaction equation in figure 1) are the chemical reactants required to make a set of molecules. Examples of reagents could be
carboxylic acids, amines, aldehydes etc. All molecules that contain the key reactive functional group for a reagent would constitute possible reactants across that reagent. In this sense, in a variable parameter space reagent refers to a particular variable parameter (dimension) whereas reactants are the available values / molecules across that dimension.
These libraries may then be subjected to in silico screening to identify compounds (hits) that exhibit activity against the biological target during the subsequent assay. In silico models such as molecular docking, pharmacophore matching, QSAR (Quantitative Structure-
utilised as objective functions to identify possible high- performance compounds that have a greater probability of exhibiting desired biological activity during in vitro assaying.
The size of the virtual libraries coupled with computational expense relating to most of the objective functions rule out exhaustive search (i.e. complete enumeration of all library members from the corresponding chemical reagents). What is therefore required is a search process that rapidly samples and identifies as many high performance molecules as possible within available time limitations. The development and integration of appropriate search techniques duringi n-si li c o screening could significantly enhance the drug discovery process by both improving the hit rates during assaying and reducing drug design cycle times.
In collaboration with Evotec OAI, Cluster-oriented Genetic algorithms (COGAs) have been integrated as a potential user-interactive search and exploratory tool with Evotec OAI\u2019s existing drug design software, EVOSeek\u2122. COGA  has a proven capability within engineering design environments described by continous variables to
identify regions of high performance (HP) solutions. This can be achieved with no apriori knowledge of the problem space in terms of possible number of local optima and the settings of niche radii, sharing factors etc.
The successful transfer of COGA technology to combinatorial chemical space offers significant potential in terms of the ability to identify groupings of HP reactants and hence support the focussing of reagent libraries or the identification of individual HP molecules. This library focussing and optimisation capability is therefore the motivation for the research described in the following sections. This work represents a proof-of- concept relating to the potential of COGA integration.
Section 2 is a brief review of library design literature. Section 3 introduces the COGA methodology. Section 4 describes initial experimentation to determine basic COGA performance Section 5 concentrates upon identufying optimal molecules from a test library whereas Section 6 presents the introduction of a tabu element to the basic COGA to enable focussing of this library. Scaleability issues are investigated in section 7 before moving on to multi-objective satisfaction in section 8.
A review of drug design approaches and strategies can be found in Tsinopoulos and McCarthy . Computer Aided Drug Design found application in the early 1990s in terms of the modelling of structure-activity relationships (SARs). The introduction of evolutionary computing (EC) techniques to modelling strategies began to emerge a little later. Milne  reports of only five published papers employing evolutionary algorithms between 1989 and 1992. However between 1993 and 1997 more than 210 EC-based papers appeared as reviewed in [4,5,6,7]. The first published application of EC to combinatorial library design was by Sheridan et al.  utilising a measure of chemical similarity as an objective function. Papers by Singh et al.  and Weber et al.  however utilised the results from in vitro biological activity to provide a fitness measure for a GA-based search thus providing a proof-of- concept of the potential of a GA to direct reactant selection for chemical synthesis.
Gillet et. al.  developed a GA based technique (SELECT) to optimise virtual libraries against a single diversity objective using a distance based diversity matrix. A weighted sum approach for multiple objectives was refined  via the introduction of a multi objective genetic algorithm (MoSELECT). Wright et al.  developed a different selection scheme in MoSELECT-II for the optimization of library size and configuration. Brown et al. , using a GA based approach called GALOPED, also addressed size and configuration whereas Pickett et al.  introduced Monte Carlo approaches to achieve similar objectives.
There is little comparison between previous work and the aims and objectives of the research described in the paper. Our objective, initially using chemical similarity, is to optimise and focus a library as opposed to Gillet\u2019s objective of generating a diverse library as described in . Whereas we introduce search, optimisation and multi-objective satisfaction in large reagent libraries
other works [8,9,10] have attempted single objective optimisation on significantly smaller libraries, This, plus the absence of suitable benchmark problems, makes comparison of our results with other works very difficult.
Cluster Oriented Genetic Algorithms, initially developed by Parmee in the early 1990s , provide the means to identify high-performance (HP) regions of complex conceptual engineering design spaces and enable the extraction of information from such regions [16,17]. COGAs identify HP solution regions through the on-line adaptive filtering of solutions generated by a genetic algorithm. Further work resulted in several variations of COGA and also identified and illustrated the manner in which the COGA approach can be utilised to generate highly relevant design information relating to single and multi-objective problem domains [17,18,19].
COGA comprises two primary components: the diverse search engine which utilises a highly exploratory genetic algorithm to search the design space and the adaptive filter (AF) which extracts solutions from each COGA generation and stores them within a Final Cluster Set (FCS). The AF scales solution fitness in terms of distance from the mean (figure 2) and only solutions that lie above a pre-defned threshold value, Rf, are copied to the FCS. By reducing the severity of Rf, more HP solutions albeit with a lower average fitness can enter the FCS. The user can therefore vary the filter setting in order to identify regions ranging from succinct groupings of very high performance solutions to larger regions of high and lower performance solutions. Design space exploration is enhanced in the underlying search engine via variable mutation regimes  or Halton injection sequences . Sufficient HP regional set-cover (in terms of number of solutions) can be achieved to allow significant qualitative and quantitative design information to be extracted.
It is not possible within the space available to provide a more detailed description of COGA. However, anyone wishing to replicate the research in this paper can refer to well-documented COGA development in many papers available
The COGA approach was primarily developed for search and exploration across design spaces described by continuous variable parameters which tend to predominate in engineering design. Typical COGA output from continuous design spaces comprises clusters of solutions describing high performance regions. The spread and distribution of these high quality solutions can offer a wealth of information relating to the characteristics of the search space and the complex relationships between variable and objective space as demonstrated in Parmee
One of the challenges in the research described in the following sections has been to modify the COGA approach in order to ensure a similar utility to that proven in engineering design when searching the discrete combinatorial problem spaces that are an all pervading aspect of drug design.
Given the very positive results from the engineering design domain the initial asssumption was that application of COGAs would significantly support the identification of \u2018best\u2019 reactants. The chemist could interact with the evolutionary process by varying the adaptive filter to identify either succinct groupings of high-performance molecules or larger collections of lesser performance molecules in terms of any chosen in silico objective function (e.g. QSAR, chemical similarity etc).
Binary representation has generally been utilised within the original COGA algorithms. However binary representation for reactants of each reagent revealed a number of potential problems. For instance, directly mapping binary strings onto the integer space comprising the numbered address of each reactant molecule of a reagent grouping results in illegal solutions and / or a degree of redundancy. Problems relating to the crossover of binary strings, the generation of non-feasible solutions and a subsequent requirement for chromosome repair were also inherent.
A straightforward integer encoding was therefore adopted where integer values represent an index of a reactant\u2019s location in the proprietary database. The chromosome in our integer representation scheme has a length (number of genes) equal to the number of dimensions (reagents) of the virtual library. A gene can then take an integer value between one and the maximum number of reactant molecules across that dimension. This number can directly be transposed to the index of that reactant molecule in the chemical database of molecules. The phenotype then is the product molecule of the reaction involving the reactant molecules represented by each gene in the chromosome.
A test virtual library from the reaction scheme of Figure 1 comprising amines and acids was initially chosen to assess the performance of COGA utilizing Tanimoto Similarity as a simple test objective function. The two reagents each possess 400 reactants creating a search space of 160,000 possible solutions. To allow a true evaluation of COGA performance and proof-of-concept all the product molecules in this virtual library were enumerated and their chemical similarity to a specific drug molecule (methotrexate) calculated. This allows the top 0.5% solutions to be plotted as shown in figure 3 which illustrates a typical distribution of high- performance solutions against which COGA output can be compared.
i.e. the identification of high performance reactants that provide a focussed combinatorial compound library the members of which include a significant number of high performance molecules.
It was initially assumed that appropriate settings of the adaptive filter threshold of COGA would result in the achievement of each of these objectives. High filter settings would provide smaller numbers of HP solutions whereas low settings would identify much larger numbers of high performance solutions with a lower average fitness which would also support the identification of high performance reactants i.e. reactants generally exhibiting high-performance across all possible combinations.
Before investigating these initial assumptions both variable mutation COGA (vmCOGA) and Halton injection COGA (hiCOGA) were investigated with the test virtual library from Figure 1 to determine their comparative performance. Performance criteria relates to
Now bringing you back...
Does that email address look wrong? Try again with a different email.