Statistical molecular design
Sunday, March 2nd, 2008The use of an external independent validation set, which has been collected independently of the training set, is widely regarded as the best way to assess the quality of a QSAR/qSAR model (Wold et al. 1995). However, it is usually difficult to find additional sources of data to construct an independent validation set and thus the typical method is to split the original dataset into two different sets, a training set for developing the QSAR/qSAR model and a validation set for evaluating the model performance (Gramatica et al. 2004). The training set should contain compounds of diverse structures that can adequately represent all of the compounds that possess a particular activity (Rajer-Kanduc et al. 2003; Schultz et al. 2003). The validation set also needs to be sufficiently diverse and representative of the compounds studied in order to accurately assess the accuracies of the QSAR/qSAR models (Rajer-Kanduc et al. 2003; Schultz et al. 2003).
There are a number of approaches for creating diverse training sets and representative validation sets from the datasets, which are given in Table 1. These include random selection, cluster-based methods, dissimilarity-based methods, cell-based methods, stochastic techniques, statistical experimental designs and neural networks (Daszykowski et al. 2002; Leach et al. 2003). Studies have shown that dissimilarity-based methods, such as Kennard and Stone algorithm and removal-until-done algorithm, are more effective than other algorithms in selecting diverse training sets and representative validation sets for developing and validating QSAR/qSAR models (Snarey et al. 1997; Rajer-Kanduc et al. 2003).
Table 1: Methods for selecting training and validation sets
| Cluster-based methods | |
| Hierarchical | Non-hierarchical |
| Single linkage (Leach et al. 2003) Complete linkage (Leach et al. 2003) Group average (Leach et al. 2003) Wards method (Leach et al. 2003) Centroid method (Leach et al. 2003) Median method (Leach et al. 2003) |
K-means (Forgy 1965) Jarvis-Patrick clustering (Jarvis et al. 1973) DBSCAN (Ester et al. 1996) OPTICS (Ankrest et al. 1999) DENCLUE (Han et al. 2001) |
| Dissimilarity-based methods | |
| MaxSum (Snarey et al. 1997) Kennard and Stone algorithm (Kennard et al. 1969) Removal-until-done (Hobohm et al. 1992) Sphere exclusion (Hudson et al. 1996) OptiSim (Clark 1997) IcePick (Mount et al. 1999) Minimum spanning tree error function (Waldman et al. 2000) |
|
| Cell-based methods | |
| Cummins algorithm (Cummins et al. 1996) Menard algorithm (Menard et al. 1998) Uniform cell coverage (Lam et al. 2002) |
|
| Stochastic techniques | |
| Techniques using Monte Carlo sampling (Agrafiotis 1996; Hassan et al. 1996) Techniques using genetic algorithms (Sheridan et al. 2000; Gillet et al. 2002) |
|
| Statistical experimental designs | |
| D-optimal design (Mitchell 1974) Factorial design (Box et al. 1978) |
|
| Others | |
| Random selection Kohonen’s self-organizing map Informative design (Miller et al. 2002) |
|
References
- Agrafiotis DK (1996). Stochastic algorithms for maximizing molecular diversity. 3rd Electronic Computational Chemistry Conference.
- Ankrest M, Breunig M, Kriegel H and Sander J (1999). OPTICS: Ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data: 49-60.
- Box GEP, Hunter WG and Hunter JS (1978). Statistics for experimenters: An introduction to design, data analysis, and model building. New York, Wiley.
- Clark RD (1997). OptiSim: An extended dissimilarity selection method for finding diverse representative subsets. Journal of Chemical Information and Computer Sciences 37(6): 1181-1188.
- Cummins DJ, Andrews CW, Bentley JA and Cory M (1996). Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commerically available compounds. Journal of Chemical Information and Computer Sciences 36(4): 750-763.
- Daszykowski M, Walczak B and Massart DL (2002). Representative subset selection. Analytica Chimica Acta 468(1): 91-103.
- Ester M, Kriegel HP, Sander J and Xu X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining: 226-231.
- Forgy E (1965). Cluster analysis of multivariate data: Efficiency vs interpretability of classifications. Biometrics 21: 768-780.
- Gillet VJ, Willett P, Fleming PJ and Green DVS (2002). Designing focused libraries using MoSELECT. Journal of Molecular Graphics and Modelling 20(6): 491-498.
- Gramatica P, Pilutti P and Papa E (2004). Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling. Journal of Chemical Information and Computer Sciences 44(5): 1794-1802.
- Han JW and Kamber M (2001). Data mining : concepts and techniques. San Francisco, Morgan Kaufmann Publishers.
- Hassan M, Bielawski JP, Hempel JC and Waldman M (1996). Optimization and visualization of molecular diversity of combinatorial libraries. Molecular Diversity 2(1-2): 64-74.
- Hobohm U, Scharf M, Schneider R and Sander C (1992). Selection of representative protein data sets. Protein Science 1(3): 409-417.
- Hudson BD, Hyde RM, Rahr E, Wood J and Osman J (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure-Activity Relationships 15: 285-289.
- Jarvis RA and Patrick EA (1973). Clustering using a similarity measure based on shared near neighbours. IEEE Transactions in Computers C-22: 1025-1034.
- Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.
- Lam RLH, Welch WJ and Young SS (2002). Uniform coverage designs for molecule selection. Technometrics 44(2): 99-109.
- Leach AR and Gillet VJ (2003). Selecting diverse sets of compounds. An introduction to chemoinformatics. Boston, Kluwer Academic Publisher: 123-145.
- Menard PR, Mason JS, Morize I and Bauerschmidt S (1998). Chemical space metrics in diversity analysis, library design, and compound selection. Journal of Chemical Information and Computer Sciences 38(6): 1204-1213.
- Miller JL, Bradley EK and Teig SL (2002). Luddite: An information-theoretic library design tool. Journal of Chemical Information and Computer Sciences 43(1): 47-54.
- Mitchell TJ (1974). An algorithm for the construction of “D-optimal” experimental designs. Technometrics 16: 203-210.
- Mount J, Ruppert J, Welch W and Jain AN (1999). IcePick: flexible surface-based system for molecular diversity. Journal of Medicinal Chemistry 42(1): 60-66.
- Rajer-Kanduc K and Zupan JM, N. (2003). Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemometrics and Intelligent Laboratory Systems 65(2): 221-229.
- Schultz TW, Netzeva TI and Cronin MTD (2003). Selection of data sets for QSARs: analyses of Tetrahymena toxicity from aromatic compounds. SAR and QSAR in Environmental Research 14(1): 59-81.
- Sheridan RP, SanFeliciano SG and Kearsley SK (2000). Designing targeted libraries with genetic algorithms. Journal of Molecular Graphics and Modelling 18(4-5): 320-334.
- Snarey M, Terrett NK, Willett P and Wilton DJ (1997). Comparison of algorithms for dissimilarity-based compound selection. Journal of Molecular Graphics and Modelling 15(6): 372-385.
- Waldman M, Li H and Hassan M (2000). Novel algorithms for the optimization of molecular diversity of combinatorial libraries. Journal of Molecular Graphics and Modelling 18(4-5): 412-426.
- Wold S and Eriksson L (1995). Statistical validation of QSAR results. Chemometric methods in molecular design. van de Waterbeemd H. Weinheim; New York; Basel; Cambridge; Tokyo, VCH: 309-318.




