Archive for March, 2008

Overfitting

Tuesday, March 11th, 2008

It is not sufficient for a QSAR/qSAR model to have good predictive capability. A second requirement for a good quality QSAR/qSAR model is that it must not suffer from overfitting. There are two main types of overfitting: (1) using a model that is more flexible than it needs to be and (2) using a model that includes irrelevant descriptors (Hawkins 2004). There are various methods that can be used to prevent or to check for these two types of overfitting.

A number of different QSAR/qSAR models can be developed using machine learning methods of varying complexities. The QSAR/qSAR model with the best balance between complexity of the machine learning method used and its predictive capability is the one that is most suitable for predicting the activity of a compound. This method prevents the use of a QSAR/qSAR model that is more flexible than is necessary.

A frequently used method for checking whether a QSAR/qSAR model is overfitted is to compare its prediction capability determined by using cross-validation methods with those determined by using independent validation sets (Hawkins 2004). Even though cross-validation methods tend to give a pessimistic estimate of the predictive capability of a QSAR/qSAR model, a model that is not overfitted should not have large differences in the estimates of its predictive capability from cross-validation methods and independent validation sets.

Y-randomization is commonly used to determine the probability of chance correlation during descriptor selection (Manly 1997; Leardia et al. 1998). In classification problems, a portion of compounds in the training set belonging to the positive data class (D+) is randomly exchanged with compounds in the training set belonging to the negative data class (D-) , creating new training sets with false D+ and D- compounds. For regression problems, the activities of all the compounds in the training set are randomly rearranged. The machine learning method is trained using this scrambled training set. The randomization is repeated a number of times and prediction capabilities of the new scrambled QSAR/qSAR model from each run are compared to that of the original QSAR/qSAR model. If the scrambled training set gives significantly lower prediction capabilities than the original training set, it can be concluded that the original QSAR/qSAR model was relevant and unlikely to arise as a result of chance correlation.

In order to determine whether the selected descriptors of the original QSAR/qSAR model include those irrelevant for the prediction of the activity of a compound, different groups of QSAR/qSAR models, each containing different number of descriptors, can be generated by using the descriptor selection method. Each group contains a fixed number of QSAR/qSAR models having the same number of descriptors. The prediction capabilities of the QSAR/qSAR models in each group are determined and the average prediction capabilities of all the groups are compared and used to determine the optimal number of descriptors for prediction of the activity of a compound. If the optimal number of descriptors coincide with the number of descriptors in the original QSAR/qSAR model, the original model is unlikely to contain irrelevant descriptors.

References

  • Hawkins DM (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences 44(1): 1-12.
  • Leardia R and González AL (1998). Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemometrics and Intelligent Laboratory Systems 41(2): 195-207.
  • Manly BFJ (1997). Randomization bootstrap and Monte Carlo methods in biology. London, Chapman and Hall.

Methods for measuring predictive capability of QSAR models

Sunday, March 9th, 2008

The following statistics are commonly calculated to determine the predictive capability of a QSAR model.

rsquare.jpg
mse.jpg
mae.jpg
folderror.jpg
averagefolderror.jpg

The r2 value measures the explained variance between the predicted and actual activity values. The fold-error of a compound measures the degree of overprediction or underprediction for a compound and is useful for identifying chemical structures which are not well-represented by the QSAR model. The average-fold error avoids the cases in which poor overpredictions are cancelled by equally poor underpredictions. A QSAR model that predicts an activity value perfectly gives an average-fold error of 1 and a model with an average-fold error of less than 2 is considered to be a successful one (Obach et al. 1997).

References

  • Obach RS, Baxter JG, Liston TE, Silber BM, Jones BC, Macintyre F, Rance DJ and Wastall P (1997). The prediction of human pharmacokinetic parameters from preclinical and in vitro metabolism data. Journal of Pharmacology and Experimental Therapeutics 283(1): 46-58.

Methods for measuring predictive capability of qSAR models

Friday, March 7th, 2008

The following statistics are usually calculated to determine the predictive capability of a qSAR model.

sensitivity.jpg
specificity.jpg
overallaccuracy.jpg
mcc.jpg

where MCC is the Matthews correlation coefficient (Matthews 1975), TP is number of the true positives, TN is the number of true negatives, FP is number of the false positives and FN is the number of false negatives. Sensitivity (SE) and specificity (SP) are the classification accuracies of a qSAR model for the positive and negative data classes respectively. Overall accuracy (Q) is the classification accuracy of the qSAR model for both positive and negative data classes. The shortcoming of the overall accuracy is that an imbalance in the data classes may result in a high overall accuracy even if either sensitivity or specificity is low. For example, a qSAR model which has a sensitivity of 100% and specificity of 0% will have an overall accuracy of 90% for a validation set that have 9 times more compounds of the positive data class than compounds of the negative data class. Thus MCC, which is a weighted measure, is increasingly being used to measure the predictive capability of qSAR models. A MCC value of 1 indicates that the qSAR model can predict the data classes of unknown compounds perfectly, a MCC value of 0 is expected for a qSAR model that is not better than random guessing, and a MCC value of -1 indicates total disagreement between the predicted data classes and the actual data classes. For the above example, MCC will give a value of 0, which is a more accurate representation of the predictive capability of the model.

References

  • Matthews BW (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta 405(2): 442-451.

Performance evaluation of a QSAR/qSAR model

Wednesday, March 5th, 2008

One of the objectives of QSAR/qSAR modeling is to allow prediction of the activities of compounds which have not been biologically tested. Thus it is important to determine the ability of the developed QSAR/qSAR model to predict the activities of compounds that are not present in the training set. There are two methods which are commonly used to determine the predictive capability of a QSAR/qSAR model (Wold et al. 1995). The first method is the use of cross-validation, which includes leave-one-out (LOO) and k-fold cross-validation. In LOO, a compound is left out of the training set and the remaining compounds are used to train the machine learning method. The derived QSAR/qSAR model is then used to predict the activity of the left-out compound. This process is repeated until every compound in the training set has been left out once. In k-fold cross-validation, the training set was randomly divided into k mutually exclusive subsets of approximately equal size. k-minus-one of the subsets were combined to form a modeling training set for developing a QSAR/qSAR model. The remaining subset was used as a modeling testing set to assess the predictive capability of the QSAR/qSAR model. This process was repeated until k QSAR/qSAR models were developed and each subset had been used as a modeling testing set once.

There are reports of the lack of correlation between cross-validation methods and the prediction capability of a QSAR/qSAR model (Golbraikh et al. 2002; Kozak et al. 2003; Reunanen 2003; Olsson et al. 2004). Moreover, cross-validation methods have a tendency of underestimating the prediction capability of a QSAR/qSAR model, especially if important molecular features are present in only a minority of the compounds in the training set (Mosier et al. 2002; Hawkins et al. 2004). Thus a model having low cross-validation results can still be quite predictive (Mosier et al. 2002). This lead to some studies which suggests that an independent validation set may provide a more reliable estimate of the prediction capability of a QSAR/qSAR model (Wold et al. 1995; Golbraikh et al. 2002). Despite these disadvantages, cross-validation methods are still useful for assessing QSAR/qSAR models during optimization of parameters of machine learning methods and during descriptor selection.

A validation set should ideally be obtained independently of the training set. However, validation sets are usually constructed by using statistical molecular design because of the limited availability of high-quality activity data. Regardless of the method used to obtain a validation set, a good validation set should be representative of the training set so that it can properly assess the prediction capabilities of the QSAR/qSAR model (Tropsha et al. 2003).

References

  • Golbraikh A and Tropsha A (2002). Beware of q2! Journal of Molecular Graphics and Modelling 20(4): 269-276.
  • Hawkins DM, Basak SC and Mills D (2004). Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences 43(2): 579-586.
  • Kozak A and Kozak R (2003). Does cross validation provide additional information in the evaluation of regression models? Canadian Journal of Forest Research 33(6): 976-987.
  • Mosier PD and Jurs PC (2002). QSAR/QSPR studies using probabilistic neural networks and generalized regression neural networks. Journal of Chemical Information and Computer Sciences 42(6): 1460-1470.
  • Olsson I-M, Gottfries J and Wold S (2004). D-optimal onion designs in statistical molecular design. Chemometrics and Intelligent Laboratory Systems 73(1): 37-46.
  • Reunanen J (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research 3: 1371-1382.
  • Tropsha A, Gramatica P and Gombar VK (2003). The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. QSAR & Combinatorial Science 22(1): 69-77.
  • Wold S and Eriksson L (1995). Statistical validation of QSAR results. Chemometric methods in molecular design. van de Waterbeemd H. Weinheim; New York; Basel; Cambridge; Tokyo, VCH: 309-318

Diversity and representativity of datasets

Tuesday, March 4th, 2008

The diversity of a dataset can be estimated by a diversity index (DI) which is the average value of the similarity between all of the pairs of compounds in that dataset (Perez 2005):

di.jpg

where sim(i,j) is a measure of the similarity between compound i and j, and n is the number of compounds in a dataset. The diversity of a dataset increases with decreasing DI. The similarity between two compound i and j is commonly described by the Tanimoto coefficient (Potter et al. 1998; Willett et al. 1998; Molnar et al. 2002):

similarity.jpg

where p is the number of descriptors of the compounds in the dataset. The mean maximum Tanimoto coefficient of the compounds in dataset A and those in dataset B can be used as a representativity index (RI) to measure the level of representativity of dataset A by dataset B. Dataset B is more representative of dataset A if the RI value between dataset A and B is higher.

References

  • Molnar L and Keseru GM (2002). A neural network based virtual screening of cytochrome P450 3A4 inhibitors. Bioorganic and Medicinal Chemistry Letters 12(3): 419-421.
  • Perez JJ (2005). Managing molecular diversity. Chemical Society Reviews 34(2): 143-152.
  • Potter T and Matter H (1998). Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. Journal of Medicinal Chemistry 41(4): 478-488.
  • Willett P, Barnard JM and Downs GM (1998). Chemical similarity searching. Journal of Chemical Information and Computer Sciences 38(6): 983-996.

Statistical molecular design

Sunday, March 2nd, 2008

The use of an external independent validation set, which has been collected independently of the training set, is widely regarded as the best way to assess the quality of a QSAR/qSAR model (Wold et al. 1995). However, it is usually difficult to find additional sources of data to construct an independent validation set and thus the typical method is to split the original dataset into two different sets, a training set for developing the QSAR/qSAR model and a validation set for evaluating the model performance (Gramatica et al. 2004). The training set should contain compounds of diverse structures that can adequately represent all of the compounds that possess a particular activity  (Rajer-Kanduc et al. 2003; Schultz et al. 2003). The validation set also needs to be sufficiently diverse and representative of the compounds studied in order to accurately assess the accuracies of the QSAR/qSAR models (Rajer-Kanduc et al. 2003; Schultz et al. 2003).

There are a number of approaches for creating diverse training sets and representative validation sets from the datasets, which are given in Table 1. These include random selection, cluster-based methods, dissimilarity-based methods, cell-based methods, stochastic techniques, statistical experimental designs and neural networks (Daszykowski et al. 2002; Leach et al. 2003). Studies have shown that dissimilarity-based methods, such as Kennard and Stone algorithm and removal-until-done algorithm, are more effective than other algorithms in selecting diverse training sets and representative validation sets for developing and validating QSAR/qSAR models (Snarey et al. 1997; Rajer-Kanduc et al. 2003).

Table 1: Methods for selecting training and validation sets

Cluster-based methods
Hierarchical Non-hierarchical
Single linkage (Leach et al. 2003)
Complete linkage (Leach et al. 2003)
Group average (Leach et al. 2003)
Wards method (Leach et al. 2003)
Centroid method (Leach et al. 2003)
Median method (Leach et al. 2003)
K-means (Forgy 1965)
Jarvis-Patrick clustering (Jarvis et al. 1973)
DBSCAN (Ester et al. 1996)
OPTICS (Ankrest et al. 1999)
DENCLUE (Han et al. 2001)
Dissimilarity-based methods
MaxSum (Snarey et al. 1997)
Kennard and Stone algorithm (Kennard et al. 1969)
Removal-until-done (Hobohm et al. 1992)
Sphere exclusion (Hudson et al. 1996)
OptiSim (Clark 1997)
IcePick (Mount et al. 1999)
Minimum spanning tree error function (Waldman et al. 2000)
Cell-based methods
Cummins algorithm (Cummins et al. 1996)
Menard algorithm (Menard et al. 1998)
Uniform cell coverage (Lam et al. 2002)
Stochastic techniques
Techniques using Monte Carlo sampling (Agrafiotis 1996; Hassan et al. 1996)
Techniques using genetic algorithms (Sheridan et al. 2000; Gillet et al. 2002)
Statistical experimental designs
D-optimal design (Mitchell 1974)
Factorial design (Box et al. 1978)
Others
Random selection
Kohonen’s self-organizing map
Informative design (Miller et al. 2002)

References

  • Agrafiotis DK (1996). Stochastic algorithms for maximizing molecular diversity. 3rd Electronic Computational Chemistry Conference.
  • Ankrest M, Breunig M, Kriegel H and Sander J (1999). OPTICS: Ordering points to identify the clustering structure. Proceedings of the ACM SIGMOD International Conference on Management of Data: 49-60.
  • Box GEP, Hunter WG and Hunter JS (1978). Statistics for experimenters: An introduction to design, data analysis, and model building. New York, Wiley.
  • Clark RD (1997). OptiSim: An extended dissimilarity selection method for finding diverse representative subsets. Journal of Chemical Information and Computer Sciences 37(6): 1181-1188.
  • Cummins DJ, Andrews CW, Bentley JA and Cory M (1996). Molecular diversity in chemical databases: Comparison of medicinal chemistry knowledge bases and databases of commerically available compounds. Journal of Chemical Information and Computer Sciences 36(4): 750-763.
  • Daszykowski M, Walczak B and Massart DL (2002). Representative subset selection. Analytica Chimica Acta 468(1): 91-103.
  • Ester M, Kriegel HP, Sander J and Xu X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining: 226-231.
  • Forgy E (1965). Cluster analysis of multivariate data: Efficiency vs interpretability of classifications. Biometrics 21: 768-780.
  • Gillet VJ, Willett P, Fleming PJ and Green DVS (2002). Designing focused libraries using MoSELECT. Journal of Molecular Graphics and Modelling 20(6): 491-498.
  • Gramatica P, Pilutti P and Papa E (2004). Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling. Journal of Chemical Information and Computer Sciences 44(5): 1794-1802.
  • Han JW and Kamber M (2001). Data mining : concepts and techniques. San Francisco, Morgan Kaufmann Publishers.
  • Hassan M, Bielawski JP, Hempel JC and Waldman M (1996). Optimization and visualization of molecular diversity of combinatorial libraries. Molecular Diversity 2(1-2): 64-74.
  • Hobohm U, Scharf M, Schneider R and Sander C (1992). Selection of representative protein data sets. Protein Science 1(3): 409-417.
  • Hudson BD, Hyde RM, Rahr E, Wood J and Osman J (1996). Parameter based methods for compound selection from chemical databases. Quantitative Structure-Activity Relationships 15: 285-289.
  • Jarvis RA and Patrick EA (1973). Clustering using a similarity measure based on shared near neighbours. IEEE Transactions in Computers C-22: 1025-1034.
  • Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.
  • Lam RLH, Welch WJ and Young SS (2002). Uniform coverage designs for molecule selection. Technometrics 44(2): 99-109.
  • Leach AR and Gillet VJ (2003). Selecting diverse sets of compounds. An introduction to chemoinformatics. Boston, Kluwer Academic Publisher: 123-145.
  • Menard PR, Mason JS, Morize I and Bauerschmidt S (1998). Chemical space metrics in diversity analysis, library design, and compound selection. Journal of Chemical Information and Computer Sciences 38(6): 1204-1213.
  • Miller JL, Bradley EK and Teig SL (2002). Luddite: An information-theoretic library design tool. Journal of Chemical Information and Computer Sciences 43(1): 47-54.
  • Mitchell TJ (1974). An algorithm for the construction of “D-optimal” experimental designs. Technometrics 16: 203-210.
  • Mount J, Ruppert J, Welch W and Jain AN (1999). IcePick: flexible surface-based system for molecular diversity. Journal of Medicinal Chemistry 42(1): 60-66.
  • Rajer-Kanduc K and Zupan JM, N. (2003). Separation of data on the training and test set for modelling: a case study for modelling of five colour properties of a white pigment. Chemometrics and Intelligent Laboratory Systems 65(2): 221-229.
  • Schultz TW, Netzeva TI and Cronin MTD (2003). Selection of data sets for QSARs: analyses of Tetrahymena toxicity from aromatic compounds. SAR and QSAR in Environmental Research 14(1): 59-81.
  • Sheridan RP, SanFeliciano SG and Kearsley SK (2000). Designing targeted libraries with genetic algorithms. Journal of Molecular Graphics and Modelling 18(4-5): 320-334.
  • Snarey M, Terrett NK, Willett P and Wilton DJ (1997). Comparison of algorithms for dissimilarity-based compound selection. Journal of Molecular Graphics and Modelling 15(6): 372-385.
  • Waldman M, Li H and Hassan M (2000). Novel algorithms for the optimization of molecular diversity of combinatorial libraries. Journal of Molecular Graphics and Modelling 18(4-5): 412-426.
  • Wold S and Eriksson L (1995). Statistical validation of QSAR results. Chemometric methods in molecular design. van de Waterbeemd H. Weinheim; New York; Basel; Cambridge; Tokyo, VCH: 309-318.

Close
E-mail It