Archive for the ‘Tutorial’ Category

Removal-until-done algorithm

Wednesday, June 4th, 2008

Other than the Kennard and Stone algorithm, another algorithm for dividing a dataset into training set and validation set is the removal-until-done algorithm. In this algorithm, compounds are sequentially removed from the dataset in pairs and placed in the training and validation sets until a defined similarity threshold or desired number of compounds was selected for the validation set. The selection of the compounds to be removed was based on their distribution in the chemical space. Here, chemical space is defined by the structural and chemical descriptors used to represent a compound and each descriptor value is a point in a multidimensional space. Each compound occupies a particular location in this chemical space. All possible pairs of the compounds in the dataset were generated and a similarity score was computed for each pair. These pairs were then ranked in terms of their similarity scores, based on which compounds of similar structural and chemical features were evenly assigned into the training and validation sets. For those compounds without enough structurally and chemically similar counterparts, they were assigned to the training set.

Kennard and Stone algorithm

Monday, June 2nd, 2008

I had mentioned Kennard and Stone algorithm when I was doing the reviews on the various machine learning software. So what exactly is the Kennard and Stone algorithm? In a nutshell, the following is the procedure for the algorithm: Two compounds with the largest Euclidean distance apart were initially selected for the training set. The remaining compounds for the training set were selected by maximizing the minimum distances between the compounds in the training set and the rest of the compounds in the dataset. This selection process continues until the desired number of compounds was selected for the training set. The remaining compounds in the dataset will be used as the validation set (Kennard et al. 1969).

References

  • Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.

Consensus methods for regression models

Monday, May 26th, 2008

The simplest and frequently used method to build a consensus regression model is to average the predicted biological properties of a compound from different regression models.

Another method is to use the weighted average of the predicted biological properties. However, a difficulty of using weighted average is the determination of appropriate weights for the different regression models. Possible ways of determining appropriate weights include using R2, mean square error, or mean absolute error to measure the performance of the regression models and assigning higher weights to those models which have better performance.

Consensus methods for classification models

Saturday, May 24th, 2008

I have used two types of consensus methods in my research (Yap et al. 2005). The first is a ‘positive majority’ consensus method, which classifies a compound as positive if the majority of the models classify the compound as positive (Eriksson et al. 2003). This consensus method requires an odd number of models to prevent ambiguity in its prediction. The second is a ‘positive probability’ consensus method, which explicitly computes the probability for a compound to be positive using the following formulas (McDowell et al. 2002):

prpos.jpg … (1)
prneg.jpg … (2)

where pr.jpgis the posterior probability that a compound is positive given the classification result from model i and alphapos.jpg and alphaneg.jpg is the sensitivity and specificity of model i respectively. Equation (1) or (2) was used when model i classifies the compound as positive or negative respectively. In the absence of the knowledge about the ratio of positive to negative compounds in the population, the prior probability of a compound to be positive can be tentatively set at 0.5. In practice, the actual value for the prior probability is unimportant if a large number of models are used for the consensus process.

References

  • Eriksson L, Jaworska J, Cronin M, Worth A, Gramatica P and McDowell R (2003). Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environmental Health Perspectives 111(10): 1361-1375.
  • McDowell R and Jaworska J (2002). Bayesian analysis and inference from QSAR predictive model results. SAR and QSAR in Environmental Research 13: 111-125.
  • Yap CW and Chen YZ (2005). Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling 45(4): 982-992.

Functional dependence study of QSAR models

Tuesday, May 20th, 2008

A functional dependence study can provide insights on the type of molecular characteristics that are important for a particular biological property and how changes in these molecular characteristics affect the biological property. This information is useful for guiding structural changes during computer-aided drug design so that the desired biological property can be obtained. It is also useful for validating a QSAR model. A valid QSAR model should be consistent with previous findings of important factors that affect the biological property.

For QSAR models developed from linear modeling methods, the descriptors are either positively or negatively correlated to biological properties in a linear relationship. In contrast, descriptors in models developed by using machine learning methods correlate to biological properties in a non-linear relationship. Thus these models can potentially provide more information about the relationships between descriptors and biological properties.

The relationships between descriptors and biological properties can be obtained by using functional dependence plots where the value of a single descriptor is varied through its range, while all other descriptors are held constant at a certain value (Wessel et al. 1998). However, QSAR models usually contain descriptors that are correlated with one another and these intercorrelations can drastically alter the shape of a functional dependence plot if the values of the descriptors that are held constant are changed (Andrea et al. 1991). In addition, descriptors may encode multiple physicochemical and structural aspects of the molecule. This makes it difficult to determine the relationship between a specific molecular characteristic and an biological property.

Principal component analysis (PCA) can be used to overcome both problems (Yap et al. 2005). PCA can extract dominant patterns in the descriptor subsets and group similar descriptors under a single principal component (PC). Different PCs encode different molecular characteristics and the orthogonality among the PCs can be exploited to determine the correlation between a molecular characteristic and a biological property without the influence of other molecular characteristics. A descriptor may belong to multiple PCs and the explained variations of a descriptor in each PC can be used to determine its level of contribution in the PCs (Eriksson et al. 2001). Artificial testing sets can be created to determine the relationship between the PCs and biological property. Each artificial testing set contains 1000 artificial compounds and initially used PCs as descriptors. The PC to be evaluated is varied uniformly from -5 to 5 while all of the other PCs are assigned a value of zero. The loadings derived from PCA are then used to transform the PCs back to the original molecular descriptors. Artificial compounds with molecular descriptors outside the range of the corresponding descriptor in the training set are removed to prevent extrapolation of the model. The values of the biological property of the remaining artificial compounds are predicted by using the developed QSAR models. Functional dependence plots of the biological property against the PCs can then be used to find the trends between various molecular characteristics and the biological property.

References

  • Andrea TA and Kalayeh H (1991). Applications of neural networks in quantitative structure-activity relationships of dihydrofolate reductase inhibitors. Journal of Medicinal Chemistry 34: 2824-2836.
  • Eriksson L, Johansson E, Kettaneh-Wold N and Wade KM (2001). PCA. Multi- and megavariate data analysis - Principles and applications. Umea, Sweden, Umetrics AB: 43-70.
  • Wessel MD, Jurs PC, Tolan JW and Muskal SM (1998). Prediction of human intestinal absorption of drug compounds from molecular structure. Journal of Chemical Information and Computer Sciences 38(4): 726-735.
  • Yap CW and Chen YZ (2005). Quantitative structure-pharmacokinetic relationships for drug distribution properties by using general regression neural network. Journal of Pharmaceutical Sciences 94(1): 153-168.

Overfitting

Tuesday, March 11th, 2008

It is not sufficient for a QSAR/qSAR model to have good predictive capability. A second requirement for a good quality QSAR/qSAR model is that it must not suffer from overfitting. There are two main types of overfitting: (1) using a model that is more flexible than it needs to be and (2) using a model that includes irrelevant descriptors (Hawkins 2004). There are various methods that can be used to prevent or to check for these two types of overfitting.

A number of different QSAR/qSAR models can be developed using machine learning methods of varying complexities. The QSAR/qSAR model with the best balance between complexity of the machine learning method used and its predictive capability is the one that is most suitable for predicting the activity of a compound. This method prevents the use of a QSAR/qSAR model that is more flexible than is necessary.

A frequently used method for checking whether a QSAR/qSAR model is overfitted is to compare its prediction capability determined by using cross-validation methods with those determined by using independent validation sets (Hawkins 2004). Even though cross-validation methods tend to give a pessimistic estimate of the predictive capability of a QSAR/qSAR model, a model that is not overfitted should not have large differences in the estimates of its predictive capability from cross-validation methods and independent validation sets.

Y-randomization is commonly used to determine the probability of chance correlation during descriptor selection (Manly 1997; Leardia et al. 1998). In classification problems, a portion of compounds in the training set belonging to the positive data class (D+) is randomly exchanged with compounds in the training set belonging to the negative data class (D-) , creating new training sets with false D+ and D- compounds. For regression problems, the activities of all the compounds in the training set are randomly rearranged. The machine learning method is trained using this scrambled training set. The randomization is repeated a number of times and prediction capabilities of the new scrambled QSAR/qSAR model from each run are compared to that of the original QSAR/qSAR model. If the scrambled training set gives significantly lower prediction capabilities than the original training set, it can be concluded that the original QSAR/qSAR model was relevant and unlikely to arise as a result of chance correlation.

In order to determine whether the selected descriptors of the original QSAR/qSAR model include those irrelevant for the prediction of the activity of a compound, different groups of QSAR/qSAR models, each containing different number of descriptors, can be generated by using the descriptor selection method. Each group contains a fixed number of QSAR/qSAR models having the same number of descriptors. The prediction capabilities of the QSAR/qSAR models in each group are determined and the average prediction capabilities of all the groups are compared and used to determine the optimal number of descriptors for prediction of the activity of a compound. If the optimal number of descriptors coincide with the number of descriptors in the original QSAR/qSAR model, the original model is unlikely to contain irrelevant descriptors.

References

  • Hawkins DM (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences 44(1): 1-12.
  • Leardia R and González AL (1998). Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemometrics and Intelligent Laboratory Systems 41(2): 195-207.
  • Manly BFJ (1997). Randomization bootstrap and Monte Carlo methods in biology. London, Chapman and Hall.

Methods for measuring predictive capability of QSAR models

Sunday, March 9th, 2008

The following statistics are commonly calculated to determine the predictive capability of a QSAR model.

rsquare.jpg
mse.jpg
mae.jpg
folderror.jpg
averagefolderror.jpg

The r2 value measures the explained variance between the predicted and actual activity values. The fold-error of a compound measures the degree of overprediction or underprediction for a compound and is useful for identifying chemical structures which are not well-represented by the QSAR model. The average-fold error avoids the cases in which poor overpredictions are cancelled by equally poor underpredictions. A QSAR model that predicts an activity value perfectly gives an average-fold error of 1 and a model with an average-fold error of less than 2 is considered to be a successful one (Obach et al. 1997).

References

  • Obach RS, Baxter JG, Liston TE, Silber BM, Jones BC, Macintyre F, Rance DJ and Wastall P (1997). The prediction of human pharmacokinetic parameters from preclinical and in vitro metabolism data. Journal of Pharmacology and Experimental Therapeutics 283(1): 46-58.

Methods for measuring predictive capability of qSAR models

Friday, March 7th, 2008

The following statistics are usually calculated to determine the predictive capability of a qSAR model.

sensitivity.jpg
specificity.jpg
overallaccuracy.jpg
mcc.jpg

where MCC is the Matthews correlation coefficient (Matthews 1975), TP is number of the true positives, TN is the number of true negatives, FP is number of the false positives and FN is the number of false negatives. Sensitivity (SE) and specificity (SP) are the classification accuracies of a qSAR model for the positive and negative data classes respectively. Overall accuracy (Q) is the classification accuracy of the qSAR model for both positive and negative data classes. The shortcoming of the overall accuracy is that an imbalance in the data classes may result in a high overall accuracy even if either sensitivity or specificity is low. For example, a qSAR model which has a sensitivity of 100% and specificity of 0% will have an overall accuracy of 90% for a validation set that have 9 times more compounds of the positive data class than compounds of the negative data class. Thus MCC, which is a weighted measure, is increasingly being used to measure the predictive capability of qSAR models. A MCC value of 1 indicates that the qSAR model can predict the data classes of unknown compounds perfectly, a MCC value of 0 is expected for a qSAR model that is not better than random guessing, and a MCC value of -1 indicates total disagreement between the predicted data classes and the actual data classes. For the above example, MCC will give a value of 0, which is a more accurate representation of the predictive capability of the model.

References

  • Matthews BW (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta 405(2): 442-451.

Performance evaluation of a QSAR/qSAR model

Wednesday, March 5th, 2008

One of the objectives of QSAR/qSAR modeling is to allow prediction of the activities of compounds which have not been biologically tested. Thus it is important to determine the ability of the developed QSAR/qSAR model to predict the activities of compounds that are not present in the training set. There are two methods which are commonly used to determine the predictive capability of a QSAR/qSAR model (Wold et al. 1995). The first method is the use of cross-validation, which includes leave-one-out (LOO) and k-fold cross-validation. In LOO, a compound is left out of the training set and the remaining compounds are used to train the machine learning method. The derived QSAR/qSAR model is then used to predict the activity of the left-out compound. This process is repeated until every compound in the training set has been left out once. In k-fold cross-validation, the training set was randomly divided into k mutually exclusive subsets of approximately equal size. k-minus-one of the subsets were combined to form a modeling training set for developing a QSAR/qSAR model. The remaining subset was used as a modeling testing set to assess the predictive capability of the QSAR/qSAR model. This process was repeated until k QSAR/qSAR models were developed and each subset had been used as a modeling testing set once.

There are reports of the lack of correlation between cross-validation methods and the prediction capability of a QSAR/qSAR model (Golbraikh et al. 2002; Kozak et al. 2003; Reunanen 2003; Olsson et al. 2004). Moreover, cross-validation methods have a tendency of underestimating the prediction capability of a QSAR/qSAR model, especially if important molecular features are present in only a minority of the compounds in the training set (Mosier et al. 2002; Hawkins et al. 2004). Thus a model having low cross-validation results can still be quite predictive (Mosier et al. 2002). This lead to some studies which suggests that an independent validation set may provide a more reliable estimate of the prediction capability of a QSAR/qSAR model (Wold et al. 1995; Golbraikh et al. 2002). Despite these disadvantages, cross-validation methods are still useful for assessing QSAR/qSAR models during optimization of parameters of machine learning methods and during descriptor selection.

A validation set should ideally be obtained independently of the training set. However, validation sets are usually constructed by using statistical molecular design because of the limited availability of high-quality activity data. Regardless of the method used to obtain a validation set, a good validation set should be representative of the training set so that it can properly assess the prediction capabilities of the QSAR/qSAR model (Tropsha et al. 2003).

References

  • Golbraikh A and Tropsha A (2002). Beware of q2! Journal of Molecular Graphics and Modelling 20(4): 269-276.
  • Hawkins DM, Basak SC and Mills D (2004). Assessing model fit by cross-validation. Journal of Chemical Information and Computer Sciences 43(2): 579-586.
  • Kozak A and Kozak R (2003). Does cross validation provide additional information in the evaluation of regression models? Canadian Journal of Forest Research 33(6): 976-987.
  • Mosier PD and Jurs PC (2002). QSAR/QSPR studies using probabilistic neural networks and generalized regression neural networks. Journal of Chemical Information and Computer Sciences 42(6): 1460-1470.
  • Olsson I-M, Gottfries J and Wold S (2004). D-optimal onion designs in statistical molecular design. Chemometrics and Intelligent Laboratory Systems 73(1): 37-46.
  • Reunanen J (2003). Overfitting in making comparisons between variable selection methods. Journal of Machine Learning Research 3: 1371-1382.
  • Tropsha A, Gramatica P and Gombar VK (2003). The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. QSAR & Combinatorial Science 22(1): 69-77.
  • Wold S and Eriksson L (1995). Statistical validation of QSAR results. Chemometric methods in molecular design. van de Waterbeemd H. Weinheim; New York; Basel; Cambridge; Tokyo, VCH: 309-318

Diversity and representativity of datasets

Tuesday, March 4th, 2008

The diversity of a dataset can be estimated by a diversity index (DI) which is the average value of the similarity between all of the pairs of compounds in that dataset (Perez 2005):

di.jpg

where sim(i,j) is a measure of the similarity between compound i and j, and n is the number of compounds in a dataset. The diversity of a dataset increases with decreasing DI. The similarity between two compound i and j is commonly described by the Tanimoto coefficient (Potter et al. 1998; Willett et al. 1998; Molnar et al. 2002):

similarity.jpg

where p is the number of descriptors of the compounds in the dataset. The mean maximum Tanimoto coefficient of the compounds in dataset A and those in dataset B can be used as a representativity index (RI) to measure the level of representativity of dataset A by dataset B. Dataset B is more representative of dataset A if the RI value between dataset A and B is higher.

References

  • Molnar L and Keseru GM (2002). A neural network based virtual screening of cytochrome P450 3A4 inhibitors. Bioorganic and Medicinal Chemistry Letters 12(3): 419-421.
  • Perez JJ (2005). Managing molecular diversity. Chemical Society Reviews 34(2): 143-152.
  • Potter T and Matter H (1998). Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. Journal of Medicinal Chemistry 41(4): 478-488.
  • Willett P, Barnard JM and Downs GM (1998). Chemical similarity searching. Journal of Chemical Information and Computer Sciences 38(6): 983-996.

Close
E-mail It