Overfitting

It is not sufficient for a QSAR/qSAR model to have good predictive capability. A second requirement for a good quality QSAR/qSAR model is that it must not suffer from overfitting. There are two main types of overfitting: (1) using a model that is more flexible than it needs to be and (2) using a model that includes irrelevant descriptors (Hawkins 2004). There are various methods that can be used to prevent or to check for these two types of overfitting.

A number of different QSAR/qSAR models can be developed using machine learning methods of varying complexities. The QSAR/qSAR model with the best balance between complexity of the machine learning method used and its predictive capability is the one that is most suitable for predicting the activity of a compound. This method prevents the use of a QSAR/qSAR model that is more flexible than is necessary.

A frequently used method for checking whether a QSAR/qSAR model is overfitted is to compare its prediction capability determined by using cross-validation methods with those determined by using independent validation sets (Hawkins 2004). Even though cross-validation methods tend to give a pessimistic estimate of the predictive capability of a QSAR/qSAR model, a model that is not overfitted should not have large differences in the estimates of its predictive capability from cross-validation methods and independent validation sets.

Y-randomization is commonly used to determine the probability of chance correlation during descriptor selection (Manly 1997; Leardia et al. 1998). In classification problems, a portion of compounds in the training set belonging to the positive data class (D+) is randomly exchanged with compounds in the training set belonging to the negative data class (D-) , creating new training sets with false D+ and D- compounds. For regression problems, the activities of all the compounds in the training set are randomly rearranged. The machine learning method is trained using this scrambled training set. The randomization is repeated a number of times and prediction capabilities of the new scrambled QSAR/qSAR model from each run are compared to that of the original QSAR/qSAR model. If the scrambled training set gives significantly lower prediction capabilities than the original training set, it can be concluded that the original QSAR/qSAR model was relevant and unlikely to arise as a result of chance correlation.

In order to determine whether the selected descriptors of the original QSAR/qSAR model include those irrelevant for the prediction of the activity of a compound, different groups of QSAR/qSAR models, each containing different number of descriptors, can be generated by using the descriptor selection method. Each group contains a fixed number of QSAR/qSAR models having the same number of descriptors. The prediction capabilities of the QSAR/qSAR models in each group are determined and the average prediction capabilities of all the groups are compared and used to determine the optimal number of descriptors for prediction of the activity of a compound. If the optimal number of descriptors coincide with the number of descriptors in the original QSAR/qSAR model, the original model is unlikely to contain irrelevant descriptors.

References

  • Hawkins DM (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences 44(1): 1-12.
  • Leardia R and González AL (1998). Genetic algorithms applied to feature selection in PLS regression: How and when to use them. Chemometrics and Intelligent Laboratory Systems 41(2): 195-207.
  • Manly BFJ (1997). Randomization bootstrap and Monte Carlo methods in biology. London, Chapman and Hall.
Share This

Leave a Reply


Close
E-mail It