Molecular descriptors - Selection
The purpose of descriptor selection is to remove descriptors irrelevant or negligible to the activity of the compounds, so as to improve computation speed, performance and interpretability of predictive models. Irrelevant and redundant descriptors are removed either by using a filter or a wrapper approach or a combination of these approaches. The filter approach is independent of the in silico method and is frequently used to remove redundant descriptors or descriptors of low information content. Descriptors are chosen or removed based on one or more of the following considerations: prior knowledge of factors affecting a particular activity, the properties of the descriptors (e.g. variance), the correlation between different descriptors, and the distribution of the descriptor values in different data classes. In the wrapper approach, a descriptor selection algorithm is incorporated into an in silico classification method (Guyon et al. 2003).
In many cases, it is difficult to uniquely select an optimum set of descriptors due to the high redundancy and overlapping of many descriptors (Gramatica et al. 2004). Separate sets of descriptors containing different members of redundant descriptor classes have been found to give similar prediction accuracies (Izrailev et al. 2004). The interpretation of the prediction results in these cases should be more appropriately conducted at the descriptor class level where redundant and overlapping descriptors are grouped into one class. Table 1 gives a list of the common descriptor selection methods used in QSAR/qSAR studies.
Table 1: Common descriptor selection methods used in QSAR studies
| Filter methods | Wrapper methods |
| Remove descriptors with low variance Remove highly correlated descriptors CORCHOP (Livingstone et al. 1989) Decision tree (Cardie 1993) FOCUS (Almuallim et al. 1994) LVF (Brassard et al. 1996) RELIEF (Kononenko 1994) Discrimination scores (Guyon et al. 2002) Information gain (Liu 2004) Mutual information (Liu 2004) chi-square-test (Liu 2004) Odds ratio (Liu 2004) GSS coefficient (Liu 2004) |
Forward selection(Xu et al. 2001) Backward elimination (Xu et al. 2001) Stepwise regression (Xu et al. 2001) Branch and bound (Narendra et al. 1977) Floating search (Pudil et al. 1994) Adaptive floating search (Somol et al. 1999) Oscillating search (Somol et al. 2000) Tabu search (Glover 1989) Simulated annealing (Sutter et al. 1993) Genetic algorithm (Siedlecki et al. 1989) Recursive feature elimination (Guyon et al. 2002) |
References
- Almuallim H and Dietterich TG (1994). Learning Boolean concepts in the presence of many irrelevant features. Artificial Intelligence 69: 279-306.
- Brassard G and al. e (1996). Fundamentals of algorithms. New Jersey, Prentice Hall.
- Cardie C (1993). Using decision trees to improve case-based learning. Proceedings 10th International Conference on Machine Learning. Los Altos, Morgan Kaufmann: 25-32.
- Glover F (1989). Tabu search - Part I. ORSA Journal on Computing 1: 190-206.
- Guyon I, Weston J, Barnhill S and Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3): 389-422.
- Guyon I and Elisseeff A (2003). An introduction to variable and feature selection. Journal of Machine Learning Research 3: 1157-1182.
- Gramatica P, Pilutti P and Papa E (2004). Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling. Journal of Chemical Information and Computer Sciences 44(5): 1794-1802.
- Izrailev S and Agrafiotis DK (2004). A method for quantifying and visualizing the diversity of QSAR models. Journal of Molecular Graphics and Modelling 22(4): 275-284.
- Kononenko I (1994). Estimating attributes: analysis and extensions of RELIEF. Machine Learning: ECML-94. European Conference on Machine Learning. Proceedings.
- Liu Y (2004). A comparative study on feature selection methods for drug discovery. Journal of Chemical Information and Computer Sciences 44(5): 1823-1828.
- Livingstone DJ and Rahr E (1989). Corchop - An interactive routine for the dimension reduction of large QSAR data sets. Quantitative Structure-Activity Relationships 8: 103-108.
- Narendra PM and Fukunaga K (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers 26: 917-922.
- Pudil P, Novoviová J and Kittler J (1994). Floating search methods in feature selection. Pattern Recognition Letters 15(11): 1119-1125.
- Siedlecki W and Sklansky J (1989). A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters 10: 335-347.
- Somol P, Pudila P, Novoviová J and Paclí P (1999). Adaptive floating search methods in feature selection. Pattern Recognition Letters 20(11-13): 1157-1163.
- Somol P and Pudil P (2000). Oscillating search algorithms for feature selection. Proceedings of the 15th International Conference on Pattern Recognition. Barcelona. 2: 406-409.
- Sutter JM and H. KJ (1993). Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchemical Journal 47(1-2): 60-66.
- Xu L and Zhang WJ (2001). Comparison of different methods for variable selection. Analytica Chimica Acta 446: 475-481.