Archive for the ‘Data mining’ Category

Recursive feature elimination (RFE)

Sunday, June 8th, 2008

It has been suggested that the ranking criterion for descriptor selection can be formulated from the variation in an objective function upon removing each descriptor (Kohavi et al. 1997). In order to improve the efficiency of support vector machine (SVM) training, this objective function is represented by a cost function J for the i-th descriptor and it is computed by using the training set only. When the i-th descriptor is removed or its weight wi is reduced to zero, the variation of the cost function DJ(i) is given by

dj.jpg

The case of Dwi = wi - 0 corresponds to the removal of descriptor i.

Guyon et al have used RFE to reduce the descriptors of a linear SVM classification system for cancer detection from gene selection data (Guyon et al. 2002). In the corresponding linear SVM classifier, the cost function is

j.jpg

where l is an m dimensional identity vector (m is the number of compounds in the training set). Therefore DJ(i) = (1/2) wi2 and wi2 can be used as a descriptor ranking criterion. Yu et al have used RFE to reduce the descriptors of a non-linear SVM classification system of polynomial kernels for prediction of drug activity (Yu et al. 2003). However, because of the diversity and complexity of the compounds to be classed, the use of linear and polynomial kernels may not always be sufficient for accurate prediction of various pharmaceutical and biological properties. Thus SVM classification systems of Gaussian kernels were frequently used. In this case, the cost function to be minimized, under the constraints 0 ≤ αk ≤ C and ay.jpg, is

j2.jpg

where H is the matrix with elements yi yj exp(-||xi - xj||2/(2σ2)).

To compute the variation in the cost function upon removal of input component i, the parameters αs were kept unchanged and the matrix H was re-computed. The resulting ranking coefficient is

dj2.jpg

where H(-i) is the matrix computed by using the same method as that of matrix H but with its i-th component removed. One or more of the descriptors with the smallest DJ(i) can thus be eliminated.

References

  • Guyon I, Weston J, Barnhill S and Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3): 389-422.
  • Kohavi R and John GH (1997). Wrappers for feature subset selection. Artificial Intelligence 97(1-2): 273-324.
  • Yu H, Yang J, Wang W and Han J (2003). Discovering compact and highly discriminative features or feature combinations of drug activities using support vector machines. Proceeding of the IEEE computer society bioinformatics conference (CSB): 220-228.

Genetic algorithm-based descriptor selection

Friday, June 6th, 2008

Genetic algorithm-based descriptor selection method comprises of four phases: initialization, evaluation, exploitation and exploration.

The initialization phase involves constructing an initial population of randomly selected descriptor subsets.

During the evaluation phase, each descriptor subset is evaluated by calculating its fitness score, which indicates the relevance of a descriptor subset to the biological property.

In the exploitation phase, the descriptor subsets were first ranked by their fitness value. The higher ranked descriptor subsets were given a higher probability of being chosen for reproduction. The top x selected descriptor subsets were then used to replace the x lowest ranking descriptor subsets in the population. These x new descriptor subsets, together with the y highest ranked descriptor subsets in the current generation, form a new generation of descriptor subsets.

In the last phase, which is the exploration phase, the x new descriptor subsets were subjected to one point crossover and mutation to increase the diversity of the population. In the mutation process, descriptors might be randomly added to or deleted from a descriptor subset. After the exploration phase, the genetic algorithm returns to the evaluation phase and the cycle repeats until at least n generations have passed and the highest ranked descriptor subset remains the same for s generations. The highest ranked descriptor subset was used to construct the final QSAR/qSAR model.

Note: x, y, n, s are defined by the user.

Removal-until-done algorithm

Wednesday, June 4th, 2008

Other than the Kennard and Stone algorithm, another algorithm for dividing a dataset into training set and validation set is the removal-until-done algorithm. In this algorithm, compounds are sequentially removed from the dataset in pairs and placed in the training and validation sets until a defined similarity threshold or desired number of compounds was selected for the validation set. The selection of the compounds to be removed was based on their distribution in the chemical space. Here, chemical space is defined by the structural and chemical descriptors used to represent a compound and each descriptor value is a point in a multidimensional space. Each compound occupies a particular location in this chemical space. All possible pairs of the compounds in the dataset were generated and a similarity score was computed for each pair. These pairs were then ranked in terms of their similarity scores, based on which compounds of similar structural and chemical features were evenly assigned into the training and validation sets. For those compounds without enough structurally and chemically similar counterparts, they were assigned to the training set.

Kennard and Stone algorithm

Monday, June 2nd, 2008

I had mentioned Kennard and Stone algorithm when I was doing the reviews on the various machine learning software. So what exactly is the Kennard and Stone algorithm? In a nutshell, the following is the procedure for the algorithm: Two compounds with the largest Euclidean distance apart were initially selected for the training set. The remaining compounds for the training set were selected by maximizing the minimum distances between the compounds in the training set and the rest of the compounds in the dataset. This selection process continues until the desired number of compounds was selected for the training set. The remaining compounds in the dataset will be used as the validation set (Kennard et al. 1969).

References

  • Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.

Different requirements for predictive models at different stages of drug design cycle

Friday, May 30th, 2008

In a typical qSAR, it is usually assumed that sensitivity and specificity of the predictive model are equally important. However, in a drug discovery project, these accuracies may have different importance at different stages of the design cycle. For example, in the initial target and hit identification phase, it may be more important not to miss potential leads. Thus, it is more important to have a predictive model which has very high sensitivity (small number of false negatives) and reasonably good specificity. At later stages, it becomes increasingly important to focus on a manageable number of candidates. Thus a predictive model with very high specificity (small number of false positives) and reasonably good sensitivity may become more important. Hence it is important to be able to modify the modeling method or predictive model so as to meet this two types of requirements.

For SVM classification systems, there are two possible approaches for modifications to suit these different needs. The first approach uses different training error penalties for compounds in positive and negative classes. For example, a higher training error penalty for compounds in positive class and lower training error penalty for compounds in negative class can be used to increase the sensitivity of the SVM classification systems. The second approach adds a correction factor to the SVM decision function. A positive or negative correction factor will improve the sensitivity or specificity of the SVM classification system respectively.

UC Irvine Machine Learning Repository

Wednesday, May 28th, 2008

UC Irvine Machine Learning Repository maintains a large number of datasets that are available for use by machine learning researchers. The current number of datasets in the repository is 171, with 113 datasets related to classification, 12 datasets related to regression, and the rest related to clustering and other tasks. Some of the more famous datasets include Iris, Adult, Abalone, Internet advertisements, etc.

The datasets in this repository are frequently used to compare and validate newly developed machine learning methods and related techniques. This is because these datasets had been used by many researchers to build models using different machine learning methods and thus a large number of models are available for comparison.

For students studying machine learning or machine learning practitioners who wish to hone their skills, these datasets present a good learning opportunity. Students can download these datasets and try to build models for them. Since most of these datasets had been used by other researchers, students can easily compare their models with exisiting models. In addition, students can also learn about the different methods that can be used to process and model the datasets from published literature.

Consensus methods for regression models

Monday, May 26th, 2008

The simplest and frequently used method to build a consensus regression model is to average the predicted biological properties of a compound from different regression models.

Another method is to use the weighted average of the predicted biological properties. However, a difficulty of using weighted average is the determination of appropriate weights for the different regression models. Possible ways of determining appropriate weights include using R2, mean square error, or mean absolute error to measure the performance of the regression models and assigning higher weights to those models which have better performance.

Consensus methods for classification models

Saturday, May 24th, 2008

I have used two types of consensus methods in my research (Yap et al. 2005). The first is a ‘positive majority’ consensus method, which classifies a compound as positive if the majority of the models classify the compound as positive (Eriksson et al. 2003). This consensus method requires an odd number of models to prevent ambiguity in its prediction. The second is a ‘positive probability’ consensus method, which explicitly computes the probability for a compound to be positive using the following formulas (McDowell et al. 2002):

prpos.jpg … (1)
prneg.jpg … (2)

where pr.jpgis the posterior probability that a compound is positive given the classification result from model i and alphapos.jpg and alphaneg.jpg is the sensitivity and specificity of model i respectively. Equation (1) or (2) was used when model i classifies the compound as positive or negative respectively. In the absence of the knowledge about the ratio of positive to negative compounds in the population, the prior probability of a compound to be positive can be tentatively set at 0.5. In practice, the actual value for the prior probability is unimportant if a large number of models are used for the consensus process.

References

  • Eriksson L, Jaworska J, Cronin M, Worth A, Gramatica P and McDowell R (2003). Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environmental Health Perspectives 111(10): 1361-1375.
  • McDowell R and Jaworska J (2002). Bayesian analysis and inference from QSAR predictive model results. SAR and QSAR in Environmental Research 13: 111-125.
  • Yap CW and Chen YZ (2005). Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling 45(4): 982-992.

Functional dependence study of QSAR models

Tuesday, May 20th, 2008

A functional dependence study can provide insights on the type of molecular characteristics that are important for a particular biological property and how changes in these molecular characteristics affect the biological property. This information is useful for guiding structural changes during computer-aided drug design so that the desired biological property can be obtained. It is also useful for validating a QSAR model. A valid QSAR model should be consistent with previous findings of important factors that affect the biological property.

For QSAR models developed from linear modeling methods, the descriptors are either positively or negatively correlated to biological properties in a linear relationship. In contrast, descriptors in models developed by using machine learning methods correlate to biological properties in a non-linear relationship. Thus these models can potentially provide more information about the relationships between descriptors and biological properties.

The relationships between descriptors and biological properties can be obtained by using functional dependence plots where the value of a single descriptor is varied through its range, while all other descriptors are held constant at a certain value (Wessel et al. 1998). However, QSAR models usually contain descriptors that are correlated with one another and these intercorrelations can drastically alter the shape of a functional dependence plot if the values of the descriptors that are held constant are changed (Andrea et al. 1991). In addition, descriptors may encode multiple physicochemical and structural aspects of the molecule. This makes it difficult to determine the relationship between a specific molecular characteristic and an biological property.

Principal component analysis (PCA) can be used to overcome both problems (Yap et al. 2005). PCA can extract dominant patterns in the descriptor subsets and group similar descriptors under a single principal component (PC). Different PCs encode different molecular characteristics and the orthogonality among the PCs can be exploited to determine the correlation between a molecular characteristic and a biological property without the influence of other molecular characteristics. A descriptor may belong to multiple PCs and the explained variations of a descriptor in each PC can be used to determine its level of contribution in the PCs (Eriksson et al. 2001). Artificial testing sets can be created to determine the relationship between the PCs and biological property. Each artificial testing set contains 1000 artificial compounds and initially used PCs as descriptors. The PC to be evaluated is varied uniformly from -5 to 5 while all of the other PCs are assigned a value of zero. The loadings derived from PCA are then used to transform the PCs back to the original molecular descriptors. Artificial compounds with molecular descriptors outside the range of the corresponding descriptor in the training set are removed to prevent extrapolation of the model. The values of the biological property of the remaining artificial compounds are predicted by using the developed QSAR models. Functional dependence plots of the biological property against the PCs can then be used to find the trends between various molecular characteristics and the biological property.

References

  • Andrea TA and Kalayeh H (1991). Applications of neural networks in quantitative structure-activity relationships of dihydrofolate reductase inhibitors. Journal of Medicinal Chemistry 34: 2824-2836.
  • Eriksson L, Johansson E, Kettaneh-Wold N and Wade KM (2001). PCA. Multi- and megavariate data analysis - Principles and applications. Umea, Sweden, Umetrics AB: 43-70.
  • Wessel MD, Jurs PC, Tolan JW and Muskal SM (1998). Prediction of human intestinal absorption of drug compounds from molecular structure. Journal of Chemical Information and Computer Sciences 38(4): 726-735.
  • Yap CW and Chen YZ (2005). Quantitative structure-pharmacokinetic relationships for drug distribution properties by using general regression neural network. Journal of Pharmaceutical Sciences 94(1): 153-168.

Data mining tools comparison - Summary

Sunday, May 18th, 2008

KNIME is very easy to use and is good for preprocessing of datasets and descriptors. Personally, among the various software, I enjoy using KNIME the most. It is a pity that it is weaker in its model building and validation portion. Hopefully the next major version of KNIME will address these issues.

RapidMiner has a very large set of operators, which makes it very suitable for comparing different machine learning/statistical methods. It is also very good for model building and validation. However, the learning curve for the software is rather steep.

Weka (KnowledgeFlow) is somewhat in between KNIME and RapidMiner. Like RapidMiner, it has quite a large number of components and like KNIME, it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner and its graphical user interface is not as friendly as KNIME.

TANAGRA is similar to RapidMiner in terms of the layout for representing an experimental procedure. However it has significantly less operators than RapidMiner. My initial impression of it is that it should be quite good for performing QSAR experiments. However, after using it, it seems like it is lacking in several important features.

Orange is similar to Weka (KnowledgeFlow) in terms of layout. However, like TANAGRA, it seems to be lacking in some important features for QSAR experiments.

A missing feature in all these software is the ability to perform parallel computing, either through job distribution among different computers in the network or through the use of all the cores in multi-core CPUs.

Table 1 shows a comparison of the five software for performing procedures that are widely used in QSAR experiments. The best software appears to be RapidMiner. At a first glance, Weka seems to be redundant since RapidMiner has incorporated most of its algorithms. However, it still contains some algorithms, especially in the area of descriptor selection, which are not available in other software. Although TANAGRA and Orange are the worst performing software among the five, they do have their own merits. For instance, TANAGRA has an interesting collection of statistical tests while Orange has some interesting prototypes like MeSH Term Browser. Personally, I will invest my time to learn KNIME, RapidMiner, and Weka well, and will use these three software for my future research work.

Table 1: Comparison of the four software for performing procedures that are widely used in QSAR experiments.

Procedure KNIME RapidMiner Weka TANAGRA Orange
Partitioning of dataset into training and testing sets. Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods)
Descriptor scaling Pass Pass Fail (cannot save parameters for scaling to apply to future datasets) Fail (cannot save parameters for scaling to apply to future datasets) Fail (no scaling methods)
Descriptor selection Fail (no wrapper methods) Pass Pass (but is not part of KnowledgeFlow) Fail (wrapper methods valid for logistic regression only) Fail (no wrapper methods)
Parameter optimization of machine learning/statistical methods Fail (not automatic) Pass Fail (not automatic) Fail (not automatic) Fail (not automatic)
Model validation using cross-validation and/or independent validation set Pass (but limited error measurement methods) Pass Pass (but cannot save model so have to rebuild model for every future dataset) Fail (cannot validate independent validation set) Pass (but cannot save model so have to rebuild model for every future dataset)

Lastly, I need to reiterate that the above comments and all the previous posts on these software are very subjective. They are subjective because I have a vested interest in QSAR type of modeling and also because I am not very familar with these software (I have never used them in any of my research projects). Thus there may be factual inaccuracies about my review (i.e. some procedures which I stated that a particular software is unable to do may be false). The authors of these software or readers who are experienced with these software are welcome to comment on these factual inaccuracies and I will update the posts to reflect the truth.


Close
E-mail It