Consensus methods for classification models

May 24th, 2008

I have used two types of consensus methods in my research (Yap et al. 2005). The first is a ‘positive majority’ consensus method, which classifies a compound as positive if the majority of the models classify the compound as positive (Eriksson et al. 2003). This consensus method requires an odd number of models to prevent ambiguity in its prediction. The second is a ‘positive probability’ consensus method, which explicitly computes the probability for a compound to be positive using the following formulas (McDowell et al. 2002):

prpos.jpg … (1)
prneg.jpg … (2)

where pr.jpgis the posterior probability that a compound is positive given the classification result from model i and alphapos.jpg and alphaneg.jpg is the sensitivity and specificity of model i respectively. Equation (1) or (2) was used when model i classifies the compound as positive or negative respectively. In the absence of the knowledge about the ratio of positive to negative compounds in the population, the prior probability of a compound to be positive can be tentatively set at 0.5. In practice, the actual value for the prior probability is unimportant if a large number of models are used for the consensus process.

References

  • Eriksson L, Jaworska J, Cronin M, Worth A, Gramatica P and McDowell R (2003). Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environmental Health Perspectives 111(10): 1361-1375.
  • McDowell R and Jaworska J (2002). Bayesian analysis and inference from QSAR predictive model results. SAR and QSAR in Environmental Research 13: 111-125.
  • Yap CW and Chen YZ (2005). Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. Journal of Chemical Information and Modeling 45(4): 982-992.
Share This

Functional dependence study of QSAR models

May 20th, 2008

A functional dependence study can provide insights on the type of molecular characteristics that are important for a particular biological property and how changes in these molecular characteristics affect the biological property. This information is useful for guiding structural changes during computer-aided drug design so that the desired biological property can be obtained. It is also useful for validating a QSAR model. A valid QSAR model should be consistent with previous findings of important factors that affect the biological property.

For QSAR models developed from linear modeling methods, the descriptors are either positively or negatively correlated to biological properties in a linear relationship. In contrast, descriptors in models developed by using machine learning methods correlate to biological properties in a non-linear relationship. Thus these models can potentially provide more information about the relationships between descriptors and biological properties.

The relationships between descriptors and biological properties can be obtained by using functional dependence plots where the value of a single descriptor is varied through its range, while all other descriptors are held constant at a certain value (Wessel et al. 1998). However, QSAR models usually contain descriptors that are correlated with one another and these intercorrelations can drastically alter the shape of a functional dependence plot if the values of the descriptors that are held constant are changed (Andrea et al. 1991). In addition, descriptors may encode multiple physicochemical and structural aspects of the molecule. This makes it difficult to determine the relationship between a specific molecular characteristic and an biological property.

Principal component analysis (PCA) can be used to overcome both problems (Yap et al. 2005). PCA can extract dominant patterns in the descriptor subsets and group similar descriptors under a single principal component (PC). Different PCs encode different molecular characteristics and the orthogonality among the PCs can be exploited to determine the correlation between a molecular characteristic and a biological property without the influence of other molecular characteristics. A descriptor may belong to multiple PCs and the explained variations of a descriptor in each PC can be used to determine its level of contribution in the PCs (Eriksson et al. 2001). Artificial testing sets can be created to determine the relationship between the PCs and biological property. Each artificial testing set contains 1000 artificial compounds and initially used PCs as descriptors. The PC to be evaluated is varied uniformly from -5 to 5 while all of the other PCs are assigned a value of zero. The loadings derived from PCA are then used to transform the PCs back to the original molecular descriptors. Artificial compounds with molecular descriptors outside the range of the corresponding descriptor in the training set are removed to prevent extrapolation of the model. The values of the biological property of the remaining artificial compounds are predicted by using the developed QSAR models. Functional dependence plots of the biological property against the PCs can then be used to find the trends between various molecular characteristics and the biological property.

References

  • Andrea TA and Kalayeh H (1991). Applications of neural networks in quantitative structure-activity relationships of dihydrofolate reductase inhibitors. Journal of Medicinal Chemistry 34: 2824-2836.
  • Eriksson L, Johansson E, Kettaneh-Wold N and Wade KM (2001). PCA. Multi- and megavariate data analysis - Principles and applications. Umea, Sweden, Umetrics AB: 43-70.
  • Wessel MD, Jurs PC, Tolan JW and Muskal SM (1998). Prediction of human intestinal absorption of drug compounds from molecular structure. Journal of Chemical Information and Computer Sciences 38(4): 726-735.
  • Yap CW and Chen YZ (2005). Quantitative structure-pharmacokinetic relationships for drug distribution properties by using general regression neural network. Journal of Pharmaceutical Sciences 94(1): 153-168.
Share This

Data mining tools comparison - Summary

May 18th, 2008

KNIME is very easy to use and is good for preprocessing of datasets and descriptors. Personally, among the various software, I enjoy using KNIME the most. It is a pity that it is weaker in its model building and validation portion. Hopefully the next major version of KNIME will address these issues.

RapidMiner has a very large set of operators, which makes it very suitable for comparing different machine learning/statistical methods. It is also very good for model building and validation. However, the learning curve for the software is rather steep.

Weka (KnowledgeFlow) is somewhat in between KNIME and RapidMiner. Like RapidMiner, it has quite a large number of components and like KNIME, it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner and its graphical user interface is not as friendly as KNIME.

TANAGRA is similar to RapidMiner in terms of the layout for representing an experimental procedure. However it has significantly less operators than RapidMiner. My initial impression of it is that it should be quite good for performing QSAR experiments. However, after using it, it seems like it is lacking in several important features.

Orange is similar to Weka (KnowledgeFlow) in terms of layout. However, like TANAGRA, it seems to be lacking in some important features for QSAR experiments.

A missing feature in all these software is the ability to perform parallel computing, either through job distribution among different computers in the network or through the use of all the cores in multi-core CPUs.

Table 1 shows a comparison of the five software for performing procedures that are widely used in QSAR experiments. The best software appears to be RapidMiner. At a first glance, Weka seems to be redundant since RapidMiner has incorporated most of its algorithms. However, it still contains some algorithms, especially in the area of descriptor selection, which are not available in other software. Although TANAGRA and Orange are the worst performing software among the five, they do have their own merits. For instance, TANAGRA has an interesting collection of statistical tests while Orange has some interesting prototypes like MeSH Term Browser. Personally, I will invest my time to learn KNIME, RapidMiner, and Weka well, and will use these three software for my future research work.

Table 1: Comparison of the four software for performing procedures that are widely used in QSAR experiments.

Procedure KNIME RapidMiner Weka TANAGRA Orange
Partitioning of dataset into training and testing sets. Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods)
Descriptor scaling Pass Pass Fail (cannot save parameters for scaling to apply to future datasets) Fail (cannot save parameters for scaling to apply to future datasets) Fail (no scaling methods)
Descriptor selection Fail (no wrapper methods) Pass Pass (but is not part of KnowledgeFlow) Fail (wrapper methods valid for logistic regression only) Fail (no wrapper methods)
Parameter optimization of machine learning/statistical methods Fail (not automatic) Pass Fail (not automatic) Fail (not automatic) Fail (not automatic)
Model validation using cross-validation and/or independent validation set Pass (but limited error measurement methods) Pass Pass (but cannot save model so have to rebuild model for every future dataset) Fail (cannot validate independent validation set) Pass (but cannot save model so have to rebuild model for every future dataset)

Lastly, I need to reiterate that the above comments and all the previous posts on these software are very subjective. They are subjective because I have a vested interest in QSAR type of modeling and also because I am not very familar with these software (I have never used them in any of my research projects). Thus there may be factual inaccuracies about my review (i.e. some procedures which I stated that a particular software is unable to do may be false). The authors of these software or readers who are experienced with these software are welcome to comment on these factual inaccuracies and I will update the posts to reflect the truth.

Share This

Orange - Part VI: Model validation using cross-validation and/or independent validation set

May 16th, 2008

The previous post already provides the steps for model validation using cross-validation. So how do we validate a model using an independent validation set?

Validate model on an independent validation set

  1. Put File widget (Data) to canvas and configure it to load a training set from a file.
  2. Put Select Attributes widget (Data) to canvas and connect the output port from the File widget to its input port.
    • Specify the attributes and class for the training set.
    • Click on the Apply button.
  3. Put K Nearest Neighbours widget (Classify) to canvas and connect the output port from the Select Attributes widget to its input port.
    • Configure it by setting Number of neighbours to 3.
    • Click on the Apply button.
  4. Put File widget (Data) to canvas and configure it to load an independent validation set from a file.
  5. Put Select Attributes widget (Data) to canvas and connect the output port from the second File widget to its input port.
    • Specify the attributes and class for the independent validation set.
    • Click on the Apply button.
  6. Put Test Learners widget (Evaluate) to canvas.
    • Connect the output port from K Nearest Neighbours widget to its Learner input port.
    • Connect the output port from the first Select Attributes widget to its Data input port.
    • Connect the output port from the second Select Attributes widget to its Separate Test Data input port.
    • Configure it by choosing Test on test data.

It can be seen that Orange is able to validate a model using either cross-validation or an independent validation set. However, it seems that Orange is unable to save a model and thus the model has to be rebuild each time it is to be used for validating an independent validation set.

Share This

Orange - Part V: Parameter optimization of machine learning/statistical methods

May 14th, 2008

  1. Put File widget (Data) to canvas and configure it to load a training set from a file.
  2. Put Select Attributes widget (Data) to canvas and connect the
    output port from the File widget to its input port.
    • Specify the attributes and class for the training set.
    • Click on the Apply button.
  3. Put K Nearest Neighbours widget (Classify) to canvas and connect the output port from the Select Attributes widget to its input port.
    • Configure it by setting Number of neighbours to 3.
    • Click on the Apply button.
  4. Put Test Learners widget (Evaluate) to canvas.
    • Connect the output port from K Nearest Neighbours widget to its Learner input port.
    • Connect the output port from the Select Attributes widget to its Data input port.
    • Configure it by choosing Cross-validation and setting the Number of folds to 10.

The above procedure shows how Orange can be used to train and assess the performance of a model. However, it is not possible to automatically determine the optimum parameter value (e.g. Number of neighbours to consider (k) in the above procedure) for a machine learning/statistical method. To determine the optimum parameter value, you have to do it manually by setting a parameter value, execute, record the overall error rates, set another parameter value, execute again, record the overall error rates and so on, until you have evaluated all the parameter values that you are interested in. Then the parameter value which gives the lowest overall error rates will be the optimum parameter value of the machine learning/statistical method for the training set.

Share This

Orange - Part IV Descriptor selection

May 12th, 2008

Orange does not have any wrapper descriptor selection methods.

Share This

Orange - Part III: Descriptor scaling

May 10th, 2008

Orange does not have any capability for scaling descriptors. Zero marks for this one.

Share This

Orange - Part II: Partitioning of dataset into training and testing sets

May 8th, 2008

  1. Put File widget (Data) to canvas and configure it to load a dataset from a file.
  2. Put Data Sampler widget (Data) to canvas and connect the output port from the File widget to its input port.
    • Configure it by choosing Random sampling and setting the Sample size to 80%.
    • Click on the Sample Data button.
  3. Put Save widget (Data) on the canvas. Connect the Examples output port from the Data Sampler node to the input node of the Save widget and configure it to save the training set to a file. Then click on the Save current data button.
  4. Put Save widget (Data) on the canvas. Connect the Remaining Examples output port from the Data Sampler node to the input node of the Save and configure it to save the testing set to a file. Then click on the Save current data button.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, Orange does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

Share This

Orange - Part I: Overview

May 6th, 2008

Orange (Snapshot 11 April 2008)

From their official website, “Orange is a component-based data mining software. It includes a range of preprocessing, modelling and data exploration techniques. It is based on C++ components, that are accessed either directly (not very common), through Python scripts (easier and better), or through GUI objects called Orange Widgets”. Orange is distributed under GPL.

If you install the current version of Orange, you will have a total of 77 widgets, with the following nodes distribution:

  • Data: 15
  • Classify: 14
  • Evaluate: 6
  • Visualize: 13
  • Associate: 13
  • Prototypes: 13
  • Regression: 3

However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, Orange can read data from five sources: text-delimited files (which include csv files), C4.5 files, and three other formats which I am not familar with. Orange cannot read data from SVMlight files, LIBSVM files or Microsoft Excel files. The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel. However, the lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.

Orange has a few filter descriptor selection methods such as ReliefF, Information gain, Gain ratio and Gini gain.

Currently, Orange contains one algorithm for developing regression models and 10 algorithms for constructing classification models. It seems strange that Orange does not have multiple linear regression algorithm, which is the most basic of regression algorithms.

Orange has a Data Sampler widget that provides validation methods like cross-validation and leave-one-out.

Overall, my first impression of Orange is that it has a nice graphical user interface but it seems quite inadequate for QSAR experiments.

Share This

TANAGRA - Part VI: Model validation using cross-validation and/or independent validation set

May 2nd, 2008

The previous post already provides the steps for model validation using cross-validation. TANAGRA does not provide any functionality for loading another dataset into the same diagram, or saving and loading a model. Thus TANAGRA is unable to validate an independent validation set (TANAGRA is able to validate on a testing set only if the testing set is derived using its Sampling operator).

Share This

Close
E-mail It