C4.5 decision tree (DT)

June 18th, 2008

C4.5 DT is a branch-test-based classifier (Quinlan 1993). A branch in a decision tree corresponds to a group of data classes and a leaf represents a specific data class. A decision node specifies a test to be conducted on a single descriptor value, with one branch and its subsequent data classes as possible outcomes of the test. A given compound with vector x is classified by starting at the root of the tree and moving through the tree until a leaf is encountered. At each non-leaf decision node, a test is conducted and the classification process proceeds to the branch selected by the test. Upon reaching the destination leaf, the data class of the given compound is predicted to be that associated with the leaf.

The algorithm is a recursive greedy heuristic that selects descriptors for membership within the tree. It uses recursive partitioning to examine every descriptor of the compounds in the training set and rank them according to their ability to partition the remaining compounds, thereby constructing a decision tree. Whether or not a descriptor is included within the tree is based on the value of its information gain. As a statistical property, information gain measures how well the descriptor separate training cases into subsets in which the data class is homogeneous. For descriptors with continuous values, a threshold value had to be established within each descriptor so that it could partition the training cases into subsets. These threshold values for each descriptor were established by rank ordering the values within each descriptor from lowest to highest and repeatedly calculating the information gain using the arithmetical midpoint between all successive values within the rank order. The midpoint value with the highest information gain was selected as the threshold value for the descriptor. That descriptor with the highest information gain (information being the most useful for classification) was then selected for inclusion in the DT. The algorithm continued to build the tree in this manner until it accounted for all training cases. Ties between descriptors that were equal in terms of information gain were broken randomly (Carnahan et al. 2003).

References

  • Carnahan B, Meyer G and Kuntz L-A (2003). Comparing statistical and machine learning classifiers: Alternatives for predictive modeling in human factors research. Human Factors 45(3): 408-423.
  • Quinlan JR (1993). C4.5 : programs for machine learning. San Mateo, Calif, Morgan Kaufmann.
Share This

k nearest neighbour (kNN) for regression

June 14th, 2008

kNN can be modified for regression problems by replacing the previous equation with the following equation:

yreg.jpg

The predicted biological value of the given compound is the average of the biological values of its k nearest neighbours. Unlike kNN that is used for classification problems, k need not be an odd number in this case.

Share This

k nearest neighbour (kNN)

June 10th, 2008

kNN is a basic instance-based method and was introduced by Fix and Hodges (Fix et al. 1951). kNN measures the Euclidean distance between a given compound with vector x and each compound in the training set with individual vector xi (Fix et al. 1951; Johnson et al. 1982). The Euclidean distances for the vector pairs are calculated using the following formula:

d.jpg

A total of k number of training compounds nearest to the given compound is used to determine its data class:

y.jpg

where sigma(a,b)=1 if a=b and sigma(a,b)=0 if a!=b, argmax is the maximum of the function, V is a finite set of data classes. k is usually an odd number to prevent ambiguity in the estimation of y.

References

  • Fix E and Hodges JL (1951). Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas, USAF School of Aviation Medicine, Randolph Field: 261-279.
  • Johnson RA and Wichern DW (1982). Applied multivariate statistical analysis. Englewood Cliffs, NJ, Prentice Hall.
  • Share This

Recursive feature elimination (RFE)

June 8th, 2008

It has been suggested that the ranking criterion for descriptor selection can be formulated from the variation in an objective function upon removing each descriptor (Kohavi et al. 1997). In order to improve the efficiency of support vector machine (SVM) training, this objective function is represented by a cost function J for the i-th descriptor and it is computed by using the training set only. When the i-th descriptor is removed or its weight wi is reduced to zero, the variation of the cost function DJ(i) is given by

dj.jpg

The case of Dwi = wi - 0 corresponds to the removal of descriptor i.

Guyon et al have used RFE to reduce the descriptors of a linear SVM classification system for cancer detection from gene selection data (Guyon et al. 2002). In the corresponding linear SVM classifier, the cost function is

j.jpg

where l is an m dimensional identity vector (m is the number of compounds in the training set). Therefore DJ(i) = (1/2) wi2 and wi2 can be used as a descriptor ranking criterion. Yu et al have used RFE to reduce the descriptors of a non-linear SVM classification system of polynomial kernels for prediction of drug activity (Yu et al. 2003). However, because of the diversity and complexity of the compounds to be classed, the use of linear and polynomial kernels may not always be sufficient for accurate prediction of various pharmaceutical and biological properties. Thus SVM classification systems of Gaussian kernels were frequently used. In this case, the cost function to be minimized, under the constraints 0 ≤ αk ≤ C and ay.jpg, is

j2.jpg

where H is the matrix with elements yi yj exp(-||xi - xj||2/(2σ2)).

To compute the variation in the cost function upon removal of input component i, the parameters αs were kept unchanged and the matrix H was re-computed. The resulting ranking coefficient is

dj2.jpg

where H(-i) is the matrix computed by using the same method as that of matrix H but with its i-th component removed. One or more of the descriptors with the smallest DJ(i) can thus be eliminated.

References

  • Guyon I, Weston J, Barnhill S and Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3): 389-422.
  • Kohavi R and John GH (1997). Wrappers for feature subset selection. Artificial Intelligence 97(1-2): 273-324.
  • Yu H, Yang J, Wang W and Han J (2003). Discovering compact and highly discriminative features or feature combinations of drug activities using support vector machines. Proceeding of the IEEE computer society bioinformatics conference (CSB): 220-228.
Share This

Genetic algorithm-based descriptor selection

June 6th, 2008

Genetic algorithm-based descriptor selection method comprises of four phases: initialization, evaluation, exploitation and exploration.

The initialization phase involves constructing an initial population of randomly selected descriptor subsets.

During the evaluation phase, each descriptor subset is evaluated by calculating its fitness score, which indicates the relevance of a descriptor subset to the biological property.

In the exploitation phase, the descriptor subsets were first ranked by their fitness value. The higher ranked descriptor subsets were given a higher probability of being chosen for reproduction. The top x selected descriptor subsets were then used to replace the x lowest ranking descriptor subsets in the population. These x new descriptor subsets, together with the y highest ranked descriptor subsets in the current generation, form a new generation of descriptor subsets.

In the last phase, which is the exploration phase, the x new descriptor subsets were subjected to one point crossover and mutation to increase the diversity of the population. In the mutation process, descriptors might be randomly added to or deleted from a descriptor subset. After the exploration phase, the genetic algorithm returns to the evaluation phase and the cycle repeats until at least n generations have passed and the highest ranked descriptor subset remains the same for s generations. The highest ranked descriptor subset was used to construct the final QSAR/qSAR model.

Note: x, y, n, s are defined by the user.

Share This

Removal-until-done algorithm

June 4th, 2008

Other than the Kennard and Stone algorithm, another algorithm for dividing a dataset into training set and validation set is the removal-until-done algorithm. In this algorithm, compounds are sequentially removed from the dataset in pairs and placed in the training and validation sets until a defined similarity threshold or desired number of compounds was selected for the validation set. The selection of the compounds to be removed was based on their distribution in the chemical space. Here, chemical space is defined by the structural and chemical descriptors used to represent a compound and each descriptor value is a point in a multidimensional space. Each compound occupies a particular location in this chemical space. All possible pairs of the compounds in the dataset were generated and a similarity score was computed for each pair. These pairs were then ranked in terms of their similarity scores, based on which compounds of similar structural and chemical features were evenly assigned into the training and validation sets. For those compounds without enough structurally and chemically similar counterparts, they were assigned to the training set.

Share This

Kennard and Stone algorithm

June 2nd, 2008

I had mentioned Kennard and Stone algorithm when I was doing the reviews on the various machine learning software. So what exactly is the Kennard and Stone algorithm? In a nutshell, the following is the procedure for the algorithm: Two compounds with the largest Euclidean distance apart were initially selected for the training set. The remaining compounds for the training set were selected by maximizing the minimum distances between the compounds in the training set and the rest of the compounds in the dataset. This selection process continues until the desired number of compounds was selected for the training set. The remaining compounds in the dataset will be used as the validation set (Kennard et al. 1969).

References

  • Kennard RW and Stone L (1969). Computer aided design of experiments. Technometrics 11: 137-148.
Share This

Different requirements for predictive models at different stages of drug design cycle

May 30th, 2008

In a typical qSAR, it is usually assumed that sensitivity and specificity of the predictive model are equally important. However, in a drug discovery project, these accuracies may have different importance at different stages of the design cycle. For example, in the initial target and hit identification phase, it may be more important not to miss potential leads. Thus, it is more important to have a predictive model which has very high sensitivity (small number of false negatives) and reasonably good specificity. At later stages, it becomes increasingly important to focus on a manageable number of candidates. Thus a predictive model with very high specificity (small number of false positives) and reasonably good sensitivity may become more important. Hence it is important to be able to modify the modeling method or predictive model so as to meet this two types of requirements.

For SVM classification systems, there are two possible approaches for modifications to suit these different needs. The first approach uses different training error penalties for compounds in positive and negative classes. For example, a higher training error penalty for compounds in positive class and lower training error penalty for compounds in negative class can be used to increase the sensitivity of the SVM classification systems. The second approach adds a correction factor to the SVM decision function. A positive or negative correction factor will improve the sensitivity or specificity of the SVM classification system respectively.

Share This

UC Irvine Machine Learning Repository

May 28th, 2008

UC Irvine Machine Learning Repository maintains a large number of datasets that are available for use by machine learning researchers. The current number of datasets in the repository is 171, with 113 datasets related to classification, 12 datasets related to regression, and the rest related to clustering and other tasks. Some of the more famous datasets include Iris, Adult, Abalone, Internet advertisements, etc.

The datasets in this repository are frequently used to compare and validate newly developed machine learning methods and related techniques. This is because these datasets had been used by many researchers to build models using different machine learning methods and thus a large number of models are available for comparison.

For students studying machine learning or machine learning practitioners who wish to hone their skills, these datasets present a good learning opportunity. Students can download these datasets and try to build models for them. Since most of these datasets had been used by other researchers, students can easily compare their models with exisiting models. In addition, students can also learn about the different methods that can be used to process and model the datasets from published literature.

Share This

Consensus methods for regression models

May 26th, 2008

The simplest and frequently used method to build a consensus regression model is to average the predicted biological properties of a compound from different regression models.

Another method is to use the weighted average of the predicted biological properties. However, a difficulty of using weighted average is the determination of appropriate weights for the different regression models. Possible ways of determining appropriate weights include using R2, mean square error, or mean absolute error to measure the performance of the regression models and assigning higher weights to those models which have better performance.

Share This

Close
E-mail It