Archive for the ‘Tutorial’ Category

Visualization software for exploratory data analysis

Monday, July 14th, 2008

A dataset may contain anywhere from one to several thousand features. When the number of features in a dataset exceeds three, it is difficult to visualize how different instances are related to one another. Luckily, there are methods available to help us visualize these high-dimensional dataset. The common thing about these methods is that they reduce the original features in the dataset into not more than three features, while retaining the distance relationship between the instances. This allows the instances to be plotted as a 2D or 3D graph, providing us with a visual overview of the structure of the dataset.

The usual method for visualizing datasets in QSAR is principal component analysis (PCA). PCA is used to convert the existing features into another set of orthogonal features, with the first few features capturing the bulk of the variance in the dataset. A 2D plot is usually made from the first two principal components and is useful for showing clusters in the dataset, areas where data is sparse, possible outliers, and whether it is possible to separate the different classes in the dataset using PCA alone.

Other than PCA, other dimensionality reduction methods are seldom used in QSAR. The reasons are not clear. Perhaps, it is due to the lack of software, or the lack of expertise in interpreting such graphs. Indeed, it is for both reasons that I do not use visualization methods often in my research. Previously, I was too involved in data mining alone. Now, I have broaden my approach to data exploration and thus it is necessary for me to learn how to visualize data properly.

A search on the internet shows that there are some visualization software available. I have selected three software, VisuMap, OmniViz, and GGobi to explore in more details.

General regression neural network (GRNN)

Sunday, July 6th, 2008

GRNN is a modification of PNN for regression problems (Specht 1991). For GRNN, the predicted value of the biological property is the most probable value, which is given by

grnnpredy.jpg

where f(x,y) is the joint density and can be estimated by using Parzen’s nonparametric estimator. Substituting Parzen’s nonparametric estimator for f(x,y) and performing the integrations leads to the fundamental equation of GRNN.

grnnpredyfinal.jpg

where

grnndxx.jpg

The network architecture of a GRNN is similar to that of a PNN except that its summation layer has two neurons that calculate the numerator and denominator. The single neuron in the output layer then performs a division of the two summation neurons to obtain the predicted biological value of the given compound.

References

  • Specht DF (1991). A general regression neural network. IEEE Transactions on Neural Networks 2(6): 568-576.

Support vector regression (SVR)

Wednesday, July 2nd, 2008

The theoretical background of SVR is similar to that of SVM (Smola et al.; Vapnik 1995; Yuan et al. 2004). In SVR, the kernel function is used to map the vectors into a higher dimensional feature space and linear regression is then conducted in this space. The optimal regression function can be represented by:

svrpredy.jpg

where y represents the predicted value of a biological property, and the coefficients alpha, alpha* and bias b are determined by maximizing the following Langrangian expression:

svrlangrangian.jpg

under the following conditions:

svralpha.jpg
svrsum.jpg

References

  • Smola AJ and Scholkopf B A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.
  • Yuan Z and Huang BX (2004). Prediction of protein accessible surface areas by support vector regression. Proteins 57(3): 558-564.

Support vector machine (SVM)

Saturday, June 28th, 2008

SVM is based on the structural risk minimization principle from statistical learning theory (Vapnik 1995; Burges 1998; Evgeniou et al. 2001). A compound is represented by a vector xi which is its molecular descriptors. In linearly separable cases, SVM constructs a hyperplane which separates two data classes of compounds with a maximum margin. This is accomplished by finding another vector w and a parameter b that minimizes w2.jpg and satisfies the following conditions:

dplus.jpg Class 1 (D+)

dminus.jpg Class 2 (D–)

where yi is the data class index of compound i, w is a vector normal to the hyperplane, bw.jpg is the perpendicular distance from the hyperplane to the origin and w2.jpg is the Euclidean norm of w. After the determination of w and b, a given compound with vector x can be classified by:

ysvm.jpg

In non-linearly separable cases, SVM maps the vectors into a higher dimensional feature space using a kernel function K(xi, xj). The table below lists three different types of kernel functions which are commonly used. The Gaussian radial basis function kernel has been extensively used in a number of different studies with good results (Burbidge et al. 2001; Czerminski et al. 2001; Trotter et al. 2001).

Commonly used kernel functions

Kernel Equation
Polynomial polynomialkernel.jpg
Gaussian radial basis function rbfkernel.jpg
Sigmoidal sigmoidalkernel.jpg

Linear support vector machine is applied to this feature space and then the decision function is given by:

ysvmnonlinear.jpg

where l is the number of support vectors and the coefficients alphai0 and b are determined by maximizing the following Langrangian expression:

langrangian.jpg

under the following conditions:

a1.jpg

a2.jpg

where C is a penalty for training errors. A positive or negative value from decision function equation indicates that the compound with vector x belongs to the positive or negative data class respectively.

References

  • Burbidge R, Trotter M, Buxton B and Holden S (2001). Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers and Chemistry 26(1): 5-14.
  • Burges CJC (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 127-167.
  • Czerminski R, Yasri A and Hartsough D (2001). Use of support vector machine in pattern classification: Application to QSAR studies. Quantitative Structure-Activity Relationships 20(3): 227-240.
  • Evgeniou T and Pontil M (2001). Support vector machines: theory and applications. Machine learning and its applications. Advanced lectures. Paliouras G, Karkaletsis V and Spyropoulos CD. New York, Springer: 249-257.
  • Trotter MWB, Buxton BF and Holden SB (2001). Support vector machines in combinatorial chemistry. Measurement and Control 34(8): 235-239.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.

Probabilistic neural network (PNN)

Sunday, June 22nd, 2008

PNN was introduced by Specht in 1990 (Specht 1990) and is a form of neural network designed for classification through the use of Bayes’ optimal decision rule:

bayes.jpg

where hi and hj are the prior probabilities, ci and cj are the costs of misclassification and fi(x) and fj(x) are the probability density function for data class i and j respectively. A given compound with vector x is classified into data class i if the product of all the three terms is greater for data class i than for any other data class j not equal to i. In most applications, the prior probabilities and costs of misclassifications are treated as being equal. The probability density function for each data class for a univariate case can be estimated by the Parzen’s nonparametric estimator (Parzen 1962):

gx.jpg

where n is the sample size, sigma is a scaling parameter which defines the width of the bell curve that surrounds each compound, W(d) is a weight function which has its largest value at d = 0 and (x – xi) is the distance between a given compound and a compound in the training set. The Parzen’s nonparametric estimator was later expanded by Cacoullos (Cacoullos 1966) for the multivariate case.

gxx.jpg

The Gaussian function is frequently used as the weight function because it is well behaved, easily calculated and satisfies the conditions required by Parzen’s estimator. Thus the probability density function for the multivariate case becomes

gx2.jpg

To simplify the equation, a single sigma that is common to all the descriptors (single-sigma model) can be used instead of an individual sigma for each descriptor (multi-sigma model). Single-sigma models could be computed faster and can produce reasonable models when all the descriptors are of approximately equal importance. However, multi-sigma models are more general than single-sigma model and are useful when descriptors are of different nature and importance (Masters 1995).

PNN can be implemented as a neural network (Masters 1995), which is shown in the figure below. The network architecture of a PNN is determined by the number of compounds and descriptors in the training set. There are 4 layers in a PNN. The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of descriptors in the training set. The number of pattern neurons is determined by the total number of compounds in the training set. Each pattern neuron computes a distance measure between the input compound and the training compound represented by that neuron and then subjects the distance measure to the Parzen’s nonparameteric estimator. The summation layer has a neuron for each data class and the neurons sum all the pattern neurons’ output corresponding to members of that summation neuron’s data class to obtain the estimated probability density function for that data class. The single neuron in the output layer then determines the final data class of the input compound by comparing all the probability density functions from the summation neurons and choosing the data class with the highest value for the probability density function.

pnn.jpg

References

  • Cacoullos T (1966). Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics 18: 179-189.
  • Masters T (1995). Advanced algorithms for neural networks : a C++ sourcebook. New York, Wiley.
  • Parzen E (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3): 1065-1076.
  • Specht DF (1990). Probabilistic neural networks. Neural Networks 3(1): 109-118.

C4.5 decision tree (DT)

Wednesday, June 18th, 2008

C4.5 DT is a branch-test-based classifier (Quinlan 1993). A branch in a decision tree corresponds to a group of data classes and a leaf represents a specific data class. A decision node specifies a test to be conducted on a single descriptor value, with one branch and its subsequent data classes as possible outcomes of the test. A given compound with vector x is classified by starting at the root of the tree and moving through the tree until a leaf is encountered. At each non-leaf decision node, a test is conducted and the classification process proceeds to the branch selected by the test. Upon reaching the destination leaf, the data class of the given compound is predicted to be that associated with the leaf.

The algorithm is a recursive greedy heuristic that selects descriptors for membership within the tree. It uses recursive partitioning to examine every descriptor of the compounds in the training set and rank them according to their ability to partition the remaining compounds, thereby constructing a decision tree. Whether or not a descriptor is included within the tree is based on the value of its information gain. As a statistical property, information gain measures how well the descriptor separate training cases into subsets in which the data class is homogeneous. For descriptors with continuous values, a threshold value had to be established within each descriptor so that it could partition the training cases into subsets. These threshold values for each descriptor were established by rank ordering the values within each descriptor from lowest to highest and repeatedly calculating the information gain using the arithmetical midpoint between all successive values within the rank order. The midpoint value with the highest information gain was selected as the threshold value for the descriptor. That descriptor with the highest information gain (information being the most useful for classification) was then selected for inclusion in the DT. The algorithm continued to build the tree in this manner until it accounted for all training cases. Ties between descriptors that were equal in terms of information gain were broken randomly (Carnahan et al. 2003).

References

  • Carnahan B, Meyer G and Kuntz L-A (2003). Comparing statistical and machine learning classifiers: Alternatives for predictive modeling in human factors research. Human Factors 45(3): 408-423.
  • Quinlan JR (1993). C4.5 : programs for machine learning. San Mateo, Calif, Morgan Kaufmann.

k nearest neighbour (kNN) for regression

Saturday, June 14th, 2008

kNN can be modified for regression problems by replacing the previous equation with the following equation:

yreg.jpg

The predicted biological value of the given compound is the average of the biological values of its k nearest neighbours. Unlike kNN that is used for classification problems, k need not be an odd number in this case.

k nearest neighbour (kNN)

Tuesday, June 10th, 2008

kNN is a basic instance-based method and was introduced by Fix and Hodges (Fix et al. 1951). kNN measures the Euclidean distance between a given compound with vector x and each compound in the training set with individual vector xi (Fix et al. 1951; Johnson et al. 1982). The Euclidean distances for the vector pairs are calculated using the following formula:

d.jpg

A total of k number of training compounds nearest to the given compound is used to determine its data class:

y.jpg

where sigma(a,b)=1 if a=b and sigma(a,b)=0 if a!=b, argmax is the maximum of the function, V is a finite set of data classes. k is usually an odd number to prevent ambiguity in the estimation of y.

References

  • Fix E and Hodges JL (1951). Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas, USAF School of Aviation Medicine, Randolph Field: 261-279.
  • Johnson RA and Wichern DW (1982). Applied multivariate statistical analysis. Englewood Cliffs, NJ, Prentice Hall.

Recursive feature elimination (RFE)

Sunday, June 8th, 2008

It has been suggested that the ranking criterion for descriptor selection can be formulated from the variation in an objective function upon removing each descriptor (Kohavi et al. 1997). In order to improve the efficiency of support vector machine (SVM) training, this objective function is represented by a cost function J for the i-th descriptor and it is computed by using the training set only. When the i-th descriptor is removed or its weight wi is reduced to zero, the variation of the cost function DJ(i) is given by

dj.jpg

The case of Dwi = wi - 0 corresponds to the removal of descriptor i.

Guyon et al have used RFE to reduce the descriptors of a linear SVM classification system for cancer detection from gene selection data (Guyon et al. 2002). In the corresponding linear SVM classifier, the cost function is

j.jpg

where l is an m dimensional identity vector (m is the number of compounds in the training set). Therefore DJ(i) = (1/2) wi2 and wi2 can be used as a descriptor ranking criterion. Yu et al have used RFE to reduce the descriptors of a non-linear SVM classification system of polynomial kernels for prediction of drug activity (Yu et al. 2003). However, because of the diversity and complexity of the compounds to be classed, the use of linear and polynomial kernels may not always be sufficient for accurate prediction of various pharmaceutical and biological properties. Thus SVM classification systems of Gaussian kernels were frequently used. In this case, the cost function to be minimized, under the constraints 0 ≤ αk ≤ C and ay.jpg, is

j2.jpg

where H is the matrix with elements yi yj exp(-||xi - xj||2/(2σ2)).

To compute the variation in the cost function upon removal of input component i, the parameters αs were kept unchanged and the matrix H was re-computed. The resulting ranking coefficient is

dj2.jpg

where H(-i) is the matrix computed by using the same method as that of matrix H but with its i-th component removed. One or more of the descriptors with the smallest DJ(i) can thus be eliminated.

References

  • Guyon I, Weston J, Barnhill S and Vapnik V (2002). Gene selection for cancer classification using support vector machines. Machine Learning 46(1-3): 389-422.
  • Kohavi R and John GH (1997). Wrappers for feature subset selection. Artificial Intelligence 97(1-2): 273-324.
  • Yu H, Yang J, Wang W and Han J (2003). Discovering compact and highly discriminative features or feature combinations of drug activities using support vector machines. Proceeding of the IEEE computer society bioinformatics conference (CSB): 220-228.

Genetic algorithm-based descriptor selection

Friday, June 6th, 2008

Genetic algorithm-based descriptor selection method comprises of four phases: initialization, evaluation, exploitation and exploration.

The initialization phase involves constructing an initial population of randomly selected descriptor subsets.

During the evaluation phase, each descriptor subset is evaluated by calculating its fitness score, which indicates the relevance of a descriptor subset to the biological property.

In the exploitation phase, the descriptor subsets were first ranked by their fitness value. The higher ranked descriptor subsets were given a higher probability of being chosen for reproduction. The top x selected descriptor subsets were then used to replace the x lowest ranking descriptor subsets in the population. These x new descriptor subsets, together with the y highest ranked descriptor subsets in the current generation, form a new generation of descriptor subsets.

In the last phase, which is the exploration phase, the x new descriptor subsets were subjected to one point crossover and mutation to increase the diversity of the population. In the mutation process, descriptors might be randomly added to or deleted from a descriptor subset. After the exploration phase, the genetic algorithm returns to the evaluation phase and the cycle repeats until at least n generations have passed and the highest ranked descriptor subset remains the same for s generations. The highest ranked descriptor subset was used to construct the final QSAR/qSAR model.

Note: x, y, n, s are defined by the user.


Close
E-mail It