Archive for the ‘Data mining’ Category

PaDEL-Descriptor

Saturday, August 2nd, 2008

Introducing the first product from my laboratory, PaDEL-Descriptor. It is a software to calculate molecular descriptors and fingerprints. The software currently calculates 393 descriptors (290 1D, 2D descriptors and 103 3D descriptors) and 5 types of fingerprints. The descriptors and fingerprints are calculated using The Chemistry Development Kit with some in-house addition for electrotopological descriptors. All the different types of descriptors are calculated in parallel to take full advantage of the multi-core CPUs that are commonly found nowadays. The usage instructions can be found on the website itself. This software is free for all (e.g. personal, academic, non-profit, non-commercial, government, commercial, etc) to use.

The software is Java Web Start ready. What this means is that if you have Java JRE installed on your computer (which most people should have by now), you can just click on a link on the website to launch the software directly. A copy of the software will automatically be downloaded, stored on your computer and run. You can create a shortcut to this software on your desktop. When you click on this shortcut and if you are online, Java Web Start will automatically check if there is a new version of the software available. If there is, it will download it before running the software. If you are offline, Java Web Start will just run your local copy. The main advantage of Java Web Start is that it will always ensure that you are running the latest version of the software (if you are online). If I have the time, I will give a short writeup on how to make your own Java software Java Web Start ready. It is really very easy if you are using NetBeans.

VisuMap

Tuesday, July 22nd, 2008

VisuMap is a high dimensional data visualizer. It provides a number of dimensionality reduction methods like principal component analysis, Sammon mapping, curvilinear component analysis, relational perspective map and SMACOF MDS. It also has a few data clustering methods such as K-mean clustering, agglomerative clustering, self-organizing map and metric sampling.

The website contains some sample maps, sample datasets for you to work on. There are also white papers, and demo videos on the software (which is only available after you register with the website).

To evaluate this software, I used my own dataset. In one of my previous research, I gathered three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones. I compute fingerprints (1025 dimensions) using openbabel for these compounds and combined them into one dataset. Then I load the dataset into VisuMap and run it through each of the different dimensionality reduction methods.

pca3d.jpg

Results from Principal component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon2d.jpg

Results from Sammon mapping. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca2d.jpg

Results from Curvilinear component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm2d.jpg

Results from Relational perspective map. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds2d.jpg

Results from SMACOF MDS. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

All the pictures above (except PCA) are the 2D maps produced by the various algorithms. Although the software can also produce 3D maps, it is not easy to visualize them as the software does not provide very good controls for rotating the map. I could not get the 3D animation to work in my VMWare machine so I don’t know whether it provides an easy way to view 3D maps. It will be good if the software adopts the way that molecular structure viewer software like Sybyl handles 3D structures (i.e. hold down right mouse button and move the mouse to rotate).

It can be seen from the pictures that the algorithms PCA, Sammon and MDS did a very good job in showing that there are three distinct groups from the obvious separation between the groups (The colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band). For the other algorithms, the separation between the groups are not as good, although it can be seen that members of each group does not mix with those from other groups. The Sammon and MDS algorithm also correctly showed that penicillins and cephalosporins are closer to each other than they are to fluoroquinolones.

Visualization software for exploratory data analysis

Monday, July 14th, 2008

A dataset may contain anywhere from one to several thousand features. When the number of features in a dataset exceeds three, it is difficult to visualize how different instances are related to one another. Luckily, there are methods available to help us visualize these high-dimensional dataset. The common thing about these methods is that they reduce the original features in the dataset into not more than three features, while retaining the distance relationship between the instances. This allows the instances to be plotted as a 2D or 3D graph, providing us with a visual overview of the structure of the dataset.

The usual method for visualizing datasets in QSAR is principal component analysis (PCA). PCA is used to convert the existing features into another set of orthogonal features, with the first few features capturing the bulk of the variance in the dataset. A 2D plot is usually made from the first two principal components and is useful for showing clusters in the dataset, areas where data is sparse, possible outliers, and whether it is possible to separate the different classes in the dataset using PCA alone.

Other than PCA, other dimensionality reduction methods are seldom used in QSAR. The reasons are not clear. Perhaps, it is due to the lack of software, or the lack of expertise in interpreting such graphs. Indeed, it is for both reasons that I do not use visualization methods often in my research. Previously, I was too involved in data mining alone. Now, I have broaden my approach to data exploration and thus it is necessary for me to learn how to visualize data properly.

A search on the internet shows that there are some visualization software available. I have selected three software, VisuMap, OmniViz, and GGobi to explore in more details.

General regression neural network (GRNN)

Sunday, July 6th, 2008

GRNN is a modification of PNN for regression problems (Specht 1991). For GRNN, the predicted value of the biological property is the most probable value, which is given by

grnnpredy.jpg

where f(x,y) is the joint density and can be estimated by using Parzen’s nonparametric estimator. Substituting Parzen’s nonparametric estimator for f(x,y) and performing the integrations leads to the fundamental equation of GRNN.

grnnpredyfinal.jpg

where

grnndxx.jpg

The network architecture of a GRNN is similar to that of a PNN except that its summation layer has two neurons that calculate the numerator and denominator. The single neuron in the output layer then performs a division of the two summation neurons to obtain the predicted biological value of the given compound.

References

  • Specht DF (1991). A general regression neural network. IEEE Transactions on Neural Networks 2(6): 568-576.

Support vector regression (SVR)

Wednesday, July 2nd, 2008

The theoretical background of SVR is similar to that of SVM (Smola et al.; Vapnik 1995; Yuan et al. 2004). In SVR, the kernel function is used to map the vectors into a higher dimensional feature space and linear regression is then conducted in this space. The optimal regression function can be represented by:

svrpredy.jpg

where y represents the predicted value of a biological property, and the coefficients alpha, alpha* and bias b are determined by maximizing the following Langrangian expression:

svrlangrangian.jpg

under the following conditions:

svralpha.jpg
svrsum.jpg

References

  • Smola AJ and Scholkopf B A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.
  • Yuan Z and Huang BX (2004). Prediction of protein accessible surface areas by support vector regression. Proteins 57(3): 558-564.

Support vector machine (SVM)

Saturday, June 28th, 2008

SVM is based on the structural risk minimization principle from statistical learning theory (Vapnik 1995; Burges 1998; Evgeniou et al. 2001). A compound is represented by a vector xi which is its molecular descriptors. In linearly separable cases, SVM constructs a hyperplane which separates two data classes of compounds with a maximum margin. This is accomplished by finding another vector w and a parameter b that minimizes w2.jpg and satisfies the following conditions:

dplus.jpg Class 1 (D+)

dminus.jpg Class 2 (D–)

where yi is the data class index of compound i, w is a vector normal to the hyperplane, bw.jpg is the perpendicular distance from the hyperplane to the origin and w2.jpg is the Euclidean norm of w. After the determination of w and b, a given compound with vector x can be classified by:

ysvm.jpg

In non-linearly separable cases, SVM maps the vectors into a higher dimensional feature space using a kernel function K(xi, xj). The table below lists three different types of kernel functions which are commonly used. The Gaussian radial basis function kernel has been extensively used in a number of different studies with good results (Burbidge et al. 2001; Czerminski et al. 2001; Trotter et al. 2001).

Commonly used kernel functions

Kernel Equation
Polynomial polynomialkernel.jpg
Gaussian radial basis function rbfkernel.jpg
Sigmoidal sigmoidalkernel.jpg

Linear support vector machine is applied to this feature space and then the decision function is given by:

ysvmnonlinear.jpg

where l is the number of support vectors and the coefficients alphai0 and b are determined by maximizing the following Langrangian expression:

langrangian.jpg

under the following conditions:

a1.jpg

a2.jpg

where C is a penalty for training errors. A positive or negative value from decision function equation indicates that the compound with vector x belongs to the positive or negative data class respectively.

References

  • Burbidge R, Trotter M, Buxton B and Holden S (2001). Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers and Chemistry 26(1): 5-14.
  • Burges CJC (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 127-167.
  • Czerminski R, Yasri A and Hartsough D (2001). Use of support vector machine in pattern classification: Application to QSAR studies. Quantitative Structure-Activity Relationships 20(3): 227-240.
  • Evgeniou T and Pontil M (2001). Support vector machines: theory and applications. Machine learning and its applications. Advanced lectures. Paliouras G, Karkaletsis V and Spyropoulos CD. New York, Springer: 249-257.
  • Trotter MWB, Buxton BF and Holden SB (2001). Support vector machines in combinatorial chemistry. Measurement and Control 34(8): 235-239.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.

Probabilistic neural network (PNN)

Sunday, June 22nd, 2008

PNN was introduced by Specht in 1990 (Specht 1990) and is a form of neural network designed for classification through the use of Bayes’ optimal decision rule:

bayes.jpg

where hi and hj are the prior probabilities, ci and cj are the costs of misclassification and fi(x) and fj(x) are the probability density function for data class i and j respectively. A given compound with vector x is classified into data class i if the product of all the three terms is greater for data class i than for any other data class j not equal to i. In most applications, the prior probabilities and costs of misclassifications are treated as being equal. The probability density function for each data class for a univariate case can be estimated by the Parzen’s nonparametric estimator (Parzen 1962):

gx.jpg

where n is the sample size, sigma is a scaling parameter which defines the width of the bell curve that surrounds each compound, W(d) is a weight function which has its largest value at d = 0 and (x – xi) is the distance between a given compound and a compound in the training set. The Parzen’s nonparametric estimator was later expanded by Cacoullos (Cacoullos 1966) for the multivariate case.

gxx.jpg

The Gaussian function is frequently used as the weight function because it is well behaved, easily calculated and satisfies the conditions required by Parzen’s estimator. Thus the probability density function for the multivariate case becomes

gx2.jpg

To simplify the equation, a single sigma that is common to all the descriptors (single-sigma model) can be used instead of an individual sigma for each descriptor (multi-sigma model). Single-sigma models could be computed faster and can produce reasonable models when all the descriptors are of approximately equal importance. However, multi-sigma models are more general than single-sigma model and are useful when descriptors are of different nature and importance (Masters 1995).

PNN can be implemented as a neural network (Masters 1995), which is shown in the figure below. The network architecture of a PNN is determined by the number of compounds and descriptors in the training set. There are 4 layers in a PNN. The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of descriptors in the training set. The number of pattern neurons is determined by the total number of compounds in the training set. Each pattern neuron computes a distance measure between the input compound and the training compound represented by that neuron and then subjects the distance measure to the Parzen’s nonparameteric estimator. The summation layer has a neuron for each data class and the neurons sum all the pattern neurons’ output corresponding to members of that summation neuron’s data class to obtain the estimated probability density function for that data class. The single neuron in the output layer then determines the final data class of the input compound by comparing all the probability density functions from the summation neurons and choosing the data class with the highest value for the probability density function.

pnn.jpg

References

  • Cacoullos T (1966). Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics 18: 179-189.
  • Masters T (1995). Advanced algorithms for neural networks : a C++ sourcebook. New York, Wiley.
  • Parzen E (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3): 1065-1076.
  • Specht DF (1990). Probabilistic neural networks. Neural Networks 3(1): 109-118.

C4.5 decision tree (DT)

Wednesday, June 18th, 2008

C4.5 DT is a branch-test-based classifier (Quinlan 1993). A branch in a decision tree corresponds to a group of data classes and a leaf represents a specific data class. A decision node specifies a test to be conducted on a single descriptor value, with one branch and its subsequent data classes as possible outcomes of the test. A given compound with vector x is classified by starting at the root of the tree and moving through the tree until a leaf is encountered. At each non-leaf decision node, a test is conducted and the classification process proceeds to the branch selected by the test. Upon reaching the destination leaf, the data class of the given compound is predicted to be that associated with the leaf.

The algorithm is a recursive greedy heuristic that selects descriptors for membership within the tree. It uses recursive partitioning to examine every descriptor of the compounds in the training set and rank them according to their ability to partition the remaining compounds, thereby constructing a decision tree. Whether or not a descriptor is included within the tree is based on the value of its information gain. As a statistical property, information gain measures how well the descriptor separate training cases into subsets in which the data class is homogeneous. For descriptors with continuous values, a threshold value had to be established within each descriptor so that it could partition the training cases into subsets. These threshold values for each descriptor were established by rank ordering the values within each descriptor from lowest to highest and repeatedly calculating the information gain using the arithmetical midpoint between all successive values within the rank order. The midpoint value with the highest information gain was selected as the threshold value for the descriptor. That descriptor with the highest information gain (information being the most useful for classification) was then selected for inclusion in the DT. The algorithm continued to build the tree in this manner until it accounted for all training cases. Ties between descriptors that were equal in terms of information gain were broken randomly (Carnahan et al. 2003).

References

  • Carnahan B, Meyer G and Kuntz L-A (2003). Comparing statistical and machine learning classifiers: Alternatives for predictive modeling in human factors research. Human Factors 45(3): 408-423.
  • Quinlan JR (1993). C4.5 : programs for machine learning. San Mateo, Calif, Morgan Kaufmann.

k nearest neighbour (kNN) for regression

Saturday, June 14th, 2008

kNN can be modified for regression problems by replacing the previous equation with the following equation:

yreg.jpg

The predicted biological value of the given compound is the average of the biological values of its k nearest neighbours. Unlike kNN that is used for classification problems, k need not be an odd number in this case.

k nearest neighbour (kNN)

Tuesday, June 10th, 2008

kNN is a basic instance-based method and was introduced by Fix and Hodges (Fix et al. 1951). kNN measures the Euclidean distance between a given compound with vector x and each compound in the training set with individual vector xi (Fix et al. 1951; Johnson et al. 1982). The Euclidean distances for the vector pairs are calculated using the following formula:

d.jpg

A total of k number of training compounds nearest to the given compound is used to determine its data class:

y.jpg

where sigma(a,b)=1 if a=b and sigma(a,b)=0 if a!=b, argmax is the maximum of the function, V is a finite set of data classes. k is usually an odd number to prevent ambiguity in the estimation of y.

References

  • Fix E and Hodges JL (1951). Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas, USAF School of Aviation Medicine, Randolph Field: 261-279.
  • Johnson RA and Wichern DW (1982). Applied multivariate statistical analysis. Englewood Cliffs, NJ, Prentice Hall.

Close
E-mail It