Update

October 25th, 2008

Have not been posting for quite some time. I am just too busy at work and at home to post anything (ya, I know, it is poor excuse). Anyway, I have just completed a programming project (see my next post) and now should have time to go back to look at visualization of datasets again :)

Share This

VisuMap - Part 2

August 29th, 2008

I had previously tried out the mapping algorithms in the software VisuMap using my own dataset of 171 compounds, which can be separated into three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones.

A recent comment by the author of VisuMap suggests that the performance of the mapping algorithms could be improved by carefully selecting the appropriate distance metric. Since my dataset was using fingerprints (1025 binary features), he suggested using the Jaccard or Dice distance metric.

So I did some more experiments.

sammon-jaccard.jpg
Results from Sammon mapping using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-jaccard.jpg
Results from Curvilinear component analysis (CCA) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-jaccard.jpg
Results from Relational perspective map (RPM) using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-jaccard.jpg
Results from SMACOF MDS using Jaccard distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon-dice.jpg
Results from Sammon mapping using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca-dice.jpg
Results from Curvilinear component analysis (CCA) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm-dice.jpg
Results from Relational perspective map (RPM) using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds-dice.jpg
Results from SMACOF MDS using Dice distance metric. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

From the figures, it can be seen that the Jaccard distance metric didn’t really improve the results (i.e. Sammon and MDS are still good, and CCA and RPM are still poor in separating the three groups). However, using the Dice distance metric has rather interesting results. CCA now shows very good separation for the three groups, whereas RPM failed rather terribly. Sammon and MDS has good separation between the penicillins/cephalosporins and fluoroquinolones but now has difficulty in cleaning separating the penicillins and cephalosporins. (It is important for me to reiterate here that the colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band).

Thus the obvious conclusion is that both the mapping algorithm and distance metric are important for separating different groups (nothing new here). The question then boils down to how to select the appropriate mapping algorithm and distance metric for a dataset? Is there some rule of thumb for selection or do we have to manually try different combinations? Some clues from the author of VisuMap is that “Sammon map and PCA emphasize on the global inter-cluster structure, whereas other mapping algorithms (like the RPM and CCA) emphasize more on the details within clusters.”. So does that mean that we should use Sammon map or PCA to get a broad overview, then extract each cluster (or highlight each cluster) and then use RPM or CCA to examine the structure of the cluster? Also, is there any references which state what type of distance metric is appropriate for what type of features? These are some questions that are going in my mind now and I guess it is time to do some literature searches. Anyone has any comments, answers or can provide some useful references?

Share This

Knowledge Discovery and Data Mining Process Model

August 12th, 2008

I read an article (Kurgan et al. 2006) reviewing several commonly used process models for knowledge discovery and data mining recently. The number of steps in these models ranged from 5 to 9 but the actual process is pretty similar among the models. Well, I guess you can’t deviate too much if you wish to do knowledge discovery properly.

Among the various models presented, I particularly like the Generic model, which pools and summarizes the important points from the reviewed models. The Generic model borrows heavily on a model proposed by Cios et al. in 2000. The steps in the Generic model are:

  1. Application domain understanding
  2. Data understanding
  3. Data preparation and identification of data mining technology
  4. Data mining
  5. Evaluation
  6. Knowledge consolidation and deployment

I am sure these steps are nothing new and will be familiar to those involved in knowledge discovery and data mining. However, if you had not been following any particular models, this might serve as a good reference to show that the methods which you had been using were already validated by others.

References

  • Cios KJ, Teresinska A, Konieczna S, Potocka J, Sharma S. A knowledge discovery approach to diagnosing myocardial perfusion. Engineering in Medicine and Biology Magazine, IEEE. 2000;19(4):17-25.
  • Kurgan LA, Musilek P. A survey of Knowledge Discovery and Data Mining process models. The Knowledge Engineering Review. 2006;21(01):1-24.
Share This

PaDEL-Descriptor

August 2nd, 2008

Introducing the first product from my laboratory, PaDEL-Descriptor. It is a software to calculate molecular descriptors and fingerprints. The software currently calculates 393 descriptors (290 1D, 2D descriptors and 103 3D descriptors) and 5 types of fingerprints. The descriptors and fingerprints are calculated using The Chemistry Development Kit with some in-house addition for electrotopological descriptors. All the different types of descriptors are calculated in parallel to take full advantage of the multi-core CPUs that are commonly found nowadays. The usage instructions can be found on the website itself. This software is free for all (e.g. personal, academic, non-profit, non-commercial, government, commercial, etc) to use.

The software is Java Web Start ready. What this means is that if you have Java JRE installed on your computer (which most people should have by now), you can just click on a link on the website to launch the software directly. A copy of the software will automatically be downloaded, stored on your computer and run. You can create a shortcut to this software on your desktop. When you click on this shortcut and if you are online, Java Web Start will automatically check if there is a new version of the software available. If there is, it will download it before running the software. If you are offline, Java Web Start will just run your local copy. The main advantage of Java Web Start is that it will always ensure that you are running the latest version of the software (if you are online). If I have the time, I will give a short writeup on how to make your own Java software Java Web Start ready. It is really very easy if you are using NetBeans.

Share This

VisuMap

July 22nd, 2008

VisuMap is a high dimensional data visualizer. It provides a number of dimensionality reduction methods like principal component analysis, Sammon mapping, curvilinear component analysis, relational perspective map and SMACOF MDS. It also has a few data clustering methods such as K-mean clustering, agglomerative clustering, self-organizing map and metric sampling.

The website contains some sample maps, sample datasets for you to work on. There are also white papers, and demo videos on the software (which is only available after you register with the website).

To evaluate this software, I used my own dataset. In one of my previous research, I gathered three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones. I compute fingerprints (1025 dimensions) using openbabel for these compounds and combined them into one dataset. Then I load the dataset into VisuMap and run it through each of the different dimensionality reduction methods.

pca3d.jpg

Results from Principal component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon2d.jpg

Results from Sammon mapping. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca2d.jpg

Results from Curvilinear component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm2d.jpg

Results from Relational perspective map. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds2d.jpg

Results from SMACOF MDS. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

All the pictures above (except PCA) are the 2D maps produced by the various algorithms. Although the software can also produce 3D maps, it is not easy to visualize them as the software does not provide very good controls for rotating the map. I could not get the 3D animation to work in my VMWare machine so I don’t know whether it provides an easy way to view 3D maps. It will be good if the software adopts the way that molecular structure viewer software like Sybyl handles 3D structures (i.e. hold down right mouse button and move the mouse to rotate).

It can be seen from the pictures that the algorithms PCA, Sammon and MDS did a very good job in showing that there are three distinct groups from the obvious separation between the groups (The colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band). For the other algorithms, the separation between the groups are not as good, although it can be seen that members of each group does not mix with those from other groups. The Sammon and MDS algorithm also correctly showed that penicillins and cephalosporins are closer to each other than they are to fluoroquinolones.

Share This

Visualization software for exploratory data analysis

July 14th, 2008

A dataset may contain anywhere from one to several thousand features. When the number of features in a dataset exceeds three, it is difficult to visualize how different instances are related to one another. Luckily, there are methods available to help us visualize these high-dimensional dataset. The common thing about these methods is that they reduce the original features in the dataset into not more than three features, while retaining the distance relationship between the instances. This allows the instances to be plotted as a 2D or 3D graph, providing us with a visual overview of the structure of the dataset.

The usual method for visualizing datasets in QSAR is principal component analysis (PCA). PCA is used to convert the existing features into another set of orthogonal features, with the first few features capturing the bulk of the variance in the dataset. A 2D plot is usually made from the first two principal components and is useful for showing clusters in the dataset, areas where data is sparse, possible outliers, and whether it is possible to separate the different classes in the dataset using PCA alone.

Other than PCA, other dimensionality reduction methods are seldom used in QSAR. The reasons are not clear. Perhaps, it is due to the lack of software, or the lack of expertise in interpreting such graphs. Indeed, it is for both reasons that I do not use visualization methods often in my research. Previously, I was too involved in data mining alone. Now, I have broaden my approach to data exploration and thus it is necessary for me to learn how to visualize data properly.

A search on the internet shows that there are some visualization software available. I have selected three software, VisuMap, OmniViz, and GGobi to explore in more details.

Share This

General regression neural network (GRNN)

July 6th, 2008

GRNN is a modification of PNN for regression problems (Specht 1991). For GRNN, the predicted value of the biological property is the most probable value, which is given by

grnnpredy.jpg

where f(x,y) is the joint density and can be estimated by using Parzen’s nonparametric estimator. Substituting Parzen’s nonparametric estimator for f(x,y) and performing the integrations leads to the fundamental equation of GRNN.

grnnpredyfinal.jpg

where

grnndxx.jpg

The network architecture of a GRNN is similar to that of a PNN except that its summation layer has two neurons that calculate the numerator and denominator. The single neuron in the output layer then performs a division of the two summation neurons to obtain the predicted biological value of the given compound.

References

  • Specht DF (1991). A general regression neural network. IEEE Transactions on Neural Networks 2(6): 568-576.
Share This

Support vector regression (SVR)

July 2nd, 2008

The theoretical background of SVR is similar to that of SVM (Smola et al.; Vapnik 1995; Yuan et al. 2004). In SVR, the kernel function is used to map the vectors into a higher dimensional feature space and linear regression is then conducted in this space. The optimal regression function can be represented by:

svrpredy.jpg

where y represents the predicted value of a biological property, and the coefficients alpha, alpha* and bias b are determined by maximizing the following Langrangian expression:

svrlangrangian.jpg

under the following conditions:

svralpha.jpg
svrsum.jpg

References

  • Smola AJ and Scholkopf B A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.
  • Yuan Z and Huang BX (2004). Prediction of protein accessible surface areas by support vector regression. Proteins 57(3): 558-564.
Share This

Support vector machine (SVM)

June 28th, 2008

SVM is based on the structural risk minimization principle from statistical learning theory (Vapnik 1995; Burges 1998; Evgeniou et al. 2001). A compound is represented by a vector xi which is its molecular descriptors. In linearly separable cases, SVM constructs a hyperplane which separates two data classes of compounds with a maximum margin. This is accomplished by finding another vector w and a parameter b that minimizes w2.jpg and satisfies the following conditions:

dplus.jpg Class 1 (D+)

dminus.jpg Class 2 (D–)

where yi is the data class index of compound i, w is a vector normal to the hyperplane, bw.jpg is the perpendicular distance from the hyperplane to the origin and w2.jpg is the Euclidean norm of w. After the determination of w and b, a given compound with vector x can be classified by:

ysvm.jpg

In non-linearly separable cases, SVM maps the vectors into a higher dimensional feature space using a kernel function K(xi, xj). The table below lists three different types of kernel functions which are commonly used. The Gaussian radial basis function kernel has been extensively used in a number of different studies with good results (Burbidge et al. 2001; Czerminski et al. 2001; Trotter et al. 2001).

Commonly used kernel functions

Kernel Equation
Polynomial polynomialkernel.jpg
Gaussian radial basis function rbfkernel.jpg
Sigmoidal sigmoidalkernel.jpg

Linear support vector machine is applied to this feature space and then the decision function is given by:

ysvmnonlinear.jpg

where l is the number of support vectors and the coefficients alphai0 and b are determined by maximizing the following Langrangian expression:

langrangian.jpg

under the following conditions:

a1.jpg

a2.jpg

where C is a penalty for training errors. A positive or negative value from decision function equation indicates that the compound with vector x belongs to the positive or negative data class respectively.

References

  • Burbidge R, Trotter M, Buxton B and Holden S (2001). Drug design by machine learning: support vector machines for pharmaceutical data analysis. Computers and Chemistry 26(1): 5-14.
  • Burges CJC (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2(2): 127-167.
  • Czerminski R, Yasri A and Hartsough D (2001). Use of support vector machine in pattern classification: Application to QSAR studies. Quantitative Structure-Activity Relationships 20(3): 227-240.
  • Evgeniou T and Pontil M (2001). Support vector machines: theory and applications. Machine learning and its applications. Advanced lectures. Paliouras G, Karkaletsis V and Spyropoulos CD. New York, Springer: 249-257.
  • Trotter MWB, Buxton BF and Holden SB (2001). Support vector machines in combinatorial chemistry. Measurement and Control 34(8): 235-239.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.
  • Share This

Probabilistic neural network (PNN)

June 22nd, 2008

PNN was introduced by Specht in 1990 (Specht 1990) and is a form of neural network designed for classification through the use of Bayes’ optimal decision rule:

bayes.jpg

where hi and hj are the prior probabilities, ci and cj are the costs of misclassification and fi(x) and fj(x) are the probability density function for data class i and j respectively. A given compound with vector x is classified into data class i if the product of all the three terms is greater for data class i than for any other data class j not equal to i. In most applications, the prior probabilities and costs of misclassifications are treated as being equal. The probability density function for each data class for a univariate case can be estimated by the Parzen’s nonparametric estimator (Parzen 1962):

gx.jpg

where n is the sample size, sigma is a scaling parameter which defines the width of the bell curve that surrounds each compound, W(d) is a weight function which has its largest value at d = 0 and (x – xi) is the distance between a given compound and a compound in the training set. The Parzen’s nonparametric estimator was later expanded by Cacoullos (Cacoullos 1966) for the multivariate case.

gxx.jpg

The Gaussian function is frequently used as the weight function because it is well behaved, easily calculated and satisfies the conditions required by Parzen’s estimator. Thus the probability density function for the multivariate case becomes

gx2.jpg

To simplify the equation, a single sigma that is common to all the descriptors (single-sigma model) can be used instead of an individual sigma for each descriptor (multi-sigma model). Single-sigma models could be computed faster and can produce reasonable models when all the descriptors are of approximately equal importance. However, multi-sigma models are more general than single-sigma model and are useful when descriptors are of different nature and importance (Masters 1995).

PNN can be implemented as a neural network (Masters 1995), which is shown in the figure below. The network architecture of a PNN is determined by the number of compounds and descriptors in the training set. There are 4 layers in a PNN. The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of descriptors in the training set. The number of pattern neurons is determined by the total number of compounds in the training set. Each pattern neuron computes a distance measure between the input compound and the training compound represented by that neuron and then subjects the distance measure to the Parzen’s nonparameteric estimator. The summation layer has a neuron for each data class and the neurons sum all the pattern neurons’ output corresponding to members of that summation neuron’s data class to obtain the estimated probability density function for that data class. The single neuron in the output layer then determines the final data class of the input compound by comparing all the probability density functions from the summation neurons and choosing the data class with the highest value for the probability density function.

pnn.jpg

References

  • Cacoullos T (1966). Estimation of a multivariate density. Annals of the Institute of Statistical Mathematics 18: 179-189.
  • Masters T (1995). Advanced algorithms for neural networks : a C++ sourcebook. New York, Wiley.
  • Parzen E (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3): 1065-1076.
  • Specht DF (1990). Probabilistic neural networks. Neural Networks 3(1): 109-118.
  • Share This

Close
E-mail It