Archive for July, 2008

VisuMap

Tuesday, July 22nd, 2008

VisuMap is a high dimensional data visualizer. It provides a number of dimensionality reduction methods like principal component analysis, Sammon mapping, curvilinear component analysis, relational perspective map and SMACOF MDS. It also has a few data clustering methods such as K-mean clustering, agglomerative clustering, self-organizing map and metric sampling.

The website contains some sample maps, sample datasets for you to work on. There are also white papers, and demo videos on the software (which is only available after you register with the website).

To evaluate this software, I used my own dataset. In one of my previous research, I gathered three congeneric groups of compounds: penicillins, cephalosporins, fluoroquinolones. I compute fingerprints (1025 dimensions) using openbabel for these compounds and combined them into one dataset. Then I load the dataset into VisuMap and run it through each of the different dimensionality reduction methods.

pca3d.jpg

Results from Principal component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

sammon2d.jpg

Results from Sammon mapping. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

cca2d.jpg

Results from Curvilinear component analysis. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

rpm2d.jpg

Results from Relational perspective map. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

mds2d.jpg

Results from SMACOF MDS. Yellow squares are cephalosporins, Red circles are penicillins, Blue triangles are fluoroquinolones

All the pictures above (except PCA) are the 2D maps produced by the various algorithms. Although the software can also produce 3D maps, it is not easy to visualize them as the software does not provide very good controls for rotating the map. I could not get the 3D animation to work in my VMWare machine so I don’t know whether it provides an easy way to view 3D maps. It will be good if the software adopts the way that molecular structure viewer software like Sybyl handles 3D structures (i.e. hold down right mouse button and move the mouse to rotate).

It can be seen from the pictures that the algorithms PCA, Sammon and MDS did a very good job in showing that there are three distinct groups from the obvious separation between the groups (The colours and shapes of the different groups were added in manually to enhance the visual effects. Bear in mind that when you process a dataset with unknown groupings, every point will appear to be the same. Thus the only way to differentitate groups is if there is an obvious separation band). For the other algorithms, the separation between the groups are not as good, although it can be seen that members of each group does not mix with those from other groups. The Sammon and MDS algorithm also correctly showed that penicillins and cephalosporins are closer to each other than they are to fluoroquinolones.

Visualization software for exploratory data analysis

Monday, July 14th, 2008

A dataset may contain anywhere from one to several thousand features. When the number of features in a dataset exceeds three, it is difficult to visualize how different instances are related to one another. Luckily, there are methods available to help us visualize these high-dimensional dataset. The common thing about these methods is that they reduce the original features in the dataset into not more than three features, while retaining the distance relationship between the instances. This allows the instances to be plotted as a 2D or 3D graph, providing us with a visual overview of the structure of the dataset.

The usual method for visualizing datasets in QSAR is principal component analysis (PCA). PCA is used to convert the existing features into another set of orthogonal features, with the first few features capturing the bulk of the variance in the dataset. A 2D plot is usually made from the first two principal components and is useful for showing clusters in the dataset, areas where data is sparse, possible outliers, and whether it is possible to separate the different classes in the dataset using PCA alone.

Other than PCA, other dimensionality reduction methods are seldom used in QSAR. The reasons are not clear. Perhaps, it is due to the lack of software, or the lack of expertise in interpreting such graphs. Indeed, it is for both reasons that I do not use visualization methods often in my research. Previously, I was too involved in data mining alone. Now, I have broaden my approach to data exploration and thus it is necessary for me to learn how to visualize data properly.

A search on the internet shows that there are some visualization software available. I have selected three software, VisuMap, OmniViz, and GGobi to explore in more details.

General regression neural network (GRNN)

Sunday, July 6th, 2008

GRNN is a modification of PNN for regression problems (Specht 1991). For GRNN, the predicted value of the biological property is the most probable value, which is given by

grnnpredy.jpg

where f(x,y) is the joint density and can be estimated by using Parzen’s nonparametric estimator. Substituting Parzen’s nonparametric estimator for f(x,y) and performing the integrations leads to the fundamental equation of GRNN.

grnnpredyfinal.jpg

where

grnndxx.jpg

The network architecture of a GRNN is similar to that of a PNN except that its summation layer has two neurons that calculate the numerator and denominator. The single neuron in the output layer then performs a division of the two summation neurons to obtain the predicted biological value of the given compound.

References

  • Specht DF (1991). A general regression neural network. IEEE Transactions on Neural Networks 2(6): 568-576.

Support vector regression (SVR)

Wednesday, July 2nd, 2008

The theoretical background of SVR is similar to that of SVM (Smola et al.; Vapnik 1995; Yuan et al. 2004). In SVR, the kernel function is used to map the vectors into a higher dimensional feature space and linear regression is then conducted in this space. The optimal regression function can be represented by:

svrpredy.jpg

where y represents the predicted value of a biological property, and the coefficients alpha, alpha* and bias b are determined by maximizing the following Langrangian expression:

svrlangrangian.jpg

under the following conditions:

svralpha.jpg
svrsum.jpg

References

  • Smola AJ and Scholkopf B A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030.
  • Vapnik VN (1995). The nature of statistical learning theory. New York, Springer.
  • Yuan Z and Huang BX (2004). Prediction of protein accessible surface areas by support vector regression. Proteins 57(3): 558-564.

Close
E-mail It