Comments on: VisuMap - Part 2 http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/ Data mining in Pharmacy Wed, 08 Sep 2010 09:01:53 +0000 http://wordpress.org/?v=2.3.2 By: James X. Li http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-382 James X. Li Fri, 05 Sep 2008 01:26:25 +0000 http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-382 Thanks for posting the dataset. After some examination we have found a defect in the implementation of the Dice dissimilarity metric that has been made available as a free plugin module for VisuMap. The defect has been corrected in the mean time, and the new version is ready to download on our web site. Using the correct Dice metric, RPM method produces similar maps as with other metrics. It is not too surprising that differences between those binary metrics are not apparent for human eye. But those fine differences could be significant for automated searching algorithms. The similarity metric as a whole depends on how you calculate the fingerprints; and that in turn depends on how you select the fragments to produce the fingerprints. The process to find appropriates similarity (or dissimilarity) metric is domain and problem specific. VisuMap can also offer some help in this regard. Please see my blog (http://jamesxli.blogspot.com/2008/09/on-similarity-metrics-for-chemical.html) for more comments. Thanks for posting the dataset. After some examination we have found a defect in the implementation of the Dice dissimilarity metric that has been made available as a free plugin module for VisuMap.

The defect has been corrected in the mean time, and the new version is ready to download on our web site. Using the correct Dice metric, RPM method produces similar maps as with other metrics.

It is not too surprising that differences between those binary metrics are not apparent for human eye. But those fine differences could be significant for automated searching algorithms.

The similarity metric as a whole depends on how you calculate the fingerprints; and that in turn depends on how you select the fragments to produce the fingerprints. The process to find appropriates similarity (or dissimilarity) metric is domain and problem specific. VisuMap can also offer some help in this regard. Please see my blog (http://jamesxli.blogspot.com/2008/09/on-similarity-metrics-for-chemical.html) for more comments.

]]>
By: Yap Chun Wei http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-324 Yap Chun Wei Tue, 02 Sep 2008 08:58:53 +0000 http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-324 Thank you so much for the useful information. I will experiment more and read more literature. I have added the link to the dataset to the post (in the first paragraph). Ids 1 to 72 are cephalosporins, 73 to 111 are fluoroquinolones and 112 to 170 are penicillins. I have experimented with t-SNE. The results are pretty similar to those of Sammon and SMACOF method for the three distance metrics. Thank you so much for the useful information. I will experiment more and read more literature.

I have added the link to the dataset to the post (in the first paragraph). Ids 1 to 72 are cephalosporins, 73 to 111 are fluoroquinolones and 112 to 170 are penicillins.

I have experimented with t-SNE. The results are pretty similar to those of Sammon and SMACOF method for the three distance metrics.

]]>
By: James X. Li http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-285 James X. Li Sat, 30 Aug 2008 15:37:30 +0000 http://voyagememoirs.com/pharmine/2008/08/29/visumap-part-2/#comment-285 I am a little surprised by the disappointing result of the Dice-metric with RPM, as this metric is quite similar to the Jaccard-metric. Would it possible to release your dataset for us to take a close look? You can replace the labels & names of the data points with anonymous strings (with the table editor), if you need to keep the data confidential. If the cluster structure is your main interest, you should also try the t-SNE mapping method. This method is the latest addition to VisuMap that preserves clusters structure very well. In general, when I explore a new dataset I would first try to use the PCA method. PCA is the safest mapping algorithm; it actually does not do any data processing except rotating and shifting the coordinators system, then project to the 3 coordinators with the most variances. PCA is less powerful compared to other non-linear mapping algorithms, since it does not do any unfolding, twisting, segmentation etc. But, PCA often provides good results in the practice. When applying PCA method, you should check the eigenvalues of the principal components (via PCA Projection>PCA Analyzer>PCA Details). Larger eigenvalues mean more variances (and more information). If you see more than 3 relatively large eigenvalues, you should be careful with the results of PCA map, since it only displays projection to 3 components, some relevant information may be invisible in the map. Also notice that with VisuMap you can select a cluster of data points then apply PCA on the selected data (via the context menu "Show PCA View"). Thus, in addition to global structure, you can easily explore the detailed relationships within clusters. If PCA does not deliver satisfactory results (e.g. the map has no clusters, no clear geometrical shapes or density gradients, does not fit to expectations, etc. just like a random cloud), I would then try Sammon or SMACOF method, then those more powerful methods like CCA, t-SNE or RPM. The tool Shepard diagram (view>Shepard Diagram) provides some help to assess how good a map reflects the original distance information. The selection of distance metric and, more generally, the preparation of data (filtering, transformation, cleansing etc.) are very domain specific tasks. Knowledge about different groups of distance metrics (dissimilarity distances) could offer some guides by searching for literatures. I am a little surprised by the disappointing result of the Dice-metric with RPM, as this metric is quite similar to the Jaccard-metric. Would it possible to release your dataset for us to take a close look? You can replace the labels & names of the data points with anonymous strings (with the table editor), if you need to keep the data confidential.

If the cluster structure is your main interest, you should also try the t-SNE mapping method. This method is the latest addition to VisuMap that preserves clusters structure very well.

In general, when I explore a new dataset I would first try to use the PCA method. PCA is the safest mapping algorithm; it actually does not do any data processing except rotating and shifting the coordinators system, then project to the 3 coordinators with the most variances. PCA is less powerful compared to other non-linear mapping algorithms, since it does not do any unfolding, twisting, segmentation etc. But, PCA often provides good results in the practice.

When applying PCA method, you should check the eigenvalues of the principal components (via PCA Projection>PCA Analyzer>PCA Details). Larger eigenvalues mean more variances (and more information). If you see more than 3 relatively large eigenvalues, you should be careful with the results of PCA map, since it only displays projection to 3 components, some relevant information may be invisible in the map.

Also notice that with VisuMap you can select a cluster of data points then apply PCA on the selected data (via the context menu “Show PCA View”). Thus, in addition to global structure, you can easily explore the detailed relationships within clusters.

If PCA does not deliver satisfactory results (e.g. the map has no clusters, no clear geometrical shapes or density gradients, does not fit to expectations, etc. just like a random cloud), I would then try Sammon or SMACOF method, then those more powerful methods like CCA, t-SNE or RPM. The tool Shepard diagram (view>Shepard Diagram) provides some help to assess how good a map reflects the original distance information.

The selection of distance metric and, more generally, the preparation of data (filtering, transformation, cleansing etc.) are very domain specific tasks. Knowledge about different groups of distance metrics (dissimilarity distances) could offer some guides by searching for literatures.

]]>