Data mining tools comparison - Summary

KNIME is very easy to use and is good for preprocessing of datasets and descriptors. Personally, among the various software, I enjoy using KNIME the most. It is a pity that it is weaker in its model building and validation portion. Hopefully the next major version of KNIME will address these issues.

RapidMiner has a very large set of operators, which makes it very suitable for comparing different machine learning/statistical methods. It is also very good for model building and validation. However, the learning curve for the software is rather steep.

Weka (KnowledgeFlow) is somewhat in between KNIME and RapidMiner. Like RapidMiner, it has quite a large number of components and like KNIME, it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner and its graphical user interface is not as friendly as KNIME.

TANAGRA is similar to RapidMiner in terms of the layout for representing an experimental procedure. However it has significantly less operators than RapidMiner. My initial impression of it is that it should be quite good for performing QSAR experiments. However, after using it, it seems like it is lacking in several important features.

Orange is similar to Weka (KnowledgeFlow) in terms of layout. However, like TANAGRA, it seems to be lacking in some important features for QSAR experiments.

A missing feature in all these software is the ability to perform parallel computing, either through job distribution among different computers in the network or through the use of all the cores in multi-core CPUs.

Table 1 shows a comparison of the five software for performing procedures that are widely used in QSAR experiments. The best software appears to be RapidMiner. At a first glance, Weka seems to be redundant since RapidMiner has incorporated most of its algorithms. However, it still contains some algorithms, especially in the area of descriptor selection, which are not available in other software. Although TANAGRA and Orange are the worst performing software among the five, they do have their own merits. For instance, TANAGRA has an interesting collection of statistical tests while Orange has some interesting prototypes like MeSH Term Browser. Personally, I will invest my time to learn KNIME, RapidMiner, and Weka well, and will use these three software for my future research work.

Table 1: Comparison of the four software for performing procedures that are widely used in QSAR experiments.

Procedure KNIME RapidMiner Weka TANAGRA Orange
Partitioning of dataset into training and testing sets. Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods) Pass (but limited partitioning methods)
Descriptor scaling Pass Pass Fail (cannot save parameters for scaling to apply to future datasets) Fail (cannot save parameters for scaling to apply to future datasets) Fail (no scaling methods)
Descriptor selection Fail (no wrapper methods) Pass Pass (but is not part of KnowledgeFlow) Fail (wrapper methods valid for logistic regression only) Fail (no wrapper methods)
Parameter optimization of machine learning/statistical methods Fail (not automatic) Pass Fail (not automatic) Fail (not automatic) Fail (not automatic)
Model validation using cross-validation and/or independent validation set Pass (but limited error measurement methods) Pass Pass (but cannot save model so have to rebuild model for every future dataset) Fail (cannot validate independent validation set) Pass (but cannot save model so have to rebuild model for every future dataset)

Lastly, I need to reiterate that the above comments and all the previous posts on these software are very subjective. They are subjective because I have a vested interest in QSAR type of modeling and also because I am not very familar with these software (I have never used them in any of my research projects). Thus there may be factual inaccuracies about my review (i.e. some procedures which I stated that a particular software is unable to do may be false). The authors of these software or readers who are experienced with these software are welcome to comment on these factual inaccuracies and I will update the posts to reflect the truth.

Share This

14 Responses to “Data mining tools comparison - Summary”

  1. Ingo Mierswa Says:

    Hello,

    first of all I would like to thank you for your efforts in comparing those data mining tools. After reading you evaluation of RapidMiner we decided to start to better meet your requirements and, for example, added a new sampling operator for Kennard-Stone sampling. This new operator is available in our new release of RapidMiner 4.1 available on our web site

    http://rapid-i.com

    We also added an improved in-program documentation. All operators now show more exhaustive descriptions as a tool tip or in the operator info dialog so that the steep learning should be flattened at least a bit ;-)

    Thanks again and I am hoping that these information are useful to you or your readers.

    Cheers,
    Ingo

  2. Yap Chun Wei Says:

    Hi Ingo,

    Congratulations on the improvements to an already excellent product. I will certainly download the new version and give it a try.

    I saw an interesting new feature for RapidMiner 4.1, which is the ability to support multi-cores. Personally, I feel that this is a very important feature since most CPU nowadays are multi-core. However, it seems like this feature is only available in the Enterprise edition, which is a pity. Hopefully, Rapid-I will decide to release this feature to the community edition in the future.

  3. kannan Says:

    Dear sir,
    I am interstef in qsar studies like solubility,logp etc. My interest is comparing different classifiers .I have started using weka knowledgeflow and connected the training set and test set from the splitmaker directly to the classifier (multilayer percepton)
    Is this wrong?
    Also how do i save a trained classifier?

  4. Frank Xavier Says:

    Dear Yap Chun Wei,

    thanks for your informative texts about the various data mining tools and your in-depth comparison. It gives a lot of insight into these tools and their usefulness as well as their short comings. And good pointers for the open source projects on where to improve their great software tools.

    Regarding the multi-core features, I can really recommend the RapidMiner Enterprise Edition. The speed-up is enormous. Since you are an assistance professor at National University of Singapore, you could ask the team at Rapid-I for academic pricing. They offer an academic version of RapidMiner at a reduced price for universities, professors, research assistants, and students.

    Looking at RapidMiner, KNIME, Weka, and the R-Project, I see four enterprise ready data mining tools and a lot of dynamic in the further development of these tools, which is good for all data miners, both in academic research and commercial applications.

    Best wishes,
    Frank

  5. Ingo Mierswa Says:

    Hi,

    it has been quite a while since I visited your web site. I just wanted to let you know that the support for multicores in RapidMiner is now part of the freely available community edition of RapidMiner - as well as all other extensions which were formerly only available for our enterprise customers.

    The second information is about the process handling: beginning with RapidMiner 5, there is also a process flow design available like that known from the Weka Knowledge Flow or SPSS Clementine. Now RapidMiner combines its power in modeling and validation with the ease of useness of other solutions. The RapidMiner 5 Release Candidate was released end of last year and maybe you want to give it another try.

    Hope you and your readers find this information interesting.

    All the best,
    Ingo

  6. bxshi.nku Says:

    Hi,
    Thanks for your sharing. I’ve just began to learn data mining all by myself and wanna find a powerful experiment environment, and finally I decided to try rapid miner, thanks for your advise!

    Best regards,
    bxshi

  7. prissammenligning Says:

    prissammenligning

    Data mining tools comparison - Summary | pharmine

  8. electronic engineering jobs Says:

    electronic engineering jobs

    […]Data mining tools comparison - Summary | pharmine[…]

  9. power tools Says:

    power tools

    […]Data mining tools comparison - Summary | pharmine[…]

  10. ch Says:

    hai

  11. vk Says:

    safdff

  12. waitrose job vacancies ipswich Says:

    waitrose job vacancies ipswich…

    Data mining tools comparison - Summary | pharmine…

  13. Kelley Says:

    Finally i quit my day job, now i earn decent money online you should try too, just search in google -
    blackhand roulette system

  14. Donald Says:

    I read a lot of interesting content here.
    Probably you spend a lot of time writing, i know how to save you a lot of work, there
    is an online tool that creates high quality, SEO friendly articles in seconds, just type
    in google - laranitas free content source

Leave a Reply


Close
E-mail It