KNIME is very easy to use and is good for preprocessing of datasets and descriptors. Personally, among the various software, I enjoy using KNIME the most. It is a pity that it is weaker in its model building and validation portion. Hopefully the next major version of KNIME will address these issues.
RapidMiner has a very large set of operators, which makes it very suitable for comparing different machine learning/statistical methods. It is also very good for model building and validation. However, the learning curve for the software is rather steep.
Weka (KnowledgeFlow) is somewhat in between KNIME and RapidMiner. Like RapidMiner, it has quite a large number of components and like KNIME, it is relatively simple to use. However, it is not able to perform all the functions that are available in RapidMiner and its graphical user interface is not as friendly as KNIME.
TANAGRA is similar to RapidMiner in terms of the layout for representing an experimental procedure. However it has significantly less operators than RapidMiner. My initial impression of it is that it should be quite good for performing QSAR experiments. However, after using it, it seems like it is lacking in several important features.
Orange is similar to Weka (KnowledgeFlow) in terms of layout. However, like TANAGRA, it seems to be lacking in some important features for QSAR experiments.
A missing feature in all these software is the ability to perform parallel computing, either through job distribution among different computers in the network or through the use of all the cores in multi-core CPUs.
Table 1 shows a comparison of the five software for performing procedures that are widely used in QSAR experiments. The best software appears to be RapidMiner. At a first glance, Weka seems to be redundant since RapidMiner has incorporated most of its algorithms. However, it still contains some algorithms, especially in the area of descriptor selection, which are not available in other software. Although TANAGRA and Orange are the worst performing software among the five, they do have their own merits. For instance, TANAGRA has an interesting collection of statistical tests while Orange has some interesting prototypes like MeSH Term Browser. Personally, I will invest my time to learn KNIME, RapidMiner, and Weka well, and will use these three software for my future research work.
Table 1: Comparison of the four software for performing procedures that are widely used in QSAR experiments.
| Procedure |
KNIME |
RapidMiner |
Weka |
TANAGRA |
Orange |
| Partitioning of dataset into training and testing sets. |
Pass (but limited partitioning methods) |
Pass (but limited partitioning methods) |
Pass (but limited partitioning methods) |
Pass (but limited partitioning methods) |
Pass (but limited partitioning methods) |
| Descriptor scaling |
Pass |
Pass |
Fail (cannot save parameters for scaling to apply to future datasets) |
Fail (cannot save parameters for scaling to apply to future datasets) |
Fail (no scaling methods) |
| Descriptor selection |
Fail (no wrapper methods) |
Pass |
Pass (but is not part of KnowledgeFlow) |
Fail (wrapper methods valid for logistic regression only) |
Fail (no wrapper methods) |
| Parameter optimization of machine learning/statistical methods |
Fail (not automatic) |
Pass |
Fail (not automatic) |
Fail (not automatic) |
Fail (not automatic) |
| Model validation using cross-validation and/or independent validation set |
Pass (but limited error measurement methods) |
Pass |
Pass (but cannot save model so have to rebuild model for every future dataset) |
Fail (cannot validate independent validation set) |
Pass (but cannot save model so have to rebuild model for every future dataset) |
Lastly, I need to reiterate that the above comments and all the previous posts on these software are very subjective. They are subjective because I have a vested interest in QSAR type of modeling and also because I am not very familar with these software (I have never used them in any of my research projects). Thus there may be factual inaccuracies about my review (i.e. some procedures which I stated that a particular software is unable to do may be false). The authors of these software or readers who are experienced with these software are welcome to comment on these factual inaccuracies and I will update the posts to reflect the truth.