RapidMiner (version Free 4.1beta2, licensed under GPL version 3)
From their official website, “is the world-wide leading open-source data mining solution due to the combination of its leading-edge technologies and its functional range. Applications of RapidMiner cover a wide range of real-world data mining tasks”. RapidMiner “is available in different flavours: the open-source version licensed under the GPL which can be used by everyone for free, another free version including an improved graphical user interface, and a proprietary version which can be used by commercial developers where the open-source license does not suit their needs”.
If you install the current version of RapidMiner, with all its optional plugins, you will have a total of 436 (+127 Weka’s) operators, with the following operators distribution:
- Core: 9
- IO: 75
- Learner: 65 (+119 Weka’s)
- Meta: 21
- OLAP: 3
- Other: 11
- Postprocessing: 6
- Preprocessing: 205 (+8 Weka’s)
- Validation: 31
- Visualization: 10
This is an enormous collection of operators. However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, RapidMiner can read data from quite a number of sources, e.g. ARFF, csv, Excel, sparse format, SPSS, database, etc, so most users should not have any problems opening their existing data files in RapidMiner.
RapidMiner has a few descriptor selection operators. However, it seems like it does not have some common ones like stepwise regression. Also the filter and wrapper descriptor selection methods seems to be mixed together. Will explore this in more detail when I start the testing proper.
Currently, RapidMiner contains 11 (+21 Weka’s) algorithms for developing regression models and 30 (+69 Weka’s) algorithms for constructing classification models.
RapidMiner contains a few validation methods like cross-validation, boosting and bagging.
Overall, my first impression of RapidMiner is that it has a very comprehensive set of tools for a full QSAR experiment. However, the large number of tools can make it difficult for an inexperienced data miner to decide which to use. Also the learning curve for the software seems to be quite steep as it is not easy to visualize the experimental workflow using the current graphical user interface. Fortunately, RapidMiner has a useful wizard for constructing experimental workflow for a few common data mining scenarios automatically. In addition, it has a comprehensive set of sample files which can help users to learn how to construct the experimental workflow for different parts of the data mining process.