Archive for the ‘Review’ Category

KNIME - Part II: Partitioning of dataset into training and testing sets

Monday, March 17th, 2008
  1. Put File Reader node (IO->Read) to workbench and configure it to load a dataset from a file.
  2. Put Partitioning node (Data Manipulation->Row) to workbench and connect the output port from the File Reader node to its input port.
    • Configure it by choosing Relative and setting it at 80%
    • Ensure the Draw randomly box is checked.
  3. Put two CSV Writer nodes (IO->Write) on the workbench. Connect the first output port from the Partitioning node to the input node of the first CSV Writer and configure it to save the first set (which is the training set) to a file. Connect the second output port from the Partitioning node to the input node of the second CSV Writer and configure it to save the second set (which is the testing set) to a file.
  4. Execute all nodes.

As can be seen from the above procedure, it is very easy to partition a dataset randomly into a training set and testing set. However, KNIME does not seems to contain other algorithms, like the Kennard and Stone algorithm, for partitioning datasets.

KNIME - Part I: Overview

Saturday, March 15th, 2008

KNIME - Konstanz Information Miner (version 1.3.3)

From their official website, “KNIME is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models”. KNIME uses a non-profit open source license which “allows KNIME to be downloaded, distributed, and used freely as long as the software or its use is not distributed per profit”.

If you install the current version of KNIME, with all its optional plugins, you will have a total of 189 nodes, with the following nodes distribution:

  • IO: 11
  • Database: 2
  • Data manipulation: 36
  • Data views: 21
  • Statistics: 4
  • Machines: 28
  • Chemistry: 22
  • Meta: 7
  • Misc: 3
  • Weka: 47
  • Python: 3
  • R: 4
  • Reporting: 2

However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, KNIME can only read data from three sources: ARFF files (which are Weka files), text-delimited files (which include csv files), and from a database. There are no nodes for reading from SVMlight files or LIBSVM files or from Microsoft Excel files. The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel. However, the lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.

At first sight, KNIME does not seem to have any descriptor selection capability. Will explore this in more detail when I start the testing proper.

Currently, KNIME contains 3 algorithms for developing regression models and 8 algorithms for constructing classification models. I did not count those algorithms that are under the Weka branch because those algorithms are just wrappers over algorithms that are present in Weka and do not have the ability to load and save developed models.

KNIME contains a Cross validation meta-node. Though the website states that it also has boosting and bagging nodes, they were not present in the downloadable version.

Overall, my first impression of KNIME is that it has a very good graphical user interface and seems easy to use. However, it may not contain sufficient tools for a full QSAR experiment.

Data mining tools comparison methodology

Thursday, March 13th, 2008

As mentioned in my previous post, I will explore Weka, RapidMiner, and KNIME in more details. A reader has suggested that I look at TANAGRA also. So I will try to give a comparison between these four tools. However, I will not be doing the usual comparison (i.e. side by side comparison) and I will not be going into all the features of these tools. Instead, I will gauge the ease with which the tool can be used for QSAR experiments. I will evaluate the tools using a few procedures that are widely used in QSAR experiments. These procedures have been described in my previous posts and are:

  1. Partitioning of dataset into training and testing sets.
  2. Descriptor scaling.
  3. Descriptor selection.
  4. Parameter optimization of machine learning/statistical methods.
  5. Model validation using cross-validation and/or independent validation set.

As I am not very familiar with these tools, my comments on these tools will be highly subjective.


Close
E-mail It