KNIME - Part I: Overview
KNIME - Konstanz Information Miner (version 1.3.3)
From their official website, “KNIME is a modular data exploration platform that enables the user to visually create data flows (often referred to as pipelines), selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models”. KNIME uses a non-profit open source license which “allows KNIME to be downloaded, distributed, and used freely as long as the software or its use is not distributed per profit”.
If you install the current version of KNIME, with all its optional plugins, you will have a total of 189 nodes, with the following nodes distribution:
- IO: 11
- Database: 2
- Data manipulation: 36
- Data views: 21
- Statistics: 4
- Machines: 28
- Chemistry: 22
- Meta: 7
- Misc: 3
- Weka: 47
- Python: 3
- R: 4
- Reporting: 2
However, since I am interested in using it for QSAR experiments, I will only examine those nodes that are relevant. Basically, KNIME can only read data from three sources: ARFF files (which are Weka files), text-delimited files (which include csv files), and from a database. There are no nodes for reading from SVMlight files or LIBSVM files or from Microsoft Excel files. The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel. However, the lack of support for SVMlight and LIBSVM files will inconvenient users who are already using these two popular support vector machine softwares.
At first sight, KNIME does not seem to have any descriptor selection capability. Will explore this in more detail when I start the testing proper.
Currently, KNIME contains 3 algorithms for developing regression models and 8 algorithms for constructing classification models. I did not count those algorithms that are under the Weka branch because those algorithms are just wrappers over algorithms that are present in Weka and do not have the ability to load and save developed models.
KNIME contains a Cross validation meta-node. Though the website states that it also has boosting and bagging nodes, they were not present in the downloadable version.
Overall, my first impression of KNIME is that it has a very good graphical user interface and seems easy to use. However, it may not contain sufficient tools for a full QSAR experiment.
Share This
March 28th, 2008 at 8:26 pm
Hi,
You say “Basically, KNIME can only read data from three sources: ARFF files (which are Weka files), text-delimited files (which include csv files), and from a database” ….”The lack of support for Microsoft Excel files is no big deal since you can easily convert them to csv format using Microsoft Excel.”
In fact you don’t need to convert from Excel since the file reader node will handle Excel files and automatically detect delimiters etc. Also as far as output is concerned their is an xls writer.
From the QSAR point of view a distinct advantage of KNIME, is that it allows direct reading of .sdf files via the SDF reader node in the Chemistry/CDK plugin.
March 31st, 2008 at 9:41 am
That is strange. I have checked KNIME again and the File Reader node is able to read ASCII files only and not native Excel files.
For output, KNIME does have an XLS writer but input is a more important criterion than output to me personally.
Yes, the ability to read SDF files directly is definitely a distinct advantage of KNIME. However, I did not highlight this as I assumed that most QSAR modellers will use their favorite descriptor calculation software to calculate the descriptors before the modelling process. Thus they will not need to read molecular files directly.
April 1st, 2008 at 10:49 pm
Strange indeed, I was convinced that I had used .xls files on my LINUX machine at home but when I checked on my Windows machine in the office you are quite right. Maybe I had confused the Excel-like icon associated with .csv files from Excel. I only mentioned the sdf file handling capability since KNIME enables calculation of CDK descriptors from input of sdf files followed by the CDK conversion node. Regrettably KNIME has no parameter reduction routines though, as you say.
April 2nd, 2008 at 9:39 pm
Personally, I feel that KNIME has the potential to be a very good tool for QSAR. I like the GUI and the workflow is rather intuitive. Hopefully, the next major release will bring a lot of advances to this software.
June 10th, 2008 at 1:16 pm
We use knime for chemoinformatics applications.However I am interested in using the same for QSAR predictions.I am looking for a suitable material on the same.I would be thankful if anybody could provide me some resources on it.
March 31st, 2009 at 12:20 am
You can use RapidMiner ( http://www.RapidMiner.com/ ), another open source data mining tool, to convert Microsoft Excel sheets, SVM^light or LibSVM files to ARFF or CSV format or into a database of your choice and then load the data into KNIME from there.
While RapidMiner is no dedicated ETL tool (Extract, Transform, Load), you can use it for that purpose as well.
And if you install RapidMiner anyway, you can also use it as an alternative or additional data mining tool. It’s downloadable for free and well worth a try.
Have fun,
Frank
January 2nd, 2010 at 3:07 am
In Excel, I’ve saved xls files using ‘other’ (comma delimited) formats and then used the knime file reader to import them. Works fine on the knime version I downloaded a couple of days back
February 9th, 2010 at 9:34 pm
I am looking for outliers detector algorithm — Java open source that run on top of DB or file (CSV)
All the parameters are strings, no numeric parameters.
I need it for high volume
Anyone can please help?
Sorry if I am not at the right forum