Archive for November, 2008

Health Discovery Corporation holds the patents to SVM and RFE

Thursday, November 20th, 2008

While doing some literature search and reading, I discovered that SVM and RFE are actually patented technologies. I am not sure what are the implications of this to researchers but I don’t like the sound of it. Maybe it is time to look into other new machine learning technologies and hold off using SVM for the next 20 years.

Y-randomization in KNIME

Monday, November 10th, 2008

Previously, I had wrote about how to perform y-randomization in Rapidminer. You can also use those basic concepts to do y-randomization in KNIME. Unlike the previous post where I detailed the steps for an entire y-randomization experiment, in this post, I will show how to perform a single y-randomization on a dataset only. Below is the basic workflow.

workflow1.jpg

“Column Filter” is used to remove all variables except the label. This is then passed to “Shuffle” to randomize the labels. An increasing row id number is then added to this randomized label dataset and the original dataset using “Math Formula”.

mathformula.jpg

“Row ID” is then used to replace the original row ids in both original and randomized label dataset with the newly created row id.

rowid.jpg

Finally, “Joiner” is used to merge the two datasets together, creating a randomized dataset.

joiner.jpg

Y-randomization in Rapidminer

Tuesday, November 4th, 2008

I had mentioned using Y-randomization as one of the methods to use for checking overfitting of a prediction model. Recently, someone had asked me about the Y-randomization that was implemented in my software, PHAKISO. PHAKISO was created during my PhD studies and unfortunately, it did not have an automated method to perform the Y-randomization experiment automatically for n number of times. I had always used the associated library, YMLL, to create a simple program to do the job and thus did not implement such feature in PHAKISO.

Since I am using Rapidminer for my research now, I thought it would be easy to create a Y-randomization process in it. Unfortunately, Rapidminer did not have a Y-randomization operator. However, through the solutions provided by the helpful moderators in the Rapidminer forum, I finally know how to do it in Rapidminer and in the process, learnt more about Rapidminer.

The basic process is shown in the following figure. operatortree.jpg

The basic idea in Y-randomization is to randomize the label of the dataset. So in Rapidminer, you would load a dataset, create a copy of it and remove all attributes, except the label, from the copy. Then randomly permutate the examples in the copy and tag all the examples with an unique id. Select the original dataset, tag all examples with an unique id and do a join between the original dataset and its copy, using the id as the key for joining. If you do it in the correct way, the labels in the original dataset will be skipped during the joining and the permutated labels in the copy will be used for the joined dataset. To perform the entire Y-randomization experiment automatically, you will need to use the IteratingPerformanceAverage operator chain to enclose the Y-randomization portion and add a validation procedure after the Y-randomization portion as shown in the figure.

The complete XML process is as follows:
<operator name=”Root” class=”Process” expanded=”yes”>
    <parameter key=”random_seed”        value=”-1″/>
    <operator name=”CSVExampleSource” class=”CSVExampleSource”>
    </operator>
    <operator name=”IteratingPerformanceAverage” class=”IteratingPerformanceAverage” expanded=”yes”>
        <parameter key=”iterations”             value=”100″/>
        <operator name=”IOMultiplier” class=”IOMultiplier”>
            <parameter key=”io_object”         value=”ExampleSet”/>
        </operator>
        <operator name=”AttributeSubsetPreprocessing” class=”AttributeSubsetPreprocessing” expanded=”yes”>
            <parameter key=”attribute_name_regex”              value=”label”/>
            <parameter key=”condition_class”              value=”attribute_name_filter”/>
            <parameter key=”keep_subset_only”       value=”true”/>
            <operator name=”Permutation” class=”Permutation”>
            </operator>
            <operator name=”IdTagging” class=”IdTagging”>
            </operator>
        </operator>
        <operator name=”IOSelector” class=”IOSelector”>
            <parameter key=”io_object”         value=”ExampleSet”/>
            <parameter key=”select_which”  value=”2″/>
        </operator>
        <operator name=”IdTagging (2)” class=”IdTagging”>
        </operator>
        <operator name=”ExampleSetJoin” class=”ExampleSetJoin”>
        </operator>
        <operator name=”XValidation” class=”XValidation” expanded=”yes”>
            <parameter key=”leave_one_out”             value=”true”/>
            <operator name=”NearestNeighbors” class=”NearestNeighbors”>
                <parameter key=”k”      value=”3″/>
            </operator>
            <operator name=”OperatorChain” class=”OperatorChain” expanded=”yes”>
                <operator name=”ModelApplier” class=”ModelApplier”>
                    <list key=”application_parameters”>
                    </list>
                </operator>
                <operator name=”ClassificationPerformance” class=”ClassificationPerformance”>
                    <parameter key=”accuracy”   value=”true”/>
                    <list key=”class_weights”>
                    </list>
                </operator>
            </operator>
        </operator>
    </operator>
</operator>

Just enter your dataset file in CSVExampleSource and change the method in XValidation from NearestNeighbors to your desired modeling method.


Close
E-mail It