I had mentioned using Y-randomization as one of the methods to use for checking overfitting of a prediction model. Recently, someone had asked me about the Y-randomization that was implemented in my software, PHAKISO. PHAKISO was created during my PhD studies and unfortunately, it did not have an automated method to perform the Y-randomization experiment automatically for n number of times. I had always used the associated library, YMLL, to create a simple program to do the job and thus did not implement such feature in PHAKISO.
Since I am using Rapidminer for my research now, I thought it would be easy to create a Y-randomization process in it. Unfortunately, Rapidminer did not have a Y-randomization operator. However, through the solutions provided by the helpful moderators in the Rapidminer forum, I finally know how to do it in Rapidminer and in the process, learnt more about Rapidminer.
The basic process is shown in the following figure. 
The basic idea in Y-randomization is to randomize the label of the dataset. So in Rapidminer, you would load a dataset, create a copy of it and remove all attributes, except the label, from the copy. Then randomly permutate the examples in the copy and tag all the examples with an unique id. Select the original dataset, tag all examples with an unique id and do a join between the original dataset and its copy, using the id as the key for joining. If you do it in the correct way, the labels in the original dataset will be skipped during the joining and the permutated labels in the copy will be used for the joined dataset. To perform the entire Y-randomization experiment automatically, you will need to use the IteratingPerformanceAverage operator chain to enclose the Y-randomization portion and add a validation procedure after the Y-randomization portion as shown in the figure.
The complete XML process is as follows:
<operator name=”Root” class=”Process” expanded=”yes”>
<parameter key=”random_seed” value=”-1″/>
<operator name=”CSVExampleSource” class=”CSVExampleSource”>
</operator>
<operator name=”IteratingPerformanceAverage” class=”IteratingPerformanceAverage” expanded=”yes”>
<parameter key=”iterations” value=”100″/>
<operator name=”IOMultiplier” class=”IOMultiplier”>
<parameter key=”io_object” value=”ExampleSet”/>
</operator>
<operator name=”AttributeSubsetPreprocessing” class=”AttributeSubsetPreprocessing” expanded=”yes”>
<parameter key=”attribute_name_regex” value=”label”/>
<parameter key=”condition_class” value=”attribute_name_filter”/>
<parameter key=”keep_subset_only” value=”true”/>
<operator name=”Permutation” class=”Permutation”>
</operator>
<operator name=”IdTagging” class=”IdTagging”>
</operator>
</operator>
<operator name=”IOSelector” class=”IOSelector”>
<parameter key=”io_object” value=”ExampleSet”/>
<parameter key=”select_which” value=”2″/>
</operator>
<operator name=”IdTagging (2)” class=”IdTagging”>
</operator>
<operator name=”ExampleSetJoin” class=”ExampleSetJoin”>
</operator>
<operator name=”XValidation” class=”XValidation” expanded=”yes”>
<parameter key=”leave_one_out” value=”true”/>
<operator name=”NearestNeighbors” class=”NearestNeighbors”>
<parameter key=”k” value=”3″/>
</operator>
<operator name=”OperatorChain” class=”OperatorChain” expanded=”yes”>
<operator name=”ModelApplier” class=”ModelApplier”>
<list key=”application_parameters”>
</list>
</operator>
<operator name=”ClassificationPerformance” class=”ClassificationPerformance”>
<parameter key=”accuracy” value=”true”/>
<list key=”class_weights”>
</list>
</operator>
</operator>
</operator>
</operator>
</operator>
Just enter your dataset file in CSVExampleSource and change the method in XValidation from NearestNeighbors to your desired modeling method.