Removal-until-done algorithm
Other than the Kennard and Stone algorithm, another algorithm for dividing a dataset into training set and validation set is the removal-until-done algorithm. In this algorithm, compounds are sequentially removed from the dataset in pairs and placed in the training and validation sets until a defined similarity threshold or desired number of compounds was selected for the validation set. The selection of the compounds to be removed was based on their distribution in the chemical space. Here, chemical space is defined by the structural and chemical descriptors used to represent a compound and each descriptor value is a point in a multidimensional space. Each compound occupies a particular location in this chemical space. All possible pairs of the compounds in the dataset were generated and a similarity score was computed for each pair. These pairs were then ranked in terms of their similarity scores, based on which compounds of similar structural and chemical features were evenly assigned into the training and validation sets. For those compounds without enough structurally and chemically similar counterparts, they were assigned to the training set.
Share This