Diversity and representativity of datasets
The diversity of a dataset can be estimated by a diversity index (DI) which is the average value of the similarity between all of the pairs of compounds in that dataset (Perez 2005):

where sim(i,j) is a measure of the similarity between compound i and j, and n is the number of compounds in a dataset. The diversity of a dataset increases with decreasing DI. The similarity between two compound i and j is commonly described by the Tanimoto coefficient (Potter et al. 1998; Willett et al. 1998; Molnar et al. 2002):

where p is the number of descriptors of the compounds in the dataset. The mean maximum Tanimoto coefficient of the compounds in dataset A and those in dataset B can be used as a representativity index (RI) to measure the level of representativity of dataset A by dataset B. Dataset B is more representative of dataset A if the RI value between dataset A and B is higher.
References
- Molnar L and Keseru GM (2002). A neural network based virtual screening of cytochrome P450 3A4 inhibitors. Bioorganic and Medicinal Chemistry Letters 12(3): 419-421.
- Perez JJ (2005). Managing molecular diversity. Chemical Society Reviews 34(2): 143-152.
- Potter T and Matter H (1998). Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. Journal of Medicinal Chemistry 41(4): 478-488.
- Willett P, Barnard JM and Downs GM (1998). Chemical similarity searching. Journal of Chemical Information and Computer Sciences 38(6): 983-996.