C4.5 decision tree (DT)
C4.5 DT is a branch-test-based classifier (Quinlan 1993). A branch in a decision tree corresponds to a group of data classes and a leaf represents a specific data class. A decision node specifies a test to be conducted on a single descriptor value, with one branch and its subsequent data classes as possible outcomes of the test. A given compound with vector x is classified by starting at the root of the tree and moving through the tree until a leaf is encountered. At each non-leaf decision node, a test is conducted and the classification process proceeds to the branch selected by the test. Upon reaching the destination leaf, the data class of the given compound is predicted to be that associated with the leaf.
The algorithm is a recursive greedy heuristic that selects descriptors for membership within the tree. It uses recursive partitioning to examine every descriptor of the compounds in the training set and rank them according to their ability to partition the remaining compounds, thereby constructing a decision tree. Whether or not a descriptor is included within the tree is based on the value of its information gain. As a statistical property, information gain measures how well the descriptor separate training cases into subsets in which the data class is homogeneous. For descriptors with continuous values, a threshold value had to be established within each descriptor so that it could partition the training cases into subsets. These threshold values for each descriptor were established by rank ordering the values within each descriptor from lowest to highest and repeatedly calculating the information gain using the arithmetical midpoint between all successive values within the rank order. The midpoint value with the highest information gain was selected as the threshold value for the descriptor. That descriptor with the highest information gain (information being the most useful for classification) was then selected for inclusion in the DT. The algorithm continued to build the tree in this manner until it accounted for all training cases. Ties between descriptors that were equal in terms of information gain were broken randomly (Carnahan et al. 2003).
References
- Carnahan B, Meyer G and Kuntz L-A (2003). Comparing statistical and machine learning classifiers: Alternatives for predictive modeling in human factors research. Human Factors 45(3): 408-423.
- Quinlan JR (1993). C4.5 : programs for machine learning. San Mateo, Calif, Morgan Kaufmann.