19  Classification using Trees and Rules

Classification trees are a powerful and interpretable approach for classification problems. Their structure consists of nested if-then statements that partition the predictor space into regions, each of which is associated with a specific class. We’ve already seen them in a few different places: Figure 2.5 visualized a regression tree for the food delivery data, while Figure 4.3 shows a classification tree to understand missingness.

As a refresher, decision trees are a hierarchical set of splits of predictors that produce two or more nodes1. The split conditions (or rules) are logical statements based on the predictor values. We’ve seen rules such as hour < 14.769 and day in {Fri, Sat}. As decision trees are trained, the child nodes produced by each split are split further until a predefined stopping rule is triggered. At the end of the tree are the terminal nodes, and the training data that fall into these nodes are used to estimate the model predictions. The terminal nodes are defined by a complete set of mutually exclusive rules that contain all of the split conditions from each stage of splitting.

The two seminal works in this area are Breiman et al. (1984) and Quinlan (1993). Additionally, Loh (2014) provides a more recent and comprehensive review.

To describe the myriad aspects of tree-based models, we’ll start from the smallest computations first, the tools to choose a split point for the logical statement inside the rule. After this, we’ll describe how missing values can be handled algorithmically during split selection, tree growth, and pruning, along with other relevant details. After these discussions, specific single-tree-based models are described, such as CART, C4.5, and others. Finally, the chapter concludes with discussions of tree stability and the creation of rule-based models.

As with previous chapters, we’ll use the forestation data to illustrate how these methods work.

19.1 Creating Splits

Splitting or recursively partitioning the data is the foundation of classification tree construction. The effectiveness of a classification tree largely depends on how well the method partitions the predictor space into regions that separate the classes.

In this section, we’ll discuss the tactics of splitting qualitative and quantitative predictors using various methods. Our goal is to understand and elucidate the possible objective functions that could be optimized when evaluating a specific predictor. Recall that because splitting occurs recursively, we can explain the process independent of any splitting level.

To more easily conceptualize splitting methods, we’ll use the small training data subset shown in Table 19.1. From these data (n = 12), the raw probabilities for each class are \widehat{p}_1 = 0.583 (forested) and \widehat{p}_2 = 0.417 (unforested).

Forested? County Maximum Vapor
Yes stevens 1,340
Yes stevens 862
Yes yakima 665
Yes okanogan 1,274
Yes okanogan 580
Yes okanogan 1,165
Yes okanogan 1,224
No yakima 1,179
No yakima 1,659
No yakima 1,551
No okanogan 1,639
No okanogan 1,657
Table 19.1: A small sample of the forestation data that will be used to illustrate splitting methods.

Let’s assume that the splits in our trees are binary; they produce two partitions of the data. This can be via thresholding a continuous predictor or by assigning levels of categorical predictors to two groups (e.g., {stevens, yakima} vs {okanogan}). In either case, a proposed split can be represented by a 2 \times 2 table as seen in Table 19.2.

Class 1 Class 2
Left n_{11} n_{12} n_{1+}
Right n_{21} n_{22} n_{2+}
n_{+1} n_{+2} n
Table 19.2: Example notation for a binary split of an outcome with C=2 class levels.

In this case, n is the number of training set points available at the time, and the plus signs indicate the marginal totals; for example, n_{+1} = n_{11} + n_{21}. This table is shown for thresholding numeric predictors, but is also valid for categorical splits. When thresholding, we associate the “left split” with smaller predictor values (i.e., X\le c) and the “right split” with larger predictor values.

Using this table, we can compute various statistics to assess the quality of the split. Many of these statistics seek to maximize the class purity of the resulting partitions. For example, the perfect split for our example data occurs when each partition contains only a single class.

The next section will describe general statistics that can be produced from this 2 \times 2 table, all of which generalize to C\ge 3 classes.

19.1.1 General Splitting Criteria

The choice of splitting criterion is a critical decision in tree construction, because this determines how the algorithm evaluates potential splits. Several metrics exist to evaluate the quality of potential splits, each with its own theoretical foundation and practical implications.

The Gini index (Breiman et al. 1984) is one of the most widely used splitting criteria in classification trees. For a two-class problem, the Gini index is defined as:

\text{Gini} = p_1(1-p_1) + p_2(1-p_2) = 2p_1p_2

where p_1 and p_2 are the class probabilities in the node. The Gini index can be interpreted as a measure of node impurity. It reaches its minimum value of zero when a node contains only one class (perfect purity) and its maximum value of 0.5 when classes are evenly distributed (maximum impurity).

When evaluating a potential split, the algorithm calculates the weighted average of the Gini indices for the resulting child nodes, with weights proportional to the number of samples in each node. The split that minimizes this weighted average is selected.

For the data in Table 19.1, the Gini index for this node would be 2 \times 0.583 \times 0.417 = 0.243. We call the node, prior to being split, the parent node.

Now, suppose a split value for maximum vapor of 1250 creates two child nodes:

Truly Forested?
Yes No
Max. Vapor \le 1250 5 1 6
Max. Vapor > 1250 2 4 6
7 5 12

Using these counts, we can compute the Gini statistic for each rule and then their weighted average:

\begin{align} \text{Left} &= 2 \times \frac{5}{6} \times \frac{1}{6} = 0.139 \notag \\ \text{Right} &= 2 \times \frac{2}{6} \times \frac{4}{6} = 0.222 \notag \\ \text{Total} &= \frac{6}{12} 0.139 + \frac{6}{12} 0.222 = 0.181 \notag \end{align}

The Gini value for this split is lower than the parent node’s, indicating a reduction in impurity. We often frame the importance of a split in terms of its gain relative to the parent node. Following the notation from Section 12.2, our objective function is denoted as \psi, and the gain is computed as:

\psi_{gain} = \begin{cases} \psi_L + \psi_R - \psi_P & \text{if maximize } \psi \notag \\ \psi_P - \psi_L - \psi_R & \text{if minimize } \psi \end{cases}

where L, R, and P represent the statistics from the left split, the right split, and the parent (i.e., prior to splitting). For the Gini statistic, the gain is 0.243 - 0.181 = 0.063. During training, we can also impose a condition that a split can only be made if the expected gain exceeds a minimum threshold and/or if there are at least n_{min} data points in the leaf. One or both of these can be treated as tuning parameters to discourage excessively deep trees, which may overfit.

When there are C > 2 classes, the more general form of the statistic is:

Gini = \sum _{k=1}^{C}\left(p_{k}\sum _{k'\neq k}^Cp_{k'}\right)

Another common splitting criterion based on information theory is the cross-entropy statistic from Equation 15.20, otherwise known as the information statistic, which also measures the reduction in uncertainty achieved by a split. In our current context, the statistic is:

CE = -\sum_{k = 1}^C p_k \log_2p_k

The information content is measured in bits and can be interpreted as the average number of bits needed to encode the class of a randomly selected sample from the node. A node with perfect purity (all samples belonging to the same class) has an information value of zero, while a node with maximum impurity (equal distribution of classes) has the highest information value. Splits with larger information gains are preferred as they create more homogeneous nodes.

Using the same example as before, the information content of the parent node would be:

CE = - \left[0.583\; \log_2\left(0.583\right) + 0.417\; \log_2\left(0.417\right)\right] = 0.98\; \text{bits}

Breaking out the calculations for each side and weighting them shows that there is a non-zero gain for this split:

\begin{align} \text{Left} &= -\left[ \frac{5}{6}\; \log_2\left(\frac{5}{6}\right) + \frac{1}{6}\; \log_2\left(\frac{1}{6}\right)\right] = 0.65 \notag \\ \text{Right} &= -\left[ \frac{2}{6}\; \log_2\left(\frac{2}{6}\right) + \frac{4}{6}\; \log_2\left(\frac{4}{6}\right)\right] = 0.918 \notag \\ \text{Total} &= \frac{6}{12} 0.65 + \frac{6}{12} 0.918 = 0.784 \; \text{bits}\notag \\ \text{Gain} &= 0.98 - 0.784 = 0.196\; \text{bits} \notag \end{align}

A limitation of information gain is its bias toward predictors with many possible values, particularly categorical predictors with numerous categories. To address this issue, Quinlan (1986) introduced the gain ratio in the C4.5 algorithm2. The gain ratio is calculated by dividing the information gain by the intrinsic information of the split, which measures the potential information generated by dividing the samples according to the predictor:

\text{gain ratio(split)} = \frac{\text{gain(split)}}{\text{intrinsic information(split)}} \tag{19.1}

The intrinsic information is calculated as:

\text{intrinsic information(split)} = -\sum_{i=1}^{n} \frac{|S_i|}{|S|} \log_2 \frac{|S_i|}{|S|}

Where |S| is the number of samples in the parent node, |S_i| is the number of samples in the i-th child node, and n is the number of child nodes. The gain ratio penalizes splits that create many small partitions, thus addressing the bias of information gain toward predictors with many values.

Instead of purity, we can also measure the association between the class values and the class predictions produced by a candidate split. The most well-known is the \chi^2 statistic (Agresti 2002; McHugh 2013). This is produced by summing a set of residuals that compare the observed counts in each cell to their expected values if there were no association. The expected values are produced by the marginals in each dimension: e_{ij} = n\, p_{i+}\, p_{+j}. This represents the number of training set points we would, on average, expect to fall into row i and column j if the split were completely uninformative. The statistic is:

\chi^2=\sum _{j=1}^{C}\sum _{k=1}^{C}{\frac {{\left(n_{jk}-e_{jk}\right)}^{2}}{e_{jk}}}

or, for 2 \times 2 tables3:

\chi^2 = \frac{n\,(n_{11}\, n_{22} - n_{12}\, n_{21})^2}{n_{1+} \, n_{2+} \, n_{+1} \, n_{+2}} \tag{19.2}

Since this statistic measures the difference between our model and an uninformative state, we want to maximize its value. This \chi^2 statistic is often used in conditional inference trees (Hothorn, Hornik, and Zeileis 2006). In Section 19.7.3, we’ll describe the strategic advantages of minimizing the associated p-value of this value rather than maximizing the statistic itself.

As we’ll see in the next chapter, modern boosting models often use splitting criteria based on the log-likelihood. The rationales for the methods discussed in this chapter are described ?sec-cls-modern-boost. Also, for simplicity, we’ll assume that the outcome has C = 2 classes. We’ll focus on the three most popular models: XGBoost (Chen and Guestrin 2016), LightGBM (Ke et al. 2017), and CatBoost (Prokhorenkova et al. 2018).

These tools often minimize the sum of the Pearson residuals (previously seen in Section 16.2.1):

\psi = \sum_{i=1}^n\sum_{k=1}^{2}\frac{(y_{ik} - \widehat{p}_{ik})^2}{\widehat{p}_{i1} \widehat{p}_{i2}} \tag{19.3}

It is common to add a small numeric constant to the denominator to avoid division by zero.

The XGBoost model adds a few modifications to this statistic that are inspired by standard regularization methods4. Recall that L1 penalization can shrink model coefficients directly to absolute zero (Section 16.2.4). For XGBoost, an analogous operation is applied to the raw residuals e_{ik} = y_{ik} - \widehat{p}_{ik} so that only “large” values contribute to the splitting criteria. That function is:

\tilde{e}_{ik} = \text{sign}(e_{ik})\, \max\left( 0, \,|e_{ik}| - \lambda_1\right) = \begin{cases} e_{ik} - \lambda_1 & \text{if } e_{ik} > \phantom{-}\lambda_1 \\ e_{ik} + \lambda_1 & \text{if } e_{ik} < -\lambda_1 \\ 0 & \text{otherwise} \end{cases} \tag{19.4}

For example, if a data point from the first class is mis-predicted using \lambda_1 = 0.01, it will disregard any data points with \widehat{p}_{i1} < 0.01 and will shrink the other residuals towards zero by subtracting 0.01 from their absolute value.

Additionally, it can emulate weight decay by adding an L2 regularization term \lambda_2 that affects the denominator of the statistic:

\psi_{xgb}(\lambda_1, \lambda_2) = \sum_{i=1}^n\sum_{k=1}^{2}\frac{\tilde{e}_{ik}^2}{\widehat{p}_{i1} \widehat{p}_{i2} + \lambda_2} \tag{19.5}

Unlike \lambda_1, this penalizes the sum of the current data at the node, which can range from 0 to n/2. All things being equal, this will penalize nodes with more data more than it does for nodes with fewer data points. However, very uncertain predictions (values near \widehat{p}_{ij} = 0.5) will affect this sum more than others. Both regularization methods can be used simultaneously and treated as tuning parameters.

As with previous statistics, this splitting statistic is computed for both sides of the split to measure the gain. LightGBM takes the same approach.

CatBoost uses essentially the same statistics with some slight alterations. First, no L1 regularization is applied. Second, the L2 penalty value is scaled to be:

\lambda_2^* = \frac{\lambda_2}{n}\sum_{i=1}^n \widehat{p}_{i1} \widehat{p}_{i2}

This mechanism is probably intended to ensure that the regularization effect is similar as the tree becomes more and more certain about its predictions. For example, in initial iterations when there are few trees, the uncertainty in the predictions may be high, such that the variances \widehat{p}_{i1} \widehat{p}_{i2} are larger than they will be in later boosting iterations. For example, suppose that, in the first tree created, the variance averages around 0.20 for the n data points in the node. If the tree improves over iterations, let’s suppose they become more certain, and the average variance reduces to about 0.05. Without scaling, the effect of \lambda_2 is largest at early iterations (which very well may be appropriate). When the penalty is scaled, its influence is normalized to be roughly the same across iterations and/or across prediction quality.

Third, when computing the final splitting statistic, CatBoost does not subtract the parent value (i.e., \psi_P).

19.1.2 Quantitative Predictors

For continuous predictors, the conventional process of finding the optimal split point involves evaluating all possible binary partitions of the data based on the predictor values. The algorithm first orders the samples by predictor values, then evaluates splits between adjacent unique values. For each potential split point, it computes the chosen split criterion and selects the split that optimizes it.

The computational complexity of this process is generally O(n \log n) for each predictor, where n is the number of samples, due to sorting. Once the samples are sorted, evaluating each potential split point can be efficient, as class distributions can be updated incrementally as you move through the sorted list.

For large datasets with many unique predictor values, some implementations use approximations to reduce computational cost (Ranka 1998; Ke et al. 2017). For example, the implementations might consider only a subset of potential split points or use binning techniques to group similar predictor values. All three previously discussed boosting methods employ one or both of these strategies.

Figure 19.1 illustrates the metrics described in the previous section for the maximum vapor predictor from the entire forested training set. To facilitate comparisons, the statistics’ values have been normalized to a common scale. The “rug” on the x-axis should indicate which data points were considered split candidates.

Many of these criteria show similar patterns and recommend splits in the same region (if not the same value), except for the Gini statistic. It suggests a larger split value of although any split point larger than 1,400 produces nearly the same result.

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| label: shiny-tree-splits
#| viewerHeight: 550
#| standalone: true
library(shiny)
library(bslib)
library(dplyr)
library(ggplot2)
library(scales)

source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-setup.R")
# source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-penalty.R")

ui <- page_fillable(
  theme = bs_theme(bg = "#fcfefe", fg = "#595959"),
  padding = "1rem",
  layout_columns(
    fill = TRUE,
    column(
      width = 10,
      radioButtons(
        inputId = "method",
        label = "Method",
        choices = c(
          "Gini Gain",
          "Information Gain",
          "Information Gain Ratio",
          "Chi-Square",
          "XGBoost",
          "LightGBM",
          "CatBoost"
        ),
        inline = TRUE
      )
    )
  ),

  as_fill_carrier(plotOutput("scores"))
)

server <- function(input, output) {
  load(url(
    "https://raw.githubusercontent.com/aml4td/website/main/RData/forested_split_examples.RData"
  ))

  output$scores <-
    renderPlot(
      {
        chosen <-
          forested_split_examples |>
          dplyr::filter(metric == input$method)

        if (input$method == "Gini Index") {
          best <-
            chosen |>
            slice_min(value)
        } else {
          best <-
            chosen |>
            slice_max(value)
        }

        others <-
          forested_split_examples |>
          dplyr::filter(metric != input$method)

        p <-
          others |>
          ggplot(aes(split_value)) +
          geom_line(
            aes(y = unit_value, group = metric, colour = metric),
            alpha = 0.3,
            show.legend = FALSE
          ) +
          geom_line(
            data = chosen,
            aes(y = unit_value),
            col = "black",
            linewidth = 0.75,
            alpha = 0.6
          ) +
          geom_rug(
            data = chosen,
            aes(x = split_value),
            col = "black",
            alpha = 0.1,
            length = unit(0.02, "npc")
          ) +
          geom_vline(xintercept = best$split_value, col = "black", lty = 2) +
          labs(x = "Maximum Vapor", y = "Normalized Statistic") +
          theme_light_bl()

        print(p)
      },
      res = 100
    )
}

app <- shinyApp(ui = ui, server = server)

app
Figure 19.1: Different splitting criteria for the maximum vapor predictor data from the entire training set. The criteria values have been scaled to have the same range.

The split curves and optimal values were the same for information gain, the \chi^2 statistic, and the three boosting methods. Note that, for the latter, the split-point candidates were not the same. Each boosting method used a binning strategy with 1,273 possible breaks. For 2 \times 2 tables, the \chi^2 and information-based statistics are directly proportional to one another and differ by a constant. Their split values ranged from 1,248 to 1,286, mostly due to small numerical differences. The curve for the information gain statistic was slightly shifted to the right, but it also produces an optimal split in this region.

19.1.3 Qualitative Predictors

Handling categorical predictors in classification trees presents unique challenges compared to continuous predictors. There are two main approaches to incorporating categorical predictors in tree models. The first approach allows the algorithm to bin the levels of a predictor into two categories, while the second approach creates new dummy variables for each category (Section 6.2). Section 5.7 of Kuhn and Johnson (2019) contains a broad discussion of these two strategies.

When using the binning approach, the algorithm must determine how to optimally partition the categories into two groups. For a predictor with k categories, there are 2^{k-1} - 1 possible ways to create a binary split. Evaluating all these possibilities becomes computationally infeasible for predictors with many categories. Therefore, algorithms often use heuristics to find good splits without exhaustive search. For example, in the Classification and Regression Trees (CART) methodology (Breiman et al. 1984), categorical predictors are ordered by the proportion of samples in each class, and then binary splits are evaluated between adjacent categories in this ordering. This approach, very similar to an effect-encoding strategy and to the process described by Fisher (1958), reduces computational complexity while still allowing the algorithm to find effective splits.

To illustrate the binning approach, consider the partitioning model illustrated in Figure 19.2. The figure illustrates a tree of a maximum depth of 3 splits. The first split uses the Gini index criterion to split the maximum vapor predictor at 1,248. Samples with values greater than or equal to 1,248 are illustrated on the right side of the tree. The next split on this side of the tree is the county variable.

Decision tree diagram with a maximum depth of three splits. The first split uses maximum vapor at 1,248. Subsequent splits involve grouped county categories (labeled by the number of counties in each group) and other numeric predictors. Each terminal node displays the dominant class, sample counts, and class percentages.
Figure 19.2: Recursive partitioning tree when using the county variable in its adjacent category form. The labels for the split points for the county variable have been simplified to summarize the number of counties included at each split.

The adjacent category methodology creates a split with 7 counties in one group and the remaining counties in the other group. Figure 19.3 demonstrates why the counties were partitioned into these groups. The x-axis in (a) and (b) of this figure provides the counties ordered from the lowest proportion of forested counties (left) to the highest proportion (right). The top portion of the figure provides the Gini index for the partitioning of the samples at that county and all counties to the left. The purity is optimized between the counties of Asotin and Okanogan. At this split point, 7 counties are placed into one group, and 13 counties are in the other group, as illustrated in Figure 19.2.

Two-panel stacked figure. Panel (a) is a line-and-dot plot of the weighted Gini index for each possible binary partition of counties ordered by forestation rate, with the optimal split point highlighted. Panel (b) is a bar chart of the forestation rate for each county, ordered from lowest to highest. Together they show the split between Asotin and Okanogan produces the purest partition.
Figure 19.3: (a) The Gini index index for each partition of the ordered counties for the samples of forestation training data with a value of maximum vapor greater than or equal to 1,248. (b) The proportion of forested samples within each county for this subset of data.

The second approach creates k binary dummy variables, one for each category (or k-1 dummy variables to avoid redundancy). Each dummy variable is then treated as a separate predictor in the tree-building process. This approach, referred to as using independent categories, forces binary splits for the categories and effectively implements a one-versus-all splitting strategy. We do not recommend this approach.

Figure 19.4 shows a tree resulting from binary indicators where no county features were chosen. This is likely to do with the inherent splitting bias in CART trees discussed below in Section 19.2.

The choice between these approaches can impact model complexity and interpretability. The grouped categories approach can capture more complex relationships between categories and the response variable. Because the algorithm groups categories that have similar effects on the response, this often leads to more accurate and interpretable models. The independent categories approach, on the other hand, may not select any categories, resulting in simpler, less accurate models.

Decision tree diagram built using binary indicator variables for each county. The tree is deeper than the adjacent-category version but does not select any county indicator as a split variable, relying entirely on numeric predictors. Each terminal node shows the dominant class, sample counts, and class percentages.
Figure 19.4: Recursive partitioning tree when transforming the county variable into many dummy, binary variables.

However, the grouped categories approach can sometimes capture more complex relationships between categories and the response variable. It allows the algorithm to group categories that have similar effects on the response, potentially leading to more accurate and interpretable models. The independent categories approach, on the other hand, treats each category in isolation, which may miss important interactions between categories but often yields simpler models.

XGBoost and LightGBM take very similar approaches to split selection for categorical predictors. The predictor categories are ordered using some statistic, similar to the CART discussion earlier5. If there are “few” category values, a full set of binary indicators is created and treated in the same way as numeric predictors. This would generate a potential set of one-versus-all splits. If there are numerous categories, a two-pass approach is used to limit the total number of possible splits to consider. The forward pass starts with the lowest-ranked categories and sequentially adds them to the left split grouping (all others go to the right). The backward pass does the opposite, sequentially grouping the highest-ranked categories.

For example, suppose we have a predictor with 26 categories, ranked A to Z. If we were to only consider a maximum of five categories within a split, we would consider partitions:

  • {A} versus {others}
  • {A, B} versus {others}
  • {A, B, C} versus {others}
  • {A, B, C, D} versus {others}
  • {Z} versus {others}
  • {Z, Y} versus {others}
  • {Z, Y, X} versus {others}
  • {Z, Y, X, W} versus {others}

The final split is the configuration with the best gain statistics for these two methods, as described above. This approach effectively limits the computational costs of splitting categories for a predictor, even when the number of categories is very large.

For XGBoost, the ranking metric uses a function of standardized residuals that does not include the square in the numerator:

R_{xgb}(\lambda_1, \lambda_2) = \sum_{i=1}^n\sum_{k=1}^{2}\frac{-\tilde{e}_{ik}}{\widehat{p}_{i1} \widehat{p}_{i2} + \lambda_2} \tag{19.6}

LightGBM is very similar, but uses a modified ranking statistic and does not use L1 regularization:

R_{lgb}(\lambda_2, \epsilon) = \sum_{i=1}^n\sum_{k=1}^{2}\frac{-e_{ik}}{\widehat{p}_{i1} \widehat{p}_{i2} + \lambda_2 + \epsilon} \tag{19.7}

where \epsilon is a smoothing value that attempts to provide additional regularization for high cardinality splits. The default value is \epsilon = 10.

For categorical predictors, CatBoost uses a target encoding strategy similar to that described in Section 6.4.3. Its version of this is called Categorical Target Representation (CTR). They use the same Beta-Binomial estimate for the probability of an event for category j:

\widehat{p}_j = \frac{n_{+j} + \alpha}{n + \alpha + \beta}

n_{+j} is the number of events in in category j within all of the n rows under consideration. CatBoost uses \beta = 1 and calls \alpha the “prior” parameter, which can be used for tuning. Once all of the j values of \widehat{p}_j are computed, the split determination treats the predictor as if it were quantitative from the start. CatBoost’s algorithm, which uses the CTR values to determine the best predictor (and value) to split on, is fairly complex and is detailed in ?sec-cls-modern-boost.

So far, we’ve constrained our splits to be binary. C4.5 can create multiway splits for categorical predictors. For example, with categories A-F, a split might contain three groups such as {A, B, C}, {D}, and {E, F}. To do this, C4.5 uses a greedy, hill-climbing algorithm. When a specific split candidate is evaluated, it is rated using a penalized gain ratio. The gain is the improvement of the information statistic before and after combining categories. The penalty is based on the Minimum Description Length (MDL) principle (Quinlan and Rivest 1987) and uses the logarithm of the number of ways to partition C categories into k groups (a.k.a. a Bell number) divided by the number of data points.

We start with k = C groups and immediately combine categories with the same frequency distributions (since they yield identical results). If we start with k-way splits, the first iteration loops through the k splits and merges one additional category. Again, with categories A-F, the first round of candidates is

{A,B}  {C}   {D}   {E}   {F}
{A,C}  {B}   {D}   {E}   {F}
{A,D}  {B}   {C}   {E}   {F}
{A,E}  {B}   {C}   {D}   {F}
{A,F}  {B}   {C}   {D}   {E}
 {A}  {B,C}  {D}   {E}   {F}
 :     :     :     :     :
  {A}   {B}   {C}   {D}  {E,F}

If there is no pair that improves the penalized gain, the algorithm stops with the current configuration. Otherwise, the combination with the best gain is used. This process continues until there are no more improvements or if there are two categories.

For ordered categories, some algorithms restrict splits to maintain the ordering, allowing only those that group adjacent categories.

19.1.4 Oblique Splits

The tree structure that we have discussed splits on a single predictor at a time. This partitions the predictor space into a collection of (hyper)rectangles. For this reason, these splits are called “rectangular” or “axis-parallel”.

It is possible for trees to create class boundaries that do not result in rectangular partitions, and these occur when more than one predictor is used in a single rule. Some advanced tree algorithms consider splits based on linear combinations of predictors (Murthy, Kasif, and Salzberg 1994; Wickramarachchi et al. 2016). These oblique trees6 can capture more complex decision boundaries than traditional trees, which are limited to axis-parallel splits. For example, an oblique tree might use a split like “if 2 \times \text{Predictor A} + 3 \times \text{Predictor B} > 5 then…”. While these splits can lead to more accurate models, they often sacrifice interpretability.

To demonstrate, a logistic regression model was simulated with linear predictor \eta = 0.1 + 5.0\, A - 5.0\, B where A and B had a bivariate Gaussian distribution with a correlation coefficient of 0.7. This configuration results in slightly overlapping data clouds, as seen in the left-hand panel of Figure 19.5. While this pattern is trivial for a logistic model to estimate, a tree-based model must go to heroic lengths to emulate a diagonal line. The tree fit shown in the middle panel required 9 terminal nodes to produce the blocky step function that does a mediocre job in separating the data.

Three-panel scatter plot comparing splitting approaches. The left panel shows the raw data with two overlapping classes separated by a diagonal boundary. The middle panel overlays a rectangular CART decision boundary that requires many splits to approximate the diagonal, producing a blocky step function. The right panel shows an oblique tree boundary that captures the diagonal with a single linear split.

Figure 19.5: An example of rectangular and oblique splits. The data were simulated using a linear predictor of \eta = 0.1 + 5.0\, A - 5.0\, B and a logit link.

An oblique tree required a single split to produce the class boundary shown in the right-hand panel. Using a training set of n_{tr} = 500 data points in the training set, the diagonal split does a remarkable job approximating the underlying pattern; it’s split uses the rule

- 1.71 \, \left(\frac{A - 0.0652}{0.966}\right) + 1.81\,\left(\frac{B - 0.048}{1.01}\right) \le 0.0494. \tag{19.8}

The values inside the parentheses are means and standard deviations used to scale the predictors. When simplified:

0.020+1.77\, A-1.80 \, B \ge 0

We’ll discuss a specific technique for creating oblique trees in Section 19.7.5.

19.2 Understanding Bias in Splitting

For trees created using greedy optimization, there is the possibility of selection bias – some predictors are, all other things being equal, more likely to be selected because of the characteristics of their distribution. The primary way this can occur is when some predictors have a higher information content than others. For example, imagine a dense numerical predictor whose values are all unique (i.e., high precision) and range uniformly from 0.0 to 10.0. Say we have n = 1,000 training set points. If we make a version of that predictor and round it to the first decimal place, there should be 100 unique values (on average). In a model such as CART, the original predictor is evaluated 10 times more than the truncated version. If they have the same relationship with the outcome, the version with full precision is much more likely to have the best splitting criterion value just by chance.

Bias in tree-based models has been well-studied. Good places to begin are Quinlan (1993), White and Liu (1994), Loh and Shih (1997), Dobra and Gehrke (2001), Hothorn, Hornik, and Zeileis (2006), and Strobl et al. (2007).

This has a variety of consequences, the foremost of them being that our tree might not favor the truly best predictors. This can be particularly critical if the model’s variance importance score is a primary focus of the analysis. Also, if most informative predictors have few unique values (e.g., categorical predictors) and many irrelevant predictors with high precision, the model might overfit to the (wrong) predictors.

There are a few methods to mitigate this issue. First, the standard splitting criteria can be adjusted to account for predictor precision. For example, Quinlan’s gain ratio normalizes the statistic to help compensate for the differences. A second approach is to decouple the process of choosing the predictor to split on from the process of choosing the best split for that predictor. If each predictor provides a single answer to the question of whether it is the best predictor, we might have less bias in the process. This idea hinges on the assumption that this single value should also be free of bias. Several models greatly reduce the bias by quantifying the association between each predictor and the outcome to determine which is best. One tool that will be described more in Section 19.7.3 uses different statistical tests for significance as the statistic. The most significant predictor is chosen (if one is chosen at all), and then standard split-value determination can be performed.

19.3 Missing Data Handling

Classification trees can have built-in mechanisms for handling missing predictor values, which provides a significant advantage over many other modeling techniques. Instead of requiring imputation or discarding samples with missing values, trees can work directly with incomplete data during both training and prediction.

Missing data is a common issue across datasets and can arise for various reasons, such as measurement errors, non-response, or data integration issues. Traditional predictive modeling methods often require complete data, forcing the user to either discard incomplete samples or impute missing values. The discussions in Chapter 4 are directly applicable here.

Tree models can handle missing values directly, without requiring sample deletion or explicit imputation. This capability stems from the tree’s hierarchical structure and the flexibility of the splitting and prediction processes.

19.3.1 Binning Missing Values

For XGBoost and LightGBM, observations with missing predictor values are placed in a separate bin. They differ in how they are treated in the forward and backward passes described in Section 19.1.3. XGBoost always adds the missing value category to the side of the initial categories used in the scan. For example, for the grouped splits listed above in Section 19.1.3, missing are added to the groups containing A on the forward pass and those containing Z on the backward pass. LightGBM always adds the missing data category to the larger group (shown as {others} above in Section 19.1.3).

19.3.2 Surrogate Splits

In the CART methodology, missing values are handled differently during tree construction and prediction. During tree construction, CART uses a “surrogate split” approach. When evaluating a potential split on a predictor with missing values, only the samples with non-missing values for that predictor are used to determine the optimal split point and calculate the improvement in the impurity measure. This ensures that the split is based on actual data rather than imputed values.

Once a split has been selected, CART identifies surrogate splits that can be used when the primary split variable is missing. A surrogate split is an alternative split on a different predictor that mimics the primary split as closely as possible. Specifically, CART looks for splits that send samples to the same child nodes as the primary split would.

For example, the first split in Figure 19.2 uses the maximum vapor predictor. Its first surrogate for this split is annual precipitation. Although the forestation data set has no missing values, the model would use annual precipitation at this point in the tree to send the sample directly to a terminal node (provided the sample did not have a missing value for this predictor).

CART ranks surrogate splits by how well they replicate the primary split’s behavior on the training data. For each primary split, CART identifies multiple surrogate splits in order of performance, creating a hierarchy of alternatives to use when predictors have missing values.

During prediction, when a sample reaches a node with a split on a predictor for which the sample has a missing value, CART uses the best available surrogate split to determine which child node the sample should go to. If all potential surrogate splits also involve predictors with missing values for that sample, CART sends the sample in the direction of the majority of training samples.

This approach allows the tree to make predictions for samples with missing values without requiring explicit imputation. It also naturally handles patterns of missingness, as the surrogate splits are chosen based on the relationships observed in the training data.

19.3.3 Fractional Accounting

The C4.5 algorithm takes a different approach to handling missing values, based on probabilistic weighting rather than surrogate splits. During tree construction, C4.5 adjusts the information gain calculations to account for missing values. When evaluating a potential split on a predictor with missing values, C4.5 calculates the information gain using only the samples with non-missing values for that predictor, then scales this gain by the fraction of non-missing data. This adjustment ensures that predictors with many missing values are not unfairly penalized or favored.

C4.5 also modifies how it handles categorical predictors with missing values. When calculating the information value for a categorical predictor (used in the gain ratio calculation), C4.5 treats missing values as a separate category. This increases the number of branches in the split and affects the penalty applied to the information gain.

A distinctive aspect of C4.5’s approach is how it handles missing values when determining the class distribution in the resulting splits. Instead of assigning samples with missing values to a specific child node, C4.5 allows them to contribute fractionally to all child nodes. The fractional contribution is based on the distribution of non-missing values among the child nodes.

For example, if 70% of samples with non-missing values go to the left child and 30% go to the right child, a sample with a missing value would contribute 0.7 of a sample to the left child and 0.3 of a sample to the right child. This fractional accounting means that the class frequency distribution in each node may not contain whole numbers, and the number of errors in a terminal node can also be fractional.

During prediction, C4.5 uses a similar fractional approach. When a sample with missing values reaches a node with a split on a predictor for which the sample is missing, the sample is sent down all possible paths, with weights proportional to the distribution of non-missing values in the training data. The final prediction is based on a weighted vote from all relevant terminal nodes.

This approach allows C4.5 to make use of all available information in a sample, even when some predictors have missing values. It also provides a natural way to incorporate uncertainty due to missing values into the prediction process.

19.3.4 Comparison of Approaches

Both the CART and C4.5 approaches to handling missing data have their strengths and weaknesses. The CART approach with surrogate splits is more deterministic, as each sample follows a single path through the tree. This can make the prediction process easier to interpret and explain. Surrogate splits also explicitly model relationships between predictors, providing additional insights into the data. However, identifying and storing surrogate splits increases the model’s computational complexity and memory requirements.

The C4.5 approach with fractional samples is more probabilistic and can better represent the uncertainty introduced by missing values. By allowing samples to contribute to multiple nodes, it leverages all available information in the tree. This approach can be particularly effective when missingness is informative or when there are complex patterns of missingness in the data. However, fractional accounting can make the model more difficult to interpret and the prediction process more complex.

Both approaches have been shown to perform well in practice, and the choice between them often depends on the specific requirements of the application, such as the need for interpretability, the patterns of missingness in the data, and the computational resources available.

19.3.5 Practical Considerations for Missing Data

While classification trees can handle missing values directly, there are still practical considerations to keep in mind when working with incomplete data.

First, the effectiveness of the missing value handling mechanisms depends on the patterns of missingness in the data. If values are missing completely at random (MCAR), both CART and C4.5 should perform well. If values are missing at random (MAR) or missing not at random (MNAR), performance may depend on whether the tree can capture relationships between missingness and other predictors, or between missingness and the response variable.

Second, even with these sophisticated mechanisms, very high rates of missingness can still be problematic. If a predictor has too many missing values, it may not be selected as a split variable, even if it would be highly informative when available. In such cases, imputation might still be beneficial as a preprocessing step.

Third, the handling of missing values can interact with other aspects of the tree-building process, such as pruning. For example, in C4.5, the fractional accounting for missing values affects the calculation of the pessimistic error rate used for pruning. This interaction should be considered when tuning the model.

19.4 Growing Trees

For some shape and size of a tree, the search space of which predictors and splits to use to create all the data partitions is vast. Traditionally, trees are built from the top down using greedy optimization. Growing a classification tree involves recursively applying the splitting process described above until a stopping criterion is met. This will be our main focus, but Section 19.7.4 will describe an alternative, non-greedy approach to creating trees.

Most growing processes begin with all the training data in a single root node. The algorithm then searches for the best predictor and split point according to the chosen splitting criterion, as described in the previous section. Once identified, the split divides the data into two child nodes. This process repeats recursively for each child node until one or more stopping criteria are met.

At each step, the algorithm considers all available predictors and all possible split points for each predictor. This exhaustive search ensures that the selected split is the best possible split according to the chosen criterion. However, it also makes tree growing computationally intensive, especially for large datasets with many predictors.

The recursive nature of tree growing means that each split is conditional on the previous splits. This allows trees to automatically capture interactions between predictors, without the need for explicit interaction terms. For example, in Figure 19.2, the first partition on the left side used the maximum vapor variable. This split was followed by a split using the roughness variable. This structure implicitly models an interaction between maximum vapor and roughness.

To further illustrate how recursive partitioning across predictors serves as an implicit surrogate for predictor interactions, consider Figure 19.6, which displays heatmaps of the predicted probability of forestation for two different pairs of variables from the forestation data set.

The left panel was constructed using a logistic regression model with maximum vapor, roughness, and their two-way interaction as predictors. The predicted probability of forestation is represented by color, with green indicating a high probability of forestation and brown indicating a high probability of non-forestation. The probability surface exhibits notable curvature, with the probability of forestation changing substantially across values of maximum vapor, conditional on the value of roughness. This non-linear relationship between the predictors and the outcome is the hallmark of a statistical interaction: accurate prediction requires knowledge of both predictor values simultaneously, as neither predictor’s effect can be understood in isolation.

In contrast, the right panel displays results from a logistic regression model using maximum vapor, northness, and their two-way interaction term as predictors. Here, the probability surface shows no curvature, with the probability of forestation remaining relatively constant across northness values for any given value of maximum vapor. The straight contours indicate that there is no meaningful statistical interaction between these predictors. In this case, the predictors contribute additively to the probability of forestation, and each predictor’s effect can be interpreted independently of the other’s value.

Two side-by-side heatmaps of predicted forestation probability. Panel (a) shows maximum vapor versus roughness with a curved probability surface, indicating a strong interaction where the effect of one predictor depends on the other. Panel (b) shows maximum vapor versus northness with straight contours and no meaningful curvature, indicating an additive relationship with no interaction.
Figure 19.6: Heat maps of predicted probabilities of forestation from two different logistic models with main effects and interactions. (a) A model containing maximum vapor and roughness. (b) A model with maximum vapor and northness.

As the tree grows, the nodes become increasingly homogeneous with respect to the response variable. In the extreme case, each terminal node would contain samples from only one class, resulting in perfect classification on the training data. However, such a tree would likely overfit the training data and perform poorly on new, unseen data. Therefore, various stopping criteria are used to prevent excessive tree growth.

When should the algorithm stop recursively partitioning the data? Several parameters control when tree growth should terminate. These stopping criteria are essential for preventing overfitting and ensuring that the tree generalizes well to new data.

As previously mentioned, one common criterion is the minimum node size, which prevents the algorithm from splitting when a node contains fewer than a specified number of samples (i.e., n_{min}). This criterion ensures that each node has sufficient data to make reliable predictions. If a node has too few samples, any patterns observed there might be due to random variation rather than true relationships in the data. The optimal minimum node size depends on the dataset size and the complexity of the underlying relationships. Larger minimum node sizes lead to smaller trees with fewer terminal nodes, reducing the risk of overfitting but potentially increasing the risk of underfitting and missing important patterns in the data.

A second stopping criterion is tree depth, which limits the number of levels in the tree. A tree with a maximum depth of 1 would have only the root node. A tree without a maximum depth constraint could have as many levels as there are samples in the training data. Like the minimum node size, the maximum depth controls the tree’s complexity during the growing phase. Deeper trees can capture more complex relationships but are more prone to overfitting.

Some tree-growing algorithms also use a minimum improvement criterion, which requires that a split improve the objective function by at least a specified amount to be considered. This criterion prevents splits that provide only marginal improvements, which are likely to capture noise rather than true patterns. The minimum improvement threshold can be set based on statistical significance (see Section 19.7.3 below) or as a fixed value.

Another criterion is early stopping. In this approach, the tree is grown incrementally, and the performance on a validation set is monitored. When the validation performance starts to deteriorate, tree growth is stopped. This approach directly optimizes for generalization performance but requires additional computational resources for cross-validation.

Stopping criterion is a tuning parameter. To find a good fit for the data, that is, a model that does not under- or overfit, we should utilize some context-appropriate form of cross-validation (Section 10.1).

19.5 Reducing Tree Complexity

Trees grown to maximum depth typically overfit the training data, capturing noise rather than underlying predictive patterns. Pruning, the process of removing leaves in a tree, addresses this issue by simplifying an overgrown tree, removing branches that do not contribute significantly to the model’s predictive performance. Pruning is a critical step in tree construction, as it helps balance the trade-off between model complexity and predictive performance.

Unpruned classification trees often exhibit excellent performance on the training data but poor performance on new, unseen data. This discrepancy occurs because the tree has learned not only the underlying patterns in the data but also the random noise present in the training samples. By fitting the noise, the tree becomes overly complex and fails to generalize well.

Pruning aims to identify and remove the parts of the tree that are likely capturing noise rather than true patterns. The goal is to find a subtree of the original tree that minimizes the expected error on new data. This subtree should be complex enough to capture the important patterns in the data but simple enough to avoid fitting the noise.

The pruning process can be viewed as a form of regularization, similar to the penalty terms used in linear models like ridge regression or lasso. By penalizing complexity, pruning encourages the model to focus on the most important patterns in the data while ignoring noise.

19.5.0.1 Cost-Complexity Pruning

Cost-complexity pruning, also known as weakest-link pruning, is used in the CART methodology and other methods. It introduces a complexity parameter (\alpha) that penalizes the tree size:

R_\alpha(T) = R(T) + \alpha|T|

where R(T) is the misclassification rate (or another measure of error) on the training data, |T| is the number of terminal nodes in the tree, and \alpha controls the trade-off between accuracy and complexity. Larger values of \alpha result in smaller trees, as the penalty for additional nodes increases. This is another regularization method, not unlike the ones shown in Section 16.2.4.

The cost-complexity pruning process involves several steps:

  1. Grow a maximal tree T_{max} that fits the training data as well as possible, subject only to minimum node size constraints.

  2. Calculate a sequence of nested subtrees T_1, T_2, ..., T_k and corresponding complexity parameters \alpha_1, \alpha_2, ..., \alpha_k, where T_k is the root node (a tree with a single node). Each subtree T_i is obtained by pruning T_{i-1} in a way that minimizes the cost-complexity measure for the corresponding \alpha_i.

  3. Use an internal cross-validation to estimate the optimal complexity parameter \alpha_{opt} that minimizes the expected error on new data.

  4. Return the subtree T_i corresponding to \alpha_{opt} as the final pruned tree.

The sequence of nested subtrees is constructed by iteratively collapsing the internal node that yields the smallest increase in training error per node removed. This node is called the “weakest link” because it contributes the least to the tree’s performance relative to its complexity.

An internal cross-validation is used to select the optimal complexity parameter because the training error alone does not reliably indicate generalization performance. The complexity parameter that minimizes this resampled error is selected. This process is identical to previously discussed resampling procedures, but occurs inside the training routine.

Some tree implementations use a “one standard error” rule when selecting the optimal complexity parameter. Instead of choosing the complexity parameter with the minimum cross-validated error, they select the largest complexity parameter (resulting in the smallest tree) whose cross-validated error falls within one standard error of the minimum. This rule favors simpler models when the performance differences are not statistically significant.

This automated selection heuristic can be very effective, but one can also tune the cost-complexity parameter \alpha using the usual (external) resampling procedure of choice.

19.5.0.2 Pessimistic Pruning

Pessimistic pruning, used in the C4.5 algorithm, differs from cost-complexity pruning. Instead of using cross-validation, which can be computationally expensive, pessimistic pruning uses a statistical heuristic to estimate the expected error on new data.

The key insight of pessimistic pruning is that the apparent error rate on the training data is an optimistic estimate of the true error rate on new data. To correct for this optimism, C4.5 computes an upper confidence bound on the error rate, serving as a pessimistic estimate of the true error rate.

For a node with n training samples and \mathcal{E} misclassifications, the apparent error rate is \mathcal{E}/n. The pessimistic error rate is calculated using the upper bound of a confidence interval for this proportion:

\text{pessimistic error rate} = \frac{\mathcal{E} + 0.5 + \frac{z^2}{2} + \sqrt{z^2 \left[ (\mathcal{E} + 0.5)\left(1 - \frac{\mathcal{E} + 0.5}{n}\right) + \frac{z^2}{4} \right]}}{n + z^2} \tag{19.9}

where z_{\alpha/2} is the critical value from the standard normal distribution for the desired confidence level. C4.5 uses a default confidence level of 75% (corresponding to z_{\alpha/2} = 0.69), but this can be adjusted as a tuning parameter. Equation 19.9 is the Wilson Score technique for constructing a confidence interval (Wilson 1927).

The pessimistic pruning process involves a bottom-up traversal of the tree, starting from the terminal nodes. For each internal node, the algorithm compares the pessimistic error rate of the subtree rooted at that node with the pessimistic error rate if the node were converted to a terminal node (i.e., if the subtree were pruned). If pruning would reduce the pessimistic error rate, the subtree is pruned.

C4.5 also considers a more complex operation called “subtree raising,” where a subtree is moved up to replace its parent node. This operation is evaluated using the same pessimistic error rate criterion.

While pessimistic pruning lacks the theoretical foundation of cost-complexity pruning, it is computationally efficient and often works well in practice. The confidence level parameter controls the aggressiveness of pruning, with higher confidence levels resulting in more aggressive pruning.

19.6 Terminal Node Estimates

After the classification tree is built, each terminal node generates a raw probability for each class based on the training samples that fall into that node. For example, in the table shown in Section 19.1.1, the left split contains six samples, and based on the split results, the probability estimates are (5/6, 1/6).

For a new sample, the predicted class probabilities are determined by the terminal node it reaches after traversing the tree according to the splitting rules. The sample is typically assigned to the class with the highest probability in that node, which is equivalent to the majority class among the training samples in the node.

Some tree implementations also provide confidence intervals or other measures of uncertainty for these probability estimates. For example, C4.5 uses a pessimistic estimate of the error rate to provide upper and lower bounds on the class probabilities.

The class probability estimates from terminal nodes can be used not only for classification but also for ranking and probability estimation. When a terminal node contains few samples, the quality of the probability estimates decreases. It is possible to apply Laplace smoothing as previously described for naive Bayes or to use a regularized estimate, such as the Beta-binomial method from Equation 12.4 (as described for CatBoost).

Alternatively, C4.5 decision trees add the event rate to the numerator count and 1.0 to the denominator. For example, our previous left split, the raw probability for the “Yes” class was 5/6, and 7/12 of the data were truly forested. C4.5 would compute it as (5 + (7/12)) / (6 + 1) = 0.798 instead of the raw estimate of 0.833. In the context of that model, this terminal node estimate is called the “confidence value.”

For XGboost and LightGBM, once the split is made, the probability estimates for the leaves are estimated by:

\begin{align} \widehat{\mathcal{F}}_{ik} &= \sum_{i=1}^n\sum_{k=1}^{2}\frac{\tilde{e}_{ik}}{\widehat{p}_{i1} \widehat{p}_{i2} + \lambda_2}\notag \\ \widehat{p}_{ik} &= \frac{1}{1 + e^{-\widehat{\mathcal{F}}_{ik}}}\notag \end{align} \tag{19.10}

In these equations, \widehat{\mathcal{F}}_{ik} plays a part similar to a linear predictor and can range across the entire real line. The relationship between that value and the probability \widehat{p}_{ik} is a logistic link. We’ll see more of this format in the next chapter.

There is also the possibility of including an additional model inside each terminal node. For example, we could embed a logistic or multinomial regression model within each terminal node, trained on the data captured by the rule that leads to that node. One of the earliest proposals of this kind was for regression “model trees” described by Quinlan (1992). This specific model will be described in ?sec-reg-cubist. For classification models, there are similar models: Loh (2002), Landwehr, Hall, and Frank (2005), Chan and Loh (2004), Gama (2004), and Zeileis, Hothorn, and Hornik (2008).

19.7 Specific Tree-Based Models

19.7.1 CART

CART, short for Classification And Regression Trees, is the most popular approach for creating individual trees (Breiman et al. 1984). These trees are restricted to binary splits and use the following tactics discussed above:

  • Performing recursive greedy searches that typically optimize the Gini statistic.
  • Grouping two sets of categories when splitting qualitative predictors.
  • Reducing tree complexity via cost-complexity pruning.
  • Accounting for missing predictor data through surrogate splits.
  • Estimating proportions in the terminal nodes.

The primary tuning parameters are:

  • The cost-complexity parameter C_p. Values tend to range between 0.1 (usually a tree with a single split) and zero (no pruning). Grids for this parameter are often created with equal spacing on the log10 scale.
  • n_{min} and the maximum tree depth to reduce the complexity of the tree in the growing phase.

Despite multiple exhaustive searches, CART models are very efficient during training and and can generate predictions quickly. Like all models described in this chapter, CART can be easily converted to highly efficient code, especially via SQL.

These models require minimal maintenance; they require almost no data preprocessing, handle missing values seamlessly, and perform automatic feature selection.

The downside to these models is that their split selection is biased towards predictors with higher information content (i.e., many unique values). Despite this, CART is a staple of ML modeling.

To demonstrate with the forestation data, a CART model was tuned over cost-complexity (10-10 to 10-1), the tree-depth (1 to 30), and n_{min} (2 to 40) using a space-filling design with 25 candidates. Figure 19.7 shows the results based on minimizing the Brier score. The figure also computes several different aspects of the fitted trees: the number of terminal nodes, how many predictors (out of 16) were used in a split, and the average number of predictors in the rules. Like the Brier score values, the points in the visualization are averages of the trees fitted within each fold.

Six-panel scatter plot of CART tuning results. The left three panels show Brier score versus cost-complexity (log scale), minimal node size, and tree depth. The right three panels show Brier score versus number of active predictors, mean rule size, and number of terminal nodes. There is a clear quadratic trend in the terminal nodes panel, with optimal performance around 25 to 60 nodes.
Figure 19.7: Tuning results for the CART model showing how the tree tuning parameters (cost-complexity, n_{min}, and tree depth) relate to the Brier score. There are three other characteristics computed for each tree that reflect the average complexity of the trees created during resampling.

We can see that the cost-complexity parameter does influence performance, but the trend is somewhat muddled, so it is not the most important parameter. In general, though, smaller values have better Brier scores; the data set favors deep CART trees. The panel for the tree depth has the same interpretation. For n_{min}, smaller Brier scores are mostly likely to occur with values greater than n_{min} \approx 18.

The three panels on the right measure the tree’s complexity. We see a clear quadratic trend in the mean rule size and the number of terminal nodes. This indicates that the trees can under- and over-fit across these values. Tuning parameter combinations that produce around 25 to 60 terminal nodes have the best Brier scores. We can also see from the panel measuring how many “active parameters” in the tree (i.e., used in at least one split) that models that used the entire feature set did poorly. However, they can also perform poorly if fewer than 11(ish) predictors are used by the tree. These values are also related to the model’s complexity.

The numerically best tree used a cost-complexity value of 10-5.12, a depth of 15, and n_{min} = 33 training set samples. The model’s resampled performance statistics are shown in Table 19.3.

Model
Brier Score
ROC AUC
Estimate Lower CI Upper CI Distribution Estimate Lower CI Upper CI Distribution
CART 0.098 0.093 0.104 0.929 0.922 0.935
C5.0 0.097 0.091 0.103 0.925 0.918 0.932
Conditional Inference 0.098 0.092 0.103 0.924 0.917 0.931
Oblique (Ridge) 0.104 0.099 0.110 0.920 0.913 0.927
C5.0 Rules 0.108 0.101 0.114 0.897 0.889 0.905
Table 19.3: Performance statistics for the best candidates of each tree model. The limits are 90% confidence intervals. The distribution columns show density estimates of the bootstrap sampling distributions of the performance metrics.

When trained on the entire data set with the optimal parameters, the tree had 39 terminal nodes. Of the 16 predictors, 12 were actually used in the model: county, eastness, elevation, latitude, longitude, annual precipitation, roughness, annual maximum temperature, annual mean temperature, january minimum temperature, maximum vapor, and year. Across the different paths to the terminal nodes, the average number of predictors used in the rules was 6.41.

Figure 19.8 visualizes a few aspects of the model. While we know the resampled estimates of the Brier score and area under the ROC curve, panels (a) and (b) visualize the calibration and ROC curves, respectively. Both appear acceptable.

Three-panel diagnostic figure for the final CART model. Panel (a) shows a calibration curve that follows the diagonal reasonably well. Panel (b) shows an ROC curve that rises steeply and hugs the upper-left corner, indicating good discrimination. Panel (c) is a horizontal bar chart of variable importance scores, showing which predictors were most influential in the tree.
Figure 19.8: Aspects of the final CART model: (a) calibration assessment, (b) the ROC curve, and (c) variable importance scores. The first two plots are created from the pooled out-of-sample predictions.

Panel (c) shows the variable importance scores for the model. As CART splits the data to reduce impurity, it tracks the improvement in the Gini statistic for each predictor. This plot shows the cumulative effect of each predictor on the model7. The two predictors have the largest contributions: maximum vapor and county. Almost half of the predictors had little to no effect on the model.

19.7.2 C4.5 and C5.0

C4.5 is a popular tree-based machine learning model, described in detail by Quinlan (1993). C5.0 is its successor, developed as a closed-source product but released as open-source around 2010, and first described in Kuhn and Johnson (2013). Many aspects of these two models are the same. For example, we’ve previously described a few aspects of how this model creates trees:

  • information gain ratio (Equation 19.1)
  • pessimistic pruning
  • multiway splits for qualitative predictors
  • small adjustments to the terminal node estimates

Differences between the two are largely due to the addition of a boosting algorithm, which will be described in ?sec-cls-modern-boost. One difference between the two versions is that C5.0 includes an option to perform a global pruning phase at the end of tree construction. This uses a form of the one-standard-error rule from CART. Pessimistic pruning occurs locally, trimming the tree at a specific node. The global process considers the entire tree at once.

To start, we must compute what our “error budget.” This is the number of errors in the tree that is allowable under the one-standard-error rule. To compute this, C5.0 determines the number of errors (\mathcal{E}) out of n training set samples. Based on the equation for the Binomial variance, the equation for the standard error of the number of errors is:

\sigma_{\mathcal{E}} = \sqrt{\mathcal{E}\left(1 - \frac{\mathcal{E}}{n}\right)} \tag{19.11}

If the tree has n=100 training set points and \mathcal{E} = 10 misclassified samples, \sigma_{\mathcal{E}} = 3. This is the number of errors that would be acceptable if we were to choose a tree with one additional standard error or misclassified points. Note that C5.0 uses the resubstitution error rate; unlike CART, it does not perform internal cross-validation to obtain a better statistical estimate.

Once we have our budget of \sigma_{\mathcal{E}} errors, we traverse the tree from top to bottom and search for a subtree whose removal would reduce the error count below the error budget.

Common tuning parameters are:

  • Confidence value: Affects the pessimistic pruning, with larger values producing smaller trees.
  • n_{min}: The number of data points required to enact an additional split within a node. This limits the tree’s depth during the growing phase.

Of these two, n_{min} tends to have the larger effect. To demonstrate, C5.0 was tuned on the example data using 25 candidates for n_{min} ranging from 2 to 40 and confidence factors from the tree-depth 0.1 to 1. Figure 19.9 shows a plot of the results that is similar to the CART analysis.

Five-panel scatter plot of C5.0 tuning results. Panels show Brier score versus minimal node size, confidence factor (log scale), number of active predictors, mean rule size, and number of terminal nodes. There is a strong downward trend in the minimal node size panel, with smaller values yielding better Brier scores. The confidence factor has almost no effect on results.
Figure 19.9: Tuning results for the C5.0 model showing how the tree tuning parameters (n_{min} and the confidence factor) relate to the Brier score. There are three other characteristics computed for each tree that reflect the average complexity of the trees created during resampling.

In this case, there is a definitive trend between n_{min} and the Brier score. These data preferred small values of n_{min}, which makes as many splits as possible by lowering the threshold for how many training set samples are required to keep splitting the data. The confidence factor had almost no effect on the results. The other panels in the plot indicate that the data preferred to use as many predictors as possible, but still used fewer than the CART tree. Also, the number of predictors in the rule was associated with small Brier scores, as was a larger number of terminal nodes. However, the number of terminal nodes produced by C5.0 was much smaller than CART. The final model was fit using a threshold of n_{min} = 5 samples to continue splitting. The resampled performance of this configuration is shown in Table 19.3 and is very similar to CART.

19.7.3 Conditional Inference Trees

Conditional inference trees (Hothorn, Hornik, and Zeileis 2006) take a more statistically motivated approach to building a tree model. When creating a new split, a statistical test is applied to each predictor to assess whether there is a statistically significant relationship between each predictor and the outcome. If there is a statistically significant relationship, the most important predictor is selected, and an appropriate split is made (usually using the \chi^2 test statistic).

There are a few interesting aspects to this process. First, there is no pruning phase. As splitting continues, the amount of data per node decreases, and so does the statistical test’s ability to identify a predictor as “significant.” To make this judgment, a p-value cutoff is used to determine whether the splitting process should proceed. This stopping criterion can be coupled with the usual thresholds for the number of data points required to split, the maximum tree depth, and so on. The p-value cutoff can be a tuning parameter. Additionally, to be more conservative, the p-values between predictors can be adjusted for multiple testing using a Bonferroni or similar correction (Sedgwick 2012).

For example, during the first pass of the splitting process, the entire training set is used to evaluate the significance of each predictor (regardless of data type). Figure 19.10 shows the results for each predictor. These test statistics are very large, and their corresponding p-values can be smaller than 10-288, indicating extremely strong statistical significance. In this case, the county predictor was chosen for splitting. The same figure shows another pass much later in the splitting process, where only n = 22 points are available. The test statistics are much smaller, and their corresponding p-values range from 0.6753 to 1. At this point, the best p-value is no longer significant enough, and the splitting process halts.

Two-panel dot plot showing conditional inference tree predictor screening. The left panel displays very large test statistics for all predictors at the root node with a large sample size, where the best predictor has an overwhelmingly significant result. The right panel shows much smaller test statistics at a deeper node with a small sample size, where no predictor reaches statistical significance and splitting halts.
Figure 19.10: Two examples of conditional inference tree screening. The left panel is the initial split with a large sample size and test statistics. The right panel shows a node much deeper in the tree with a smaller sample size and no statistically significant predictors.

What statistical test is used to evaluate the predictors? Strasser and Weber (1999) developed mathematical methods for permutation tests of various data types and found a unifying framework. Their method shows that many statistical tests fall within the framework of linear statistics, and that their permutation test statistics can have the same distributions. In essence, we can conduct many different types of statistical tests and still make an “apples-to-apples” comparison (either using the test statistics or their p-values). For classification, Table 19.4 shows different combinations of predictor and outcome types, their corresponding test descriptions, and any connections to existing statistical methods. Note that the permutation tests are analogs to these statistical methods. For example, the test comparable to the Kruskal-Wallis test is evaluating the same hypothesis but using a different computational approach (i.e., permutation tests).

Predictor Type Test Type Analog
Outcome: Quantitative
Quantitative Correlation test Pearson correlation (Pearson 1931)
Qualitative ANOVA-type test Kruskal-Wallis(Kruskal and Wallis 1952)
Ordered Qualitative linear-by-linear trend test Jonckheere-Terpstra (Jonckheere 1954)
Outcome: Binary
Quantitative Two-sample location test Wilcoxon rank-sum (Mann and Whitney 1947)
Qualitative Independence test Chi-squared
Ordered Qualitative Trend test Cochran-Armitage (Cochran 1954; Armitage 1955)
Outcome: Multiclass
Quantitative Multi-sample location test Kruskal-Wallis (Kruskal and Wallis 1952)
Qualitative Independence test Chi-squared
Ordered Qualitative Multi-way trend test none
Outcome: Ordered Qualitative
Quantitative Correlation test with scores Spearman-type (Fieller, Hartley, and Pearson 1957)
Qualitative Location test on scores none
Ordered Qualitative Linear-by-linear test (Agresti 2002)
Table 19.4: The permutation test types for different outcome and predictor variables, along with their conventional analogs.

For example, when evaluating numeric predictors such as longitude or annual precipitation, we compare the centers of their distributions, similar to comparing the medians per class. For our county predictor, we use the previously discussed \chi^2 statistic. For ordered qualitative columns (e.g., low, medium, high), the categories are converted to rank “scores” and are treated as numeric.

The primary benefit of using this framework for data splitting is that the predictor selection process is completely unbiased. Since permutation statistics have the same distribution across data types and characteristics, these factors do not affect their probability of selection for splitting. It also has the benefit of producing trees that are more stable (i.e., lower model variance) than those from most other models described here.

Using the forestation data, conditional inference trees were tuned over the following parameters:

  • Minimum node size (n_{min}): 2 to 40.
  • Tree depth: 1 to 15.
  • P-value threshold to continue splitting: 0.01 to 0.2.
  • Bonferroni p-value correction: enabled or disabled.

The primary implementation is constrained to using qualitative predictors with fewer than 31 possible values. For the county predictor, we used an “othering” approach that limits the number of counties to fit this constraint. That process was not tuned, but resampled along with the rest of the tuning parameters.

Performance statistics for 25 candidate parameters are shown in Figure 19.11. For this model, the tree depth is the strongest tuning parameter for these data. The p-value correction and n_{min} do not seem to make a significant difference to the results. The choice of whether or not to correct the p-values ostensibly does not affect the Brier score in a meaningful way. However, using the correction definitely helps prevent overfitting by reducing the number of predictors and terminal nodes. When the correction is not used, we see a trend of poor performance begin as the model becomes excessively complex.

Six-panel scatter plot of conditional inference tree tuning results, with points distinguished by whether Bonferroni p-value correction is used. Panels show Brier score versus minimal node size, tree depth, p-value threshold, number of active predictors, mean rule size, and number of terminal nodes. Tree depth has the strongest effect on performance, while the p-value correction reduces model complexity without substantially affecting accuracy.
Figure 19.11: Tuning results for conditional inference trees showing how the tree tuning parameters (p-value threshold, p-value correction, n_{min}, and tree depth) relate to the Brier score. There are three other characteristics computed for each tree that reflect the average complexity of the trees created during resampling.

The conditional inference tree has equivalent performance to the previous trees but with reduced complexity and uses fewer predictors.

19.7.4 Globally Estimated Trees

So far, we’ve used greedy methods to construct trees. At each split, these models make locally optimal decisions without reconsidering any previous splits to ensure global optimality. Although Murthy and Salzberg (1995) found that greedy optimization performed well in their experiments, it is worth considering alternatives.

There are also global methods for tree construction. One will be discussed in the next chapter (Section 20.4) that uses Bayesian estimation. Another approach is to use evolutionary optimization methods, such as genetic algorithms, to simultaneously configure all of the splits in the tree.

How do we encode the tree in a manner that such optimization tools can be used? One approach is to specify the largest size tree that we are interested in as a placeholder. Specifying a maximum depth d_{max} for binary splits indicates that there will be, at most, 2^{d_{max}}-1, terminal nodes. A vector-based structure to contain the tree details (a.k.a. the chromosome in GAs) would bind together for the split variable index, the split values, and the rule type (i.e., \le, >, =, or \in).

A basic genetic algorithm can be used to generate new trees via reproduction (e.g., crossover, mutation). However, it is very possible that generic genetic operations can result in nonsensical or infeasible tree structures.

Alternatively, Chipman, George, and McCulloch (1998) describe sets of operations to modify tree structures. Figure 19.12 shows examples of how a binary tree with two splits and three terminal nodes might be refactored. The first is growth, in which a terminal node is picked at random, and a new split is created. Pruning removes a randomly chosen split. We can also change a randomly chosen split by modifying the split variable and/or its split value. Finally, it might be helpful during the search process to swap adjacent nodes in a tree.

Diagram showing four tree mutation operations applied to a simple binary tree. Growth adds a new split to a terminal node. Pruning removes a split and collapses its children. Change modifies the split variable or value at a node. Swap exchanges two adjacent nodes. Affected nodes are highlighted in a contrasting tone.
Figure 19.12: Methods of mutating trees: growing, pruning, mutating, and swapping nodes. Orange nodes represent the portions of the tree affected by the changes.

For example, Grubinger, Zeileis, and Pfeiffer (2014) describes a specific genetic algorithm called evtree using these operators8 to evolve a tree where the objective/fitness function is a penalized misclassification cost metric. The penalty is

\mathcal{P}(T;\alpha) = \alpha\; (d_{max} + 1)\; log(n)

where T is the tree and \alpha is a tuning parameter. Decreasing \alpha should increase the depth of the final tree. By default, the algorithm uses 100 individuals in the population, 10,000 generations, and the following probabilities for operators:

  • Growing: 20%
  • Pruning: 20%,
  • Cross-over: 20%
  • Change split variable and point: 20%
  • Change split point: 20%

There are a variety of similar evolutionary algorithms that can be used. See Papagelis and Kalles (2001), DeLisle and Dixon (2004), and Barros et al. (2011).

19.7.5 Oblique Trees

Recall that oblique trees create non-rectangular splits of the data by using rules that involve more than one predictor at a time.

Many oblique CART trees deterministically compute predictor coefficients and, given these values, search to optimize the threshold parameter to minimize model impurity at each node. Section 5.3 of Breiman et al. (1984) first described the notion of combining predictors within a single rule. Another early technique was OC1 (Murthy, Kasif, and Salzberg 1994).

OC1 begins with a standard parallel-axis tree and uses its impurity estimate as a baseline value for the splitting statistic. It initializes the first oblique split coefficients with these split values. Over the course of iterations, the coefficients are perturbed. If the newly perturbed coefficient set yields better results, accept those values and move to the next split. This results in a basic hill-climbing approach to constructing trees.

More recently, Jaeger et al. (2022) extended the work in Jaeger et al. (2019) to create a framework for oblique classification trees. Although this work is designed to produce ensembles, it can also be used to generate single oblique trees (and was used to create Figure 19.5).

The tree construction is fairly similar to the process used by CART; the Gini statistic is the default criterion for selecting the best split using a greedy search. However, at each node, the model fits a logistic regression model to determine the linear combination that defines the oblique split. At the end of this stage, we have the left-hand side of the rule. For example, we might estimate the linear combination to be 2A - 7B + C but the rule needs to be finalized by completing the inequality 2A - 7B + C > threshold.

After the logistic model, a secondary step performs a small grid search over threshold values on the right-hand side of the rule. The threshold value associated with the best Gini statistic is used to finalize the rule. This implementation does not consider the gain relative to the parent node; it only looks at child purity, ignoring parent impurity.

For subsequent splits, predefined termination criteria are used to halt the growing process:

  • The number of data points in the current node must be greater than n_{min} (where 10 samples is the default).
  • The child nodes must have at least n^*_{min} samples in the leaves.
  • The improvement in the Gini statistic should be greater than a preset threshold (defaulted to zero).

Each of these criteria can be optimized as tuning parameters.

This implementation does not prune the trees. Also, the quantitative predictors are centered and scaled by default, and a reference cell encoding method is applied to qualitative predictors. This is reflected in Equation 19.8.

Regarding the logistic regression models, the training process might use an abbreviated number of optimization rounds to estimate the logistic model – perhaps a single step of iteratively reweighed least squares. Alternatively, following Truong (2009), a regularized logistic regression model could be used to estimate a sparse linear combination. Jaeger et al. (2022) enables the specification of the glmnet mixing parameter \alpha in Equation 16.10. To determine the amount of regularization, the penalty is chosen based on an upper limit on the glmnet model’s degrees of freedom for each penalty value. By default, the degrees of freedom are set to \lceil \sqrt{p} \rceil where p is the number of parameters used in the linear combination.

Also, recall that logistic regression models are specific to classes with two possible values. For C > 2 outcome classes, the model creates a set of one-against-all models. It identified the outcome class that is determined to be most “splittable” (i.e., has sufficient events and non-events). Based on this, it fits a logistic regression for the favored class versus the others.

To demonstrate using the forestation data, the model was tuned over n_{min} and also by the logistic regression method. A regular grid was used, combining 50 values of n_{min} with four model configurations: standard logistic regression and glmnet models with \alpha \in \{0.0, 0.5, 1.0\}9.

Figure 19.13 shows the results. First, none of the logistic models preferred small values of n_{min}; values of at least 25 showed good (and stable) performance. The unpenalized model worked well, but the logistic regression using the ridge penalty showed the best results. The models with mixtures of \alpha \in \{0.5, 1.0\} did not perform as well, especially with smaller values of n_{min}. The right-hand panel shows the relationship between the Brier scores and the number of terminal nodes. Fewer are better for this model and these data. We can see that the Brier scores rapidly inflate when there are more than 50 terminal nodes in the tree.

Two-panel scatter plot with smoothed trend lines for oblique tree tuning results. The left panel shows Brier score versus minimal node size, and the right panel shows Brier score versus number of terminal nodes, with separate trend lines for unpenalized, ridge, glmnet, and lasso logistic regression methods. All methods show similar trends, with lower Brier scores at smaller node sizes and more terminal nodes.
Figure 19.13: Tuning results for the oblique tree models showing how n_{min} relates to the Brier score and the number of terminal nodes for each method of estimating the logistic regression model.

The performance metrics for the tree using n_{min} = 30 samples and the ridge penalty are shown in Table 19.3; the Brier and ROC metric values are slightly worse than the trees using rectangular splits.

We’ll discuss ensembles of oblique trees in the next chapter.

19.8 Rule-Based Models

Rule-based models can be created in different ways. The most frequently used strategy is to start with a decision tree and use the paths to the terminal nodes as an initial set of rules. Examples of this approach are Cohen (1993) and Frank and Witten (1998). C4.5 was one of the first methods to do this. We’ll describe what C5.0 does, but it is mostly the same as its predecessors. Also, this section focuses on a rule set derived from a single tree. The next chapter describes how rule set ensembles are created.

We start with a C5.0 classification tree and populate the initial rule set. From this initial set, the process to finalize the model is to

  1. Simplify and/or prune the conditions of individual rules.
  2. Use an optimization method to determine which rules should be removed or added to the rule set.

To process the individual rules, we can first simplify them by removing logical redundancies. For example, if a rule was A < 5 & A < 3, we can achieve the same effect with just A < 3. The simplification can also involve different predictors, and Quinlan (1993) presents a general method for identifying redundancies using cross-tabulations.

The second strategy for an individual rule is to determine which conditions of the rule will have acceptable performance; how many can we remove without significantly increasing errors? The algorithm iterates over all conditions in a specific rule and greedily removes those that do not improve the objective function. C5.0Rules uses an information gain statistic to make the determination on which (if any) condition to remove. For simplicity, we’ll show the process using a simple accuracy statistic which, for this example, has the same results as information gain.

For example, consider this rule:

max vapor < 1247.5 & roughness < 12.5 & max annual temp < 15.965

The training set points that are “active” under this rule are predominantly forested, so the class prediction suggested by this rule is “forested”.

This baseline rule has a resubstitution accuracy of 59.7%, using the correction defined in Section 19.6, which it calls the “confidence value”. What happens when we remove each of the conditions in the rule but keep the same predicted value?

  • max vapor < 1247.5 removed: 37.4% accuracy
  • roughness < 12.5 removed: 84.4% accuracy
  • max annual temp < 15.965 removed: 52.9% accuracy

Removing the roughness term improves the rule’s performance, so we exclude it from the final model. The new baseline accuracy is 84.4% .

We continue the process by removing each of the two remaining conditions, one at a time:

  • max vapor < 1247.5 removed: 68.5% accuracy
  • max annual temp < 15.965 removed: 82.4% accuracy

Since neither operation increases the accuracy, the rule is finalized with two conditions.

Similar to C5.0’s approach to combining categories when splitting, the global optimization process tries to minimize a statistic based on the Minimum Description Length (MDL) principle (Quinlan 1993,sect 5.2). This statistic is another trade-off between the complexity of the rule set and the error rate that penalizes overfitting. We can conceptualize it as the sum of the model cost and the error cost. Assuming equal costs for different types of errors, the model cost is an information-theoretic value that is the number of bits needed to encode the entire rule set. As the rule set grows, so does the model’s cost10. The error cost is a function of the resubstitution error rate of the current rule set.

Once the rules and the members of the rule set are determined, the model can make predictions. The complication is that many of the rules are active for any specific data point being predicted, and it is also possible that no rules are active for a new data point.

The algorithm orders the rules so that it prioritizes better quality rules, as measured by their terminal node estimate (\widehat{\pi} ) and the number of training set samples covered/supported by the rule (n_{rule}):

Utility = \widehat{\pi} \, \log(n_{rule} + 1)

When a new sample is predicted, C5.0Rules finds the first active rule and uses the dominant class in that subset of data as the hard class prediction and \widehat{\pi} as the probability estimate.

What occurs when no rules are active? In this case, the model finds locates training set points that are not matched by any rules. The class frequencies are computed from these data and their dominant class is used as the hard class prediction.

The model also intrinsically handles missing values. When evaluating whether a rule applies to a sample with missing values, C5.0Rules treats the rule as partially satisfied, with the degree of satisfaction determined by the distribution of non-missing values in the training data. This partial satisfaction contributes to the computation of \widehat{\pi}. During prediction, multiple rules may be partially satisfied due to missing values. C5.0Rules aggregates the \widehat{\pi} from all relevant rules to make the final prediction, similar to the weighted voting approach used in C5.0 trees. This approach allows rule-based models to handle missing values without requiring explicit imputation, maintaining the flexibility and robustness of the original tree model.

Figure 19.14 shows the results of tuning the C5.0Rules model on the forstation data using the same approach as the C5.0 tree optimization. The results are very similar; more rules reduce the Brier score, and the confidence factor has little effect on performance. Unfortunately, n_{min} cannot be any lower, so the best candidate is at the model’s maximum complexity.

Five-panel scatter plot of C5.0Rules tuning results. Panels show Brier score versus minimal node size, confidence factor (log scale), number of active predictors, number of rules, and mean rule size. More rules and smaller node sizes produce lower Brier scores. The confidence factor has little effect on performance.
Figure 19.14: Tuning results for the C5.0Rules model showing how the tree tuning parameters (n_{min} and the confidence factor) relate to the Brier score. There are three other characteristics computed for each tree that reflect the average complexity of the rules created during resampling.

Referring back to Table 19.3, we see that this model, out of all of the methods discussed in this chapter, had the worst Brier score and area under the ROC curve. However, it’s till better than the naive Bayes model.

19.9 The Instability of Trees

Tree models, especially those that use the standard greedy training approach, can be highly unstable (Breiman 1996). This means that the variance of these models is higher than many other models11. Consider CART and C5.0 trees, which search all split points across all predictors multiple times. The first split point is chosen as the numerically best split. However, as seen in Figure 19.1, the chosen split point (or predictor) might be influenced by random noise; other split points might perform equally well and achieve virtually the same performance. With a slightly different data set, the split might have been very different, to the point where individual data sets might have a considerable impact on the results.

For trees, each split is greatly influenced by the split(s) that came before. If random noise in the data affects the first split, a small change in the data might lead to a different split, which will likely change all subsequent splits12. If the interpretability of the tree model is important, the idea that you could get a completely different tree by adding or subtracting a few training set points can be problematic (Turney 1995).

However, some tree-based models are less unstable. Recall that conditional inference trees use a statistical test using all of the current training set data in the node to pick the predictor to split. After this, the standard search within that predictor is used (and is unstable). This decoupling enables these methods to have more reliable models. Also, the previously described oblique tree method in Jaeger et al. (2022) fits a logistic regression model to obtain its linear predictor, which it uses as a split. Conditional inference trees employ a more complex but substantially more stable process than CART or C5.0, with even greater reliability when regularization is applied.

This is a significant disadvantage for trees, but, as we’ll see in the next chapter, it makes them ideal for creating ensemble models. Perhaps surprisingly, ensembles have specific mechanisms to deliberately make their trees more unstable to improve performance.

19.10 Comparisons to Previous Models

How well do these models perform relative to previously discussed techniques? Currently, the KNN model has the smallest Brier score. Figure 19.15 visualizes the Brier score results across models. The Bayesian methods in Section 14.1.3 were used to estimate the probability distributions for each of the best candidates for each model. The intervals show the 0.05, 0.50, and 0.95 quantiles.

Two-panel figure comparing all forestation models discussed so far. The left panel is a dot-and-whisker plot of Brier scores with 90% credible intervals, ordered from best to worst. Tree-based models cluster at higher Brier scores than the leading models. The right panel shows horizontal bars for the probability of practical equivalence, indicating that single tree models have virtually no chance of matching the currently best-performing models.
Figure 19.15: Comparisons of current forestation models. The intervals are 90% credible intervals. The bars on the right show the probability of practical equivalence to the current best model, assuming that a difference in Brier scores of +/- 0.01 is practical.

The results show that, so far, the tree-based models discussed have mediocre results; they are not the worst, but they are not competitive with some of the previous models. The figure shows that they have virtually no chance of being practically equivalent to the higher performing model. That isn’t much of a surprise. Single tree-based models are not known for their outstanding effectiveness.

It’s also important to realize that the differences in the Brier scores and ROC AUC values for the five specific models discussed in this chapter are only different at the second or third decimal place. Their performance is effectively equivalent to one another.

While the results for these models might be disappointing, their value lies in the next chapter where ensembles of these models are described. These are some of the most effective tools for tabular data.

Chapter References

Agresti, A. 2002. Categorical Data Analysis. Wiley-Interscience.
Armitage, P. 1955. Tests for Linear Trends in Proportions and Frequencies.” Biometrics 11 (3): 375–86.
Barros, R, M Basgalupp, A De Carvalho, and A Freitas. 2011. A Survey of Evolutionary Algorithms for Decision-Tree Induction.” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42 (3): 291–312.
Breiman, L. 1996. Heuristics of Instability and Stabilization in Model Selection.” The Annals of Statistics 24 (6): 2350–83.
Breiman, L, J Friedman, C Stone, and RA Olshen. 1984. Classification and Regression Trees. CRC Press.
Chan, KY, and WY Loh. 2004. LOTUS: An Algorithm for Building Accurate and Comprehensible Logistic Regression Trees.” Journal of Computational and Graphical Statistics 13 (4): 826–52.
Chen, T, and G Guestrin. 2016. XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94.
Chipman, H, E George, and R McCulloch. 1998. Bayesian CART Model Search.” Journal of the American Statistical Association 93 (443): 935–48.
Cochran, W. 1954. Some Methods for Strengthening the Common \chi^2 Tests.” Biometrics 10 (4): 417–51.
Cohen, W. 1993. E Cient Pruning Methods for Separate-and-Conquer Rule Learning Systems.” Learning 140: 160.
DeLisle, R, and S Dixon. 2004. Induction of Decision Trees via Evolutionary Programming.” Journal of Chemical Information and Computer Sciences 44 (3): 862–70.
Dobra, A, and J Gehrke. 2001. Bias Correction in Classification Tree Construction.” In Proceedings of the Eighteenth International Conference on Machine Learning, 90–97.
Fieller, E, H Hartley, and E Pearson. 1957. Tests for Rank Correlation Coefficients. I.” Biometrika 44 (3/4): 470–81.
Fisher, W. 1958. On Grouping for Maximum Homogeneity.” Journal of the American Statistical Association 53 (284): 789–98.
Frank, E, and I Witten. 1998. Generating Accurate Rule Sets Without Global Optimization.” In Proceedings of the Fifteenth International Conference on Machine Learning, 144–51.
Gama, J. 2004. Functional Trees.” Machine Learning 55 (3): 219–50.
Grubinger, T, A Zeileis, and KP Pfeiffer. 2014. Evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in r.” Journal of Statistical Software 61 (1): 1–29.
Hothorn, T, K Hornik, and A Zeileis. 2006. Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3): 651–74.
Jaeger, B, L Long, D Long, M Sims, J Szychowski, YI Min, L Mcclure, G Howard, and N Simon. 2019. Oblique Random Survival Forests.” The Annals of Applied Statistics 13 (3): 1847.
Jaeger, B, S Welden, K Lenoir, and N Pajewski. 2022. aorsf: An r Package for Supervised Learning Using the Oblique Random Survival Forest.” Journal of Open Source Software 7 (77): 4705.
Jonckheere, A. 1954. A Distribution-Free k-Sample Test Against Ordered Alternatives.” Biometrika 41 (1/2): 133–45.
Ke, G, Q Meng, T Finley, T Wang, W Chen, W Ma, Q Ye, and TY Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc.
Kruskal, W, and A Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association 47 (260): 583–621.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Kuhn, M, and K Johnson. 2019. Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press.
Landwehr, N, M Hall, and E Frank. 2005. Logistic Model Trees.” Machine Learning 59 (1): 161–205.
Loh, WY. 2002. Regression Tress with Unbiased Variable Selection and Interaction Detection.” Statistica Sinica, 361–86.
Loh, WY. 2014. Fifty Years of Classification and Regression Trees.” International Statistical Review 82 (3): 329–48.
Loh, WY, and YS Shih. 1997. Split Selection Methods for Classification Trees.” Statistica Sinica, 815–40.
Mann, H, and D Whitney. 1947. On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other.” The Annals of Mathematical Statistics, 50–60.
McHugh, M. 2013. The Chi-Square Test of Independence.” Biochemia Medica 23 (2): 143–49.
Murthy, S, S Kasif, and S Salzberg. 1994. A System for Induction of Oblique Decision Trees.” Journal of Artificial Intelligence Research 2: 1–32.
Murthy, S, and S Salzberg. 1995. Decision Tree Induction: How Effective Is the Greedy Heuristic? In KDD, 222–27.
Papagelis, A, and D Kalles. 2001. Breeding Decision Trees Using Evolutionary Techniques.” In ICML, 1:393–400.
Pearson, E. 1931. The Test of Significance for the Correlation Coefficient.” Journal of the American Statistical Association 26 (174): 128–34.
Prokhorenkova, L, G Gusev, A Vorobev, AV Dorogush, and A Gulin. 2018. CatBoost: Unbiased Boosting with Categorical Features.” In Advances in Neural Information Processing Systems, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett. Vol. 31.
Quinlan, R. 1986. Induction of Decision Trees.” Machine Learning 1 (1): 81–106.
Quinlan, R. 1992. Learning with Continuous Classes.” In 5th Australian Joint Conference on Artificial Intelligence, 92:343–48.
Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.
Quinlan, R, and R Rivest. 1987. Inferring Decision Trees Using the Minimum Description Length Principle.” Laboratory for Computer Science. Massachusetts Institute of Technology.
Ranka, S. 1998. CLOUDS: A Decision Tree Classifier for Large Datasets.” Knowledge Discovery and Data Mining.
Sedgwick, P. 2012. Multiple Significance Tests: The Bonferroni Correction.” BMJ 344.
Strasser, H, and C Weber. 1999. On the Asymptotic Theory of Permutation Statistics.” 27. Report Series SFB "Adaptive Information Systems and Modelling in Economics and Management Science. SFB Adaptive Information Systems; Modelling in Economics; Management Science.
Strobl, C, AL Boulesteix, A Zeileis, and T Hothorn. 2007. Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8: 1–21.
Truong, A. 2009. Fast Growing and Interpretable Oblique Trees via Logistic Regression Models.” PhD thesis, Oxford University, UK.
Turney, P. 1995. Bias and the Quantification of Stability.” Machine Learning 20 (1): 23–33.
White, A, and W Liu. 1994. Bias in Information-Based Measures in Decision Tree Induction.” Machine Learning 15 (3): 321–29.
Wickramarachchi, D, Blair L Robertson, M Reale, C John Price, and J Brown. 2016. HHCART: An Oblique Decision Tree.” Computational Statistics & Data Analysis 96: 12–23.
Wilson, E. 1927. Probable Inference, the Law of Succession, and Statistical Inference.” Journal of the American Statistical Association 22 (158): 209–12.
Zeileis, A, T Hothorn, and K Hornik. 2008. Model-Based Recursive Partitioning.” Journal of Computational and Graphical Statistics 17 (2): 492–514.

  1. Some models, such as C4.5, can produce more than two groups, and these will be discussed on a case-by-case basis.↩︎

  2. Quinlan’s original classification tree model is called C4.5, while C5.0 is the latest version. The models are very similar, and until Section 19.7.2 we will refer to the model as C4.5 (and then C5.0 from that point onward).↩︎

  3. This statistic can also be applied to tables of any dimensionality where the number of rows and columns is greater than 2.↩︎

  4. We say “inspired” because regularization is a property that arises from an optimization problem. For XGBoost and LightGBM, these modifications are not derived in that way; they just emulate what previously published tools have done.↩︎

  5. The specific statistics are discussed momentarily.↩︎

  6. Historically called multivariate trees or linear machines.↩︎

  7. These computations should not include the effect of surrogate splits.↩︎

  8. However, it does not use the swapping operation and considers switching the split predictors and split values separate operations.↩︎

  9. Recall the $values of 0.0, 0.5, and 1.0 correspond to the ridge regression, glmnet, and LASSO models↩︎

  10. This statistic has different costs for different condition types (e.g., numeric, categorical) and includes a term related to the number of possible orderings.↩︎

  11. Recall the variance-bias discussion in Section 8.4.↩︎

  12. This is especially likely if there are correlated predictors in the training set.↩︎