17 Complex Nonlinear Boundaries
This chapter focuses on several classes of models that naturally produce nonlinear boundaries, such as K-nearest neighbors, and support vector machines. A separate category of models, which use tree structures, also has inherently nonlinear classification boundaries. We’ll discuss those models in Chapter 19 and Chapter 20. We’ll also save neural networks for the next chapter; their prominence in machine learning requires an extensive discussion.
In this chapter, several recurring motifs are present across different types of models. First, this involves the use of an operation that utilizes the positive part of a function, known as “hinge” or “rectifier” functions. We’ll see these in Section 17.1.2 and Section 18.1.1. Another element that will be encountered across different models is the kernel function, previously seen in Section 12.6.2. These functions are used to quantify the distance or similarity between data points (or, more specifically, vectors). Both the K-nearest neighbor and the support vector machine (SVM) models rely on these functions. Finally, the dot product between vectors plays a critical role in the important attention mechanism in neural networks (?sec-attention) as well as in SVM models (Section 17.4).
To get started, the first chapter summarizes an old, traditional model for classification, linear discriminant analysis (LDA), and shows how it can be effectively modified to produce nonlinear class boundaries.
17.1 Nonlinear Discriminants
Discriminant analysis is a classical statistical method for classification. While its prominence in the statistical learning literature has diminished in recent decades, understanding its foundations provides valuable context for several modern and highly effective classification methods. This section focuses on linear discriminant analysis (LDA).
Linear discriminant analysis (McLachlan 2005; Murphy 2012) produces linear classification boundaries and can be motivated via Bayes’ Rule. Let’s say that our training set predictors for class k are contained in the random variable vector X_k with p elements. LDA assumes that the predictor distributions within each class follow a multivariate Gaussian distribution with class-specific mean vectors but a common covariance matrix across all classes. Formally, for class k, the conditional distribution of the predictors given class membership is X_k \sim N(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}).
To classify a new observation with predictor vector \boldsymbol{x}, LDA applies Bayes’ theorem to compute the posterior probability of class membership. Given a prior probability Pr[Y = k] for class k, the posterior probability is:
Pr[Y = k |\boldsymbol{x}] =\frac{Pr[Y = k]Pr[\boldsymbol{x}|Y = k]}{\sum\limits_{l=1}^C Pr[Y = l]Pr[\boldsymbol{x}|Y = l]} \tag{17.1}
This equation is straightforward to calculate. The parameters \boldsymbol{\mu}_k and \boldsymbol{\Sigma} are estimated from the training data using their maximum likelihood estimates: the class-specific sample mean vectors \bar{\boldsymbol{x}}_k and the pooled sample covariance matrix \boldsymbol{S}, respectively. The class-conditional densities (Pr[\boldsymbol{x}|Y]) are then evaluated using the multivariate normal probability density function, which modern statistical software computes efficiently. In practice, computational efficiency can be improved by calculating only the numerator of Equation 17.1 for each class, then normalizing these values to ensure the posterior probabilities sum to 1. For a new observation \boldsymbol{x}, the class with the maximum posterior probability is assigned as the predicted class.
Alternatively, we can frame the prediction in terms of the discriminant function:
D_k(\boldsymbol{x}) = \boldsymbol{x}'\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k + \frac{1}{2}\boldsymbol{\mu}_k'\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k + \log\left(Pr[Y = k]\right) \tag{17.2}
The discriminant values are conceptually similar to the linear predictor in logistic regression, taking values on the real line, with larger values corresponding to higher posterior probabilities. The predicted class is determined by identifying which discriminant function yields the largest value.
A notable advantage of LDA is its explicit modeling of the predictor covariance structure through \boldsymbol{\Sigma}. This characteristic renders the method robust to moderate levels of multicollinearity among predictors. Furthermore, the framework naturally extends to multi-class problems.
However, LDA has important limitations. The method requires that \boldsymbol{\Sigma} be non-singular (invertible). When the training set contains fewer observations than predictors (n < p) or when perfect linear dependencies exist among predictors, the sample covariance matrix becomes singular and the discriminant functions cannot be computed. Additionally, the Gaussian distributional assumption is fundamental to the method’s theoretical justification. While binary indicator variables can technically be included in the model (as they are numeric), their discrete, non-Gaussian nature violates the model assumptions and undermines the probabilistic interpretation of the results. For datasets with categorical predictors or high-dimensional settings where n < p, alternative classification methods may be more appropriate as we will see later.
As presented, LDA is limited as a classification tool. Its assumption of a common covariance matrix across classes restricts it to producing linear decision boundaries. While nonlinear boundaries could be induced through feature engineering (e.g., adding polynomial or interaction terms), such an approach would be better done using logistic regression.
Nevertheless, the discriminant analysis framework has led to extensions that address the various limitations of LDA. Quadratic discriminant analysis (QDA) (Ghojogh and Crowley 2019) relaxes the assumption of a common covariance matrix, allowing each class to have its own covariance structure: X_k \sim N(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). This modification results in quadratic decision boundaries, providing greater flexibility at the cost of estimating additional parameters. Robust variants of both LDA and QDA have been developed to mitigate the influence of outliers on parameter estimation (Hubert, Raymaekers, and Rousseeuw 2024), addressing the well-known sensitivity of maximum likelihood estimates to extreme observations. Additionally, regularized discriminant analysis methods (Friedman 1989; Pang, Tong, and Zhao 2009) introduce penalty terms that shrink the covariance matrix estimates, improving stability and predictive performance, particularly when sample sizes are modest relative to the number of predictors.
Two extensions of discriminant analysis are practically useful. The first, naive Bayes classification, represents a simplification that assumes conditional independence among predictors within each class. The second, flexible discriminant analysis (FDA), is a generalization of the LDA framework. FDA can accommodate nonlinear decision boundaries, perform implicit feature selection, and incorporate various modeling strategies, making it a considerably more versatile tool for classification problems. These methods are discussed in the following sections.
17.1.1 Naive Bayes
While LDA estimates the class-conditional density Pr[\boldsymbol{x}|Y = k] using a multivariate Gaussian distribution, naive Bayes (Hand and Yu 2001) adopts an alternative approach by decomposing this joint density into a product of univariate marginal distributions:
Pr[\boldsymbol{x}|Y = k] = \prod_{j=1}^p Pr[x_j|Y = k] \tag{17.3}
This formulation assumes conditional independence among all predictors within each class; i.e., the predictors are all statistically independent of one another. This assumption is often violated in practice. Despite this seemingly restrictive assumption, naive Bayes classifiers can occasionally perform surprisingly well in practice.
The primary advantage of the conditional independence assumption is that it relaxes the distributional requirements for individual predictors. Unlike LDA, which requires joint multivariate normality, naive Bayes allows each predictor to follow its own distribution, which need not be Gaussian. Moreover, these distributions do not even need to be fully parametric.
For continuous predictors, kernel density estimation (Silverman 1986; Chen 2017) provides a flexible, nonparametric approach to estimating Pr[x_j|Y = k]. Kernel density estimators are smoothing methods that adaptively characterize the empirical distribution of the data. A kernel function K(\cdot) is a non-negative, symmetric function that integrates to one and serves as a weighting scheme1. Common kernel functions are:
- The uniform kernel, which assigns equal weight to all observations within a fixed range
- The Gaussian kernel, K(u) = \frac{1}{\sqrt{2\pi}}\exp(-u^2/2), which assigns weights that decrease exponentially with distance from the point of interest
The kernel density estimate at a point u is constructed as:
\hat{f}(u) = \frac{1}{nh}\sum_{i=1}^n K\left(\frac{u - x_i}{h}\right)
where n is the number of observations, x_i are the observed training set data points, and h is the bandwidth parameter that controls the smoothness of the estimate. We’ll come back to the bandwidth parameter shortly.
For example, Figure 17.1 displays estimated densities for the minimum vapor pressure variable across three subsets of the data. Kernel density estimates are shown in blue, while fully parametric Gaussian density estimates (assuming N(\mu, \sigma)) are shown in orange. The left-hand panel presents the entire training set, which exhibits modest right skewness. Both estimation approaches provide reasonable approximations for vapor pressure values2 above 200. However, the Gaussian estimate lacks the flexibility to adequately capture the density in the middle of the range, and both methods perform poorly for values below 200, where the data are sparse.
The middle panel shows the density for non-forested locations. The histogram suggests these data are approximately trimodal with left skewness, and the kernel density estimate successfully captures these multiple modes. In contrast, the Gaussian does not capture the multimodal structure, and relying on this parametric assumption would likely result in underfitting and poor probability estimates for classification.
The right-hand panel displays the density for forested locations, which exhibits right skewness. The parametric Gaussian approach provides a particularly poor fit for the lower half of this distribution, where the bulk of the data reside. The kernel density estimate adapts to the asymmetry and provides a better representation of the empirical distribution.
These examples illustrate the potential advantage of kernel density estimation within the naive Bayes framework: the ability to accommodate non-Gaussian, multimodal, and skewed distributions without requiring distributional transformations or parametric assumptions that may be violated in practice.
The kernel density estimator contains a tuning parameter: the bandwidth h, which controls the degree of smoothing applied to the density estimate. The bandwidth determines the effective width of the kernel’s influence around each observation. Small bandwidth values produce density estimates that closely track the observed data, potentially capturing fine-scale features but risking variability and overfitting. Conversely, large bandwidth values distribute the kernel weight over a broader range, yielding smoother density estimates that may obscure multimodality or other important distributional characteristics.
Several data-driven methods exist for selecting the bandwidth that optimizes specific criteria. Common approaches include minimizing an estimate of the mean integrated squared error or using cross-validation (Park and Marron 1990). These default bandwidth selection methods are designed to produce generally accurate density estimates.
However, the bandwidth that yields the most accurate density estimate may not necessarily produce the best classification performance. The objective in density estimation—minimizing the mean squared error—differs from the classification objective of minimizing prediction error. To address this potential misalignment, the bandwidth can be treated as a tuning parameter in the naive Bayes model. Rather than directly tuning h, it is often more convenient to tune a bandwidth adjustment factor that is a multiplier of the bandwidth. A value greater than 1.0 encourages more smoothing of the data, while a smaller values make very bumpy densities.
Since naive Bayes does not assume much about Pr[x_j|Y = k], categorical predictors can be readily incorporated into the model. For a categorical predictor, the conditional probabilities are estimated directly from the observed frequencies within each class. Figure 16.3 showed the raw probabilities associated with the forested class (i.e., Pr[y = forested]).
However, sparse data can create computational difficulties. Some counties contain no forested locations, while others have very few locations at all. When a particular predictor-class combination is unobserved in the training data, the estimated probability is zero. Due to the multiplicative structure in Equation 17.3, a single zero probability causes the entire class-conditional probability to equal zero, regardless of the values of other predictors. This results in the class receiving zero posterior probability, which is clearly problematic, especially when it occurs with all classes.
Several smoothing methods address this issue by adjusting probability estimates away from the boundaries of zero and one. One approach employs a Beta-Binomial model, as described in Equation 12.4, which shrinks extreme estimates toward more moderate values. For example, consider a county with ten locations, none of which are forested. The Beta-Binomial method with \alpha = 1 and \beta = 3 would pull the estimate from 0.0% to 7.1%.
Alternatively, Laplace smoothing (Manning, Raghavan, and Schutze 2009, sec. 11.3.2) provides another approach:
\hat{p}_{k} = \frac {x_k+\alpha }{n_k+\alpha C_{x}} \tag{17.4}
where C_{x} is the number of levels of the qualitative predictor and x_k and n_k are the number of events and locations, respectively. Here, with C_x = 2 and \alpha = 1, the new estimate is 8.3%.
Despite the potentially unrealistic assumption of conditional independence among predictors, naive Bayes classifiers can demonstrate competitive predictive performance with other methods. In addition, naive Bayes models can be computationally efficient since many of the calculations can be done in parallel. Depending on how the conditional densities are computed, they can be easily added to database tables, and the final prediction can be executed using highly efficient SQL.
One potential downside for this model is that there is a tendency for the class probability estimates to be seriously miscalibrated. Since we treat the predictors as independent entities, we end up multiplying many probabilities together. When the predictors are related, this means including redundant terms. When any set of probabilities is multiplied together, there is a tendency for the values to become polar; the products tend to migrate towards zero or one. This may not effect the ability of the model to separate classes, but it will compromise the accuracy of the probability estimates. Many predictions, when incorrect, are confidently incorrect.
Figure 17.2 illustrates the probability distribution and calibration performance for one model candidate applied to the forestation data. The left panel demonstrates that the predicted probabilities exhibit a bimodal distribution, with the majority of observations concentrated near the boundaries of zero and one, while relatively few observations fall in the intermediate probability range. The right panel presents the corresponding calibration curve, which reveals severe miscalibration. Specifically, observations assigned predicted probabilities near zero exhibit observed event rates of approximately 10%, while those with predicted probabilities approaching one show similarly discordant observed rates. The intermediate probabilities are no better, with observed event rates clustering near 0.5 across multiple bins regardless of predicted probability. The Brier score for these data is poor (0.128) but the area under the ROC curve (0.922) shows good separation of the classes by the model.
To address these calibration issues, an initial feature filter could be helpful. Reducing the number of predictors may improve probability estimates by mitigating potential overfitting or reducing noise from uninformative predictors. For the forestation data, a random forest model was computed (within each resample), and predictors were ranked according to their permutation importance scores. We can choose a proportion of parameters to retain and then fit our naive Bayes models. Proportions of 20%, 40%, …, 100% were used during the tuning process. Figure 17.3 displays the resulting importance scores, with error bars representing 90% confidence intervals estimated using the 10 replicate values produced during cross-validation.
#> Warning in select_best(forest_nb_res): No value of `metric` was given;
#> "brier_class" will be used.
The results show that, according to random forest, annual precipitation has, by far, the largest effect on the prediction function. The second tier of importance includes maximum vapor pressure and longitude predictors, which exhibit nearly equivalent importance scores. A moderate cluster of predictors shows intermediate importance values, while several predictors—including county, year, eastness, and westness—yield low importance scores.
It is important to note that variable importance measures are model-specific and should not be interpreted as universal or absolute assessments of predictor relevance3. In this context, the random forest importance scores serve as a filtering mechanism to identify and remove potentially irrelevant predictors prior to fitting the classification model, rather than as definitive decisions regarding predictor-response relationships.
Additionally, we have summarized the results over resamples for visualization. In practice, we execute the random forest within each resample and then fit the model on that resample’s listing of which predictors are “important”. We do not compute the importance once and use the same ranking across resamples.
For the forestation data, the naive Bayes kernel adjustment parameter was tuned over values ranging from 0.5 to 1.5, while the Laplace smoothing parameter was held constant at \alpha = 1. Beyond feature selection, no additional preprocessing was applied; the categorical counties remained as-is , and no transformations were applied to the data.
The tuning results for Brier score and ROC AUC are presented in Figure 17.4. Both metrics achieved optimal performance when the full predictor set was retained, with substantial performance degradation observed when only 20% of predictors were included. Across all feature selection thresholds, the results favor smaller kernel bandwidths (more wiggly). However, the magnitude of improvement attributable to bandwidth adjustment is modest relative to the effect of feature selection proportion.
The predictions for the best candidate are those shown in Figure 17.2. As noted there, the Brier score was mediocre while the ROC curve results indicate that the model can do a good job at qualitative separating the classes.
It is also worth noting that, for this model, the Brier and ROC AUC values point to similar findings. Across all of the candidates, the correlation between these metrics was -0.95; they tend to agree very often. The correlation between these metrics and cross-entropy was slightly lower: -0.73 and 0.84 for the Brier score and the ROC AUC, respectively.
Now, let’s turn our attention to the other notable variation of LDA: flexible discriminant analysis.
17.1.2 Flexible Discriminants
Flexible discriminant analysis (FDA) by Hastie, Tibshirani, and Buja (1994) extends LDA to accommodate more complex discriminant functions. The derivation involves substantial linear algebra; you can skip to the “Important” box below for the key concepts.
Hand (1981) and Breiman and Ihaka (1984) demonstrated that the LDA model can be estimated through a series of basic linear regressions. To do this, we first construct an n_{tr}\times C indicator matrix \boldsymbol{Y} where each column contains binary indicators for the corresponding class. Next, create a C\times (C-1) matrix of initial “scores” \boldsymbol{\Omega} using a contrast function (see Section 6.2). Hastie and Tibshirani (2024) employ a Helmert-like contrast matrix. For C=3 classes, the Helmert contrast is
Helmert = \begin{pmatrix} -1 & -1 \\ \phantom{-} 1 & -1 \\ \phantom{-} 0 &\phantom{-} 2 \\ \end{pmatrix}
which is then normalized using a function of training set class proportions. For example, with C = 3 equally frequent classes, the initial score matrix is:
\boldsymbol{\Omega} = \begin{pmatrix} \phantom{-} 1.22 & -0.71 \\ -1.22 & -0.71 \\ \phantom{-} 0.00 & \phantom{-} 1.41 \\ \end{pmatrix}
Being a contrast matrix, the column means are zero and have covariances of zero.
The modified response matrix for linear regression is computed as \boldsymbol{\Omega}^* = \boldsymbol{Y}\boldsymbol{\Omega}. Using the predictor matrix \boldsymbol{X} and the columns of \boldsymbol{\Omega}^* as outcomes, multivariate linear regression yields coefficient matrix \widehat{\boldsymbol{B}} and fitted values \hat{\boldsymbol{\Omega}} = \boldsymbol{X}\hat{\boldsymbol{B}}.
An eigendecomposition of \boldsymbol{\Omega}'\hat{\boldsymbol{\Omega}} produces eigenvalues and a matrix of eigenvectors \boldsymbol{\Phi}.
The matrix \boldsymbol{\Phi} scales the linear regression predictions to approximate the discriminant functions in Equation 17.2. For a new sample \boldsymbol{x}, the vector of FDA discriminant functions is:
\boldsymbol{D}(\boldsymbol{x}) \approx \hat{\boldsymbol{\Omega}}\Phi = \boldsymbol{x}'\hat{\boldsymbol{B}}\Phi \tag{17.5}
This (C-1) \times 1 vector contains entries for all but the reference class4. Class probabilities are computed from the discriminant functions (typically using the softmax transformation). We compute last probability estimate using \hat{\pi}_C = 1 - \sum_k^{C-1}\hat{\pi}_k.
The magic in Hastie, Tibshirani, and Buja (1994) is that linear regression can be replaced with any regression method while maintaining the discriminant analysis framework. For example, ridge or lasso regression can produce improved predictions that are then converted to class probabilities by rescaling with \boldsymbol{\Phi} as in Equation 17.5.
Hastie, Tibshirani, and Buja (1994) originally implemented FDA using multivariate adaptive regression splines (MARS) and a penalized spline method called Bruto. In practice, we have found that the best application of FDA uses MARS, which will be discussed further in ?sec-mars.
As we will see in ?sec-reg-nonlinear, MARS is a forward stepwise procedure that sequentially adds pairs of features for each predictor. These features are derived via a hinge function: h(x) = xI(x > 0). At each iteration, MARS identifies the optimal predictor and split point a, then adds two terms: h(x - a) and h(a - x). The first term equals zero for x < a, while the second equals zero for x > a. Figure 17.5 illustrates these “left” and “right” hinge functions.
#| '!! shinylive warning !!': |
#| shinylive does not work in self-contained HTML documents.
#| Please set `embed-resources: false` in your metadata.
#| label: shiny-fda-hinge
#| viewerHeight: 600
#| viewerWidth: "100%"
#| standalone: true
#| fig-alt: Examples of a single set of MARS features.
library(shiny)
library(bslib)
library(ggplot2)
library(dplyr)
library(tidyr)
source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-setup.R")
ui <- page_fillable(
theme = bs_theme(bg = "#fcfefe", fg = "#595959"),
padding = "1rem",
layout_columns(
fill = FALSE,
col_widths = breakpoints(xs = c(-2, 8, -2), sm = 4),
sliderInput(
"beta_1",
label = "Left slope",
min = -4.0,
max = 4.0,
step = 0.5,
value = 1
),
sliderInput(
"beta_2",
label = "Right slope",
min = -4.0,
max = 4.0,
step = 0.5,
value = 1
),
sliderInput(
"split",
label = "Split",
min = 0,
max = 10,
step = 0.1,
value = 5
)
),
layout_columns(
fill = FALSE,
col_widths = breakpoints(xs = c(-2, 8, -2)),
as_fill_carrier(plotOutput("plots"))
)
)
server <- function(input, output) {
theme_set(theme_bw())
output$plots <-
renderPlot(
{
dat <- tibble(x = seq(0, 10, by = 0.05))
h_right <- function(x, a) ifelse(x < a, a - x, 0)
h_left <- function(x, a) ifelse(x > a, x - a, 0)
split <- input$split
beta_0 <- 0
beta_1 <- input$beta_1
beta_2 <- input$beta_2
feat <-
dat |>
mutate(
left = h_left(x, split),
right = h_right(x, split),
pred = beta_0 + beta_1 * left + beta_2 * right
)
p <-
feat |>
pivot_longer(
cols = c(left, right, pred),
names_to = "side",
values_to = "value"
) |>
mutate(
group = case_when(
side == "left" ~ "Left Hinge: h(x - a)",
side == "right" ~ "Right Hinge: h(a - x)",
TRUE ~ "Regression Equation"
),
group = factor(
group,
levels = c(
"Left Hinge: h(x - a)",
"Right Hinge: h(a - x)",
"Regression Equation"
)
)
) |>
ggplot(aes(x, value)) +
geom_line() +
geom_vline(xintercept = split, lty = 3) +
facet_wrap(~group, ncol = 1, scale = "free_y") +
labs(x = "Original Predictor", y = "Feature Value")
print(p)
},
res = 100
)
}
app <- shinyApp(ui, server)
app
As seen in the figure, the hinge feature h(x − a) isolates the effect of the predictor to values greater than a. Likewise, its companion feature h(a − x) restricts the effect of the predictor only when x < a. If x is the first predictor selected by MARS, these functions create a flexible segmented regression:
y_i = \beta_0 + \beta_1 h(x_{i1} - a) + \beta_2 h(a - x_{i1}) + \epsilon_i
which is estimated using ordinary linear regression. The figure illustrates the impact of different values of \beta_1 and \beta_2.
The “growing” phase of MARS creates new pairs of features by evaluating all prospective pairs, retaining the pair that most improves model fit, and repeating until a user-specified maximum number of terms is reached.
During the training process, MARS may split the same predictor multiple times when the predictor-response relationship exhibits substantial nonlinearity. For categorical predictors with C levels, MARS automatically generates C binary indicators that can then be split. The algorithm can also include a predictor in its original form (without hinge functions) if that representation best improves the model at a given iteration. The maximum number of model terms is an important tuning parameter that controls model complexity.
Thus far, we have described MARS when it is constrained to produce an additive model (i.e., where each feature depends on a single predictor). MARS can also generate interaction terms. After selecting an initial pair of hinge functions on one predictor, the algorithm conducts a secondary search over all other predictors (and their candidate split points) to find the combination that most improves the model. This produces four new features, each representing a distinct two-dimensional region of the predictor space. The degree of interaction (typically one or two) is a tuning parameter; higher-order interactions are possible but computationally prohibitive.
Following the growing phase, MARS performs a backward pass that sequentially removes terms using generalized cross-validation (GCV), an efficient approximation to leave-one-out cross-validation that is available when parameters are estimated by ordinary least squares (Golub, Heath, and Wahba 1979). GCV quantifies the increase in prediction error when a given term is removed. During pruning, one or both members of a hinge pair may be removed. The final model consists of the retained features, their estimated coefficients, and the \Phi matrix, which together define the prediction function.
FDA-MARS models can produce class boundaries that are segmented or trapezoidal. For example, the “Medium Complexity” and “High Complexity” panels in Figure 16.1 were generated using FDA-MARS, with interactions and differing numbers of retained features.
Regarding preprocessing, MARS internally converts categorical predictors to binary indicators, so this transformation should not prior to model fitting. Since MARS estimates feature coefficients using ordinary least squares, standard preprocessing techniques for linear models apply. For instance, transforming predictors to improve symmetry may enhance performance. Missing values must also be imputed prior to modeling.
MARS performs automatic feature selection: if a predictor is never used in a basis function or if all terms containing that predictor are removed during pruning, the final model is independent of that predictor. This property extends to FDA-MARS. As illustrated in Figure 13.1, irrelevant predictors can be included during training without adverse effects, eliminating the need for a separate feature selection step. However, including irrelevant features will increase computation time. This may or may not have a practical impact, depending on the data set size and number of irrelevant features.
For using MARS with FDA with our forestation data, we initially generated a total of 200 features (apart from the intercept) during the growing phase. To tune the model, we considered additive and interaction models as well as using forward and backward removal strategies for the pruning phase. The number of retained terms was optimized to be between 2 and 50 final model terms.
We can see how this worked out in Figure 17.6. Performance leveled off around 25-35 terms for each pruning method, with forward selection showing a tiny advantage for these data. Models with interaction effects were uniformly more successful than additive models. An interaction model using forward selection and 30 terms appears to be a reasonable choice. This configuration had an estimated Brier score of 0.083 and an area under the ROC curve of 0.95.
Once the FDA-MARS model was fit to the training set, the results was 25 total coefficients. The implementation of MARS checks several stopping conditions for the growing phase based on the coefficient of determination (a.k.a. R2), including:
- Reached a R2 of 0.999 or more.
- No new term increases R2
- Adding a term changes R2 by less than 0.001.
The last condition triggered the training procedure to stop at 25 model terms instead of the 30 that were requested. Of these terms, 6 used a single predictor and 18 were interaction terms. The categorical predictor (county) was used 3 times in both main effects and interactions. To give the reader a sense of what the linear regression model looks like, Table 17.1 shows the set of coefficients, the model terms, and split points.
| Model Term | value |
|---|---|
| (Intercept) | −1.04 |
| h(834 - annual precipitation) x h(annual maximum temperature - 10.3) | 0.0000138 |
| h(43 - roughness) | 0.0173 |
| h(annual precipitation - 650) x h(8.65 - annual maximum temperature) | 0.000135 |
| h(annual mean temperature - 7.31) x h(longitude - - 118.905) | 0.0876 |
| h(650 - annual precipitation) | 0.00478 |
| h(1920 - annual precipitation) x h(annual mean temperature - 7.31) | 0.000180 |
| h(650 - annual precipitation) x h(annual mean temperature - 7.95) | −0.00122 |
| h(650 - annual precipitation) x county: ferry | −0.00431 |
| h(650 - annual precipitation) x county: stevens | −0.00409 |
| h(1348 - maximum vapor) x county: whitman | 0.00978 |
| h(141 - elevation) x h(annual maximum temperature - 10.3) | 0.000645 |
| h(650 - annual precipitation) x h(7.95 - annual mean temperature) | −0.000898 |
| h(annual mean temperature - 7.31) x h( - 118.905 - longitude) | −0.0332 |
| h(year - 2020) x h(43 - roughness) | −0.0158 |
| h(0.3 - dew temperature) x h(annual mean temperature - 7.31) | −0.708 |
| h(43 - roughness) x h(1018 - annual precipitation) | −0.0000238 |
| h(43 - roughness) x h(annual precipitation - 1018) | −0.00000822 |
| h(roughness - 43) | 0.00164 |
| h(annual mean temperature - 7.31) | 0.151 |
| h(10.3 - annual maximum temperature) | 0.0876 |
| h(roughness - 7) x h(annual mean temperature - 7.31) | −0.000739 |
| h(annual precipitation - 650) | 0.0000552 |
| h(elevation - 141) x h(annual maximum temperature - 10.3) | 0.0000548 |
| h(7 - roughness) x h(annual mean temperature - 7.31) | 0.00595 |
To better understand what is driving the model, MARS has an internal variable importance metric. The method estimates variable importance based on the improvement in model fit as terms are added. During the forward pass, MARS quantifies the reduction in GCV associated with each new basis function and attributes this improvement to the corresponding predictor(s). These GCV reductions are summed across all terms involving each predictor and scaled to range from 0 to 100, where 0 indicates the predictor does not appear in the final model. For our model, Figure 17.7 displays the results. Annual precipitation and annual maximum temperature were the most influential predictors, each receiving an importance score of 100. Terrain roughness, annual mean temperature, and longitude also contributed substantially to the classifier. There were 6 predictors that were not used at all: annual minimum temperature, eastness, january minimum temperature, latitude, minimum vapor, and northness.
FDA via MARS occupies a middle ground between fully interpretable models and black-box methods. While it may not achieve optimal predictive performance, it often approaches it while still being interpretable through visualization, particularly when the model is additive.
For example, suppose we chose an additive model based on the tuning results in Figure 17.6. In an additive model, the functional form relating each predictor to the outcome is independent of the other predictors. Although the values of other predictors shift the range of predicted probabilities, the shape of each predictor’s effect remains constant. This property enables us to construct partial dependence plots showing how predicted class probabilities vary with each predictor while holding others fixed. Figure 17.8 illustrates such profiles from an additive fit, enabling an interpretation of the model’s behavior.
For example, we can say that:
- The probability of forestation has notable jump after the year 2020.
- The probability increases sharply when annual precipitation exceeds approximately 500 mm and remains elevated thereafter (holding other predictors constant).
- The downward spike in the longitude corresponds to the unforested region east of Seattle.
and so on. These patterns illustrate how individual predictors influence the classification boundary in an interpretable manner.
17.2 K-Nearest Neighbors
K-nearest neighbors (KNN) is predicated on the idea that data points that are close to one another in the predictor space should have outcomes that are also similar. If we are predicting a new sample, we find the K training set points that are “closest” to the new data point and use their outcome values to estimate the prediction. There is essentially no training here; everything happens when the sample is being predicted. KNN falls into a class of techniques called instance-based learning since it stores the training set instances and directly uses these data during prediction.
What does “similar” and/or “close” mean? The operational definition of “proximity” or “closeness” requires specifying either a distance or similarity metric. While Euclidean distance is a common metric, the Minkowski distance provides a more flexible framework. For a new observation \boldsymbol{u} with p predictors, the Minkowski distance is defined as:
d(\boldsymbol{u}, \boldsymbol{x}) ={\biggl (}\sum _{j=1}^{p}|u_j-x_{j}|^{degree}{\biggr )}^{\frac {1}{degree}},
where degree is the distance parameter. This formulation yields Euclidean distance when degree = 2, Manhattan distance (also called taxicab or city-block distance) when degree = 1, and intermediate distance metrics for other values of degree.
The choice of distance metric can substantially influence model performance and can be considered a tuning parameter in practice (via the Minkowski degree).
Beyond the Minkowski family, several specialized distance metrics may be appropriate depending on the structure and nature of the predictor space:
Geodesic distance (discussed in Section 7.3.1) is based on a general distance matrix to construct a graph-based representation of inter-point relationships.
Haversine distance (introduced in Section 7.4) accounts for the curvature of the Earth’s surface and is the appropriate metric when predictors represent geographic coordinates (latitude and longitude).
Mahalanobis distance (De Maesschalck, Jouan-Rimbaud, and Massart 2000; Johnson and Wichern 2007) incorporates the covariance structure among predictors, effectively scaling predictor differences by their variances and correlations. This metric is particularly valuable when predictors exhibit substantial correlation and sufficient data are available to reliably estimate the covariance matrix. The Mahalanobis distance has a direct connection to the mathematics of discriminant analysis.
These distance-based metrics require numeric predictors. For the Minkowski distance, predictors must be expressed in the same units to prevent variables with larger scales from dominating the distance calculation. This is typically accomplished through centering and scaling (standardization) or other normalization techniques. Additionally, addressing skewness through appropriate transformations (e.g., Box-Cox, Yeo-Johnson, orderNorm, etc.) prior to distance calculation is recommended, as highly skewed distributions can distort relationships and degrade model performance.
For qualitative predictors, one approach is to encode them as indicator (dummy) variables, which should then be preprocessed in the same way as continuous predictors. However, when the predictor set contains a mixture of data types, Gower distance (Gower 1971) provides a unified framework that accommodates different variable types without requiring uniform encoding. Gower distance computes a dissimilarity measure separately for each predictor according to its data type, then aggregates these component distances:
d(\boldsymbol{u}, \boldsymbol{x}) =\frac {1}{p}\sum _{j=1}^{p}d_j(u_j, x_{j}),
where d_j denotes the distance function for predictor j. For continuous predictors, the component distance is range-normalized:
d_j(u_j, x_{j})=1-{\frac {|u_{j}-x_{j}|}{R_{j}}}
where R_j represents the range of predictor j in the training set. This normalization ensures all continuous predictors contribute equally regardless of their original scales.
For categorical predictors, the component distance is a simple mismatch indicator, d_j(u_j, x_{j}) = I(u_j = x_j). This binary metric assigns zero distance for matching categories and unit distance otherwise.
For a comprehensive treatment of distance metrics in the context of nearest neighbor methods, see Cunningham and Delany (2021).
For classification problems, KNN identifies the K nearest neighbors and estimates class probabilities as the proportion of neighbors belonging to each class:
\widehat{\pi}_{c} = \frac{1}{K}\sum_{k=1}^K y_{kc}
where y_{kc} is a binary indicator equal to 1 if neighbor k belongs to class c, and 0 otherwise. For small values of K, these empirical probability estimates can be unreliable, particularly when certain classes are underrepresented among the neighbors. The Laplace correction (previously discussed in Section 17.1.1) can be used to regularize the estimate:
\widehat{\pi}_{c} = \frac{1}{K}\sum_{k=1}^K \frac{y_{kc} +\alpha}{K + \alpha C}
where C denotes the number of classes and \alpha \in [0, 2] is a smoothing parameter. Setting \alpha = 0 yields the unregularized estimate, while \alpha > 0 shrinks extreme probabilities toward a uniform distribution.
It is common to fix K and collect all neighbors, regardless of their distances to the query point. Alternatively, one can put a cap on the maximum distance or use a more informative approach of using weighted distances:
\widehat{p}_{c} = \frac{1}{W}\sum_{k=1}^K g(d_k)\;y_{kc}
where g() is a kernel function that weights the distances and W is the sum of the g(d_k) (Dudani 1976; Zuo, Zhang, and Wang 2008; Samworth 2012). Figure 17.9 shows some commonly used functions (Hechenbichler and Schliep 2004).
Outside of preprocessing methods, the primary tuning parameters are the number of neighbors, the weighting function, and the distance degree (if using a Minkowski distance). Recall from Figure 13.1 that KNN models can be harmed when used with irrelevant predictors. As will be seen with these data, tuning an initial feature selection filter can improve model quality.
How well does this technique work for the forestation data? To start, we used preprocessing that included the same random forest importance filter as Section 17.1.1 (and tune the proportion fo predictors retained). After that, an effect encoding was used for the county column, and then applied an orderNorm transformation to all predictors to standardize them to the same units and potentially remove skewness. The model was tuned over the proportion of features retained (5% to 100%), the number of neighbors (two through fifty), the Minkowski degree parameter (between 0.1 and 2.0), and the kernel functions listed in Figure 17.9. A total of fifty configurations were evaluated using resampling, and the Brier scores are shown in Figure 17.10.
Poor results occurred when there were fewer than twenty neighbors, and subsequently, the Brier scores remained consistently low. The distance parameter appeared to favor small values (around 0.5). There are no strong trends in terms of which kernel functions; Gaussian kernels appear to do well overall, but there are configurations that do well using any kernel. Numerically, the best results retained 86.4% of the predictors, used 28 neighbors, a Minkowski degree of 1.65, and a cos kernel function. In the figure above, we can see that there is a small increase in error when the entire predictor set is used. Otherwise, smaller feature sets are underfitting the data. The corresponding Brier score was 0.0778, which is very the smallest Brier score seen so far. The area under the ROC curve was also good, with a resampling estimate of 0.953.
17.4 Support Vector Machines
17.5 Comparisons to Previous Models
Is there a specific model, so far, that appears to work best for the forested data? To determine that, we conduct a Bayesian comparison of the models using their resampling statistics (from Section 14.1.3). We’ll focus on the best candidate from each distinct modeling method.
A Bayesian model was created using Gaussian priors to estimate the posterior distribution of the Brier statistic for each model and to contrast the current best model with others. A ROPE approach is taken here, in which we specify that a difference of ±0.01 in the Brier scores between models would be considered a realistic change.
While the KNN model currently has the smallest Brier score, there is a cluster of models that have roughly the same Brier scores. This includes the logistic models from the previous chapter. Clearly, the naive Bayes model is not in the running for best model, not even through the lens of practical equivalence.
Chapter References
We’ll see kernel functions two more times in this chapter.↩︎
In units of Pa × 100.↩︎
A different view of predictor importance will be seem later in Figure 17.7.↩︎
The approximation involves scaling factors derived from the eigenvalues of \boldsymbol{\Phi}.↩︎