16  Generalized Linear and Additive Classifiers

We often conceptualize classification models by the type of class boundary they produce. For example, in Figure 9.1, two predictors were visualized, and the colors and shapes of the data indicated their class memberships. For one model, the results for three different settings of a tuning parameter were shown. Additionally, the class boundary was visualized as a black line, dividing the data into two (or more) regions where the model predicted a specific class. When there are more than two predictors, visualizing this boundary becomes impossible, but we still use the idea of a line that demarcates different regions that have the same (hard) class prediction.

That figure also demonstrated that some models can produce simple boundaries while others produce very convoluted partitions. The simplest models produce linear boundaries, which are naturally constrained from over-adapting to localized trends in the data and overfitting. Unfortunately, the lack of complexity also increases the model bias so that, for cases where the true boundary is nonlinear, the model may dramatically underperform.

This chapter focuses on models that, at first appearance, produce linear class boundaries. Apart from feature engineering, they tend to be relatively easy and fast to train and are more likely to be interpretable due to their simplicity. We’ll start by discussing the most used classification model: logistic regression. This model has many important aspects to discuss, as well as numerous ways to estimate model parameters. An extension of this model for more than two classes, a.k.a. multinomial regression, is also discussed. Finally, we review a fairly antiquated classification model (discriminant analysis) that will lead to some effective generalizations in the next chapter.

However, before diving into modeling techniques, let’s take a deep look at the Washington State forestation data originally introduced in Section 3.9. These data will be used to demonstrate the nuances of different classifications in this and the next two chapters.

16.1 Exploring Forestation Data

These data have been discussed in Sections 3.9, 10.8, and 13.4. As a refresher, locations in Washington state were surveyed, and specific criteria were applied to determine whether they were sufficiently forested. Using predictors on the climate, terrain, and location, we want to accurately predict the probability of forestation at other sites within the state.

As previously mentioned, this type of data exhibits spatial autocorrelation, where objects close to each other tend to have similar attributes. This is not a book specific to spatial analysis; ordinary machine learning tools will be used to analyze these data. Our analyses might be, to some degree, suboptimal for the task. However, for our data, Bechtold and Patterson (2015) describes the sampling methodology, in which the on-site inspection locations are sampled from within a collection of 6,000-acre hexagonal regions. While spatial autocorrelation is very relevant, the large space between these points may reduce the risk of using spatially ignorant modeling methodologies. Despite this, our data spending methodologies are spatially aware, and these might further mitigate any issues caused by ordinary ML models. For example, when initially splitting (and resampling them), recall that we used a buffer to add some space between the data used to fit the model and those used for assessment (e.g., Figures 3.5 and 10.10). This can reduce the risk of ignoring the autocorrelation when estimating model parameters.

To learn more about spatial machine learning and data analysis, Kopczewska (2022) is a nice overview. We also recommend Nikparvar and Thill (2021), Kanevski, Timonin, and Pozdnukhov (2009), and Cressie (2015).

As with any ML project, we conduct preliminary exploratory data analysis to determine whether any data characteristics might affect how we model them. Table 16.1 has statistical and visual summaries of the 15 numeric predictors using the training data. Several of the predictors exhibit pronounced skew (right or left leaning). By coercing the distributions of some predictors to be more symmetric, we might gain robustness and perhaps an incremental improvement in performance (for some models).

Also, the annual minimum temperature, dew temperature, January minimum temperature, and maximum vapor show bimodality in the distributions. The year of inspection is also interesting; the data collection was sparse before 2011, and subsequent years contain a few hundred data points per year before beginning to drop off in 2021. These characteristics are not indicative of problems with data quality, but it can be important to know that they exist when debugging why a model is underperforming or showing odd results.

Predictor Minimum Mean Max Std. Dev Skewness Distribution
Annual Maximum Temperature −2.98 13.9 19.0 2.79 −1.00
Annual Mean Temperature −7.89 8.51 12.5 2.37 −1.12
Annual Minimum Temperature −15.4 −3.13 4.50 3.16 0.200
Annual Precipitation 171 1,200 5,900 1,040 1.15
Dew Temperature −14.4 2.26 8.56 2.89 −0.0831
Eastness −100 −3.21 100 69.4 0.0649
Elevation 0 674 3,820 484 0.915
January Minimum Temperature −12.8 3.12 8.14 2.24 −0.751
Latitude 45.6 47.4 49.0 0.877 0.0592
Longitude −125 −120 −117 2.04 −0.0843
Maximum Vapor 306 1,100 2,020 351 0.165
Minimum Vapor 12.0 132 336 75.1 0.271
Northness −100 −2.08 100 70.0 0.0432
Roughness 0 48.8 376 47.7 1.44
Year 2,000 2,020 2,020 3.24 −0.402
Table 16.1: Histograms and statistical summaries of the numeric predictors in the Washington State training set.

Also, recall that Figure 13.2 previously described the correlation structure of these predictors. There were several clusters of predictors with strong magnitudes of correlation. This implies that there is some redundancy of information in these features. However, machine learning is often “a game of inches,” where even redundant predictors can contribute incremental improvements in performance. In any case, the analyses in this chapter will be profoundly affected by this characteristic; it will be discussed in more detail below.

Individually, how does each of these features appear to relate to the outcome? To assess this, we binned each predictor into roughly 20 groups based on percentiles and used these groups (each containing about 230 locations) to compute the rate of forestation and 90% confidence intervals1. As an exploratory tool, we can use these binned versions of the data to see potential relationships with the outcome. Figure 16.1 shows the profiles. Note that there is enough data in each bin to make the confidence intervals very close to the estimated rates.

Figure 16.1: Binned rates of forestation over percentiles of the numeric predictors. The shaded regions are 90% confidence intervals.

Quite a few predictors show considerable nonlinear trends, and a few are not monotonic (i.e., the sign of the slope changes over the range of values). The Eastness and Northness features, which capture the landscape orientation at the location, show flat trends. This means that these predictors are less likely to be important. However, once in a model with other predictors, the model may be able to extract some utility from them, perhaps via interaction terms. The primary takeaway from this visualization is that models that are able to express nonlinear trends will probably do better than those restricted to linear classification boundaries.

In addition to longitude and latitude, the data contains a qualitative location-based predictor: the county in Washington. There are data on 39 counties. The number of locations within each county can vary with San Juan county having the least training set samples (6) and Okanogan having the most (365). Figure 16.2 shows how the rate of forestation changes and the uncertainty in these estimates. Several counties in the training set have no forested locations. Given the number of counties and their varying frequencies of data, an effect encoding strategy might be appropriate for this predictor.

Figure 16.2: Outcome rates for different counties in Washington State.

Finally, it might be a good idea to assess potential interaction effects prior to modeling. Since almost all of our features are numeric, it can be difficult to assess interactions visually, so the H-statistics for two-way interactions were calculated using a boosted tree as the base model using the numeric features. Since there are only 105 possible interactions, the H-statistics were recomputed 25 times using different random number generators so that we can compute a mean H-statistic and its associated standard error.

Figure 16.3: Results for the top 25 H-statistic interactions. The error bars are 90% intervals based on replicate computations.

The vast majority of the 105 H-statistics are less than 0.215 and there are a handful of interactions that are greater than that value; we’ll take the five interactions and use them in a logistic regression shown below.

The next section will describe a mainstay of machine learning models for two classes: logistic regression.

16.2 Logistic Regression

Generalized Linear Models

Forestation Model Development

Interpretation

Examining Our Model

A Lurking Problem: Multicollinearity

16.2.1 Regularized

Refitting the Forestation Model

16.2.2 Bayesian Estimation

16.3 Generalized Additive Models

16.4 Multinomial Regression

16.5 Discriminants

Chapter References

Bechtold, W, and P Patterson. 2015. The Enhanced Forest Inventory and Analysis Program - National Sampling Design and Estimation Procedures. U.S. Department of Agriculture, Forest Service, Southern Research Station.
Cressie, Noel. 2015. Statistics for Spatial Data. John Wiley & Sons.
Kanevski, M, V Timonin, and A Pozdnukhov. 2009. Machine Learning for Spatial Environmental Data: Theory, Applications, and Software. EPFL press.
Kopczewska, K. 2022. Spatial Machine Learning: New Opportunities for Regional Science.” The Annals of Regional Science 68 (3): 713–55.
Nikparvar, B, and JC Thill. 2021. Machine Learning of Spatial Data.” ISPRS International Journal of Geo-Information 10 (9): 600.

  1. Binning is used here as a visualization tool; we re-emphasize that converting numeric predictors into categorical features is problematic.↩︎