10  Measuring Performance with Resampling

In Section 9.3, it was shown that using the same data to estimate and evaluate our model can produce inaccurate estimates of how well it functions to predict new samples. It also described using separate partitions of the data to fit and evaluate the model.

There are two general approaches for using external data for model evaluation. The first is a validation set, which we’ve seen before in Section 3.7. This is a good idea if you have a lot of data on hand. The second approach is to resample the training set. Resampling is an iterative approach that reuses the training data multiple times based on sound statistical methodology. We’ll discuss both validation sets and resampling in this chapter.

Figure 10.1: A general data usage scheme that includes resampling. The colors denote the data used to train the model (in tan) and the separate data sets for evaluating the model (colored periwinkle).

Figure 10.1 shows a standard data usage scheme that incorporates resampling. After an initial partition, resampling creates multiple versions of the training set. We’ll use special terminology1 for the data partitions within each resample which serve the same function as the training and test sets:

Like the training and test sets, the analysis and assessment sets are mutually exclusive data partitions and are different for each of the B iterations of resampling.

The resampling process fits a model to an analysis set and predicts the corresponding assessment set. One or more performance statistics are calculated from these held-out predictions and saved. This process continues for B iterations, and, in the end, there is a collection of B statistics of efficacy. These are averaged to produce the overall resampling estimate of performance. To formalize this, we’ll consider a mapping function \(M(\mathfrak{D}^{tr}, B)\) that takes the training set as input and can output \(B\) sets of analysis and assessment sets.

Algorithm 10.1 describes this process.

\begin{algorithm} \begin{algorithmic} \State $\mathfrak{D}^{tr}$: training set of predictors $X$ and outcome $y$ \State $B$: number of resamples \State $M(\mathfrak{D}^{tr}, B)$: a mapping function to split $\mathfrak{D}^{tr}$ for each of $B$ iterations. \State $f()$: model pipeline \Procedure{Resample}{$\mathfrak{D}^{tr}, f, M(\mathfrak{D}^{tr}, B)$} \For{$b =1$ \To $B$} \State Partition $\mathfrak{D}^{tr}$ into $\{\mathfrak{D}_b^{fit}, \mathfrak{D}_b^{pred}\}$ using $M_b(\mathfrak{D}^{tr}, B)$. \State Train model pipeline $f$ on the analysis set to produce $\hat{f}_{b}(\mathfrak{D}_b^{fit})$. \State Generate assessment set predictions $\hat{y}_b$ by applying model $\hat{f}_{b}$ to $\mathfrak{D}_b^{pred}$. \State Estimate performance statistic $\hat{Q}_{b}$. \EndFor \State Compute reampling estimate $\hat{Q} = \sum_{b=1}^B \hat{Q}_{b}$. \Return $\hat{Q}$. \Endprocedure \end{algorithmic} \end{algorithm}
Algorithm 10.1: Resampling models to estimate performance.
Note

In Algorithm 10.1, \(f()\) is the model pipeline described in Section 1.5. Any estimation methods, before, during, or after the supervised model, are executed on the analysis set B times during resampling.

There are many different resampling methods, such as cross-validation and the bootstrap. They differ in how the analysis and assessment sets are created in line 7 of Algorithm 10.1. Subsequent sections below discuss a number of mapping functions.

Two primary aspects of resampling make it an effective tool. First, the separation of data used to create and appraise the model avoids the data reuse problem described in Section 9.3.

Second, using multiple iterations (B > 1) means that you are evaluating your model under slightly different conditions (because the analysis and assessment sets are different). This is a bit like a “multiversal” science fiction story: what would have happened if the situation were slightly different before I fit the model? This lets us directly observe the model variance; how stable is it when the inputs are slightly modified? An unstable model (or one that overfits) will have a high variance.

In this chapter, we’ll describe some conceptual aspects of resampling. Then, we’ll define and discuss various methods. Finally, the last section lists frequently asked questions we often hear when teaching these tools.

Up until Section 10.6, we will assume that each row of the data are independent. We’ll see examples of non-independent in the later sections.

10.1 What is Resampling Trying to Do?

It’s important to understand some of the philosophical aspects of resampling. It is a statistical procedure that tries to estimate some true, unknowable performance value (let’s call it \(Q\)) associated with the same population of data represented by the training set. The quantity \(Q\) could be RMSE, the Brier score, or any other performance statistic.

There have been ongoing efforts to understand what resampling methods do on a theoretical level. Bates, Hastie, and Tibshirani (2023) is an interesting work focused on linear regression models with ordinary least squares. Their conclusion regarding resampling estimates is that they

… cannot be viewed as estimates of the prediction error of the final model fit on the whole data. Rather, the estimate of prediction error is an estimate of the average prediction error of the final model across other hypothetical data sets from the same distribution.

It is a good idea to think that this will also be true for more complex machine learning models. The important idea is that resampling is measuring performance on training sets similar to ours (and the same size).

To understand different methods, let’s examine some theoretical properties of resampling estimates that matter to applied data analysis.

Let’s start with analogy; consider the usual sample mean estimator used in basic statistics (\(\bar{x}\)). Based on assumptions about the data, we can derive an equation that optimally estimates the true mean value based on some criterion. Often, the statistical theory focuses on our estimators’ theoretical mean and variance. For example, if the data are independent and follow the same Gaussian distribution, the sample mean attempts to estimate the actual population mean (i.e., it is an unbiased estimator). Under the same assumptions we can also derive what the estimator’s theoretical variance. These properties, or some combination of them, such as mean squared error2, help guide us to the best estimator.

The same is true for resampling methods. We can understand how their bias and variance change under different circumstances and choose an appropriate technique. To demonstrate resampling methods, we’ll focus on bias and variance as the main properties of interest.

First is bias: how accurately does a resampling technique estimate the true population value \(Q\)? We’d like it to be unbiased, but to our knowledge, that is nearly impossible. However, some things we do know. Figure 10.1 shows that the training set (of size \(n_{tr}\)) is split into B resampling partitions which are then split into analysis and assessment sets of size \(n_{fit}\) and \(n_{pred}\), respectively. It turns out that, as \(n_{pred}\) becomes larger, the bias increases in a pessimistic direction3. In other words, if our training set has 1,000 samples, a resampling method where the assessment set has \(n_{pred} = 100\) data points has a smaller bias than one using \(n_{pred} = 500\). Another thing that we know is that increasing the number of resamples (B) can’t significantly reduce the bias. Therefore, we must think carefully about the way the data is partitioned in the resampling process to reduce potential bias.

The variance (often called precision) of resampling is also important. We want to get a performance estimate that gives us consistent results if we repeat it. The resampling precision is driven mainly by B and \(n_{tr}\). With some resampling techniques, if your results are too noisy, you can resample more and stabilize or reduce the estimated variance (at the cost of increased computational time). We’ll examine the bias and precision of different methods as we discuss each method.

Let’s examine a few specific resampling methods, starting with the most simple: a single validation set.

10.2 Validation Sets

We’ve already described a validation set; it is typically created via a three-way initial split4 of the data (as shown in Figure 10.2). For time series data, it is common to use the most recent data for the test set. Similarly, the validation set would include the most recent data (once the test set partition is created). The training set would include everything else.

Figure 10.2: An initial data splitting scheme that incorporates a validation set.

While not precisely the same, a validation set is extremely similar to an approach described below called Monte Carlo cross-validation (MCCV). If we used a single MCCV resample, the results would be effectively the same as those of a validation set; the difference not substantive.

One aspect of using the validation set is what to do with it after you’ve made your final decision about the model pipeline. Ordinarily, the entire training set is used for the final model fit.

Data can be scarce and, in some cases, an argument can be made that the final model fit could include the training and validation set. Doing this does add some risk; your validation set statistics are no longer completely valid since they measured how well the model works with similar training sets of size \(n_{tr}\). If you have an abundance of data, the risk is low, but, at the same time, the model fit won’t change much by adding \(n_{val}\) data points. However, if your training data set is not large, adding more data could have a profound impact, and you risk using the wrong model for your data5.

10.3 Monte Carlo Cross-Validation

Monte Carlo cross-validation emulates the initial train/test partition. For each one of B resamples, it takes a random sample of the training set (say about 75%) to use as the analysis set6. The remainder is used for model assessment. Each of the B resamples is created independently of the others, so some of the training set points are included in multiple assessment sets. Figure 10.3 shows the results of \(M(\mathfrak{D}^{tr}, 3)\) for MCCV with a data set with \(n_{tr} = 30\) data points and 80% of the data are allocated to the analysis set. Note that samples 16 and 17 are in multiple assessment sets.

Figure 10.3: A schematic of three Monte Carlo cross-validation resamples created from an initial pool of 30 data points.

How many resamples should we use, and what proportion of the data should be used for the analysis set? These choices are partly driven by computing power and training set size. Clearly, using more resamples means better precision.

A thought experiment can be useful for any resampling method. Based on the proportion of data going into the assessment set, how confident would you feel in a performance metric being computed for this much data? Suppose that we are computing the RMSE for a regression model. If our assessment set contained \(n_{pred} = 10\) (on average), we might think our metric’s mean is excessively noisy. In that case, we could either increase the proportion held out (better precision, worse bias) or resample more (better precision, same bias, long computational time).

Each resampling method has different trade-offs between bias and variance. Let’s look at some simulation results to help understand the trade-off for MCCV.

Another Simulation Study

Section 9.1 described a simulated data set that included 200 training set points. In that section, Figure 9.1 illustrates the effect of overfitting using a particular model (KNN). This chapter will use the same training set but with a simple logistic regression model where a four-degree of freedom spline was used with each of the two predictors.

Since these are simulated data, an additional, very large data set was simulated and used to approximate the model’s true performance, once again evaluated using a Brier score. Our logistic model has a Brier score value of \(Q \approx\) 0.0898, which demonstrates a good fit. Using this estimate, the model bias can be computed by subtracting the resampling estimate from 0.0898.

The simulation created 500 realizations of this 200 sample training set7. We resampled the logistic model for each and then computed the corresponding Brier score estimate from MCCV. This process was repeated using the different analysis set proportions and values of B shown in Figure 10.4. The left panel shows that the bias8 decreases as the proportion of data in the analysis set becomes closer to the amount in the training set. It is also apparent that the bias is unaffected by the number of resamples (B). The panel on the right shows the precision, estimated using the standard error, of the resampled estimates. This decreases nonlinearly as B increases; the cost-benefit ratio of adding more resamples shows eventual diminishing returns. The amount retained for the analysis set shows that values close to one have higher precision than the others. Regarding computational costs, the time to resample a model increases with both parameters (amount retained and B).

Figure 10.4: Variance and bias statistics for simulated data using different configurations for Monte Carlo cross-validation. The range of the y-axes are common across similar plots below for different resampling techniques.

The results of this simulation indicate that, in terms of precision, there is little benefit in including more than 70% of the training set in the assessment set. Also, while more resamples always help, there is incremental benefit in using more than 50 or 60 resamples (for this size training set). Regarding bias, the decrease somewhat slows down near 80% of the training set retained. Holdout proportions of around 75% - 80% might be a reasonable rule of thumb (these are, of course, subjective and based on this simulation).

10.4 V-Fold Cross-Validation

V-fold cross-validation (sometimes called K-fold cross-validation) (Stone 1974) is the most well-known resampling method9. It randomly allocates the training data to one of V groups of about equal size, called a “fold” (a stratified allocation can also be used). There are B = V iterations where the analysis set comprises V -1 of the folds, and the remaining fold defines the assessment set. For example, for 10-fold cross-validation, the ten analysis sets consist of 90% of the data, and the ten assessment sets contain the remaining 10% of the data and are used to quantify performance. As a visual illustration, Figure 10.5 shows the process for V = 3 (for brevity), for a data set with 30 samples.

Figure 10.5: An example mapping with \(V = 3\)-fold cross-validation.

Arlot and Celisse (2010) is a comprehensive survey of cross-validation. Let’s look at two special cases.

Leave-One-Out Cross-Validation

Leave-one-out cross-validation (LOOCV, also called the jackknife) sets \(V = n_{tr}\). A single sample is withheld and, over \(n_{tr}\) iterations, a set of models is created, each with an analysis set of \(n_{tr} - 1\) samples. The resampling estimate is created by applying the metric function to the resulting \(n_{tr}\) predicted values. There are no replicate performance values; only a single \(\hat{Q}\). As one might expect, the bias is nearly zero for this method. Although it would be difficult to quantify, the standard error of the estimator should be very large.

This method is extremely computationally expensive10 and is not often used in practice unless computational shortcuts are available.

Repeated Cross-Validation

One way improve precision of this method is to use repeated V-fold cross-validation. In this case, V-fold CV is repeated multiple times using different random number seeds. For example, two repeats of 10-fold cross-validation11 create two sets of V folds, which are treated as a collection of twenty resamples (e.g., B = V \(\times\) R for R repeats). Again, increasing the number of resamples does not generally change the bias.

Variance and Bias for V-Fold Cross-Validation

The discussions of LOO and repeated cross-validation beg the question: what value of V should we use? Sometimes the choice of V is driven by computational costs; smaller values of V require less computation time. A second consideration for choice of V is metric performance in terms of bias and variance. As we will see later in this chapter, the degree of bias is determined by how much data are retained in the assessment sets (as defined by V). As V increases, bias decreases; V=5 has substantially more bias than V=10. Fushiki (2011) describes post hoc methods for reducing the bias for this resampling method.

Compared to other resampling methods, V-fold cross-validation generally has a smaller bias, but relatively poor precision (that would be improved via repeats).

To understand the trade-off, the same simulation approach from the previous section is used. Figure 10.6 shows bias and variance results, where the x-axis is the total number of resamples \(V\times R\). As expected, the bias decreases with V and is constant over the number of resamples. Five-fold cross-validation stands out as particularly bad with a percent bias of roughly 3.9%. Using ten folds decreases this to about 2.2%. The precision values show that smaller values of V have better precision; 10-fold CV needs substantially more resamples to have the same standard error than V = 5.

Figure 10.6: Variance and bias statistics for simulated data using different configurations for \(V\)-fold cross-validation.

Is there any advantage to using V > 10? Not much. The decrease in bias is small and many more replicates are required to reach the same precision as V = 10. For example, twice as many resamples are required for 20-fold CV to match the variance of 10-fold CV.

For this simulation, the properties are about the same when MCCV and 10-fold CV are matched in terms of number of resamples and the amount allocated to the assessment set.

Our recommendation is to almost always use V = 10. The bias and variance improvements are both good with 10 folds. Reducing bias in 5-fold cross-validation is difficult but the precision for 10-fold can be improved by increased replication.

10.5 The Bootstrap

The bootstrap (Efron 1979, 2003; Davison and Hinkley 1997; Efron and Hastie 2016) is a resampling methodology originally created to compute the sampling distribution of statistics using minimal probabilistic assumptions12. In this chapter, we’ll define a bootstrap sample and show how it can be used to measure fit quality for predictive models.

For a training set with \(n_{tr}\) data points, a bootstrap resample takes a random sample of the training set that is also size \(n_{tr}\). It does this by sampling with replacement; when each of the \(n_{tr}\) samples is drawn, it has no memory of the prior selections. This means that each row can be randomly selected again. For example, Figure 10.7 shows another schematic for three bootstrap samples. In this figure, the thirteenth training set sample was selected to go into the first analysis set three times. For this reason, a bootstrap resample will contain multiple replicates of some training set points, while others will not be selected at all. The data that were never selected are used to create the assessment set. This means that while the training set will have the same number of data points, the assessment sets will have varying numbers of data points.

For the bootstrap, we’ll refer to the mean estimate (line 12 in Algorithm 10.1) as the “ordinary” estimator (others will follow shortly).

Figure 10.7: A schematic of three bootstraps resamples created from an initial pool of 30 data points.

The probability that a data point will be picked is \(1 / n_{tr}\) and, from this, the probability that a training set point is not selected at all is

\[\prod_{i=1}^{n_{tr}} \left(1 - \frac{1}{n_{tr}}\right) = \left(1 - \frac{1}{n_{tr}}\right)^{n_{tr}}\approx e^{-1} = 0.368\]

Since the bootstrap sample contains the selected data, each training point is selected with probability \(1 - e^{-1} \approx 0.632\). The implication is that, on average, the analysis set, contains about 63.2% unique training set points and the assessment set includes, on average, 36.8% of the training data. When comparing the number of unique training set points excluded by the bootstrap method to those excluded by V-fold cross-validation, the bootstrap method is roughly equivalent to using \(V=3\) in cross-validation.

Figure 10.8 shows bias and variance for the simulated data. As one might expect, there is considerable bias in the bootstrap estimate of performance. In comparison to the corresponding plot for MCCV, the bootstrap bias is worse than the MCCV curve where 60% were held out. The curve is also flat; the bias does’t go away by increasing B.

However, the precision is extremely good, even with very few resamples.

Figure 10.8: Variance and bias statistics for simulated data using different configurations for the bootstrap.

Correcting for Bias

There have been some attempts to de-bias the ordinary bootstrap estimate. The “632 estimator” (Efron 1983) uses the ordinary bootstrap estimate (\(\hat{Q}_{bt}\)) and the resubstitution estimate (\(\hat{Q}_{rsub}\)) together:

\[\hat{Q}_{632} = e^{-1}\, \hat{Q}_{rsub} + (1 - e^{-1})\, \hat{Q}_{bt} = 0.368\,\hat{Q}_{rsub} + 0.632\,\hat{Q}_{bt}\] Figure 10.8 shows that there is a significant drop in the bias when using this correction. The 632 estimator combines two different statistical estimates, and we only know the standard error of \(\hat{Q}_{bt}\). Therefore, the right-hand panel does not show a standard error curve for the 632 estimator.

Let’s look at an average example from the simulated data sets with B = 100. Table 10.1 has the estimator values and their intermediate values. Let’s first focus on the values for the column labeld “Logistic.” The ordinary bootstrap estimate was \(\hat{Q}_{bt}\) = 0.101 and repredicting the training set produced \(\hat{Q}_{rsub}\) = 0.0767. The 632 estimate shifts the Brier score downward to \(\hat{Q}_{632}\) = 0.0919. This reduces the pessimistic bias; our large sample estimate of the true Brier score is 0.0898 so we are closer to that value.

Another technique, the 632+ estimator (Efron and Tibshirani 1997), uses the same blending strategy but uses dynamic weights based on how much the model overfits (if at all). It factors in a model’s “no-information rate”: the metric value if the predicted and true outcome values were independent. The author gives a formula for this, but it can also be estimated using a permutation approach where we repeatedly shuffle the outcome values and compute the metric. We will denote this value as \(\hat{Q}_{nir}\). For our simulated data set \(\hat{Q}_{nir}\) = 0.427; we believe that this is the worst case value for our metric.

We then compute the relative overfitting rate (ROR) as

\[ ROR = \frac{\hat{Q}_{bt}-\hat{Q}_{rsub}}{\hat{Q}_{nir} -\hat{Q}_{rsub}} \]

The denominator measures the range of the metric (from most optimistic to most pessimistic). The numerator measures the optimism of our ordinary estimator. A value of zero implies that the model does not overfit, and values of one indicate the opposite. For our logistic model, the ratio is 0.024 / 0.35 = 0.0686, indicating that the logistic model is not overinterpeting the training set.

The final 632+ estimator uses a different weighting system to combine estimates:

\[ \begin{align} \hat{Q}_{632+} &= (1 - \hat{w})\, \hat{Q}_{rsub} + \hat{w}\, \hat{Q}_{bt} \quad \text{with}\notag \\ \hat{w} &= \frac{0.632}{1 - 0.368ROR} \notag \end{align} \] Plugging in our values, the final weight on the ordinary estimator (\(w\) = 0.648) is very close to what is used by the regular 632 estimator (\(w\) = 0.632). The final estimate is also similar: \(\hat{Q}_{632+}\) = 0.0923.

Brier Score
Logistic 1 NN
Estimates
(truth) 0.090 0.170
resubstitution 0.077 0.000
simple mean 0.101 0.174
Intermediates
no information rate 0.427 0.496
relative overfitting rate 0.069 0.351
weights 0.648 0.726
Final Estimates
632 0.092 0.110
632+ 0.092 0.126
Table 10.1: Bias correction values for two models on a simulated data set.

The simulations studies shown here in Figure 10.8 replicated what Molinaro (2005) found; the 632+ estimator has bias properties about the same as 10-fold cross-validation.

How do these bias corrected estimators work when there is extreme overfitting? One way to tell is to consider a 1-nearest neighbor model. In this case, re-predicting the training set predicts every data point with its outcome value (i.e., a “perfect” model). For the Brier score, this means \(\hat{Q}_{rsub}\) is zero. Table 10.1 shows the rest of the computations. For this model \(Q\approx\) 0.17 and the ordinary estimate comes close: \(\hat{Q}_{bt}\) = 0.174.

The 632 estimate is \(\hat{Q}_{632} = 0.632\,\hat{Q}_{bt} =\) 0.11; this is too much bias reduction as it overshoots the true value by a large margin. The 632+ estimator is slightly better. The ROR value is higher (0.351) leading to a higher weight on the ordinary estimator to produce \(\hat{Q}_{632+}\) = 0.126.

Should we use the bootstrap to compare models? Its small variance is enticing, but the bias remains an issue. The bias is likely to change as a function of the magnitude of the metric. For example, the bootstrap’s bias is probably different for models with 60% and 95% accuracy. If we think our models will have about the same performance, the bootstrap (with a bias correction) may be a good choice. Note that Molinaro (2005) found that the 632+ estimator may not perform well when the training set is small (and especially of the number of predictors is large).

Let’s take a look at a few specialized resampling methods.

10.6 Time Series Data

Time series data (Hyndman and Athanasopoulos 2024) are sequential observations ordered over time. The most common example is daily stock prices. These data are a special case due to autocorrelation: rows of the data set are not independent. This can occur for many different reasons (e.g. underlying seasonal trends, etc.), and this dependency complicates the resampling process in two primary ways. We need to think more carefully about how analysis and assessment sets are created.

  • Randomly allocating samples to these sets would deteriorate the relationships among the samples due to autocorrelation. If the autocorrelation is important information related to the response, then random sampling will negatively impact a model’s ability to uncover the predictive relationship with the response.
  • Placing newer data in the analysis set and older data in the assessment set would not make sense because of the time trends in the data.

Rolling forecasting origin splitting (Tashman 2000) is a resampling scheme that mimics the train/test splitting pattern from Section 3.4 where the most recent data are used in the assessment set. We would choose how much data (or what range of time) to use for the analysis and assessment sets, then slide this pattern over the whole data set. Figure 10.9 illustrates this process where each block could represent some unit of time. For example, we might build the model on six months of data and then predict the next month. The next resample iteration could bump that forward by a month or some other period (so that the assessment sets can be distinct or overlapping).

Figure 10.9: A schematic of rolling origin resampling for time series data. Each box corresponds to a unit of time (e.g., day, month, year, etc.).

We also have the choice to cumulatively expand the analysis data from the start of the training set to the end of the current slice. For example, in Resample 2 of Figure 10.9, the model would be fit using samples 1-9; in Resample 3, the model would be fit using samples 1-10; and so on. This approach may be useful when the number of periods of time are smaller and we would like to include as much of the historical data as possible in each resample.

We’ll use rolling forecasting origin resampling in one of the upcoming case studies using the hotel rate data previously discussed in Section 6.1.

10.7 Spatial Data

Section 3.9 described the problem of spatial correlation between locations and a method for creating an initial split for the forestation data.

Like time-series data, the resampling scheme for this type of data emulates the initial splitting process. We could create groupings for the training set using the following methods:

  • Clustering methods: These methods create partitions of the data where points with spatial coordinates are closer to each other within the same cluster than to points in other clusters.
  • Grid-based methods: A grid of equally sized blocks, either rectangular or hexagonal, is created to encompass the training set points.

From these groupings, we can apply procedures similar to previous cross-validation variations:

  • Leave-one-group-out resampling: This method creates as many resamples as there are groups. In each resample, data points from one group are used as the assessment set, while the remaining groups are used to train the model. This process repeats for each group.
  • V-fold cross-validation: This method assigns each group to one of V meta-groups (e.g., folds). In each of the V iterations, one fold is left out as the assessment set, while the remaining folds are used to train the model.
  • Repeated V-fold cross-validation: This method involves restarting the grouping process with a different random number seed, allowing for multiple rounds of V-fold cross-validation.

Each of these approaches can be augmented using a buffer to exclude very close neighbors from being used.

Going back to our forestry data, we previously used hexagonal blocks to split the data (see Figure 3.5). The same blocking/buffering scheme was re-applied to the training set in order to create a 10-fold cross-validation. Recall that a 25 x 25 grid was used. For the training set, some of these hexagons are missing all of their data (since they were previously added to the test set). Of the remaining grid points, we systematically assign them to the ten folds. Figure 10.10 shows this process for the first fold. The figure shows the blocks of data, along with the corresponding buffer for this iteration of cross-validation. The purple blocks are, for the most part, not adjacent to one another. These are combined into the first assessment set. The other locations that are not in the buffers are pooled into the first analysis set.

Figure 10.10: An example of one split produced by block cross-validation on the forestation data. The analysis (magenta) and assessment (purple) sets are shown with locations in the buffer in black. The empty spaces represent the data reserved for the test set (see Figure 3.5).

Our training set contains 4,834 locations. Across each of the ten folds, every location appears once in the analysis and assessment sets.

Just as with standard V-fold cross-validation, the number of folds can be chosen based on bias and variance concerns. Unlike standard resampling, any investigation into this topic would depend on a specific spatial autocorrelation structure. As such, we have decided to follow the recommendations for non-spatial data and use V=10. If we find that the variation in our statistics estimated from ten resamples is too large, we can employ the same process of repeating the block resampling process using different random number seeds. For our data, the average number of locations in the assessment sets is 483 and this appears large enough to produce sufficiently precise performance statistics.

10.8 Grouped or Multi-Level Data

When rows of data are correlated for other reasons, such as those discussed in Section 3.8, we can extend the techniques described above. Consider the previously mentioned example of having the purchase histories of many users from a retail database. If we had a training set of 10,000 instances that included 100 customers, we could conduct V-fold cross-validation to determine which customers would go into the analysis and assessment sets. As with the initial split, any rows associated with specific customers would be placed into the same partition (i.e., analysis or assessment).

10.9 Frequently Asked Questions

The general concept of resampling seems clear and straightforward. However, we regularly encounter confusion and commonly asked questions about this process. In this section we will address these common questions and misconceptions.

Is it bad to get different results from each resample?

The point of resampling is to see how the model pipeline changes when the data changes. There is a good chance that some important estimates differ from resample to resample. For example, when algorithmically filtering predictors, you might get B different predictor lists, and that’s okay. It does give you a sense of how much noise is in the part of your model pipeline.

It can be good to examine the distribution of resampling performance. Occasionally, we may find a resample that has unusual performance with respect to the rest of the resamples. This may merit further investigation as to why the resample is unusual.

Can/should I use the individual resampling results to change the model pipeline?

We often hear questions about how to use the cross-validated results to define the model. For example, using the top X% of predictors selected during resampling as the final predictor set in a new model pipeline seems intuitive.

One important thing to consider is that you now have a different model pipeline than what was just resampled. To understand the performance of your newly informed model, you need to resample a model pipeline that algorithmically selects the X% of predictors and resample it.

Note that resampling the specific set of top predictors determined from the resamples is not appropriate. We have seen this done many times before in the hopes of improving predictive performance. Performance often improves for the assessment sets. But the performance cannot be generalized to the test set or for other new samples. The rule to determine the top set should be resampled, not the specific output of that rule.

For the original pipeline, the final predictor set is determined when you fit that pipeline on the training set. The resampling results tell you about what could have happened; the training set results show what did happen.

What happens to the B trained models? Which one should I keep?

The B models fit within resampling are only used to estimate how well the model fits the data. You don’t need them after that. You can retain them for diagnostic purposes though. As mentioned above, using the replicate models to look into how parts of the model change can be helpful.

Also, if you are not using a single validation set, recall that none of these models are trained with the unaltered training set, so none would be the final model.

Is this some sort of ensemble?

No. We’ll talk model about ensembles in ?sec-ensembles and how the assessment set predictions can be used to create an ensemble.

Can I accidentally resample incorrectly?

Unfortunately, yes. As previously mentioned, improper data usage is one of the most frequent ways that machine learning models silently fail; that is, you won’t know that there is a problem until the next set of labeled data. A good example discussed in ?sec-removing-predictors is Ambroise and McLachlan (2002).

If you are

  • training every part of your model pipeline after the initial data split and using only the training set and,

  • evaluating performance with external data

the risk of error is low.

What is nested resampling?

This is a version of where an additional layer of resampling occurs. For example, suppose you are using 10-fold cross-validation. Within each of the 10 iterations, you might add 20 bootstrapping iterations (that resamples the analysis set). That ends up training 200 distinct models13.

For a single model, that’s not very useful. However, as discussed in Section 11.4, there are situations where the analysis set is being used for too many purposes. For example, during model tuning, you might use resampling to find the best model and measure its performance. In certain cases, that can lead to bad results for the latter task.

Another example is recursive feature selection, where we are trying to rank predictors, sequentially remove them, and determine how many to remove. In this case, it’s a good idea to use an outer resampling loop to determine where to stop and an inner loop for the other tasks.

We will see nested resampling in the next chapter.

Is resampling related to permutation tests?

Not really. Permutation methods are similar to resampling only in that they perform multiple calculations on different versions of the training set.

We used permutation methods in the bootstrapping section to estimate the no-information rate. Otherwise, they won’t be used to evaluate model performance.

Are sub-sampling or down-sampling the same as resampling?

No. These are techniques to modify your training set (or analysis set) to rebalance the data when a classification problem has rare events. They’ll be discussed in ?sec-imbalance-sampling.

Why do I mostly hear about validation sets?

As discussed in Section 1.6, stereotypical deep learning models are trained on very large sets of data. There is very little reason to use multiple resamples in these instances, and deep learning takes up a lot of space in the media and social media.

Generally, every data set that is not massive could benefit from multiple resamples.

Chapter References

Ambroise, C, and G McLachlan. 2002. Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data.” Proceedings of the National Academy of Sciences 99 (10): 6562–66.
Arlot, S, and A Celisse. 2010. A Survey of Cross-Validation Procedures for Model Selection.” Statistics Surveys 4: 40–79.
Bates, S, T Hastie, and R Tibshirani. 2023. Cross-Validation: What Does It Estimate and How Well Does It Do It? Journal of the American Statistical Association, 1–12.
Davison, A, and D Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge University Press.
Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7 (1): 1–26.
Efron, B. 1983. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.” Journal of the American Statistical Association, 316–31.
Efron, B. 2003. Second Thoughts on the Bootstrap.” Statistical Science, 135–40.
Efron, B, and T Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.
Efron, B, and R Tibshirani. 1997. Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60.
Fushiki, T. 2011. Estimation of Prediction Error by Using k-Fold Cross-Validation.” Statistics and Computing 21: 137–46.
Hyndman, RJ, and G Athanasopoulos. 2024. Forecasting: Principles and Practice, 3rd Edition. Otexts.
Molinaro, A. 2005. Prediction Error Estimation: A Comparison of Resampling Methods.” Bioinformatics 21 (15): 3301–7.
Stone, M. 1974. Cross-Validatory Choice and Assessment of Statistical Predictions.” Journal of the Royal Statistical Society: Series B (Methodological) 36 (2): 111–33.
Tashman, L. 2000. Out-of-Sample Tests of Forecasting Accuracy: An Analysis and Review.” International Journal of Forecasting 16 (4): 437–50.

  1. These names are not universally used. We only invent new terminology to avoid confusion; people often refer to the data in our analysis set as the “training set” because it has the same purpose.↩︎

  2. This discussion is very similar to the variance-bias tradeoff discussed in Section 8.4. The context here differs, but the themes are the same.↩︎

  3. Meaning that performance looks worse than it should. For example, smaller \(R^2\) or inflated Brier score.↩︎

  4. The split can be made using the same tools already discussed, such as completely random selection, stratified random sampling, etc.↩︎

  5. Also, if \(n_{tr}\) is not “large,” you shouldn’t use a validation set anyway.↩︎

  6. Again, this could be accomplished using any of the splitting techniques described in Chapter 3.↩︎

  7. The simulation details, sources, and results can be found at https://github.com/topepo/resampling_sim.↩︎

  8. The percent bias is calculated as \(100(Q_{true} - \hat{Q})/Q_{true}\) for metrics where smaller is better.↩︎

  9. Arlot and Celisse (2010) provides a comprehensive survey of cross-validation↩︎

  10. However, for linear regression, the LOOCV predictions can be quickly computed without refitting a large number of models.↩︎

  11. This isn’t the same as using V = 20. That scheme has very different bias properties than two repetitions of V = 10.↩︎

  12. This application will be discussed in ?sec-boot-intervals.↩︎

  13. We will refer to the first resample (10-fold CV) as the outer resample and the bootstrap as the inner resample.↩︎