Applied Machine Learning for Tabular Data

Authors

Max Kuhn

Kjell Johnson

Published

2024-03-18

Preface

Welcome! This is a work in progress. We want to create a practical guide to developing quality predictive models from tabular data. We’ll publish materials here as we create them and welcome community contributions in the form of discussions, suggestions, and edits.

We also want these materials to be reusable and open. The sources are in the source GitHub repository with a Creative Commons license attached (see below).

Our intention is to write these materials and, when we feel we’re done, pick a publishing partner to produce a print version.

The book takes a holistic view of the predictive modeling process and focuses on a few areas that are usually left out of similar works. For example, the effectiveness of the model can be driven by how the predictors are represented. Because of this, we tightly couple feature engineering methods with machine learning models. Also, quite a lot of work happens after we have determined our best model and created the final fit. These post-modeling activities are an important part of the model development process and will be described in detail.

We deliberately avoid using the term “artificial intelligence.” Eugen Rochko’s (@Gargron@mastodon.social) comment on Mastodon does a good job of summarizing our reservations regarding the term:

It’s hard not to say “AI” when everybody else does too, but technically calling it AI is buying into the marketing. There is no intelligence there, and it’s not going to become sentient. It’s just statistics, and the danger they pose is primarily through the false sense of skill or fitness for purpose that people ascribe to them.

To cite this website, we suggest:

@online{aml4td,
  author = {Kuhn, M and Johnson, K},
  title = {{Applied Machine Learning for Tabular Data}},
  year = {2023},
  url = { https://aml4td.org},
  urldate = {2024-01-02}
}

License

This work is licensed under CC BY-NC-SA 4.0

This license requires that reusers give credit to the creator. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms.

  • BY: Credit must be given to you, the creator.
  • NC: Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation.
  • SA: Adaptations must be shared under the same terms.

Our goal is to have an open book where people can reuse and reference the materials but can’t just put their names on them and resell them (without our permission).

Intended Audience

Our intended audience includes data analysts of many types: statisticians, data scientists, professors and instructors of machine learning courses, laboratory scientists, and anyone else who desires to understand how to create a model for prediction. We don’t expect readers to be experts in these methods or the math behind them. Instead, our approach throughout this work is applied. That is, we want readers to use this material to build intuition about the predictive modeling process. What are good and bad ideas for the modeling process? What pitfalls should we look out for? How can we be confident that the model will be predictive for new samples? What are advantages and disadvantages of different types of models? These are just some of the questions that this work will address.

Some background in modeling and statistics will be extremely useful. Having seen or used basic regression models is good, and an understanding of basic statistical concepts such as variance, correlation, populations, samples, etc., is needed. There will also be some mathematical notation, so you’ll need to be able to grasp these abstractions. But we will keep this to those parts where it is absolutely necessary. There are a few more statistically sophisticated sections for some of the more advanced topics.

If you would like a more theoretical treatment of machine learning models, then we recommend Hastie, Tibshirani, and Friedman (2017). Other books for gaining a more in-depth understanding of machine learning are Bishop and Nasrabadi (2006), Arnold, Kane, and Lewis (2019) and, for more of a deep learning focus, Goodfellow, Bengio, and Courville (2016) and/or Prince (2023).

Is there code?

We definitely want to decouple the content of this work from specific software. One of our other books on modeling had computing sections. Many people found these sections to be a useful resource at the time of the book’s publication. However, code can quickly become outdated in today’s computational environment. In addition, this information takes up a lot of page space that would be better used for other topics.

We will create computing supplements to go along with the materials. Since we use R’s tidymodels framework for calculations, the supplement currently in-progress is:

If you are interested in working on a python/scikit-learn supplement, please file an issue

Are there exercises?

Many readers found the Exercise sections of Applied Predictive Modeling to be helpful for solidifying the concepts presented in each chapter. The current set can be found at exercises.aml4td.org

How can I ask questions?

If you have questions about the content, it is probably best to ask on a public forum, like cross-validated. You’ll most likely get a faster answer there if you take the time to ask the questions in the best way possible.

If you want a direct answer from us, you should follow what I call Yihui’s Rule: add an issue to GitHub (labeled as “Discussion”) first. It may take some time for us to get back to you.

If you think there is a bug, please file an issue.

Can I contribute?

There is a contributing page with details on how to get up and running to compile the materials (there are a lot of software dependencies) and suggestions on how to help.

If you just want to fix a typo, you can make a pull request to alter the appropriate .qmd file.

Please feel free to improve the quality of this content by submitting pull requests. A merged PR will make you appear in the contributor list. It will, however, be considered a donation of your work to this project. You are still bound by the conditions of the license, meaning that you are not considered an author, copyright holder, or owner of the content once it has been merged in.

Computing Notes

Quarto was used to compile and render the materials

Quarto 1.4.533
[✓] Checking versions of quarto binary dependencies...
      Pandoc version 3.1.11: OK
      Dart Sass version 1.69.5: OK
      Deno version 1.37.2: OK
[✓] Checking versions of quarto dependencies......OK
[✓] Checking Quarto installation......OK
      Version: 1.4.533
[✓] Checking tools....................OK
      TinyTeX: (external install)
      Chromium: (not installed)
[✓] Checking LaTeX....................OK
      Using: TinyTex
      Version: 2022
[✓] Checking basic markdown render....OK
[✓] Checking Python 3 installation....OK
      Version: 3.8.13 (Conda)
      Jupyter: (None)
      Jupyter is not available in this Python installation.
[✓] Checking R installation...........OK
      Version: 4.3.2
      LibPaths:
      knitr: 1.45
      rmarkdown: 2.25
[✓] Checking Knitr engine render......OK

R version 4.3.2 (2023-10-31) was used for the majority of the computations. torch 1.13.1 was also used. The versions of the primary R modeling and visualization packages used here are:

applicable (0.1.0) baguette (1.0.1) bestNormalize (1.9.1)
bonsai (0.2.1) broom (1.0.5) brulee (0.2.0)
C50 (0.1.8) Cubist (0.4.2.1) DALEXtra (2.3.0)
dbarts (0.9-23) desirability2 (0.0.1) dials (1.2.0)
dimRed (0.2.6) discrim (1.0.1) doMC (1.3.8)
dplyr (1.1.4) e1071 (1.7-13) earth (5.3.2)
embed (1.1.3) finetune (1.1.0) GA (3.2.3)
gganimate (1.0.8) ggiraph (0.8.7) ggplot2 (3.4.4)
glmnet (4.1-8) gt (0.10.0) hardhat (1.3.0)
ipred (0.9-14) irlba (2.3.5.1) kernlab (0.9-32)
kknn (1.3.1) klaR (1.7-2) lightgbm (3.3.5)
mda (0.5-4) mgcv (1.9-0) mixOmics (6.25.1)
modeldata (1.2.0) modeldatatoo (0.2.1) pamr (1.56.1)
parsnip (1.1.1) partykit (1.2-20) patchwork (1.1.3)
plsmod (1.0.0) probably (1.0.2) pROC (1.18.5)
purrr (1.0.2) ragg (1.2.6) ranger (0.16.0)
recipes (1.0.8) rpart (4.1.21) rsample (1.2.0)
rstudioapi (0.15.0) rules (1.0.2) sparsediscrim (0.3.0)
sparseLDA (0.1-9) spatialsample (0.5.1) splines2 (0.5.1)
stacks (1.0.3) stopwords (2.3) textrecipes (1.0.6.9000)
themis (1.0.2) tidymodels (1.1.1) tidyposterior (1.0.1)
tidyr (1.3.0) torch (0.11.0) tune (1.1.2)
usethis (2.2.2) VBsparsePCA (0.1.0) workflows (1.1.3)
workflowsets (1.0.1) xgboost (1.7.5.1) xrf (0.2.2)
yardstick (1.2.0)

Chapter References

Arnold, T, M Kane, and B Lewis. 2019. A Computational Approach to Statistical Learning. Chapman; Hall/CRC.
Bishop, C M, and N M Nasrabadi. 2006. Pattern Recognition and Machine Learning. Vol. 4. 4. Springer.
Goodfellow, I, Y Bengio, and A Courville. 2016. Deep Learning. MIT press.
Hastie, T, R Tibshirani, and J Friedman. 2017. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.
Prince, S. 2023. Understanding Deep Learning. MIT press.