$$ \require{cancel} \newcommand{\given}{ \,|\, } \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\vecg}[1]{\boldsymbol{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\bbone}{\unicode{x1D7D9}} $$

References

Abramson, I.S. (1982). On bandwidth variation in kernel estimates-a square root law. The Annals of Statistics 10(4), 1217–1223. DOI: 10.1214/aos/1176345986

Afendras, G., Markatou, M. (2019). Optimality of training/test size and resampling effectiveness in cross-validation. Journal of Statistical Planning and Inference 199, 286–301. DOI: 10.1016/j.jspi.2018.07.005

Aitchison, J., Aitken, C.G.G. (1976). Multivariate binary discrimination by the kernel method. Biometrika 63(3), 413–420. DOI: 10.1093/biomet/63.3.413

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723. DOI: 10.1109/TAC.1974.1100705

Akaike, H. (1972). Information theory and an extension of the maximum likelihood principle, in: Proceedings of the 2nd International Symposium on Information Theory. pp. 267–281.

Amit, Y., Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation 9(7), 1545–1588. DOI: 10.1162/neco.1997.9.7.1545

Andoni, A., Indyk, P., Razenshteyn, I. (2018). Approximate nearest neighbor search in high dimensions, in: Proceedings of the International Congress of Mathematicians (Icm 2018). pp. 3287–3318. DOI: 10.1142/9789813272880_0182

Arlot, S., Bach, F. (2009). Data-driven calibration of linear estimators with minimal penalties. arXiv (0909.1884). URL https://arxiv.org/abs/0909.1884

Arlot, S., Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79. DOI: 10.1214/09-SS054

Aslett, L.J.M. (2021). Statistical Machine Learning. URL https://www.louisaslett.com/StatML/notes/

Aslett, L.J.M., Esperança, P.M., Holmes, C.C. (2015). A review of homomorphic encryption and software tools for encrypted statistical machine learning. arXiv (1508.06574). URL https://arxiv.org/abs/1508.06574

Aurenhammer, F. (1991). Voronoi diagrams — a survey of a fundamental geometric data structure. ACM Computing Surveys 23(3), 345–405. DOI: 10.1145/116873.116880

Bach, F. (2021). Learning Theory from First Principles, Draft. ed. URL https://www.di.ens.fr/~fbach/ltfp_book.pdf

Barber, D. (2012). Bayesian Reasoning and Machine Learning, 1st ed. Cambridge University Press. URL http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.Online, ISBN: 978-0521518147

Bartlett, P., Freund, Y., Sun Lee, W., Schapire, R.E. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26(5), 1651–1686. DOI: 10.1214/aos/1024691352

Bates, S., Hastie, T., Tibshirani, R. (2021). Cross-validation: What does it estimate and how well does it do it? arXiv (2104.00673). URL https://arxiv.org/abs/2104.00673

Belkin, M., Hsu, D., Ma, S., Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116(32), 15849–15854. DOI: 10.1073/pnas.1903070116

Belson, W.A. (1959). Matching and prediction on the principle of biological classification. Journal of the Royal Statistical Society: Series C 8(2), 65–75. DOI: 10.2307/2985543

Bengio, Y., Grandvalet, Y. (2004). No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research 5, 1089–1105. URL https://www.jmlr.org/papers/v5/grandvalet04a.html

Bentley, J.L. (1975). Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517. DOI: 10.1145/361002.361007

Bergmeir, C., Hyndman, R.J., Koo, B. (2018). A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis 120, 70–83. DOI: 10.1016/j.csda.2017.11.003

Bernardo, J.M., Smith, A.F.M. (1994). Bayesian Theory, 1st ed, Wiley Series in Probability and Statistics. Wiley. ISBN: 978-0471494645

Beygelzimer, A., Kakade, S., Langford, J. (2006). Cover trees for nearest neighbor, in: Proceedings of the 23rd International Conference on Machine Learning. pp. 97–104. DOI: 10.1145/1143844.1143857

Bishop, C.M. (2006). Pattern Recognition and Machine Learning, 1st ed, Information Science and Statistics. Springer. URL https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf, ISBN: 978-0387310732

Bohanec, M., Bratko, I. (1994). Trading accuracy for simplicity in decision trees. Machine Learning 15, 223–250. DOI: 10.1023/A:1022685808937

Box, G.E.P., Draper, N.R. (1987). Empirical model-building and response surfaces, 1st ed, Wiley series in probability and statistics. Wiley. ISBN: 978-0471810339

Breiman, L. (2001a). Statistical modeling: The two cultures. Statistical Science 16(3), 199–231. DOI: 10.1214/ss/1009213726

Breiman, L. (2001b). Random forests. Machine Learning 45, 5–32. DOI: 10.1023/A:1010933404324

Breiman, L. (1998). Arcing classifier. The Annals of Statistics 26(3), 801–849. DOI: 10.1214/aos/1024691079

Breiman, L. (1996). Bagging predictors. Machine Learning 24, 123–140. DOI: 10.1007/BF00058655

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and regression trees, 1st ed. Chapman & Hall/CRC. ISBN: 978-1351460484

Breiman, L., Meisel, W., Purcell, E. (1977). Variable kernel estimates of multivariate densities. Technometrics 19(2), 135–144. DOI: 10.1080/00401706.1977.10489521

Breiman, L., Spector, P. (1992). Submodel selection and evaluation in regression. The \(X\)-random case. International Statistical Review 60(3), 291–319. DOI: 10.2307/1403680

Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics 34(2), 559–583. DOI: 10.1214/009053606000000092

Bühlmann, P., Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science 22(4), 477–505. DOI: 10.1214/07-STS242

Caton, S., Haas, C. (2020). Fairness in machine learning: A survey. arXiv (2010.04053). URL https://arxiv.org/abs/2010.04053

Chen, G.H., Shah, D. (2018). Explaining the success of nearest neighbor methods in prediction. Foundations and Trends in Machine Learning 10(5-6), 337–588. DOI: 10.1561/2200000064

Chen, T., Guestrin, C. (2016). XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. pp. 785–794. DOI: 10.1145/2939672.2939785

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., Li, M., Xie, J., Lin, M., Geng, Y., Li, Y. (2021). xgboost: Extreme Gradient Boosting. R package version 1.4.1.1. URL https://CRAN.R-project.org/package=xgboost

Christodoulou, E., Ma, J., Collins, G.S., Steyerberg, E.W., Verbakel, J.Y., Van Calster, B. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology 110, 12–22. DOI: 10.1016/j.jclinepi.2019.02.004

Clopper, C.J., Pearson, E.S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26(4), 404–413. DOI: 10.1093/biomet/26.4.404

Collins, G.S., Moons, K.G.M. (2019). Reporting of artificial intelligence prediction models. Lancet 393, 1577–1579. DOI: 10.1016/S0140-6736(19)30037-6

Collins, G.S., Moons, K.G.M. (2012). Comparing risk prediction models. BMJ 344. DOI: 10.1136/bmj.e3186

Cover, T.M., Hart, P.E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27. DOI: 10.1109/TIT.1967.1053964

Cox, D.R. (1958). Two further applications of a model for binary regression. Biometrika 45(3-4), 562–565. DOI: 10.1093/biomet/45.3-4.562

Davison, A.C., Hinkley, D.V., Young, G.A. (2003). Recent developments in bootstrap methodology. Statistical Science 18(2), 141–157. DOI: 10.1214/ss/1063994969

DeGroot, M.H., Schervish, M.J. (2012). Probability and Statistics, 4th ed. Pearson. ISBN: 978-0321500465

Devroye, L., Györfi, L., Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition, 1st ed. Springer. ISBN: 978-0387946184

Dietterich, T.G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–157. DOI: 10.1023/A:1007607513941

Dietterich, T.G., Kong, E.B. (1995). Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical Report, Department of Computer Science, Oregon State University. URL https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.38.2702&rep=rep1&type=pdf

Dwork, C., Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9(3-4), 211–407. DOI: 10.1561/0400000042

Efron, B. (2020). Prediction, estimation, and attribution. Journal of the American Statistical Association 115(530), 636–655. DOI: 10.1080/01621459.2020.1762613

Efron, B. (2004). The estimation of prediction error: Covariance penalties and cross-validation. Journal of the American Statistical Association 99(467), 619–632. DOI: 10.1198/016214504000000692

Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association 81(394), 461–470. DOI: 10.2307/2289236

Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association 78(382), 316–331. DOI: 10.2307/2288636

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7(1), 1–26. DOI: 10.1214/aos/1176344552

Efron, B., Tibshirani, R. (1997). Improvements on cross-validation: The .632+ bootstrap method. Journal of the American Statistical Association 92(438), 548–560. DOI: 10.2307/2965703

Epanechnikov, V.A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications 14(1), 153–158. DOI: 10.1137/1114019

Fix, E., Hodges, Jr., J.L. (1951). Discriminatory analysis — nonparametric discrimination: Consistency properties. Technical Report: USAF School of Aviation Medicine (21-49-004.4). DOI: 10.2307/1403797

Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 121(2), 256–285. DOI: 10.1006/inco.1995.1136

Freund, Y., Schapire, R.E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1), 119–139. DOI: 10.1006/jcss.1997.1504

Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29(5), 1189–1232. DOI: 10.1214/aos/1013203451

Friedman, J.H. (1998). Data mining and statistics: What’s the connection? Computing Science and Statistics 29(1), 3–9.

Friedman, J.H. (1997). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery 1, 55–77. DOI: 10.1023/A:1009778005914

Friedman, J., Hastie, T., Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics 28(2), 337–407. DOI: 10.1214/aos/1016218223

Friedman, J.H., Bentley, J.L., Finkel, R.A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software 3(3), 209–226. DOI: 10.1145/355744.355745

García, S., Derrac, J., Cano, J.R., Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3), 417–435. DOI: 10.1109/TPAMI.2011.142

Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association 70(350), 320–328. DOI: 10.2307/2285815

Gilmour, S.G. (1996). The interpretation of mallows’s \(C_p\)-statistic. Journal of the Royal Statistical Society: Series D 45(1), 49–56. DOI: 10.2307/2348411

Glur, C. (2020). data.tree: General Purpose Hierarchical Data Structure. R package version 1.0.0. URL https://CRAN.R-project.org/package=data.tree

Gneiting, T., Raftery, A.E. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102(477), 359–378. DOI: 10.1198/016214506000001437

Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning, 1st ed. MIT Press. URL https://www.deeplearningbook.org/, ISBN: 978-0262035613

Gorman, K.B., Williams, T.D., Fraser, W.R. (2014). Ecological sexual dimorphism and environmental variability within a community of antarctic penguins (genus pygoscelis). PLOS ONE 9(3). DOI: 10.1371/journal.pone.0090081

Gower, J.C. (1971). A general coefficient of similarity and some of its properties. Biometrics 27(4), 857–871. DOI: 10.2307/2528823

Greenwell, B., Boehmke, B., Cunningham, J., Developers, G. (2020). gbm: Generalized Boosted Regression Models. R package version 2.1.8. URL https://CRAN.R-project.org/package=gbm

Grund, B., Hall, P., Marron, J.S. (1994). Loss and risk in smoothing parameter selection. Journal of Nonparametric Statistics 4(2), 107–132. DOI: 10.1080/10485259408832605

H2O.ai (2021). R Interface for H2O. R package version 3.32.1.3. URL https://github.com/h2oai/h2o-3

Hall, P., Samworth, R.J. (2005). Properties of bagged nearest neighbour classifiers. Journal of the Royal Statistical Society: Series B 67(3), 363–379. DOI: 10.1111/j.1467-9868.2005.00506.x

Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the roc curve for multiple class classification problems. Machine Learning 45, 171–186. DOI: 10.1023/A:1010920819831

Hand, D.J., Yu, K. (2007). Idiot’s bayes — not so stupid after all? International Statistical Review 69(3), 385–398. DOI: 10.1111/j.1751-5823.2001.tb00465.x

Hart, P. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory 14(3), 515–516. DOI: 10.1109/TIT.1968.1054155

Hastie, T., Tibshirani, R., Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed, Springer Series in Statistics. Springer. URL https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf, ISBN: 978-0387848570

Hays, J., Efros, A.A. (2008). IM2GPS: estimating geographic information from a single image, in: 2008 Ieee Conference on Computer Vision and Pattern Recognition. pp. 1–8. DOI: 10.1109/CVPR.2008.4587784

Ho, T.K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(8), 832–844. DOI: 10.1109/34.709601

Holmes, C.C., Adams, N.M. (2002). A probabilistic nearest neighbour method for statistical pattern recognition. Journal of the Royal Statistical Society: Series B 64, 295–306. DOI: 10.1111/1467-9868.00338

Horst, A.M., Hill, A.P., Gorman, K.B. (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. URL https://allisonhorst.github.io/palmerpenguins/

Hothorn, T., Hornik, K., Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics 15(3), 651–674. DOI: 10.1198/106186006X133933

Hyafil, L., Rivest, R.L. (1976). Constructing optimal binary decision trees is NP-complete. Information Processing Letters 5(1), 15–17. DOI: 10.1016/0020-0190(76)90095-8

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). An Introduction to Statistical Learning, 1st ed, Springer Texts in Statistics. Springer. URL https://www.statlearning.com/s/ISLR-Seventh-Printing.pdf, ISBN: 978-1461471370

Jenkins, P.A., Johansen, A.M., Evers, L. (2021). APTS: Computer Intensive Statistics Notes. URL https://warwick.ac.uk/fac/sci/statistics/apts/students/resources/cis-notes.pdf

Jones, M.C., Marron, J.S., Sheather, S.J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91(433), 401–407. DOI: 10.2307/2291420

Jones, M.L. (2018). How we became instrumentalists (again): Data positivism since World War II. Historical Studies in the Natural Sciences 48(5), 673–684. DOI: 10.1525/hsns.2018.48.5.673

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree, in: Proceedings of the 30th International Conference on Neural Information Processing Systems. URL https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

Ke, G., Soukhavong, D., Lamb, J., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y. (2021). lightgbm: Light Gradient Boosting Machine. R package version 3.2.1. URL https://CRAN.R-project.org/package=lightgbm

Kearns, M. (1988). Thoughts on hypothesis boosting. Machine Learning Class Project (unpublished) 1–9. URL http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf

Kohler, M., Krzyżak, A., Walk, H. (2006). Rates of convergence for partitioning and nearest neighbor regression estimates with unbounded data. Journal of Multivariate Analysis 97(2), 311–323. DOI: 10.1016/j.jmva.2005.03.006

Kuhn, M., Johnson, K. (2020). Feature Engineering and Selection: A Practical Approach for Predictive Models, 1st ed, Data Science Series. Chapman & Hall/CRC. ISBN: 978-1138079229

Kuhn, M., Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. URL https://www.tidymodels.org

Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software 4(44), 1903. DOI: 10.21105/joss.01903

Larson, S.C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology 22(1), 45–55. DOI: 10.1037/h0072400

Leslie, D. (2019). Understanding artificial intelligence ethics and safety: A guide for the responsible design and implementation of ai systems in the public sector. Technical Report, The Alan Turing Institute. DOI: 10.5281/zenodo.3240529

Li, K. (1984). Consistency for cross-validated nearest neighbor estimates in nonparametric regression. The Annals of Statistics 12(1), 230–240. DOI: 10.1214/aos/1176346403

Liaw, A., Wiener, M. (2002). Classification and regression by randomForest. R News 2(3), 18–22. URL https://www.r-project.org/doc/Rnews/Rnews_2002-3.pdf

Liley, J., Emerson, S.R., Mateen, B.A., Vallejos, C.A., Aslett, L.J.M., Vollmer, S.J. (2021). Model updating after interventions paradoxically introduces bias, in: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics. pp. 3916–3924. URL http://proceedings.mlr.press/v130/liley21a.html

Long, P.M., Servedio, R.A. (2010). Random classification noise defeats all convex potential boosters. Machine Learning 78, 287–304. DOI: 10.1007/s10994-009-5165-z

Lugosi, G., Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics 24(2), 687–706. DOI: 10.1214/aos/1032894460

Majka, M. (2019). naivebayes: High Performance Implementation of the Naive Bayes Algorithm in R. R package version 0.9.7. URL https://CRAN.R-project.org/package=naivebayes

Malik, M.M. (2020). A hierarchy of limitations in machine learning. arXiv (2002.05193). URL https://arxiv.org/abs/2002.05193

Mallows, C.L. (1973). Some comments on \(C_p\). Technometrics 15(4), 661–675. DOI: 10.1080/00401706.1973.10489103

Maxwell, J.C. (1860). On the theory of compound colours, and the relations of the colours of the spectrum. Proceedings of the Royal Society of London 10, 404–409. DOI: 10.1098/rspl.1859.0074

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys 54(6), 1–35. DOI: 10.1145/3457607

Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F. (2021). e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-8. URL https://CRAN.R-project.org/package=e1071

Milborrow, S. (2021). rpart.plot: Plot ’rpart’ Models: An Enhanced Version of ’plot.rpart’. R package version 3.1.0. URL https://CRAN.R-project.org/package=rpart.plot

Mingers, J. (1989). An empirical comparison of pruning methods for decision tree induction. Machine Learning 4, 227–243. DOI: 10.1023/A:1022604100933

Mingers, J. (1987). Expert systems – rule induction with statistical data. Journal of the Operational Research Society 38, 39–47. DOI: 10.1057/jors.1987.5

Mohri, M., Rostamizadeh, A., Talwalkar, A. (2018). Foundations of Machine Learning, 2nd ed. MIT Press. URL https://www.dropbox.com/s/7voitv0vt24c88s/10290.pdf?dl=1, ISBN: 978-0262039406

Molnar, C., Casalicchio, G., Bischl, B. (2021). Interpretable machine learning – a brief history, state-of-the-art and challenges, in: ECML Pkdd 2020 Workshops. pp. 417–431. DOI: 10.1007/978-3-030-65965-3_28

Moons, K.G.M., Altman, D.G., Reitsma, J.B., Ioannidis, J.P.A., Macaskill, P., Steyerberg, E.W., Vickers, A.J., Ransohoff, D.F., Collins, G.S. (2015). Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration. Annals of Internal Medicine 162(1), W1–W73. DOI: 10.7326/M14-0698

Morgan, J.N., Sonquist, J.A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association 58(302), 415–434. DOI: 10.2307/2283276

Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective, 1st ed. MIT Press. ISBN: 978-0262018029

Murthy, S.K., Salzberg, S. (1995). Decision tree induction: How effective is the greedy heuristic?, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining. pp. 222–227. URL https://www.aaai.org/Library/KDD/kdd95contents.php

Nadaraya, E.A. (1964). On estimating regression. Theory of Probability & Its Applications 9(1), 141–142. DOI: 10.1137/1109020

Nagy, G. (1968). State of the art in pattern recognition. Proceedings of the IEEE 56(5), 836–863. DOI: 10.1109/PROC.1968.6414

Neyshabur, B., Tomioka, R., Srebro, N. (2015). In search of the real inductive bias: On the role of implicit regularization in deep learning, in: Proceedings of the Third International Conference on Learning Representations. URL https://arxiv.org/abs/1412.6614

Ng, A.Y., Jordan, M.I. (2001). On discriminative vs. Generative classifiers: A comparison of logistic regression and naive bayes, in: Proceedings of the 14th International Conference on Neural Information Processing Systems. pp. 841–848. URL http://papers.nips.cc/paper/2020-on-discriminative-vs-generative-classifiers-a-comparison-of-logistic-regression-and-naive-bayes.pdf

Niculescu-Mizil, A., Caruana, R. (2005). Predicting good probabilities with supervised learning, in: Proceedings of the 22nd International Conference on Machine Learning. pp. 625–632. DOI: 10.1145/1102351.1102430

Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. The Annals of Statistics 24(3), 1084–1105. DOI: 10.1214/aos/1032526958

Ogden, H.E., Davison, A.C., Forster, J.J., Woods, D.C., Overstall, A.M. (2021). APTS: Statistical Modelling Notes. URL https://warwick.ac.uk/fac/sci/statistics/apts/students/resources/statmod-notes.pdf

Parzen, E. (1962). On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3), 1065–1076. DOI: 10.1214/aoms/1177704472

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830. URL https://www.jmlr.org/papers/v12/pedregosa11a.html

Platt, J.C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, in: Advances in Large Margin Classifiers. pp. 61–74. ISBN: 978-0262283977

Quinlan, J.R. (1993). C4.5: Programs for Machine Learning, 1st ed. Morgan Kaufmann Publishers. ISBN: 978-1558602380

Quinlan, J.R. (1986). Induction of decision trees. Machine Learning 1, 81–106. DOI: 10.1007/BF00116251

R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/

Robert, C.P. (2007). The Bayesian Choice, 2nd ed, Springer Texts in Statistics. Springer. ISBN: 978-0387715988

Robertson, T., Wright, F.T., Dykstra, R.L. (1988). Order restricted statistical inference, 1st ed. Wiley. ISBN: 978-0471917878

Rosenblatt, M. (1971). Curve estimates. The Annals of Mathematical Statistics 42(6), 1815–1842. DOI: 10.1214/aoms/1177693050

Rosenblatt, M. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics 27(3), 832–837. DOI: 10.1214/aoms/1177728190

Rosset, S., Tibshirani, R.J. (2020). From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association 115(529), 138–151. DOI: 10.1080/01621459.2018.1424632

Sachs, M.C. (2017). plotROC: A Tool for Plotting ROC Curves. Journal of Statistical Software, Code Snippets 79(2), 1–19. DOI: 10.18637/jss.v079.c02

Schapire, R.E. (1990). The strength of weak learnability. Machine Learning 5, 197–227. DOI: 10.1023/A:1022648800760

Schapire, R.E., Freund, Y. (2012). Boosting: Foundations and Algorithms, 1st ed. MIT Press. ISBN: 978-0262017183

Scott, D.W. (2015). Multivariate Density Estimation: Theory, Practice, and Visualization, 2nd ed, Wiley Series in Probability and Statistics. Wiley. ISBN: 978-1118575574

Shao, J. (1993). Linear model selection by cross-validation. Journal of the American Statistical Association 88(422), 486–494. DOI: 10.2307/2290328

Shaw, S.C., Rougier, J.C. (2020). APTS: Statistical Inference Notes. URL https://warwick.ac.uk/fac/sci/statistics/apts/students/resources/lecturenotes.pdf

Sheather, S.J. (2004). Density estimation. Statistical Science 19(4), 588–597. DOI: 10.1214/088342304000000297

Sheather, S.J., Jones, M.C. (1991). A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B 53(3), 683–690. DOI: 10.1111/j.2517-6161.1991.tb01857.x

Shmueli, G. (2010). To explain or to predict? Statistical Science 25(3), 289–310. DOI: 10.1214/10-STS330

Silverman, B.W. (1998). Density Estimation for Statistics and Data Analysis, 1st ed. Chapman & Hall/CRC. ISBN: 978-0412246203

Stein, C.M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics 9(6), 1135–1151. DOI: 10.1214/aos/1176345632

Stone, C.J. (1977). Consistent nonparametric regression. The Annals of Statistics 5(4), 595–645. DOI: 10.1214/aos/1176343886

Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and akaike’s criterion. Journal of the Royal Statistical Society: Series B 39(1), 44–47. DOI: 10.1111/j.2517-6161.1977.tb01603.x

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B 36(2), 111–133. DOI: 10.1111/j.2517-6161.1974.tb00994.x

Štrumbelj, E. (2018). Predictive model evaluation. Course Notes. URL https://file.biolab.si/textbooks/ml1/model-evaluation.pdf

Terrell, G.R., Scott, D.W. (1992). Variable kernel density estimation. The Annals of Statistics 20(3), 1236–1265. DOI: 10.1214/aos/1176348768

The CONSORT-AI and SPIRIT-AI Steering Group (2019). Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nature Medicine 25, 1467–1468. DOI: 10.1038/s41591-019-0603-3

Therneau, T., Atkinson, B. (2019). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-15. URL https://CRAN.R-project.org/package=rpart

The Turing Way Community, Arnold, B., Bowler, L., Gibson, S., Herterich, P., Higman, R., Krystalli, A., Morley, A., O’Reilly, M., Whitaker, K. (2019). The Turing Way: A Handbook for Reproducible Data Science. Zenodo v0.0.4. DOI: 10.5281/zenodo.3233986

Tibshirani, R.J. (2015). Degrees of freedom and model search. Statistica Sinica 25(3), 1265–1296. DOI: 10.5705/ss.2014.147

Tibshirani, R.J., Rosset, S. (2019). Excess optimism: How biased is the apparent error of an estimator tuned by sure? Journal of the American Statistical Association 114(526), 697–712. DOI: 10.1080/01621459.2018.1429276

Tukey, J.W. (1962). The future of data analysis. The Annals of Mathematical Statistics 33(1), 1–67. DOI: 10.1214/aoms/1177704711

Van Calster, B., McLernon, D.J., Van Smeden, M., Wynants, L., Steyerberg, E.W. (2019). Calibration: The Achilles heel of predictive analytics. BMC Medicine 17(230). DOI: 10.1186/s12916-019-1466-7

Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M.J., Steyerberg, E.W. (2016). A calibration hierarchy for risk models was defined: From utopia to empirical data. Journal of Clinical Epidemiology 74, 167–176. DOI: 10.1016/j.jclinepi.2015.12.005

Van Calster, B., Vickers, A.J. (2014). Calibration of risk prediction models: Impact on decision-analytic performance. Medical Decision Making 35(2), 162–169. DOI: 10.1177/0272989X14547233

van der Laan, M.J., Polley, E.C., Hubbard, A.E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology 6(1). DOI: 10.2202/1544-6115.1309

van der Ploeg, T., Austin, P.C., Steyerberg, E.W. (2014). Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology 14(137). DOI: 10.1186/1471-2288-14-137

Vapnik, V.N. (1998). The Nature of Statistical Learning Theory, 2nd ed. Springer. ISBN: 978-0387987804

Viering, T., Loog, M. (2021). The shape of learning curves: A review. arXiv (2103.10948). URL https://arxiv.org/abs/2103.10948

Wager, S. (2020). Cross-validation, risk estimation, and model selection: Comment on a paper by Rosset and Tibshirani. Journal of the American Statistical Association 115(529), 157–160. DOI: 10.1080/01621459.2020.1727235

Wand, M.P. (2021). KernSmooth: Functions for Kernel Smoothing Supporting Wand & Jones (1995). R package version 2.23-20. URL https://CRAN.R-project.org/package=KernSmooth

Wand, M.P., Jones, M.C. (1995). Kernel Smoothing, 1st ed, Monographs on Statistics and Applied Probability. Chapman & Hall/CRC. ISBN: 978-0412552700

Wang, J., Shen, X. (2006). Estimation of generalization error: Random and fixed inputs. Statistica Sinica 16(2), 569–588. URL http://www3.stat.sinica.edu.tw/statistica/J16N2/J16N213/J16N213.html

Warner, H.R., Toronto, A.F., Veasey, L.G., Stephenson, R. (1961). A mathematical approach to medical diagnosis: Application to congenital heart disease. JAMA 177(3), 177–183. DOI: 10.1001/jama.1961.03040290005002

Wasserman, L.A. (2014). Rise of the machines, in: Lin, X., Genest, C., Banks, D.L., Molenberghs, G., Scott, D.W., Wang, J.-L. (Eds.), Past, Present, and Future of Statistical Science. Chapman; Hall/CRC, pp. 525–536. DOI: 10.1201/b16720

Watson, G.S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A 26(4), 359–372. URL http://www.jstor.org/stable/25049340

Whittle, P. (1958). On the smoothing of probability density functions. Journal of the Royal Statistical Society: Series B 20(2), 334–343. DOI: 10.1111/j.2517-6161.1958.tb00298.x

Wilson, D.L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics SMC-2(3), 408–421. DOI: 10.1109/TSMC.1972.4309137

Wolpert, D.H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation 8, 1341–1390. DOI: 10.1162/neco.1996.8.7.1341

Wong, W.H. (1983). On the consistency of cross-validation in kernel nonparametric regression. The Annals of Statistics 11(4), 1136–1141. DOI: 10.1214/aos/1176346327

Wright, M.N., Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software 77(1), 1–17. DOI: 10.18637/jss.v077.i01

Yang, Y. (2005). Can the strengths of aic and bic be shared? A conflict between model indentification and regression estimation. Biometrika 92(4), 937–950. DOI: 10.1093/biomet/92.4.937

Yousef, W.A. (2020). A leisurely look at versions and variants of the cross validation estimator. arXiv (1907.13413). URL https://arxiv.org/abs/1907.13413

Yu, Y. (2021). APTS: High-dimensional Statistics Notes. URL https://warwick.ac.uk/fac/sci/statistics/apts/students/resources/hdsnotes.pdf

Zadrozny, B., Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates, in: Proceedings of the Eigth Acm Sigkdd International Conference on Knowledge Discovery and Data Mining. pp. 694–699. DOI: 10.1145/775047.775151

Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. (2017). Understanding deep learning requires rethinking generalization, in: Proceedings of the Fifth International Conference on Learning Representations. URL https://arxiv.org/abs/1611.03530

Епанечников, В.А. (1969). Непараметрическая оценка многомерной плотности вероятности. Теория вероятн. и ее примен. 14(1), 156–161. URL http://mi.mathnet.ru/tvp1130