Project III 2018-19 - Louis Aslett

The Fourth Industrial Revolution

Supervisors: Dr L.J.M. Aslett & Dr J. Einbeck

In 2016 the World Economic Forum described machine learning and artificial intelligence as the 'fourth industrial revolution' (Schwab, 2016). Although the area suffers from significant hype, there has in fact been genuinely impressive progress in automating everything from driverless cars (Waldrop, 2015) and video-game control (Mnih et al, 2015) to cancer diagnostics (Esteva et al, 2017) and motion detection (Willetts et al, 2018), making machine learning a highly employable skill set.

At the heart of machine learning are a variety of methods which sit at the intersection of statistics and computer science. These methods are developed with the emphasis primarily on highly accurate prediction of some outcome of interest, meaning they often produce highly complex and hard to interpret models, but which can none-the-less achieve impressive levels of accuracy.

The core idea is to take a large corpus of data containing many covariates, \( X \), which are in principle readily available and where the outcome we are interested in predicting, \( Y \), is already known. We then aim to 'learn' an optimal strategy to predict the outcome of interest on future data where only the covariates are known, \( \hat{Y} = f(X) \). For example, in medical diagnostics we may have ready access to covariate information about a patient, such as medical history and diagnostic test results, and be interested in predicting whether they have a particular disease. Machine learning would take a large amount of covariate data about patients for whom the disease status was known and then learn a model which could automate the prediction of a future patient's disease status given only their covariate information.

Support vector machine non-linear transformation to achieve linear separability (Yu et al, 2010).

There are a large and rich set of interesting problems which require careful treatment, including among others concerns of overfitting, bias, variance, loss functions, feature engineering, variable selection, and parameter tuning.

The goal in this project will be to initially learn about some machine learning methods, which may include logistic regression, naïve Bayes, support vector machines, random forests, gradient boosting machines or deep neural networks, as well as the best practise methodology for their use. With this understanding you may then take the project in whatever direction you find most interesting, which might include going deeper into the methodology of a technique which interests you, or finding a data set to use for a real-world application.

Prerequisites

Statistical Concepts II and familiarity with the statistical language R (or alternatively Python/Julia) is essential.

Background

Friedman et al (2009) is an excellent book on statistical machine learning methods. Deep learning is thoroughly covered in Goodfellow et al (2016). Both are available in the library or for free online (legally!)

References

Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M. and Thrun, S., 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), p.115. DOI: 10.1038/nature21056

Friedman, J., Hastie, T. and Tibshirani, R., 2009. The Elements of Statistical Learning. 2nd Edition. New York: Springer series in statistics. Durham Library, or free online copy.

Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep Learning. MIT Press. Durham Library, or free online copy

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), p.529. DOI: 10.1038/nature14236

Schwab, K., 2016. The Fourth Industrial Revolution. World Economic Forum.

Waldrop, M.M., 2015. No drivers required. Nature, 518(7537), p.20. DOI: 10.1038/518020a

Willetts, M., Hollowell, S., Aslett, L.J.M., Holmes C.C. and Doherty, A., 2018. Statistical machine learning of sleep and physical activity phenotypes from sensor data in 96,220 UK Biobank participants. Nature Scientific Reports, 8(7961). DOI: 10.1038/s41598-018-26174-1

Yu, W., Liu, T., Valdez, R., Gwinn, M. and Khoury, M.J., 2010. Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Medical Informatics and Decision Making, 10(1), p.16. DOI: 10.1186/1472-6947-10-16