Evaluating Similarities and Differences between Machine Learning and Traditional Statistical Modeling in Healthcare Analytics

Michele Bennett; Ewa J. Kleczyk; Karin Hayes; Rajesh Mehta

doi:10.5772/intechopen.105116

Abstract

Data scientists and statisticians are often at odds when determining the best approaches and choosing between machine learning and statistical modeling to solve their analytical challenges and problem statements across industries. However, machine learning and statistical modeling are actually more closely related to each other rather than being on different sides of an analysis battleground. The decision on which approach to choose is often based on the problem at hand, expected outcome(s), real world application of the results and insights, as well as the availability and granularity of data for the analysis. Overall machine learning and statistical modeling are complementary techniques that are guided on similar mathematical principles, but leverage different tools to arrive at insights. Determining the best approach should consider the problem to be solved, empirical evidence and resulting hypothesis, data sources and their completeness, number of variables/data elements, assumptions, and expected outcomes such as the need for predictions or causality and reasoning. Experienced analysts and data scientists are often well versed in both types of approaches and their applications, hence use best suited tools for their analytical challenges. Due to the importance and relevance of the subject in the current analytics environment, this chapter will present an overview of each approach as well as outline their similarities and differences to provide the needed understanding when selecting the proper technique for problems at hand. Furthermore, the chapter will also provide examples of applications in the healthcare industry and outline how to decide which approach is best when analyzing healthcare data. Understanding of the best suited methodologies can help the healthcare industry to develop and apply advanced analytical tools to speed up the diagnostic and treatment processes as well as improve the quality of life for their patients.

Keywords

machine learning
statistical modeling
data science
healthcare analytics
research design

Author Information

Show +

Michele Bennett
- Symphony Health, ICON, plc Organization, USA
- Data Science, Computer Science, and Business Analytics, Grand Canyon University, USA
Ewa J. Kleczyk*
- Symphony Health, ICON, plc Organization, USA
- The School of Economics, The University of Maine, USA
Karin Hayes
- Symphony Health, ICON, plc Organization, USA
Rajesh Mehta
- Symphony Health, ICON, plc Organization, USA

*Address all correspondence to: ewa.kleczyk@symphonyhealth.com

1. Introduction

In the recent years, machine learning techniques have been utilized to solve problems at hand across multitudes of industries and topics. In the healthcare industry, these techniques are often applied to a variety of healthcare claims and electronic health records data to garner valuable insights into diagnostic and treatment pathways in order to help optimize patient healthcare access and treatment process [1]. Unfortunately, many of these applications resulted in inaccurate or irrelevant research results, as proper research protocols were not fully followed [2]. On the other hand, statistics has been the basis of analysis in healthcare research for decades, especially, in the areas of clinical trials and health economics and outcomes research (HEOR), where the precision and accuracy of analyses have been the primary objectives [3]. Furthermore, the classical statistics methodologies are often preferred in those research areas to ensure the ability to replicate and defend the results and ultimately, the ability to publish the research content in peer-reviewed medical journals [3]. The increased availability of data, including data from wearables, provided the opportunity to apply a variety of analytical techniques and methodologies to identify patterns, often hidden, that could help with optimization of healthcare access as well as diagnostic and treatment process [4].

With the rapid increase in data from the healthcare and many other industries, it is important to consider how to select well - suited statistical and machine learning methodologies that would be best for the problem at hand, the available data type, and the overall research objectives [5]. Machine learning alone or complemented by statistical modeling is becoming, not just a more common, but a desired convergence to take advantage of the best of both approaches for advancing healthcare outcomes [1]. Please note that this book chapter was originally posted on the Cornell University’s research working article website: https://arxiv.org. The book chapter content is mostly the same between the two versions [6].

2. Machine learning foundation is in statistical learning theory

Machine learning (ML) is considered a branch of artificial intelligence and computer science that focuses on mimicking human behaviors through a set of algorithms and methods that use historical values to predict new values [7], without specifically being coded to do so and thereby learning over time [8, 9]. ML is grounded in statistical learning theory (SLT), which provides the constructs used to create prediction functions from data. One of the first examples of SLT was the creation of the support vector machine (SVM), the supervised learning method that can be used as for both classification and regression and has become a standard in modeling how to recognize visual objects [7]. SLT formalizes the model that makes a prediction based on observations (i.e., data) and ML automates the modeling [7].

SLT sets the mathematical and theoretical framework for ML as well as the properties of learning algorithms [7] with the goals of providing mechanisms for studying inference and creating algorithms that become more precise and improved over time [8]. SLT is based multivariate statistics and functional analysis [8]. Functional analysis is the branch of statistics that measures shapes, curves, and surfaces, extending multivariate vector statistics to continuous functions and finding functions that describe data patterns [8]. Inductive inference is the process of generalizing and modeling past observations to make predictions for the future; SLT formalizes the modeling concepts of inductive inference, while ML automates them [8].

For example, pattern recognition is considered a problem of inductive inference and SLT, as it is a curving-fitting problem, and one of the most common applications of ML [7, 8, 9]. Pattern recognition is not suited for traditional computer programming as the inferences needed are not free of assumptions and the patterns are not easily described or labeled programmatically with deterministic functions. The standard mathematics behind SLT makes no assumptions on distributions, uses stochastic functions that can include humans labeling the “right” classification, i.e., training data, and can assume that the probability of the occurrence of one observation is independent of another thereby including the concept of randomness [7, 8, 9]. These tenets are therefore those of ML as well.

SLT also provides the definition of terms often using in ML such as overfitting, underfitting and generalization. Overfitting is when the presence of noise in the data negatively affects training and the ultimate model performance because the noise is being incorporated into the learning process, thereby giving error when the model sees new data [8, 9]. Underfitting is when the noise impacts both performance on training data as well as new and unseen data [9]. In ML, discussion about underfitting and overfitting are often used to describe models that do not generalize the data effectively and might not present the right set of data elements to explain the data patterns and posited hypotheses [9]. Underfitting is often defined when model which is missing features that would be present in the most optimized model, akin to a regression model not fully explaining all of the variance of the dependent variable [9]. In a similar vein, overfitting is when the model contains more features or different features than is optimal, like a regression model with autocorrelation or multicollinearity [9].

The general goal of learning algorithms and therefore ML model optimization is to reduce the dimensions, features, or data variables to the fewest number needed as that reduces noise or the impact of trivial variables that can overfit or unfit [8, 9]. A regularization model can then become generalized to perform not just on the past or the training data, but also on future and yet unseen data [8, 9]. Although true generalization needs both the right modeling criteria as well as strong subject matter knowledge [8].

Often dimension reduction approaches like Principal Component Analysis (PCA) or boot strapping techniques used along with subject matter expertise can help resolve how to refine models, combat fit challenges, as well as improve generalization potential [9, 10]. Furthermore, understanding the studied population and data characteristics can further help define the data to be used, variable selection, and proper model set up [10].

3. Similarities between machine learning and statistical modeling

Statistical modeling is based on SLT and use of mathematical models and statistical assumptions to generate sample data and make predictions about the real world occurrences. A statistical model is often represented as a collection of probability distributions on a set of all possible outcomes. Furthermore, statistical modeling has evolved in the last few decades and shaped the future of business analytics and data science, including the current use and applications of ML algorithms. On the other hand, machine learning does not require many assumptions and interventions when running algorithms in order to accurately predict studied outcomes [7].

There are similarities between ML and statistical modeling that are prevalent across most analytical efforts. Both techniques use historical data as input to predict new output values, but they vary as noted above on the underlying assumptions and the level of analyst intervention and data preparation.

Overall, machine learning foundations are based from statistical learning theory, and it is recommended for the data scientists to apply SLT’s guiding rules during analysis. While it may seem as a statistical background and understanding is not required when analyzing the underlying data, this misconception often leads to data scientist’s inability to set up proper research hypothesis and analysis due to a lack of understanding of the problem and the underlying data assumptions as well as caveats. This issue can in turn result in biased and irrelevant results as well as unfounded conclusions and insights. With that in mind, it is important to evaluate the problem at hand, and consider both statistical modeling and ML as possible methods to be applied. Understanding the underlying assumptions of the data and statistical inference can help support proper technique selection and guide the pathway to solution [11]. In the later sections of the chapter, application of both techniques will be provided and the reasoning for selecting the methods presented to guide future research.

As mentioned above, the similarities between ML and statistical modeling start with the underlying assumption that data or observations from the past can be used to predict the future [7]. The variables included in the analysis generally represent two types: dependent variables, that in ML are called targets, and independent variables, that in ML are called features. The definition of the variables is the same across both techniques [8]. Furthermore, both ML and statistical modeling leverage the available data in a way that allow for generalization of results to larger population [7]. The loss and risk associated with the models accuracy and representation of the real world occurrence is described frequently in terms of mean squared error (MSE). In statistical modeling, MSE is the difference between the predicted value and the actual value and is used to measure loss of the performance of predictions. In the ML, the same MSE concept is presented via a confusion matrix that evaluates a classification problem's accuracy [9].

4. Differences between machine learning and statistical modeling

Differences between machine learning and statistical modeling are distinct and based on purposes and needs for the analysis as well as the outcomes. Assumptions and purposes for the analysis and approach can vastly differ. For example, statistics typically assumes that predictors or features are known and additive, models are parametric, and testing of hypotheses and uncertainty are forefront. On the other hand, ML does not make these assumptions [12]. In ML, many models are based on non-parametric approaches where the structure of model is not specified or unknown, additivity is not expected, and assumptions about normal distributions, linearity or residuals, for example, are not needed for modeling [10].

The purpose of ML is predictive performance using general purpose learning algorithms to find patterns that are less known, unrelated, and in complex data without a priori view of underlying structures [10]. Whereas in statistical modeling, consideration for inferences, correlations, and the effects of a small number of variables are drivers [12].

Due to the differences in the methods’ characteristics, it is important to understand the variations in application of the techniques when solving healthcare problems. For example, one typical application of statistics is to analyze whether a population has a particular medical condition. For some diseases such as diabetes, the condition is easily screened for and diagnosed using distinct lab values, such as elevated and increasing HbA1C over time, high glucose levels and low insulin levels, often due to insulin depletion occurring from unmanaged diabetes. Also conditions such as hypertension can easily be detected at home or in the healthcare provider’s office using simple blood pressure measurement and monitoring, and wearables can identify when patients are experiencing atrial fibrillation, abnormal heart rhythms and even increased patient falls (possible syncope). Therefore, analyses of patients with these easily measurable conditions can be done simply by qualifying patients based on lab values or biomarkers falling within or outside of certain ranges. One of the simplest examples is identifying patients with diabetes [13]. This can be accomplished by using A1C levels to group patients as having no diabetes (A1C < 5.7), pre-diabetes (AIC of 5.7–6.4), or diabetes (A1C > 6.4). These ranges are based on American Diabetes Association Diagnosis Guidelines and a very high, medically accepted correlation between AIC levels and the diagnosis of diabetes [14].

On the other hand, if the objective of the research is to predict which pre-diabetic patients are most likely to progress to diabetes, a myriad of factors influence diabetes progression including extent of chronic kidney disease, high blood pressure, insulin levels over time, body mass index/obesity, age, years with diabetes, success of prior therapy, number and types of prior therapies, family history, coronary artery disease, prior cardiovascular events, infections, etc. A complicated combination of comorbidities, risk factors, and patient behavior can lead to differing diabetes complications and varying outcomes makes prediction more challenging and thus it represents a good candidate for the use of machine learning techniques. Classification models such as gradient boosting tree algorithms have been used to successfully predict diabetes progression, especially earlier in the disease. While there any many diabetes risk factors and co-morbidities, these disease characteristics are well studied over many years, thus enabling stable predictive models which perform well over time [14].

Overall, machine learning is highly effective when the model uses more than a handful of independent variables/features [10]. ML is required when the number of features (p) is larger than the number of records or observations (n) – this is called the curse of dimensionality [15, 16], which increases the risk of overfitting, but can be overcome with dimensionality reductive techniques (i.e., PCA), as part of modeling [15] and clinical/expert input on the importance or lack thereof of certain features, is it relates to the disease or its treatment. Additionally, statistical learning theory teaches that learning algorithms increase their ability to translate complex structures from data at a greater and faster rate than the increase of sample size capture can alone provide [8]. Therefore, statistical learning theory and ML offer methods for addressing high-dimensional data or big data (high velocity, volume and variety) and smaller sample sizes [17], such as recursive feature elimination and support vector machines, boosting, or cross validation which can also minimize prediction error [18].

In the healthcare industry, machine learning models are frequently used in cancer prediction, generally in three areas: (1) predicting a patient with a cancer prognosis/diagnosis, (2) predicting cancer progression, and (3) predicting cancer mortality. Of these, predicting whether a patient may have a cancer prognosis/diagnosis can be more or less difficult depending on the tumor type. Certain cancers such as lung cancer, breast cancer, prostate cancer, and skin cancer are evaluated based on specific signs and symptoms, and non-invasive imaging or blood tests. These cancers are easier to predict. Conversely, cancers with non-descript symptoms such fatigue, dizziness, GI pain and distress, and lack of appetite are much more difficult to predict even with machine learning models as these symptoms are associated with multiple tumor types (for example esophageal, stomach, bladder, liver, and pancreatic cancer) and also mimic numerous other conditions [14].

For cancers with vague symptoms, understanding the patient journey is very important to cancer prediction. If a prediction period is too long and does not reflect the time period before diagnosis when symptoms develop, the model may overfit due to spurious variables not related to the condition. If the prediction period is too short, key risk factors from the patient record could be missing. Variable pruning is required in these situations. A multi-disciplinary team including business and clinical experts can help trim unrelated variables and improve model performance [14].

Model validation is an inherent part of the ML process where the data is split into training data and test data, with the larger portion of data used to train the model to learn outputs based on known inputs. This process allows for rapid structure knowledge for primary focus on building the ability to predict future outcomes [15]. Beyond initial validation of the model within the test data set, the model should be further tested in the real world using a large, representative, and more recent sample of data [19]. This can be accomplished by using the model to score the eligible population and using a look forward period to assess incidence or prevalence of the desired outcome. If the model is performing well, probability scores should be directly correlated to incidence/prevalence (the higher the probability score, the higher the incidence/prevalence). Model accuracy, precision, and recall can also be assessed using this approach [20].

Epidemiology studies and prior published machine learning research in related areas of healthcare can help benchmark the performance of the model relative to the baseline prevalent or incident population for the condition to be predicted. Machine learning models created using a few hundred or thousand patients often do not perform as well in the real world. Careful variable pruning, cohort refinement and adjustment of modeling periods can often resolve model performance problems. Newer software can be used to more quickly build, test, and iterate models, allowing users to easily transform and combine features as well as run many models simultaneously and visualize model performance, diagnosis and solve model issues [21].

5. How to choose between machine learning and statistical modeling

Machine learning algorithms are a preferred choice of technique vs. a statistical modeling approach under specific circumstances, data configurations, and outcomes needed.

5.1 Importance of prediction over causal relationships

As noted above, machine learning algorithms are leveraged for prediction of the outcome rather than present the inferential and causal relationship between the outcome and independent variables/data elements [17, 22]. Once a model has been created, statistical analysis can sometime elucidate and validate the importance and relationship between independent and dependent variables.

5.2 Application of wide and big dataset(s)

Machine Learning algorithms are learner algorithms and learn on large amount of data often presented by a large number of data elements, but not necessarily with many observations [23]. Ability of multiple replications of samples, cross validation or application of boot strapping techniques for machine learning allows for wide datasets with many data elements and few observations, which is extremely helpful in predicting rare disease onset [24] as long as the process is accompanied with real world testing to ensure the models are not suffering from overfitting [18, 19]. With the advent of less expensive and more powerful computing power and storage, multialgorithm, ensembled models using larger cohorts can be more efficiently built. Larger modeling samples that are more representative of the overall population can help reduce the likelihood of overfitting or underfitting [25]. A large cohort imposes various issues and of priority is the ability to identify the set of independent variables that are most meaningful and impactful. These significant independent variables provide a predictive and/or inferential model that can be readily acceptable in providing a real-world application. The variables in such instances may also result into more realistic magnitude and direction of the causal relationship between the independent and outcomes variables of interest.

A recent example for a real-world example in healthcare for machine learning algorithm application is to identify the likelihood of hospitalization for high-risk patients diagnosed with Covid 19. The dataset leveraged included over 20,000 independent variables across healthcare claims data for diagnostics and treatment variables. The best optimal ML model consisted of approximately 200 important predictors variables such as age, diagnosis like Type 2 diabetes/CKD/Hypertension, frequency of office visits, Obesity amongst others. None of the variables in this example were ‘new’, however, the magnitude and direction as a result of the ML exercise may illustrate the ‘true’ impact of each independent variable, a feature that is a serious limitation in traditional statistical modeling [26].

Furthermore, as explained above, statistical models tend to not operate well on very large datasets and often require manageable datasets with a fewer number of pre-defined attributes/data elements for analysis [23]. The recommended number of attributes is up to 12 in a statistical model, because these techniques are highly prone to overfitting [25]. This limitation creates a challenge when analyzing large healthcare datasets and require application of dimension reduction techniques or expert guidance in allowing to eliminate the number of independent variables in the study [23].

5.3 Limited data and model assumptions are required

In machine learning algorithms, there are fewer assumptions that need to be made on the dataset and the data elements [5]. However, a good model is usually preceded by profiling of the target and control groups and some knowledge of the domain. Understanding relationships within the data improve outcomes and interpretability [27].

Machine learning algorithms are comparatively more flexible than statistical models, as they do not require making assumptions regarding collinearity, normal distribution of residuals, etc. [5]. Thus, they have a high tolerance for uncertainty in variable performance (e.g., confidence intervals, hypothesis tests [28]. In statistical modeling emphasis is put in uncertainty estimates, furthermore, a variety of assumptions have to be satisfied before the outcome from a statistical model can be trusted and applied [28]. As a result, the statistical models have a low uncertainty tolerance [25].

Machine learning algorithms tend to be preferred over statistical modeling when the outcome to be predicted does not have a strong component of randomness, e.g., in visual pattern recognition an object must be an E or not an E [5], and when the learning algorithm can be trained on an unlimited number of exact replications [29].

ML is also appropriate when the overall prediction is the goal, with less visibility to describe the impact of any one independent variable or the relationships between variables [30], and when estimating uncertainty in forecasts or in effects of selected predictors is not a requirement [28]. However, often data scientists and data analysts leverage regression analytics to understand the estimated impact, including directionality of the relationships between the outcome and data elements, to help with model interpretation, relevance, and validity for the studied [27]. ML is also preferred when the dataset is wide and very large [23] with underlying variables are not fully known and previously described [5].

6. Machine learning extends statistics

Machine learning requires no prior assumptions about the underlying relationships between the data elements. It is generally applied to high dimensional data sets and does not require many observations to create a working model [5]. However, understanding the underlying data will support building representative modeling cohorts, deriving features relevant for the disease state and population of interest, as well as understanding how to interpret modeling results [19, 27].

In contrast, statistical model requires a deeper understanding how the data was collected, statistical properties of the estimator (p-value, unbiased estimators), the underlying distribution of the population, etc. [17]. Statistical modeling techniques are usually applied to low dimensional data sets [25].

7. Machine learning can extend the utility of statistical modeling

Robert Tibshirani, a statistician and machine learning expert at Stanford University, calls machine learning “glorified statistics,” which presents the dependence of machine learning techniques on statistics in a successful execution that not only allows for a high level of prediction, but interpretation of the results to ensure validity and applicability of the results in the healthcare [17]. Understanding the association and knowing their differences enables data scientists and statisticians to expand their knowledge and apply variety of methods outside their domain of expertise. This is the notion of “data science,” which aims to bridge the gap between the areas as well as bring other important to consider aspects of research [5]. Data science is evolving beyond statistics or more simple ML approaches to incorporate self-learning and autonomy with the ability to interpret context, assess and fill in data gaps, and make modeling adjustment over time [31]. While these modeling approaches are not perfect and more difficult to interpret, they provide exciting new options for difficult to solve problems, especially where the underlying data or environment is rapidly changing [27].

Collaboration and communication between not only data scientists and statisticians but also medical and clinical experts, public policy creators, epidemiologists, etc. allows for designing successful research studies that not only provide predictions and insights on relationships between the vast amount of data elements and health outcomes [30], but also allow for valid, interpretable and relevant results that can be applied with confidence to the project objectives and future deployment in the real [30, 32].

Finally, it is important to remember that machine learning foundations are based in statistical theory and learning. It may seem machine learning can be done without a sound statistical background, but this leads to not really understanding the different nuances in the data and presented results [17]. Well written machine learning code does not negate the need for an in-depth understanding of the problem, assumptions, and the importance of interpretation and validation [29].

8. Specific examples in healthcare

As mentioned earlier in the chapter, machine learning algorithms can be leveraged in the healthcare industry to help evaluate a continuum of access, diagnostic and treatment outcomes, including prediction of patient diagnoses, treatment, adverse events, side effects, and improved quality of life as well as lower mortality rates [24].

As shown in Figure 1, often these algorithms can be helpful in predicting a variety of disease conditions and shortening the time from awareness to diagnosis and treatment, especially in rare and underdiagnosed conditions, estimate the ‘true’ market size, predicting disease progression such as identifying fast vs. slow progressing patients as well as determinants of suitable next line change [32]. Finally, the models can be leveraged for patient and physician segmentation and clustering to identify appropriate targets for in-person and non-personal promotion [30].

Figure 1.
Examples of Machine Learning Applications in Healthcare Analytics [22].

There are, however, instances in which machine learning might not be the right tool to leverage, including when the condition or the underlying condition have a few known variables, when the market is mature and has known predetermined diagnostic and treatment algorithm, and when understanding correlations and inference is more important than making prediction [5].

One aspect of the machine learning process is to involve a cross functional team of experts in the healthcare area to ensure that the questions and problem statement along with hypothesis are properly set up [33, 34]. Many therapeutic areas require in-depth understanding of the clinical and medical concepts (i.e., diagnostic process, treatment regimens, potential adverse effects, etc.), which can help with the research design and selection of the proper analytical techniques. If the expert knowledge is not considered or properly captured in the research design, it might lead to irrelevant, invalid, and biased results, and ultimately invalidate the entire research study [33, 34].

9. A practical guide to the predominant approach

Using a real example of a project with the goal of predicting the risk of hypertension due to underlying comorbid conditions or induced by medication, the decision to lead with machine learning vs. statistical modeling can be based on explicit criteria that can be weighed and ranked based on the desired outcome of the work [17, 32]. Please see Figure 2 presenting an example of the approach.

Figure 2.
Criteria for Choosing the Predominant Approach for a Project.

As shown in Figure 2, pending the research objectives, machine learning or statistical modeling or both techniques could be the right method(s) to apply. For example, shifts in market trends, including shifts in patient volume of diagnosis and treatment present a suitable example when a statistical modeling type of analysis should be utilized. On the other hand, trying to predict patients with a high risk for hypertension requires the utilization of ML approaches. Leveraging both methods is best suited when predictive power and explanatory reasoning is needed to understand the important factors driving the outcome and their relative magnitudes and inferences.

10. Conclusions

Machine learning requires fewer assumptions about the underlying relationships between the data elements. It is generally applied to high dimensional data sets and require fewer observations to create a working model [5]. In contrast, statistical model requires an understanding of how the data was collected, statistical properties of the estimator (p-value, unbiased estimators), the underlying distribution of the population, etc. [17]. Statistical modeling techniques are usually applied to low dimensional data sets [25]. Statistical modeling and ML are not at odds but rather complementary approaches that offer choice of techniques based on need and desired outcomes. Data scientists and analysts should not necessarily have to choose between either machine learning or statistical modeling as a mutually exclusive decision tree. Instead, selected approaches from both areas should be considered as both types of methodologies are based on the same mathematical principles but expressed somewhat differently [5, 10].

Note: This book chapter was originally posted on the Cornell University’s research working paper website: https://arxiv.org. The content of the book chapter is mostly the same compared to the version posted on https://arxiv.org [6].

Conflict of interest

The authors declare no conflict of interest.

Funding

Authors work for Symphony Health, ICON plc Organization.

References

1. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;19(13):1317-1318. DOI: 10.1001/jama.2017.18391
2. Shelmerdine et al. Review of study reporting guidelines for clinical studies using artificial intelligence in healthcare. BMJ Health & Care Informatics. 2021;28(1):e100385. DOI: 10.1136/bmjhci-2021-100385
3. Romano R, Gambale E. Statistics and medicine: The indispensable know-how of the researcher. Translational Medicine @UniSa. 2013;5:28-31
4. Razzak et al. Big data analytics for preventive medicine. Neural Computing and Application. 2020;32:4417-4451. DOI: 10.1007/s00521-019-04095-y
5. Bzdok D, Altman N, Krzywiniski M. Statistics versus machine learning. Nature Methods. 2018;15(4):233-234. DOI: 0.1038/nmeth.4642
6. Bennett M, Hayes K, Kleczyk EJ, Mehta R. Analytics in healthcare: Similarities and differences between machine learning and traditional advanced statistical modeling. Cornell University. 2022:1-16. Available from: https://arxiv.org/abs/2201.02469
7. Von Luxburg U, Scholkopf B. Inductive logic. In: Handbook and History of Logic. Vol. 10. New York: Elsevier; 2011
8. Bousquet et al. Introduction to Statistical Learning. 2003. Available from: http://www.econ.upf.edu/~lugosi/mlss_slt.pdf
9. Field A. Discovering Statistics Using R. London: Sage; 2012
10. Carmichael I, Marron JS. Data science vs. statistics: Two cultures? Japanese Journal of Statistics and Data Science. 2018;1(1):117-138
11. Cahn A, Shoshan A, Sagiv T, Yesharim R, Goshen R, Shalev V, et al. Prediction of progression from pre-diabetes to diabetes: Development and validation of a machine learning model. Diabetes/Metabolism Research and Reviews. 2020;36(2):e3252. DOI: 10.1002/dmrr.3252 Epub 2020 Jan 14
12. Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science. 2001;16(3):199-231
13. Mehta R, Uppunuthula S. Use of machine learning techniques to identify the likelihood of hospitalization for high-risk patients diagnosed with COVID-19. In: ISPOR Conference; Washington DC. 2022
14. American Diabetes Association. Understanding A1C Diagnosis. 2022. Available from: https://www.diabetes.org/diabetes/a1c/diagnosis#:~:text=Diabetes%20is%20diagnosed%20at%20fasting,equal%20to%20126%20mg%2Fdl
15. Bzdok et al. Machine learning: A primer. Nature Methods. 2017;14(12):1119-1120. DOI: 10.1038/nmeth.4526
16. Bellman RE. Adaptive Control Processes. Princeton, NJ: Princeton University Press; 1961
17. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2ed). Stanford, CA: Springer; 2016
18. Chapman et al. Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development. Psychology Methods. 2016;21(4):603-620. DOI: 10.1037/met0000088
19. Argent et al. The importance of real-world validation of machine learning systems in wearable exercise biofeedback platforms: A case study. Sensors (Basel). 2021;21(7):2346. DOI: 10.3390/s21072346
20. Parikh et al. Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology. 2008;56(1):45-50. DOI: 10.4103/0301-4738.37595
21. Mendis A. Statistical Modeling vs. Machine Learning. 2019. Available from: https://www.kdnuggets.com/2019/08/statistical-modelling-vs-machine-learning.html
22. Hayes K, Rajabathar R, Balasubramaniam V. Uncovering the machine learning “Black Box”: Discoveringlatent patient insights using text mining & machine learning. In: Conference Paper Presented at Innovation in Analytics via Machine Learning & AI; Las Vegas, NV. 2019 Available from: https://www.pmsa.org/other-events/past-symposia
23. Belabbas M, Wolfe PJ. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences. 2009;106(2):369-374. DOI: 10.1073/pnas.0810600105
24. Kempa-Liehr et al. Healthcare pathway discovery and probabilistic machine learning. International Journal of Medical Informatics. 2020;137:104087. DOI: 10.1016/j.ijmedinf.2020.104087
25. Wasserman L. Rise of the machines. In: Past, Present, and Future of Statistical Science. Chapman and Hall; 2013. pp. 1-12. DOI: 10.1201/b16720-49
26. Ranjan R. Calibration in machine learning. 2019. Available from: https://medium.com/analytics-vidhya/calibration-in-machine-learning-e7972ac93555
27. Child CM, Washburn NR. Embedding domain knowledge for machine learning of complex material systems. MRS Communications. 2019;9(3):806-820. DOI: 10.1557/mrc.2019.90
28. Hilliermeir E, Waegerman W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning. 2021;110:457-506. DOI: 10.1007/s10994-021-05946-3
29. Goh et al. Evaluating human versus machine learning performance in classifying research abstracts. Scientometrics. 2020;125:1197-1212. DOI: 10.1007/s11192-020-03614-2
30. Chicco D, Jutman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(6). DOI: /10.1186/s12864-019-6413-7
31. Ansari et al. Rethinking human-machine learning in Industry 4.0: How does the paradigm shift treat the role of human learning? Procedia Manufacturing. 2018;23:117-122. DOI: 10.1016/j.promfg.2018.04.003
32. Morganstein et al. Predicting population health with machine learning: A scoping review. BMJ Open. 2020;10(10). DOI: 10.1136/bmjopen-2020-037860
33. Terranova et al. Application of machine learning in translational medicine: Current status and future opportunities. The AAPS Journal. 2021;23(74). DOI: 10.1208/s12248-021-00593-x
34. Kleczyk E, Hayes K, Bennett M. Building organization AI and ML acumen during the COVID Era. 2022. In: PMSA Annual Conference. Louisville, KY. 2022

[1] 1. Beam AL, Kohane IS. Big data and machine learning in health care. JAMA. 2018;19(13):1317-1318. DOI: 10.1001/jama.2017.18391

[2] 2. Shelmerdine et al. Review of study reporting guidelines for clinical studies using artificial intelligence in healthcare. BMJ Health & Care Informatics. 2021;28(1):e100385. DOI: 10.1136/bmjhci-2021-100385

[3] 3. Romano R, Gambale E. Statistics and medicine: The indispensable know-how of the researcher. Translational Medicine @UniSa. 2013;5:28-31

[4] 4. Razzak et al. Big data analytics for preventive medicine. Neural Computing and Application. 2020;32:4417-4451. DOI: 10.1007/s00521-019-04095-y

[5] 5. Bzdok D, Altman N, Krzywiniski M. Statistics versus machine learning. Nature Methods. 2018;15(4):233-234. DOI: 0.1038/nmeth.4642

[6] 6. Bennett M, Hayes K, Kleczyk EJ, Mehta R. Analytics in healthcare: Similarities and differences between machine learning and traditional advanced statistical modeling. Cornell University. 2022:1-16. Available from: https://arxiv.org/abs/2201.02469

[7] 7. Von Luxburg U, Scholkopf B. Inductive logic. In: Handbook and History of Logic. Vol. 10. New York: Elsevier; 2011

[8] 8. Bousquet et al. Introduction to Statistical Learning. 2003. Available from: http://www.econ.upf.edu/~lugosi/mlss_slt.pdf

[9] 9. Field A. Discovering Statistics Using R. London: Sage; 2012

[10] 10. Carmichael I, Marron JS. Data science vs. statistics: Two cultures? Japanese Journal of Statistics and Data Science. 2018;1(1):117-138

[11] 11. Cahn A, Shoshan A, Sagiv T, Yesharim R, Goshen R, Shalev V, et al. Prediction of progression from pre-diabetes to diabetes: Development and validation of a machine learning model. Diabetes/Metabolism Research and Reviews. 2020;36(2):e3252. DOI: 10.1002/dmrr.3252 Epub 2020 Jan 14

[12] 12. Breiman L. Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical Science. 2001;16(3):199-231

[13] 13. Mehta R, Uppunuthula S. Use of machine learning techniques to identify the likelihood of hospitalization for high-risk patients diagnosed with COVID-19. In: ISPOR Conference; Washington DC. 2022

[14] 14. American Diabetes Association. Understanding A1C Diagnosis. 2022. Available from: https://www.diabetes.org/diabetes/a1c/diagnosis#:~:text=Diabetes%20is%20diagnosed%20at%20fasting,equal%20to%20126%20mg%2Fdl

[15] 15. Bzdok et al. Machine learning: A primer. Nature Methods. 2017;14(12):1119-1120. DOI: 10.1038/nmeth.4526

[16] 16. Bellman RE. Adaptive Control Processes. Princeton, NJ: Princeton University Press; 1961

[17] 17. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2ed). Stanford, CA: Springer; 2016

[18] 18. Chapman et al. Statistical learning theory for high dimensional prediction: Application to criterion-keyed scale development. Psychology Methods. 2016;21(4):603-620. DOI: 10.1037/met0000088

[19] 19. Argent et al. The importance of real-world validation of machine learning systems in wearable exercise biofeedback platforms: A case study. Sensors (Basel). 2021;21(7):2346. DOI: 10.3390/s21072346

[20] 20. Parikh et al. Understanding and using sensitivity, specificity and predictive values. Indian Journal of Ophthalmology. 2008;56(1):45-50. DOI: 10.4103/0301-4738.37595

[21] 21. Mendis A. Statistical Modeling vs. Machine Learning. 2019. Available from: https://www.kdnuggets.com/2019/08/statistical-modelling-vs-machine-learning.html

[22] 22. Hayes K, Rajabathar R, Balasubramaniam V. Uncovering the machine learning “Black Box”: Discoveringlatent patient insights using text mining & machine learning. In: Conference Paper Presented at Innovation in Analytics via Machine Learning & AI; Las Vegas, NV. 2019 Available from: https://www.pmsa.org/other-events/past-symposia

[23] 23. Belabbas M, Wolfe PJ. Spectral methods in machine learning and new strategies for very large datasets. Proceedings of the National Academy of Sciences. 2009;106(2):369-374. DOI: 10.1073/pnas.0810600105

[24] 24. Kempa-Liehr et al. Healthcare pathway discovery and probabilistic machine learning. International Journal of Medical Informatics. 2020;137:104087. DOI: 10.1016/j.ijmedinf.2020.104087

[25] 25. Wasserman L. Rise of the machines. In: Past, Present, and Future of Statistical Science. Chapman and Hall; 2013. pp. 1-12. DOI: 10.1201/b16720-49

[26] 26. Ranjan R. Calibration in machine learning. 2019. Available from: https://medium.com/analytics-vidhya/calibration-in-machine-learning-e7972ac93555

[27] 27. Child CM, Washburn NR. Embedding domain knowledge for machine learning of complex material systems. MRS Communications. 2019;9(3):806-820. DOI: 10.1557/mrc.2019.90

[28] 28. Hilliermeir E, Waegerman W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning. 2021;110:457-506. DOI: 10.1007/s10994-021-05946-3

[29] 29. Goh et al. Evaluating human versus machine learning performance in classifying research abstracts. Scientometrics. 2020;125:1197-1212. DOI: 10.1007/s11192-020-03614-2

[30] 30. Chicco D, Jutman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(6). DOI: /10.1186/s12864-019-6413-7

[31] 31. Ansari et al. Rethinking human-machine learning in Industry 4.0: How does the paradigm shift treat the role of human learning? Procedia Manufacturing. 2018;23:117-122. DOI: 10.1016/j.promfg.2018.04.003

[32] 32. Morganstein et al. Predicting population health with machine learning: A scoping review. BMJ Open. 2020;10(10). DOI: 10.1136/bmjopen-2020-037860

[33] 33. Terranova et al. Application of machine learning in translational medicine: Current status and future opportunities. The AAPS Journal. 2021;23(74). DOI: 10.1208/s12248-021-00593-x

[34] 34. Kleczyk E, Hayes K, Bennett M. Building organization AI and ML acumen during the COVID Era. 2022. In: PMSA Annual Conference. Louisville, KY. 2022

Evaluating Similarities and Differences between Machine Learning and Traditional Statistical Modeling in Healthcare Analytics

Artificial Intelligence Annual Volume 2022

Abstract

Keywords

Author Information

Michele Bennett

Ewa J. Kleczyk*

Karin Hayes

Rajesh Mehta

1. Introduction

2. Machine learning foundation is in statistical learning theory

3. Similarities between machine learning and statistical modeling

4. Differences between machine learning and statistical modeling

5. How to choose between machine learning and statistical modeling

5.1 Importance of prediction over causal relationships

5.2 Application of wide and big dataset(s)

5.3 Limited data and model assumptions are required

6. Machine learning extends statistics

7. Machine learning can extend the utility of statistical modeling

8. Specific examples in healthcare

Figure 1.

9. A practical guide to the predominant approach

Figure 2.

10. Conclusions

Conflict of interest

Funding

References

Image-Based Crop Leaf Disease Identification Using Convolution Encoder Networks

Evaluating Similarities and Differences between Machine Learning and Traditional Statistical Modeling in Healthcare Analytics

Artificial Intelligence Annual Volume 2022

Abstract

Keywords

Author Information

Michele Bennett

Ewa J. Kleczyk*

Karin Hayes

Rajesh Mehta

1. Introduction

2. Machine learning foundation is in statistical learning theory

3. Similarities between machine learning and statistical modeling

4. Differences between machine learning and statistical modeling

5. How to choose between machine learning and statistical modeling

5.1 Importance of prediction over causal relationships

5.2 Application of wide and big dataset(s)

5.3 Limited data and model assumptions are required

6. Machine learning extends statistics

7. Machine learning can extend the utility of statistical modeling

8. Specific examples in healthcare

Figure 1.

9. A practical guide to the predominant approach

Figure 2.

10. Conclusions

Conflict of interest

Funding

References

Continue reading from the same book

Artificial Intelligence Annual Volume 2022