Patient demographic summary (age and gender).
Coronavirus disease (COVID-19) caused an overwhelming healthcare, economic, social, and psychological impact on the world during 2020 and first part of 2021. Certain populations, especially those with Substance Use Disorders (SUD), were particularly vulnerable to contract the virus and also likely to suffer from a greater psychosocial and psychological burden. COVID-19 and addiction are two conditions on the verge of a collision, potentially causing a major public health threat. There is surge of addictive behaviors (both new and relapse), including use of alcohol, nicotine, and recreational drugs. This book chapter analyzed the bi-directional relationship between COVID-19 and SUD by leveraging descriptive summaries, advanced analytics, and machine learning approaches. The data sources included healthcare claims dataset as well as state level alcohol consumption to help in investigating the bi-directional relationship between the two conditions. Results suggest that alcohol and nicotine use increased during the pandemic and that the profile of SUD patients included diagnoses and symptoms of COVID-19, depression and anxiety, as well as hypertensive conditions.
- opioid addiction
- public health
- advanced analytics
- linear regression
- machine learning model
The coronavirus disease (COVID-19) has caused a large healthcare, economic, and psychosocial impact on communities in the United States and around the world in 2020 and first part of 2021. Many communities, especially those with low income and Substance Use Disorders (SUD), were particularly vulnerable to contract the infection and likely to suffered from a greater economic and psychosocial burden .
Addiction, characterized by a range of mental, physical, and behavioral symptoms, claims the lives of millions of people every year around the world . In their National Survey on Drug Use and Health, Substance Abuse and Mental Health Services Administration estimated that 22.6 million Americans, 12 years of age or older (9.2% of the population), have SUD, including alcohol and tobacco use . Furthermore, the long-term treatment has challenges to due frequent relapse . Alcohol consumption and drug addiction cost around 1.5% of the global burden of disease, and it can be as high as 5% in some countries, according to recent data .
There are two basic settings to treat SUD: inpatient and outpatient. The primary goal is for patients affected by addiction to be in the most effective, yet least restrictive environment that allows them to move along a continuum of care, depending on their personal and medical needs. There are four phases of SUD care: outpatient treatment, intensive outpatient treatment, residential treatment, and inpatient hospitalization .
Furthermore, SUD treatment programs are often designed based on three basic models:
Psychological model that includes behavioral therapy and treats emotional challenges as the primary cause of SUD.
Medical model that requires treatment of SUD symptoms by a healthcare provider. It focuses on the physiological, biological, and genetic causes of the disease.
Sociocultural model that aims to modify the physical and social environment of a person with SUD .
Many patients, receiving SUD treatments, may have also problems in other areas of their life, including but not limited to: physical and mental health issues, relationship problems, inadequate social and work skills, as well as legal or financial challenges. As a result, the treatment options should aim to address the entire spectrum of issues, and not only treat the addiction component .
Even with the variety of treatment options for SUD, more than 6,000 people a month died from overdosing before the pandemic started in the US . In addition to the continued loss of lives due to addiction, the pandemic also added other challenges for those suffering from SUD, resulting in additional 2,000 individuals a month dying from SUD between March and August 2020 . The government COVID-19 based restrictions, like home confinement, caused enormous economic burden to communities in the US as well as around the world. Individuals and their families faced various unwelcome emotional, psychological, and behavioral challenges, including excessive substance abuse and depression , which further increased the risk for addiction. The COVID-19 related restrictions caused individuals to turn to smoking, alcohol, drugs, including opioids and synthetic drugs like Fentanyl, as well as gaming activities to deal with the COVID-19 pandemic [6, 7, 8].
On the other hand, individuals suffering from addiction were often also part of low income communities that already faced many significant challenges related to access to healthcare, quality education, and unemployment. They also were also more prone to contract infection during the COVID-19 pandemic due to their underlying comorbid conditions and immune system deficiencies [8, 9].
In this book chapter, the bi-directional correlation between the COVID-19 diagnoses and SUD was investigated, and insights were provided to better understand the impact of the pandemic on addiction occurrence. The research leveraged multiple analytics methods from descriptive statistics, through a simple linear regression, and selected machine learning models to analyze this relationship. The data sources utilized for the analysis included healthcare claims dataset and the state level alcohol consumption.
2. Literature review
The COVID-19 pandemic caused limited social interactions for individuals around the world due to the strict national, state, and local governmental restrictions [10, 11]. As a result of the restrictions, many individuals started using tobacco, alcohol, and other substances to help with stress related symptoms. On the other hand, the increased restrictions and home confinement reduced the substance exposure, but also resulted in more pronounced cravings and withdrawal effects in current users. Selected articles have cited a substantially increased number of drug – and alcohol – withdrawal cases and hospitalizations, which were potentially putting burden on the already strained health care systems [12, 13].
Opioid addiction and its management was often discussed SUD type in the COVID-19 era. Opioid addicts particularly faced a challenge due to difficulty in accessing healthcare services, imposed restrictions on prescription and over-the-counter drugs, closures of rehabilitation centers, and an increased risk of life-threatening withdrawals . While loosening of restrictions were recommended for home-based self-injections and long-acting formulations of methadone and buprenorphine to mitigate these problems, there was also fear of overdosing and fatalities [15, 16].
Due to the financial burden and an uncertain future as a result of the pandemic, gambling activities also increased to unprecedented levels [17, 18]. Eating disorders and compulsive buying were progressively being reported [19, 20]. COVID-19 pandemic created a vicious cycle of stress, depression, social isolation, anxiety, excess leisure time that led to surge of behavioral addictions, resulting in mood alterations, irritability, anxiety, and stress [19, 20].
3. Data and methodology overview
There were multiple types of data utilized for this research. The first source of the data is represented by the healthcare claims database with the study time period from January 31, 2019 to December 31, 2020. Patient cohorts: study target and control were established, using SUD and COVID-19 ICD 10 diagnosis codes. The diagnoses codes are listed in the appendix. The healthcare claims dataset included diagnosis codes, medical and surgical codes, therapeutics and treatments prescribed at the transactional level. In addition, socioeconomic variables, including age, gender, race, education and incomes levels were leveraged to provide additional insights into the characteristics of patients with SUD during COVID-19 pandemic .
The second dataset employed for the study represented the State Level Alcohol Consumption trends for 2019 and 2020. The 2020 data was available, however, only through end of September. For this analysis, alcohol consumption data on per capita, alcohol sales from 19 states (Alaska, Arkansas, Colorado, Connecticut, Delaware, Florida, Illinois, Kansas, Kentucky, Louisiana, Massachusetts, Missouri, North Dakota, Oregon, Tennessee, Texas, Utah, Virginia, and Wisconsin) by type of alcoholic beverage was leveraged. Only information from the states noted above was used due to limited availability of data from other states .
A number of analytical methods was employed for the analysis from the rules-based patient qualification criteria, descriptive statistics, linear regression analysis to machine learning algorithms in order to understand the bi-directional relationship between SUD trends and the COVID-19 surge.
3.1 Healthcare claims patient level database
The healthcare claims database is an anonymous longitudinal patient data set that can help researchers, healthcare providers, and pharmaceutical companies in the design of research studies in order to aid comparisons of diagnosis and treatment outcomes that represent individual patient-based experiences and interactions with the US healthcare system .
The healthcare claims database leveraged for this study consisted of medical, hospital, and prescription claims across all insurance payment types. As shown in Figure 1, the database covers more than 317 million patients in the US, spans over more than 17 years of medical health history, and includes more than 1.9 million healthcare providers . The data elements used for the study included diagnoses codes for SUD and other comorbid conditions, procedures and treatments, payment types: commercial, Medicaid, Medicare and cash, along with patient sociodemographic characteristics like age, gender, race, education and income levels, as well as geography .
3.2 Methodology overview: Linear regression introduction
One of the methods utilized to analyze the relationship between addiction and the COVID-19 pandemic was a linear regression approach. In statistics, linear regression is a linear method to modeling the relationship between a scalar response (dependent variable – y) and one or more explanatory variables (independent variables – x):
When there is only one explanatory variable, the regression is called a simple linear regression. When there are more than one independent variables, the process is called a multiple linear regression .
In a linear regression, the relationships are modeled using linear predictor functions, whose model parameters are estimated from the data . Linear regression focuses on the conditional probability distribution of the response given the values of the predictors, which is the domain of multivariate analysis .
There are several metrics often leveraged to evaluate the model performance: R-squared and F-statistic. R-squared also called the coefficient of determination is the proportion of the variance in the dependent variable that can be explained by the variation in the independent variable(s). The value of the metric ranges between 0 and 1, and the higher value represents a better performance of the model . An F-test represents a statistical test often used when comparing statistical models employed on the studied datasets to identify the model that best fits the population from which the data sample was drawn .
3.3 Machine learning introduction
Machine learning is a subfield of the artificial intelligence area, which includes statistics, mathematics, computer algorithms, etc. focused on building applications that learn and improve their predictive capabilities automatically over time without being specifically programmed to do so. Machine learning models are built upon a statistical framework, since they involve data elements often described, using statistical distributions and assumptions. These algorithms gained in popularity in the recent years due the increased amounts of data availability and significant advancements in the computing power .
In this book chapter, selected algorithms were leveraged to analyze the relationship between SUD and COVID-19 diagnoses. The analysis identified factors beyond the pandemic, such as patient characteristics: age, race, education and income levels, comorbid conditions (example: diabetes, hypertension, mental health), concomitant treatments that increased the addiction diagnoses, including patients most likely to struggle with SUD, regions of greater prevalence, and comorbid conditions presented along with SUD and COVID-19 diagnoses.
3.3.1 Supervised learning algorithms
Supervised learning is the process of training or building machine learning algorithms, in which algorithms learn to map from input space (X) to output space (Y) .
The major objective is to approximate the mapping function (f) in order to predict (y) outcome when a new data point (x) is added . Supervised learning algorithms are mainly used for classification and prediction problems . The following are examples of supervised algorithms: logistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting (XGBoost), support vector machines (SVMs), naïve bayes, adaptive boosting (AdaBoost), and artificial neural network (ANN) .
3.3.2 Unsupervised learning algorithms
Unsupervised learning algorithms, on the other hand, learn the hidden patterns within the input dataset (X) . These models are called unsupervised, because there is no supervision to guide them, and the algorithms learn, discover, and display the patterns in the input data (X) . These algorithms are often employed to uncover the natural clusters, dimension reduction, anomaly detection, etc. Examples of unsupervised algorithms include: k-means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule) .
Depending on the study objectives and the available data type, algorithms are tested for performance, data type fit, and are selected accordingly. A random forest and an extreme gradient boosting models were selected to explain the bi-directional relationships of the SUD trends and COVID-19 pandemic surge.
3.3.3 xExtreme gradient boosting
Gradient boosting algorithm is an ensemble of weak prediction models, mostly decision trees . XGBoost starts by creating a first simple tree [32, 33], which than adds other trees, and builds upon the weaker learners. The model learns with each iteration and revises the previous tree until an optimal point is reached .
Feature importance is the value mostly generated by tree-based models like decision trees, random forest, XGBoost, etc.  and signifies the importance of features in the model in predicting the outcome. It represents how good the feature is at reducing node impurity. It is widely known as ‘gini importance’ or ‘mean decrease impurity,’ and is defined as the total decrease in node impurity averaged over all trees of the ensemble . Importance is mostly calculated as: weight, gain and cover, where ‘weight’ is the number of times a feature is present in a tree, ‘gain’ is the average gain of splits, while ‘cover’ is the average coverage of splits, with ‘coverage’ being defined as the number of samples affected by the split .
3.3.4 Random forest
Random forest or random decision forest is an ensemble learning method for classification and regression analysis that constructs an array of decision trees during the training timeframe. The output of the random forest for the classification task is the class selected by the majority of trees, while for the regression task, the output represents the mean or average prediction across individual trees [35, 36].
3.3.5 Chi-square test and p-value
The Chi-square test is one of the most widely used non-parametric tests , often utilized to test the independence between observed and expected frequencies of one or more attributes in a contingency table, known as ‘goodness of fit test’ .
The p-value, also used in this study, evaluates the statistical significance of the predictor variables. The significance level was set at the 5% and 10% to aid the feature importance evaluation and statistical results’ interpretation [24, 38].
3.3.6 Classification metrics
The following classification metrics were leveraged to validate the machine learning models’ performance. A confusion matrix is often generated from the predicted probability values with 0.5 as the classification threshold. Patients with probability value greater than or equal to 0.5 are noted as 1 and below 0.5 are noted as 0 .
True Positive (TP) – Target patient correctly identified by the model as target patient
False Positive (FP) – Control patient misclassified by the model as target patient
True Negative (TN) – Control patient correctly classified by the model as control patient
False Negative (FN) – Target patient misclassified by the model as control patient
Model performance metrics:
Accuracy: % of total patients correctly identified among total patients
Positive Predictive Value (PPV, Precision): % of true target patients among total predicted target patients
True Positive Rate (TPR, Sensitivity, Recall, Hit Rate): % of true target patients who were correctly identified among total target patients
False Positive Rate (FPR): % of true control patients incorrectly identified among total control patients
Specificity: % of those control who will have a negative target result
F1 Score: is the harmonic mean of precision and recall
AUC: Area Under the Receiver Operating Characteristic (ROC) Curve. To validate the trade-off between true positive rate and false positive rate .
4. Analysis results and discussion
4.1 Substance usage disease trends overview
This section provides an overview of SUD trends for 2019 and 2020 when leveraging the healthcare claims dataset that was discussed in the earlier section of the chapter. The summary includes information on the overall trends, patient demographics, and insights into the COVID-19 diagnosis rates. The first part of the analysis was to review and understand the SUD diagnoses trends as well as COVID-19 infection rates within the SUD population. The focus of the analysis was on the SUD population only to understand changes in trends during the pandemic.
The monthly trends of patients with SUD diagnoses presented that the addiction trends stayed consistent over 2019 and 2020, with the exception of April–May 2020 timeframe. The list of SUD diagnoses is presented in the appendix refers to Tables 6–9. At the beginning of the pandemic (April–May 2020), there was a decrease in the number of patients with addiction diagnoses. A two sample t-test that compared the SUD diagnosis counts between April–May 2019 and April–May 2020 revealed that the difference in counts was not significant at either the 5% or 10% significance level. However, the directional decline might have been a result of the state enacted restrictions, including home confinement as well as the inability to hold in-person HCP office visits and elective procedures (Figure 2).
The SUD diagnoses trend data also involved analyzing trends by splitting the patient cohort into newly diagnosed patients in the last 12 months as well as previously diagnosed patients within the same timeframe. The analysis presented that the share of newly diagnosed patients vs. previously diagnosed declined slightly between 2019 and 2020, but the difference was not statistically significant. In 2020, newly diagnosed patients accounted for 62% of all patients vs. 66% in 2019. In addition, patients diagnosed with addiction as well as COVID-19 represented 3% of the newly diagnosed patients and 4% of those with already a diagnosis.
Furthermore, several different types of SUD experienced a decline in the number of patients diagnosed at the start of the pandemic. Opioid dependence was the leading addiction type with alcohol dependence following as next most frequently diagnosed SUD (Figure 3). The counts of opioid dependence diagnosis were statistically different from the counts for other types of SUD. Statistically significant difference in trends at 5% significance level was also observed between opioid dependence and alcohol dependence, opioid dependence vs. cannabis dependence, sedatives dependance vs. cannabis dependance, and sedatives dependence vs. alcohol use. The addiction diagnosis codes are noted in the appendix. Patients with psychoactive type of addiction represented a higher share within the COVID-19 diagnosed population (19%) as compared to the overall share (3%). Psychoactive SUD is referred to as addiction type with hallucinatory symptoms. The COVID-19 patient distribution for other addictions was very similar to the overall addiction population.
An additional analysis of demographic and geographic attributes as presented in Table 1 revealed that males presented a higher percentage of the SUD population compared to SUD and COVID-19 population, but the percentage was not statistically significantly different from the percentage of women SUD patients. On the other hand, patients using cannabis appeared younger compared to the rest of the SUD population. This finding was statistically significant based on a two sample t-test (p-value = 0.00, statistically significant at 5%).
Most of both SUD and COVID-19 patients had commercially provided insurance coverage (∼70%), while ∼30% of patients had a government provided healthcare insurance (p-value = 0.00, statistically significant at 5%). Inhalants and rehabilitation drug addiction represented the highest share of patients with commercial insurance with more than 75%.
Furthermore, the North East and Midwest regions represented two main geographic areas of the United States with the highest level of patients diagnosed with SUD and covered more than 50% of the total addiction diagnosed patients. The share of patients in these two regions was statistically significantly higher compared to the rest of the US regions (p-value = 0.00, statistically significant at 5%). The West regions on the other hand covered approximately 20% of the addiction diagnosed population.
The SUD treatment pattern analysis revealed that the procedural services, including psychotherapy, recommended to treat SUD patients declined in April and May 2020 and then returned to similar levels before the pandemic and on par with the 2019 trends (Figure 4). The decline was statistically significant at the 10% significance level with a p-value = 0.07. The decline might have been related to the imposed country-wide lockdown during the two months on 2020.
On the other hand, the number of patients treated with prescription medications statistically significantly increased between 2019 and 2020 (p-value = 0.00, statistically significant at 5%), even during the pandemic, the trend continued to increase, implying that patients continuously were receiving patient care (Figure 5). The drugs names are presented in the appendix. Prescription treatments for drug related addiction had the highest share of the treatments, followed by addiction relapse treatments. The share of drug prescription treatments was statistically different from the other types of therapy, including relapse and alcohol treatments.
4.2 Alcohol consumption overview
This section of the book chapter provides an overview of alcohol consumption trends for 2019 and 2020. For this analysis, alcohol consumption data on per capita, alcohol sales from 19 states (Alaska, Arkansas, Colorado, Connecticut, Delaware, Florida, Illinois, Kansas, Kentucky, Louisiana, Massachusetts, Missouri, North Dakota, Oregon, Tennessee, Texas, Utah, Virginia, and Wisconsin) by type of alcoholic beverage was leveraged. The limited alcohol consumption information by state was due to the limited data availability at source .
The monthly trends of pure alcohol (gallons of ethanol) from 2019 and 2020 in Figure 6 showed that the trend stayed nearly the same, with a only a directional increase in 2020 . A two sample t-test did not present statistically significant differences between the yearly trends. On the other hand, it was observed that with the increase in COVID-19 cases in the middle of pandemic (June–August 2020), there was an associated increase in the consumption of pure alcohol, as evident from the high Pearson correlation coefficient of 0.87 between the alcohol consumption and COVID-19 diagnosed number of patients. The increase in the alcohol use might have been associated with individuals experiencing hardship due to the prolonged lockdowns, loss of job, and the overall changes in lifestyle as a result of pandemic, and alcohol being perceived as a way for coping with the changing environment.
The trends describing gallons of alcohol per capita for age 14 and older (Figure 7) showed a statistically significance increase (p-value = 0.04, statistically significant at 5%) in gallons per capita from mid-May 2020, which might be a result of the COVID-19 pandemic spread. This was also apparent from a strong a Pearson correlation coefficient of 0.91 between the pandemic outbreak as denoted by a volume of patients diagnosed with COVID-19 and gallons of alcohol per capita .
In order to understand the alcohol consumption over time, the percentage change in gallons of alcohol per capita from 2017 to 2019 (a 3-year average) to 2020 was analyzed (Figure 8). Overall, a visible increase in the trend was noticed; however, in May 2020, a statistically significant decline in the percentage in alcohol consumption (p-value = 0.06, statistically significant at 10%) was observed due to the COVID-19 imposed lockdowns and closures of liquor stores. From the month of June onwards, a statistically significant increase in the percentage change (p-value = 0.06, statistically significant at 10%) was noted, which might have been associated with the lockdown restrictions being partially lifted, leading to re-opening of liquor stores and increased purchasing levels .
Figure 9 depicts a comparison of selected states’ gallons of alcohol consumption per capita between 2019 and 2020. As noted earlier, only a few selected states were considered for the analysis due to the limited availability of the data at the source. For most of the states, an increase in alcohol (in gallons) per capita is noticeable in 2020. Delaware’s excise tax on liquor of $3.75 per gallon, lower than 72% of the other 50 states, led to the highest increase in per capita alcohol in 2020. Most of the states experienced an increase in per capita alcohol during the COVID-19 pandemic .
The alcohol consumption data was also analyzed for each alcohol type. Figure 10 presents a yearly consumption comparison between 2019 and 2020 for beer, wine and spirits. It was observed that beer consumption was higher in January and February in 2020 compared to January and February in 2019, but after the COVID-19 pandemic started, the consumption decreased from March onwards until May 2020 due to lockdowns and limited liquor facilities opened in each state. For wine and spirits, the trend of consumption showed an increase, starting from January 2020 onwards . It was also evident that the increase in volume of beer, spirit, and wine consumption from 2019 to 2020 was statistically significant based on the two sample t-test, which resulted in a p-value < 0.02 for each alcohol type .
4.3 Effects of alcohol and substance use during COVID-19
In this section of the book chapter, the effects of alcohol and SUD during the COVID-19 pandemic were analyzed. An ordinary least square linear regression model was used to investigate the correlation between these events. The dataset employed was a combination of the healthcare claims data and the alcohol consumption data, both aggregated at a state and monthly levels for comparison [21, 22].
The results of the ordinary least square regression are presented in Table 2. The analysis results showed that sedatives use, alcohol abuse, and beer consumption were the highly significant variables and positively correlated with the COVID-19 pandemic spread. Sedatives like benzodiazepines are often prescribed for anxiety and insomnia, confirming the finding . Furthermore, an increased consumption of alcohol might have led to seeking treatments to manage the signs of frustration, sadness, mental health conditions, and stress , caused by the prolonged isolation during the pandemic.
Other parameters like opioid use, cannabis use/abuse were significant as well, but they were negatively correlated with the pandemic spread. These results were consistent with previous research articles, presenting that drug use declined during the pandemic while at the same time patients suffered from withdraws and other symptoms related to fewer substances available for consumption [12, 13, 14, 41, 42]. On the other hand, while beer consumption was positively correlated with COVID-19 pandemic trends, the consumption of spirits presented the reversed correlation, which was contrary to the findings of overall alcohol trend increases. The difference in correlation might have been related to the differences in states regulations of the different types of alcohol, which in turn might impact the alcohol type availability for consumption at the state level .
The model significance was evident from the F-statistic value of 14.7 with an adjusted R square value of ∼74%, which informed that a relatively high proportion of the variations in the data could be explained by the predictor variables.
4.4 Machine learning: important features leading to addiction
To understand the parameters associated with SUD and identify if the COVID-19 pandemic impacted the addiction diagnosis rate, supervised classification machine learning algorithms, including random forest and XGBoost were performed.
4.4.1 Dataset overview
As a part of the analysis, two distinct patient cohorts: study target and control groups were developed to allow for analysis of the SUD and COVID-19 trends. The distinction between these two groups permitted the machine learning models to learn the variations in the data and identify the important variables that best distinguished between both groups. The target group was defined by the patients in the data from October 2020 to December 2020, who had at least 2 addiction diagnoses, followed by a treatment after initial diagnosis, and the control group was defined by the patients from October 2019 to December 2019, having two addiction diagnoses, followed by a treatment after initial diagnosis. A sample of ∼20,000 records were randomly selected for the modeling exercise based on similar age and gender distribution as in the target group. Two months of historical claims data related to diagnosis, procedures, and pharmacological treatment were pulled from initial diagnosis event along with other demographic data elements like age, gender, income, education, etc. The healthcare claims level data was converted to patient level records, using data pre-processing steps, and a final data structure with ∼ 20,000 records and ∼15,000 features was created for machine learning modeling .
4.4.2 Feature selection
The data elements used for the study included diagnosis, procedures, and pharmacological treatments along with other demographic features like age, gender, income, education, etc. Since the number of features was ∼15,000, the data element dimension needed to be reduced to a more manageable number.
In order to reduce the variables’ space and select the top features, the LightGBM and Boruta algorithms were leveraged for the purposes of dimension reduction. LightGBM is a gradient boosting framework, which uses tree-based learning, whereas Boruta is a feature selection algorithm, a wrapper built around RF Classification algorithm. The top features from both the algorithms were selected for the machine learning algorithms development .
Table 3 below represents a list of selected features important in the preliminary run of the models. Data elements related to the COVID-19 diagnoses and associated symptoms along with alcohol and nicotine use as well as major depressive disorder were noted as important variables, separating the 2020 and 2019 SUD patient cohorts.
4.4.3 Machine learning models overview
In order to understand the underlying factors for the SUD 2020 and 2019 patient cohorts and investigate if there was an association with the COVID-19 pandemic, the following machine learning models were applied: random forest and XGBoost. Hyper tunning process was also performed to optimize the models. Below a brief methodology overview is presented.
Random forest is a classification algorithm, consisting of many decision trees that use bootstrap aggregation, bagging and feature randomness when building each individual tree. It creates an uncorrelated forest of trees whose prediction is more accurate than that of any individual tree. The model outcome provides estimates of variables important in the classification .
XGBoost is a decision-tree-based ensemble machine learning algorithm that uses a gradient boosting framework. XGBoost approaches the process of sequential tree building, using parallelized implementation. Each model run learns from the error of previous models and weak learners. It incorporates the error of weak learners in the ensemble model and re-runs the process. It uses bootstrap aggregating technique, also called as bagging, which implies diving the data into sub samples for each iteration of model training. For prediction purposes, the model chooses majority of the vote from all the learners .
Hyper parameter tuning is the optimization process of finding the model parameters to improve the model performance. The objective is to minimize the cost function, hence reduce the error caused by the model. It uses gradient descent algorithm, which initially randomly assigns the model parameter to calculate the cost function and later improves it at each step, so that the cost function assumes a minimum value. Mathematically, it takes the derivative of the sum of squared residuals and equates it to 0 to find a point where the function is changing .
4.4.4 Machine learning analysis details
The machine learning algorithms were evaluated for the different performance metrics as noted in the earlier section of the book chapter. Initially, the models were overfitting, as they seemed to capture most of the variations from the training data, as well as at the same time captured the noise from the data. This resulted in many negative records misclassified as positive, which might have led to a considerably lower precision value.
As noted in Table 4, the baseline models, random forest resulted in AUC of 73% and XGBoost resulted in AUC of 74% with recall of 74.08% and 82.80% respectively. While working with healthcare claims data, a higher ratio of false negative records is perceived as a large problem. For example, predicting a sick patient as healthy, because ideally the patient should have received the treatment on time, may lead to health complications and even potentially death. Thus, it is advised to minimize the number of false negative observations, which will result in a higher recall, depicting an inverse relationship between the two, also called true positive rate or sensitivity. However, the initial baseline models resulted in a comparatively lower recall. To improve the model accuracy, the hyper parameter tuning was executed to find the best model parameters.
In addition, F1 score was defined as follows:
which is a function of precision and recall and should be maximized such that both precision and recall both are optimal [45, 46, 47]. Hyper parameter tuned models not only improved the recall, but also slightly improved the F1 scores for random forest and XGBoost to 69.4% and 69.57% respectively, implying a robust model sensitive to false negative observations.
The machine learning models were retrained, using a few model parameters, including mex=depth, min-samples_split and max_features. In order to obtain the optimal values of model parameters, k fold cross validation with 5 iterations were used in the hyper parameter tuning process .
Using the optimal values of the model parameters obtained from hyper parameter tuning, the models showed an improvement in the model performance, which was evident from the model metrics. As shown in Table 4, the recall significantly increased for both of the models and also, the F1 scores slightly improved compared to the baseline models. This resulted in a decrease in the false negatives count, which led to an increase in recall.
Figure 11 depicts the final ROC AUC plot, which shows the relationship between the true positive rate and the false positive rate at different probability thresholds. A true positive rate also known as the sensitivity metric, which informs the proportion of positive records that were correctly classified over the total number of positive records. A false positive rate is the proportion of negative records misclassified over total number of negative records . Both random forest and XGBoost resulted in AUC of 75% with recall of 94.7% and 85.6% respectively, which improved from the baseline models.
4.4.5 Machine learning model interpretation
The random forest and XGBoost models identify features, which were a combination of SUD as well as COVID-19 data elements. Table 5 presents the top important features. The importance of the features was measured using the ‘gini’ importance metric, which calculated the impurity in the node. The metric measured how each feature decreased the impurity of the split, while making the decision tree in the algorithm and averaging it over all the trees in the forest, resulting in the measure of feature importance [32, 33].
Features like nicotine dependence, alcohol abuse, long term drug therapy, disulfiram , methadone  were presented as important in explaining the differences in SUD patient cohorts between 2019 and 2020. For example, the value 0.003 of nicotine dependence importance denoted that the impurity reduced in the node by adding the variable, which thus contributed to the model robustness and a higher accuracy level. The effects of pandemic on individual’s lives were not only restricted to patients’ physical health, but also affected their mental health, as noted by the major depressive, anxiety diagnoses, and suicidal tendencies as presented in the top most important healthcare data elements. The unexpected and unwanted change enforced on daily lives, drastically increased the stress levels. Difficulties in management of the changing environment and following preventive measures, such as undergoing lockdowns, fueled the stress levels even more. The economic downturn, leading to unemployment, and low consumer confidence played an imperative role in increasing the stress levels as well. As a result of the prolonged stress and anxiety due to the lockdowns, the consumption of alcohol, smoking, and other nicotine-based products increased [12, 13, 41, 42].
It was also interesting to see features related to COVID-19 pandemic being noted as important differentiators between the 2019 and 2020 SUD patient cohorts. The features included COVID-19 diagnosis and related symptoms: headache, cough, acute upper respiratory infection, specimen collection for severe acute respiratory condition. Procedures noting HCP in-office and tele-visits along with in-patient hospital or ER visits were also noted as important variables, further highlighting that the amount of care might have increased as a result of SUD, but also due to COVID-19 diagnoses and related symptoms. In addition, medications often used to treat viruses and infections like Azithromycin were also presented as important data elements defining the 2020 SUD patient cohort.
Furthermore, the cohorts differed on the occurrence of the comorbid conditions, such as chronic kidney condition, hypertension, hyper lipidemia, and gastro-esophagus conditions, which might inform a potential impact of a larger alcohol and other substance abuse activities during the pandemic or simply present that the patient profile changed during the pandemic, expanding the definition of the SUD patients group. There were also several data elements identifying SUD treatments such as Narcan, methadone to list a few and procedures related to drug testing, blood panels, and other related treatments, which present an increased rate of addiction testing and treatment between the two periods, confirming earlier findings of increased SUD treatment trends. The analysis also presented differences of the cohorts on a diagnoses for lower back pain and pain relieve medications use.
From the sociodemographic data elements, patients diagnosed with addiction or treated for addiction presented characteristics that can help further define the patient profiles for individuals that were likely for developing SUD during the pandemic. For example, the average age of 42 was observed for the impacted population. Ethnicity of Caucasian and Black/African American was also noted as prevalent. Patients with nicotine dependence, alcohol dependence, opioid use, and cannabis dependence were relatively more prevalent in the states of Florida and Texas. These states presented a relatively higher volume of patients with specific SUD diagnoses compared to other states. The impacted patients presented some college or achieved at least a high school diploma as well as were more likely to be associated with the lower economic status communities, with income level being less than $30 K annually. The educational and economic levels were noted by other published articles, presenting the economic impact and increased risk for COVID-19 virus within low income population .
5. Conclusions and study limitations
This book chapter investigated SUD and the resulting impact from the COVID-19 pandemic on the rate of diagnoses and treatment. Overall, the diagnoses rate of SUD was consistent over time in 2020 compared to 2019 (except for April and May); however, a statistically significant increase in treatment of different addiction types was noted during the pandemic. In 2020, newly diagnosed patients accounted for 62% of all SUD patients compared to 66% in 2019, but the difference was not statistically significant. Furthermore, the changes in procedures performed for addiction testing significantly declined at the beginning of the pandemic and then returned to normal levels in June of 2020, while the SUD treatment significantly increased between 2019 and 2020. In addition, patients diagnosed with addiction as well as COVID-19 represented 3% of the newly diagnosed patients and 4% of those with already a diagnosis. Patients using cannabis were found statistically significantly younger compared to the rest of the SUD population.
In 2020, a noticeable increase in alcohol consumption and drinking behaviors was observed compared to 2019, including an increase in the average gallons consumed by alcohol type: spirits, wine, and beer. Compared to the previous years, a statistically significant positive percentage change in gallons of alcohol per capita from 2017 to 2019 (a 3-year average) to 2020 was observed , which could be related to the increased stress levels due to the pandemic spread and prolonged lockdowns .
Machine learning analysis of SUD patient cohorts between 2020 and 2019 presented that the patients in the 2020 cohort who were diagnosed with SUD, were also often diagnosed with either COVID-19 or related symptoms, including headache, upper respiratory infection, and cough. Furthermore, it is likely that SUD patients with addiction to drugs and nicotine products were more likely to contract COVID-19, as a result of their weaker immune system due to lower white cells levels in the blood [51, 52]. The analysis also presented the importance of HCP in-office and tele-visits along with in-patient hospital visits that could be related to the increased level of SUD treatment, but also present the severity of COVID-19 related symptoms and the need for treatment.
Moreover, excess alcohol consumption identified as one of the important factors, differentiating between the two SUD patient cohorts could lead to immune deficiency, causing increased susceptibility to certain diseases. Prolonged alcohol abuse may cause disruptions to the digestive system and could result in liver failure. Alcohol use may also affect individual’s ability to store adequate amounts of protein and nutrients. Most importantly drugs and alcohol affect white blood cells, which act as the defense system for the body. The weaker defense system can increase the risk of developing life-ending diseases .
Finally, based on the machine learning analysis, the SUD patient cohorts differed on occurrence of the comorbid conditions such as chronic kidney condition, hypertension, hyper lipidemia, and gastro-esophagus condition, which might present that the SUD patient profile changed during the pandemic due to the changes in the life style and increased consumption of alcohol and tobacco. Additional investigation should be conducted to further examine the patients’ health history and understand the underlying reasons for the differences in the SUD patient cohort characteristics.
5.1 Study limitations
Due to the timing of writing this book chapter, not all data was available for the entire year of 2020 to allow for a comprehensive analysis. Adding the additional data for alcohol consumption, as well as data for recreational drug use by state during the pandemic could enhance the analysis in presenting the SUD population characteristics including their health, mental, and economic state. Furthermore, it might also be helpful to add COVID-19 vaccination data by state in order to understand the effects of vaccinations and COVID-19 variants on general virus trends as well as SUD impacted populations.
Additionally, the healthcare patient level claims data, comprising of prescription, medical, and hospital claims can also observe gaps in the coverage of long-term care institutions, mental health hospitals, correctional facilities, and other institutions with a limited public reporting, and result in a potential bias in the studied population when compared to the entire US population. Furthermore, the COVID-19 related symptoms’ diagnoses might also skew the analysis and overestimate the impact of COVID-19 on the SUD population. The statistical results could be further enhanced and become more robust with the additional data availability and understanding of the diagnosis codes for the COVID-19 related symptoms.
Since the COVID-19 pandemic was a rare event, it became a new topic of interest for analysis. As a result, there was a limited number prior research studies conducted on this topic, which posed a challenge in creating a theoretical foundation for this book chapter’s research questions and hypothesis. With little prior research, developing an entire new research typology was challenging.
The scope of the analysis can also be enhanced via adding additional data sources and having a longer timeframe to evaluate the impact of pandemic on addiction and health impact of those impacted by either condition. Furthermore, new set of advanced analytics, including deep learning and natural language processing (NLP) approaches, could be applied to create data driven evidence to confirm newly established hypothesis, research objectives, and results.
Availability of data and materials: The healthcare claims dataset that supports the findings of this study are available from Symphony Health, ICON plc Organization, but restrictions apply to the availability of these data, which were used under a license for the current study, and so they are not publicly available.
The Alcohol Consumption Dataset is available from the National institute of Alcohol Abuse and Alcoholism [Online]. https://pubs.niaaa.nih.gov/publications/surveillance-covid-19/COVSALES.htm
Conflict of interest
The authors declare no conflict of interest.
Authors work for Symphony Health, ICON plc Organization. The data used in the article is the property of Symphony Health, ICON plc Organization. Authors used the healthcare claims data for the sole purpose of publication of this article.
The authors would like to thank Mike Byzon, Suzanne Rosado, Heather Valera, Lakshya Mandawat, and Koichi Iwata for their valuable feedback on earlier versions of this chapter.