Open access peer-reviewed chapter

# Descriptive and Inferential Statistics in Undergraduate Data Science Research Projects

By Malcolm J. D’Souza, Edward A. Brandenburg, Derald E. Wentzien, Riza C. Bautista, Agashi P. Nwogbaga, Rebecca G. Miller and Paul E. Olsen

Submitted: April 17th 2016Reviewed: September 12th 2016Published: April 26th 2017

DOI: 10.5772/65721

## Abstract

Undergraduate data science research projects form an integral component of the Wesley College science and mathematics curriculum. In this chapter, we provide examples for hypothesis testing, where statistical methods or strategies are coupled with methodologies using interpolating polynomials, probability and the expected value concept in statistics. These are areas where real-world critical thinking and decision analysis applications peak a student’s interest.

### Keywords

• Wesley College
• STEM
• solvolysis
• phenyl chloroformate
• benzoyl chloride
• benzoyl fluoride
• benzoyl cyanide
• Grunwald-Winstein equation
• transition-state
• multiple regression
• time-series
• Ebola
• polynomial functions
• probability
• expected value

## 1. Introduction

Wesley College (Wesley) is a minority-serving, primarily undergraduate liberal-arts institution. Its STEM (science, technology, engineering and mathematics) fields contain a robust (federal and state) sponsored directed research program [1, 2]. In this program, students receive individual mentoring on diverse projects from a full-time STEM faculty member. In addition, undergraduate research is a capstone thesis requirement and students complete research projects within experiential courses or for an annual Scholars’ Day event.

Undergraduate research is indeed a hallmark of Wesley’s progressive liberal-arts core-curriculum. All incoming freshmen are immersed in research in a specially designed quantitative reasoning a 100-level mathematics core course, a first-year seminar course and 100-level frontiers in science core course [1]. Projects in all level-1 STEM core courses provide an opportunity to develop a base knowledge for interacting and manipulating data. These courses also introduce students to modern computing techniques and platforms.

At the other end of the Wesley core-curriculum spectrum, the advanced undergraduate STEM research requirements reflect the breadth and rigor necessary to prepare students for (possible) future postgraduate programs. For analyzing data in experiential research projects, descriptive and inferential statistics are major components. In informatics, students are trained in the SAS Institute’s statistical analysis system (SAS) software and in the use of geographic information system (GIS) spatial tools through ESRI’s ArcGIS platform [2].

To help students with poor mathematical ability and to further enhance their general thinking skills, in our remedial mathematics courses, we provide a foundation in algebraic concepts, problem-solving skills, basic quantitative reasoning and simple simulations. Our institution also provides a plethora of student academic support services that include an early alert system, peer and professionally trained tutoring services and writing center support. In addition, Wesley College non-STEM majors are required to take the project-based 100-level mathematics core course and can then opt to take two project-based 300-level SAS and GIS core courses. Such students who are trained in the concepts and applications of mathematical and statistical methods can then participate in Scholars’ Day to augment their mathematical and critical thinking skills.

## 2. Linear free energy relationships to understand molecular pathways

Single and multiparameter linear free energy relationships (LFERs) help chemists evaluate multiple kinds of transition-state molecular interactions observed in association with compound variability [3]. Chemical kinetics measurements are understood by correlating the experimental compound reaction rate (k) or equilibrium data and their thermodynamics. The computationally challenging stoichiometric analysis elucidates metabolic pathways by analyzing the effect of physiochemical, environmental and biological factors on the overall chemical network structure. All of these determinations are important in the design of chemical processes for petrochemical, pharmaceutical and agricultural building blocks.

In this section, through results obtained from our undergraduate directed research program in chemistry, we outline examples with statistical descriptors that use inferential correctness for testing hypotheses about regression coefficients in LFERs that are common to the study of solvent reactions. To understand mechanistic approaches, multiple regression correlation analyses using the one- and two-term Grunwald-Winstein equations (Eqs. (1) and (2)) are proven to be effective instruments that elucidate the transition-state in solvolytic reactions [3]. To avoid multicollinearity, it is stressed that the chosen solvents have widely varying ranges of nucleophilicity (N) and solvent-ionizing power (Y) values [3, 4]. In Eqs. (1) and (2) (for a particular substrate), kis the rate of reaction in a given solvent, kois the 80% ethanol (EtOH) reaction rate, lis the sensitivity toward changes in N, mis the sensitivity toward changes in Yand cis a constant (residual) term. In substrates that have the potential for transition-state electron delocalization, Kevill and D’Souza introduced an additional hIterm to Eqs. (1) and (2) (and as shown in Eqs. (3) and (4)). In Eqs. (3) and (4), hrepresents the sensitivity to changes in the aromatic ring parameter I[3].

log(k/ko)=mϒ+cE1
log(k/ko)=lN+mϒ+cE2
log(k/ko)=mϒ+hI+cE3
log(k/ko)=lN+mϒ+hI+cE4

Eqs. (1) and (3) are useful in substrates where the unimolecular dissociative transition-state (SN1 or E1) formation is rate-determining. Eqs. (2) and (4) are employed for reactions where there is evidence for bimolecular associative (SN2 or E2) mechanisms or addition-elimination (A-E) processes. In substrates undergoing similar mechanisms, the resultant l/mratios obtained can be important indicators to compensate for earlier and later transition-states (TS). Furthermore, l/mratios between 0.5 and 1.0 are indicative of unimolecular processes (SN1 or E1), values ≥ 2.0 are typical in bimolecular processes (SN2, E2, or A-E mechanisms) and values <<0.5 imply that ionization-fragmentation is occurring [3].

To study the (solvent) nucleophilic attack at a sp2 carbonyl carbon, we completed detailed Grunwald-Winstein (Eqs. (1), (2) and (4)) analyses for phenyl chloroformate (PhOCOCl) at 25.0°C in 49 solvents with widely varying Nand Yvalues [3, 4]. Using Eq. (1), we obtained an mvalue of −0.07 ± 0.11, c= −0.46 ± 0.31, a very poor correlation coefficient (R= 0.093) and an extremely low F-test value of 0.4. An analysis of Eq. (2) resulted in a very robust correlation, with R= 0.980, F-test = 568, l= 1.66 ± 0.05, m= 0.56 ± 0.03 and c= 0.15 ± 0.07. Using Eq. (4), we obtained l= 1.77 ± 0.08, m= 0.61 ± 0.04, h= 0.35 ± 0.19 (P-value = 0.07), c= 0.16 ± 0.06, R= 0.982 and the F-test value was 400.

Since the use of Eq. (2) provided superior statistically significant results (R, F-test and P-values) for PhOCOCl, we strongly recommended that in substrates where nucleophilic attack occurs at a sp2 hybridized carbonyl carbon, the PhOCOCl l/mratio of 2.96 should be used as a guiding indicator for determining the presence of an addition-elimination (A-E) process [3, 4]. Furthermore, for n-octyl fluoroformate (OctOCOF) and n-octyl chloroformate (OctOCOCl), we found that the leaving-group ratio (kF/kCl) was close to, or above unity. Fluorine is a very poor leaving-group when compared to chlorine, hence for carbonyl group containing molecules, we proposed the existence of a bimolecular tetrahedral transition-state (TS) with a rate-determining addition step within an A-E pathway (as opposed to a bimolecular concerted associative SN2 process with a penta-coordinate TS).

For chemoselectivity, the sp2 hybridized benzoyl groups (PhCO─) are found to be efficient and practical protecting agents that are utilized during the synthesis of nucleoside, nucleotide and oligonucleotide analogue derivative compounds. Yields for regio- and stereoselective reactions are shown to depend on the preference of the leaving group and commercially, benzoyl fluoride (PhCOF), benzoyl chloride (PhCOCl) and benzoyl cyanide (PhCOCN) are cheap and readily available.

We experimentally measured the solvolytic rates for PhCOF at 25.0°C [5]. In 37 solvent systems, a two-term Grunwald-Winstein (Eq. (2)) application resulted in an lvalue of 1.58 ± 0.09, an mvalue of 0.82 ± 0.05, a cvalue of −0.09, R= 0.953 and the F-test value was 186. The l/mratio of 1.93 for PhCOF is close to the OctOCOF l/mratio of 2.28 (in 28 pure and binary mixtures) indicating similar A-E transition states with rate-determining addition.

On the other hand, for PhCOCl at 25.0°C, we used the available literature data (47 solvents) from various international groups and proved the presence of simultaneous competing dual side-by-side mechanisms [6]. For 32 of the more ionizing solvents, we obtained l= 0.47 ± 0.03, m= 0.79 ± 0.02, c= −0.49 ± 0.17, R= 0.990 and F-test = 680. The l/mratio is 0.59. Hence, we proposed an SN1 process with significant solvation (lcomponent) of the developing aryl acylium ion. In 12 of the more nucleophilic solvents, we obtained l= 1.27 ± 0.29, m= 0.46 ± 0.07, c= 0.18 ± 0.23, R= 0.917 and F-test = 24. The l/mratio of 2.76 is close to the 2.96 value obtained for PhOCOCl. This suggests that the A-E pathway is prevalent. In addition, there were three solvents where there was no clear demarcation of the changeover region.

At 25.0°C in solvents that are common to PhCOCl and PhCOCF we observed kPhCOCl > kPhCOF. This rate trend is primarily due to more efficient PhCOF ground-state stabilization.

Lee and co-workers followed the kinetics of benzoyl cyanide (PhCOCN) at 1, 5, 10, 15 and 20°C in a variety of pure and mixed solvents and proposed the presence of an associative SN2 (penta-coordinate TS) process [7]. PhCOCN is an ecologically important chemical defensive secretion of polydesmoid millipedes and cyanide is a synthetically useful highly active leaving group. Since the leaving group is involved in the rate-determining step of any SN2 process, we became skeptical with the associative SN2 proposal and decided to reinvestigate the PhCOCN analysis. We hypothesized that since PhCOCl showed mechanism duality, similar analogous dual mechanisms should endure during PhCOCN solvolyses.

Using the Lee data within Arrhenius plots (Eq. (5)), we determined the PhCOCN solvolytic rates at 25°C (Table 1). We obtained the rates for PhCOCN in 39 pure and mixed

ln(k)=EaRT+ln(A)E5
Solvent (v/v)105 k/s−1NTYclISolvent (v/v)105 k/s−1NTYclI
90% EtOH1139.90.16−0.940.1030% acetone12447−0.963.21−0.38
80% EtOH1210.00.000.000.0020% acetone13726−1.113.77−0.40
70% EtOH1322.8−0.200.78−0.0610% acetone14071−1.234.28−0.434
60% EtOH1598.8−0.381.38−0.1530% dioxane11690−0.982.97−0.295
50% EtOH1986.7−0.582.02−0.2320% dioxane12887−1.123.71−0.25
40% EtOH11761−0.742.75−0.2410% dioxane14196−1.254.23−0.345
30% EtOH13064−0.933.53−0.3010T−90E237.200.274−1.9940.264
20% EtOH13732−1.164.09−0.3320T−80E241.640.08−1.420.31
90% MeOH1390.5−0.01−0.180.2830T−70E235.50−0.11−0.950.38
80% MeOH1575.3−0.060.670.1440T−60E232.98−0.34−0.480.43
70% MeOH1800.2−0.401.460.0450T−50E230.30−0.640.160.51
60% MeOH11616−0.542.07−0.1970T−30E227.70−1.341.240.654
50% MeOH12573−0.752.70−0.0576.3 TFE3639.9−2.1942.840.28
40% MeOH24205−0.873.25−0.1367.4 TFE3886.1−1.8842.9340.224
30% MeOH25351−1.063.73−0.2257.9 TFE31075−1.7843.0540.144
80% acetone19.547−0.37−0.83−0.2347.9 TFE31512−1.3343.2140.064
70% acetone186.02−0.420.17−0.2937.1 TFE32089−1.1943.444−0.034
60% acetone1157.1−0.520.95−0.2825.6 TFE32944−1.1543.734−0.154
50% acetone1505.9−0.701.73−0.3213.3 TFE33870−1.2344.104−0.294
40% acetone11149−0.832.46−0.35

### Table 1.

The 25.0°C calculated rates for PhCOCN, the NT, YCl and Ivalues.

1Calculated using four data points in an Arrhenius plot.

2Calculated using three data points in an Arrhenius plot.

3Calculated using three data points in an Arrhenius plot and are w/w compositions.

4Determined using a second-degree polynomial equation.

5Determined using a third-degree polynomial equation.

aqueous organic solvents of ethanol (EtOH), methanol (MeOH), acetone (Me2CO), dioxane, 2,2,2-trifluoroethanol (TFE) and in TFE-EtOH (T-E) mixtures. For all of the Arrhenius plots, the R2 values ranged from 0.9937 to 1.0000, except in 60% Me2CO, R2 was 0.9861. The Arrhenius plot for 80% EtOH is shown in Figure 1. In order to utilize Eqs. (1)–(4) for all 39 solvents, second degree or third-degree polynomial equations were used to calculate the missing NT, YCl and Ivalues. The calculated 25°C PhCOCN reaction rates and the literature available or interpolated NT, YCl and Ivalues are listed in Table 1.

Using Eq. (2) for 32 of the PhCOCN solvents in Table 1 (20–90% EtOH, 30–90% MeOH, 20–80% Me2CO, 10–30% dioxane, 10T–90E, 20T–80E, 30T–70E, 40T–60E, 50T–50E and 70T–30E), we obtained R= 0.988, F-test = 595, l= 1.54 ± 0.11, m= 0.74 ± 0.03 and c= 0.13 ± 0.04. Using Eq. (4), we obtained R= 0.989, F-test = 432, l= 1.62 ± 0.11, m= 0.78 ± 0.03, h= 0.22 ± 0.11 (P-value = 0.07) and c= 0.13 ± 0.04.

The l/mratio of 2.08 obtained (for PhCOCN) using Eq. (2) is close to that obtained (1.93) for PhCOF and hence we propose a parallel A-E mechanism.

For the seven highly ionizing aqueous TFE mixtures, using Eq. (1) we obtained, R= 0.977, F-test = 105, m= 0.61 ± 0.06 and c= −1.15 ± 0.20. Using Eq. (2) we obtained R= 0.999, F-test = 763, l= 0.25 ± 0.031, m= 0.42 ± 0.03 and c= −0.13 ± 0.14. Using Eqs. (3) and (4) we obtained R= 0.998, F-test = 417, m= −0.65 ± 0.22 (P-value = 0.04), h= −2.83 ± 0.491 (P-value = 0.01) and c= 3.12 ± 0.73 (P-value = 0.01) and R= 0.989, F-test = 572, l= 0.17 ± 0.07 (P-value = 0.11), m= 0.02 ± 0.33 (P-value = 0.96), h= −1.04 ± 0.86 (P-value = 0.31), c= 1.10 ± 1.02 (P-value = 0.36), respectively.

In the very polar TFE mixtures, in Eq. (2) the l/mratio was 0.60, indicating a dissociative SN1 process. The lvalue of 0.25 is consistent with the need of small preferential solvation to stabilize the developing SN1 carbocation and the lower mvalue (0.42) attained can be rationalized in terms of less demand for solvation of the cyanide anion (leaving group).

In all of the common solvents at 25.0°C, kPhCOCl > kPhCOCN > kPhCOF. In addition, PhCOCN was found to be faster than PhCOF by a factor of 18–71 times in the aqueous ethanol, methanol, acetone and dioxane mixtures and 185–1100 times faster in the TFE-EtOH and TFE-H2O mixtures. These observations are very reasonable as the cyanide group is shown to have a greater inductive effect and in addition, the cyanide anion is a weak conjugate base. This rationalization is logical as (l/m)PhCOCN > (l/m)PhCOF.

## 3. Estimating missing values from a time series data set

Complete historical data time series are needed to create effective mathematical models. Unfortunately, systems that track and record the data values periodically malfunction thereby creating missing and/or inaccurate values in the time series. If a reasonable estimate for the missing value can be determined, the data series can then be used for future analysis.

In this section, we present a methodology to generate a reasonable estimate for a missing or inaccurate values when two important conditions exist: (1) a similar data series with complete information is available and (2) a pattern (or trend) is observable.

The extent of the ice at the northern polar ice cap in square kilometers is tracked on a daily basis and this data is made available to researchers by the National Snow & Ice Data Center (NSIDC). A review of the NASA Distributed Active Archive Center (DAAC) data at NSIDC indicates that the extent of the northern polar ice cap follows a cyclical pattern throughout the year. The extent increases until it reaches a maximum for the year in mid-March and decreases until it reaches a minimum for the year in mid-September. Unfortunately, the data set contains missing data for some of the days.

The extent of the northern polar ice cap in the month of January for 2011, 2012 and 2013 is utilized as an example. Complete daily data for January in 2011 and 2012 is available. The 2013 January data has a missing data value for January 25, 2013.

Figure 2 presents the line graph of the daily ice extent for January of 2011, 2012 and 2013. A complete time series is available for 2011 and 2012, so the first condition is met. The line graphs also indicate that the extent of the polar ice caps is increasing in January, so the second condition is met. An interpolating polynomial will be introduced and used to estimate the missing value for the extent of the polar ice cap on January 25, 2013.

Let t= the time period or observation number in a time series.

Let f(t)= the extent of the sea ice for time period t.

The extent of the sea ice can be written as a function of time.

For a polynomial of degree 1, the function will be: f(t)=a0+a1(t)

For a polynomial of degree 3, the function will be: f(t)=a0+a1(t)+a2(t)2+a3(t)3

Polynomials of higher degrees could also be used. The extent of the polar ice for January 25 will be removed from the data series for 2011 and 2012 and an estimate will be prepared using polynomials of degree 1. Another estimate is prepared using polynomials of degree 3. The estimated value will be compared to the actual value for the years 2011 and 2012. The degree of the polynomial that generates the best (closest) estimate for January 25 will be the degree of the polynomial used to generate the estimate for January 25, 2013.

A two-equation, two-unknown system of equations is created when using polynomials of degree 1. One known value before and after the missing value for each year is used to set up the system of equations. To simplify the calculations, January 24 is recorded as time period 1, January 25 is recorded as time period 2 and January 26 is recorded as time period 3. The time period and extent of the sea ice for each year was recorded in Excel.

 Time period 2011 2012 2013 1 12,878,750 13,110,000 13,077,813 2 12,916,563 13,123,125 3 12,996,875 13,204,219 13,404,688

The system of equations using a first-order polynomial for January 2011 is:

a0+a1(1)=12,878,750a0+a1(3)=12,996,875E6

The coefficients aican be found by solving the system of equations. Substitution, elimination, or matrices can be used to solve the system of equations. A TI-84 graphing calculator and matrices were used to solve this system.

The solution to this system of equations is: a0=12,819,687.5,a1=59,062.5

The estimate for January 25, 2011 is: 12,819,687.5+59,062.2(2)=12,937,812.5 km2.

The system of equations using a first-order polynomial for 2012 is:

a0+a1(1)=13,110,000a0+a1(3)=13,204,219E7

The solution to this system of equations is: a0=13,062,890.5,a1=47,109.5

The estimate for January 25, 2012 is: 13,062,890.5+47,109.5(2)=13,157,109.5km2.

The absolute values of the deviations (actual and estimated values) were calculated in Excel.

 Degree Year Actual Estimated Absolute deviation 1 2011 12,916,563 12,937,812.5 21,249.5 1 2012 13,123,125 13,157,109.5 33,984.5

A four-equation, four-unknown system of equations is created when using polynomials of degree 3. Two known values before and after the missing value are used to set up the system of equations. To simplify the calculations, January 23 is recorded as time period 1, January 24 is recorded as time period 2, January 25 is recorded as time period 3, January 26 is recorded as time period 4 and January 27 is recorded as time period 5. The time period and extent of the sea ice for each year was recorded in Excel.

 Time period 2011 2012 2013 1 12,848,281 13,199,375 13,168,594 2 12,878,750 13,110,000 13,077,813 3 12,916,563 13,123,125 4 12,996,875 13,204,219 13,404,688 5 13,090,625 13,227,344 13,388,750

The system of equations using a third-order polynomial for 2011 is:

a0+a1(1)+a2(1)2+a3(1)3=12,848,281a0+a1(2)+a2(2)2+a3(2)3=12,878,750a0+a1(4)+a2(4)2+a3(4)3=12,996,875a0+a1(5)+a2(5)2+a3(5)3=13,090,625E8

The solution to this system of equations is: a0=12,832,811.67,a1=8,985.17,a2=5,976.33,a3=507.83

The estimate for January 25, 2011 is: 12,832,811.67+8,985.17(3)+5,976.33(3)2+507.33(3)3=12,927,252.1km2.

The system of equations using a third-order polynomial for 2012 is:

a0+a1(1)+a2(1)2+a3(1)3=13,199,375a0+a1(2)+a2(2)2+a3(2)3=13,110,000a0+a1(4)+a2(4)2+a3(4)3=13,204,219a0+a1(5)+a2(5)2+a3(5)3=13,227,344E9

The solution to this system of equations is: a0=13,486,719, a1=413,073.33, a2=139,101.75, a3=13,372.42

The estimate for January 25, 2012 is: 13,486,719413,073.33(3)+139,101.75(3)213,372.42(3)3=13,138,359.42 km2

The absolute values of the deviations (actual and estimated values) were calculated in Excel.

 Degree Year Actual Estimated Absolute deviation 3 2011 12,916,563 12,927,252.1 10,689.1 3 2012 13,123,125 13,138,359.4 15,234.4

The mean of the absolute deviations for polynomials of degree 1 and the mean of the absolute deviations for polynomials of degree 3 were calculated in Excel. The polynomial of degree 3 provided the smallest mean absolute deviation.

 Degree Mean absolute deviation 1 27,617.00 3 12,961.75

Therefore, a third order polynomial will be used to generate an estimate for the sea ice extent on January 25, 2013.

The system of equations using a third-order polynomial for 2013 is:

a0+a1(1)+a2(1)2+a3(1)3=13,168,594a0+a1(2)+a2(2)2+a3(2)3=13,077,813a0+a1(4)+a2(4)2+a3(4)3=13,404,688a0+a1(5)+a2(5)2+a3(5)3=13,388,750E10

The solution to this system of equations is: a0=13,717,916.67, a1=850,859.17, a2=337,669.33, a3=36,132.83.

The estimate for January 25, 2013 is: 13,717,916.67850,859.17(3)+337,669.33(3)236,132.83(3)3=13,228,776.72 km2. Figure 3 shows the extent of the sea ice in January, 2013 with the estimate for January 25.

## 4. Statistical methodologies and applications in the Ebola war

In 2014, an unprecedented outbreak of Ebola occurred predominantly in West Africa. According to the Center for Disease Control (CDC), over 28.5 thousand cases were reported resulting in more than 11,000 deaths [8]. The countries that were affected by the Ebola outbreak were Senegal, Guinea, Nigeria, Mali, Sierra Leone, Liberia, Spain and the United States of America (USA). Statistics through dynamic modeling played a crucial role with clinical data collection and management. The lessons learned and the resultant statistical advances continue to inform and drive current and subsequent pandemics.

For this honors thesis project, we tracked and gathered Ebola data over an extended period of time from the CDC, World Health Organization (WHO) and the news media [8, 9]. We used statistical curve fitting that involved both exponential and polynomial functions as well as model validation using nonlinear regression and R2 statistical analysis.

The first WHO report (initial announcement) of the West Africa Ebola outbreak was made during the March 23rd, 2014 week. Consequently, the data for this project began from that week to October 31, 2014. The 2014 Ebola data was used to create epidemiological models to predict the possible pathway of a 2014 West Africa type of Ebola outbreak. The WHO number of Ebola cases and death toll as of October 31st, 2014 were Liberia (6635 cases with 2413 deaths), Sierra Leone (5338 cases with 1510 deaths), Guinea (1667 cases with 1018 deaths), Nigeria (20 cases with eight deaths), the United States (four cases with one death), Mali (one case with one death) and Spain (one case with zero death).

Microsoft Excel was used for the modeling of the three examples shown and were predicated upon the following assumptions: (1) Week 1 is the week of March 23rd, 2014; (2) X is the number of weeks starting from Week 1 and Y is the number of Ebola deaths; (3) there was no vaccine/cure; and (4) the missing data for the 24th week was obtained by interpolation.

### 4.1. Modeling of weekly Guinea Ebola deaths

The dotted curve in Figure 4 shows the actual observed deaths while the solid line shows the number of deaths as determined by the fitted model. As shown in Figure 4, the growth of the Guinea deaths is exponential. The best fit curve for the projected growth is y= 72.827e0.0823x. A comparison of the actual data to the projected data shows that the two are similar but not exact (Table 2). The projected amount of deaths is approximately 1300 by week 35 (or the week of November 23, 2014).

### 4.2. Modeling of Liberia Ebola deaths (weekly)

Unlike the Guinea deaths, the Liberian deaths are modeled using polynomial function (Figure 5).

Ebola deaths in Guinea
WeekDeathsModelWeekDeathsModelWeekDeathsModel
129791326421225494570
270861426723126494619
395931530325027648672
41081011630727228739730
51361101730429529862792
61431191831432030904860
71551301933934831XXX934
81571412036337832XXX1014
91741532137741033XXX1101
101931662239644534XXX1195
112151802340648335XXX1298
122261962445052536XXXXXX

### Table 2.

Actual and projected Ebola deaths in Guinea.

The best fit curve is best defined with the polynomial equation y= 0.0003x5 − 0.0069x4 + 0.0347x3 + 0.5074x2 − 4.1442x+ 10.487. The model is not exact but it is close enough to predict that by week 35, there would be over 7000 deaths in Liberia (Table 3).

Ebola deaths in Liberia
WeekDeathsModelWeekDeathsModelWeekDeathsModel
107132433258711001
204142543266701267
31031565582718301589
41331684792820691976
563171051072924842436
665181271453027052981
71171915619731XXX3620
81192028226432XXX4366
911122135535233XXX5231
1011152257646434XXX6230
1111202362460635XXX7377
1211252474878336XXXXXX

### Table 3.

Actual and projected Ebola deaths in Liberia.

### 4.3. Modeling of total deaths (World)

When analyzing the total deaths of Ebola (for 35 weeks), the data was best modeled using the polynomial function y= 0.033x4 − 1.4617x3 + 23.437x2 − 118.18x+ 231.59 (Figure 6). An exponential function was not used as it was not suitable since the actual growth was not (initially) fast enough to match the exponential growth. As shown in Table 4, the projected total deaths according to this model would be greater than 11,000 by week 35.

Total Ebola deaths in the world
WeekDeathsModelWeekDeathsModelWeekDeathsModel
129135133373872518481977
27078143504282616472392
310551154674702730912893
412149165185162834393494
514265176035712945554206
614993186606383048775044
71661311972972231XXX6022
81681732093282932XXX7155
918521721106996733XXX8461
10210262221350114134XXX9955
11232305231427136235XXX11656
12244347241638163736XXXXXX

### Table 4.

Actual and projected worldwide deaths.

### 4.4. Nonlinear regression and R-squared analysis

A visual inspection of the graphs and tables shows that the model for Liberia as well as the model for the world-wide total deaths evidently fits the data more closely and a lot better than does the Guinea model. Hence, other statistical goodness-of-fit tests are used to reassert these observations. Here, nonlinear polynomial regression (Eq. (11)) and R2 statistical analysis are employed. In Eq. (11), Σ signifies summation, wrefers to the actual (observed) number of Ebola deaths, zis the number of Ebola deaths as calculated with the model and nis the total number of weeks.

R2=2(wz)+n(w¯)2(z2)2w¯(w)(w2)+n(w¯)22w¯(w)E11

For the Guinea epidemiological Ebola model, the nonlinear regression equation is y= 72.827e0.0823x with R2 as 0.9077 indicating that about 91% of the total variations in y(the number of actual Ebola deaths) can be explained by the regression equation. The polynomial epidemiological model for Ebola deaths in Liberia, y= 0.0003x5 − 0.0069x4 + 0.0347x3 + 0.5074x2 − 4.1442x+ 10.487, has R2 as 0.9715 so that about 97% of the total variations in y(the number of observed Ebola deaths) can be explained by the regression equation. For the third world-wide model, the polynomial for the total Ebola deaths for all countries combined is expectedly better. Here, the R2 is 0.9823, so that about 98% of the total variations in the number of actual Ebola deaths can be explained by the regression equation, y= 0.033x4 − 1.4617x3 + 23.437x2 − 118.18x+ 231.59.

This shows that recording good and organized data that is easily retrievable is paramount in the fight of pandemics. The statistical models developed, in turn, can continue to inform and drive current and subsequent pandemic analyses.

## 5. Probability and expected value in statistics

At Wesley College, probability and expected value in statistics are introduced in two freshman-level mathematics classes: the quantitative reasoning math-core course and a first-year seminar, Mathematics in Gambling.

In general, there are two practical approaches to assigning a probability value to an event:

1. The classical approach

2. The relative frequency/empirical approach and

The classical approachto assigning a probability assumes that all outcomes to a probability experiment are equally likely. In the case of a roulette wheel at a casino, the little rolling ball is equally likely to land in any of the 38 compartments of the roulette wheel. In general, the rule for the probability of an event according to the classical approach is:

P (event A) =number of ways event A can occurtotal number of ways anything can occurE12

In the case of roulette, the probability an individual wins by placing a bet on the color red is 18/38. Since there are 18 red, 18 black and 2 green compartments, the probability of a gambler winning by placing a bet on the color red is 1838=919 or approximately 0.474.

Unfortunately, the classical approach to probability is not always applicable. In the insurance industry, actuaries are interested in the likelihood of a policyholder dying. Since the two events of a policyholder living or dying are not equally likely, the classical approach cannot be used.

Instead, the relative frequency approachis used, which is:

P (event B) = number of times event B has happened in the past n trialsnumber of trials, nE13

When setting life insurance rates for policyholders, life insurance companies must consider variables such as age, sex and smoking status (among others). Suppose recent mortality data for 65-year-old non-smoking males indicates 1800 such men died last year out of 900,000 such men. Based on this data, one would say the probability a 65-year-old non-smoking male will die in the next year, based on the relative frequency approach is:

P(65-year-old non-smoking male dies) = 1,800900,000or approximately 0.002 or 0.2%.

The field of decision analysis often employs the concept of expected value. Take the case of a 65-year-old non-smoking male buying a $250,000 term life insurance policy. Is it worth buying this policy? Based on the concept of expected value, a calculation based on probability is made and interpreted. If the value turns out to be negative, students then have to explain the rationale justifying the purpose of purchasing the term life insurance policy. For a casino installing, a roulette wheel or craps table will the table game be a money maker for the casino? In the Mathematics of Gamblingfirst-year seminar course, students research the rules for the game of roulette and the payoffs for various bets. Based on their findings, they determine the “house edge” for various bets. They also compare various bets in different games of chance to analyze which is a “better bet” and in what game. Assume a situation has various outcomes/states of nature which occur randomly and are unknown when a decision is to be made. In the case of a person considering a life-insurance policy, the person will either live (L) or die (D) during the next year. Assuming the person has no adverse medical condition, the person’s state of nature is unknown when he has to make the decision to buy the term life-insurance (the two outcomes will occur in no predictable manner and are considered random). If each monetary outcome (denoted Oi) has a probability (denoted pi), then the expected valuecan be computed by the formula: Expected value =O1 p1+ O2 p2+ O3 p3+ O4 p4+ .+On pn=i=1n( Oipi)E14 where there are npossible outcomes. In other words, it is the sum of each monetary outcome times its corresponding probability. Example 1: A freshman-level quantitative reasoning mathematics-core class Assume a 67-year-old non-smoking male is charged$1180 for a one year $250,000 term life-insurance policy. Assume actuarial tables show the probability of death for such a person to be 0.003. What is the expected value of this life-insurance policy to the buyer? A payoff table can be constructed showing the outcomes, probabilities and “net” payoffs:  Outcome: Person dies Person lives Probability: 0.003 1 – 0.003 = 0.997 Net payoff:$250,000–$1180 −$1180 $248,820 The payoff in the case of the person living is negative since the money is spent with no return on the investment. Using these data, the expected value is calculated as Expected value=$248,8200.003+$11800.997=$430.E15

The negative sign in the expected value means the consumer should expect to lose money (while the insurance company can expect to make money). Students are asked to explain the meaning of the expected value and explain reasons for people throwing their money away like this. What will they do when it comes time to consider term life insurance?

Example 2: Mathematics of Gambling class

Students are asked to research rules of various games of chance, the meaning of various payoffs (for example, 35 to 1 versus 35 for 1) and then be asked to calculate and interpret the house edgein gambling. This is defined by the formula

House edge=Expected value of the betSize of the betE16

By asking different students to evaluate the house edge of different gambling bets, students can analyze and decide which bet is safest if they do choose to gamble.

Which bet has the lower house edge and why?

Bet #1 – Placing a $10 bet in American roulette on the “row” 25– 27. Bet #2 – Placing a$5 bet in Craps on rolling the sum of 11.

Students must research each game of chance and determine important information to use, which is recorded as follows:

 $10 Bet on a row in roulette$5 Bet on a sum of 11 in craps Probability of a winning bet: 338 236=118 Payoff odds: 11 to 1 15 to 1 Payoff: −$110 −$75 Probability of a losing bet: 3538 3436=1718 Payoff to house for lost bet: +$10 +$5 House Edge: $0.0526$0.1111 Computed by: 338 ⋅(−$110) + 3538 ⋅( +$10 ) $10 118⋅(−$75)+ 1718⋅(+$5)$5

The roulette bet has a lower house edge and is financially safer in the long run for the gambler. Students were then asked to compute the house edge using the shortcut method based on the theory of odds. The house edge is the difference between the true odds (denoted a:b) and the payoff odds the casino pays, expressed as a percentage of the true total odds (a+ b).

In the example involving craps, the true odds against a sum of 11 is 34:2 which reduces to 17:1. The difference between the true odds and payoff odds is 17 – 15 (see Example 2) = 2. Expressing this difference as a percentage of (a + b), the house edge is then calculated as 2÷(17+1)=2÷18=19=0.1111which is the same answer found using the expected value.

Due to the concept of the house edge, casinos know that in the long run, every time a bet is made in roulette, the house averages a profit of \$0.0526 for each dollar bet. Yes, gamblers do win at the roulette table and large amounts of money are paid out. But in the long run, the game is a money maker for the casino.

## Acknowledgments

This work was made possible by grants from the National Institute of General Medical Sciences—NIGMS (P20GM103446) from the National Institutes of Health (DE-INBRE IDeA program), a National Science Foundation (NSF) EPSCoR grant IIA-1301765 (DE-EPSCoR program) and the Delaware (DE) Economic Development Office (DEDO program). The undergraduates acknowledge tuition scholarship support from Wesley’s NSF S-STEM Cannon Scholar Program (NSF DUE 1355554) and RB acknowledges further support from the NASA DE-Space Grant Consortium (DESGC) program (NASA NNX15AI19H). The DE-INBRE, the DE-EPSCoR and the DESGC grants were obtained through the leadership of the University of Delaware and the authors sincerely appreciate their efforts.

Author contributions

Drs. D’Souza, Wentzien and Nwogbaga served as undergraduate research mentors to Brandenberg, Bautista and Miller, respectively. Professor Olsen has developed and taught the probability and expected value examples in his freshman-level mathematics core courses. The findings and conclusions drawn within the chapter in no way reflect the interpretations and/or views of any other federal or state agency.

Conflicts of interest

The authors declare no conflict of interest.

chapter PDF
Citations in RIS format
Citations in bibtex format

## More

© 2017 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## How to cite and reference

### Cite this chapter Copy to clipboard

Malcolm J. D’Souza, Edward A. Brandenburg, Derald E. Wentzien, Riza C. Bautista, Agashi P. Nwogbaga, Rebecca G. Miller and Paul E. Olsen (April 26th 2017). Descriptive and Inferential Statistics in Undergraduate Data Science Research Projects, Advances in Statistical Methodologies and Their Application to Real Problems, Tsukasa Hokimoto, IntechOpen, DOI: 10.5772/65721. Available from:

### Related Content

#### Advances in Statistical Methodologies and Their Application to Real Problems

Edited by Tsukasa Hokimoto

Next chapter

By Aris Spanos

#### Statistical Approaches With Emphasis on Design of Experiments Applied to Chemical Processes

Edited by Valter Silva

First chapter

#### Introductory Chapter: How to Use Design of Experiments Methodology to Get Most from Chemical Processes

By Valter Bruno Reis e Silva, Daniela Eusébio and João Cardoso

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.