Open access peer-reviewed chapter - ONLINE FIRST

Spatial Statistics: A GIS Methodology to Investigate Point Patterns in Stroke Patient Healthcare

Written By

Joanne N. Halls, Barbara J. Lutz, Sara B. Jones and Matthew A. Psioda

Submitted: 23 May 2023 Reviewed: 24 May 2023 Published: 07 August 2023

DOI: 10.5772/intechopen.1001922

Recent Advances in Biostatistics IntechOpen
Recent Advances in Biostatistics Edited by B. Santhosh Kumar

From the Edited Volume

Recent Advances in Biostatistics [Working Title]

B. Santhosh Kumar

Chapter metrics overview

86 Chapter Downloads

View Full Metrics

Abstract

Stroke is the leading cause of major disability and the fifth leading cause of death in the United States. Stroke incidence across the U.S. is not uniform where the southeastern states, known as the “Stroke Belt”, have historically higher rates. Importantly, while the national average death rate due to stroke has been declining, the death rate in the Stroke Belt (from 2013 to 2015) increased 4.2% overall and 5.8% within the Hispanic population. Healthcare interventions have been designed to improve acute stroke care, but they are less prevalent in addressing post-acute care needs of stroke survivors. Therefore, this chapter will describe the results of a recent study that investigated patterns in post-stroke care using a sequence of geospatial statistics. Through this investigation, the reader will learn the sequence of Geographic Information System (GIS) techniques appropriate to use when studying complex spatial patterns.

Keywords

  • geospatial statistics
  • point patterns
  • drive time
  • GIS
  • healthcare data
  • stroke
  • North Carolina USA

1. Introduction

Stroke is the leading cause of major disability and the fifth leading cause of death in the U.S. [1]. Stroke incidence across the U.S. is not uniform. The southern states of Arkansas, Louisiana, Mississippi, Alabama, Tennessee, Georgia, South Carolina, and North Carolina are known as the “Stroke Belt” where there are historically higher rates [2]. Importantly, while the national average death rate due to stroke has been declining, from 2013 to 2015 the death rate in the Stroke Belt has increased 4.2% overall and 5.8% within the Hispanic population [3]. In North Carolina the average death rate from stroke is 84.6 deaths per 100,000 (age 35 and up, all races/ethnicities, both genders, 2014–2016) while the national average is 73.3 per 100,000. However, this average death rate does not tell the full story of stroke in North Carolina because the rate of death due to strokes varies substantially where 5 rural counties experience the lowest rates between 51 and 60 per 100,000 people compared with 21 counties with 100 to 190 per 100,000 people (Figure 1). Specifically, 68 out of the 100 North Carolina counties have stroke death rates above the national average [5]. This high rate of stroke and high variability across the state has led to researching the differences between these locations to identify if there are reasons for such variability in stroke death rates.

Figure 1.

Stroke death rate, per 100,000 people aged 35+, all races/ethnicities, all genders, 2014–2016. Data source: Centers for Disease Control, Interactive Atlas of Heart Disease and Stroke [4].

Interventions, such as timely administration of intravenous tissue plasminogen activator (IV tPA) or mechanical thrombectomy, have improved stroke patient outcomes in acute care [6]. However, evidence-based interventions to optimize post-acute stroke recovery and address recurrent stroke after discharge have not been widely implemented. Several Transitional Care (TC) models have been designed to reduce care fragmentation and improve post-discharge outcomes [7]. The Comprehensive Post-Acute Stroke Services (COMPASS) study evaluated an evidence-based TC model, which included early telephone follow-up and an in-person clinic visit, compared with usual care in a cluster-randomized pragmatic trial in 40 (out of 110) hospitals in North Carolina [8, 9]. Average attendance at follow-up clinics in the intervention group was 35% and ranged from 6 to 70% across the 19 hospitals that implemented the intervention [8]. This variability in attendance at follow-up care led to this geographic study that investigated where attendance was highest and lowest and the relationship with drive time from patients’ homes to the follow-up clinics.

Advertisement

2. Data and methods

We spatially compared where patients lived, drivetime to their assigned follow-up clinic, the Area Deprivation Index from the Health Innovation Program [10] and the designation of urban versus rural from the United States Department of Agriculture. It was hypothesized that (1) the further a patient lived from the clinic the less likely they would attend the follow-up visit; (2) higher area deprivation would correlate to lower attendance at the clinic; and (3) urban clinics would have a higher attendance rate than rural clinics. Results from this geospatial statistical analysis could yield insights into the variability of stroke death rates across North Carolina and lead to improvements in patient access to follow-up care.

2.1 Geocoding address data

Studies have shown that implementing a Geographic Information System (GIS) can provide spatial data analytics for identifying areas with less access to care and therefore potential health care disparities [11]. Additionally, by spatially referencing a variety of data layers, functions such as map overlay to identify areas that intersect between spatial layers can be utilized which can lead to statistical comparison among the varied spatial layers. For example, overlaying drinking water sources (public system versus well water), agricultural use of pesticides, and prevalence of Parkinson’s Disease (PD), has identified a link between well water and PD in California [12]. To begin the GIS process, non-spatial data need to be converted into a spatial data set. Several data sets, consisting of hospital, clinic, and patient data, were geo-referenced using the World Geocoding Service within ArcGIS 10.7.1. These data included 19 hospitals that participated in the intervention study, affiliated hospital clinics where patients were assigned for follow-up care, and patient data collected between July 2016 and March 2018.

Geocoding was performed on the physical addresses of hospitals and clinics as well as residential addresses of study participants. Geocoding is an iterative process that compares an address with a reference base map to estimate a spatial location for the address. To perform the geocoding, the first step is to import the address data into a geodatabase table and then parse the addresses into several components (e.g. address number, direction prefix, street name, direction suffix, city name, and zip code) using the Address Locator, tool which is an important step to compare these components with the reference base map [13]. Next, search criteria are defined and then the data are batch processed to compare them with the base map. This results in a match score for each address record. The search criteria define how spatially accurate the resulting locations will be. For example, one can specify to match a record if the city and zip code match or, for greater precision, if the address number, direction and street name also match. These criteria can greatly impact the resulting spatial accuracy of the output point data and therefore care must be taken when applying the search criteria. In this example, we specified the most stringent criteria in order to have the highest spatial accuracy of the output point data.

All records that can be matched using the criteria and the reference base map are given a match score that reflects the quality of the output data. It is important to check all output points that have a match score less than 100% because these points could have multiple matches to the reference map or other issues, such as a location at a large complex or a typographical error in the format of the address data. An example of a typographical error is the slight misspelling of a street name. When this occurs, the user much check all results with a score that is less than 100%, correct any errors with the address, and then re-run the geocoding to obtain a final result. If the address data have a substantial number of inconsistencies, it is best to reformat all of the data and then perform the geocoding process rather than iteratively fix each error and this will improve the spatial accuracy of the resulting point data [14].

In this study, hospitals, clinics and patient records were geocoded, iteratively checked, and final point data were derived. Of the 2689 patient records only 5 (0.2%) were not able to be geocoded because the patients were listed as homeless or had an invalid address. Many addresses were initially unable to be geocoded due to errors in the address data. This is very common with address data because of the many opportunities to enter incorrect data. Therefore, as discussed above, it is critical to employ extensive quality control and assurance procedures during data collection and to correct address data prior to initiating the geocoding process. Because some patients resided outside the study area, i.e., beyond 50 miles of the North Carolina border, these records were removed from further analysis, which resulted in 2615 (or 97.2%) of the geocoded sample residing within the study area. Due to patient confidentiality, we cannot illustrate the point results from the geocoding of the patient data; hospital and clinic locations are shown in Figure 2, which demonstrates the conversion from address data to point locations.

Figure 2.

Hospitals and clinics in the North Carolina, USA, study area. Hospitals were designated as control and intervention. Patients seen at the Intervention (N = 19) hospitals were given custom care plans and assigned to clinics for follow-up care. Inset A is the Raleigh area and insets B and C are the Charlotte area. Also shown (in orange) are Joint Commission Certified primary or comprehensive stroke hospitals that did not participate in the study [15].

2.2 Computing shortest path and drive time

Drive time is commonly used as a metric of geographic accessibility of health care services [16, 17]. Due to design of the study intervention, the patients were assigned to the follow-up clinic that was affiliated with the hospital where they received their acute stroke care, which may not have been in close proximity to where they lived due to wide catchment areas of acute stroke hospitals. Since where patients live was not considered, in some cases the patients were not assigned to the closest clinic. Patient addresses, therefore, were linked with their assigned clinic. This is an important aspect to this study because in most geographic analyses the closest, or shortest distance, is used as a measure of nearness, but in this case, the closest clinic may not have been the clinic the patient was assigned to. This is an excellent example ensuring the structure of the data and the study context guide appropriate use of spatial statistics and the interpretation of the resulting nearness or other spatial statistics results.

Open StreetMap (https://www.openstreetmap.org/#map=4/38.01/-95.84) and ArcGIS Network Analyst (https://www.esri.com/en-us/arcgis/products/arcgis-network-analyst/overview) were used to compute the shortest drive time from each patient’s residence to their assigned follow-up clinic. To perform the drive time calculation recall that we considered all patients within North Carolina as well as patients that were within 50 miles of the border. Therefore, it was necessary to download the Open StreetMap data for North Carolina, South Carolina, Tennessee, and Virginia, import these into ArcGIS, and then build a comprehensive network dataset for all these state road networks. Once the network was built, the patient and clinic data were used as “origin” and “destination” data in performing the shortest drive time calculation. Additionally, the shortest drive time to the nearest clinic was also calculated to identify the number of patients who were not assigned to the closest clinic and to identify if the spatial patterns differ when comparing closest versus assigned clinic. Using the average drive time, zones around each follow-up clinic were computed using Network Analysis to derive service areas, which can then be used to identify underserved areas.

Lastly, Analysis of Variance (or ANOVA) was used to test whether drive time and visitation rates differed between urban and rural areas and cross tabulation and Pearson’s Chi Square were used to compare the rate of attendance at the clinic and drive time to assigned versus closest clinic.

2.3 Spatial statistical analyses

There are a variety of GIS methods to identify spatial clusters and relationships among several data layers. Importantly, the methodology should follow a series of steps to establish the neighborhood distances between observations that are valid for each unique study area [18, 19, 20, 21, 22, 23, 24, 25]. Therefore, we computed both global and local spatial statistics to systematically assess the degree of spatial clustering. First, the Moran’s I spatial autocorrelation statistic was tested using the patient location and shortest drive time to determine if they were spatially clustered. Next, we performed an Average Nearest Neighbor (ANN) calculation, which measures the degree of clustering using distances between neighboring locations. The ANN statistic is useful for confirming the spatial autocorrelation results from Moran’s I and gives a measure (distance) of the amount of clustering. Importantly, when calculating the ANN statistic, you must include the size of the study area (in this case it was 127,605,669,275 m2) otherwise the statistic will use the size of the bounding box of the dataset and will likely overestimate the size of the study area which can dramatically alter the results. Once global clustering has been measured using the Moran’s I and ANN, we then computed a local spatial cluster analysis (local Moran’s I) to identify where the clusters are located. This technique identified several types of clusters: (1) the location of clusters with low values (shorter drive time), (2) the location of clusters with high values (longer drive time), and (3) cluster outliers where clusters of low values are surrounded by high values and vice versa where clusters of high values are surrounded by low values [26].

Regression analysis is used to look for relationships among independent, or explanatory, variables and a dependent variable. With spatial data we can use Ordinary Least Squares (OLS) Regression to identify an overall pattern in the data and Geographically Weighted Regression (GWR) to identify regression equations at the local level, or each spatial unit such as a county, Census Tract, or other enumeration unit. In this study we used demographic data, the Area Deprivation Index, and access to community resources as independent variables and attendance rate at the follow-up clinic as the dependent variable. In North Carolina there are 100 counties, 1410 Census Tracts, and 6155 Block Groups with an average size of 22.6 sq. km. In the regression analysis, we used Census Block Groups which enables the highest granularity of resolution.

OLS is a multiple regression technique that identifies the strength of the relationships (both positively and negatively) between the independent variables with the dependent variable (rate of attendance at the follow-up clinic visit). First, all independent variables are tested and if any are colinear then they are iteratively removed from the analysis until a result can be achieved that has no multicollinearity problems. From this shorter list of independent variables, the GWR technique is used to identify local weights (importance) for each independent variable and to derive unique regression equations for each location. The GWR technique uses a local/neighborhood approach to computing multiple regression. Unlike OLS regression which looks at the entire dataset as a whole, the GWR method uses a neighborhood around each location (e.g. Block Group) to compute a regression equation. Therefore, all independent variables were tested and those that did not violate the rules of regression were included in the development of GWR models. The benefit of GWR is the ability to identify the importance of the independent variables across the study area. For example, in some areas drive time may be more important compared to other areas where demographics (e.g. age or race/ethnicity) may be more related to the attendance at follow-up clinics. The GWR technique enables the identification of spatially significant differences across the study area, rather than traditional multiple regression that looks at the entire dataset as a whole. GWR has been used in many disciplines including health studies to investigate the spatial patterns of diseases [27, 28, 29]. It is best to run many GWR trials to identify the highest performance based on the combination of independent variables. One of the most important decisions in the use of GWR is the bandwidth, or the calculation of the number of neighbors around each observation. This decision is important because it will determine the number of observations used in each unique local regression equation. The bandwidth can be a fixed distance or, so that all observations use the same neighborhood size, it can vary across the study area depending on the geographic distribution of observations. The corrected Akaike Information Criterion (AICc) identifies the optimal distance and the Cross Validation (CV) identifies the optimal number of neighbors. One can also use the distance identified from the ANN results. Because the bandwidth distance is so important in the derivation of local regression equations, we recommend testing all three approaches and using the one that yields the best results.

In this study, 1523 Block Groups with patients were included since the dependent variable was attendance at the follow-up clinic. Unlike other studies, such as studies based on Census data where there is complete geographic coverage, in this study there were gaps in coverage and the distance between neighboring polygons varied because of the large study area. Therefore, the size of the neighborhood was defined using the cross-validation approach where the optimal number of neighbors was defined locally. A well-specified model will have randomly distributed over and under predictions (residuals) of the dependent variable (attendance at follow-up clinic). If there is clustering in the over and under predictions (residuals), then it is likely that the model is missing at least one key variable.

Next, Grouping Analysis, which is similar to Principal Component Analysis, is a method that is used to look for an overall spatial pattern, especially when there are many variables, with the goal of identifying statistically significant clusters in space. The Grouping Analysis technique uses the K-means statistic to identify the independent variables within each group that are as similar as possible while also identifying groups that are as different as possible. The most important variables identified through the regression analysis were used in the Grouping Analysis to identify the statistically significant groups within the study area.

Advertisement

3. Results

3.1 Drive time analysis

The average drive time from patients’ residences to assigned clinics was 24 (± 25 minutes Standard Dev) with a minimum of 0.75 to a maximum of 360 minutes. A majority, 94%, had a drive time less than 1 hr. and 74% were less than 30 minutes. The global Moran’s I spatial autocorrelation statistic, which compared the patient location and drive time, identified that the data were significantly clustered with a z-score of 4.289 (P-value = 0.0000). Given the Moran I results, the next spatial statistic test, the Average Nearest Neighbor (ANN), also had highly significantly clustering with a z-score of −35.742 (P-value = 0.000). Given these results, we then computed a spatial cluster analysis, the local Moran’s I, to identify where the clusters were located (Figure 3). These results identify where there are high clusters (longer drive time), low clusters (shorter drive time), and the outliers of high/low clusters (higher drive time surrounded by lower drive time) which were more prevalent than low/high clusters.

Figure 3.

Drive-time spatial cluster analysis (Local Moran’s I) showing locations of high clusters (longer drive time), low clusters (shorter drive time), and outliers where low drive time clusters are surrounded by high drive time and high drive time clusters are surrounded by low drive time.

The average rate of attendance at the follow-up clinics was 35% (range was 6 to 70%) and the average drive time for those who attended the clinic was significantly less at 19 minutes versus those who did not attend at 23 minutes (P = 0.005). Given this significant relationship, we can compare the rate of attendance at the follow-up clinic with drive time by deriving drive time zones around each clinic using the average drive time for each clinic (Figure 4). We can see that some clinics have a much longer average drive time for those who did not attend the follow-up clinic shown in red on the map. Additionally, the larger average drive time (large zones on the map) tend to also have the lowest visit rate (smaller yellow marker size). Conversely, the smaller drive time zones had larger attendance rates. These relationships are not ubiquitous, but they are significant. What this type of mapping reveals is the importance of investigating the spatial patterns in data rather than solely relying on global statistics.

Figure 4.

Average drive time for patients who attended the follow-up clinic (blue hatch) versus those who did not attend the clinic (red) as well as the rate of attendance (yellow graduated circles). Most locations had a higher attendance rate with shorter drive times (P = 0.005).

The drive time portion of the study concluded that there are locations of significant clusters of shorter and longer drive times and that shorter drive time was significantly related to higher rates of attendance.

3.2 Regression analysis

There were 18 independent variables tested to determine which are associated with attendance rate at the follow-up clinic (the dependent variable). Table 1 contains the list of independent variables, sorted by decreasing importance, as indicated by the overall significance, the direction of the association (positive or negative), the Variance Inflation Factor (VIF) which indicates multicollinearity, and the list of covariates, or variables that are colinear and are potentially providing the same information. Since the goal of regression analysis is to include variables that explain a unique aspect of the dependent variable it is wise to remove redundant variables. One way of deciding which covariates to select is to use the variable with the strongest positive or negative relationship as well as remove, one at a time, the variables with VIF greater than 7.5 and then re-run the OLS to check the results. In this iterative way the most important explanatory variables are identified.

VariableSignificance (%)Negative (%)Positive (%)VIFCovariates*
Average drive-time10010001.25
Rate of caregivers10001001.88
Not hispanic93.59010019.19White, Urban Area, Urban Cluster, Rural, Black
ADI85.591.5698.441.26
Percent rural77.770100296.73Urban Area, Urban Cluster, Not Hispanic, White, Black
Average age75.1594.395.611.11
Percent white55.722.4977.51355.46Not Hispanic, Black, Urban Area, Urban Cluster, Rural
Percent black48.875.0594.95101.36White, Not Hispanic, Urban Area, Urban Cluster, Rural
Percent urban area36.6230.469.6194.63Rural, Urban Cluster, Not Hispanic, White, Black
Density of community resources35.3122.377.71.91
Percent unknown race34.9592.347.663.59
Percent within urban cluster28.5352.3147.69196.62Rural, Urban Area, Not Hispanic, White, Black
Percent Hispanic26.0913.6786.331.84
Percent multi-race24.0901002.84
Percent Asian11.6680.2819.721.54
Percent native American6.243.5856.424.43
Percent other race4.739.1360.876.31
Percent Pacific Islander0.148.9391.071.29

Table 1.

List of Ordinary Least Squares regression results where the independent variables are listed in order of decreasing significance.

Covariates are listed in decreasing importance/strength.


Negative and Positive indicate the relationship between the independent and dependent variable (attendance at the follow-up clinic). Variance Inflation Factor (VIF) indicates multicollinearity where the higher the value the greater collinearity.

The strongest independent variables were average drive time, number of caregivers, not-Hispanic, ADI, rural, and average age. As expected, drive time was very strongly negatively related to attendance and having caregivers was very positively related to attendance (both 100%). The strongest race/ethnicity variable was the percentage of the patients who were not Hispanic which was positively related to clinic attendance and many of the race/ethnicity variables covaried. Interestingly, ADI was positively related to attendance which was unexpected because higher ADI means the area is more deprived. The next strongest variable was percent rural, and it was also positively related to the attendance which indicates that patients who lived in rural areas were more likely to attend the follow-up clinic. As expected, average age was negatively related to attendance which means older patients would be less likely to attend the follow-up. The remaining independent variables were substantially less related to attendance (less than 56%).

OLS regression gives us the overall relationship between independent and dependent variables, but when you have a large study area with potentially different geographic influences, the GWR can provide insight into the varying importance of independent variables. Therefore, GWR was tested with many iterations of the independent variables to identify the combination that yields the most significant results but also low multicollinearity.

There are a series of steps one should take to interpret GWR results. First, in this study, the cross-validation method for determining bandwidth size resulted in an average of 55 neighbors which is a relatively small neighborhood around each observation/Block Group considering there were 1523 Block Groups. This is an excellent result because it informs us that local analysis is providing information about the significance of independent variables. Conversely, if the CV method resulted in a much larger number of neighbors, perhaps 500, then that would suggest a much larger geographic area should be used to derive the regression equations for each Block Group. Additionally, GWR performs best when you have a large number of observations because the analysis will have enough nearby observations to create a unique regression equation.

Next, an investigation of the residuals informs us as to whether the local regression equations are using observations that are close to the mean, or close to the regression line. In this study, only 154 block groups (10.1%) had standardized residuals greater than or less than 1.5 standard deviations, so most of the Block Groups (89.9%) had regression equations that predicted the dependent variable (attendance at the clinic) close to the mean. The Moran’s I test for spatial autocorrelation confirmed that the GWR residuals were randomly distributed (z-score = 0.0.955875), or not clustered, which means the local GWR did not have obvious missing variables.

The Condition Number is used to test for local multicollinearity, where values above 30 may have unreliable results. None of the Block Groups had Condition Numbers above 30. Given the Moran’s I and Condition Number results, the local R2 values indicate where the GWR equations fit the dependent variable (Figure 5). The areas highlighted in blue (0.338 to 0.5) indicate places where the local regression equations are not explaining as much of the variance in comparison to the areas in light and dark red (0.601 to 0.784) where there were very high regression results indicating equations that contain sufficient variables to predict attendance. To test whether the local R2 results were randomly distributed, a Moran’s I Spatial Autocorrelation test had a z-score of 139.79 (p-value = 0.01) which is very high (greater than 2.58 is significant) indicating that the pattern was significantly clustered and therefore the GWR R2 results are reliable.

Figure 5.

Local R2 results from geographically weighted regression analysis.

Given the good condition number, Moran’s I and R2 results, the next check is to see where the predicted attendance differs from the observed attendance (Figure 6). Only 60 Block Groups (3.94%) under predicted the attendance by 1 or 2 people and only 43 Block Groups (2.82%) over predicted by 1 or 2 people. To test whether the difference between predicted and observed was clustered, the Moran’s I Spatial Autocorrelation test had a z-score of 1.082583 indicating that the pattern is not significantly different from random (P = 0.278993). Given that more than 93% of the Block Groups predicted attendance within 1 person of the actual attendance, we concluded that the variables included in the GWR analysis were able to predict attendance at the follow-up clinic.

Figure 6.

Results from Geographically Weighted Regression (GWR) where observed attendance at the follow-up clinic was subtracted from predicted attendance. Areas in yellow (93%) had the same predicted attendance as actual (observed), areas in green had fewer predicted (4%) than actual and areas in red (3%) had more predicted than actual.

Given the high local R2, excellent predicted versus observed results, and random standardized residuals, the next step was to investigate where each independent variable was important across the study area by the strength of each coefficient. For each Block Group, we identified the top three variables with the greatest contribution (largest coefficients) and most negative coefficients (Figure 7). The density of Community Resources had both a strong positive relationship with attendance in the south-central (Pinehurst area) and western regions (Figure 7A) and a negative relationship in urban (Charlotte and Raleigh) and eastern areas (Figure 7D). The number of people with multiple races had the largest and second largest positive variables in many areas. In the Raleigh and eastern areas, the second most positive influence was having a caregiver (Figure 7B). Recall that the OLS results indicated that having a caregiver was one of the most important variables and the GWR analysis corroborates this relationship and illustrates where this is important.

Figure 7.

Geographic weighted regression results where the largest positive (A, B, and C) and negative (D, E, and F) independent variables for each Census Block Group. A is the largest positive coefficient, B is the 2nd largest and C is the 3rd largest coefficient. Independent variables also had a negative relationship with attendance at the follow-up clinic where D had the largest negative coefficient, E is the 2nd largest negative coefficient and F is the 3rd largest negative coefficient.

The multi-race characteristic was consistently a positive variable in the Charlotte and Pinehurst south-central areas (Figure 7AC). In the far eastern part of the study area, being black was the 3rd highest positive coefficient while in the western area being black had a negative relationship with attendance at the follow-up visit. Interestingly, given the importance in the OLS results, the “not Hispanic” independent variable was not a 1st positive variable, but was very important in the 2nd and 3rd most positive variables and most dominant in the western part of the study area. This illustrates the importance of performing GWR to find where the independent variables are most important.

Similar to the OLS results, drive time was a consistent and important negative variable where the longer drive times had lower rates of attendance at the follow-up clinic visit, but in the 2nd and 3rd coefficient. Similarly, average age was a strong negative variable in the western area, where being older correlated with being less likely to attend the clinic visit. Lastly, the only area where being rural negatively related to attendance was in the eastern area. ADI, which was a strong positive variable in the OLS regression analysis, was a negative variable in the GWR analysis. This is the expected relationship and was only important in 202 Block Groups (13%) in the Charlotte region.

3.3 Grouping analysis

Given the identification of the importance, both positively and negatively, of the independent variables used in the regression analysis, the next step was to use the independent variables to create spatial clusters, or groups, where each group has shared similarities among the independent variables. Grouping Analysis identified 3 statistically significant groups (Figure 8). Like an unsupervised classification in remote sensing, the groups were identified, but we need to identify what these groups represent. Using the list of variables and their values within each group, the following characteristics defined each group:

  1. Group 1 (N = 615 Block Groups): urban, short drive time, very high community resources, and below average ADI. This group had an average attendance rate of 33% and is considered less vulnerable given these results.

  2. Group 2 (N = 157): rural, average drive time, very low community resources, very high number of caregivers, not Hispanic, and above average ADI. This group had an average attendance rate of 42% and is considered moderate vulnerability because of these results.

  3. Group 3 (N = 751): is rural much like group 1, but has very long drive time, low community resources, average number of caregivers, and above average ADI. This group had an average attendance rate of 33% and is considered highly vulnerable because of the results in this group.

Figure 8.

Grouping analysis identified 3 groups across the study area, where N is the number of Block Groups. Group 1 had short drive time, high density of community resources and low ADI. Group 2 had average drive time, low density of community resources, very high number of caregivers, and above average ADI. Group 3 had very long drive time, low density of community resources, and above average ADI.

Therefore, even though these are statistically significant groups, the rate of attendance is not dependent on these characteristics alone and care should be taken to not generalize too much and instead focus on the GWR results that provide spatially explicit variables of importance. The importance of the groups presented here is for comparing with the GWR and other results to identify areas of vulnerability, such as Group 3, and then design targeted strategies to improve patient care, especially in the most vulnerable groups.

Advertisement

4. Discussion

This chapter has outlined a workflow for identifying patterns in point data (cluster analysis) and identifying spatial relationships between independent and dependent variables through regression analysis. The example study has identified the relationship between attendance at stroke care follow-up visits and drive time, ethnicity, and other social characteristics. These characteristics vary in space, geographically, where they are more positively or negatively related to clinic attendance and these results confirm that race and ethnicity are important factors [30, 31, 32]. Geographic variability in social characteristics has been related to health outcomes in previous studies. For example, one study found that accessibility to pharmaceutical products is directly related to varying social characteristics [33] and another study found disparate resources between urban and rural nursing home facilities [34].

Drive time varied significantly across the study area and was directly related to attendance at the follow-up clinic. These results are comparable to other studies that have shown the direct relationship between health care use and accessibility [16, 32, 35].

Cluster and Grouping Analysis identified statistically significant locations of social characteristics as related to attendance at the follow-up visit. Other studies have used similar cluster analysis techniques [18, 21, 22, 23, 36, 37, 38, 39] and research is being conducted to develop new methods of cluster analysis [22]. As with all studies that use GWR, it is important to review the results to identify places where the regression equations have lower results (e.g. R2, residuals, and over and under predictions) because these indicate places where the local regression equations are not explaining the variance in the input data. In this study, 65% (982 out of 1523) of the Block Groups had R2 greater than 0.5, which is very high for GWR analysis, but this also means 35% of the variance was not explained so future work could investigate these areas further to try and identify missing variables that may improve the explanation for attendance at the follow-up clinic.

Using GWR, we identified 10 important variables in the prediction of attendance at follow-up clinic visits. The unique equations for each Census Block Group have coefficients/weights for each variable and these weights varied across the study area. Other studies have used GWR to identify spatially varying patterns in health [27, 28, 29]. Building on the results from this study, the coefficients for the 10 independent variables were applied across the study area using spatial interpolation and then combined to create an overall Index of Vulnerability (Figure 9). This approach highlights areas with higher z-scores, higher vulnerability, in the eastern part of North Carolina and in the Charlotte area, with respect to attending a clinic follow-up visit. Future work could address these areas and the variables identified in the local GWR analysis to focus health care interventions.

Figure 9.

Statewide index of vulnerability derived using GWR weights. Areas in blue have very low z-scores, negative standard deviations and low to very low vulnerability. Conversely, areas in orange and red have positive standard deviations, high z-scores, and high to very high vulnerability.

Advertisement

5. Conclusions

The strategy and use of spatial statistics outlined here provides a framework for others to use as they investigate spatial patterns in data. There are many other approaches, such as geostatistical analyses that interpolate surfaces (e.g. kriging), thus the approach described here is in no way comprehensive, but is one strategy that logically progresses through a series of data analytic strategies to identify statistical patterns in vector-based geographic data. Given the relatively low rate of attendance at follow up clinics and geographic variability across the study area, this study identified several factors that are related to attendance (e.g. drive time, presence of a caregiver, presence of community resources). However, as with most geographic studies, future research is needed to further investigate additional factors that may relate to patient attendance because some parts of the study area had relatively low explanatory power. These results complement the overall results described in previous research [8, 40, 41] as well as other studies that have also concluded that health care access is directly related to proximity [16, 17, 20]. Given that this study also identified drive time as one of the most important factors for clinic attendance, we recommend that hospitals take into consideration where patients live when they assign follow-up care. This could be accomplished by creating a regional system of integrated stroke follow-up clinics that would allow patients to receive stroke-specific follow-up care at a clinic that is closest to where they live regardless of hospital affiliation. This is especially important in urban areas where there may be several clinics to choose from. Several European countries have successfully implemented regional integrated healthcare delivery networks. With this approach, the nurse care coordinator could automatically compute drive-time, discuss the routing results with the patient, and provide them a list of clinics including the estimated drive time, enabling the patients a choice of follow-up clinics. Providing follow-up care by telehealth is another option especially as accessibility to reliable broadband technology continues to improve [42]. These strategies may improve the rate of attendance at follow-up clinic visits and would be more patient-centric than the current hospital-centric approach to stroke care in the U.S.

Advertisement

Acknowledgments

Geospatial analysis was conducted at the University of North Carolina Spatial Analysis Lab. Patient data for this project was obtained through a project funded through a Patient-Centered Outcomes Research Institute Award (PCS-1403-14532). The contents of this chapter are solely the responsibility of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute (PCORI), its Board of Governors, or Methodology Committee. University of North Carolina Wilmington students Zachary Hahn and Alexandria Reimold assisted with the geocoding portion of the project.

Advertisement

Conflict of interest

The authors declare no conflict of interest.

Advertisement

Disclaimer

All statements in this report, including its findings and conclusions, are solely those of the authors and do not necessarily represent the views of the Patient-Centered Outcomes Research Institute PCORI, its Board of Governors or Methodology Committee.

References

  1. 1. Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics-2021 update: A report from the American Heart Association. Circulation. 2021;143(8):e254-e743. DOI: 10.1161/CIR.0000000000000950
  2. 2. Karp DN, Wolff CS, Wiebe DJ, Branas CC, Carr BG, Mullen MT. Reassessing the stroke belt: Using small area spatial statistics to identify clusters of high stroke mortality in the United States. Stroke. 2016;47:1939-1942. DOI: 10.1161/STROKEAHA.116.012997
  3. 3. Hall EW, Vaughan AS, Ritchey MD, Schieb L, Casper M. Stagnating national declines in stroke mortality mask widespread county-level increases, 2010-2016. Stroke. 2019;50:3355-3359. DOI: 10.1161/STROKEAHA.119.026695
  4. 4. Centers for Disease Control. Interactive Atlas of Heart Disease and Stroke. 2019. Available from: https://www.cdc.gov/dhdsp/maps/atlas/index.htm
  5. 5. Yang Q , Tong X, Schieb L, Vaughan A, Gillespie C, Wiltz JL, et al. Vital Signs: Recent Trends in Stroke Death Rates — United States, 2000-2015. MMWR Morbidity and Mortality Weekly Report. 2017;2017:933-939. Available from: https://www.cdc.gov/mmwr/volumes/66/wr/mm6635e1.htm
  6. 6. Powers WJ et al. Guidelines for the Early Management of Patients with Acute Ischemic Stroke: 2019 update to the 2018 Guidelines for the Early Management of Acute Ischemic Stroke: A Guideline for Healthcare Professionals from the American Heart Association/American Stroke Association. Stroke. 2019;50(12):e344-e418
  7. 7. Hirschman KB, Shaid E, McCauley K, Pauly MV, Naylor MD. Continuity of care: The transitional care model. Online Journal of Issues in Nursing. 2015;20:1. DOI: 10.3912/OJIN.Vol20No03Man01
  8. 8. Duncan PW, Bushnell CD, Jones SB, Psioda MA, Gesell SB, D’Agostino RB, et al. Randomized pragmatic trial of stroke transitional care: The COMPASS study. Circulation: Cardiovascular Quality and Outcomes. 2020;13(6):e006285. DOI: 10.1161/circoutcomes.119.006285
  9. 9. Johnson AM, Jones SB, Duncan PW, Bushnell CD, Coleman SW, Mettam LH, et al. Hospital recruitment for a pragmatic cluster-randomized clinical trial: Lessons learned from the COMPASS study. Trials. 2018;19:74. DOI: 10.1186/s13063-017-2434-1. Available from: https://rdcu.be/dbbMU
  10. 10. HIPxChange. Area Deprivation Index Datasets. 2020
  11. 11. Ferguson WJ, Kemp K, Kost G. Using a geographic information system to enhance patient access to point-of-care diagnostics in a limited-resource setting. International Journal of Health Geographics. 2016;10:15. DOI: 10.1186/s12942-016-0037-9
  12. 12. Gatto NM, Cockburn M, Bronstein J, Manthripragada AD, Ritz B. Well-water consumption and Parkinson’s disease in rural California. Environmental Health Perspectives. 2009;117(12):1912-1918. DOI: 10.1289/ehp.0900852
  13. 13. Zandbergen PA. Geocoding quality and implications for spatial analysis. Geography Compass. 2009;3:647-680. DOI: 10.1111/j.1749-8198.2008.00205.x
  14. 14. Matci DM, Avdan U. Address standardization using the natural language process for improving geocoding results. Computers, Environment and Urban Systems. 2018;70:1-8. DOI: 10.1016/j.compenvurbsys.2018.01.009
  15. 15. The Joint Commission. Comprehensive Stroke Center. 2019 [October 10, 2019]. Available from: https://www.jointcommission.org/en/accreditation-and-certification/certification/certifications-by-setting/hospital-certifications/stroke-certification/advanced-stroke/comprehensive-stroke-center/
  16. 16. Brual J, Gravely-Witte S, Suskin N, Stewart DE, Macpherson A, Grace SL. Drive time to cardiac rehabilitation: at what point does it affect utilization? International Journal of Health Geographics. 2010;9:27. DOI: 10.1186%2F1476-072X-9-27
  17. 17. Hare TS, Barcus HR. Geographical accessibility and Kentucky’s heart-related hospital services. Applied Geography. 2007;27:181-205. DOI: 10.1016/j.apgeog.2007.07.004
  18. 18. Barro AS, Kracalik IT, Malania L, Tsertsvadze N, Manvelyan J, Imnadze P, et al. Identifying hotspots of human anthrax transmission using three local clustering techniques. Applied Geography. 2015;60:29-36. DOI: 10.1016/j.apgeog.2015.02.014
  19. 19. Chen J, Roth RE, Naito AT, Lengerich EJ, MacEachren AM. Geovisual analytics to enhance spatial scan statistic interpretation: An analysis of U.S. cervical cancer mortality. International Journal of Health Geographics. 2008;7(1):57. DOI: 10.1186/1476-072X-7-57
  20. 20. Coppi R, D’Urso P, Giordani P. A fuzzy clustering model for multivariate spatial time series. Journal of Classification. 2010;27(1):54-88. DOI: 10.1007/s00357-010-9043-y
  21. 21. Fritz CE, Schuurman N, Robertson C, Lear S. A scoping review of spatial cluster analysis techniques for point-event data. Geospatial Health. 2013;7(2):183-198. DOI: 10.4081/gh.2013.79
  22. 22. Han J, Zhu L, Kulldorff M, Hostovich S, Stinchcomb DG, Tatalovich Z, et al. Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. International Journal of Health Geographics. 2016;2016:15. DOI: 10.1186/s12942-016-0056-6
  23. 23. Huang L, Pickle LW, Das B. Evaluating spatial methods for investigating global clustering and cluster detection of cancer cases. Statistics in Medicine. 2008;27(25):5111-5142. DOI: 10.1002/sim.3342
  24. 24. Yamada I, Rogerson PA, Lee G. GeoSurveillance: A GIS-based system for the detection and monitoring of spatial clusters. Journal of Geographical Systems. 2009;11(2):155-173. DOI: 10.1007/s10109-009-0080-1
  25. 25. Iftimi A, Montes F, Mateu J, Ayyad C. Measuring spatial inhomogeneity at different spatial scales using hybrids of Gibbs point process models. Stochastic Environmental Research and Risk Assessment. 2017;31(6):1455-1469. DOI: 10.1007/s00477-016-1264-0
  26. 26. Roberson S, Dutton M, Macdonald M, Odoi A. Does place of residence or time of year affect the risk of stroke hospitalization and death? A descriptive spatial and temporal epidemiologic study. PLoS One. 2016;11(1):13. DOI: 10.1371/journal.pone.0145224
  27. 27. Kauhl B, Schweikart J, Krafft T, Keste A, Moskwyn M. Do the risk factors for type 2 diabetes mellitus vary by location? A spatial analysis of health insurance claims in Northeastern Germany using kernel density estimation and geographically weighted regression. International Journal of Health Geographics. 2016;15(1):38. DOI: 10.1186/s12942-016-0068-2
  28. 28. Cabrera-Barona P, Murphy T, Kienberger S, Blaschke T. A multi-criteria spatial deprivation index to support health inequality analyses. International Journal of Health Geographics. 2015;14:11. DOI: 10.1186/s12942-015-0004-x
  29. 29. Comber AJ, Brunsdon C, Radburn R. A spatial analysis of variations in health access: Linking geography, socio-economic status and access perceptions. International Journal of Health Geographics. 2011;10:44. DOI: 10.1186/1476-072X-10-44
  30. 30. Plantinga L, Howard VJ, Judd S, Muntner P, Tanner R, Rizk D, et al. Association of duration of residence in the southeastern United States with chronic kidney disease may differ by race: The REasons for geographic and racial differences in stroke (REGARDS) cohort study. International Journal of Health Geographics. 2013;12(1):17. DOI: 10.1186/1476-072X-12-17
  31. 31. Moore JX, Donnelly JP, Griffin R, Safford MM, Howard G, Baddley J, et al. Community characteristics and regional variations in sepsis. International Journal of Epidemiology. 2017;46(5):1607-1617. DOI: 10.1093/ije/dyx099
  32. 32. Wennerholm C, Grip B, Johansson A, Nilsson H, Honkasalo M-L, Faresjö T. Cardiovascular disease occurrence in two close but different social environments. International Journal of Health Geographics. 2011;10:5. DOI: 10.1186/1476-072X-10-5
  33. 33. Amstislavski P, Matthews A, Sheffield S, Maroko AR, Weedon J. Medication deserts: Survey of neighborhood disparities in availability of prescription medications. International Journal of Health Geographics. 2012;11(1):48. DOI: 10.1186/1476-072X-11-48
  34. 34. Lin S-W, Yen C-F, Chiu T-Y, Chi W-C, Liou T-H. New indices for home nursing care resource disparities in rural and urban areas, based on geocoding and geographic distance barriers: A cross-sectional study. International Journal of Health Geographics. 2015;14(1):28. DOI: 10.1186/s12942-015-0021-9
  35. 35. Freyssenge J, Renard F, Schott AM, Derex L, Nighoghossian N, Tazarourte K, et al. Measurement of the potential geographic accessibility from call to definitive care for patient with acute stroke. International Journal of Health Geographics. 2018;17:1. DOI: 10.1186/s12942-018-0121-4
  36. 36. van Rheenen S et al. An analysis of spatial clustering of stroke types, In-hospital mortality, and reported risk factors in Alberta, Canada, using geographic information systems. Canadian Journal of Neurological Sciences / Journal Canadien des Sciences Neurologiques. 2015;42(5):299-309
  37. 37. Solano R et al. Retrospective space-time cluster analysis of whooping cough re-emergence in Barcelona, Spain, 2000-2011. Geospatial Health. 2014;8(2):455-461
  38. 38. Hagenlocher M et al. Assessing socioeconomic vulnerability to dengue fever in Cali, Colombia: Statistical vs expert-based modeling. International Journal of Health Geographics. 2013;2013:12
  39. 39. Queiroz JW et al. Geographic information systems and applied spatial statistics are efficient tools to study Hansen’s disease (leprosy) and to determine areas of greater risk of disease. American Journal of Tropical Medicine and Hygiene. 2010;82(2):306-314
  40. 40. Gesell SB, Bushnell CD, Jones SB, Coleman SW, Levy SM, Xenakis JG, et al. Implementation of a billable transitional care model for stroke patients: The COMPASS study. BMC Health Services Research. 2019;19(1):1-14. DOI: 10.1186/s12913-019-4771-0. Available from: https://rdcu.be/dbbPC
  41. 41. Lutz BJ, Reimold AE, Coleman SW, Guzik AK, Russell LP, Radman MD, et al. Implementation of a transitional care model for stroke: Perspectives from frontline clinicians, administrators, and COMPASS-TC implementation staff. The Gerontologist. 2020;60(6):1071-1084. DOI: doi.org/10.1093/geront/gnaa029
  42. 42. Adeoye O, Nyström KV, Yavagal DR, Luciano J, Nogueira RG, Zorowitz RD, et al. Recommendations for the establishment of stroke systems of care: A 2019 update. Stroke. 2019;50:e187-e210. DOI: 10.1161/STR.0000000000000173

Written By

Joanne N. Halls, Barbara J. Lutz, Sara B. Jones and Matthew A. Psioda

Submitted: 23 May 2023 Reviewed: 24 May 2023 Published: 07 August 2023