Mapping the distribution of patients and analyzing disease clusters is an effective method in epidemiology, where the non-random aggregation of patients is carefully investigated. This can aid in the search for clues to the etiology of diseases, particularly the rare ones. Indeed, with the increased incidence of rare diseases in certain populations and/or geographic areas and with proper analysis of common exposures, it is possible to identify the likely promoters/triggers of these diseases at a given time. In this chapter, we will highlight the appropriate methodology and demonstrate several examples of cluster analyses that lead to the recognition of environmental, occupational and communicable preventable triggers of several rare diseases.
- cluster investigation
- rare diseases
- disease triggers
- patient registries
Many diseases are preventable with lifestyle modifications and by minimizing exposures to harmful substances. In fact, it was recently reported that nearly half of all cancer-related deaths in the United States were attributable to modifiable and preventable risk factors . Through epidemiological studies and careful examination of public health data such as disease registries, and by studying disease distribution, incidence, prevalence and mortality trends, the occurrence of diseases in defined populations can be estimated and can be related to different external factors. Disease clusters are aggregates of patients with a particular disease in a specified time period and at a defined geographical level, occurring at a rate markedly higher than expected. Analyzing and mapping the incidence rates of diseases indeed can help identify non-random distributions of patient clusters, while a proper assessment of the population demographics as well as the surrounding environment can implicate occupational, communicable and environmental exposures as potential causes for a given disease. For instance, in the 1800’s, despite limited knowledge on the etiology of many diseases such as cholera, clustering analysis was the method that enabled physicians and scientists to establish a definite link between the disease outbreaks and causative or potentiating agents from the surrounding environment. In the case of cholera, water from a contaminated pump infected with Vibrio cholerae bacterium was clearly identified as a disease source in London, England. As highlighted in the examples in Section 2.2, it will be evident to the reader that geographic clustering analyses of patient populations can shed light on the triggers of many rare diseases.
2. Cluster investigation analysis
2.1. Cluster investigations
In epidemiology, trends and causes of diseases and their progression and regression rates can be monitored over time and the occurrence of diseases within the defined populations can be estimated. There are several types of epidemiologic studies including cohort, case–control, cross-sectional, ecologic and cluster studies. Epidemiological studies have a significant impact on public health outcomes as they identify increased disease incidence/prevalence rates, which shape health policies including preventative measures and resource allocation planning, accordingly. Spatial epidemiology is the description and geographical analysis of health data, taking into account patients’ demographics, and risk factors including socioeconomic, genetic, environmental, behavioral, infectious and noninfectious exposures . Detection of disease clusters is an integral component of spatial epidemiology as it identifies disproportionately high rates of a disease in a given population, which ultimately generates hypotheses that can help elucidate disease triggers/promoters.
Clustering analyses can be characterized as either general (non-focused) or specific (focused). In a general clustering analysis, the precise location of disease clusters is not studied but rather the clustering tendency of the disease and the overall distribution of disease is examined [3, 4]. On the other hand, specific disease clustering analysis carefully describes unusual nonrandom accumulation of disease outbreaks and the precise location of clusters, in time or space, that are unlikely to be due to chance alone [3, 4]. These investigations can be applied to formulate hypotheses and elucidate potential causes of diseases. Further, clustering patterns of diseases have many applications beyond identifying disease triggers, including identification of areas with high disease prevalence in order to optimize medical management and resource allocation. This will be discussed in Section 2.2.
2.2. Applications of cluster investigations in identifying disease triggers
Cholera is an acute infectious diarrheal disease that can be fatal within days if left untreated. This disease became a major health threat in the 1800s, with several outbreaks that had devastating outcomes. The accepted explanation of cholera outbreaks at the time was attributed to the “Miasma theory”, which suggested that poisonous vapor or mist filled with substances from decomposed matter (“Miasmata”) caused many diseases including cholera, chlamydia and plague . In 1854, a severe outbreak of cholera occurred in London, England, killing more than 600 people. Dr. John Snow, an English physician, investigated the cause of this epidemic by analyzing the geographic distribution of cholera and plotting cholera cases on a map, along with certain landmarks in the city including providers of potable water (Figure 1). Notably, most of the cholera cases occurred within 250 yards of the intersection of Broad and Cambridge streets and in close proximity to a public water pump on Broad street. This observation prompted the local council to disable the water pump which halted the spread of cholera. This analysis enabled the identification of the precise source of the cholera outbreak in London as the public water pump, which was built near an old open toilet. This work established for the first time that cholera can spread via contaminated water . The breakthrough in fact paved the way for the field of epidemiology.
Mesothelioma is a rare but aggressive cancer that arises in the mesothelium, the lining of the pleura, peritoneum, and pericardium. Studying the prevalence of mesotheliomas in asbestos miners of South Africa established asbestos exposure as a critical factor responsible for this deadly malignancy. In a study by Wagner and colleagues, it was noted that while mesothelioma is a very rare disease in the Northwest Cape province of South Africa, 33 cases were described in the area, each having occupational exposure to crocidolite asbestos mining . This finding was shortly followed by several population studies in Quebec Canada, United Kingdom, The Netherlands, Germany, Scotland and Northern Ireland, demonstrating that most of the described mesothelioma patients clustered in several communities where occupational exposure to asbestos was routine. At that time asbestos was commonly used in insulation, construction, factory work as well as in shipyards. This analysis confirmed the causal link between asbestos and mesotheliomas and led to a legislative action to ban the use of this carcinogen in construction and other workplaces .
2.2.3. Squamous cell carcinoma
For many centuries, arsenic was used by the Egyptians, Greeks, Asians and Romans for many applications including the treatment of rheumatism and for facial hair removal. Little was known about the carcinogenic effects of arsenic at that time. In 1898, Geyer conducted a detailed population study in Reichenstein, Silesia (Prussia). In this small arsenic-mining town, chronic poisoning took place primarily through the use of drinking water contaminated by precipitating arsenical fumes in the rain. This work shed light on the carcinogenic effects of arsenic . Affected individuals developed a constellation of symptoms including pigmentation changes and hyperkeratosis (wart-like lesions) on the palms and soles. The latter had a high risk of progression to cutaneous squamous cell carcinomas. This condition was referred to as “Reichenstein’s disease” . The significant increase of this disease in the residents of this town helped establish the link between arsenic and the occurrence of arsenical keratoses and squamous cell carcinomas of the skin. This example further demonstrates the importance of non-random clustering of rare diseases in identifying novel environmental or occupational disease triggers.
2.2.4. Cutaneous T-cell lymphoma
Cutaneous T-cell lymphoma (CTCL) is a rare group of non-Hodgkin lymphomas that primarily involves the skin. Patients with CTCL typically present with persistent, red itchy patches and thickened plaques that are located mostly on the trunk. As the malignancy progresses, patients can develop skin tumors with concomitant involvement of lymph nodes and visceral organs. In some stages, the disease involves the blood and patients can develop erythroderma (generalized redness and desquamation of the skin) and suffer intractable pruritus as well as B-symptoms of lymphoma. Many advanced disease patients succumb to this malignancy within 2–3 years. Unfortunately, the risk factors and promoters for this disease remained poorly understood for many years. It is recognized that disruption of molecular pathways in skin lymphocytes by bacterial, viral or environmental factors can lead to cutaneous lymphomas [12, 13, 14]. Although progress has been made in the past few decades, the precise pathogenesis by which CTCL develops remains poorly understood. Several reports from different parts of the world examined the distribution of CTCL patients illustrating non-random clustering of cases. This was shown in Sweden , Houston, Texas (Figure 2) [16, 17] and the Pittsburgh metropolitan area . Furthermore, the unusually high incidence of CTCL in married couples , and in families  was also noted. These clustering patterns of CTCL patients strongly argue for the existence of external and potentially preventable risk factors for this rare skin cancer.
Several factors have been implicated in CTCL carcinogenesis, including immunosuppression, vitamin D deficiency, bacterial agents (Staphylococcus aureus, Mycobacterium leprae and Chlamydophila pneumoniae), medications (calcium channel blockers, angiotensin converting enzyme inhibitors, hydrochlorothiazide, and serotonin reuptake inhibitors), dermatophytes and viruses (EBV, HSV and HTLV-1) . However, none of these agents have been definitively linked with this skin lymphoma.
In addition, recent studies in Canada further confirmed the existence of disease clusters and areas completely spared by this malignancy and implicated industrial exposure and living in a proximity to major transportation junctions as potential triggers for CTCL [22, 23]. Considering that the majority of skin cancers are caused by external and often preventable triggers (e.g. UV radiation, HPV, polyomaviruses, etc.) it is not surprising that skin lymphomas could also be caused by an external trigger. Currently, the search for such trigger(s) for this malignancy is ongoing.
2.2.5. Childhood leukemia
Another example revealing a cause of an important disease came from an observation in the early 1980s in Woburn, Massachusetts, where an elevated incidence rate of childhood leukemia was documented. An extensive investigation of the geographical distribution of these patients helped implicate chlorinated organic compounds contaminating two of eight municipal wells servicing Woburn as a cause of childhood leukemia. Specifically, it was shown that select dwellings where the patients with this cancer resided, were provided water from these contaminated wells .
2.2.6. Bladder cancer
Bladder cancer is a disease of significant morbidity and mortality . Cluster investigation recently helped identify occupational and behavioral promoters for this cancer. These factors are potentially modifiable and thus rates of this malignancy could possibly be reduced with primary prevention. The astute observation in 1895 by Rehn, a German physician, showed that the incidence rates of bladder cancer were remarkably high in aniline dye industry workers. This was the first evidence that occupational risk factors can be directly implicated in this malignancy . By carefully analyzing the incidence of bladder cancers in industrial workers, it was possible to identify aromatic amines, polycyclic aromatic hydrocarbons and chlorinated hydrocarbons that are now well recognized as causative agents for this disease .
2.2.7. Emerging trends
184.108.40.206. Multiple sclerosis
Multiple sclerosis (MS) is an autoimmune demyelinating disease, affecting the central nervous system and resulting in a spectrum of neurological symptoms including vision problems, fatigue, pains, spasms and cognitive decline. The precise triggers of this rare disease have not yet been described or identified. However, studying the epidemiology and geographic distribution of MS globally has yielded many interesting trends that allowed generation of a number of hypotheses addressing the cause of MS. Clusters of new MS cases have been reported in many communities around the world including the United States, Canada, Europe, Israel, New Zealand, Australia and Russia [27, 28, 29, 30]. Many studies indicated significant variation in the global distribution of MS patients, where the incidence of this autoimmune disease is relatively uncommon in tropical climates, but is much more common in temperate zones and in the Western Hemisphere . Furthermore, remarkably elevated incidence rates in northern latitudes were reported [32, 33]. Many theories have been postulated to implicate promoters of MS, such as diet, soil minerals and deficiency in vitamin D [32, 33]. The identity of a definite trigger for MS remains unknown, and extensive follow up of identified clusters may potentially provide some clues in the future.
220.127.116.11. Alzheimer’s disease
Alzheimer’s disease is a common, yet incompletely understood form of dementia. Differences in the geographical distribution of patients with Alzheimer’s disease were reported, highlighting the possible contribution of nutritional or socio-environmental factors in the development and progression of the disease . Indeed, levels of essential trace elements including selenium, magnesium, iron, copper and zinc were shown to be markedly reduced in Alzheimer’s patients compared to same age healthy individuals . This illustrates that further epidemiologic studies can be used to associate nutritional deficiencies with diseases.
2.3. Applications of cluster investigations in identifying nutritional deficiencies
Deficiency in micronutrients and vitamins can result in a variety of diseases. For instance, vitamin A deficiency is a known cause of keratomalacia, while vitamin D deficiency in childhood invariably causes rickets. One important use of clustering analysis in epidemiology is to identify nutritional deficiencies.
During the Age of Discovery in the fifteenth and sixteenth century, particularly during long transatlantic journeys, it was noted that the incidence of scurvy, a rare disease caused by a severe deficiency of vitamin C (ascorbic acid), was much higher in sailors, pirates and other sea explorers. Also, the disease later affected soldiers in world wars. Scurvy is characterized by general weakness, gingivitis and bleeding disorders. It was noted that eating citrus fruits prevented and cured this disease in sailors, which enabled later confirmation that vitamin C deficiency is the sole cause of scurvy. Thus, careful demographic and epidemiologic analyses of these individuals, who did not have access to fresh fruit and vegetables, established a link between nutritional deficiency and disease.
Thyroid goiters, which represent enlargement of the thyroid gland, are caused by iodine deficiency. Fortification of table salt, medications and common foods like bread with iodine has largely eliminated the once pandemic goiter, but the condition persists in some regions of the developing world. The first hypothesis linking iodine with the treatment of goiter was made in the mid-1800s by a French chemist, Adolphe Chatin . However, fortification of table salt with iodine was not implemented in the United States until the early 1920s , and this was, at least in part, driven by epidemiological research.
It was noted that the prevalence of goiter was very high (in approximately 26–70% of children) in the upper Midwest and Great Lakes regions of the United States. In fact, this endemic region was known at the time as the “Goiter Belt” . The prevalence was also reported as high as 64.4% in some areas of Michigan . This highlighted the severity of the problem, sparking a major public health initiative to supplement table salt with iodine. The intervention was very successful, as the incidence of goiter in Michigan dropped by up to 90% within a decade of iodine supplementation . Currently, several areas have remarkably high prevalence of goiter, such as parts of India and the Himalayan/sub-Himalayan belts . In fact, despite efforts to implement iodine supplementation and table salt fortification with iodine, the goiter prevalence in these communities has not decreased significantly . Thus, more work needs to be done to address logistic, cultural and other obstacles to eliminate suffering from goiter in these regions. In conclusion, recognizing the high prevalence of ‘uncommon’ diseases such as goiter has important clinical implications. These studies help detect regions with micronutrient deficiency, which can serve as surrogate markers for poor nutrition and encourage prioritizing resource allocation to the affected communities.
2.4. Conducting a proper cluster investigation analysis
2.4.1. Systematic approach to conducting a cluster analysis
The study of the incidence/prevalence of a disease and mapping its distribution requires a systematic approach when trying to implicate occupational and environmental exposures as disease triggers/promoters. Mapping and exposure investigations are critical to highlight the existence and significance of identified clusters. However, it is not enough to only learn about the geographical disease clusters (i.e., disease hot-spots). It is also important to identify regions that are significantly spared by the disease (i.e., disease cold-spots). Detailed epidemiological and statistical analysis of both can help rule-in or rule-out environmental contamination or exposures as disease triggers . A point-by-point guide of a systematic approach to conducting a cluster investigation is provided below:
Define the disease and population(s) to be examined.
Obtain ‘background’ information about patient demographics to enable standardization of incidence and mortality rates (such as standardization by age, gender, race, socioeconomic status, etc.).
Obtain census or other population information to enable calculating incidence and mortality rates per country, territory/state/province, city and postal code. It is also helpful to learn about common exposures or diseases in that population to adjust for potential confounders. For instance, when studying the incidence of hepatitis C infection in a population, the rate of HIV prevalence would be an important confounder, since in many patients there is co-infection with both viruses due to shared risk factors for viral transmission. Population demographic parameters often vary and can be useful for subsequent analysis of collected data. The specific parameters of interest will differ for each disease, but often include population size, age ranges, race, gender distribution, socioeconomic status, data on lifestyle/behaviors, other environmental, occupational, or local rates of communicable diseases, etc.
Obtain public health data on patients with the disease of interest (e.g. local or national cancer registries and Centers for Disease Control, etc.). It is critical to obtain the data from population-based registries since it is often very difficult to draw conclusions from data based on a single medical center or a few select hospitals’ experience. One must always seek to correlate single center evidence with population-based registries/databases. Relevant collected information should include age at diagnosis, year of diagnosis (for incidence calculation), gender, ethnic background (to study disease ethnic predilection), patients’ addresses (for geographical mapping), age at death, year of death (for mortality calculation), disease stage, etc.
Subsequent calculations of incidence can be easily performed using the obtained data (incidence rate per year = number of new patients per year/population at risk per year). A plot of incidence rates (y axis) vs. year (x axis) will enable calculating an average incidence rate and trending the change of rate over time. Mortality calculations are done similarly, using number of deceased patients per year/population at risk.
Incidence rates in smaller geographical regions can be calculated similarly. For rare diseases, it is important to include only locations with at least >5000–10,000 residents per geographical area to reduce erroneous false-positive hits, in which a few cases of disease occurring within a scarcely populated area (e.g., <5000 residents) may artificially inflate the incidence/mortality rate.
The calculated incidence/mortality rates can be normalized to several variables (such as age, gender, ethnicity) or to a known distribution of relevant disease-specific variables (such as communicable diseases, geographical latitude, socioeconomic status, etc.) This is important to account for potential confounding variables and to highlight trends that can be ‘masked’ if rates are not normalized in subsequent analyses.
Conduct proper statistical analysis to determine statistically significant high and low incidence/mortality rates per geographical region at all levels. Two of the most commonly used methods of statistical analysis are the chi-square test (comparing observed number of cases to that expected under an assumed Poisson distribution) and the Knox test for time–space interaction, among more than 70 different methods, which have been used in previously published studies .
Plot the incidence rates in a specialized computer program such as ArcGIS or other geographic information system (GIS) software. Generate several maps, choosing appropriate color schemes representing standardized rates. It may also be advantageous to generate maps representing rates of statistical significance. Maps should serve as a clear, rapid and informative summary of complex geographical information and should help the reader identify interesting trends and generate relevant hypotheses.
Repeat the mapping analysis (step 9) using different normalized rates. Map the data in different formats and beware of “biased mapping” which was discussed elsewhere . Ensure plotting maps that convey the message clearly and accurately.
Visualize and further analyze the plotted maps and note the presence of disease clusters (“disease hot-spots”) as well as areas of significantly low incidence/mortality rates (“cold-spots”). Observe for interesting trends, particularly, if several of these clusters occur geographically side-by-side and are supported by hypotheses/current evidence of disease pathogenesis. It is often useful to compare generated disease maps with land-use maps that can be obtained from local authorities.
Perform sub-analysis of the identified “disease hot-spots” and correlate with the surrounding environment for any prevalent occupations, exposures, environmental factors, etc. If the patients within the area of high incidence (e.g. within a zip/postal code or a city) demonstrate an additional level of clustering (e.g., living on the same street or up and down the stream or river) it can further strengthen clustering findings and provide clues regarding possible triggers/exposures.
2.4.2. Limitations and bias
As illustrated in this chapter, studying the spatial patterns and geographical distribution of diseases has many benefits including the identification of disease clusters. This can be a powerful tool to help identify disease triggers and to better allocate financial and logistic resources for better management of these medical conditions. When the analysis is conducted properly, results are often specific. However, as in any type of analysis, one must be aware of potential limitations and intrinsic bias of the method. When analyzing clusters of patients in a given geographical region, one must be aware that there is a possibility that at least some of the observed clusters may be occurring by chance alone. Another important point, when studying the incidence of rare diseases in small regions: it is imperative to bracket the population analysis to at least 5000–10,000 residents per geographical area to reduce erroneous false-positive hits. Also, association does not always imply causality. Extensive additional field and experimental work must be performed to link identified associations causally with a given disease. Finally, one must be careful when directly comparing different geographic clustering studies as differences in the inclusion criteria, statistical methods or intrinsic differences of the populations at risk can produce divergent results.
The applications of cluster studies in medicine have developed rather rapidly in recent decades. These will enable us to focus on studying risk factors and possible etiologic triggers of rare cancers and other conditions. Furthermore, this work can help make informed decisions regarding resource allocation and promote the development of primary prevention programs.
The authors would like to sincerely thank both Dr. Linda Moreau and Dr. Elham Rahme for their generous support and valuable advice.
Conflict of interest
The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this book chapter.