Time Series from Clustering: An Approach to Forecast Crime Patterns

This chapter presents an approach to forecast criminal patterns that combines the time series from clustering method with a computational intelligence-based prediction. In this approach, clusters of criminal events are parametrized according to simple geometric prototypes. Cluster dynamics are captured as a set of time series. The size of this set corresponds to the number of clusters multiplied by the number of parameters per cluster. One of the main drawbacks of clustering is the difficulty of defining the optimal number of clusters. The paper also deals with this problem by introducing a validation index of dynamic partitions of crime events that relates the optimal number of clusters with the foreseeability of time series by means of non-linear analysis. The method as well as the validation index was tested over two cases of reported urban crime. Our results showed that crime clusters can be predicted by forecasting their representative time series using an evolutionary adaptive neural fuzzy inference system. Thus, we argue that the foreseeability of these series can be anticipated satisfactorily by means of the proposed index.


Introduction
A crime is an event that emerges from opportunities configured by the interaction of offenders, victims, and the surrounding environment [1]. The process behind a crime event has a decisional nature which evaluates the benefits and risks for the offender [2]. Because of the social value in understanding and preventing crime, it has been studied from different perspectives. It has been noted that crime is a complex phenomenon [3] since criminal activity is connected to the complex dimension of social systems and their actors [4].
Environmental criminology recognizes that crime is not uniformly distributed over space, time, or society [5]. Finding rules that explain the non-randomness of crime dynamics has become an intense research area. Typically, stochastic process analysis has been used in the study of crime dynamics [6][7][8]. However, other epistemological approaches, like fuzzy set theory [9], non-classical topology [10], and complex network theory [11] have been also applied to describe this phenomenon. partition quality index, and describes the forecasting approach. Section 3 presents our results for the two study cases. Section 4 discusses our main findings and presents some conclusions.

Method
The overall TSC method is depicted in Figure 1. Reported spatial events are organized and filtered in frames according to a time scale (i.e., daily, weekly, etc.). A clustering algorithm is applied over these data frames in order to detect and track spatial patterns. These clusters are synthesized in terms of relevant spatial variables computed from each data frame, resulting in several time series. This process is repeated until the maximum number of clusters is reached. Non-linear signal analysis is used to compute a foreseeability index for all time series. The optimal number of clusters is selected as the one that minimizes this index. Selected time series are characterized in order to perform their forecasting. These stages are described in detail as follows.

Data organization
Reported criminal events are grouped in time frames. These events must have a specified date, and longitude and latitude of occurrence to be organized. Time can be defined daily, weekly, monthly, etc., depending on the number of events in the database. Time frames can be generated independently (i.e., without sharing any events) or with some amount of overlapping (i.e., sharing some events).

Clustering reorganization algorithm
Following data organization, clustering of criminal events in a given number of clusters is performed. The result is a set of time series that represents the spatiotemporal dynamics of crime. The number of time series is equivalent to the predefined number of clusters C multiplied by the number of parameters per cluster. It is important to ensure that these resulting time series have spatial and temporal structure. The clustering reorganization algorithm (CRA) was proposed in [15]. This algorithm ensures that the order of the identified clusters remains the The time series from clustering method uses a clustering reorganization algorithm over reported data to produce clusters of crime events that evolve in time and space. The evolution is represented by means of time series of the clusters' parameters. A designed index relates the number of clusters with the foreseeability of time series by using non-linear analysis. same throughout the time. In CRA, the clusters are identified by the Fuzzy C-Means algorithm (FCM) [19]. The outline of clusters can also be determined by several other clustering algorithms, such as the Gustafson-Kessel [20], among others. The identification of the clustering algorithm is related to the possible prototypes that can be found in the data frames. These prototypes are related to the urban form where the events take place [11].
The FCM and other algorithms alike initialize their parameters randomly. In this case, the centers of the clusters, their order and the membership values vary from one clustering experiment to another. Because of this random initialization, the clusters identified will not remain in the same zone, nor will the order of the fuzzy partitions created be preserved throughout time.
Therefore, if a study of C clusters is carried out, there must be certainty that the order of the partitions is maintained. It was considered for this case that the spatiotemporal trends in crime tend to be regular due to the normal population behavior. Thus, if a cluster is identified in a time frame, call it C l , then in all future time frames clusters must be identified near that previously identified cluster. Euclidean distance is taken as a basis for the evaluation between the centers of the clusters of the different time frames (Time frame 0 and Time frame k). The CRA obtains a time series from clustering in five steps. These steps are: Initialization, Iteration, Distance Evaluation, Discarding non-minima, and Assignment.
First, in the initialization stage, the CRA requires the number of clusters to be set, the k-th time frame in which the dataset is in, as well as, the center guide matrix. The center guide determines the order of the clusters in the future time frames. As a second step, the CRA ensures that throughout all time frames, the number of clusters is maintained. In this step, the algorithm verifies that for each reorganization process carried out, each center identified in the time frame k is assigned to a different cluster in the center guide (time frame 0). If the number of clusters is not maintained, this stage is responsible for executing the FCM algorithm once more to re-evaluate the organization of the clusters.
Third, in the distance evaluation stage, the process to measure the Euclidean distance d k ð Þ ij between the center of clusters identified in time frames t 0 and t k , where i and j represent the cluster order of the k-th frame and the first frame, respectively, takes place as follows: where d k ð Þ ij is the Euclidian distance between the center of the cluster j in time 0 and the cluster i in time k. Based on this distance evaluation, a criterion for the reorganization is established, where the clusters order is assigned according to the minimum distance of each cluster to the centers in the center guide matrix.
Even though clusters will tend to occupy the same zones, in certain cases there may be some clusters that are identified further away from their usual locations. To prevent the CRA confusing the cluster organization, the "discarding non-minima" stage is introduced. In this stage, it is assumed that the i-th center cr i in time frame k to the closest j-th center cr j in the time frame 0 is selected for the order in which the i-th cluster is organized.
Finally, the assignment stage takes place once the iteration condition is met. The centers and the membership values have already been assigned in the correct order, according to the identification of the clusters in the first time frame. This stage assigns the correct order to the variables of the centers of the clusters as well as to the membership value of each event. By doing so for each time frame, once the whole time window is processed there will be a set of time series that represents the spatio-temporal dynamics of the groups that have been identified by the clustering algorithm.
An example of three time series that surrogate the dynamics of a cluster is depicted in Figure 2. The spatial dynamics of a cluster is presented as a circle that moves in different frames and whose radius also changes. The dynamics of this cluster is summarized by three time series: X t ð Þ displacement of the cluster centroid in dimension x, Y t ð Þ displacement of the cluster centroid in dimension y, and R t ð Þ radius of the cluster measured as the Euclidean distance from its centroid to the farthest criminal event that belongs to the cluster.

The MIG index
The memory, information, and geometry (MIG) index is proposed in this work as a scalar that quantifies the predictability of a time series obtained from the CRA. The index is constructed as follows: The predictability of a time series becomes more plausible as the MIG index is minimized. M evaluates the amount of non-linear correlation inside the time series (i.e., Memory). I is an indicator of how much information is produced by the signal. G accounts for the size of the phase space in which the dynamics can be embedded (i.e., Geometry).
In practical terms, the MIG index for a time series S t ð Þ can be evaluated from statistics grounded in non-linear signal processing and chaos theory as: where λ corresponds to the estimated Largest Lyapunov Exponent (LLE) of S t ð Þ, D is the size of the embedding space of the series obtained from the estimation of Example of time series that describe a cluster dynamics. Parameters of a circular cluster tracked by the CRA evolve in time producing three time series that correspond to: Horizontal displacement X t ð Þ, vertical displacement Y t ð Þ, and radius of the cluster R t ð Þ.
the False Nearest Neighbors in the attractor dynamics of the signal and τ is the correlation lag located in the first minimum of the mutual average information of S t ð Þ. A detailed explanation about the computation of these quantities can be found in [21], while a practical estimation of LLEs is proposed in [22].

Optimal number of clusters
Dynamics of the i-th cluster is represented by three signals (i.e., X i t ð Þ, Y i t ð Þ, and R i t ð Þ), then the MIG index q C for the dynamics of a partition with C clusters is computed as follows: The optimal number of clusters C opt is obtained by minimizing q C : where C max is the largest number of clusters considered in the optimization process.
The minimum MIG index q opt C suggests that in average the time series produced when C j ¼ C opt is more predictable than in any other case. Thus the dynamics should be studied over 3 * C opt signals.

Time series characterization
MIG index is proposed as an a-priori estimation of the predictability of a time series produced by the CRA. The signals that belong to the most and least predictable scenarios should be forecasted so the MIG validity can be corroborated, however it is computationally expensive. Instead, these signals are characterized through several statistics (independent from the MIG index) in order to select some as the most representative signals for each scenario.

Characterization
Characterization is performed through a selected group of statistics: the first and second estimated statistical moments of the distribution p of the series S t ð Þ, the Shannon entropy and the number of lags beyond which the autocorrelation of S t ð Þ is effectively zero. The estimators [23] of the two first statistical moments are computed as follows: where S k is the discrete representation of S t ð Þ and L is the number of samples. The third measure, the entropy described in (8), is related to the information uncertainty inside the series. A high entropy value is interpreted as a lowpredictability variable.
The autocorrelation function is shown in Eq. (9). The proposed measure represents a confidence bound of the number of regressors with a linear dependence on the signal. It establishes the grade of possible foreseeability in the basis of its periodical behavior as long as forecasting crime is given in terms of stochastic processes [24].
S the average of the signal. The number of lags Lg is obtained when c h drops below 0:2c 0 for h>0.

Series analysis
The chosen statistics are applied over time series whereby a four-dimensional vector (Eq. (10)) is obtained per signal: where S is the class identification of the corresponding series: X-displacement of the cluster in dimension x, Y-displacement of the cluster in dimension y, and R-radius of the cluster. These vectors are arranged in matrices as in Eq. (11). In this way, the analysis is done for each matrix separately: where C represents the number of clusters.
• Normalization: the aim is to find the most representative signals (i.e., X, Y, R). The statistics are considered as a way to describe each series, but most of them have a dependence of the scale avoiding a comparison in value. Consequently, a difference of one order of magnitude could be significant in the lags of autocorrelation, but insignificant in the mean. Therefore, normalization is a way to establish a common reference. In this case, the normalization interval is 0, 1 ð avoiding the value 0. Each M S matrix is independently normalized.
• Finding the average vectors: an average vector is built from the normalized matrices. As per the last step, this process is independently applied. It is important to emphasize that in a determined clustering, there is an average vector v S for each M S : • Determining distances: the distance between each row of the matrices M S and the corresponding average vector is computed. Thereby, the Euclidean distance is calculated for all vectors v S i as follows: The matrices D S are organized by collecting the distances calculated from the matrices M S .
• The fittest vectors: finally, the series corresponding to the position of the minimum value in each matrix D S is chosen as the most representative signal. Thus, there are three representative signals obtained from D X , D Y , and D R which are referenced as X f t ð Þ, Y f t ð Þ, and R f t ð Þ, respectively. These three signals are taken in the forecasting stage as representation of the clustering dynamics for a given number of clusters. Particularly, the representative signals for the best and worst conditions of the MIG index are considered in this work.

Evolutionary-fuzzy forecasting of representative time series
Forecasting of representative time series is carried out by means of a custom memetic algorithm [25,26]. The heuristic was proposed in [27] as the combination of the differential evolution algorithm [28] and the adaptive neuro-fuzzy inference system (ANFIS) [29]. The general flow graph of the algorithm is shown in Figure 3.

Differential memetic neuro-fuzzy algorithm
A memetic algorithm allows to take advantage of both global and local search. In this manner, the optimization space can be explored widely and deeply. The differential evolution algorithm (highlighted in orange) is chosen as the global optimizer, on account of its use for optimizing multidimensional real-valued functions. In addition, it has been successfully applied for model optimization of complex systems [30]. Several variants of this algorithm have been proposed in literature, taking into account the multiplicity of elitism strategies, chromosome representations, and mutation operators. ANFIS (highlighted in pink) is implemented as the supervised learning strategy of the memetic algorithm. It uses the gradient of the objective function on the basis of its differentiability to search for solutions around a locality of the optimization landscape. ANFIS is supported on Sugeno-type fuzzy systems with gaussian membership functions in the inputs [29,31]. In addition, the proposed fuzzy system has m rules, which also corresponds to the number of membership functions in each input and the amount of singleton fuzzy sets in the output. Each rule relates the r À th group of membership functions in the inputs with the ith singleton value in the output [30].
As shown in the flowgraph in Figure 3, when the population is adapted by ANFIS, it is necessary to re-evaluate the solutions and later, according to the fitness function, select them. This re-evaluation must be carried out since ANFIS relies on the RMSE to evaluate the solutions but the memetic algorithm relies on the fitness function. Thus, an interesting individual for ANFIS may not necessarily be good according to its fitness score.
In order to guarantee population diversity, and avoiding a fast-convergence caused by ANFIS, the diversity calculator considered is the FANO factor [26], which calculates the average variation of the mean in the membership functions according to Eq. (15). The population is re-initialized if the FANO factor is greater than a given threshold: where And n * m is the number of membership functions in the inputs, D li is the average of the i-th membership function of the l-th individual, and gen is the current generation.

Fitness function
The root mean squared error (RMSE) index used by ANFIS may not be appropriate for guiding the global search of an evolutionary algorithm [30] while dealing with complex behaviors. Thus, a problem-oriented fitness function to evaluate candidate solutions must be designed. This function is significant because it provides the criteria to judge a solution and its performance. In addition, it is imperative to include several performance indicators to check solutions in a more precise and proper manner. This process is carried out by minimizing the following expression: The fitness function is composed of three indexes. Each one has been chosen to guide the evolution regarding features in the best solution. Thus, many possible solutions that do not have a near-zero mean error but can forecast the series properly, will be explored.
The Nash-Sutcliffe efficiency (NSE) is conceived as an overall performance measure. It is a normalized statistic that determines the relative magnitude of the residual variance compared to the measured data variance [32]. A value of 1 corresponds to a perfect match of the modeledd to the observed data d.
The mean absolute error (MAE) represents a clear interpretation of the absolute difference between the series and its forecasting, relative to the series [33]. It is chosen instead of RMSE because the series has several peaks and it is necessary to rest importance on them in the forecasting. If it was not done, the fitness function would penalize errors in the lowest values (i.e., near the mean) more than in the highest ones.
Prediction of change in direction (POCID) is a non-linear index included to ensure that a forecast that does not fit the series well, but follows its direction changes reliably, gets a good score [34]. As a multiplicative inverse, it causes an increase in the fitness function as large as how far the fitness score is from the optimal value. where NSE ¼ 1 and MAE ¼ 0 represent the two roots of f , hence these are enough reasons to consider a solution as the best, but the POCID operates as a non-linear penalizer.

Results
The method outlined in the previous section was applied to forecast criminal patterns in two cities: San Francisco, USA and Bogota, Colombia. Results obtained from evolved fuzzy predictors are described in this section. In addition, evidence is presented from these cases about the validity of the MIG index for setting the optimal number of clusters of a dynamic partition of crime events.

Data organization
The criminal dataset for the city of San Francisco was obtained online through the open data services provided by the local government of the city. The dataset used in this work registers about 70,000 criminal events between years 2003 and 2015. Each criminal register contains attributes such as latitude, longitude, time, date, type of crime, among others. For the purposes of this study, only house burglary registers were considered, and from each register only latitude, longitude, and date were taken into account. Each spatial register was projected by means of the universal mercator system taken the location of the city of San Francisco as reference. Relative distances were expressed in meters. Time frames of criminal activity were created by aggregating the crime events of the last 7 days. Frames iterate from 1 day to the next. A total of 3195 time frames were produced.

Optimal number of clusters
Optimization of the MIG index was carried out over time series from clustering with C ¼ 2:::16, as shown in Figure 4. There was not be a clear criteria to set the maximum number of clusters to be considered in the optimization of any clustering validation index. In this case, the maximum was determined by taking into account that San Francisco city is divided in 10 police districts. Thus, the optimization was computed considering just three additional clustering settings (i.e. C ¼ 12, 16,20). It was observed for these values that the MIG index diverged too fast, since in C ¼ 20, it reached a value greater than the first maximum by about two orders of magnitude. The optimal MIG index is obtained for C ¼ 4, whose value is around 30% of the maximum. However, the index found in C ¼ 5 is quite similar to the optimum. It is expected to find a similar predictability potential for these two cases.
The difference between the extreme cases may be explained by examining Figure 5. Note that for C opt ¼ 4, big areas are covered so that clusters concentrate a greater number of crime events. Therefore, the spatial variability of clusters in the optimal case is less subject to fluctuations than that of the worst case. In other words, the dynamics of clusters in the optimum case would exhibit the lowest level of disorder.
An example of cluster dynamics for house burglaries in the city of San Francisco, USA. In this example, four clusters were tracked with the clustering reorganization algorithm (CRA).

Characterization of time series
According to the MIG index, only clustering results for C ¼ 4 and C ¼ 16 groups were considered. These cases represent the most and least predictable dynamic partitions. It is necessary to choose a representative time series (Rep) in the two scenarios, denoting them as the optimal scenario (i.e., minimum MIG index) and the control scenario (i.e., maximum MIG index). Table 1 summarizes some results from this process. In column S, the field "[i]" refers to the cluster number. As shown in this table, selected series are from different clusters since the first aim of the forecasting in this study is to have a preliminary validation of the MIG index instead of developing an exhaustive forecast. Representative series (Rep) are selected by finding the minimum distance between their characterization vector and the mean vector of a class S (i.e., X, Y, and R) according to Eqs. (10)- (14).

Forecasting results
Once the series were selected, the memetic algorithm was configured to compute the forecasting with 70% of the data (i.e., 2236 days) and compute the  validation with the other 30% of the data (i.e., 959 days). In order to initialize the algorithm in such a way that the search space could be explored widely with the least possible computational cost, guaranteeing the simplicity of the solutions, a set of experiments was launched for the combination of parameters presented in Table 2. The number of generations and population size were selected as the maximum values in which at least a difference of 0.1% between consecutive generations was found in the fitness function of the fittest solutions. Number of Epochs was chosen to avoid frequent loss of diversity so that the FANO factor was close to the selected threshold. Regarding the fuzzy predictor, the number of regressors in the inputs and the number of rules were selected by finding the maximum values in which the Pearson coefficient for the forecast and real data did not change in more than 2%.
From selected configuration values, due to the random initialization of the memetic algorithm, over 10,000 experiments were run per series. Table 3 summarizes the performance of the best fuzzy predictors found by the memetic algorithm. Pearson's correlation coefficient between the model and data for the optimal scenario was the greatest in both training and validation. The same observation can be stated for the Nash coefficient. The relative difference between Pearson coefficients of the optimal and control scenarios was about 17% in validation whereas for the Nash index, the relative difference was 26%. Regarding the estimated LLE, smaller values were obtained for the optimal scenario, and even a negative LLE was obtained. Visual results are depicted in Figure 6. Selected time series exhibited an interesting texture, which was more accentuated in the control scenario since more peaks appeared randomly and recurrence was not easily observed. These series are a representative sample of the entire, set which contains 12 time series (four clusters each one with three parameters). The first three parameters were used to configure the differential evolution algorithm whereas the last two were used to adjust the Adaptive Neural Fuzzy Inference System. Table 2.
Parameters of the evolutionary fuzzy predictor used for the city of San Francisco, USA.

Data organization
The dataset for the city of Bogota, Colombia was provided by the Non-Governmental Organization (NGO) "Fundacion ideas para la paz," which contains about 25,000 events of cellular phone robbery registered between the years of 2012 and 2015. Criminal registers contain X-coordinates, Y-coordinates, and dates of events. No transformation was required since the coordinates were already expressed in a geographical system adapted to the city. All differential spatial measurements were processed in meters. As in the previous case, time frames were created by the aggregation of the criminal events recorded in the last 7 days. A total amount of 1417 time frames was generated.

Optimal number of clusters
According to Figure 7, the MIG index for Bogota series exhibited a global minimum at C opt ¼ 3, showing non-convex behavior with respect to the number of Dynamic partitions with minimum (4 clusters) and maximum (16 clusters) MIG index were considered. Table 3.
Forecasting indices for representative time series of two dynamic partitions (house burglaries in the city of San Francisco, USA). clusters. Note the similarity of the optimal MIG index and the one obtained for C ¼ 12. The optimization process was carried out for C ¼ 2:::20; however, the last result with C ¼ 20 was omitted since the MIG index grew exaggeratedly. A sample of the cluster dynamics is presented in Figure 8 for the optimal case. However, it is not a straightforward task to infer the result of the optimization process from this figure. Centroids of clusters are moving, as can be noticed from the change in shape of the connecting polygon. Note that the radiuses of the clusters changed in size so different areas were covered from one time frame to another.

Characterization of time series
Characterization of time series of dynamic partitions are summarized in Table 4. The field "[i]" refers to the cluster number. For these series it is supposed a-priori that the three-group clustering is more predictable than the five-group, considering the respective entropies. This result is in accordance with the MIG index optimization. Regarding the radius series, the number of lags is higher with respect to the series collection. Representative series (Rep) were selected by finding the minimum distance between their characterization vector and the mean vector of a class S (i.e., X, Y, and R) according to Eqs. (10)- (14). Table 5 presents the combination of parameters that were used for running the optimization of the fuzzy forecasters. In this case, 70% of generated samples (i.e.,  992 days) were used for training, and the remaining 30% for validation (i.e., 425 days). The best forecasting results for Bogota series are presented in Table 6. The Pearson correlation coefficient between the model and real data was greater for the optimal scenario (i.e., three clusters) with respect to the control scenario (i.e. five clusters) in both training and validation. The relative difference was about 18 and 12%, respectively. In terms of the Nash index, the optimal scenario exhibited the highest performance with a relative difference of 31%. Estimated LLEs were smaller in the optimal scenario with a negative LLE in the case of the X t ð Þ signal. Figure 9 depicts visual results. It can be seen that predicted signals in the optimal scenario are correlated to the real ones. Signals in the control scenario exhibited more random peaks, although the recurrence is similar to the optimal case. These series are a representative sample of the entire set, which contains nine time series (three clusters each one with three parameters). Table 4.

Forecasting results
Features of representative time series obtained from the TSC method for cellular phone robbery in the city of Bogota, Colombia, considering three clusters.

Parameter Range
Step Value Number of generations 100-400 50 200 Population size 10-50 10 30 Epochs number 1-20 1 4 Number of regressors 3-12 1 8 Number of rules 4-24 4 8 The first three parameters configured the differential evolution algorithm, whereas the last two were adjusted for the Adaptive Neural Fuzzy Inference System. Dynamic partitions with minimum (three clusters) and maximum (five clusters) MIG index were considered. Table 6.
Forecasting indices for representative time series of two dynamic partitions of cellular phone robbery in the city of Bogota, Colombia.

Conclusion
Qualitatively speaking, similar results were obtained for the two study cases. In both cases, the forecasting of time series from the optimal scenario outperformed that of the control scenario in terms of two statistical indices (i.e., Pearson correlation and Nash coefficients) and one from chaos theory (i.e., LLE). Therefore, we argue that the MIG index might be useful to evaluate the goodness of a dynamic partition of crime events. Moreover, it may give a confidential a-priori insight about the predictability of series synthesized from TSC. Hence, this method produces coherent series that preserve a temporal structure, which a forecasting method can take advantage of.
TSC series for both cities exhibited an interesting behavior in terms of their texture. No evident periodicity was observed and abundant peaks appeared over the observed time window. Positive LLEs computed in these signals revealed the presence of a possible chaotic nature of the phenomena. Chaotic texture of these series speaks about non-stationary spatial crime patterns that evolve continuously producing information. Thus, this observation reflects a footprint of complexity for urban crime as noted in previous studies [35].
As a future work, the proposed method will be tested over other dynamic phenomena characterized by non-uniform sampling of relevant variables in both space and time. The method will be also refined by considering other techniques such as auto-encoder deep neural networks among others.

Conflict of interest
The authors declare no conflict of interest. Forecasting results for the three representative time series of the cluster dynamics (cellular phone robbery in the city of Bogota, Colombia). Series were obtained from a dynamic partition with three clusters.