Novel Formulation of Parzen Data Analysis

The Parzen analysis associates Gaussian kernels with each data point, thus obtaining a density function which may be viewed as a possible artificial generator of the data. This probability function can be decomposed into the product of two components, weight (W) and shape (S), which represent different aspects of the data. We demonstrate how this naturally leads to a formalism of fields in data space, which are interconnected through relations in one-dimensional scale space, corresponding to the common Gaussian width. We discuss the connection of this formalism to different clustering procedures such as quantum clustering (QC) and mean shift (MS). We demonstrate on various examples the importance of these concepts in the analysis of natural data as well as in image analysis in two or three dimensions.


Introduction
Unsupervised machine learning has led to a wealth of clustering methods over the past few decades [1]. One of the important early ideas is that of the Parzen window distribution [2]. It has been introduced in 1962, as a kernel density estimate of a distribution function underlying measured data, and still serves as the basis of clustering algorithms in pattern recognition [1,3]. Recently, it has been discovered [4] that the Parzen probability function can be decomposed into two components, weight and shape, which represent different aspects of the data. Weight, as its name implies, describes the semi-global strength of the distribution, whereas shape represents local properties which come to light once the bias of the weight is being removed. Moreover, Àlog(shape) coincides with a potential function V, which has been previously introduced in quantum clustering (QC) [5]. The cluster centers in QC correspond to minima of V. An alternative method, mean shift [6,7], views the maxima of the probability function as the appropriate candidates of cluster centers. These two different points of view can now be studied and compared within a unified formalism [4].
Here we discuss the novel connections of the Parzen distribution to its potential and show how both can be used for the analysis of data points, leading to alternative clustering possibilities and extracting interesting features from the data. A particularly interesting set of applications appears in image analysis. Scale-space image analysis [8] has developed from the Parzen kernel methodology discussed in [6]. Now it turns out that insights from the potential term, or the shape component, allow for novel applications which are relevant to medical and technical imaging.

The Parzen probability distribution and its potential function
Data analysis often involves dimensionality reduction and noise removal, as well as some other tools, which eventually lead to consider a set of preprocessed data points located in a d-dimensional Euclidean space x i ∈ R d , with possible positive attributes (e.g., intensities) I i . For this set, we define the non-normalized Parzen window function with Gaussian kernels as Following [4], we introduce a relative probability weight function representing the influence of the kernel at data point x i on any arbitrary point x: It obeys ∑ i I i p i x ð Þ ¼ 1 and allows for the definition of two new scalar functions over data space x, which are the potential and entropy fields Their difference is related to the Parzen probability function This can be rewritten as using [4] the concepts of weight and shape: ð Þ ≥ 0 , it follows that S x ð Þ ≤ 1:Moreover, both W and S are nonnegative. S is integrable over x and, as such, can also serve as a distribution. From the definitions of (1) and (3), one can derive [4] the Schrödinger equation which has been the cornerstone of the QC algorithm [5].

Interplay of scale and data space dependence
All the scalar fields over data space, introduced in the previous section, depend on the parameter σ, the scale of all Gaussian kernels. This dependence leads to further interesting relations between the Parzen probability function and its potential. Thus, from the definitions (1) and (3), it follows that where we keep the index σ which has been suppressed in the previous section. This relation displays a direct connection between the two scalar functions defining the probability and the potential. We proceed now to introduce a vector field D σ which is defined by and vanishes when the probability reaches its extrema in data space. Interestingly it is also related to the gradient of the potential function, through Hence we conclude that the potential reaches its extrema when D σ remains stationary with respect to variations of σ.
D σ may be expressed, in analogy with Eq. (3), as Its square U ¼ D 2 serves as an indicator function whose stationarity implies the existence of extrema of either the probability or the potential. Since U ¼ D 2 is nonnegative, U = 0 is a minimum in σ. It corresponds to extrema of ψ which are associated with D¼0: Other values of U which obey Eq. (12) are associated with extrema of V which occur whenever ∂ ∂σ D σ ¼ 0:Eq. (12) may be viewed as a statement concerning a set of points of interest in the data: all extrema of either the probability or the potential. In analogy with statistics, one may also view this equation as an inference method finding the parameter σ which leads to points of interest at given values of x.
Although all extrema may be regarded as points of interest, some are of more interest than others: extrema that remain fixed in x for a range of scale values, which is large compared with the range of scales of other points of interest. This criterion, introduced by Roberts [9], allows searching for scales which correspond to natural properties of the data. Thus it subserves the search for good clustering of the data [4,5,9].
Finally, we wish to point out that ψ is not a properly normalized distribution function. A proper probability function, whose integral is 1, is defined by where N ¼ ∑ i I i . We note that ψ and V obey a joint integration constraint [10] 1 N This may be interpreted as a constraint on the expectation value of the potential function in data space.
Examples of the behavior of log ψ and of V are demonstrated in Figure 1 for a data set of 9000 observed galaxies (with redshift in the domain 0.47 AE0:005) regarded as points in spherical angles θ and φ within some limited range. Whereas for σ ¼ 2 (in units of angle degrees), the two fields exhibit many extrema; there exist clear differences for larger sigma, for example, σ = 10, where log ψ has one maximum, while V displays several minima. This figure is taken from [10], a paper which contains a detailed and expanded formulation of the analysis presented in this section. 4. Applications to clustering, image analysis, and more 4.1 Anomalies Figure 1 serves as an example of different behaviors of log ψ and of V. We wish to stress that when studying this system with large σ, the probability distribution is smooth; nonetheless underlying structure is observed in V. This means that, for a given σ, representing the large-scale behavior with one Gaussian, as one may be tempted to do after seeing the probability distribution, is wrong as demonstrated by the structures observed in V. On the other hand, the probability function tends to a smooth limit for σ = 10, whereas the fluctuating V changes with σ; hence V may represent random fluctuations in the data. However, comparing with the raw data in Figure 1a, we can be convinced that structure of the type discovered by V exists in the data. If these are fluctuations or not, one cannot tell from a single set of data.
A generally important question is if, within changing patterns of V, there exists one (or some) which remains relatively stable as function of σ. Such a structure may be viewed as a possible anomaly in the data. It is therefore advisable to study V when looking for anomalies.

Clustering
Clustering methodologies based on maximization of the probability and minimization of the potential can be defined by letting replica of data points move in these directions. These methods are known as mean shift (MS) and quantum clustering (QC) correspondingly. A recent review of MS techniques has been presented in [11]. Analyzing the same data with the same width-parameter σ leads to different clustering results for these two different methods, as is expected from Figure 1.
For illustration of clustering based on these different methods, we consider the crab data set which is included in Ripley's textbook [12]. It consists of 200 instances belonging to four equally sized classes and is defined in a five-dimensional parameter space. Performing PCA and restricting ourselves to the 2D plane defined by PC2-PC3 lead to a challenging clustering problem which has been discussed by [13], when introducing support vector clustering (SVC), and by [5] when introducing QC. It has been used in other papers employing variations of QC, such as the recent study [14]. Here we will show the results of [4] who applied to these data three clustering methods: Maximal Shape Clustering (MSC) which coincides with QC, Maximal Probability Clustering (MPC) which coincides with MS, and Maximal Entropy Clustering (MEC). The quality of all three methods may be judged by applying the Jaccard score J ¼ n 11 n 11 þ n 10 þ n 01 where n 11 is the number of pairs of points which belong together both in the same class (accepted as "ground truth") and in the same cluster, while n 10 + n 01 are numbers of pairs which belong to the same class but different clusters and vice versa. This test is performed in Figure 2a, demonstrating that QC wins the competition for a wide range of σ values. The expected asymptotic value is J = 98/398, befitting one cluster and four classes.
Another comparison is being made in Figure 2b. This follows Roberts' criterion [9] that the preferable clustering method is the one which displays the most stable number of clusters with respect to variation of σ. This criterion is handy when the ground truth is unknown. QC excels also in this test, leading to a stable prediction of four clusters for a wide range of σ. This last figure also serve as a credibility test for Roberts' criterion.
In order to make the clustering results more intuitive, we display in Figure 3, also taken from [4], topographic maps of the different fields describing probability, weight, and shape, for σ = 0.7. The points in four different colors represent the four different classes. The topographic maps allow one to understand the clustering results which represent the outcome of gradient ascent applied to replica of data points which climb toward their nearest peak. Comparing the topologies of Figure 3 with the results for σ = 0.7 in Figure 2 leads to an understanding of why the three methods differ from each other.

Image analysis
A gray-scale image may be analyzed as a set of inputs associated with different pixels. In higher dimensional problems, such as 3D MRI data, the pixels are replaced  Topographic maps of probability, weight, and shape, for σ = 0.7. Reproduced from [4]. by voxels. Both may fit well into our analysis which starts with Eq. (1), associating a probability distribution with every image. One may then wonder if the weightshape decomposition of Eq. (6) can lead to any novel understanding of image analysis. In any practical application, non-normalized probability and weight may have very large amplitudes, yet shape will be limited to values ≤1. Nonetheless it carries some important characteristics: 1. Shape may be normalized to become a distribution on its own.

Shape serves as a generalized edge detector.
3. Shape is the basis of QC; hence the latter is very relevant to clustering of regions where shape is large. Alternative methods may be relevant when shape is small.
The first claim is trivial since S is limited to the range 0 ≤ S ≤ 1, and the Gaussian kernels are integrable. The second property is a result of Eq. (7) which shows that the potential is related to the second derivative of the probability. It has led to an interesting result in [4], demonstrating that line caricatures of images can be produced by thresholded shape drawings.
To demonstrate the third point, we display in Figure 4 the results of an analysis of a T2 MRI of the brain of a Macaque monkey [15]. Following the general procedure outlined above, and limiting ourselves to large relative values (thresholded distributions) of probability and shape, we find that the latter peaks in cortical regions, whereas the former peaks in internal regions of the brain, as demonstrated in Figure 4. Thus, a simple thresholding procedure allows one to easily segment the MR image, for the purpose of further analysis of the cortex by applying QC to the data in the large S domain. In Figure 5, we follow these conclusions [15] with a display of QC clusters projected onto the surface of the brain, leading to its Thresholded shape (red) and thresholded probability (blue) dominate different regions within the same MR image of a macaque brain, projected on its y-z plane. This analysis used σ = 3 in voxel units. Data outside the brain are due to artifacts and noise in the MR image. These results are due to [15], and they indicate that large shape components dominate cortical regions of the T2 MRI brain image. parcellation into cortical components which are derived by just computational image analysis.

Convolutional representation of V
When one analyzes data in a regular underlying structure, such as pixels m of an image I(m), the translational invariance of the Gaussian kernel allows one to use a convolutional description such as with K being a discrete representation of the kernel. This leads [4] to the following result for the potential where L = ÀK log K. Such 3D kernels were applied to brain MR images [15] leading to the results displayed in Figures 4 and 5.
Noting that Eq. (15) is reminiscent of a convolutional layer in a deep network [16], we hypothesize that it can be useful to incorporate intermediate layers with nonlinear filters such as Eq. (16), as additional non-trained pooling filters in deep networks.

Computational remarks
The clustering methodology which has been employed in the different examples shown above is the simplest flavor of gradient descent (or ascent). It calculates the relevant fields on the basis of the data points and continues with straightforward dynamics that have been applied to replica of data points, seeking the extrema of the fields. Various alternatives to this basic application exist. The most important one is hierarchical clustering, which allows for conceptual simplicity and saves computational complexity. Such methodologies were described and discussed in [11] and in [4].
Computational complexity is an obvious issue when working with large data sets. Thus, 3D MRI data may easily comprise 1 M points, whereas their Characteristic results of QC cortical clusters as mapped onto the surface of the brain and projected onto the x-y plane. This figure displays a map of the largest clusters of shape, each described by a different color. These results are due to Fisher [15]. manipulation within a system like MATLAB may well be limited to handling only 40 K points at a time [15]. One way to overcome such issues is to consider performing the analysis within extended voxels, for example, voxels containing three pixels in each direction in a 3D image problem. Within each new voxel, one may simply sum the intensity of points, leading to a new presentation of the data in the form of Eq. 1 on the smaller extended voxel space. Clearly one has to make sure that such an approximation does not harm interesting features of the data.
When analyzing other big data, no prior dimensional representation may be required. For the sake of noise reduction and computational complexity, it is advantageous to first apply relevant dimensional reduction, as provided, for example, by singular value decomposition (SVD) and principal component analysis (PCA). It is also important to make sure that the different axes are of similar scale, as shown in the example of Figure 3. When the data is still large, one may apply the trick of extended voxels described above. For very large data, one may also separate the data into several components, as is customary in supervised learning, to make sure that conclusions are not affected by the random choice of a subset of the data.

Conclusions
In the past (see, e.g., [3]), Parzen analysis has not considered the potential field V, which plays an important part in the understanding of different features of the data. In particular, V is sensitive to small changes in the Parzen probability by being related to its second derivative. It is also the basis of quantum clustering whose advantages have been demonstrated here as well as in many other investigations in the literature. The discovery of the weight-shape decomposition of the Parzen probability has led to a focus on shape and on the potential and allows for a meaningful comparative discussion of the different features which may be extracted from data.
Here we have defined a set of fields in data space which hopefully will turn out to serve as useful tools in future data analyses. They seem to be adequately applicable to image analysis, and we expect them to be particularly useful in biomedical and technical image analyses in three dimensions. When analyzing other data, where no visual display constraints exist, noise reduction and computational complexity call for preprocessing by dimensional reduction. Further reduction of computational complexity may be tried by employing extended voxels, which our technique can easily accommodate.
In summary, this extended Parzen method replaces any set of discreet data by a continuous set of fields in data space, with interrelations in scale space. It allows for investigating data properties in terms of these fields and their extrema.