Parallelized Integrated Time-Correlated Photon Counting System for High Photon Counting Rate Applications Parallelized Integrated Time-Correlated Photon Counting System for High Photon Counting Rate Applications

information Abstract Time-correlated single-photon counting (TCSPC) applications usually deal with a high counting rate, which leads to a decrease in the system efficiency. This problem is further complicated due to the random nature of photon arrivals making it harder to avoid counting loss as the system is busy dealing with previous arrivals. In order to increase the rate of detected photons and improve the signal quality, many parallelized structures and imaging arrays have been reported, but this trend leads to an increased data bottleneck requiring complex readout circuitry and the use of very high output frequencies. In this paper, we present simple solutions that allow the improvement of signal-to-noise ratio (SNR) as well as the mitigation of counting loss through a parallelized TCSPC architecture and the use of an embedded memory block. These solutions are presented, and their impact is demonstrated by means of behavioral and mathematical modeling potentially allowing a maximum signal-to-noise ratio improvement of 20 dB and a system efficiency as high as 90% without the need for extremely high readout frequencies.


Introduction
Time-correlated single-photon counting (TCSPC) is a mature and extremely accurate low light signal measurement technique that uses single quanta of light to provide information on the temporal structure of the light signal. The method was first conceived in nuclear physics [1] and was for a long time primarily used to analyze the light emitted as fluorescence during the relaxation of molecules from an optically excited state to a lower energy state [2]. Today, TCSPC is widely used in many applications that require the analysis of fast weak periodic light events with a resolution of tens of picoseconds such as diffuse optical tomography (DOT) [3,4], fluorescence lifetime imaging (FLIM) [5] and high-throughput screening (HTS) [6]. TCSPC is based on detecting single photons of a periodical light signal, measuring the detection times within the light period and reconstructing the light waveform from the individual time measurements after repeating the measurements for enough times. Traditionally, the TCSPC technique relied on vacuum tube technologies such as PMTs and MCPs. These mature technologies are capable of achieving very good performances, but they are expensive, cumbersome and fragile and require extremely high operating voltages, which make them unsuitable for the fabrication of miniaturized portable TCSPC imaging systems. In recent years, single-photon avalanche diodes (SPADs) have gained a wide popularity as a less expensive and more compact alternative for vacuum tube detectors. The integration of planar epitaxial SPADs in standard CMOS technology has significantly improved the level of miniaturization of SPADs and paved the way for SPAD arrays. These devices possess the typical advantages of microelectronics integrated circuits, such as small size, ruggedness, low operating voltages and low cost. Furthermore, they can be directly implemented with the necessary associated circuits on the same chip to realize an integrated, ultrasensitive, high-speed and low-cost TCSPC imaging system. Many SPAD-based TCSPC systems have been successfully demonstrated lately. Nowadays, state-of-the-art imaging sensors integrating thousands of single-photon detectors on the same chip have been demonstrated in standard CMOS technology [7,8]. Most integrated TCSPC systems consist of 2D arrays or 1D arrays of SPADs with their associated electronics in the form of smart pixels resulting in a trade-off between high-photon detection efficiency and advanced electronic functionalities [9][10][11]. This approach allows for a better detection efficiency compared to a single commercial SPAD. However, such designs should be conceived such that the detection yield is optimized, i.e. ensure an optimal detection efficiency and a limited counting loss probability. In this chapter, we present these two issues and propose methods to quantify and limit their effects based on mathematical and behavioral modeling.

A parallelized macropixel structure for SNR optimization
Single-photon avalanche diodes (SPADs) operate in Geiger mode; in this mode, the p-n junction is biased beyond its breakdown voltage, as a result a high electric field exists in the charge space such that a charge carrier ideally created by photoelectric interaction is enough to generate a self-sustained avalanche. Indeed, unlike linear APDs, where stopping the light signal is enough to stop the avalanche, when an avalanche is triggered in an SPAD, the current will continue to increase until the destruction of the component as a result of overheating. Therefore, the avalanche must be swiftly quenched by an associated circuitry that senses the avalanche and stops it by reducing the reverse bias below the breakdown voltage, so that the avalanche cannot maintain itself, then returned it to its initial condition. The circuit used to accomplish these tasks is the quenching circuit, and the selection of such circuit is not a trivial task as it directly affects many of the SPAD performance metrics [12]. It is therefore important to choose a suitable quenching circuit for the desired application so it will not limit or deteriorate the SPAD characteristics.
Each SPAD with its associated electronics forms an independent pixel, and the quenching electronic is the main part of the SPAD-associated electronics; however, other smart functionalities could also be included in the pixel. In particular, it is possible to use a gating signal to activate or inactive the SPAD; this functionality is traditionally used to operate the SPAD in gated mode where it is enabled only during the gate-on window and disabled during the gateoff time interval such that absorbed photons do not trigger an avalanche. This functionality could also be used to deactivate SPAD showing an abnormal behavior that affects the system yield. In [13], a macropixel architecture that makes use of such approach was implemented, in this approach. The macropixel ( Figure 1) is divided into eight pixels that could be activated or deactivated based on their activity levels. This option was added to ensure that the SNR is not affected by an undesirable effect that could decrease the detector's efficacy.
The signal delivered by a photon counting detector is affected by temporal fluctuations that are expressed as a Poisson distribution. If N is the average number of detected pulse, it includes a fluctuation expressed in the shot noise n ¼ ffiffiffiffi N p , while the other electronic noise can be ignored thanks to the infinity gain of the SPAD. The total signal N is given by N=N ph + N d where N ph is the total of detected photon and N d is the number of counts caused by the dark count. The associated shot noises are n ph ¼ ffiffiffiffiffiffiffiffi The number of photons is measured by subtracting the results of two measurements: one for the total number of counts (Nph + Nd) and the second for the dark ones (Nd). In this case, the total noise is given by If N d is considered as a constant equal to the mean value N d instead of being measured each time, the variance of the term comes to zero, and thus, the number of photons and its associated noise are given by Figure 1. Simplified schematic of the parallelized macropixel is presented in [13].
Therefore, the signal-to-noise ratio is In the case of a multi-SPAD macropixel, the SNR of the macropixel structure is the sum of each SPAD photon count divided by the total noise component: where Nph i is the number of detected photons and Nd i is the dark count rate of the ith SPAD (SPAD i ) in the macropixel. Consequently, the signal-to-noise ratio can be optimized by switching SPADs on/off such that pixels showing undesirable activity levels are deactivated. These undesirable pixels could be 'hot pixels' showing an above-average high dark count rate or 'dark pixels' showing a below-average low light sensibility.

Hot pixel elimination
These pixels could be identified through a calibration phase where the individual DCR of each SPAD Nd i is measured in the dark and potentially eliminated based on a hot pixel elimination (HPE). To evaluate the benefit of such approach, we assume that the macropixels are uniformly lighted, i.e. all the Nph i are equal to Nph, and all the SPAD's DCR are equal to Nd except for one SPAD j that presents a DCR m times higher than the rest of the SPADs. Thus, the signal-tonoise ratio is given by By turning off the noisy SPAD, the SNR becomes Consequently, disabling the noisy SPAD leads to a signal-to-noise ratio improvement of where α ¼ N ph =N d is the mean photon count on the mean DCR ratio. Figure 2 shows the SNR gain versus the hot pixel DCR multiplication factor m for different α ratios. For a weak signal measurement (α = 0.1), the gain can be as high as 20 dB. Nevertheless, this assessment clearly states that the SNR may be slightly lowered if the m coefficient is too low, and thus it is not advisable to remove SPADs with a DCR greater than the mean DCR. Based on these simulations, an efficient rule of thumb is to disable only SPADs with an m coefficient greater the α ratio, with obviously m > 1. Previous works have reported that about 20% of the SPADs integrated in an array have a dark count about 10 to 1000 times higher than the 80% other diodes [7,14]. Consequently, there is a high probability of having a hot SPAD among the eight SPADs. Therefore, the proposed structure can lead to significant SNR improvement ranging from 0 to 20 dB.

Dark pixel elimination algorithm
The scenario that could lead to lower SNR is pixels with low light sensibility due to a manufacturing defect, a dust or as a result of the SPADs not being uniformly illuminated. To evaluate the SNR gain resulting from eliminating such pixels, we will consider the case where the eliminated SPADs are completely blind. This is the worst case of light sensibility and the elimination of these dark pixels results in the best SNR improvement ( Figure 3). Assuming n as dark pixels, the corresponding SNR is If all blinded SPADs are turned off, the SNR becomes Consequently, for n 6 ¼ 8, the SNR gain is given by Figure 2. Signal-to-noise ratio improvement using the hot pixel elimination scheme.

SNR gain evaluation
A low SNR could be the result of a low signal levels or high noise levels; consequently, the SNR could be improved by elimination of pixels exhibiting high noise levels (hot pixel elimination) or pixels exhibiting low light sensibility (dark pixel elimination). Both schemes require a calibration phase. In the case of dark pixel elimination scheme, the counting rate of each pixel must be measured under illumination to detect SPADs with low sensibility levels, and these measurements should be repeated if the test conditions change. The hot pixel elimination scheme on the other hand requires a onetime calibration phase to measure the individual DCR for each SPAD and deactivate the too noisy SPADs based on their DCR levels. Both approaches resulted in an improved SNR; however, the dark pixel elimination efficiency was relatively low, whereas the hot pixel elimination was found to be useful in most cases.

Counting loss in TCSPC systems
Typical TCSPC setup consists of a pulsed optical laser source, a photon detector such as a silicon photon multiplier (SiPM) or an SPAD, a time measurement block based on a time-todigital converter (TDC) or time-to-amplitude converters (TAC) and an external CPU to process the measurement results. When a photodetection occurs, a certain time is required for data processing; such time interval is referred to as 'dead time' because the system is incapable of processing any additional photons collected by the SPAD resulting in counting loss and a reduction of the SNR caused by the decreased counting efficiency which is at best equal to CV ¼ This issue is further complicated by the random nature of photon arrivals and the fact that TCSPC applications such as FLIM and HTS usually deal with high counting rates. In order to increase the rate of detected photons and improve the SNR, many parallelized imaging structures have been reported [5,15], but this trend leads to an increased data bottleneck which requires the use of complex readout circuitry [7] as well as very high output frequencies to ensure a reasonable dead time [5]. Another solution for the high output rate is the use of an embedded FIFO to store the measurement results, while they have been processed; nevertheless, FIFOs are very demanding in terms of power and silicon area, and to our knowledge, there has been no study done to properly determine the exact FIFO length required to achieve optimum results. It is therefore important to evaluate the counting gain resulting from the use of an embedded FIFO as a function of its depth and the readout rate.

TCSPC system as a queuing model
TCSPC systems are based on measuring arrival times of single-photon events. Processing these measurements requires several additional operation steps such as quenching the photon detector, shaping the regenerated signal, converting the time to a digital value and sending it into a processing unit or memory. While these operations are being conducted, the system is unavailable to process another measurement for a certain time interval referred to as 'dead time'. To simplify the study of the TCSPC system, the readout period is considered equal to the system's dead time. The dead time as well as the random nature of the single-photon detection events leads to random counting losses as the system is busy processing a previous photon arrival, thus limiting the system efficiency. To evaluate the counting loss, the TCSPC can be modeled using a queuing model with an arrival rate λ representing the average number of photon arriving at the sensor's surface per second, a departure rate μ representing the readout data rate given in sample per second and a service rate r representing the rate at which the TCSPC system can process photon detections which is equal to (dead time) À1 . Figure 4 illustrates this phenomenon; it is clear that even if the arrival rate λ is equal to or less than the departure rate μ, the random nature of the photon arrival leads to a quiet period followed by a peak of arrivals of photon, a well-known characteristic of a Poisson process. During this peak of activity, some photons will be lost as a result of the system's dead time. The simplest approach to limit such loss is the reduction of the dead time and the readout period, but reducing these times is limited by physical and electrical constraints to tens of nanoseconds. Another approach is the use of parallelized structures with the incoming light uniformly split ( Figure 5); assuming an equal distribution of the photon arrivals, this is similar to the division of the arrival rate λ into M equal parts where M is the number of parallel modules. This approach leads to a reduction of the counting loss as well as the pile-up effect, but it also creates a data bottleneck at the end of the processing chain, thus requiring the use of high output frequencies to process the resultant high counting rate. Consequently, the loss problem is not resolved but only shifted towards the final output. This problem could be mitigated by integrating a FIFO in the TCSPC system which allows a better flexibility in processing the stochastic arrival events. Indeed, a TSCPC system without a FIFO can be modeled as one buffer queuing system; similarly, a TCSPC system integrating a FIFO with N rows can be modeled as an N cell queuing system. We will assume that the FIFO's input data follow a Poisson process, a reasonable assumption when the average photon arrival rate is significantly lower than the TCSPC's operating frequency. Giving the stochastic nature of the measured phenomena, i.e. the photon arrival Poisson process, the system's behavior must be studied in terms of the traffic intensity in and out of the FIFO to determine the impact of its limited capacity on the sensor's sensibility due to missed arrivals when the FIFO is full. The FIFO can be equated to a size N queuing system where the input is a Poisson arrival process with a mean arrival rate λ and the probability function of n arrivals occurring during the time interval [t,t + τ] given as The FIFO's output follows a periodic departure process with a departure rate μ and a readout period T d = μ À1 which represents the time needed for one departure to be accomplished. The system can be modeled as a semi-Markov chain where Q n = Q(t = t n ) is the number of occupied cells in the FIFO immediately after departure moments {t n , n = 0,1,2…} [16]. Giving that the FIFO's capacity is limited to N cells, the number of occupied cells in the system cannot exceed N-1, and the embedded Markov chain contains N states labeled according to the number of occupied cells left soon after a departure S = {n, n = 0,1,2…N-1}. Figure 6 shows the embedded Markov chain with all the possible transitions from a random state 'i'.

Steady-state probability evaluation
Let X n be the number of arrivals during the readout period T d giving the Poisson arrival property; the probability of j arrivals occurring during the readout period is where r defined as is the photon rate to the readout rate ratio. The number of occupied cells after the (n+1)th period is increased by the number X n+1 of photon arrivals during this period and is reduced by one readout. If the number of photon arrivals overloads the FIFO, the number of occupied cells is clipped to N-1 and a loss of measurement occurs. If the FIFO is empty, i.e. Q n = 0, no readout occurs. Therefore, the relation between Q n and Q n+1 is defined as And, the transition probability from the state i to the state j after m transitions is In particular the one-step transition probability is.
which allows us to define the K Â K transition probability Matrix 'P' of the one-step transition probabilities P i, j [16]: where the i,j of element P i, j of the matrix represents the probability of being in the state 'j' giving that the system was in the state 'i'. These probabilities describe the transient behavior of the system; however, as the system evolves, it will converge into a state of equilibrium known as the steady state with time-independent distribution [17] represented as a vector π = (π 0 , π 1 , π 2 …π NÀ1 , ) where π i is the probability to be in the state 'i' once the system has reached its equilibrium.

Blocking probability
The main goal of using this queuing model is to evaluate the system efficiency based on the probability of an arrival finding the FIFO fully, and as a result of being lost, such probability represents the blocking probability P B . In order to evaluate P B , we need to have the state distribution at all moment and not only at departure moments. Let us define the following system probabilities: P k : Probability of the system containing k registered arrivals (k = 0…N). π k : State probabilities at departure instants (k = 0…N-1). π a,k : State probabilities at arrival instants regardless whether the arrival joins the queue or not (k = 0…N).
An important property of the Poisson arrival process is the Poisson Arrival See Time Averages [16] which implies that the distribution of occupied cells seen at arrival instants is the same as the distribution seen by a random observer: On the other hand, the probability that an arrival finds k < N occupied queue in the system is equal to the probability that a departure leaves k occupied cell giving that the new arrival is admitted: In particular for k = 0, we have Furthermore, arrivals entering the system occur at a rate λ as long as they are admitted into the queue; hence, we define the effective arrival rate as Simultaneously, departures out of the system continue to occur with a rate μ as long as the system is not empty which allows us to define the effective departure rate as Given that in equilibrium the traffic entering the queue system is equal to the one leaving the queue [7], we have And, the blocking probability is The described method was used to determine the blocking probability and the system efficiency η: where π 0 is defined in (26). It is clear that the system's efficiency increases with the FIFO depth although the amount of the growth decreases. As a result, when taking in consideration the resources needed for an embedded FIFO, it is safe to say that a FIFO depth of 8 is enough to reduce the arrival input loss due to the blocking phenomenon.

Case study of a parallelized TCSPC system including an embedded FIFO
The TCSPC system illustrated in Figure 8 was designed to be used for an HTS application that requires counting rates up to several MHz per channel. With a TDC dead time of 40 ns, the maximum data rate is equal to 25 MS/s. According to Figure 7, the use of a unique TCSPC module would lead to an efficiency η of, respectively, 98, 90 and 50% for a photon rate of 0.25, 2.5 and 25 MHz, i.e. a service rate of 0.01, 0.1 and 1. Obviously, for a service rate r > 1, the system's efficiency would tend to be 1/r regardless of the use of a FIFO. A photon rate of λ = 25 Mega photons/s is therefore not reasonable in the configuration of a single TCSPC module, but if the arrival rate is divided among the eight TCSPC ( Figure 8) and assuming that the arrival process is equally distributed among the eight units, each TCSPC i receives an arrival rate: resulting in a service rate r i ¼ 0:0125 and an efficiency η ph ¼ 90%, i.e. an expected departure rate μ TCSPCi ¼ 2:8 MHz out of each TCSPC unit which is similar to the value obtained in [19]. Giving the low service rate of each TCSPCi, the output of each TCSPC unit will have a distribution very similar to the Poisson process, and the resulting process is the sum of eight Poisson processes with their respective arrival rate λ i , i ¼ 1, 2, …8 and is therefore also a Poisson process with an arrival rate: Assuming an output frequency of only 33.33 MHz, the service rate will be r f ¼ 0:67. In the absence of the FIFO, the system can be assimilated to a buffer resulting in memory block efficiency η M ¼ 0:6 and a total efficiency: The efficiency of the system is therefore not improved by the parallelization of the TCSPC even with the reduction of the pile-up effect. However, using the eight FIFO cells leads to a memory block efficiency of η M ffi 100%; the overall TCSPC system efficiency is maintained at about 90%. Such efficiency level can only be achieved with a 3 GHz output frequency without the use of the FIFO which proves the great impact including the FIFO in the TCSPC system.

Conclusion
The random nature of photon and applications involing a high counting rate require a specialized TCSPC system scheme to process the resulting data and improve the SNR. This requires the optimization of the photon detection process through the reduction of noise effects and low sensibility. It also requires the optimization of the system's architecture such that photon events are not lost due to the dead time following a previous photon arrival. In this chapter, Figure 8. Parallelization scheme of the TCSPC system with the embedded FIFO as presented in [20].
we have discussed these two issues and presented solutions using mathematical models to assess the gain of such schemes. A low SNR could be the result of low signal levels or high noise levels. In the case of an SPAD, a low signal level is the result of low light sensibility, while a high noise level is the result of a high DCR. Thus, increasing the detector's SNR can be achieved by limiting the negative effect of these two cases. We presented a TCSPC macropixel architecture in which the SNR can be increased by deactivating dark pixels and/or hot pixels. A dark pixel is a pixel with an abnormally low sensibility level and a hot pixel is a pixel with high noise level in comparison to other pixel noises. The dark pixel elimination scheme requires a calibration phase to determine the activity level of each pixels and the low sensibility pixels that must be deactivated; this calibration phase should be conducted whenever the measurement conditions are changed and would lead to an SNR gain up to 1.5 times higher. The hot pixel elimination scheme on the other hand requires a onetime calibration scheme to determine the DCR of each pixel, and as a result, the pixels must be deactivated which allow an SNR improvement ranging up to 20 dB. The processing of detected photons can be optimized by means of a parallelized TCSPC architecture that make use of an embedded FIFO to limit counting loss due to photon detections' subsequent dead time. Using a queueing model, we demonstrated the impact of such approach and quantified the efficiency improvement as a function of the FIFO length, the counting rate and the readout rate. The proposed TCSPC architecture is capable of achieving a 90% efficiency rate with a counting rate of 25 MHz at a readout rate of 33 MHz. Without the use of the embedded FIFO; such efficiency would require the use of a 3 GHz readout frequency.