Imaging set-up

Our experimental set-up is illustrated in Fig. 1. The illumination source was a pulsed laser diode (PicoQuant LDH series with a 640-nm center wavelength), whose original output-pulse duration was increased to a full-width at half-maximum of ∼2.5 ns (that is, a root mean square (r.m.s.) value of T p ≈1 ns). The laser diode was pulsed at a T r ≈50 ns repetition period set by the SPAD array’s trigger output. A diffuser plate spatially spread the laser pulses to flood illuminate the scene of interest. An incandescent lamp injected unwanted background light into the camera. The lamp’s power was adjusted so that (averaged over the region that was imaged) each detected photon was equally likely to be due to signal (backreflected laser) light or background light. A standard Canon FL series photographic lens focused the signal plus background light on the SPAD array. Each photon detection from the array was time tagged relative to the time of the most recently transmitted laser pulse and recorded (Supplementary Methods).

Figure 1: Single-photon array imaging framework. (a) SPAD-array imaging set-up. A repetitively pulsed laser flood illuminates the scene of interest. Laser light reflected from the scene plus background light is detected by a SPAD camera. Photon detections at each pixel are time tagged relative to the most recently transmitted pulse and recorded. The raw photon-detection data is processed on a standard laptop computer to recover the scene’s 3D structure and reflectivity. (b) Example of 3D structure and reflectivity reconstruction using the baseline single-photon imager from ref. 22. (c) Example of 3D structure and reflectivity reconstruction from our processing. Large portions of the mannequin’s shirt and facial features that were not visible in the baseline image are revealed using our method. Both images were generated using an average of ∼1 detected signal photon per pixel. Full size image

The SPAD array18,20, covering a 4.8 × 4.8-mm footprint, consists of 32 × 32 pixels of fully independent Si SPADs and complementary metal-oxide-semiconductor (CMOS)-based electronic circuitry that includes a time-to-digital converter for each SPAD detector. The SPAD within each 150 × 150-μm pixel has a 30-μm-diameter circular active region, giving the array a 3.14% fill factor. At the 640-nm operating wavelength, each array element’s photon-detection efficiency is ∼20% and its dark count rate is ∼100 Hz at room temperature. To extend the region that could be imaged and increase the number of pixels, we used multiple image scans to form a larger-size composite image. In particular, we mounted the SPAD array on a feedback-controlled, two-axis motorized translation stage to produce images with N x × N y =384 × 384 pixels (Supplementary Figs 1a,b and 2a–c).

The SPAD array has a Δ=390 ps time resolution set by its internal clock rate. We set each acquisition frame length to 65 μs, with a gate-on time of 16 μs and a gate-off time of 49 μs for limiting power dissipation of the chip and for data transfer. At the start of each frame, the SPAD array was set to trigger the laser to generate pulses at a ∼20 MHz repetition rate. Hence, in the 16 μs gate-on time of each frame, ∼320 pulses illuminated the scene Supplementary Fig. 3).

Observation model

We define to be the scene’s 3D structure and reflectivity that we aim to recover, and we let be the average rates of background-light plus dark-count detections. Flood illumination of the scene at time t=0 with a photon-flux pulse s(t) then results in the following Poisson-process rate function for (i,j)-th pixel of the composite image:

where η i,j ∈(0,1] is the (i,j)th detector’s photon-detection efficiency and c is the speed of light. Fabrication imperfections of the SPAD array cause some pixels to have inordinately high dark-count rates , making their detection times uninformative in our imaging experiments because they are predominantly from dark counts. Thus we performed camera calibration to determine the set of these ‘hot pixels’ (2% of all pixels in our case) so that their outputs could be ignored in the processing of the imaging data.

We define to be the total number of time bins in which the photon detections can be found, and let C i,j,k be the observed number of photon counts in the kth time bin for pixel (i,j) after n s pulsed-illumination trials. By the theory of photon counting28, we have that C i,j,k ’s statistical distribution is

for k=1,2,…,N z , where we have assumed that the pulse repetition period is long enough to preclude pulse aliasing artifacts. Also, we operate in a low-flux condition such that ∑ k=1 N z C i,j,k , the total number of detections at a pixel, is much less than n s , the total number of illumination pulses to avoid pulse-pileup distortions. Our imaging problem is then to construct accurate estimates, and , of the scene’s reflectivity A and 3D structure Z, using the sparse photon-detection data .

3D structure and reflectivity reconstruction

In the low-flux regime, wherein there are very few detections and many of them are extraneous, an algorithm that relies solely on the aforementioned pixelwise photodetection statistics has very limited robustness. We aim to achieve high photon efficiency by combining those photodetection statistics with prior information about natural scenes.

Most natural and man-made scenes have strong spatial correlations among neighbouring pixels in both transverse and longitudinal measurements, punctuated by sharp boundaries29. While conventional works normally treat each pixel independently, our imaging framework exploits these correlations to censor/remove extraneous (and randomly distributed) photon-detection events due to background light and detector dark counts. It should be noted that unlike noise mitigation via spatial filtering and averaging, in which fine spatial features are washed out due to oversmoothing, our technique retains the spatial resolution set by the SPAD array. Our reconstruction algorithm optimizes between two constraints for a given set of censored measurements: that the 3D and reflectivity image estimates come from a scene that is correlated in both the transverse and longitudinal domains, and that the estimates employ the Poisson statistics of the raw single-photon measurements.

The implementation of our reconstruction algorithm can be divided into the following three steps (Fig. 2; Supplementary Note 1).

Figure 2: Stages of 3D structure and reflectivity reconstruction algorithm. (a) Raw time-tagged photon-detection data are captured using the SPAD camera set-up. Averaged over the scene, the number of detected signal photons per pixel was ∼1, as was the average number of background-light detections plus dark counts. (b) Step 1: raw time-tagged photon detections are used to accurately estimate the scene’s reflectivity by solving a regularized optimization problem. (c) Step 2: to estimate 3D structure, extraneous (background light plus dark count) photon detections are first censored, based on the longitudinal sparsity constraint of natural scenes, by solving a sparse deconvolution problem. (d) Step 3: the uncensored (presumed to be signal) photon detections are used for 3D structure reconstruction by solving a regularized optimization problem. Full size image

Step 1: natural scenes have reflectivities that are spatially correlated—the reflectivity at a given pixel tends to be similar to the values at its nearest neighbours—with abrupt transitions at the boundaries between objects. We exploit these correlations by imposing a transverse-smoothness constraint using the total-variation (TV) norm30 on our reflectivity image. In this process, we ignore data from the hot-pixel set . The final reflectivity image is thus obtained by solving a regularized optimization problem.

Step 2: natural scenes have a finite number of reflectors that are clustered in depth. It follows that in an acquisition without background-light or dark-count detections, the set of detection times collected over the entire scene would have a histogram with N z bins that possesses non-zero entries in only a small number of small subintervals. This longitudinal sparsity constraint is enforced in our algorithm by solving a sparse deconvolution problem from the coarsely time-binned photon-detection data, which is specific to the array imaging set-up, to obtain a small number of representative scene depths. Raw photon-detection events at times corresponding to depths differing by more than cT p /2 from the representative scene depths are censored. As step 2 has identified coarse depth clusters of the scene objects, the next step of the algorithm uses the filtered set of photon detections to determine a high-resolution depth image within all identified clusters.

Step 3: similar to what was done in step 1 for reflectivity estimation, we impose a TV-norm spatial smoothness constraint on our depth image, where data from the hot-pixel set and censored detections at the remaining pixels are ignored. Thus, we obtain by solving a regularized optimization problem.

Reconstruction results

Figure 3 shows experimental results of 3D structure and reflectivity reconstructions for a scene comprised of a mannequin and sunflower when, averaged over the scene, there was ∼1 signal photon detected per pixel and ∼1 extraneous (background light plus dark count) detection per pixel. In our experiments, the per-pixel average photon-count rates of backreflected waveform and background-light plus dark-count response were 1,089 counts/s and 995 counts/s, respectively. The image resolution was 384 × 384 for this experiment. We compare our proposed method with the baseline pixelwise imaging method that uses filtered histograms22 and the state-of-the-art pseudo-array imaging method25.

Figure 3: 3D structure and reflectivity reconstructions. (a–d) Results of imaging 3D structure and reflectivity using the filtered histogram method, the state-of-the-art pseudo-array imaging method, our proposed framework and the ground-truth proxy obtained from detecting 550 signal photons per pixel. For visualization, the reflectivity estimates are overlaid on the reconstructed depth maps for each method. The frontal views, shown here, provide the best visualizations of the reflectivity estimates. (e–h) Results of imaging 3D structure and reflectivity from a–d rotated to reveal the side view, which makes the reconstructed depth clearly visible. The filtered histogram image is too noisy to show any useful depth features. The pseudo-array imaging method successfully recovers gross depth features, but in comparison with the ground-truth estimate in h, it overestimates the dimensions of the mannequin’s face by several cm and oversmooths the facial features. Our SPAD-array-specific method in g, however, gives high-resolution depth and reflectivity reconstruction at low flux. (i–k) The depth error maps obtained by taking the absolute difference between estimated depth and ground-truth depth show that our method successfully recovers the scene structure with mean absolute error of 2 cm, which is sub-bin-duration resolution as cΔ/2≈6 cm, while existing methods fail to do so. (l) Vertical cross section plot of the middle of 3D reconstructions from pseudo-array imaging (red line), pixelwise ground truth (black line) and our proposed method (blue line). Note that our framework recovers fine facial features, such as the nose, while the pseudo-array imaging method oversmoothes them. Full size image

From the visualization of reflectivity overlaid on depth, we observe that the baseline pixelwise imaging method (Fig. 3a,e) generates noisy depth and reflectivity images without useful scene features, owing to the combination of low-flux operation and high-background detections plus detector dark counts. In contrast, the existing pseudo-array method—which exploits transverse spatial correlations, but presumes constant B i,j —gives a reflectivity image that captures overall object features, but is oversmoothed to mitigate hot-pixel contributions (Fig. 3b). Furthermore, because the pseudo-array method presumes the 10-ps-class time tagging of a single-element SPAD that is used in raster-scanning set-ups, its depth image fails to reproduce the 3D structure of the mannequin’s face from the ns-class time tagging afforded by our SPAD camera’s detector array. In particular, it overestimates the head’s dimensions and oversmooths the facial features (Fig. 3f), whereas our array-specific method accurately captures the scene’s 3D structure and reflectivity (Fig. 3c,g). This accuracy can be seen by comparing our framework’s result with the high-flux pixelwise depth and reflectivity images (Fig. 3d,h)—obtained by detecting 550 signal photons per pixel and performing time-gated pixelwise processing—that serve as ground-truth proxies for the scene’s actual depth and reflectivity. For the fairest comparisons in Fig. 3, each algorithm—baseline pixelwise processing, pseudo-array processing and our new framework—had its parameters tuned to minimize the mean-squared degradation from the ground-truth proxies.

The depth error maps in Fig. 3i–k quantify the resolution improvements from our imager over the existing ones for this low-flux imaging experiment. Although the mean number of signal photon detections is ∼1 per pixel, in the high-reflectivity facial regions of the mannequin the average is ∼8 signal photon detections per pixel, while the number of signal photons detected at almost every pixel in the background portion of the scene is 0. Despite this fact, the pseudo-array imaging technique leads to a face estimate with high depth error due to oversmoothing incurred in its effort to mitigate background noise. It particularly suffers at reconstructing the depth boundaries with low-photon counts as well. Compared with conventional methods, our framework gives a much better estimate of the full 3D structure. We also study how the depth error of our framework depends on the number of photon counts at a given pixel. For example, our framework gives errors of 4.4 and 0.9 cm at pixels (260, 121) and (107, 187), which correspond to 1-photon-count depth boundary region and 8-photon-count mannequin face, respectively. Overall, we observe a negative correlation between the depth error and the number of photons detected at a pixel for our method (Supplementary Fig. 4). Recall that the time bin duration of each pixel of the SPAD camera is Δ=390 ps, corresponding to cΔ/2≈6-cm depth resolution. Overall, our imager successfully recovers depth with mean absolute error of 2 cm and thus with sub-bin-duration resolution, while existing methods fail to do so.

In terms of measuring the improvements in gray-scale imaging, we can compute the peak signal-to-noise ratio (PSNR) between the reference ground-truth reflectivity and the estimated reflectivity. While the PSNR values of the conventional pixelwise estimation and pseudo-array imaging are 19.3 and 24.9 dB, respectively, that of our method is 29.1 dB; it improves over both existing methods by at least 4 dB. We emphasize the difficulty of single-photon imaging in our set-up by computing that the SNR of the time-of-flight of a single photon ranges from −2.2 to 7.8 dB (Supplementary Note 2). We also note that varying the regularization parameters also affects the imaging performance (Supplementary Fig. 5). Lastly, the robustness of our reconstruction algorithm is evaluated by imaging an entirely different scene consisting of watering can and basketball, using regularization parameters pre-trained on the mannequin scene (Supplementary Fig. 6).

Choice of laser pulse root mean square time duration

For a transform-limited laser pulse, such as the Gaussian s(t) that our imaging framework presumes, the r.m.s. time duration T p is a direct measure of system bandwidth. As such, it has an impact on the depth-imaging accuracy in low-flux operation. This impact is borne out by the simulation results in Fig. 4, where we see that the pulse waveform with the shortest r.m.s. duration does not provide the best depth recovery. Thus, in our experiments, we broadened the laser’s output pulse to T p ≈1 ns. The full-width at half-maximum is then 2.4 ns. This pulse duration allowed our framework to have a mean absolute depth error of 2 cm and resolve depth features well below the cΔ/2≈6 cm value set by the SPAD array’s 390-ps-duration time bins (see Fig. 4 for details on depth-recovery accuracy versus r.m.s. pulse duration).

Figure 4: Relationship between r.m.s. pulse duration and depth-recovery accuracy. Plot of depth-recovery accuracy versus pulsewidth using our algorithm (steps 2 and 3) versus r.m.s. pulse duration T p obtained by simulating a low-flux SPAD imaging environment with an average of 10 detected signal photons per pixel and 390-ps time bins. Depth recovery is deemed a success if estimated depth is within 3 cm of ground truth. In all our SPAD-array experiments, we used T p ≈1 ns, which is in the sweet spot between durations that are too short or too long. We emphasize, however, that our algorithm is not tuned to a particular pulsewidth and can be performed using any T p value. Full size image

For application of our framework to different array technology, the optimal pulsewidth should scale with the SPAD camera’s time-bin duration. For example, if our SPAD hardware were replaced to improve the timing resolution to the 50-ps range17, we would want to make sure the r.m.s. pulsewidth remains approximately three times longer than the time bin (or the full-width at half-maximum is approximately six times longer) based on our method for choosing optimal pulsewidth from Fig. 4. Thus, for accurate single-photon imaging with a 50-ps time-binning SPAD array, we would shorten our pulse from 1.1 ns to 140 ps.