B. Artificial neural network accelerator

Since power consumption is a primary concern in modern computer system designs, improving power efficiency, i.e., maximizing OPS/W (operations per second per watt), on neural network executions is critical. To tackle this issue, researchers have so far proposed electronic neural network accelerators like DaDianNao which is an application specific processor targeting convolutional neural networks and can achieve significant improvement in power efficiency, compared with traditional computing platforms such as CPUs and GPUs.Another representative implementation is the TPU which is widely used for accelerating commercial AI services.The most notable feature of such accelerators is to equip with a large number of MACs (multiply-accumulate circuits) in order to calculate vector-by-matrix multiplications (VMMs) efficiently; for instance, the first generation of TPU contains 8-bit 64K MACs in a chip. From the viewpoint of neural network applications, on the other hand, another interesting characteristic is a good tolerance to errors in computing results. This feature makes it possible to apply analog processing to aggressively improve the power efficiency, e.g., ISAAC proposed in Ref.exploits analog arithmetic in crossbars.

Artificial neural network (ANN) accelerators exploiting nanophotonic devices are promising for the following reasons. First, the VMMs can be implemented by using optical elements, and we can expect ultralow latency operations because the computation is done as the light propagates in the nanophotonic device. Electronic digital operations fundamentally require charging and discharging of capacitors, and such a mechanism consumes a large amount of electric power. Since the operation speed or the clock frequency in state-of-the-art electronic digital systems is limited by the power dissipation that does not scale-down with the transistor shrinking anymore, we cannot expect improving the clock frequency. Unlike such traditional electronic digital circuits, nanophotonic devices can operate at the speed of light in the analog fashion, and such a feature makes it possible to design an ultralow latency computing platform. Although one of the critical issues of optical analog circuits is the noise, a good error tolerance of neural network applications has a significant potential to mitigate such disadvantages. Second, optical processing of high parallelism, inherent to neural network operations, is enabled by taking advantage of various attributes of light waves such as the wavelength, phase, polarization, and amplitude. Third, the high I/O bandwidth provided by an optical interconnect is important even in neural network accelerations. It has been discussed that such a nanophotonic neural network accelerator has the potential to achieve at least three orders of magnitude higher efficiency compared with an ideal electric computer.Designing VMM-optimized hardware is now entering the mainstream, and this tendency strongly supports that photonic ANN research could have a real-world impact toward energy efficient AI computing.

Another configuration is a Mach-Zehnder interferometer (MZI) based VMMas shown in. The MZI consists of two directional couplers (DCs) where two input signals are equally divided into two output ports and two phase shifters (PSs) that modulate the input signals. The amount of phase shift () can be controlled by charging the phase shifters or changing their temperature. The transfer function of MZI can be expressed bywhereandare electric field amplitudes of the output and input signals, respectively. The MZI can distribute the light signal of each input port to the two output ports at an arbitrary ratio by adjusting () so that it can be used to implement a 2 × 2 unitary transformation. Reckand Clementshave proposed that arbitraryunitary transformation can be realized by exploiting the property of MZI. As shown in, the matrix of the VMM is represented by two MZI-based unitary matrix switches and a set of either attenuators or amplifiers. The unitary matrix switches representing thematrixandmatrixperform together a couple of unitary conversions. The attenuators or amplifiers represent andiagonal matrix Σ of singular value decomposition (SVD). Note that Clements’s implementation has features of higher fidelity, excellent tolerance to noise, and a smaller size compared with Reck’s one. These differences are due to the arrangement of the MZIs: the unitary matrix switches are triangular in Reck’s circuit but rectangular in Clements’s implementation. Detailed operation principles and mathematical proofs can be seen in Refs.and

Another configuration is a wavelength division multiplexing (WDM) VMMas shown in. In WDM-VMM, the input vectoris represented by a wavelength multiplexed light from light sources, the matrixby cascaded ring resonators, and the output vectorby a set of photodetectors. The WDM input signal, in which each element of the input vector is assigned a unique wavelength carrier, is equally split so that it passes through a series of ring resonators. The output is then summed up by the photodetectors. Ring resonators are placed next to a waveguide to couple only the light of a specific wavelength. When the length of the ring circumference is equal to an integer number of wavelengths which is the resonance condition, the optical signal in the waveguide is dropped to the ring and lost due to the scattering. The levels of light intensity are considered as data values in analog form as well as SLM-VMM explained above, and multiplications are implemented by the ring resonators in which the loss rate can be controlled by injecting a charge into the ring or changing its temperature. For instance, if we assume () = (1.0, 1.0, 1.0, 1.0) and () = (0.0, 1.0, 1.0, 1.0) in, the intensity level ofbecomes zero due to the effect of the associated ring resonator, and the other signals regardingare gathered to obtain the final result ofby the associated photodetector. In general, cascading multiple microring resonators can cause bandwidth narrowing and increase the control operation complexity. Hence, selecting appropriate parameters and well-design for microring resonators is quite important by exploiting design space like in Ref.. Recently, high-resolution optical WDM-VMM has also been demonstrated.The number of microring resonators is the same as that in Ref., but the topology is different. This work uses microring resonators with a low quality (Q) factor and demonstrates high linearity and high resolution.

One configuration is a spatial light modulator (SLM) VMM as shown inIn the SLM-VMM, the input vectoris represented by a set of light sources, the weight matrixby an SLM, and the output vectorby a set of photodetectors. SLM-VMM is scalable because it utilizes free space optics. On the other hand, a planar integrated SLM-VMM is compact and compatible with electronic VLSI technologies and microfabrication.Unlike traditional electrical circuits that exploit voltage levels in order to represent digitalized information, data values treated in the SLM-VMM are mapped onto the levels of light intensity in analog form. Multiplications and additions are performed by weakening the light intensity and collecting multiple light waves at a photodetector, respectively. Since the intensity of a light sourceis reduced toafter passing an SLM cell, the light transfer characteristic can be expressed as, whereis a coefficient regarding the transmittance of the SLM cell. Therefore, the output element+ ⋯ +depicted incan optically be obtained by associatingandwithand, respectively, and by gathering the weakened light sources corresponding to the first row of the matrix at the first photodetector. Another method to implement the SLM is to use an optomechanical micromirror device based on microelectromechanical system (MEMS) technology. This device deflects light signals digitally by selectively redirecting parts of them.

We describe optical analog VMMs. The matrix product is the basic operation of many kinds of information processing, especially approximate computing such as a neural network. VMM requires generally a long delay in CMOS-based circuits, and hence, a significant acceleration can be expected by applying optical VMM. The operation of VMM is defined bywhereis the input vector,is the matrix, andis the output vector.illustrates three types of optical VMM implementations for the case with= 4. The latency of each VMM against the number of pixels () is summarized in Table I . The latency is determined by the longest optical path for the light emitted from the light source (LS) to the photodetector (PD). The longest optical path ofVMM mainly depends on the number of modulators on the longest pass and the waveguide length. The number of modulators as light wave passes from LSs to PDs depends on VMM configurations, as shown in Table I . The waveguide length is, which is constant regardless of the VMM configuration. Therefore, the latency of each VMM in Table I is calculated by adding these two factors. For example, the latency of spatial light modulator (SLM)-VMM issince the sum of two factors is 1 +. As a consequence, there is no significant difference in performance depending on the size of VMM. Another important consideration is the energy consumption. When an optical signal passes an optical modulator, it loses a small amount of energy. This means that the amount of lost energy is roughly proportional to the number of optical devices, i.e., the first row of Table I . If it is assumed that a constant amount of energy is lost in each optical modulator, the SLM-VMM consumes the lowest energy.

The energy cost is mainly incurred by the phase shifters and photodetectors. A light signal from a light source is detected by the photodetector at the loss of energy by passing through phase shifters, and then, it is converted into the photocurrent. Assume that all the energy of the optical signal at the photodetector is converted (in fact, conversion efficiency <30%) irrespective of the values ofand. The power consumption of the VMM is expressed bywhereis the power of the light source. This power model can be applied to closed systems in which only light sources can provide energy for optical calculations, and final outputs are detected only by photodetectors. Based on Eqs. (5) and (8) , the power efficiency of the optical VMM is derived as follows:

Maximizing power efficiency of optical analog VMMs is a critical challenge on photonic ANN accelerators. Here, we focus on the Clements’s MZI-VMM implementation in Table I and define the throughput or performance as the number of multiply accumulate operations per second. It can be evaluated for the VMM scale with thematrix bywhereis the operation frequency.is limited by several factors such as the maximum operation frequency of light sources,, that of photodetectors,, two phase shifters in the MZI,, and that of the optical data-path, which is the reciprocal of the VMM circuit latency. In the MZI-VMM,depends on the length of the longest path from a light source input to a photodetector output in the optical circuit and can be modeled as Eq. (7) , whereis the size (or length) of an MZI. From the viewpoint of photonic neural-network executions,anddirectly limit the maximum operation frequency to feed the next input data regardingand, respectively.

3. Impact of device/architecture/application-level codesign for power efficient optical VMM

Owing to maturing nanophotonic device technology, ANN acceleration has a good prospect for low-power, low-latency, and high-throughput computing; however, there are some drawbacks arising from the optical devices. Unlike traditional electronic digital computers, for instance, optical analog calculations are prone to suffer from the noise, resulting in degradation of computing accuracy. Another factor is the insertion loss of the phase shifter. To maximize the power-performance potential, it is required to take full advantage of nanophotonic devices and at the same time to circumvent the shortcomings. Cross-layer interactions, or device/architecture/application-level codesigns, are crucial to resolve these issues. There is a trade-off between the power efficiency and computation accuracy in optical VMMs. If target applications are tolerant to computational errors like neural network inference, a drastic power reduction can be achieved, as described hereafter. Device designers tend to concentrate only on minimizing the device footprint, and hence, sometimes, little attention has been paid to how far such a codesign optimization has an impact on system-wide power efficiency.

Fig. 8 32 32. S. Kawakami, T. Ono, M. Notomi, and K. Inoue, “ Evaluation platform for a nanophotonic neural network accelerator (in Japanese) ,” IEICE Trans. J102-A(6), 182– 193 (2019). N, device parameters such as the MZI size, and some of noise parameters. Some of these parameters are interdependent. For example, there is a trade-off between the transmittance or loss of MZI and the modulation bandwidth of PS. However, in this evaluation, these parameters are assumed to be independent to clarify which parameter contributes to the power efficiency. In other words, we ignore the dependence of parameter values in order to indicate the direction of device parameter improvement. Note that the power loss of MZI is defined as loss = −10 × log 10 (transmittance), and therefore, the evaluation of the power efficiency using the transmittance is equivalent to the evaluation considering the power loss. First, we introduce our evaluation platform as shown inThe evaluation platform uses the hardware configurations as the inputs, including microarchitectural parameters such as the size of VMM, device parameters such as the MZI size, and some of noise parameters. Some of these parameters are interdependent. For example, there is a trade-off between the transmittance or loss of MZI and the modulation bandwidth of PS. However, in this evaluation, these parameters are assumed to be independent to clarify which parameter contributes to the power efficiency. In other words, we ignore the dependence of parameter values in order to indicate the direction of device parameter improvement. Note that the power loss of MZI is defined as= −10 × log), and therefore, the evaluation of the power efficiency using the transmittance is equivalent to the evaluation considering the power loss.

In the software configuration, users can set neural network parameters such as the network structure, hyperparameters, and the dataset for machine learning. The platform includes an optical simulation engine and power-performance models.

Fig. 8 σ s h o t 2 = 7.02 × 1 0 − 11 , σ d a r k _ c u r r e n t 2 = 1.60 × 1 0 − 17 , σ t h e r m a l 2 = 1.66 × 1 0 − 12 , and σ m o d 2 = 1 0 − 15 . We assumed that the power of all the light sources ranges from −40 dBm to 20 dBm. It is also assumed that the signal calculated by the analog VMM is quantized by 8 bits, and the maximum operation frequency of light sources f LS is the same as that of photodetectors f PD . Here, we discuss the impact of the cross-layer codesign on Clements’s MZI-VMM circuit in Table I . The blue hatched part of the evaluation platform inis used in this evaluation. Table II shows some of the representative values of the design parameters we focus on, and we expect the performance improvement for MZI-VMM from “Standard” to “Advanced” in the near future. Note that in this evaluation, the calculation accuracy depends on the effects of the shot noise of the light source, the shot noise of the dark current and the thermal noise of PD, and the fluctuation of modulators. Each of them has been obtained based on general parameter values,. We assumed that the power of all the light sources ranges from −40 dBm to 20 dBm. It is also assumed that the signal calculated by the analog VMM is quantized by 8 bits, and the maximum operation frequency of light sourcesis the same as that of photodetectors

TABLE II. Variable parameters.

Symbol Description Standard Advanced S MZI (μm) Size (or length) of MZI 100 1 f PS (GHz) Frequency of PS 12.5 100 f PD (GHz) Frequency of PD 40 100 N(−) N × N VMM scale 6 up to 1024 T MZI (−) Transmittance rate of MZI 0.9 0.99