You are on page 1of 33

Deep Learning Photonic Accelerators: A Survey

Mohammad Atwany* Solomon Serunjogi Mahmoud Rasras*


Mohammad Atwany
Address: Department of Electrical Engineering, NYU Tandon School of Engineering 6 MetroTech Cen-
ter, Brooklyn 11201, NY, USA
Email Address: mza7327@nyu.edu
Solomon Serunjogi, Mahmoud Rasras
Address: Electrical Engineering, NYU Abu Dhabi Saadiyat Island, Abu Dhabi 129188, AD, UAE
Email: mr5098@nyu.edu
Keywords: Photonics Deep Learning Accelerators, Deep Learning Neural Networks, All-Optical Neural
Networks
Across all sectors of industrialization, medicine, etc., deep learning is gaining unprecedented momentum and is thus considered the
pinnacle of advancement. However, it can be computationally demanding to train deep neural networks as the field continues to grow
rapidly. Consequently, hardware accelerators for Deep Learning (DL) based on electrical components have become increasingly pop-
ular in recent years due to their improved performance and energy efficiency over conventional Central Processing Units (CPUs) and
Graphics Processing Units (GPUs). However, these accelerators are limited by fundamental bottlenecks, in which electronic proces-
sors have limited computational and performance per watt capabilities due to the slowdown in Complementary Metal-oxide Semicon-
ductor (CMOS) scaling. Additionally, modern processors rely on metallic interconnects, which are not scalable and result in signif-
icant bandwidth, latency, and energy inefficiencies. As a promising CMOS-compatible alternative, silicon photonics are emerging in
the form of auspicious deep learning accelerators that use light to communicate as well as compute. This paper examines photonic
accelerators for deep learning models under different subcategories and explores the potential gaps for improvement in this domain.

1 Introduction
Since the advent of computers, researchers have been captivated by the prospect of endowing machines
with human-like intelligence, a field known as Artificial Intelligence (AI). The ultimate goal is to em-
power machines with cognitive abilities, including abstract thinking, decision-making, adapting to new
situations, creativity, and social skills [1]. AI, which focuses on specific tasks, has seamlessly integrated
into everyone’s daily life, evident in applications such as automatic photo-tagging, customer service advi-
sory calls, and personalized product recommendations.
The impact of AI extends even further, encompassing its potential to enhance medical diagnosis, drug
design, and cancer detection/treatment. This rapid advancement in AI can be largely attributed to the
swift growth in computational power, encompassing both data storage and processing speed. Further-
more, the rapid progress in deep learning algorithms and their applications across various fields has spurred
the demand for high-performance computing platforms.
With that in mind, a key challenge in AI computing is the von Neumann bottleneck, [2, 3, 4] stemming
from the architectural design of computer systems. This bottleneck highlights the limitations in system
throughput, which are a result of the bandwidth constraints on data being transferred in and out. To
address this computational bottleneck, which is primarily caused by the physical separation between the
CPU and memory, various strategies have been investigated that utilize modified Von Neumann devices
such as GPU architecture, to process large models and datasets [5, 6, 7] but the scalability of such in-
ventions is still limited by the latency associated with fetching data from memory islands . However, the
past decade has witnessed a growing interest in exploring alternative computing paradigms to acceler-
ate deep learning tasks. These alternatives include the exploration of optical-related disciplines such as
metasurfaces, non-linear photonics, and the utilization of photonics accelerators [8, 9].
Photonic accelerators, also known as optical accelerators, have emerged in the last decade, leveraging on
prior photonic enabling technologies such as modulators, photo-detectors, and optical filters [10]. This
growth is illustrated in Figure 1 in terms of publications per year and in comparison, the major evolu-
tionary milestones of traditional electronic AI-based technologies such as AlexNet [11], AI co-processors
[12], Microsoft Project catapult [13] and In-memory computing [14]. Unlike traditional electronic com-
ponents such as transistors, electronic switches, modulators, and microprocessors, photonic accelerators

1
Figure 1: Published literature on Photonic accelerators.

utilize light particles (photons) to process information. This innovative approach enables parallel pro-
cessing and faster information transfer, making it particularly advantageous for AI workloads involving
intensive matrix calculations in neural network operations. The success of photonic-based accelerators
has been driven by decades of innovations at the device and chip level of optical systems. These devel-
opments build upon foundational photonic technologies such as lasers, modulators, photo-detectors, and
optical filters. This progress is illustrated in figure 1, which displays the yearly increase in related publi-
cations and contrasts this with the developmental milestones of traditional electronic AI technologies.
The figure highlights key optical devices and systems introduced in the early 2000s in silicon photonics,
like photonic Wavelength Division Multiplexing (WDM) filters [15, 16], Mach-Zehnder Interferometer
(MZI) modulators [17, 18] and IQ-modulators [19]. This evolution continued with the advent of smaller-
sized Microring Resonators (MRRs), crucial in many optical filter designs and high-speed Non-return-to-
zero (NRZ) modulators [20] with bandwidths around 25 GHz. Pulse Amplitude Modulation with Four
Levels (PAM4) modulation schemes have been explored using the ring resonator in order to increase the
throughput per footprint of device [21]. These ring resonators, possessing high Q factors, have been en-
gineered to function as switches, integrators, differentiators, and memory elements at terahertz (THz)
frequencies.
Recently, these devices have been integrated to create optical neural networks that are energy-efficient,
compact, and offer high throughput [22]. Figure 1 also includes two throughput/mm² lines for compar-
ison. The first line, indicating 25 TOPS/mm², has been achieved by various optical devices. This high
level of performance is achieved at lower power consumption due to parallelism (utilizing multiple wave-
lengths) as well as smaller foot print per wavelength. Some examples of these innovations are photonic
tensor cores [23, 24, 25], programmable phase-change metasurface for multimode photonic Convolutional
Neural Networks (CNNs) [26, 27], in-memory computing, and hybrid co-processors.
In contrast, the second line in the figure shows a throughput of 10 TOPS/mm², representing the pro-
jected maximum capability of current non-photonic hardware accelerators. This comparison underscores
the advancements and potential of photonic technologies in achieving higher throughput and efficiency in
computing, particularly in the context of neural network processing and AI acceleration. The earliest op-
tical accelerators could be traced in the assemblage of typical lab bench-top discrete optical components
interconnected with long fiber spools intended to perform canonical mathematical functions [28, 29]. The
first of such functions essential to AI computations is the unitary matrix. In which it was first demon-
strated optically by [30] in 1994, using optical beam splitters. This development laid the groundwork
for subsequent advancements in photonic integrated computations using MZIs. It was later shown by
Miller et al. [31, 32, 33] that meshes of MZI networks could be self-configured to define adaptive filter
functions. Such reconfigurable networks are promising candidates for building adaptive neural networks

2
and photonic FPGA (Field-Programmable Gate Array) systems [34, 35, 36, 37].
However, optical computing has been viewed skeptically in applications that require large data storage
and efficient flow control such as activation functions and non-linear computations. Therefore, the cur-
rent trend is to use photonic accelerators in areas that maximize the inherent advantages of optics. These
applications include parallel computing using WDM, polarization diversity, and mode multiplexing [38].
The long path length often associated with MZI circuits, however, has challenged the suitability of their
use in large-scale photonic circuits like accelerators and FPGA fabrics. This context still raises consid-
erable concerns regarding high latency and insertion loss due to the longer physical length of the circuit
[39].
MRRs, on the other hand, present an alternative with better scalability and compactness. When light
goes through ring resonators such as in 2 × 2 switches, the drop port of the switch induces a time de-
lay determined by the Q factor of the ring [40, 41, 42, 43]. The latency can be tuned by either inserting
Phase-changing Materials (PCMs) as cladding or cascading additional switches in tandem. The phase
transition of the PCMs leads to appreciable alterations in their optical properties, controllable either
electrically or optically [27, 44]. This characteristic offers a notable advantage in power efficiency for pro-
grammable photonic devices compared to electro-optic or thermo-optic methods [45, 46].
Moreover, incorporating non-volatile PCMs as photonic devices enables optical memory functions and
in-memory computing, achieved by transmitting optical input through the programmed device. Opti-
cal memory in ring resonators has also been studied using Volttera series in microwave photonics [47].
The memory effect is modelled as a multi-dimensional impulse response in the time domain or Voltterra
kernels in the frequency domain. By using the ring resonator as a differentiator, it is possible to induce
nonlinear mixing of multiple wavelengths to realize a frequency dependant memory function. Yet an-
other key enabling technology, highlighted in Figure 1 for realizing synaptic-like non-linear functions in
photonic computing is the optoelectronic neuron.
In certain applications, such as deep learning inference, the trained synaptic weights may not require
frequent updates or any at all. Here, non-volatile analog memory, like ”in-memory” computing, is ad-
vantageous. This can be achieved either optically or electronically using PCMs [48, 49, 50, 51]. By us-
ing digital electronic drivers with photonics-compatible firmware, a real-time neural network can be es-
tablished. This neuron is a hybrid of the well-modelled electronic non-linearities with the, negligably
lossy optical systems. Photodetectors (PDs), modulators, and lasers are combined in building such a de-
vice with the PD generating electrical current in proportion to the incident optical power in a waveguide
[23, 52, 53, 54].
Likewise, a spiking function can be realized when photons are generated from a threshold-based semicon-
ductor laser [55]. Such optoelectronic spiking neural networks have been realized with very low energy
consumption of 4.8 fJ/bit operating at 10 Gbit/s [53]. However, such neurons require constant biasing
of the laser source which increases the overall power consumption for a photonic accelerator with many
neurons. Nonetheless, spiking neural networks present attractive energy efficiency metrics owing to the
spontaneous and mostly idle mode of operation in neuromorphic communication [56].
Figure 1 also plots two lines indicating the projected speed of operation for various technologies and sys-
tems. The reported compute efficiency together with the theoretical projections in TOPs per mm2 help
to classify the throughput contribution of the various technologies. Here TOPs (tera-operations) per sec-
ond is normalized by the processor area as a figure-of-merit for performance. For the SiN devices, the
area of one MAC unit cell is 285µm × 354µm [57, 58]. This, when operating at 12 GHz with 4 input
vectors via WDM, corresponds to a compute density of 1.2 TOPS/mm2 . If SOI MRR devices are used
instead with a nominal bend radius of 5µm electrical power management, the area of the MAC unit cell
could be reduced to less than 30µm × 30µm, increasing the compute density to 420 TOPS/mm2 per in-
put channel [59, 60]. In-memory-computing photonic tensor cores show predicted compute density and
compute efficiencies of 880 TOPS/mm2 and 5.1 TOPS/W for a 64×64 Xbar core at 25-GHz clock speed
[61]. Compared with digital electronic accelerators (ASIC and GPU), the photonic core has 13 orders of
magnitude improvement in both compute density and efficiency.
This study investigates and reviews the implementation of deep learning accelerators from a very gran-

3
Figure 2: Classification of Photonic accelerators.

ular design methodology perspective to a very broad application perspective. The rest of the paper is
organized as follows: In Section 2, a total of 3 broad branches of photonic accelerator classes are dis-
cussed. Next, Section 3 provides an overview of how Deep Learning (DL) computations work and what
role accelerators play in these processes. In Section 4, DL Photonic accelerators are classified into 7 dis-
tinct categories based on their design. Then, Section 5 further discusses methodology approaches aimed
at improving design efficiency. Finally, Section 6 concludes by pointing out possible research gaps that
remain unexplored and discussing the extracted key points.

2 Photonics Accelerators Classes


The hierarchical classification of optical photonic accelerators for machine learning organizes them into
three primary domains, with subcategories defined by their distinct functions and applications which will
be referred to as classes in this section, namely, Modalities, Operating Mechanisms, and Applications as
shown in Figure 2. This breakdown aids in understanding their roles within the greater context of ma-
chine learning hardware acceleration.

2.1 Modalities
Optical Processing Units (OPUs) leverage photonic devices for machine learning applications, perform-
ing a broad spectrum of mathematical and logical tasks crucial for deep learning algorithms. Photonic
Integrated Circuits (PICs), the predominant form of OPUs, are engineered for efficiency in operations
such as matrix multiplications and convolutions, which are fundamental in deep learning models[62].
Optical processors implementing matrix-vector multiplications at Gb/s processing rates have been im-
plemented in literature [63, 64, 65, 66, 67]. One example of such an OPU is the LightOn [63], which has
been shown to operate at 50 TeraOPS/watt with x, y input vector dimensions of 1 million × 2 million.
This OPU circumvents the limitations of using the von Neumann architecture by reducing the computa-
tional size and time from O(n2 ) to O(1) i.e. the computational time is independent of data size. OPUs
that break the memory-dependant computations would allow direct single-chip implementation of larger
datasets with extended Random Access Memory (RAM) limits. This emerging field leverages photons
for energy-efficient matrix multiplications, capitalizing on their high speed and compatibility with the
semiconductor industry. Companies like Lightmatter, Lightelligence, Luminous, and LightOn [68] are
the new face of developing Photonic Neural Networks (PNNs) for low-power Multiply-and-accumulate

4
2.1 Modalities

(MAC) computations, significantly outperforming conventional digital and analog electronics.


Another common attribute of signal processing is the use of Fourier transforms which appear in forms of
complex multiplications. With the gradual decline of Moore’s law and the simultaneous rise of AI, the
inevitability of OPUs focused on such tasks has emerged. Photonic vector multiplications have been per-
formed using plane light conversions, MZI, and WDM structures as reported in [69]
Quantum Dot (QD)-based OPUs, incorporate nanoscale semiconductor particles to enhance OPU func-
tionality. Semiconductor QDs represent a type of zero-dimensional, quantum-confined device that ex-
hibit distinct electronic and optical characteristics. The three-dimensional quantum confinement within
QDs leads to the total localization of carriers, producing a discrete spectrum characterized by a δ-function-
like density of states [70]. The precision control afforded by these quantum dots over photon emission
and absorption translates to more effective processing tailored for specific machine learning tasks, thereby
expanding the versatility of photonic processing applications [71].
[72] furthered the domain with the use of coupled quantum well devices on-chip, highlighting their po-
tential in creating excitable neuromorphic networks [73]. In which the aforementioned PIC consists of
quasi-single-mode slotted Fabry–Pérot lasers coupled via an actively pumped waveguide. The research
suggests that substituting quantum well material with QD material could enhance these networks, en-
abling a variety of controllable excitable states, including dual-state excitability and dual-state bursting
mixed-mode oscillations.
Interestingly, in his 1945 report, von Neumann employed the concepts of synapses, neurons, and neu-
ral networks similar to the QD mode of operation, to describe his suggested architectural design, subse-
quently forecasting its constraints, which are now known as the von Neumann bottleneck [74]. He stated
“the main bottleneck of an automatic very high-speed computing device lies: At the memory.” Neuro-
morphic computers based on such non-linear elements as QDs, PCMs and SLM, operate differently, us-
ing directed graphs which align computing units and memory more effectively [75, 76, 77, 78, 79]. These
systems feature memory as synaptic weights linked to each node pair in the graph. This arrangement
of data near processing units enables neuromorphic architectures to bypass the processing-memory bot-
tleneck, allowing for simultaneous and parallel processing of various data streams, akin to the human
brain’s functioning.
Adaptive and Reconfigurable OPUs also represent a versatile subgroup with the capability to alter pro-
cessing parameters dynamically, adapting to the demands of varied machine learning workloads [80]. Such
OPUs eliminate the need for physical hardware modifications, ensuring cost-effectiveness and resource
efficiency. Such a device has been demonstrated in [81] and [82]. The authors in [81] characterize two
thermally tuned photonic integrated processors in silicon-on-insulator and silicon nitride platforms for
analyzing dominant features in convolutional neural networks.
Moreover, in [27], a compact, programmable waveguide mode converter based on a Ge2 Sb2 T e5 (GST-
enhanced) phase-gradient metasurface was shown. The converter utilizes changes in the refractive index
of GST to control the waveguide spatial modes up to 64 levels. This contrast represents the matrix ele-
ments, with a 6-bit resolution to perform matrix-vector multiplication in convolutional neural networks.
The design featured high programming resolution and was used to construct a photonic kernel using an
array of these programmable phase change metasurface (PMMC) devices, enabling an optical convo-
lutional neural network to be designed for image processing and recognition tasks. The findings high-
light that phase-change photonic devices, like the demonstrated PMMC, offer robust and versatile pro-
grammability. The authors also reveal the potential for a wide range of unique optical functions, making
them suitable for large-scale optical computing and neuromorphic photonics. Innovations in this cate-
gory also address the issue of noise—often a challenge in photonics-based systems—through advanced
noise reduction and error correction techniques, ensuring the accuracy and reliability of machine learning
computations [83, 84, 85, 86, 27].
The classification further extends to photonic-electronic hybrid OPUs, which synergize with conventional
CPUs and GPUs to offload computational tasks. Examples of hybrid optical networks with specific func-
tionalities have been demonstrated which include Spiking Neurosynaptic Optical Networks (SNONs)
[87], and photonic-electronic deep neuron networks [88]. Additional reconfigurability has been demon-

5
2.2 Operating Mechanisms

Figure 3: Analog Photonics ODE Solver. Reproduced without changes under terms of the CC-BY license [100]. 2016, Li et
al., published by Springer Nature. .

strated by a diffractive Optical Neural Network (ONN) accelerator in which hidden layers have their
phases tuned with the help of extra electronic Digital to Analog Converters (DACs) and drivers [89].
This offloading can markedly enhance the overall throughput and efficiency of machine learning systems
[90, 91, 92, 93]. In [94], a hybrid optoelectronic CNN, was designed which allows for more difficult clas-
sification tasks than the standard optical correlator [28]. These hybrid OPUs have demonstrated scala-
bility and the facilitation of complex computational acceleration within standard computing frameworks
[95, 96, 97].

2.2 Operating Mechanisms


In this section, specific areas of research that are worth exploring for the field of DL photonic accelera-
tors are addressed. Analog processors leverage the continuous time and space properties of light to per-
form computations, whereas, digital photonic accelerators use digital encoding and processing of pho-
tonic signals to accelerate machine learning tasks. The two subcategorizations are discussed based on
the innovations published or introduced thus far.
In this context, Analog Optical Processing Units (A-OPUs), can perform analog computations, such as
weighted summations, in an energy-efficient manner. Photonic accelerators in this domain have been
used in designing differential equations solvers. Figure 3 shows an example of an A-OPUs suited for solv-
ing PDEs and ODEs that require continuous solutions [88]. This is particularly useful in scientific sim-
ulations and optimization problems, where continuous analog processing is advantageous. Analog pho-
tonic processing has been applied to reservoir computing, a type of machine learning framework. A-OPUs
can perform continuous-time, analog computations within the reservoir to process time-series data and
perform pattern recognition tasks. Also, A-OPUs are also involved in quantum photonic processing, such
as continuous-variable quantum computing and Continuous-variable Quantum Optical Processors (CQOPs).
In which these systems operate in an analog mode to perform quantum simulations and optimizations.
Optical neural networks are capable of executing analog calculations like weighted sums with high en-
ergy efficiency. In this sphere, photonic accelerators have shown promise in creating solvers for differ-
ential equations. Specifically, A-OPUs which also share common features with a reconfigurable optical
network are ideally tailored for solving Partial Differential Equations (PDEs) and Ordinary Differential
Equations (ODEs) that necessitate continuous solutions [98]. This feature is useful in areas such as sci-
entific computing, optimization, finance and logistics [99].
In the realm of machine learning, analog photonic processing has also been utilized in reservoir comput-
ing frameworks. A-OPUs are adept at conducting continuous-time, analog operations within the reser-
voir, enabling them to handle time-series data and execute pattern recognition tasks [101]. Furthermore,
A-OPUs play a role in quantum photonic processing [102, 103, 104, 105, 106].
Analog photonic processing has also been utilized in Reservoir Computing (RC) [107, 108, 109]. It is
a Recurrent Neural Network (RNN) [110] framework that maps inputs into a fixed non-linear system,
known as a “reservoir,” and then processes this information through a simple, trainable readout mecha-
nism to produce outputs. Originating from concepts like liquid-state machines and echo-state networks,
RC has been adopted in various fields, including optics/photonics and quantum systems [99].
Digital Optical Processing Units (D-OPUs) on the other hand, use discrete digital photonic signals for
computation and processing. They employ techniques to discretize and manipulate photons, making them
suitable for digital-like computations. Recent innovations in D-OPUs include Photonic digital signal pro-

6
2.3 Applications

cessors that work with discrete optical signals, enabling operations like binary logic and bit manipula-
tion. These processors are used in optical computing for specific tasks. Quantum Digital Optical Proces-
sors (QDOP) leverage discrete photonic qubits for quantum computation [111]. They enable quantum
gates and algorithms that work with digital quantum information, facilitating quantum-enhanced ma-
chine learning algorithms. Photonic quantum systems offer practical approaches for areas such as quan-
tum communication, quantum sensing, quantum computing, and simulation. A recent study shows the
potential applications of photonic quantum computers in optical processing units. Photonic Integrated
Circuits (PICs) have also been used for digital signal manipulation where the complex optical circuits al-
low for the precise control and manipulation of discrete photons, making them suitable for digital-like
processing tasks. Many mathematical operations have been solved using Binary Photonic Arithmetic
(BPA) where photonic accelerators perform binary arithmetic operations using discrete optical signals
[111]. These can be applied to various machine learning tasks, especially in binary neural networks and
binary-coded optimization problems. Digital photonic data transmission has also emerged in optical in-
terconnects for data compression, multiplexing and encoding in optical interconnects for digital data
handling between processing units and memory components in high-performance computing clusters.

2.3 Applications
As a subgroup of ONNs, Photonic Deep Learning Accelerators (PDLAs) focus on leveraging photonic
technology to accelerate deep learning models. They utilize the speed and parallelism of photons to en-
hance various aspects of deep learning, including CNNs and Recurrent Neural Networks (RNNs) [99].
One recent innovation in this field is the development of integrated photonic devices that can perform
complex matrix operations crucial for neural network computations at much higher speeds than tradi-
tional electronic hardware. Researchers are also exploring the use of photonic circuits for faster and more
efficient training of deep neural networks.
Optical co-processors represent another vital development, acting in tandem with CPUs and GPUs. These
co-processors, particularly those dedicated to matrix multiplication tasks, have been integrated into the
hardware ecosystem to enhance machine learning throughput [111]. This integration is supported by ad-
vances in optical interconnects, which have improved bandwidth and reduced latency, critical for dis-
tributed machine learning systems. High-bandwidth optical interconnects are central to optical data trans-
mission accelerators, and recent advancements here have focused on increasing data rates, decreasing
power consumption, and achieving higher reliability. Technologies such as silicon photonics are paving
the way for scalable, energy-efficient data transmission [98]. Lastly, the area of Quantum-Enhanced Clas-
sical Machine Learning Accelerators (CMLA) is an intriguing new development within quantum photon-
ics. This approach seeks to enhance classical machine learning algorithms using quantum-inspired meth-
ods, with a promise of solving complex optimization problems more quickly than classical approaches
[111].
Also, within the context of deep learning, another paradigm in focused applications is the use of Quan-
tum Photonic Machine Learning Accelerators (QPMLAs). These accelerators aim to harness the power
of quantum mechanics for machine learning tasks. Recent innovations include the development of quan-
tum photonic processors capable of executing quantum algorithms that could significantly speed up cer-
tain machine learning tasks [99]. Further research in this field can offer the potential for exponential speedup
in solving optimization problems and enhancing pattern recognition tasks.
These subgroups represent cutting-edge research and development in the field of optical photonic accel-
erators for machine learning, with ongoing efforts to harness the potential of photonic technology and
quantum mechanics for faster, more efficient, and scalable machine learning solutions. The latest innova-
tions are driving progress in areas such as deep learning, quantum-enhanced algorithms, and high-speed
data handling, pushing the boundaries of what’s possible in the realm of machine learning hardware ac-
celeration.
The introduction of GPUs, ASICs, and neuromorphic chips like IBM’s True North [113] and Intel’s Loihi
[114] has dramatically improved energy efficiency and speed. However, hardware accelerators in neural
networks face two primary challenges: (i) maximizing the parallelism of neural networks requires scal-

7
Figure 4: Illustration of Wavelength Division Multiplexing (WDM) [112].

ing the accelerators, and (ii) minimizing energy consumption necessitates optimization of data movement
[115]. This has led to the emergence of analog neural networks characterized by rapid processing, WDM-
assisted parallelism and high energy efficiency. Figure 4 provides an illustrative figure for the WDM con-
cept widely employed in various photonic accelerator designs.
The compact size of photonic integrated silicon platforms and their seamless integration with electronic
systems make them highly desirable. Implementations of silicon photonic accelerators utilize photonic
components to significantly reduce energy consumption, especially during matrix–vector multiplications
in deep-learning models and CNNs. Unlike electronics, silicon photonics do not consume power as they
merely propagate light through the silicon photonic structure, which is inherently low-power [116].

3 Background of Photonic Deep Learning Fundamentals


The most fundamental notion of using deep learning photonics accelerators is to perform intensive com-
putations required by deep neural networks efficiently at high speeds. These computations are therefore
commonly known as MAC operations. Figure 5 provides an overview of a typical signal pathway of an
AI chip utilizing MAC operations.
Multiplications, convolutions, Fourier transforms, and dot products are all linear mathematics operations
that involve MACs. Therefore, by adopting combinations of the aforementioned operations, the simula-
tion of neural networks is achieved.
For the general case, MAC [117] operations are represented in deep learning models as a′ ← a + w · x,
and an accumulator is used to add the results of MAC operations. For a modified state a′ and a given
accumulation variable a, the operation is as follows:

a′ ← a + (w × x)
Subsequently, for a more NN-specific use of MAC operations, one can think of different layers of NNs
that contain different nodes within each layer. More Nodes j (or neurons) receive signals from a large
number of other nodes i for a set of input variables xi and output variables yj . A weighted sum is calcu-
P
lated based on the inputs yj = i wij xi . In the next layer, yj is input through a non-linear function:
( )
x′j
X
=f wij xi
i

Any non-linear operation can be expressed in the form of f {x} (eg. ReLUs, pooling operations, etc.).
Weighted sums can be expressed as ai = ai−1 + wi xi for i = 1. It takes M parallel MAC operations

8
Figure 5: Typical signal pathway for a modern AI chip. Energy is consumed primarily by moving data in systems widely
used in literature. The passing of information occurs between MAC processors performing a + (wx), memory caches, and
non-linear operations f {x} [117].

to operate each neuron. The amount of MAC operations per time step required for a neural network of
size N is M × N , or one operation for each node in the network. Using an interconnected network of
N nodes (M = N case), MAC operations required per step are ∆t or characteristic time constant τ in
analog hardware is N 2 per step. Energy can also be consumed by the non-linear function f {x}, however,
it scales O(N ) for this operation and not with O (N 2 ), thus, not a very costly operation. The MAC be-
comes the most difficult hardware bottleneck as the network size N grows [117].
Essentially, computations in the photonic space are passive, therefore, even in the case of O (N 2 ) fixed-
point operations, it has efficient energy scaling costs probably at O(N ). Ultimately, it is only the periph-
ery of modulation and detection that can create bottlenecks in photonic matrix multiplication [117].

4 Classification of DL Photonic Accelerators


In this section, a more granular overview of seven primary categorization techniques for photonic DL ac-
celerators is presented. The section assesses each approach’s design strategies, highlights their strengths,
and identifies potential areas for improvement.

4.1 WDM-based Accelerators


WDM serves as the foundational concept for all photonic accelerator designs discussed in this paper,
either partially or in their entirety. This is because CNNs involve weight updates during training, and
WDM can effectively mimic this weight update behavior. Building upon this analogy, [118] introduced
PCNNA, a proof-of-concept photonic convolutional neural network accelerator. PCNNA was specifically
designed to address the challenge of expensive convolutional operations in CNNs. It achieved this by re-
lying entirely on the WDM concept, utilizing MRRs as the primary photonic component. To expedite
data transfer across CNN layers and enhance the speed of fundamental MAC operations, PCNNA uti-
lized MRR banks as fundamental building blocks. These banks were structured based on the Broadcast-
and-weight (BW) protocol, effectively mimicking sparsity in the connections between CNNs’ input fea-
ture maps and kernels, all accomplished through the implementation of WDM.
Within PCNNA, LDs play a crucial role in multiplexing neuron output onto distinct wavelengths using
the BW protocol. Subsequently, a waveguide is employed to broadcast these multiplexed wavelengths
to the destination layer after bundling them together. At the destination layer, each neuron receives all
incoming wavelengths. The amplitude of each wavelength is determined by multiplying it with its cor-
responding MRR wavelength response. Using a specific laser wavelength, rings are tuned in and out of
resonance. Consequently, a photodiode aggregates all incoming wavelengths, generating an aggregate
photocurrent. Figure 6 provides a concise overview of the input-to-output sequence. PCNNA layers are
applied in a repetitive sequential fashion to address sequential convolutional layers in CNNs.

9
4.1 WDM-based Accelerators

Figure 6: An MRR bank-based BW protocol. A bundled wavelength is propagated through an MRR bank as it enters.
Through the tuning of corresponding rings, each bank weights each wavelength. Photodiodes create photocurrents by
adding all wavelengths together. Photo-currents modulate light waves of wavelength λm . Multiplexing of all laser beams is
used to broadcast the beams to the next layer. Reproduced, with permission [118], from Mehrabian et al. 2018 © IEEE.

Ultimately, with this design, PCNNA addresses a significant drawback associated with using MRRs. In
traditional setups, the number of microrings required to perform multiplication in the MAC operation
scales as Ni ×Ni+1 , where N represents the number of neurons in a layer, and i represents the layer num-
ber. However, in PCNNA’s innovative design, only kernel values over the input feature map of a specific
size N are utilized in the convolution operation. This careful selection serves the purpose of controlling
the number of MRR banks employed. As a consequence, only N kernel MRRs are necessary for weight-
ing, thereby requiring just N MRR banks. By employing this selective approach, PCNNA achieves sig-
nificant savings in terms of wavelength representations and MRRs needed for demultiplexing incoming
wavelengths in the subsequent layer. Figure 7 provides an overview of this selective use of MRR weight
banks for kernel convolutions.
As part of PCNNA’s concept of reduced parameter count, sparsity considerations became the guiding
principle for photonic accelerator development. Therefore, [119] proposed Albireo, which essentially fo-
cuses on introducing sparsity convolution based on WDM. Photonic Locally Connected Units (PLCUs)
are the main component of the design, for which a series of photonic computation schemes are presented
to leverage multicast data distribution and shared parameters inherent to CNNs. A WDM dot product
processing is also used in these proposed PLCUs and through passively overlapping receptive fields using
WDM, Albireo leverages shared parameters to significantly increase computation parallelism.
To extend the analogy of WDM-based sparsity designs to even deeper CNNs beyond Albireo, DNNARA-
E was introduced by Peng et al. [120]. DNNARA-E is a hybrid optoelectronic computing architecture
and Residue Number System (RNS) accelerator. The authors proposed a novel approach by combining
residue adders and optical multipliers within a matrix-vector multiplication architecture. This innovative
method not only reduced optical critical paths but also minimized power consumption, making it highly
suitable for intricate and deeper DL networks like ResNet [121], as further discussed in Section 6. Facil-
itating sparsity in these deeper networks is achieved through high-level parallelism at the system level,
a capability inherent in WDM for high-speed operations. Consequently, RNS seamlessly transitions be-
tween electrical and optical modes, utilizing one-hot encoding.
With the RNS, a number can be represented as pairwise coprime moduli, and because residue arithmetic
is digit-irrelevant, high parallelism can be achieved. It is then possible to combine the results separately
during the residue operation, and ensemble them at the end so addition is represented by mappings in a
residue arithmetic system. Thus, every modulo digit has a single-bit output without repetition, enabling
computation-in-the-network using one-hot encoding photonic routing. This process ensures that convolu-
tion takes place using fewer parameters.
Nevertheless, even with sparsity designs aimed at deeper models, cross-talk between channels and within

10
4.2 Memristor-based Accelerators

Figure 7: A 16×16 input feature map with 5 kernels of 3×3, without filtering the input feature map and with the input
feature map filtered to only pass through the receptive field. As can be seen, taking advantage of narrow receptive fields
results in fewer MRRs. Reproduced, with permission [118], from Mehrabian et al. 2018 © IEEE.

channels still presented an issue for deeper networks. Therefore, [122] presented an MRR weight bank-
based accelerator that demonstrates parallel cascading in the WDM system. The aforementioned design
utilized nanoscale etching to embed neuron nodes into silicon substrates in order to implement this neu-
romorphic network physically. When the input optical signal is captured, the MRR weigh bank modu-
lates the output signal of the laser near its threshold. In addition, feedback is used to achieve non-linearity
in the system. Then, WDM is achieved by using MRRs at nodes, each with a specific wavelength of light,
and with a paired laser channel and a probing Source Meter (SM), each MRR is powered by an electrical
source, and its resonance is thermally tuned after calibration. An illustrative overview of this process is
showcased in Figure 8.

4.2 Memristor-based Accelerators


ConvLight, introduced by [123], marked the pioneering entry into this category, featuring a design cen-
tered around the utilization of memristors and MRR modulators as essential photonic components.
This innovative accelerator design employs a unique convolution process utilizing a mode-locked laser
off-chip, generating light across 784 different wavelengths. The convolution operation is facilitated through
a WDM waveguide, incorporating a Weight Resistor Array (WRA) based on memristors, a Ring Modu-
lator Array (RMA), and an SRAM buffer (SB). Data manipulation occurs in three sequential steps: 1)
Digital to Analog Conversion, 2) Memristive Convolution, and 3) Data Modulation. Initially, Digital-
to-Analog Converters (DACs) transform digital data stored in SRAM buffers. Subsequently, the DAC
output signals undergo convolution with a WRA, as depicted in the sequence outlined in Figure 9. Each
bank consists of 9 memristors, forming a (3x3) filter representative of a standard convolutional layer.
The output currents from these memristors are accumulated and fed into a modulator. Following this,
data modulation occurs, with 784 modulators individually modulating lightwaves within the WDM waveg-
uide. Post modulation, the lightwaves are separated from the WDM waveguide by a WDM decoupler.
Each isolated lightwave is then directed to the subsequent layer and so forth. Figure 9 provides a visual
representation of the ConvLight convolution process.
ConvLight’s core innovation lies in harnessing memristor conductance, which can be dynamically ad-
justed by applying an external current flux beyond a specific threshold. This unique approach enables
the control of conductance, effectively simulating the weights in the filters of a CNN, thereby acceler-
ating convolution in the optical domain. In a significant milestone, ConvLight surpassed contemporary
electrical accelerators in terms of Computational Efficiency (CE) at the time of its publication. When

11
4.2 Memristor-based Accelerators

Figure 8: Schematic representation of how an on-chip MRR weight bank can be used in an experimental setup to perform
a photonic accelerator. The components include Distributed Feedback Lasers (DFBs), a Source Meter (SM), and a Bal-
anced Photo-detector (BPD). Reproduced without changes under terms of the CC-BY license [122]. 2023, Zhang et al.,
published by arxiv.

compared to FPGA-based Caffeine accelerator [124] and memristor crossbar-based ISAAC accelerator
[125], ConvLight demonstrated exceptional performance, boasting 250× and 28× higher CE, respectively.
This remarkable efficiency can be attributed to ConvLight’s fully analog nature, in contrast to the slower
DACs and ADCs used in other approaches [126]. These comparisons were based on training and infer-
ence tasks executed on four versions of the VGG [127] model applied to the MNIST dataset [128].
Nevertheless, due to ConvLight’s focus on inference with offline training, attention shifted toward accel-
erating the online training of deep neural networks. This challenge was addressed by Dang et al. [129],
who introduced a novel photonics-based backpropagation accelerator named BPLight. The aim was to
enhance performance for large-size deep learning online training. Additionally, the authors presented
a comprehensive design for a CNN that incorporates the BPLight accelerator. This CNN architecture
is tailored for end-to-end training and inference, utilizing a combination of photonics components and
memristors. With a configurable memristor-integrated photonic CNN accelerator design, the proposed
BPLight-CNN stands out as an analog and scalable silicon photonic-based backpropagation accelerator.
In the overall CNN design, BPLight introduced a reversible convolution approach for each layer. Lever-
aging energy-efficient Semiconductor Optical Amplifiers (SOAs), optical comparators, and a fully ana-
log feature extraction method, BPLight demonstrates superior computational and energy efficiency com-
pared to conventional GPU implementations. However, it’s essential to acknowledge that insertion losses
in photonic components could potentially impact accuracy, especially in deeper stages of deep learning.
Addressing these insertion losses represents a crucial opportunity to enhance the accelerator’s applicabil-
ity and scalability for more complex CNN models.
Following the introduction of the BPLight and ConvLight accelerators in the domain, the use of memris-
tors was limited to memristor-based crossbar array memory. This constraint aimed to enable the adop-
tion of a fully optical photonic accelerator for comprehensive online training and inference. This concept
materialized with the introduction of LiteCON [130], an all-photonic neuromorphic accelerator designed
for efficient training and inference of deep learning models. LiteCON employs silicon microdisk-based
convolution, memristor-based memory, and Dense WDM (DWDM) to achieve its functionality.
LiteCON comprises four key components: a feature extractor, a feature classifier, a backpropagation ac-
celerator, and a weight updater. These components enable complete analog end-to-end CNN training
and inference. In LiteCON, convolution layers are constructed using silicon microdisks, Rectified Linear
Unit (ReLU) [131] layers are utilized in feature extractors and classifiers, and pooling layers are incorpo-
rated in the construction of these extractors and classifiers. The Feedforward CNN accelerators consist
of both Feature Extractors (FEs) and Feature Classifiers (FCs). Backpropagation accelerators are con-
structed using microdisks, splitters, and multiplexers, while LiteCON’s weight update unit is comprised
of memristors. This analog configuration categorizes LiteCON as neuromorphic, mimicking the behavior

12
4.3 FPGA Photonic Accelerators

Figure 9: Schematic diagram of a Convolutional layer in ConvLight. Readapted, with permission [123], from Dang et al.
2017 © IEEE.

of a neural network, specifically a CNN. LiteCON operates entirely in the analog domain, incorporat-
ing silicon photonic microdisk-based convolution, memristive memory, high-speed photonic waveguides,
and analog amplifiers. The efficiency of LiteCON is enhanced by the combination of its compact foot-
print, low-power characteristics, and the ultrafast nature of silicon microdisks. For simplicity of explana-
tion, an input of 9 pixels can be assumed when describing the photonic convolution analogy in LiteCON.
Therefore, in the memristor crossbar, 9 pixels are stored as analog inputs as In11 , In12 , . . . In33 . For sim-
plification, another assumption is that there are 9 weights or a 3 × 3 filter. There exists w11 , w12 , . . . w33
for the filter matrix. Using a multiplexed waveguide, the weights are modulated onto 9 wavelength chan-
nels. By using a microdisc, each input pixel Inxy is converted into a weight-carrying channel. This is
done by modulating In11 into the channel carrying weight w11 . By doing this, the microdisk performs
amplitude modulation, or multiplication, so that the channel now contains w11 × In11 . Likewise, other
channels carry w12 × In12 , . . . w33 × In33 . As a result, a photodiode captures all of these photonic signals
in the sum total, i.e., w11 × In11 + w12 × In12 + · · · w33 × In33 . Briefly, this is how the accelerator works
using photonic convolution.

4.3 FPGA Photonic Accelerators


The robustness of FPGA accelerators continues to make them a strong foundation for DL acceleration.
Consequently, photonic components are increasingly being integrated into FPGA accelerators to enhance
their performance in deep learning tasks.
An approach in this domain by [132] utilized quantized weights and activations constrained to one bit
([-1, +1]) to achieve comparable accuracies as non-quantized 32-bit models while requiring fewer com-
putational resources. To enable Matrix-Vector Multiplication (MVM) computations at the bit-level, this
design employed MZIs and MRRs as the primary photonic components. These components were utilized
to design two electrical-optical hybrid MVM accelerators mapped onto FPGAs.
Later in the pipeline, power efficiency presented an attractive field for most future FPGA accelerator en-
tries with the initial inclusion of photonic components with [132]. For this reason, [133] then proposed
OASys, an opto-electronic accelerator system, for efficient deep learning acceleration in which, the sys-
tem takes on a full-stack architectural design approach and combines the power of Fourier optics and
conventional FPGAs. As opposed to the studies mentioned in this study, this accelerator performs as a
complement to FPGAs rather than as a stand-alone module.
With the use of optics and electronics, OASys envisions achieving four distinct levels of abstraction. Us-
ing this combination, OASys is defining a full-stack approach to integration. Optical technology and
electronics provide the processing power, while hardware design provides the fundamental building blocks.
Moreover, in OASys’ hardware design, the central processing core incorporates both electronics and op-
tics for acceleration, while all other operations are completely electronic.
Eventually, the optoelectronic implementation of neural networks is the primary focus of OASys in which

13
4.4 Scalability-focused Accelerators

Figure 10: Diagram of a homodyne optical neural network with a single layer. (a) A neural network composed of K layers,
where each layer consists of Matrix-vector multiplications (gray) and element-wise non-linearities (red). (b) Implementing
a single layer. The matrix multiplication process involves combining input and weight signals and performing balanced ho-
modyne detection (inset) between each pair of signal and weight. Reproduced without changes under terms of the CC-BY
license [134]. 2019, Hamerly et al., published by American Physial Society.

the core processing element block of OASys incorporates optics. This compact system design incorpo-
rates free-space bulk optics, SLMs, FPGAs, and relevant electronics. The optical setup consists of a Fourier-
based convolution and multiplication system that will complement the conventional FPGA processor by
leveraging highly parallel optical processing. Therefore, a self-contained system like this can be used for
in situ computations without having to communicate with a host.

4.4 Scalability-focused Accelerators


A scaling challenge of photonic accelerator functions increases with the increased scale and depth of ma-
chine learning models. Thus, some photonic accelerators emphasize scalability as a major design char-
acteristic. On the basis of this notion, [134] presented a photonic accelerator with homodyne detection
that can process deep learning at low energy costs. This method can be scalable for large networks of
(N ≥ 106) neurons, has a very low (sub-attojoule) energy per MAC, and operates at very high speeds
(GHz).
The aforementioned accelerator uses standard free-space optical components for the utilization of spa-
tial multiplexing. Both the weights and inputs of this accelerator are optically encoded, enabling repro-
gramming and training at an exceptionally high speed during online model training in machine learning.
Synaptic connections (matrix-vector products) are realized in the homodyne detectors through quantum
photoelectric multiplication. Figure 10 provides a concise illustration of the device process workflow.
The photoelectric effect is fully utilized in [134], for which the relation I ∝ |E|2 is exploited to compute
the required matrix products optoelectronically without the requirement of an all-optical non-linearity.
This previously represented one of the major limitations of optical computing for MAC operations. Also,
the photodetector shot noise is shown to be the ”standard quantum limit” for optical neural networks in
simulations using digit and image classification models. With this bound, which can be as low as 50zJ =
M AC, this device is theoretically capable of digital irreversible computations below the thermodynamic
limit (Landauer). The device does not rely on nanophotonic components and can therefore scale much
larger than purely nanophotonic arrangements.
For instance, for a matrix multiplication in [134], Cm×n = Am×k Bk×n , the Input-output (IO) energy

14
4.4 Scalability-focused Accelerators

Figure 11: An overview of the SONIC architecture, with NCONV layer-specific VDUs and KFC layer-specific VDUs. Re-
produced, with permission [136], from Sunny et al. 2022 © IEEE.

scales as O(mk) + O(nk) + O(mn), while the number of MACs scales as O(mnk). A performance range
of ∼ 10 f J/M AC can therefore be feasible for moderately large problems in CNNs with (m, n, k ≥ 100)
at moderate IO energies (∼ J), displaying the scalable nature of such an accelerator design.
The photonic accelerator benchmarking results primarily focused on popular classification models de-
signed for scalable image classification. These models were compared against common benchmarks, in-
cluding the ImageNet dataset. To address scalability concerns in real-world deployment scenarios, [135]
introduced a universal optical vector convolutional accelerator capable of operating beyond 10 Tera-OPS
(TOPS - operations per second). Specifically tailored for facial image recognition, this design is based on
the simultaneous interleaving of temporal, wavelength, and spatial dimensions, made possible by an in-
tegrated microcomb source. This innovative approach is scalable and adaptable to much more intricate
networks, catering to demanding applications such as unmanned vehicles and real-time video recogni-
tion.
With optical convolutions, the accelerator in [135] is capable of processing and compressing large amounts
of data. For which interleaving wavelengths, temporal dimensions, and spatial dimensions, Kerr frequency
combs, or microcombs, can then be used. The convolution accelerator has the capability of acting as
both a CNN front-end with fully connected neurons and as a convolutional accelerator front-end using
the same hardware. As the accelerator scheme is a stand-alone, universal system, it can be used both
with electrical and optical interfaces. Consequently, it is capable of serving as a universal high band-
width data compression front end for neuromorphic hardware, either optical or electronic, resulting in
massive-data, ultrahigh bandwidth machine learning.
Another entry under this subcategory dubbed, SONIC [136], used photonic components to increase scal-
ability by including sparse NN layers. Hence, SONIC was designed to utilize optimal sparsity in order to
accelerate Sparse Neural Networks (SpNNs) in an energy-efficient and low-latency manner.
In the photonic domain, SONIC’s core computes multiplications and accumulations for fully-connect and
convolution layers using Vector-dot-product Units (VDUs). It also integrates various peripheral elec-
tronic modules to connect with the main memory, map image data to photonic VDUs, and interface with
the main memory. This involves applying non-linearities to the photonic core, accumulating partial sums,
and performing other post-processing operations. Different wavelengths are generated by Vertical Cavity
Surface Emitting Lasers (VCSELs) within VDUs using DAC arrays which convert buffered signals into
analog tuning signals. In photonic summation, analog signals are converted to digital values via ADC ar-
rays before being processed and buffered for further processing. Eventually, with such a modular, vector-
granularity-aware structure, SONIC is designed for high throughput and scalability. Furthermore, it is
designed to optimize the photonic accelerator’s operation based on Sparsity-aware data compression and
dataflow techniques. Figure 11 provides an overview of the SONIC accelerator architecture in summary.
Next, to further explore real-time deployment, [116] presented two novel Bayesian learning schemes to
help classify real handwritten digits in the MNIST dataset, namely regularized and fully Bayesian learn-
ing schemes. The algorithms target hand-written digits in the MNIST [128] dataset using 512 phase shifters.

15
4.5 All Optical Neural Network Architecture

Then, in conjunction with pre-characterization stages that provide passive offsets, the new schemes sig-
nificantly reduce the processing power required by the PIC without compromising classification accuracy.
Therefore, on top of reducing energy, the full Bayesian scheme provides information about phase shifter
sensitivity. As a result, the phase actuators may be partially deactivated and the driving system is sig-
nificantly simplified.
The phase tuning process in [116] is based on an offline training scheme that takes into account uncer-
tainty. The novel feature of this study is its Bayesian approach to photonic accelerators, where, instead
of defining optimum phase shifter values through training, a parametric Probability Distribution Func-
tion (PDF) is defined for each phase shifter, which is optimized by updating variational parameters at
every iteration. Aside from indicating the correct values for phase shifters, this Bayesian procedure also
quantifies their robustness to phase deviation. Using this data, novel algorithms can be developed for
adjusting and controlling photonic accelerators, which will increase their noise robustness and scalability.

4.5 All Optical Neural Network Architecture


The universality of weight representation can be sacrificed for reduced usage of optical components, lower
area cost, and lower energy consumption, as demonstrated by [137] in the context of an Optical Sub-
space Neural Network (OSNN). In comparison to GEMM-based ONNs [138], the authors introduce a
butterfly-shaped photonic-electronic neural chip to implement OSNNs with significantly fewer trainable
optical components (7× fewer). Moreover, the training framework is hardware-aware, reducing the chip
size, enhancing noise robustness, and minimizing the need for precise device programming.
The OSNN, presented in [137], adopts a unique approach by constraining the parameter space of each
layer’s weights. It suggests dividing the weight matrix of each layer into smaller k × k (k = 4, 8) sub-
matrices. This innovative architecture enables photonic neural computing using 7 fewer trainable opti-
cal elements compared to MZI-based ONNs designed for general MVMs. Through the implementation of
structured circuit pruning, the number of trainable optical components can be further reduced by 70%
with negligible impact on performance. Moreover, the study introduces an experimental hardware-aware
training framework characterized by high noise robustness and low control precision requirements. This
framework facilitates ONN training with exceptional noise tolerance. In addition to diminishing control
precision requirements, enhancing noise robustness, and maximizing subspace expressivity, a hardware-
aware training framework can effectively reduce control precision demands. To further enhance OSNN
performance, exploring faster and more efficient Electro-Optic/Opto-Electronic (EO/OE) conversion
techniques, alongside the utilization of smaller optical components, holds significant potential.

4.6 Computing Engine Accelerators


In Silicon Photonic (SiPho) architectures, supporting DL applications with a large number of trainable
parameters necessitates analog computing engines capable of performing Tiled Matrix Multiplication
(TMM) at line rate. This requirement is akin to the approach employed by state-of-the-art electronic
graphics processing units. Addressing this need, [139] introduced a coherent analog SiPho computing en-
gine designed for optical TMM operations at 50 GHz. The authors demonstrated that this engine suf-
fices for DL applications even when the number of trainable parameters exceeds the available hardware
dimensions.
In the previously described implementation [139], a compact Silicon Photonic (SiPho) engine is employed
to handle input and weight updates directly in the optical domain. This approach enables the imple-
mentation of DNNs using a limited number of photonic devices. The photonic accelerator incorporates
two input Coherent Linear Neurons (COLNs) equipped with high-speed Silicon Germanium Electro-
Absorption Modulators (EAMs) for both weight and input imprinting. The deployment of this acceler-
ator in a data center traffic inspection system for network security applications highlights its capabilities
in performing TMM and supporting DNNs with higher dimensions. Additionally, the photonic engine is
utilized for identifying Distributed Denial-of-Service (DDoS) attack patterns by classifying Reconnais-
sance Attacks (RAs).

16
4.7 Conventional NNs Single Task/Operation Accelerators

4.7 Conventional NNs Single Task/Operation Accelerators


The majority of studies on accelerators have predominantly concentrated on accelerating convolution
operations through various techniques, often overlooking other crucial aspects of deep learning models.
These aspects include operations like average pooling, max pooling, and others, as well as subtasks within
classification models, such as generating heat maps of important features.
Indeed, focusing on individual components of a DL network, an initial effort to optimize pooling oper-
ations in traditional DL NNs was made. In this context, [140] introduced a short-term solution to meet
the growing demand for photonic accelerators, known as the Photonic Feature Map Accelerator (PFMA).
This design specifically targeted a single DL task: generating feature maps in CNNs through a limited
number of convolutions. The proposed PFMA not only performed both linear and non-linear operations
in optics but also incorporated a novel all-optical average pooling layer. According to the authors, PFMA
demonstrated energy efficiency 2.5 to 5 times higher than state-of-the-art electronic hardware available
at the time of publication.
In PFMA [140], convolution and feature extraction are performed on small pixel windows, typically 2×2
or 3×3 in size. Current integrated photonic technologies, specifically III-V-on-silicon, can accomplish
this optical task with a limited number of devices and input/output ports. PFMA achieves this using
three layers. The first set of layers consists of Optical Interference Layers (OILs), which are linear lay-
ers composed of thermally tuned MZIs performing matrix-vector product calculations. The second set
comprises Optical Non-linear Layers (ONLs) utilizing saturable absorbers, advantageous due to their low
power consumption and input-output characteristics resembling the Exponential Linear Unit (ELU) ac-
tivation function commonly used in modern CNNs. Lastly, the Optical Pooling Layer (OPL) on a Multi-
mode Interfering (MMI) device combines optical signals into one output, providing an average score.
Another innovation introduced in [141] involves a photonic max-pooling architecture designed specifi-
cally for image classification and compensating fiber non-linearity in photonic neural networks. This ar-
chitecture aims to accelerate max-pooling functionality, leveraging the inherent non-linear properties of
ring modulators. Unlike average-pooling, max-pooling functions are inherently non-linear, making them
suitable for exploitation in photonic systems. This accelerator utilizes optical inputs and outputs that
undergo max-pooling operations through the non-linear characteristics of a ring modulator. The func-
tionality of max-pooling is demonstrated using the iPronics Smartlight Processor, employing a hexagonal
mesh of MZIs to achieve this innovative approach.
In [141], the design involves coupling two ring modulators to two optical input signals. The resonance
wavelength of the ring modulators is precisely tuned to match the wavelength of the smaller input sig-
nal. As a result, the smaller signal experiences significant attenuation, while the larger input remains
unattenuated. The circuit’s functionality is demonstrated using a programmable photonic platform. Uti-
lizing this proposed architecture on other integrated photonic platforms is promising, as it has the po-
tential to achieve operation bandwidths in the tens of gigahertz range. This enhanced performance is fa-
cilitated by the use of fast modulators and PDs.

5 Efficiency-oriented Accelerator Designs


Synonymous to Section 4, which focused on the diverse functions achievable with photonic accelerators,
this section exclusively explores accelerator designs emphasizing efficiency.

5.1 Architectural Improvements


As detailed in various subcategories within Section 4, many accelerators primarily rely on MRRs. Due
to their smaller dimensions (approximately ∼ 10µm) and lower energy consumption, MRRs offer a more
power and area-efficient solution compared to MZIs. While a single MZI can achieve an Extinction Ra-
tio (ER) of over 60 dB (indicating how precisely light signals can be modulated by the device), the ER
of MRRs depends on achieving critical coupling, influenced by their thermal stability. State-of-the-art

17
5.1 Architectural Improvements

demonstrations have measured ERs of single MRRs up to 25 dB. Hence, it is objectively more efficient
to design and develop solutions using both MZIs and MRRs together, as introduced by [142].
While process technology improvement can mitigate conversion overhead, it remains a fundamental limi-
tation affecting the speed and efficiency of designs in this field. Additionally, incorporating more optical
operations might be beneficial, but it can lead to losses in optical devices, thereby reducing SNR and bit
precision. Similarly, MZIs have limited bandwidth, causing latency in weight programming. In this sub-
section, the exploration focuses on photonic accelerators that utilize a combination of photonic compo-
nents and concepts to maximize efficiency during design [142].
By integrating optical delay lines into accelerator architecture, one notable entry into this design direc-
tion kept efficiency at its core, for which [143] showcased the IPCNN accelerator design. This IPCNN
design focuses on manifesting efficiency in the convolution layers using photonic convolution. As a result,
the electronic circuits that manipulate data prior to matrix multiplication are eliminated, resulting in re-
duced power consumption/latency and fewer E/O interfaces for less energy consumption.
In IPCNN, the electronic circuitry responsible for data patching and allocation is replaced by optical de-
lay lines functioning as data buffers. As a result, the data manipulation process becomes nearly power-
free and operates at the speed of light. IPCNN utilizes WDM to combine multiple delay lines into one,
addressing the footprint issue associated with individual delay lines and making chip fabrication more
practical. With WDM implementation, the number of optical delay lines is significantly reduced, en-
abling the feasibility of current integration technologies. The performance of the IPCNN system is fur-
ther evaluated considering practical fabrication challenges such as noise, imbalance, and insertion loss,
as well as criteria including prediction accuracy, maximal integration scale, computing speed, and energy
efficiency.
The performance of IPCNN is significantly influenced by two critical factors. The SNR is primarily de-
termined by photodetection noise, where lower noise levels translate to higher efficiency, larger hardware
scales, and reduced power consumption. To mitigate this noise, PDs and Trans-Impedance Amplifiers
(TIA) must be constrained in bandwidth. Specifically, if the modulation rate is fm , the PDs need to
have a bandwidth of fm /2. Additionally, insertion losses can occur due to laser inputs, modulators, and
delay lines. The use of advanced lithium niobate modulators can help minimize insertion losses. Em-
ploying low-loss delay lines and integral lithium niobate modulators can further reduce insertion loss,
enhancing overall system efficiency.
Efficiency in neural network accelerators has also been explored through the paradigm of cross-layer net-
work acceleration. This approach, as proposed in [144], utilizes silicon photonics as NN accelerators to
optimize the performance of cross-layer NNs. The cross-light architecture achieves this optimization by
integrated device-level engineering, allowing for greater tolerance to process variations, minimizing ther-
mal crosstalk through circuit tuning, and enhancing resolution, energy efficiency, and throughput through
architecture engineering.
To overcome the challenges mentioned above, a cross-layer design approach was employed to develop
CrossLight, as presented in [144]. This silicon photonic neural network accelerator addresses the issues
by adopting a cross-layer design paradigm, which optimizes hardware-software designs holistically, con-
sidering multiple layers in the stack. The authors introduced an enhanced photonic device design fabri-
cated to enhance resilience to manufacturing process variations. Additionally, the process incorporates a
high-speed device tuning circuit that supports large thermally-induced resonance shifts simultaneously.
CrossLight also improves wavelength reuse and implements matrix decomposition at the architecture
level, enhancing weight resolution, energy efficiency, and throughput achievable in this architecture.
Furthermore, [145] introduced TRON, the first silicon photonic hardware accelerator for Vision Trans-
formers (ViTs), designed to support the latest advancements in deep learning models, particularly transformer-
based networks. TRON utilizes non-coherent silicon photonics and presents a transformer accelerator ca-
pable of accelerating transformer neural networks.
The TRON architecture, functioning as a non-coherent photonic accelerator, serves as a fundamental
framework for inference in transformer models. At its core, the TRON accelerator comprises Feed-forward
(FF) and Multi-head Attention (MHA) units, allowing for the reuse of encoder and decoder resources.

18
5.2 Algorithmic Designs

Figure 12: Overview of the TRON accelerator architecture. Reproduced with permission [145], 2023, Afifi et al., published
by Association of Computing Machinery.

Integrated Electronic Control Units (ECUs) facilitate interaction with the main memory, buffer interme-
diate results, and map matrices to the photonic architecture. Additionally, software optimization tech-
niques can be applied to further reduce the memory footprint of the transformer, leading to enhanced
performance and efficiency. Figure 12 illustrates the TRON accelerator architecture.

5.2 Algorithmic Designs


In the context of reduced/computationally efficient accelerators, a universal logarithmic quantization
method was used to develop a novel ultra-low-power photonic accelerator, named MindReading by [146],
for detecting human intentions in real-time via Electroencephalography (EEG). By using a universal log-
arithmic quantization method, the authors quantified not only weights but also activations of convolu-
tional, LSTM [147], and fully connected layers with trivial loss of accuracy. Floating point MVM is re-
placed by addition and shift operations with a low bit width requirement using this method.
Furthermore, MindReading has developed a photonic activation unit for quantifying Tanh, ReLU [131],
and Sigma activation outputs. Also, an eDRAM buffer is used for storing EEG signals as well as inter-
mediate results generated by the Photonic Processing Unit (PPU). Then, by using photonic additions
and shifters, the PPU computes binary logarithms and logarithmic accumulations for ULQ-quantized
EEG-NET. As a result of the non-linear units, Sigmoid, Tanh, and ReLU units can be activated for EEG-
NET. Eventually, MindReading was reported to reduce power consumption by 62.7% and increase through-
put by 168.6% in comparison to existing accelerator counterparts for the same classification task on av-
erage.
Then, similar to the MindReading photonic accelerator with a continued focus on MAC efficiency, Pixel
[148] which is a photonic neural network accelerator was introduced to perform efficient MAC function-
ality with MRRs and MZIs as the main photonic components. The main advantage of this design is the
detailed architecture parameters such as area, power, and timing that are detailed for MAC in the opti-
cal domain. The accelerator exists in two versions in which the first is a hybrid Optical Electrical (OE)
version that multiplies in the optical domain and then accumulates in the electrical domain. The other
version is a full-fledged Optical Optical (OO) version for the MAC operations to happen completely in
the optical domain.
As part of the proposed PIXEL architecture, the OMAC units are composed of RF for storing filter weights,
and a MAC unit for multiplying and accumulating these weights. A neuron that is fired with photonics
interconnects in both x and y dimensions. Data will be preprocessed at the front-end to fire neurons re-
peatedly, and information will be recovered from the accumulation at the back-end. To implement MAC
functionality, timed firing of neurons is assumed to be implemented in the OMAC because synapses are
preloaded. This design benefits from the fact that firing and accumulating partial sums of neurons are in

19
5.2 Algorithmic Designs

Figure 13: Pixel NN Accelerator: (a) A simple STR configuration using bitwise multiplication and addition. (b) An
OMAC unit that performs multiplication optically while addition and shifting are performed electrically. (c) Optical ac-
cumulation with extended OMAC. Reproduced, with permission [148], from Shiflett et al. 2020 © IEEE.

the optical domain, which significantly reduces the use of energy. Additionally, instead of preloading the
filter weights into the MRRs, photonics can also be used to send the weight information to the OMACs
on a specific channel. The only active component is the MRR, so scaling up means driving the optical
signal with more intensity. Figure 13 provides an overview of the process flow in the proposed architec-
ture design.
As far as some accelerators that have been discussed this far, there exists a reliance on ripple-carry adders
and SRAMs, both of which are severely limiting the frequency and inference throughput of the accelera-
tor, primarily due to the adder’s long critical path and the SRAM’s long access latency.
To address the aforementioned problem, by processing binarized CNNs using photonic XNOR gates and
popcount units, [149] proposed a photonic non-volatile memory (NVM)-based accelerator, LightBulb.
Besides using photonic racetrack memory as input/output registers, LightBulb also uses photonic race-
track memory for power. With LightBulb’s photonic XNOR and popcount units, inferences of binarized
CNNs are processed electro-optically.
To replace floating-point MACs with XNORs and popcounts, LightBulb first binarizes the weights and
activations of a CNN into linear combinations of (−1, +1)s. The authors also propose XNOR gates based
on photonic microdisks and an XNOR-based ADC converter with PCM-based photonics for LightBulb.
A photonic XNOR gate as well as an ADC can operate at 50 GHz. Then, LightBulb is further equipped
with a photonic racetrack memory which serves as input and output registers to support its high-frequency
operation. This study implements, evaluates, and compares Lightbulb against state-of-the-art GPU, FPGA,
ASIC, ReRAM, and photonic CNN accelerators, thereby showing its efficiency.
Synonymous to Lightbulb, with a concentration on DNN acceleration efficiency and latency, the design
by [150] combined WDM with a Residue Number System (RNS) to present a hybrid opto-electronic com-
puting architecture for accelerating DNNs. In this case, RNS is combined with WDM to focus on achiev-
ing efficiency. By reducing the area of the accelerator, WDM enables a high level of parallelism while re-
ducing the number of optical components. RNS can also produce optical components with short opti-
cal critical paths. The advantage of this is that it reduces optical losses as well as the need for high laser
power. A key feature of RNS compute modules is their one-hot encoding, which facilitates fast switching
between the electrical and optical domains.
As a result of its high parallelism, RNS is perfect for CNNs and DNNs, particularly those using MACs.
Because of this, the authors try to avoid converting binary numbers into residue numbers. The imple-
mentation of activation functions like sigmoid and hypertangent functions with RNS is difficult. As a re-
sult, σ and tanh would be treated as Taylor series, and they can be implemented as polynomials with

20
5.2 Algorithmic Designs

specialized adders and multipliers.


Another noteworthy entry presented a novel entry that focused solely on accelerating Binary Neural Net-
works (BNNs), in which a BNN adopts its name from the task of classifying just two classes. Thus, au-
thors in [151] introduced an optical-domain BNN accelerator, dubbed ROBIN, that efficiently imple-
ments the key functionalities of a BNN by intelligently integrating heterogeneous MRR optical devices.
ROBIN optimizes for energy and area efficiency while improving the overall performance at the device,
circuit, and architecture levels. Having optimized ROBIN, two variants were developed: ROBIN-EO,
which offers higher FPS performance at the expense of more power consumption, and ROBIN-PO, which
offers higher area and energy efficiency. Using photonic systems in this study, the weights remain binary
but activations are multi-bit parameters, allowing for the acceleration of partially binarized networks.
Therefore, considering mixed quantization for activation parameters in each layer of the model is imme-
diately an extension consideration.
Following the entry of accelerators focused on BNNs and cross NNs, [152] introduced a novel photonic
hardware accelerator called RecLight for efficiently accelerating another category of NNs, namely, RNNs,
Gated Residual Units (GRUs) and Long Short Term Memory units (LSTMs). Essentially, RecLight can
accelerate NNs that consist of any combination of simple RNNs, GRUs, and LSTMs. The authors uti-
lize noncoherent integrated silicon photonics, in which the design is based on a novel noncoherent silicon
photonic accelerator targeting accelerating RNN variants relying on a novel photonic MAC unit design
that minimizes power dissipation and energy consumption while maximizing the overall throughput.
Using a hybrid tuning circuit, RecLight induces ∆λM R using both TO and EO tuning. While EO tun-
ing consumes a smaller amount of power and is faster (≈ ns range), the tuning range is smaller. Re-
cLight achieves better parameter resolution by relieving thermal crosstalk, which was identified as the
main constraint in noncoherent photonic computations.
Another paradigm worth exploring is the use of broadcasting for design efficiency, therefore, the chiplet-
based ASCEND DNN accelerator was introduced in [153] with photonic interconnects. ASCEND enables
chiplets to communicate seamlessly and map a variety of convolution layers within and across chiplets
without any delays. The parallelism is maximized by exploiting broadcasting’s ease of use, which makes
it possible to process computations simultaneously with shared inputs.
In ASCEND, several columns and rows of local Processing Elements (PEs) are grouped across differ-
ent chiplets in a unit 2D array. The PE array communicates unicast from each PE to the Global Buffer
(GLB) via a waveguide, while the broadcast communication from the GLB to each PE is accomplished
by a waveguide. Multiple PE arrays are aggregated and connected to the GLB through SDM to form a
chiplet-based accelerator. Then, by mapping diverse convolution layers at 2D PE array granularity, the
resulting photonic network enables seamless one-hop communication within and between chips.
In accordance with the broadcasting concept, [154] addressed the use of quantifying parameters in CNNs
to produce efficient models with low memory requirements. The design combined both WDM and Time
Division Multiplexing (TDM). By contrast, homogeneous quantization can reduce the accuracy of CNN
models, whereas heterogeneous quantization can increase the accuracy of CNN models.
Therefore, to achieve heterogeneously quantized CNN acceleration, the authors introduced a Heteroge-
neous Quantization Neural Network Accelerator (HQNNA) which is a novel non-coherent silicon pho-
tonic accelerator based on WDM and TDM. A key feature of the design is the use of matrix-vector mul-
tiplication with granularity-aware modular vectors that is both energy and throughput-efficient.
MVUs comprise the HQNNA architecture, with data routing through an electronic control unit. Con-
volution and Fully Connected (FC) layer activations are performed using the MVU array. A vector and
matrix array is mapped across the array of MVUs, with the resulting partial sum vectors summed digi-
tally to obtain the sum vectors. Every FC layer’s MVU considers a weight matrix of size v × v and an
activation vector of size v. In order to pass the vector to the next FC layer, large vectors, and matri-
ces are divided up by FCMVUs of different dimensions. As reported, LightBulb makes CNN inference
throughput 17× to 173× faster than prior electro-optical accelerators and increases inference throughput
per Watt by 17.5× to 660×.

21
Architecture Total Network Parameters (Million) FLOPS (Billion)
AlexNet [158] 62.38 1.50
VGG-16 138.42 19.60
Inception-v3 [159] 24.00 6.00
ResNet-152 60.30 11.00
MobileNet-v2 [160] 6.90 1.17
ShuffleNet-v1 (1x) [161] 1.87 0.14
DenseNet-121 [162] 7.98 5.69
EfficientNet-B1 [163] 7.80 0.70
Vision Transformer (Base/16) [164] 86.60 17.60

Table 1: Total number of parameters and FLOPS in the most commonly used classification deep learning models in ma-
chine learning literature.

6 Discussion and Research Gaps


The overall energy efficiency of a significant portion of the accelerators assessed in this review is severely
compromised due to the inefficiency of lasers and phase shifters, the insertion of optical components lead-
ing to excessive losses, and overheads associated with optical-to-electrical and electrical-to-optical con-
versions. However, the inherent parallelism in the multiplication of analog signals and the significantly
higher operating speed make them appealing for reducing delays and enhancing throughput compared to
digital CMOS implementations. Low-energy tuning schemes, combined with emerging SiP technologies
like NOEMS and LCOS, can notably enhance the energy efficiency of photonic accelerators. Efficient
weight tuning is achievable with thermal PS and insulation, making mass production a viable option.
Moreover, low-voltage swing modulators with heterogeneous polymer integration promise improved en-
ergy efficiency by consuming significantly less power on the CMOS driver and modulator [155].
In silicon PIC accelerators, numerous MZIs and MRRs are cascaded, and WDM technology relies on
MRRs. Matrix multiplication necessitates quadratically more units than the input data dimension, re-
sulting in quadratic energy consumption and footprint scaling. Despite the potential of PICs in inte-
grated optics, manufacturing limitations impede the development of accelerators capable of supporting
deeper neural networks with more parameters, as indicated in Table 1, due to their limited integration
scale and the complexity of control circuits to reduce resonance wavelength fluctuations. Therefore, the
integration scale severely restricts the size of input vectors and matrices that can be loaded onto these
hardwares. Recently, on-chip-scale microcomb devices have been developed with multi-wavelength light
sources, offering convenience and significantly increasing the input data size. However, limitations on the
size of the input matrix may still exist based on the number of wavelength channels multiplexed. Addi-
tionally, there are other on-chip architectures capable of processing matrices. Despite their impressive
performance, the practical compute rate and processed matrix size of these architectures still fall below
their potential for building high-speed photonic engines [156].

6.1 Scalability
Inherently, photonic accelerators must be able to support the weight capacity or parameters of DL mod-
els widely used in order to replace electronic accelerators. As shown in Table 1, ResNet-152, which is
a popular and widely used DL classification architecture by Microsoft, has already surpassed 60 mil-
lion parameters. Therefore, one paradigm of the direct and effective solution is to manufacture larger-
scale photonic accelerators. Consequently, as presented in Section 4, partially or fully optical accelera-
tors promise massive parallelism by employing WDM which has been adopted by a majority of photonic
accelerators in literature and possibly Mode division Multiplexing (MDM) which is yet to be realized or
investigated in this domain [157]. Ultimately, the more scalable and efficient a photonic accelerator ar-
chitecture is, the more capable it is to support more sophisticated DL neural networks.
For instance, a large-scale silicon photonic MRR array requires a seamless control technique, which can
be achieved using integrated photoconductive heaters without the need for additional components, so-

22
6.2 Future Perspectives

phisticated tuning algorithms, or additional electrical interfaces. Integrated with silicon photonics, lithium
niobate and barium titanate electrooptical modulators provide high-speed phase modulation and low op-
erating voltage, making them extremely attractive for photonic accelerators. Taking into account the
high thermal tuning costs associated with phase shifters in MZI mesh schemes, scaling to a larger ma-
trix can be problematic, and as a result of the accumulation of thermal energy for thousands of phase
shifter units, photonic accelerators will be less competitive [157]. In this review of photonic accelerator
methodologies, the analysis was conducted in terms of positives and negatives to address efficiency and
scalability concerns. These aspects emerge as key areas requiring further research.

6.2 Future Perspectives


In this section, specific areas of research worthy of exploration for the field of DL photonic accelerators
are addressed.

6.2.1 Hardware Considerations

Despite the considerable advantages that photonic DL accelerators offer over their electronic counter-
parts, several challenges persist. One such challenge is the implementation of caching memory subsys-
tems for neural networks, which becomes intricate when scaling up to handle real workloads generat-
ing substantial intermediate data. To execute large-scale neural networks, electronic memories such as
SRAM and DRAM can be integrated with optical video memory modules. The development of cross-
layer co-design tools and the establishment of simulation methodologies are also crucial. The design space
exploration of nanophotonic accelerators involves detailed power/performance modeling. Lastly, defining
hardware-software interfaces for nanophotonic accelerations and considering optical-electrical heteroge-
neous computing models are important [165].

6.2.2 Weight Matrices Manipulation to Counter Photonic Component Scaling

It is possible to discretize equations with constant coefficients with Toeplitz matrices. Therefore, based
on the fact that Toeplitz matrices have only 2n − 1 parameters, it is reasonable to expect that linear sys-
tems T x = b, synonymous to NNs, can be solved in fewer flops than Lower Upper (LU) factorization
would require. In fact, there are methods that require only O(n2 ) flops [166]. As can be seen in Figure
14, Toeplitz matrices exhibit a constant diagonal structure, making them suitable for scenarios requiring
shift invariance, as is often the case in classification deep learning neural networks [167].
Hence, the implementations of DL utilizing photonic accelerators inherently come with limitations that
restrict the complexity of DL models that can be effectively supported. A fundamental requirement for
such demanding computations is unitary weight matrices. Therefore, machine learning mechanisms need
to transform weight matrices to their nearest unitary ones. Additionally, DL models that perform con-
volutions on the data need to be handled differently in order to fit within the parameters of the pho-
tonic system. It has also been shown in this study that it is possible to convert non-unitary matrices to
unitary ones, as well as to apply linear algebra techniques to the transformation of Convolutional Neu-
ral Networks (CNNs) into models capable of feed-forward learning using Toeplitz matrix operations for
convolutions [168]. Experimental results in [168] proved that post-training or iterative techniques to find
the nearest unitary weight matrix can be applied for photonic chips with the minimum loss in accuracy,
while CNNs adapted well in a photonic configuration employing a Toeplitz matrix implementation. This
proposed approach addresses the limitation of DL models for deployment in photonic FPGAs and there-
fore possibilities for fully optical accelerators.
An excellent example of a CNN is that it illustrates a straightforward but significant observation regard-
ing computer image recognition problems: since detecting an object is primarily a separate task from
detecting where it appears in the image, it should be equivariant with translation. In a CNN, the lin-
ear transformation is not completely connected, but if the unitary weight matrices in a photonic com-
ponent acting as a placeholder or pixel values such as MRR or MZI has its implementation transformed

23
Figure 14: Toeplitz Matrix (P )

to a matrix multiplication and then non-linear activation, convolution can be achieved, despite the pho-
tonic restrictions in terms of unitary weight matrices. By doing this, instead of using for-loops to do 2D
convolutions on images, the filter is converted into a Toeplitz matrix and the image to a vector, followed
by one matrix multiplication to complete the convolution. As such, a single matrix multiplication can
be effectively replicated using photonic components in this case. The method is based on forcing the
Toeplitz matrix to be a square matrix. Since unitary matrices are square by definition, this step is nec-
essary [168].
In contrast to fully-connected deep neural networks, the Toeplitz nature of N × N matrix P allows only
the encoding of N degrees of freedom. Toeplitz implementations, however, allow photonic FPGAs to han-
dle CNNs with the least possible complexity, without utilizing backpropagation algorithms. With this
introduction of Toeplitz implementations for Photonic FPGAs and possibly fully photonic accelerators,
larger networks can be implemented more efficiently. In which the majority of work on photonic acceler-
ators in the literature reviewed and adopted within the scope of this paper, exists for results on narrower
networks due to complexity and size restrictions. Due to turnarounds’ enabling deep and effective DL
implementations, training large models in the photonic domain such as ConvNeXt [169], deep ViTs, etc.,
operating fully optical DL learning accelerators for larger and deeper CNN models will be made possible
[168].

7 Conclusion
Matrix-Vector Multiplication (MVM) operations utilized in CNNs through convolutions have been proven
to be accelerated either partially or in their entirety by photonics. This allowed for remarkable speed
and significantly lower energy consumption compared to their electronic counterparts, showing promise
for their acceleration in more complex AI applications. The high compactness and integration density of
on-chip integrated photonic circuit platforms make them excellent frameworks for artificial neural net-
works. In contrast, electronic components, due to their complexity, necessitate a large number of transis-
tors and an additional scheduler program to perform simple operations. MVM operations, on the other
hand, can be effortlessly implemented using fundamental photonic components like MRRs, MZIs, and
diffractive planes.
Our study therefore explored the use of photonic deep learning accelerators in neuromorphic systems
with the aim of reducing power consumption and enhancing processing speed in the future. The pursuit
of improved metrics, such as J/MAC or MAC/s, has driven a significant segment of the photonic com-
munity to meticulously replicate algorithmic neural networks in photonic hardware. Overall, A signifi-
cant and consistent drawback highlighted in our paper is scalability and as evident in this study, Wave-
length Division Multiplexing (WDM) is often employed for multiplexing in accelerator designs to address
this issue. However, despite the significant added value of WDM, it is demands sophisticated hardware,
leading to increased costs and a larger physical footprint. Ultimately, this calls for a shift towards even
more efficient design approaches as discussed in the paper, driven by the already promising entries in
this emerging domain.
Acknowledgements

24
REFERENCES

This work was made possible with the support of the NYUAD Research Enhancement Fund. The au-
thors express their gratitude to the NYU Abu Dhabi Center for Smart Engineering Materials and The
Center for Cyber Security for their valuable contributions and support.

References
[1] J. Schmidhuber, Neural networks 2015, 61 85.
[2] X.-Y. Xu, X.-M. Jin, ACS Photonics 2023, 10, 4 1027.
[3] J. Von Neumann, IEEE Annals of the History of Computing 1993, 15, 4 27.
[4] M. D. Hill, G. S. Sohi, Readings in computer architecture, Gulf Professional Publishing, 2000.
[5] J. Song, Y. Cho, J.-S. Park, J.-W. Jang, S. Lee, J.-H. Song, J.-G. Lee, I. Kang, In 2019 IEEE In-
ternational Solid-State Circuits Conference-(ISSCC). IEEE, 2019 130–132.
[6] A. C. Elster, T. A. Haugdahl, Computing in Science & Engineering 2022, 24, 2 95.
[7] J. Koomey, S. Berard, M. Sanchez, H. Wong, IEEE Annals of the History of Computing 2010, 33,
3 46.
[8] M. Ahmadi, H. Bolhasani, Photonic neural networks: A compact review, 2023.
[9] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu,
A. Lukashchuk, A. S. Raja, et al., Nature 2021, 589, 7840 52.
[10] M. S. Rasras, D. M. Gill, M. P. Earnshaw, C. R. Doerr, J. S. Weiner, C. A. Bolle, Y.-K. Chen,
IEEE Photonics Technology Letters 2009, 22, 2 112.
[11] A. Krizhevsky, I. Sutskever, G. E. Hinton, Advances in neural information processing systems
2012, 25.
[12] P. Colangelo, O. Segal, A. Speicher, M. Margala, In 2019 IEEE High Performance Extreme Com-
puting Conference (HPEC). IEEE, 2019 1–8.
[13] D. Chiou, In 2017 IEEE International Symposium on Workload Characterization (IISWC). IEEE
Computer Society, 2017 124–124.
[14] C. Rı́os, N. Youngblood, Z. Cheng, M. Le Gallo, W. H. Pernice, C. D. Wright, A. Sebastian,
H. Bhaskaran, Science advances 2019, 5, 2 eaau5759.
[15] S. Xiao, M. H. Khan, H. Shen, M. Qi, Optics Express 2007, 15, 12 7489.
[16] S. Cheung, T. Su, K. Okamoto, S. Yoo, IEEE Journal of Selected Topics in Quantum Electronics
2013, 20, 4 310.
[17] V. J. Sorger, N. D. Lanzillotti-Kimura, R.-M. Ma, X. Zhang, Nanophotonics 2012, 1, 1 17.
[18] E. Timurdogan, C. M. Sorace-Agaskar, J. Sun, E. Shah Hosseini, A. Biberman, M. R. Watts, Na-
ture communications 2014, 5, 1 1.
[19] H. Sepehrian, J. Lin, L. A. Rusch, W. Shi, Journal of Lightwave Technology 2019, 37, 13 3078.
[20] J. Rosenberg, W. Green, S. Assefa, D. Gill, T. Barwicz, M. Yang, S. Shank, Y. Vlasov, Optics ex-
press 2012, 20, 24 26411.
[21] Y. Ban, J. Verbist, M. Vanhoecke, J. Bauwelinck, P. Verheyen, S. Lardenois, M. Pantouvaki,
J. Van Campenhout, In 2019 IEEE Optical Interconnects Conference (OI). IEEE, 2019 1–2.

25
REFERENCES

[22] M. A. Nahmias, T. F. De Lima, A. N. Tait, H.-T. Peng, B. J. Shastri, P. R. Prucnal, IEEE Jour-
nal of Selected Topics in Quantum Electronics 2019, 26, 1 1.
[23] M. Miscuglio, G. C. Adam, D. Kuzum, V. J. Sorger, APL Materials 2019, 7, 10.
[24] X. Ma, J. Meng, N. Peserico, M. Miscuglio, Y. Zhang, J. Hu, V. J. Sorger, In Optical Fiber Com-
munication Conference. Optica Publishing Group, 2022 M2E–4.
[25] N. Peserico, X. Ma, B. J. Shastri, V. J. Sorger, Emerging Topics in Artificial Intelligence (ETAI)
2022 2022, 12204 53.
[26] C. Wu, S. Lee, H. Yu, R. Peng, I. Takeuchi, M. Li, In 2020 IEEE Photonics Conference (IPC).
IEEE, 2020 1–2.
[27] C. Wu, H. Yu, S. Lee, R. Peng, I. Takeuchi, M. Li, Nature communications 2021, 12, 1 96.
[28] B. Javidi, J. Li, Q. Tang, Applied optics 1995, 34, 20 3950.
[29] B. Javidi, Optical Engineering 1990, 29, 9 1013.
[30] M. Reck, A. Zeilinger, H. J. Bernstein, P. Bertani, Physical review letters 1994, 73, 1 58.
[31] D. A. Miller, Journal of Lightwave Technology 2013, 31, 24 3987.
[32] D. A. Miller, Optics express 2013, 21, 5 6360.
[33] D. Miller, In APS March Meeting Abstracts, volume 2015. 2015 S6–001.
[34] G. Cong, N. Yamamoto, T. Inoue, M. Okano, Y. Maegami, M. Ohno, K. Yamada, Optics Express
2019, 27, 18 24914.
[35] J. Wang, S. Gu, In 2021 11th International Conference on Information Science and Technology
(ICIST). IEEE, 2021 571–577.
[36] M. T. Ajili, Y. Hara-Azumi, IEEE Access 2022, 10 9603.
[37] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., In Pro-
ceedings of the 2016 ACM/SIGDA international symposium on field-programmable gate arrays.
2016 26–35.
[38] D. A. Miller, Optics express 2013, 21, 17 20220.
[39] J. Hardy, J. Shamir, Optics Express 2007, 15, 1 150.
[40] O. Schwelb, Journal of Lightwave Technology 2004, 22, 5 1380.
[41] Q. Xu, J. Shakya, M. Lipson, Optics express 2006, 14, 14 6463.
[42] L. Zhang, R. Ji, L. Jia, L. Yang, P. Zhou, Y. Tian, P. Chen, Y. Lu, Z. Jiang, Y. Liu, et al., Optics
letters 2010, 35, 10 1620.
[43] Q. Xu, D. Fattal, R. G. Beausoleil, Optics express 2008, 16, 6 4309.
[44] S. R. Tamalampudi, G. Dushaq, J. E. Villegas, N. S. Rajput, B. Paredes, E. Elamurugu, M. S.
Rasras, Optics Express 2021, 29, 24 39395.
[45] G. Dushaq, B. Paredes, J. E. Villegas, S. R. Tamalampudi, M. Rasras, Optics Express 2022, 30,
10 15986.
[46] S. R. Tamalampudi, G. Dushaq, J. E. Villegas, B. Paredes, M. S. Rasras, Journal of Lightwave
Technology 2023.

26
REFERENCES

[47] S. M. Serunjogi, M. A. Sanduleanu, M. S. Rasras, Journal of Lightwave Technology 2017, 36, 9


1537.
[48] A. Sebastian, M. Le Gallo, G. W. Burr, S. Kim, M. BrightSky, E. Eleftheriou, Journal of Applied
Physics 2018, 124, 11.
[49] S. Ambrogio, P. Narayanan, H. Tsai, R. M. Shelby, I. Boybat, C. Di Nolfo, S. Sidler, M. Giordano,
M. Bodini, N. C. Farinha, et al., Nature 2018, 558, 7708 60.
[50] Z. Cheng, C. Rı́os, N. Youngblood, C. D. Wright, W. H. Pernice, H. Bhaskaran, Advanced Materi-
als 2018, 30, 32 1802435.
[51] Y. Zhang, J. B. Chou, J. Li, H. Li, Q. Du, A. Yadav, S. Zhou, M. Y. Shalaginov, Z. Fang,
H. Zhong, et al., Nature communications 2019, 10, 1 4279.
[52] E. Farquhar, P. Hasler, IEEE Transactions on Circuits and Systems I: Regular Papers 2005, 52, 3
477.
[53] C. Lee, W. Zame, A. Alaa, M. Schaar, In The 22nd international conference on artificial intelli-
gence and statistics. PMLR, 2019 596–605.
[54] L. Szilagyi, J. Pliva, R. Henker, D. Schoeniger, J. P. Turkiewicz, F. Ellinger, IEEE Journal of
Solid-State Circuits 2018, 54, 3 845.
[55] M. A. Nahmias, B. J. Shastri, A. N. Tait, P. R. Prucnal, IEEE journal of selected topics in quan-
tum electronics 2013, 19, 5 1.
[56] N. Rathi, I. Chakraborty, A. Kosta, A. Sengupta, A. Ankit, P. Panda, K. Roy, ACM Computing
Surveys 2023, 55, 12 1.
[57] N. Farmakidis, N. Youngblood, X. Li, J. Tan, J. L. Swett, Z. Cheng, C. D. Wright, W. H. Pernice,
H. Bhaskaran, Science advances 2019, 5, 11 eaaw2687.
[58] H. Zhang, L. Zhou, L. Lu, J. Xu, N. Wang, H. Hu, B. A. Rahman, Z. Zhou, J. Chen, ACS Photon-
ics 2019, 6, 9 2205.
[59] J. Feldmann, N. Youngblood, X. Li, C. D. Wright, H. Bhaskaran, W. H. Pernice, IEEE Journal of
Selected Topics in Quantum Electronics 2019, 26, 2 1.
[60] A. N. Tait, A. X. Wu, T. F. De Lima, E. Zhou, B. J. Shastri, M. A. Nahmias, P. R. Prucnal, IEEE
Journal of Selected Topics in Quantum Electronics 2016, 22, 6 312.
[61] W. Zhou, N. Farmakidis, J. Feldmann, X. Li, J. Tan, Y. He, C. D. Wright, W. H. Pernice,
H. Bhaskaran, MRS Bulletin 2022, 47, 5 502.
[62] M. S. V. Vassiliev, Light Sci Appl 12 2023, 12, 127.
[63] C. Brossollet, A. Cappelli, I. Carron, C. Chaintoutis, A. Chatelain, L. Daudet, S. Gigan, D. Hess-
low, F. Krzakala, J. Launay, et al., arXiv preprint arXiv:2107.11814 2021.
[64] R. Ohana, J. Wacker, J. Dong, S. Marmin, F. Krzakala, M. Filippone, L. Daudet, In ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2020 9294–9298.
[65] J. Launay, I. Poli, K. Müller, I. Carron, L. Daudet, F. Krzakala, S. Gigan, arXiv preprint
arXiv:2006.01475 2020.
[66] D. Hesslow, A. Cappelli, I. Carron, L. Daudet, R. Lafargue, K. Müller, R. Ohana, G. Pariente,
I. Poli, arXiv preprint arXiv:2104.14429 2021.
[67] T. W. Hughes, M. Minkov, Y. Shi, S. Fan, Optica 2018, 5, 7 864.

27
REFERENCES

[68] L. Lu, Light: Science & Applications 2021, 10, 1.


[69] H. D. J. Cheng, J.; Zhou, Nanomaterials 2021, 11, 1683.
[70] A. Litvin, I. Martynenko, F. Purcell-Milton, A. Baranov, A. Fedorov, Y. Gun’Ko, Journal of Mate-
rials Chemistry A 2017, 5, 26 13252.
[71] N. Tate, Photonic Neural Networks with Spatiotemporal Dynamics 2024, 71.
[72] A. . D. M. . L. C. F. H. P. K. B. Lingnau, B; Perrott, Opt. Lett 2020, 45, 2223-2226.
[73] J. M. Shainline, S. M. Buckley, R. P. Mirin, S. W. Nam, Physical Review Applied 2017, 7, 3
034013.
[74] L. El Srouji, A. Krishnan, R. Ravichandran, Y. Lee, M. On, X. Xiao, S. Ben Yoo, APL Photonics
2022, 7, 5.
[75] T. F. De Lima, H.-T. Peng, A. N. Tait, M. A. Nahmias, H. B. Miller, B. J. Shastri, P. R. Prucnal,
Journal of Lightwave Technology 2019, 37, 5 1515.
[76] M. A. Nahmias, H.-T. Peng, T. F. De Lima, C. Huang, A. N. Tait, B. J. Shastri, P. R. Prucnal, In
2018 IEEE Photonics Conference (IPC). IEEE, 2018 1–2.
[77] M. Xu, X. Mai, J. Lin, W. Zhang, Y. Li, Y. He, H. Tong, X. Hou, P. Zhou, X. Miao, Advanced
Functional Materials 2020, 30, 50 2003419.
[78] S. Song, J. Kim, S. M. Kwon, J.-W. Jo, S. K. Park, Y.-H. Kim, Advanced Intelligent Systems
2021, 3, 4 2000119.
[79] A. Jha, C. Huang, P. R. Prucnal, Optics letters 2020, 45, 17 4819.
[80] G. W. Burr, M. J. Brightsky, A. Sebastian, H.-Y. Cheng, J.-Y. Wu, S. Kim, N. E. Sosa, N. Papan-
dreou, H.-L. Lung, H. Pozidis, et al., IEEE Journal on Emerging and Selected Topics in Circuits
and Systems 2016, 6, 2 146.
[81] M. L.-L. O. C. G. C. P. A. N. De Marinis, L; Cococcioni, Applied Sciences 2021, 11, 6232.
[82] A. Gondarenko, J. S. Levy, M. Lipson, Optics express 2009, 17, 14 11366.
[83] X. Li, N. Youngblood, C. Rı́os, Z. Cheng, C. D. Wright, W. H. Pernice, H. Bhaskaran, Optica
2019, 6, 1 1.
[84] P. Xu, J. Zheng, J. K. Doylend, A. Majumdar, Acs Photonics 2019, 6, 2 553.
[85] M. Wuttig, H. Bhaskaran, T. Taubner, Nature photonics 2017, 11, 8 465.
[86] Z. Yang, S. Ramanathan, IEEE Photonics Journal 2015, 7, 3 1.
[87] X. e. a. Xu, Nature 2021, , 589 44.
[88] Y. e. a. Shen, Nature, Photonics 2017, 11 441.
[89] Y. e. a. Shen, Nature, Photonics 2019, 569 208.
[90] J. Spall, X. Guo, A. I. Lvovsky, Optica 2022, 9, 7 803.
[91] D. Lin, A. L. Holsteen, E. Maguid, G. Wetzstein, P. G. Kik, E. Hasman, M. L. Brongersma, Nano
letters 2016, 16, 12 7671.
[92] D. Psaltis, D. Brady, K. Wagner, Applied Optics 1988, 27, 9 1752.
[93] V. Gruev, R. Etienne-Cummings, IEEE Transactions on Circuits and Systems II: Analog and Dig-
ital Signal Processing 2002, 49, 4 233.

28
REFERENCES

[94] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, G. Wetzstein, Scientific reports 2018, 8, 1 12324.
[95] Z. Ying, C. Feng, Z. Zhao, S. Dhar, H. Dalir, J. Gu, Y. Cheng, R. Soref, D. Z. Pan, R. T. Chen,
Nature communications 2020, 11, 1 2154.
[96] Z. Ying, C. Feng, Z. Zhao, J. Gu, R. Soref, D. Z. Pan, R. T. Chen, IEEE Photonics Journal 2020,
12, 6 1.
[97] Q. Xu, R. Soref, Optics Express 2011, 19, 6 5244.
[98] Y. T. J. e. a. Li, M.; Deng, Sci Rep 6 2016, 19985.
[99] S. N. G. B. Nikita, Analog photonics computing for information processing, inference and optimi-
sation, 2023.
[100] M. Li, Y. Deng, J. Tang, S. Sun, J. Yao, J. Azaña, N. hua Zhu, Scientific Reports 2016, 6.
[101] M. Nakajima, K. Tanaka, T. Hashimoto, Communications Physics 2021, 4, 1 20.
[102] T. K. Paraiso, T. Roger, D. G. Marangon, I. De Marco, M. Sanzaro, R. I. Woodward, J. F. Dynes,
Z. Yuan, A. J. Shields, Nature Photonics 2021, 15, 11 850.
[103] E. Pelucchi, G. Fagas, I. Aharonovich, D. Englund, E. Figueroa, Q. Gong, H. Hannes, J. Liu, C.-Y.
Lu, N. Matsuda, et al., Nature Reviews Physics 2022, 4, 3 194.
[104] P. Sibson, J. E. Kennard, S. Stanisic, C. Erven, J. L. O’Brien, M. G. Thompson, Optica 2017, 4, 2
172.
[105] S. Buck, R. Coleman, H. Sargsyan, arXiv preprint arXiv:2107.02151 2021.
[106] D. Bunandar, A. Lentine, C. Lee, H. Cai, C. M. Long, N. Boynton, N. Martinez, C. DeRose,
C. Chen, M. Grein, et al., Physical Review X 2018, 8, 2 021009.
[107] Y. Paquot, F. Duport, A. Smerieri, J. Dambre, B. Schrauwen, M. Haelterman, S. Massar, Scientific
reports 2012, 2, 1 287.
[108] K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, P. Bienstman, IEEE transactions on
neural networks 2011, 22, 9 1469.
[109] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M. Gutiérrez, L. Pesquera, C. R. Mirasso,
I. Fischer, Optics express 2012, 20, 3 3241.
[110] G. Paulin, R. Andri, F. Conti, L. Benini, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 2021, 29, 9 1624.
[111] W. P. K. S. O. B. V. Z. AW, Elshaari;, Nat Photonics 2020, 14, 5.
[112] H. Ishio, J. Minowa, K. Nosu, Journal of Lightwave Technology 1984, 2, 4 448.
[113] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jack-
son, N. Imam, C. Guo, Y. Nakamura, et al., Science 2014, 345, 6197 668.
[114] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi,
N. Imam, S. Jain, et al., Ieee Micro 2018, 38, 1 82.
[115] T. Wang, C. Wang, X. Zhou, H. Chen, arXiv preprint arXiv:1901.04988 2018.
[116] G. Sarantoglou, A. Bogris, C. Mesaritakis, S. Theodoridis, IEEE Journal of Selected Topics in
Quantum Electronics 2022, 28, 6: High Density Integr. Multipurpose Photon. Circ. 1.
[117] M. A. Nahmias, T. F. de Lima, A. N. Tait, H.-T. Peng, B. J. Shastri, P. R. Prucnal, IEEE Journal
of Selected Topics in Quantum Electronics 2020, 26, 1 1.

29
REFERENCES

[118] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, T. El-Ghazawi, In 2018 31st IEEE International


System-on-Chip Conference (SOCC). IEEE, 2018 169–173.
[119] K. Shiflett, A. Karanth, R. Bunescu, A. Louri, In 2021 ACM/IEEE 48th Annual International
Symposium on Computer Architecture (ISCA). IEEE, 2021 860–873.
[120] J. Peng, Y. Alkabani, K. Puri, X. Ma, V. Sorger, T. El-Ghazawi, ACM Journal on Emerging Tech-
nologies in Computing Systems (JETC) 2022, 18, 4 1.
[121] K. He, X. Zhang, S. Ren, J. Sun, CoRR 2015, abs/1512.03385.
[122] W. Zhang, H. Zhang, Weight bank addition photonic accelerator for artificial intelligence, 2023.
[123] D. Dang, J. Dass, R. Mahapatra, In 2017 IEEE 24th International Conference on High Perfor-
mance Computing (HiPC). IEEE, 2017 114–123.
[124] C. Zhang, Z. Fang, P. Zhou, P. Pan, J. Cong, In 2016 IEEE/ACM International
Conference on Computer-Aided Design (ICCAD). IEEE Press, 2016 1–8, URL
https://doi.org/10.1145/2966986.2967011.
[125] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S.
Williams, V. Srikumar, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Ar-
chitecture (ISCA) 2016, 14–26.
[126] D. Patel, A. Samani, V. Veerasubramanian, S. Ghosh, D. V. Plant, IEEE Photonics Technology
Letters 2015, 27, 23 2433.
[127] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
2015.
[128] L. Deng, IEEE Signal Processing Magazine 2012, 29, 6 141.
[129] D. Dang, S. V. R. Chittamuru, S. Pasricha, R. Mahapatra, D. Sahoo, ACM Journal on Emerging
Technologies in Computing Systems (JETC) 2021, 17, 4 1.
[130] D. Dang, B. Lin, D. Sahoo, ACM Transactions on Architecture and Code Optimization (TACO)
2022, 19, 3 1.
[131] A. F. Agarap, CoRR 2018, abs/1803.08375.
[132] K. Shiflett, A. Karanth, A. Louri, R. Bunescu, In Proceedings of the 2021 on Great Lakes Sympo-
sium on VLSI. 2021 9–14.
[133] Y. Wang, T. D. Wilkinson, In Frontiers in Optics. Optical Society of America, 2020 FM7D–1.
[134] R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, D. Englund, Physical Review X 2019, 9, 2
021032.
[135] D. Moss 2021.
[136] F. Sunny, M. Nikdast, S. Pasricha, In 2022 27th Asia and South Pacific Design Automation Con-
ference (ASP-DAC). 2022 214–219.
[137] C. Feng, J. Gu, H. Zhu, Z. Ying, Z. Zhao, D. Z. Pan, R. T. Chen, ACS Photonics 2022, 9, 12
3906.
[138] A. Anderson, A. Vasudevan, C. Keane, D. Gregg, Low-memory gemm-based convolution algo-
rithms for deep neural networks, 2017.
[139] G. Giamougiannis, A. Tsakyridis, M. Moralis-Pegios, G. Mourgias-Alexandris, A. R. Totovic,
G. Dabos, M. Kirtas, N. Passalis, A. Tefas, D. Kalavrouziotis, et al., Advanced Photonics 2023,
5, 1 016004.

30
REFERENCES

[140] L. D. Marinis, F. Nesti, M. Cococcioni, N. Andriolli, In OSA Advanced Photonics Congress (AP)
2020 (IPR, NP, NOMA, Networks, PVLED, PSC, SPPCom, SOF). Optica Publishing Group,
2020 PsTh1F.3, URL https://opg.optica.org/abstract.cfm?URI=PSC-2020-PsTh1F.3.
[141] F. Ashtiani, M. B. On, D. Sanchez-Jacome, D. Perez-Lopez, S. J. Ben Yoo, A. Blanco-Redondo, In
2023 Optical Fiber Communications Conference and Exhibition (OFC). 2023 1–3.
[142] C. Demirkiran, F. Eris, G. Wang, J. Elmhurst, N. Moore, N. C. Harris, A. Basumallik, V. J.
Reddi, A. Joshi, D. Bunandar, arXiv preprint arXiv:2109.01126 2021.
[143] S. Xu, J. Wang, W. Zou, arXiv preprint ArXiv:1910.12635 2019.
[144] F. Sunny, A. Mirza, M. Nikdast, S. Pasricha, In 2021 58th ACM/IEEE Design Automation Con-
ference (DAC). 2021 1069–1074.
[145] S. Afifi, F. Sunny, M. Nikdast, S. Pasricha, In Proceedings of the Great Lakes Symposium on VLSI
2023. 2023 15–21.
[146] Q. Lou, W. Liu, W. Liu, F. Guo, L. Jiang, In 2020 25th Asia and South Pacific Design Automa-
tion Conference (ASP-DAC). 2020 464–469.
[147] S. Hochreiter, J. Schmidhuber, Advances in neural information processing systems 1996, 9.
[148] K. Shiflett, D. Wright, A. Karanth, A. Louri, In 2020 IEEE International Symposium on High Per-
formance Computer Architecture (HPCA). IEEE, 2020 474–487.
[149] F. Zokaee, Q. Lou, N. Youngblood, W. Liu, Y. Xie, L. Jiang, In 2020 Design, Automation & Test
in Europe Conference & Exhibition (DATE). IEEE, 2020 1438–1443.
[150] J. Peng, Y. Alkabani, S. Sun, V. J. Sorger, T. El-Ghazawi, In Proceedings of the 49th International
Conference on Parallel Processing. 2020 1–11.
[151] F. P. Sunny, A. Mirza, M. Nikdast, S. Pasricha, ACM Trans. Embed. Comput. Syst. 2021, 20, 5s.
[152] F. Sunny, M. Nikdast, S. Pasricha, In 2022 IEEE Computer Society Annual Symposium on VLSI
(ISVLSI). IEEE, 2022 98–103.
[153] Y. Li, K. Wang, H. Zheng, A. Louri, A. Karanth, IEEE Transactions on Circuits and Systems I:
Regular Papers 2022, 69, 7 2730.
[154] F. Sunny, M. Nikdast, S. Pasricha, In Proceedings of the Great Lakes Symposium on VLSI 2022.
2022 367–371.
[155] M. Al-Qadasi, L. Chrostowski, B. Shastri, S. Shekhar, APL Photonics 2022, 7, 2.
[156] S. Zheng, J. Zhang, W. Zhang, arXiv preprint arXiv:2303.01287 2023.
[157] H. Zhou, J. Dong, J. Cheng, W. Dong, C. Huang, Y. Shen, Q. Zhang, M. Gu, C. Qian, H. Chen,
et al., Light: Science & Applications 2022, 11, 1 30.
[158] A. Krizhevsky, I. Sutskever, G. E. Hinton, In F. Pereira, C. Burges,
L. Bottou, K. Weinberger, editors, Advances in Neural Information
Processing Systems, volume 25. Curran Associates, Inc., 2012 URL
https://proceedings.neurips.cc/paperf iles/paper/2012/f ile/c399862d3b9d6b76c8436e924a68c45b−
P aper.pdf.
[159] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for
computer vision, 2015.
[160] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam,
Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

31
REFERENCES

[161] X. Zhang, X. Zhou, M. Lin, J. Sun, Shufflenet: An extremely efficient convolutional neural network
for mobile devices, 2017.
[162] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Weinberger, In Proceedings of the IEEE conference
on computer vision and pattern recognition. 2017 4700–4708.
[163] M. Tan, Q. Le, In International conference on machine learning. PMLR, 2019 6105–6114.
[164] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, CoRR 2020, abs/2010.11929.
[165] K.-i. Kitayama, M. Notomi, M. Naruse, K. Inoue, S. Kawakami, A. Uchida, Apl Photonics 2019,
4, 9.
[166] G. H. Golub, C. F. Van Loan, Johns Hopkins 2013.
[167] S. Kokila, A. Jayachandran, Remote Sensing Applications: Society and Environment 2023, 29
100881.
[168] G. Agrafiotis, E. Makri, I. Kalamaras, A. Lalas, K. Votis, D. Tzovaras, In Proceedings of the
Northern Lights Deep Learning Workshop, volume 4. 2023 .
[169] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A convnet for the 2020s, 2022.

32
REFERENCES

Mohammad Atwany, currently a PhD student at the IBME branch of Engineering Science at the University of Oxford.
Mohammed completed his BSc (Hons) and Master’s degrees in Machine Learning in December 2022. He then joined NYU
as a research engineer in the Department of Electrical Engineering. He then officially joined the MultiMeDIA lab at the
University of Oxford in October 2023. He is motivated by a broad range of research interests in Photonic Integrated
Circuit (PIC) design, photonics in biomedical engineering, and machine learning including, Domain Generalization and
Domain Adaptation in interventional medicine.

Solomon M. Serunjogi received the B.Sc. degree in telecommunications from Kyambogo University, Kyambogo, Uganda,
and the M.Sc. and Ph.D. degrees from the Masdar Institute of Science and Technology, Abu Dhabi, UAE from the Depart-
ment of Electrical Engineering and Computer Science. He is currently a research associate in the Photonics lab at NYU
Abu Dhabi. His research interests include the area of microwave photonics, signal processing, and the design of CMOS
driver circuits for high-speed telecom applications.

Mahmoud Rasras, an Associate Professor of Electrical Engineering at New York University Abu Dhabi (NYUAD), earned
his Ph.D. in physics from the Catholic University of Leuven, Belgium, with research conducted at imec. With over 11
years of industrial research at Bell Labs, Alcatel-Lucent, NJ, USA, he joined NYUAD after serving as a faculty and for-
mer Director of the SRC/Globalfoundries Center of Excellence for Integrated Photonics at Masdar Institute, UAE. Dr.
Rasras has authored 180 papers in high-impact journals and holds many granted US patents. He specializes in 2D mate-
rials optoelectronics, silicon photonics, plasmonic-enhanced optoelectronics, and microwave photonics. Dr. Rasras served
as an Associate Editor of Optics Express and he is a senior IEEE member and a member of the Mohammed Bin Rashid
Academy of Scientists.

33

You might also like