RasmusJonesThesis v12 Revised Print

Downloaded from orbit.dtu.
dk on: Aug 18, 2021
Machine Learning Methods in Coherent Optical Communication Systems
Jones, Rasmus Thomas
Publication date:
2019
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Jones, R. T. (2019). Machine Learning Methods in Coherent Optical Communication Systems. Technical
University of Denmark.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright
owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
 Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
 You may not further distribute the material or use it for any profit-making activity or commercial gain
 You may freely distribute the URL identifying the publication in the public portal
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Ph.D. Thesis
Doctor of Philosophy
Machine Learning Methods in

Coherent Optical Communication
Systems
Rasmus Thomas Jones
Kongens Lyngby 2019

DTU Fotonik
Department of Photonics Engineering
Technical University of Denmark
Ørsteds Plads
Building 343
2800 Kongens Lyngby, Denmark
Phone +45 4525 6352
www.fotonik.dtu.dk
Abstract
The future demand for digital information will outpace the capabilities of
current optical communication systems, which are approaching their limits
due to fiber intrinsic nonlinear effects. Machine learning methods promise
to find new ways of exploiting the available resources, and to handle future
challenges in larger and more complex systems.
The methods presented in this work apply machine learning on optical com-
munication systems, with three main contributions. First, a machine learning
framework combines dimension reduction and supervised learning, addressing
computational expensive steps of fiber channel models. The trained algorithm
allows more efficient execution of the models. Second, supervised learning is
combined with a mathematical technique, the nonlinear Fourier transform,
realizing more accurate detection of high-order solitons. While the technique
aims to overcome the intrinsic fiber limitations using a reciprocal effect be-
tween chromatic dispersion and nonlinearities, there is a non-deterministic
impairment from fiber loss and amplification noise, leading to distorted soli-
tons. A machine learning algorithm is trained to detect the solitons despite the
distortions, and in simulation studies the trained algorithm outperforms the
standard receiver. Third, an unsupervised learning algorithm with embedded
fiber channel model is trained end-to-end, learning a geometric constellation
shape that mitigates nonlinear effects. In simulation and experimental stud-
ies, the learned constellations yield improved performance to state-of-the-art
geometrically shaped constellations.
The contributions presented in this work show how machine learning can
be applied together with optical fiber channel models, and demonstrate, that
machine learning is a viable tool for increasing the capabilities of optical com-
munications systems.
ii
Resumé
Den fremtidige efterspørgsel for digital information vil overhale kapaciteten af
de nuværende optiske kommunikationssystemer, der nærmer sig deres grænse
på grund af ikke-lineære fiber effekter. Maskinlæringsmetoder lader os finde
nye måder at udnytte de eksisterende ressourcer og håndterer udfordringer i
større og mere komplekse systemer.
Metoderne præsenteret i denne afhandling er maskinlæringsmetoder for
optiske kommunikationssystemer, inddelt i tre hovedkategorier. Først, et
maskinlærings-framework der ved at kombinerer dimensionsreduktion og su-
perviseret læring adresserer udregningstunge skridt i fiber modellering. Den
trænede algoritme giver adgang til mere effektive udregninger i modellerne.
I den anden metode bliver superviseret læring kombineret med en matema-
tisk teknik; den ikke lineære Fourier transformation, der gør det muligt at
detekterer højere ordens solitoner. Teknikken søger at overkomme fiber be-
grænsninger ved at bruge en reciprok effekt imellem kromatisk dispersion og
ikke-lineariteter. Der er en ikke-deterministisk fejlkilde fra tabet i fiberen og
forstærkerstøj, hvilket fører til forvrængede solitoner. En maskinlæringsalgo-
ritme bliver trænet til at detektere solitonerne på trods af forvrængningerne
og i simuleringer fungerede algoritmen bedre end en standard modtager. Den
tredje og sidste metode bruger en ikke-superviseret-læringsalgoritme med plantet
fiber-kanal-model. Denne algoritme bliver trænet end-to-end for at opnå en
geometrisk konstellationsform der mitigerer ikke-lineære effekter. I både simu-
leringer og eksperimenter har denne algoritme øget virkningen af state-of-the-
art geometriske konstellationer.
Arbejdet præsenteret her viser hvordan maskinlæring kan blive brugt sam-
men med den optiske fibermodel og demonstrer at maskinlæring er et nyttigt
redskab til at øge kapaciteten af optiske kommunikationssystemer.
ii
Preface
The work presented in this thesis was carried out as a part of my Ph.D. project
in the period November 1st , 2015 to October 31st , 2018. The work took
place at the Department of Photonics Engineering, Technical University of
Denmark, Kgs. Lyngby, Denmark, with an external stay of six months at the
Photonic Network System Laboratory, National Institute of Information and
Communications Technology, Tokyo, Japan.
This Ph.D. project was financed by Keysight Technologies (Germany, Böblin-

gen) and by the European Research Council through the ERC-CoG FRECOM
project (grant agreementno. 771878), and supervised by:
• Darko Zibar (main supervisor), Associate Professor, Department of Pho-

tonics Engineering, Technical University of Denmark, 2800 Kgs. Lyngby,
Denmark
• Metodi P. Yankov (co-supervisor), Fingerprint Cards A/S, 2730 Herlev,

Denmark, and Department of Photonics Engineering, Technical Univer-
sity of Denmark, 2800 Kgs. Lyngby, Denmark
• Molly Piels (co-supervisor), Juniper Networks, 1133 Innovation Way,

Sunnyvale, California 94089, USA, and former Department of Photon-
ics Engineering, Technical University of Denmark, 2800 Kgs. Lyngby,
Denmark
iv
Acknowledgements
First I would like to thank my supervisor at DTU, Darko Zibar, for giving me
the opportunity to pursue this Ph.D., and together with my co-supervisors,
Metodi P. Yankov and Molly Piels, for the guidance, encouragement and many
fruitful discussions over the last three years.
Thanks to the High Speed Optical Communications (HSOC) group, for
great laughs around the lunch table and during the coffee breaks, in particular,
to Francesco for always being reliable, and to Lisbeth for herding the cats and
coordinating us scientists.
Further, I would like to thank Ben Puttnam from the National Institute of
Information and Communications Technology for hosting me for my external
research stay in Tokyo, and together with Georg, Ruben, Tobias and Werner,
for introducing me to the Japanese culture, navigating me around the Japanese
cuisine and borrowing me camping gear. Arigato for the collaborations, and
for teaching me everything I needed to know in the lab and the wonders of an
espresso machine.
Thanks to the DSP cowboys, Júlio, Simone, Edson, Martin and Stenio, for
fruitful discussions, collaborations, flip chart art projects, Friday bar beers,
and simply sharing the office with a friendly and daily “Good morning”.
Thank you to the sweetest flat-mates, Tasja and Micha, and lately, Sara,
for the time in the kitchen we spent and providing me with a home to arrive
to every evening.
Many thanks to my friends in Copenhagen and back in Germany for con-
stantly reaching out, and all your diverse input forcing me to keep an open
mind. Thanks to Rob, Andi and Carys for visiting me in Japan.
Thank you to my family for their unconditional support throughout the
years. In particular, my parents, Suzan and Simon, and my brother, Mark,
thank you for your love and visiting me in Japan. Thank you, Èva, my partner
in crime.
vi
Acknowledgements vii
It is more fun to talk with someone who

doesn’t use long, difficult words but rather
short, easy words like, ”What about lunch?”
- Winnie the Pooh
viii
Ph.D. Publications
The following publications have resulted from this Ph.D. project.
Articles in peer-reviewed journals

[J1] J. C. M. Diniz, F. Da Ros, E. P. da Silva, R. T. Jones, and D. Zibar,
“Optimization of DP-M-QAM Transmitter Using Cooperative Coevolu-
tionary Genetic Algorithm,” Journal of Lightwave Technology, vol. 36,
no. 12, pp. 2450–2462, 2018.
[J2] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Time-Domain
Neural Network Receiver for Nonlinear Frequency Division Multiplexed
Systems,” IEEE Photonics Technology Letters, vol. 30, no. 12, pp. 1079–
1082, 2018.
[J3] J. Thrane, J. Wass, M. Piels, J. C. Diniz, R. Jones, and D. Zibar,
“Machine learning techniques for optical performance monitoring from
directly detected PDM-QAM signals,” Journal of Lightwave Technol-
ogy, vol. 35, no. 4, pp. 868–875, 2017.
Articles submitted to peer-reviewed journals

[S1] R. T. Jones, T. A. Eriksson, M. P. Yankov, B. J. Puttnam, G. Rademacher,
R. S. Luı́s, and D. Zibar, “Geometric Constellation Shaping for Fiber
Optic Communication Systems via End-to-end Learning,” arXiv preprint
arXiv:1810.00774, 2018.
x Ph.D. Publications
Contributions to international peer-reviewed

conferences
[C1] R. T. Jones, T. A. Eriksson, M. P. Yankov, and D. Zibar, “Deep
Learning of Geometric Constellation Shaping including Fiber Nonlin-
earities,” in European Conf. Optical Communication (ECOC), IEEE,
2018, We1F.5.
[C2] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Noise Ro-
bust Receiver for Eigenvalue Communication Systems,” in Optical Fiber
Communication Conference, IEEE, 2018, pp. 1–3.
[C3] B. J. Puttnam, G. Rademacher, R. S. Luı́s, R. T. Jones, Y. Awaji,
and N. Wada, “Impact of Temperature on Propagation Time and Inter-
Core Skew in a 19-Core Fiber,” in Asia Communications and Photonics
Conference, Optical Society of America, 2018.
[C4] S. Savian, J. C. M. Diniz, A. P. T. Lau, F. N. Khan, S. Gaiarin,
R. Jones, and D. Zibar, “Joint Estimation of IQ Phase and Gain
Imbalances Using Convolutional Neural Networks on Eye Diagrams,”
in CLEO: Science and Innovations, Optical Society of America, 2018,
STh1C–3.
[C5] J. C. M. Diniz, F. Da Ros, R. T. Jones, and D. Zibar, “Time skew
estimator for dual-polarization QAM transmitters,” in European Conf.
Optical Communication (ECOC), IEEE, 2017, pp. 1–3.
[C6] R. T. Jones, J. C. Diniz, M. Yankov, M. Piels, A. Doberstein, and
D. Zibar, “Prediction of second-order moments of inter-channel inter-
ference with principal component analysis and neural networks,” in
European Conf. Optical Communication (ECOC), IEEE, vol. 1, 2017.
[C7] J. Wass, J. Thrane, M. Piels, R. Jones, and D. Zibar, “Gaussian pro-
cess regression for WDM system performance prediction,” in Optical
Fiber Communication Conference, Optical Society of America, 2017,
Tu3D–7.
[C8] S. Gaiarin, X. Pang, O. Ozolins, R. T. Jones, E. P. Da Silva, R.
Schatz, U. Westergren, S. Popov, G. Jacobsen, and D. Zibar, “High
speed PAM-8 optical interconnects with digital equalization based on
neural network,” in Asia Communications and Photonics Conference,
Optical Society of America, 2016, AS1C–1.
Contributions to international peer-reviewed conferences xi
[C9] J. Wass, J. Thrane, M. Piels, J. C. Diniz, R. Jones, and D. Zibar,

“Machine Learning for Optical Performance Monitoring from Directly
Detected PDM-QAM Signals,” in European Conf. Optical Communica-
tion (ECOC), VDE, 2016, pp. 1–3.
xii
Contents
Abstract i
Resumé i
Preface iii
Acknowledgements v
Ph.D. Publications ix
Acronyms xvii
1 Introduction 1
1.1 Motivation and Outline of Contributions . . . . . . . . . . . . . 4
1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 6
2 Coherent communication systems 9

2.1 The Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Lasers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Modulation Formats . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Electrical Signal Generation . . . . . . . . . . . . . . . . 11
2.1.4 The Optical Modulator . . . . . . . . . . . . . . . . . . 12
2.2 The Additive White Gaussian Noise Channel . . . . . . . . . . 13
2.3 The Optical Fiber Channel . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Split-step Fourier Method . . . . . . . . . . . . . . . . . 15
2.3.2 Simple XPM Model . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Gaussian Noise Model . . . . . . . . . . . . . . . . . . . 19
2.3.4 Nonlinear Interference Noise Model . . . . . . . . . . . . 20
2.3.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 The Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Coherent Detection . . . . . . . . . . . . . . . . . . . . . 23
xiv Contents
2.4.2 Digital Signal Processing . . . . . . . . . . . . . . . . . 23

2.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Optical Signal-to-Noise Ratio . . . . . . . . . . . . . . . 25
2.5.3 Pre-FEC Bit Error Ratio . . . . . . . . . . . . . . . . . 25
2.5.4 Mutual Information . . . . . . . . . . . . . . . . . . . . 26
2.6 Nonlinear Fourier Transform Based Communication Systems . 26
2.7 Constellation Shaping . . . . . . . . . . . . . . . . . . . . . . . 28
3 Machine Learning Methods 31

3.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . 33
3.1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . 34
3.2 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.3 Test and Validation . . . . . . . . . . . . . . . . . . . . 39
3.2.4 Activation Functions . . . . . . . . . . . . . . . . . . . . 40
3.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 Dimension Reduction and Feature Extraction . . . . . . . . . . 45
3.5.1 Principal Component Analysis . . . . . . . . . . . . . . 45
3.5.2 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.6.1 Rectified Linear Unit . . . . . . . . . . . . . . . . . . . . 50
3.6.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.3 Automatic differentiation . . . . . . . . . . . . . . . . . 51
4 Prediction of Fiber Channel Properties 53

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Machine Learning Framework . . . . . . . . . . . . . . . . . . . 54
4.3 Training and Prediction . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Neural Network Receiver for NFDM Systems 61

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Contents xv
5.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Receiver Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 NFT Receiver . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3.2 Minimum Distance Receiver . . . . . . . . . . . . . . . . 64
5.3.3 Neural Network Receiver . . . . . . . . . . . . . . . . . 65
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 65
5.4.1 Receiver Comparison over the AWGN Channel . . . . . 65
5.4.2 Temporal and Spectral Evolution . . . . . . . . . . . . . 66
5.4.3 Receiver Comparison over the Fiber Optic Channel . . . 67
5.4.4 Neural Network Receiver Hyperparameters . . . . . . . 69
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6 Geometric Constellation Shaping via End-to-end Learning 73

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2 End-to-end Learning . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2.1 Autoencoder Model Architecture . . . . . . . . . . . . . 75
6.2.2 Autoencoder Model Training . . . . . . . . . . . . . . . 76
6.2.3 Mutual Information Estimation . . . . . . . . . . . . . . 78
6.2.4 Learned Constellations . . . . . . . . . . . . . . . . . . . 78
6.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.4 Experimental Demonstration . . . . . . . . . . . . . . . . . . . 83
6.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.5 Joint Optimization of Constellation and System Parameters . . 86
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.6.1 Mitigation of Nonlinear Effects . . . . . . . . . . . . . . 87
6.6.2 Model Discrepancy . . . . . . . . . . . . . . . . . . . . . 88
6.6.3 Digital Signal Processing Drawbacks . . . . . . . . . . . 88
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7 Conclusion 91
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A Softmax Activation and Mutual Information Estimation 95
Bibliography 97
xvi
Acronyms
ADC analog-to-digital converter
AE autoencoder
AI artificial intelligence
AIR achievable information rate
ASE amplified spontaneous emission
AWG arbitrary waveform generator
AWGN additive white Gaussian noise
BER bit error rate
BPSK binary phase shift keying
COI channel of interest
DAC digital-to-analog converter
DBP digital back-propagation
DSP digital signal processing
EDFA erbium-doped fiber amplifier
EGN extended Gaussian noise
FEC forward error correction
FWM four-wave mixing
GMI generalized mutual information
GN Gaussian noise
xviii Acronyms
GPU graphics processing unit
HWHM half width at half maximum
I in-phase
IM/DD intensity modulation/direct detection
IPM iterative polar modulation
LMS least mean squares
MI mutual information
ML maximum likelihood
MSE mean squared error
MZM Mach-Zehnder modulator
NFDM nonlinear frequency-division multiplexing
NFT nonlinear Fourier transform
NLI nonlinear interference
NLIN nonlinear interference noise
NLSE nonlinear Schrödinger equation
OSNR optical signal-to-noise ratio
PCA principal component analysis
PRBS pseudorandom binary sequence
PSD power spectral density
Q quadrature
QAM quadrature amplitude modulation
QPSK quadrature phase-shift keying
ReLU rectified linear unit
SDM space-division multiplexing

Acronyms xix
SGD stochastic gradient descent
SSMF standard single-mode fiber
SNR signal-to-noise ratio
SPM self-phase modulation
SSFM split-step Fourier method
VOA variable optical attenuators
WDM wavelength-division multiplexing
XPM cross-phase modulation

xx
CHAPTER 1
Introduction
The globalization of our society is largely attributed to progress in communica-
tion technology [1]. We depend on services providing digital information and
connecting us at all times via email, instant messaging, and audio or video
calls. Data has become an economic resource distributed on the Internet by
a global infrastructure based on fiber optic communication technology. The
modern world is highly dependent on the Internet to reliably move data across
countries and continents. In 2021, the global Internet traffic is expected to
reach 278 exabyte per month [2]. Transmitting data at today’s data rates and
latencies is only possible with the bandwidth and low loss provided by optical
fibers as transmission medium [3].
Multiple technological breakthroughs have enhanced the available data
rates of fiber optic systems in the past decades. A fiber was first developed
in 1966 [4], leading to standard single-mode fiber (SSMF), providing low loss
and larger transmission distances than previous guided media [5]. Optical am-
plifiers, such as erbium-doped fiber amplifier (EDFA) [6], replaced electronic
regeneration schemes, and enabled all optical and cost effective amplification
of multiple wavelength-division multiplexing (WDM) channels [7]. Within the
coherent era starting in the late 2000s, the usable EDFA bandwidth reached
5 THz. Coherent transceiver technology [8, 9] could be built on to existent fiber
infrastructure giving access to both quadratures and polarizations of each opti-
cal carrier. In combination with increasing analog-to-digital converter (ADC)
speeds, coherent technology led to incredible growth in data throughput dur-
ing this decade. In 2018, a spectral efficiency within a SSMF of 17.3 bit/s/Hz
has been reported [10], which is more than 40 times the spectral efficiency
obtained in the late 1990s [3].
Consumer demand for digital information is growing exponentially [2],
driven by new applications on ever increasing number of devices. Bandwidth
heavy multi-user virtual reality entertainment services and the Internet of
Things pose challenging demands on the existing infrastructure and available
latencies. Further, machine to machine communication will outpace the traffic
2 1 Introduction
consumed by humans [11]. Within data centers, large scale machine learning
algorithms demand repeated access to a vast amount of data [12, 13], teach-
ing themselves from human labeled examples. Future networks must further
scale, while staying cost and power efficient [14]. The growing network com-
plexity requires new digital signal processing (DSP) algorithms, low power
devices, self-monitoring systems as well as adaptive and flexible routing mech-
anisms [3].
Increasing the spectral efficiency within an SSMF even further requires
high-order modulation formats, which demand larger optical launch power
counteracting amplification noise. However, higher power levels trigger the
performance limiting nonlinear response of SSMF [15], due to the fiber intrin-
sic Kerr effect. The Kerr effect describes power dependent nonlinear inter-
actions of signal frequency components. Although the nonlinear effects are
deterministic, they are driven by the random data carried by the WDM sig-
nals. In combination with chromatic dispersion and amplification noise, the
nonlinear effects make joint processing and signal recovery of multiple WDM
channels challenging [16]. Further, joint processing of multiple WDM channels
is not possible in optically routed systems. When different channels have dis-
tinct destinations, the deterministic nonlinear interactions become stochastic
from a receiver standpoint, since the underlying information of the neigh-
bouring channels is unavailable when demodulating and recovering the signal.
Hence, a launch power exists for which the amplification noise and nonlinear
effects are balanced and which maximizes the transmission rate. This power
dependent limitation is widely referred to as nonlinear Shannon limit [16–
19]. Although the existence of such has never been proved, how to address
the nonlinear Shannon limit has received a lot of attention within fiber optic
communication research with different potential approaches [20].
Mitigating and compensating nonlinear effects with DSP techniques has
been studied [21], where most of these efforts are based on insight from the
nonlinear Schrödinger equation (NLSE) [22] and fiber channel models derived
from it [23–30]. By propagating the received lightwave systematically through
a virtual fiber from receiver to transmitter, digital back-propagation (DBP)
simulates the reverse of the complex nonlinear effects defined by the NLSE.
It accounts for the deterministic processes of the nonlinear effects, yet it is
limited by non-deterministic processes between signal and amplification noise,
polarization effects, and its computational demands. Compensation methods
that build upon perturbation models [29, 31], first or higher-order approx-
imations of the NLSE, range from compensating for the accumulated fiber
nonlinearities in one step [31], to continuously tracking a time varying process
1 Introduction 3
induced by fiber nonlinearities with a Kalman filter [32]. Modeling the NLSE
with a Volterra series [33] leads to an inverse transfer function for the NLSE,
yet as DBP, such compensation method is computational expensive.
Further, constellation shaping increases transmission data rates [34, 35], by
choosing a modulation format design with non-equidistant and non-equiprobable
symbols, known as geometric and probabilistic shaping, respectively. Based
on insight from fiber channel models, constellation shaping can also be used
to mitigate signal dependent nonlinear effects [34].
By adding space as another multiplexing dimension [36, 37], space-division
multiplexing (SDM) addresses the scaling problem and increases the number
of parallel channels of optical communications systems. However, SDM avoids
addressing of the nonlinear Shannon limit directly, since within each spatial
path the limit still applies. It is realized with multi-core fibers, multi-mode
fibers or a combination of the two. Besides more channels, SDM with joint
DSP, could lead to cost and energy savings, yet new infrastructure is required,
since new fiber types must be installed. For future systems with WDM signals
on all spatial paths, resource allocation will be challenging given the growing
number of interdependent parameters and various complex constraints. How-
ever, with latest record breaking experiments [38–40], SDM is an obvious
choice for large capacity future systems.
The nonlinear Fourier transform (NFT) has received significant interest
from the optical fiber research community, as a method to include fiber non-
linearities as part of the transmission. With the NFT, solutions to the NLSE
can be found, which theoretically overcome the performance limitations lead-
ing to the nonlinear Shannon limit. It describes a mapping between the time-
domain signal and the nonlinear spectrum. Information encoded in nonlinear
spectra transforms linearly during transmission over a noise free and lossless
link. However, realistic channels include losses and noise, and their joint per-
turbation leads to non trivial distortions in the nonlinear spectrum, making
data recovery challenging. Further research is required for the NFT to become
a viable transmission method, including new DSP algorithms and analysis of
the perturbation.
The different directions of addressing the nonlinear Shannon limit require
advanced fiber models, new DSP algorithms and complex optimization meth-
ods for resource allocation. Machine learning methods promise to resolve
emerging problems within these research directions. With fiber optic commu-
nication systems growing in size and complexity, the set of optimizable param-
eters, including modulation formats, optical launch power levels and flexible
channel spacing, becomes multi-dimensional and interdependent. Here, ma-
4 1 Introduction
chine learning aided optimization is capable of exploring solutions efficiently

while considering constraints.
For the physical layer, machine learning methods report results for chan-
nel equalization, quality of transmission estimation, optical amplification con-
trol, optical performance monitoring, modulation format recognition and non-
linear mitigation techniques [41, 42]. Most proposed approaches report re-
sults based on learning to predict outcomes using historical data, known as
supervised learning. Receiver side channel equalization for intensity modu-
lation/direct detection (IM/DD) systems has been shown with ordinary and
convolutional neural networks [43–45] and Gaussian mixture models [46]. For
coherent systems, similar methods are reported with a neural network [47],
a variant called extreme learning [48], the k-nearest neighbor algorithm [49],
and with the support vector machine algorithm [50]. Machine learning meth-
ods have been applied for performance monitoring, where properties, such as
the optical signal-to-noise ratio (OSNR), chromatic dispersion and polariza-
tion mode dispersion, are learned with a neural network from various different
signal properties [51–57].
More advanced methods combine machine learning with domain specific
knowledge from optical communication systems. In [58], a time-domain DBP
algorithm is reduced in complexity by pruning the number of filter taps within
each step, and thereafter improved values of the filter taps are obtained
through training, similar to the weights of a deep neural network, where a
layer is equal to a step of the DBP algorithm. In [59], a similar approach
is taken, where a factor graph includes the architecture of a communication
system. Eriksson et al. [60] show, one should not apply neural networks and
blindly trust the results, since they are capable of learning a pseudorandom
binary sequence (PRBS) and hence fabricate performance gain. For a more
extensive review over a wider application range of machine learning in optical
communication systems, we refer to two survey articles [41, 42].
1.1 Motivation and Outline of Contributions

In recent years, machine learning has lead to record breaking results in cog-
nitive tasks previously reserved for humans, such that many scientists and
engineers have adopted the methods and applied them widely to their fields
and real-world problems. Although, machine learning techniques such as adap-
tive and Bayesian filtering, and probabilistic modeling are not novel in the field
1.1 Motivation and Outline of Contributions 5
of telecommunications, new techniques have emerged and in particular with

increasing computing power, large scale problems are addressable.
In Chapter 3, machine learning methods are introduced with examples spe-
cific to telecommunication applications. The reviewed methods are combined
with present fiber channel models as follow:
Prediction of Fiber Channel Properties

Accurate fiber channel models include exhaustive numerical computations. In
the study of Chapter 4, the computational demand of a fiber channel model is
decreased by introducing a machine learning algorithm. Computational heavy
intermediate channel modeling steps are learned from a sparse dataset. There-
after, the trained learning algorithm provides predictions for the computation-
intensive steps. With a machine learning aided model and its reduced process-
ing requirement, real-time analysis becomes feasible.
Neural Network Receiver for NFDM Systems

The NFT is derived under a theoretic noise free and lossless channel assump-
tion. In Chapter 5, a new approach of detecting NFT derived signals is pro-
posed, which takes distortions occurring in a realistic channel into account.
The study shows that the statistics of the accumulated distortions of the
channel are not additive Gaussian, and the performance of the machine learn-
ing aided receiver presents a performance benchmark for future NFT receiver
algorithms.
Geometric Constellation Shaping via End-to-end Learning

A constellation shape for the nonlinear optical fiber channel must take amplifi-
cation noise and signal dependent nonlinear effect into account. In Chapter 6,
a fiber channel model is embedded within a machine learning algorithm and
optimized end-to-end, such that the learning algorithm captures the statistical
properties of the channel. The learned constellation shape is jointly robust
to both channel impairments. The proposed method presents how machine
learning can be combined with domain specific knowledge from optical com-
munications.
6 1 Introduction
1.2 Structure of the Thesis

The remaining six chapters are organized as follows:
Chapter 2 presents the fundamental building blocks of a coherent optical
communication system. Transmitter and receiver concepts are discussed, the
additive white Gaussian noise (AWGN) channel, and how the propagation
of WDM signals through optical fiber are modeled, where four different ap-
proaches are presented and compared in performance. Further, performance
metrics, used in later chapters for assessing optical communication systems,
are discussed. The last two sections introduce more specific topics, NFT-based
transmission and constellation shaping, which are combined with machine
learning methods as part of this works contributions.
Chapter 3 provides an introduction into machine learning, with the focus
on neural networks. Different concepts of machine learning are discussed be-
fore the components, structure, training and testing of a neural network is
explained. Two different applications of supervised learning are illustrated
with examples using neural networks. Unsupervised learning methods, the
principal component analysis (PCA) and the autoencoder (AE), are discussed,
and the last section discusses the implications of deep learning for this work.
Chapter 4 presents a two step machine learning framework. First the
dimensionality of a dataset is reduced, which facilitates the second step –
learning to interpolate data samples. The framework is applied on a dataset,
which reflects the relationship between physical layer parameters and compu-
tationally expensive properties of a fiber channel model. The interpolation of
the data removes the computational expensive step within the fiber model.
Chapter 5 proposes a machine learning based receiver for nonlinear frequency-
division multiplexing (NFDM) signals. Although chromatic dispersion and
nonlinear effects are mutually countered within NFT-based transmission, am-
plification noise and the fiber loss introduce non-trivial distortions. Simulation
studies are conducted and it is shown that a neural network based receiver
is capable of handling the distortions, and detecting the received signals with
improved performance compared to the standard NFT-based receiver.
Chapter 6 uses an unsupervised learning method in combination with op-
tical fiber channel models to learn a geometrical constellation shape. Further,
the applied method is generalized to optimize for the optimal launch power
simultaneously. The learned geometrically shaped constellations are used for
simulation and experimental studies. It is shown that the machine learning
algorithm captures the impairments present in the channel model and the
learned constellation shapes outperform state-of-the-art geometrical constella-
1.2 Structure of the Thesis 7
tion shapes.
Chapter 7 summarizes the results of this work, and presents future direc-
tions and open challenges of the presented methods.
8
CHAPTER 2
Coherent
communication
systems
In this chapter, the building blocks of a coherent optical communication sys-
tem are introduced. The main concept is to transmit bits of information from
point A to B through an optical fiber. Light of a laser source is externally
modulated by an electrical signal driving a modulator and thereafter trans-
mitted over the fiber channel. During transmission, the signal is amplified
multiple times to compensate for the fiber loss by erbium-doped fiber ampli-
fiers (EDFAs) or Raman amplifiers. The fiber channel is dispersive and power
dependent Kerr nonlinearities occur. At the receiving end, the optical sig-
nal is mapped into the electrical domain via coherent detection. The digital
bit sequence is extracted from the electrical signal, with a set of signal re-
covery algorithms. Typically, forward error correction (FEC) is employed at
the transmitter such that errors are potentially corrected in the received bit
sequence. Compared to intensity modulation/direct detection (IM/DD) sys-
tems, where only the amplitude is modulated, coherent systems allow higher
spectral efficiency.
Section 2.1 describes how the bits are converted into an optical wave-
form carrying the information. In Section 2.2 the additive white Gaussian
noise (AWGN) channel model is explained. In Section 2.3, the optical fiber
channel is described and its various simulation methods, ranging from com-
putational heavy numerical methods to derived mathematical models. Sec-
tion 2.4 describes the necessary steps of extracting the information from the
received and distorted waveform. Section 2.5 describes metrics assessing the
performance of coherent communication systems. Section 2.6 briefly intro-
duces nonlinear Fourier transform (NFT) based transmission and Section 2.7
10 2 Coherent communication systems
Transmitter Receiver
010101001
FEC encoding Coherent Frontend
Bit to Symbol
Mapping Electrical
Signal
Pulse-shaping Channel
Electrical
Laser DSP
Signal
Modulation FEC decoding

Optical
Signal
010101001
Figure 2.1: Example of an optical fiber communication system.
explains constellation shaping.
2.1 The Transmitter

This section describes the transmitter components of a coherent optical com-
munication system.
2.1.1 Lasers
Before transmitting a data signal over an optical fiber, an optical carrier is re-
quired, ideally a continuous lightwave with constant amplitude and frequency,
such that the data signal is not distorted by its carrier [61]. A device generat-
ing an ideal light source does not exist. However, stable lasers with linewidths
in the MHz region meet most demands of an optical carrier. These lasers are
used for direct modulated systems and systems with external modulation.
Direct modulation typically controls the electrical current driving the laser,
such that the laser emits the pattern of the data signal. Such a modulation
format is called on-off keying or more generally amplitude-shift keying. Di-
rect modulation is attractive for short-reach applications, due to the small
2.1 The Transmitter 11
hardware size and low power consumption. For long transmission distances,
external modulators are used. They enable faster and more sensitive coherent
transmission, where the amplitude and phase of the carrier are modulated.
The spectra of a laser is not a single spectral line, but holds a broader
spectral linewidth. The combined linewidth of lasers at the transmitter and
receiver, determine the amount of phase noise added to the carried signal.
Phase noise is typically described as a Wiener process [62],
ψ[k + 1] = ψ[k] + n[k], (2.1)
where ψ[k] is the phase noise at time k and n[k] ∼ N (0, σψ2 ) is an independent
and identically distributed Gaussian random variable. The variance of the
Gaussian variable is determined by:
σψ2 = 2π∆f Ts , (2.2)
where Ts is the symbol period and ∆f is the sum of carrier linewidth and local
oscillators lasers. After reception, the carrier phase noise must be estimated
to obtain the transmitted signal. Section 2.4.2 on the digital signal processing
(DSP) chain includes a discussion on carrier phase estimation, compensating
for phase noise.
2.1.2 Modulation Formats

A modulation format is a finite set of complex values symbols, S = {sm , m =
1, ..., M }, with M = 2b and b is the number of bits per symbol. In this
work, M -ary quadrature amplitude modulation (QAM) and iterative polar
modulation (IPM)-based geometric shaped constellations are used [63, 64]. In
Chapter 3, a real valued modulation format binary phase shift keying (BPSK)
is used within an example. Constellation shaping is discussed in Section 2.7.
2.1.3 Electrical Signal Generation

After applying FEC to the bit sequence, the bits are mapped to symbols of the
used modulation format. The independent and identically distributed symbol
sequence is pulse-shaped to an upsampled representation holding the same
information [63],
∞
∑
uE (t) = sk g(t − kTs ), (2.3)
−∞
where sk is the symbol at time kTs and g(t) is the pulse-shape. The band-
width of the obtained signal is controlled with the pulse-shaping filter. Raised
cosine and root-raised cosine pulse-shapes can be used for transmission in op-
tical communication. They enable intersymbol interference-free transmission
and control over the spectral edge through the roll-off factor of the filter [63].
Low roll-off factors lead to sharp spectral edges and allow densely packed
wavelength-division multiplexing (WDM) channels [65]. More elaborate pulse-
shapes have been proposed for optical communication with increased nonlin-
earity tolerance [66].
Further, signal processing in the electrical domain enables chromatic dis-
persion pre-compensation [67] and nonlinear mitigation techniques [68, 69].
The limited resolution of digital-to-analog converters (DACs) introduces quan-
tization noise [70] and the frequency response of the transmitter [71], leads to
a non-ideal signal.
2.1.4 The Optical Modulator

In coherent systems the in-phase (I) and quadrature (Q) components of the op-
tical signal carry information, which is equivalent to amplitude and phase [72, 73].
An I/Q modulator is comprised of two Mach-Zehnder modulators (MZMs).
As depicted in Fig. 2.2, the light of a laser source is split into two arms, each
holding a MZM. MZMs are amplitude modulators and only modulate one
dimension. With an additional π/2 phase shift in one of the arms, the recom-
bined signals are orthogonal and thereby two dimensions are modulated. For
dual-polarization transmission the described scheme is duplicated for both X
and Y polarization. Under ideal conditions, the output of the modulator is a
dual-polarized optical signal in passband, U (0, t) = [ux (0, t), uy (0, t)]T , where
I I/Q Modulator
Mach-Zehnder
Modulator
Mach-Zehnder
π/2
Modulator
Figure 2.2: Optical I/Q modulator.

2.2 The Additive White Gaussian Noise Channel 13
both polarizations hold a signal as in (2.3).

However, a MZM has a sinusoidal transfer function, which is either ad-
dressed by operating in a linear operating region or by pre-compensating it elec-
trically [71]. The former leads to a lower optical signal-to-noise ratio (OSNR)
at the transmitter.
2.2 The Additive White Gaussian Noise

Channel
The AWGN channel is a powerful channel model. It adds white Gaussian
noise to the transmitted symbols. The channel is given by [63]:
r[k] = s[k] + n[k], (2.4)
where s[k] is a symbol drawn from the constellation of a modulation format,

r[k] is the observation and n[k] ∼ N (0, σAWGN
2 ) the noise sample, all at time
step k. The AWGN channel presented here is on symbol level, yet for a
generalization to a system with continuous waveform, the variance of the noise
source must be adjusted according to the system bandwidth due to the higher
sampling rate.
A fiber optic link can be modeled by the AWGN channel, if nonlinear ef-
2
fects are neglected and the variance σAWGN is determined by the amplification
noise accumulating throughout the whole link from multiple EDFAs. For the
AWGN channel model optimal shaping gain and optimal decision regions for
the maximum likelihood receiver are known.
2.3 The Optical Fiber Channel

The optical fiber channel is a nonlinear channel. In Fig. 2.3, a typical trans-
mission setup composed of NSp fiber spans is shown. Within the span, the
transmitted waveform is exposed to loss, chromatic dispersion and nonlinear
effects. The nonlinear Schrödinger equation (NLSE) models these effects. The
NLSE is a differential equation describing how light propagates through a stan-
dard single-mode fiber (SSMF) and how it evolves w.r.t. space z and time t.
The light pulse envelope A(z, t) satisfies the NLSE, as follows [15]:
dA α β2 d2 A
=− A−i + iγ|A|2 A, (2.5)
dz 2 2 dt2
Digital Fiber Optic Channel Digital

Information Information
010101001 Transmitter Receiver 010101001
Fiber EDFA
x NSp
Figure 2.3: Example of an optical fiber communication system with EDFA

amplification.
where α and β2 account for fiber loss and second-order chromatic dispersion
effects. Third-order chromatic dispersion effects are neglected. The nonlinear
coefficient γ = 2πn2 /(λc Aeff ) is defined by the nonlinear-index coefficient n2 ,
the center wavelength λc and the effective core area Aeff of the fiber. Chro-
matic dispersion gives rise to pulse broadening in time-domain within a single
channel and walk-off across WDM channels, both due to different propagation
speeds of light at different wavelengths. The Kerr effect introduces nonlinear-
ities modeled by the nonlinear term, such as self-phase modulation (SPM),
cross-phase modulation (XPM) and four-wave mixing (FWM).
For dual-polarized lightwaves, the NLSE is simplified by the Manakov
equations, if dispersion and nonlinear lengths are much larger than the bire-
fringence correlation length [74]:
dAx α β2 d2 Ax 8
= − Ax − i + i γ(|Ax |2 + |Ay |2 )Ax , (2.6a)
dz 2 2 dt2 9
dAy α 2
β 2 d Ay 8
= − Ay − i + i γ(|Ax |2 + |Ay |2 )Ay , (2.6b)
dz 2 2 dt2 9
where Ax (z, t) and Ay (z, t), or as vector A(z, t), are the pulse envelopes of
two orthogonal light polarizations. The random birefringence along the fiber
leads to many random unitary rotations of the two polarizations. The factor
8/9 of the expression accounts for an average of the interplay between the
nonlinearities and the continuously rotating polarization fields over long fiber
lengths. Polarization mode dispersion is not considered in (2.6).
Typical parameters for such systems are given in Table 2.1, where D =
−2πcβ2 /λ2c . The | · |2 terms in (2.5) and (2.6) reflect the dependence of the op-
tical signal on its own power. At high power nonlinear effects are predominant
and become performance limiting, if unmanaged.
At the end of each span, the optical signal is amplified to compensate for
fiber losses, introducing amplification noise from an EDFA. The noise occurs
due to amplified spontaneous emission (ASE) in the amplifier and is typically
2.3 The Optical Fiber Channel 15
Table 2.1: Parameters of SSMF.

Parameter
Fiber loss α 0.2 dB/km
Dispersion D 17 ps/(nm km)
Nonlinear coefficient γ 1.3 (W km)−1
Carrier wavelength λ 1550 nm
2
modeled with an AWGN noise source with variance σASE [22],
G · NF − 1
nsp = , (2.7a)
2 · (G − 1)
(G − 1) · nsp · h · c
2
σASE = · B, (2.7b)
λc
where nsp is the spontaneous-emission factor, G is the amplification gain, NF
is the EDFA noise figure, h is the Planck constant, c is the speed of light, λc
is the signal center wavelength and B the bandwidth of the system. Other
amplification methods, such as Raman amplification, are not considered in
this work.
2.3.1 Split-step Fourier Method

The split-step Fourier method (SSFM) is a numerical method to solve nonlin-
ear partial differential equations, such as the NLSE and Manakov equations.
The SSFM models the propagation of light through the fiber in many con-
secutive steps. Typically, each step switches from time-domain to frequency
domain and back. In this way, the linear part of the NLSE is solved within
Fourier domain and the nonlinear part in time-domain [75]. The process of
splitting the NLSE into linear and nonlinear parts introduces a numerical
error. However, by decreasing the step size the error is controllable, and de-
pendent on the desired accuracy, the step size can be adjusted. The linear
and nonlinear steps are described as follows:
dAL α β2 d2 A
=− A−i , (2.8a)
dz 2 2 dt2
dAN
= iγ|A|2 A, (2.8b)
dz
where AL (z, t) is the linear part and AN (z, t) the nonlinear part. Now, both
have an analytical solution; the linear part in Fourier domain and the nonlinear
part in time-domain. For one global step of length h, half of the linear step
is applied, followed by a full nonlinear step, and thereafter the second half of
the linear step:
( )
h α β2 h e
AeL (z + , ω) = exp (− − i ω 2 ) A(z, ω), (2.9a)
2 2 2 2
( )
AN (z + h, t) = exp iγ|AL |2 h AL (z, t), (2.9b)
( )
e + h α β2 h
A(z , ω) = exp (− − i ω 2 ) AeN (z, ω), (2.9c)
2 2 2 2
e ω) is the Fourier transform of A(z, t). For a dual-polarized signal,
where A(z,
the same method is applied to the Manakov equations. The SSFM models
SPM, XPM and FWM accurately, but compared to the other models presented
in this section, the SSFM is computational expensive.
2.3.2 Simple XPM Model

Tao et. al. [76] have proposed a simplified fiber model for WDM systems. It
models XPM for a dispersion managed link with repeated lumped amplifica-
tion, but omits SPM and FWM effects. It is based on the Volterra expansion
of the Manakov equations and linearizes the system at the beginning of each
span after the amplifier. It exploits that in a system with lumped amplifica-
tion, most nonlinearities occur in the beginning of the span. We rewrite the
model to accommodate a dispersion unmanaged link.
Consider a WDM system of NCh channels and NSp spans of same length
LSp . The initial optical field is given by:
∑
N Ch
(m) t
A(0, t) = U (m) (0, t)eiΩ , (2.10)
m=1
where Ω(m) are the channel spacings to the channel of interest (COI), and
(m) (m)
U (m) (0, t) = [ux (0, t), uy (0, t)]T are the complex amplitudes for both po-
larizations of the WDM channels, where the COI is denoted by m = COI. The
model is only considering effects that affect the COI by working with the com-
plex amplitudes of the individual channels U (m) (z, t). In contrast, the SSFM
works on the whole optical field A(z, t). Thereby, the model also avoids large
upsampling rates, which are required for representing A(z, t) without contra-
dicting the sampling theorem. The propagation of the COI within each span
is modeled in two steps. An initial nonlinear step, a multiplication by a time

varying Jones matrix, and a consecutive linear step, a chromatic dispersion
filter. The nonlinear step is given by:
UwCOI ((n − 1)LSp , t) = W (n) (t) · Uwo

COI
((n − 1)LSp , t), (2.11)
COI (z, t) denotes the waveform without XPM distortions and U COI (z, t)
where Uwo w
denotes the waveform with XPM distortions. The consecutive linear step is
given by:
( )
f β2 2 fCOI ((n − 1)L , ω),
Uwo (nLSp , ω) = exp −i ω LSp · U
COI
w Sp (2.12)
2
where Uf(m) (z, ω) is the Fourier transform of U (m) (z, t). The linear step models
the chromatic dispersion of one span. The result of the linear step is used for
the nonlinear step of the next span, and the process is repeated until the end
of the link. The loss of the link is assumed to be ideally compensated for
by EDFAs, such that attenuation and amplification can be neglected as the
global step always ends right after the amplifier, where also amplification noise
is added.
The Jones matrix holds the contributions of the interfering channels oc-
curring in span n. The propagation of the interfering channels neglects nonlin-
ear effects and exclusively models chromatic dispersion including the walk-off
modeled by a linear phase shift:
( ( ))
f(m) (z, ω) = exp −iβ2 z 1 2 f(m) (0, ω).
U ω + Ω(m) ω ·U (2.13)
2
The interfering channels populate the Jones matrix as follows:
 √ 
(n) (n) (t) (n)
iϕ(n) (t)  (1 − |wxy (t)|2 )ei∆ϕ √ wyx (t) ,
W (n) (t) = e
|wyx (t)|2 )e−i∆ϕ (t)
(n) (n) (n)
wxy (t) (1 −
(2.14)
(n) (n) (n) (n)
where ϕ(n) (t)
= (ϕx (t)
+ and ϕy (t))/2= −
∆ϕ(n) (t) are(ϕx (t) ϕy (t))/2
(n)
terms of the XPM-induced phase-noise, and wxy/yx (t) are terms of the XPM-
induced polarization cross-talk:
∑
∗(m)
x ((n − 1)LSp , t)uy ((n − 1)LSp , t) ⊗ h(m) (t), (2.15a)
(n)
wyx (t) = iu(m)
m∈M
∑
∗(m)
y ((n − 1)LSp , t)ux ((n − 1)LSp , t) ⊗ h(m) (t), (2.15b)
(n)
wxy (t) = iu(m)
m∈M
∑
ϕ(n) x ((n − 1)LSp , t)| + |uy ((n − 1)LSp , t)| ) ⊗ h
(2|u(m) 2 (m) 2 (m)
x (t) = (t),
m∈M
(2.15c)
∑
ϕ(n)
y (t) = (2|u(m)
y ((n − 1)LSp , t)| +
2
|u(m)
x ((n − 1)LSp , t)| ) ⊗ h
2 (m)
(t),
m∈M
(2.15d)
( )
(m)
8γ 1 − exp −αLSp + i∆β1 ωLSp
H (m) (ω) = × (m)
, (2.15e)
9 α − i∆β1 ω
where ⊗ is the convolution operator, the summations skip the COI with M =
{1, ..., NCh } \ COI, and h(m) (t) is the inverse Fourier transform of the trans-
(m) (m)
fer function H (m) (ω). The fiber parameters α, γ and ∆β1 = β1 − β1COI
represent the fiber loss, the nonlinear coefficient and the differential group
velocity, respectively. The derivation of (2.15), in particular the nonlinear
transfer function H(ω) can be found in [76].
The Jones matrix for the n-th span, (2.14), holds the contributions of the
XPM-induced phase-noise and the XPM-induced polarization cross-talk from
the other channels occuring in the n-th span. For this, the complex amplitudes
of the interfering channels are propagated up to the beginning of span n,
(2.13). Thereafter, the interfering channels follow multiplications and modulus
operations as in the Manakov equations, (2.15a-2.15d), before they are filtered
by a transfer function, (2.15e), which is derived from the linearisation with
the Volterra expansion [76]. The contributions of all interfering channels are
summed up and populate the Jones matrix, (2.14) and (2.15).
For the dispersion managed link, as in [76], the dispersion operations
in (2.12) and (2.13) become unnecessary. An additional summation over all
spans in (2.15a-2.15d) results in a single time varying Jones matrix modeling
the whole link.
2.3.3 Gaussian Noise Model

The Gaussian noise (GN)-model was proposed by Poggiolini et. al. [26], with
similar results existing since 1993 [77]. The proposed model assumes that
nonlinear effects degrading the transmitted signal can be modeled as additive
Gaussian noise. The assumption holds due to the central limit theorem, which
states that the sum of many independent random variables lead to a single
Gaussian distributed random variable. During transmission, due to chromatic
dispersion, the random data of the signal increasingly overlap in time, while
the signal continuously interferes with itself. Hence, samples of the overlapping
signal are modeled by a single Gaussian random variable. (Note that the
signal is only point-wise Gaussian. Joint gaussianity of the signal samples as
a whole is not given. Section 2.3.4 describes a more accurate model taking this
property into account. Also, the extended Gaussian noise (EGN)-model [30]
takes this into account.)
The GN-model, derived through the Manakov equations, leads to a physi-
cal explanation for the power spectral density (PSD) of the nonlinear interfer-
ence (NLI), based on the nonlinear FWM effect. A hypothetical WDM system
with a PSD comprised of a finite number of discrete frequency tones is given,
as depicted in Fig. 2.4 (top). If all frequency tones in this WDM system inter-
act in a FWM manner, such that a triplet of these tones acts as pumps for a
fourth new tone, the fourth tone can be interpreted as a minuscule part of the
NLI, as in Fig. 2.4 (bottom). Adding up all new frequency tones produced by
every possible combination of pump triplets, yields a new PSD manifesting
the whole NLI. Increasing the density of finite discrete frequency tones of the
initial WDM system, such that their spacing becomes infinitesimally small,
leads to the full PSD of the NLI. For a link with identical spans and perfectly
compensated loss by EDFAs, the full PSD of the NLI is given by [26]:
16 2 2
GNLI (f ) = γ Leff ·
27
∫ ∫
∞ ∞
GWDM (f1 )GWDM (f2 )GWDM (f1 + f2 − f )· (2.16)
−∞ −∞
ρ(f1 , f2 , f ) · χ(f1 , f2 , f )df2 df1 ,
where γ is the nonlinear coefficient, Leff is the effective fiber length, GWDM (f )
is the PSD of the WDM signal, ρ accounts for the non-degenerate FWM effi-
ciency and χ is the phased-array factor. A frequency tone triplet is described
by {f1 , f2 , f1 + f2 − f }, which pump the fourth tone at f . Descriptions of Leff ,
ρ and χ are given in [26] equations (4-8).
5 WDM
channels
Power
Frequency
NLI Power
Spectral Density
Frequency tone
Power
triplet
Frequency
Receiver
bandpass filter
Figure 2.4: Schematics of the GN-model.
Only the NLI within the spectral band of the COI after the receiver filter
affects the COI. A single evalutation of GNLI (f ) is required, with the assump-
tion that the NLI is white within the spectral band of the COI. Then, the
variance of the NLI is given by:
2
σGN = Rs · GNLI (fCOI ), (2.17)
where Rs is the baud rate of the COI and fCOI is the center frequency of the
COI.
2.3.4 Nonlinear Interference Noise Model

The nonlinear interference noise (NLIN)-model proposed by Dar et. al. [27, 28],
is derived from a time-domain perspective given in [78]. It explicitly resolves
a shortcoming of the GN-model and adds a term to the variance of the NLI
that depends on the modulation format. With a fully Gaussian modulation
format, the two models are equal.
The GN-model assumes that the NLI is fully characterized by its PSD, sta-
tistically independent frequency tones. Dar et. al. argue that frequency tones
within a WDM channel are uncorrelated but must be statistically dependent,
since they represent the same signal. Further, the frequency tones are not
jointly Gaussian, since a linear transform of a discrete random variable, does
not become a Gaussian random variable, and the neighbouring intra-channel
frequency tones are a linear transformation of samples from the modulation
format, modeled by a discrete random variable. If the modulation format is
modeled by a Gaussian random variable, the GN-model has no shortcomings,
which underlines that the NLI is dependent on the modulation format.
By taking the statistical dependence among intra-channel frequency tones
into account, the NLIN-model is derived. The frequency tones of neighbouring
WDM channels are statistically independent, and contributions of different
channels are independently computed. The EGN-model [30] is an extension
of the above two models, including additional cross wavelength interactions,
which are only significant in densely packed WDM systems. For the system
investigated in this work, it is sufficient to use the NLIN-model.
The NLIN-model describes the variance of the NLI as follows [28]:
mod. format dependent

GN-model z ( }| ){
z }| {
∑ (m) (m) ⟨|b(m) |4 ⟩
2
σNLIN = 3
PTx X1 3
+ PTx X2 −2 , (2.18)
m∈M
⟨|b(m) |2 ⟩2
where NCh is the number of WDM channels, the summation skips the COI
with M = {1, ..., NCh }\COI, PTx is the launch power of a single channel, with
(m)
the assumption that all WDM channels have the same launch power, X1
(m)
and X2 are constants depending on the system, and b(m) are the symbols
(m) (m)
of the m-th interfering channel. Expressions for X1 and X2 are found
in [27] equations (26-27), they are dependent on the number of spans NSp , the
span length LSp , the fiber loss α, the chromatic dispersion β2 , the nonlinear
coefficient γ and the channel spacing Ω(m) .
2.3.5 Comparison
In Fig. 2.5, the presented fiber models are compared. It shows the obtained
performance of a 5 WDM channel system of 10 spans of each 100 km length,
simulated with the presented fiber channel models. All models but the SSFM
consider only inter-channel nonlinear effects. For fair comparison, at the re-
19
18
effective SNR
17
16
SSFM with single channel DBP
Simple XPM-model
15
NLIN-model
GN-model
14
-5 -4 -3 -2 -1 0 1 2 3 4 5
Launch Power per Channel [dBm]
Figure 2.5: Effective SNR in respect to the per channel launch power for
the center channel of a WDM system, simulated with the four
presented fiber channel models. A 5 WDM channel system with
50 GHz channel spacing is modeled with fiber properties of a
SSMF.
ceiver of the signal propagated with the SSFM, intra-channel nonlinearity com-
pensation is applied through digital back-propagation (DBP). The SSFM, the
simple XPM-model and the NLIN-model show almost identical performance.
The GN-model overestimates the NLI.
The SSFM and the simple XPM-model yield a waveform at the output
of the channel, whereas the GN-model and the NLIN-model are on symbol
level. The SSFM models the transmission of an arbitrary waveform through
SSMF. It is the most accurate model, but also requires the most computation
time. The simple XPM model emulates the transmission of a single channel
of the WDM system of a dispersion unmanaged link. The method allows to
limit calculations of the interactions towards the COI and acts on the complex
amplitudes of the individual WDM signals.
The GN-model simulates the NLI as a memoryless AWGN term dependent
on the launch power per channel. The NLIN-model includes modulation de-
pendent effects. The latter allows more accurate analysis of non-conventional
modulation schemes, such as probabilistic and geometric shaped constellations,
while the GN-model allows the study under an AWGN channel assumption. In
this work, equation (2.18) is used to compute the variance of the NLI for both
2.4 The Receiver 23
models. The second term with dependence on the fourth order moment µ4
is neglected for the GN-model. The NLI variance is also dependent on intra-
channel nonlinear effects, which are not described in (2.18), but included in
later chapters. The intra-channel nonlinear effects are also dependent on the
sixth order moment µ6 of the constellation. Hence, the NLIN-model variance
2
is described by σNLIN (Ptx , µ4 , µ6 ), whereas the GN-model variance is described
2
by σGN (Ptx ).
A per symbol description of the memoryless NLIN-model follows:
y[k] = cNLIN (x[k], Ptx , µ4 , µ6 ), (2.19a)

= x[k] + nASE [k] + nNLIN [k], (2.19b)
where x[k] and y[k] are the transmitted and received symbols at time k,
cNLIN (·) is the channel model, and nASE [k] ∼ N (0, σASE2 ) and nNLIN [k] ∼
2 2
N (0, σNLIN (·)) are Gaussian noise samples with variances σASE 2
and σNLIN (·),
2 2
respectively. For the GN-model, σNLIN (·) is replaced with σGN (·) and the chan-
2
nel model only depends on the power cGN (x[k], Ptx ). We refer to σNLIN/GN (·)
as a function of the optical launch power and moments of the constellation,
since these parameters are optimized in Chapter 6.
2.4 The Receiver
2.4.1 Coherent Detection

The coherent front-end of a coherent receiver converts the optical signal back
into the electrical domain [8]. The received optical signal is split for dual-
polarization reception. Both polarization are mixed with the local oscillator
in a 90◦ hybrid removing the carrier from the signal. The resulting signals are
detected by balanced photodiodes and converted into the electrical domain by
an analog-to-digital converter (ADC) [70]. All modulated signals are recovered,
I and Q components of both polarizations. Thereafter, the signal is digitally
processed to recover the data.
2.4.2 Digital Signal Processing

The DSP chain extracts the digital bits from the received electrical signal.
The main concepts are explained as follows.
Chromatic dispersion compensation: Before coherent systems, chromatic

dispersion was either tolerated, or compensated for optically with inline dis-
persion compensating fiber [79]. In coherent systems, with both amplitude
and phase converted into electrical domain, chromatic dispersion is elegantly
compensated for by a linear filter. The number of filter taps is dependent on
the chromatic dispersion length. Residual chromatic dispersion is taken care
of in the adaptive equalizer. [80, 81]
Timing recovery: Samples of the received signal are periodic but shifted in
time, such that the optimal sampling position and clock must be recovered.
This is the task of timing recovery algorithms. They use a cost function, which
is minimized to determine the right sampling time, by exploiting symmetries
in the modulation format and the transition between the symbols. [82–84]
Frequency offset compensation: The laser providing the carrier and the
local oscillator mix during coherent detection. However, the detection is het-
erodyne, such that their frequencies are not exactly aligned, leaving the signal
with a continuous frequency component. Frequency offset compensation esti-
mates the residual frequency and corrects for its effect. [84]
Adaptive equalizer: All residual linear effects, such as polarization-mixing

and polarization mode dispersion, are compensated for by the adaptive equal-
izer. It is initialised in data aided mode and switches to decision directed
mode after convergence of the filter taps. Also schemes with pilot symbols are
possible. The filter taps are continuously updated to deal with non-stationary
effects. [9, 85]
Carrier phase noise estimation: As described in Section 2.1.1, the lasers

are responsible for a Wiener process, that interferes with the signal as phase
rotation. Blind phase search algorithms, intertwined into the adaptive equal-
izer [62, 84, 86], and the Viterbi-Viterbi algorithm [87], can track the phase
noise continuously.
Mitigation of nonlinearities: Mitigation techniques for nonlinear effects

include DBP, time-domain perturbative nonlinear compensation, frequency-
domain Volterra equalizers, time-varying linear equalizers and Kalman fil-
ters. [21, 31, 32, 88–90]
Symbol decision and error correction: The symbol is recovered, either

through a hard symbol decision or a probabilistic soft symbol decision. There-
after, hard or soft error correction decoding algorithms are applied.
2.5 Performance Metrics 25
2.5 Performance Metrics

This section introduces four metrics for assessing the performance of optical
communication systems.
2.5.1 Signal-to-Noise Ratio

The ratio between the electrical signal power and the electrical noise power
within the bandwidth of the signal. It measures the signal quality in electrical
domain [16],
Ps
SNR[dB] = , (2.20)
Pn
where Ps is the signal power and Pn the noise power. In later chapters, the
term effective SNR is used to indicate the received and measured SNR. The
effective SNR includes noise contributions from transmitter imperfections, am-
plification noise and nonlinear effects. In the simulation studies, the transmit-
ter is considered ideal. The effective SNR is given by:
1 1 1 1
= + + , (2.21)
SNReff SNRTx SNRASE SNRNLI
where SNRTx is the SNR at the transmitter before transmission, SNRASE
contributes to the noise added by the EDFA amplification stages and SNRNLI
contributes to the NLI.
2.5.2 Optical Signal-to-Noise Ratio

The ratio between the optical signal power and the optical noise power within
a 12.5 GHz bandwidth. The OSNR is in direct relationship to the SNR [16],
( )
pBe
OSNR[dB] = SNR[dB] + 10 · log10 , (2.22)
2Bo
where p is the number of polarizations, Be is the electrical signal bandwidth
and Bo =12.5 GHz is the reference bandwidth.
2.5.3 Pre-FEC Bit Error Ratio

The bit error ratio before error correction is simply the number of bit errors
after hard symbol decision divided by the total number of bits.
2.5.4 Mutual Information

The mutual information (MI) is a statistical input to output relationship of
the received symbols [91]:
[ ]
pY |X (y|x)
I(X; Y ) = E log2 , (2.23)
pY (y)
where x and y are the transmitted and received symbols, pY |X (y|x) is the con-
ditional probability density function of the channel output given the channel
input and pY (y) is the marginal probability density function of the channel
output.
The MI provides the rate, achievable by the best receiver. In fiber optic
systems, the conditional probability density function of the channel output is
not known, therefore an optimal receiver cannot be determined. Hence, for its
estimation, an auxiliary channel model assumption is required [35, 92], which
most often is assumed to be Gaussian. Due to the assumption, the resulting
estimate is always a lower bound for the MI. For a given modulation format,
the MI gives the maximum achievable rate independent of the bit to symbol
mapping, whereas generalized mutual information (GMI) gives the maximum
achievable rate for a given format with a given bit to symbol mapping [93].
2.6 Nonlinear Fourier Transform Based

Communication Systems
The NFT is a novel method of addressing the capacity limiting Kerr nonlin-
earities in optical communication systems [94]. It incorporates nonlinearities
as an element of the transmission by exploiting the property of integrability of
the lossless nonlinear Schrödinger equation, and for dual-polarization of the
lossless Manakov equations. The NFT associates a nonlinear spectrum, a con-
tinuous and discrete part, to a signal in time-domain [95–97]. A linear trans-
formation describes the evolution of the spectrum upon spatial propagation
in the nonlinear fiber channel. At the receiver, the inverse linear transforma-
tion allows to recover the data encoded in the nonlinear spectrum. Encoding
the data in the nonlinear spectrum is known as nonlinear frequency-division
multiplexing (NFDM) [98]. Transmission using NFT has been experimentally
demonstrated for single and dual-polarization [99–104].
2.6 Nonlinear Fourier Transform Based Communication Systems 27
X-Polarization Y-Polarization
Im[b1( 2)] Im[b2( 2)]
Re[b1( 2)] Re[b2( 2)]

Im[ ]
2
Im[b1( 1)] Im[b2( 1)]
Re[ ] Re[b1( 1)] Re[b2( 1)]
Discrete
eigenvalues
Scattering
coefficients
Figure 2.6: Discrete eigenvalues, λ1 = j0.3 and λ2 = j0.6, and scattering co-
efficients of both polarizations b1,2 (λi ), i = 1, 2. The scattering
coefficients associated to λ1 are chosen from a QPSK constella-
tion rotated by π/4, while the scattering coefficients associated
with λ2 are chosen from a QPSK constellation without rotation.
In this work, we use NFDM with information only carried by the discrete
spectrum as complex eigenvalues λi and their spectral amplitudes, also called
scattering coefficients bp (λi ), with p = 1, ..., P and i = 1, ..., NEig , where P is
the number of polarizations and NEig the number of eigenvalues. In the time-
domain, complex eigenvalues and a set of scattering coefficients correspond to
a high-order soliton pulse.
In Fig. 2.6, the nonlinear spectrum is illustrated with NEig =2 discrete
eigenvalues, with QPSK modulated scattering coefficients of order M =4 in
P =2 polarizations. In this example, there are four QPSK constellations.
Choosing a single symbol from each constellation, refers to an unique second-
order soliton pulse, leading to M P ·NEig = 42·2 = 256 different combinations
(in time domain 256 different waveforms), enabling transmission of 8 bits per
soliton pulse. A train of soliton pulses is generated, such that they do not over-
lap in time. During propagation of a trace using the SSFM, the scattering
coefficients evolve as follows:
bp (λi , z) = bp (λi , 0) · e−4iλ z ,

2
(2.24)
where z is the propagated length. After reception of amplitude and phase of

the solitons, the NFT is used to map the solitons back into the nonlinear spec-
trum, where the inverse linear transformation of (2.24) is applied to recover
the four QPSK symbols. During propagation of a trace including losses and
noise, the evolution as in (2.24) no longer holds, and the performance of the
system is degraded. The received solitons are distorted and the compensation
with the inverse linear transform is non-ideal. The effects of fiber losses and
noise on soliton transmission and NFT systems have been studied [105–111].
2.7 Constellation Shaping

Probabilistic and geometric shaping are techniques that achieve gain through
changing the symbol probabilities or the position of the symbols of the constel-
lation [35, 112]. Under the assumption that the constellation of a transmission
system is a continuous Gaussian random variable, a shaping gain of 1.53 dB for
the AWGN channel is achieved. The digital nature of the information forces
the constellation to be a discrete random variable of finite size. Yet, modeling
the symbols Gaussian-like, still yields shaping gain. Rearranging the symbols
or making symbols closer to the origin more probable, accomplish Gaussian-
like shapes. Increasing the number of symbols in the constellation gives more
freedom to model Gaussian-like constellations and leads to potentially more
shaping gain.
Probabilistic shaping under an AWGN channel assumption is discussed
in [35, 113] and experimentally demonstrated in [114–116]. Geometric shapes
have been proposed in [64, 117–119] and with experimental results in [120].
Comparisons between probabilistic and geometric shaping are discussed in [121–
123]. A combined shaping approach is proposed in [124]. The constellation
shape can be optimized for the MI and GMI, where the latter is more challeng-
ing since the optimization must take the bit to symbol mapping into account.
Shaping methods optimized for the GMI are discussed in [125–128].
For the fiber optic channel, the achievable shaping gain is unknown due
to the channel nonlinearities. Yet, semi-analytical fiber models show [27, 30],
that the signal dependent nonlinear impairment is controlled by stochastic
properties of the constellation in use. Geometric and probabilistic constella-
2.7 Constellation Shaping 29
tion shapes with nonlinear tolerance have been proposed [34, 129–133] , and
also methods considering channels with memory [134].
30
CHAPTER 3
Machine Learning
Methods
Around 2010, deep learning [135, 136] had its breakthrough after more than a
decade of artificial intelligence (AI) winter1 . When the so called AlexNet [137]
won the ImageNet [138] Large Scale Visual Recognition Challenge. AlexNet
is based on deep convolutional neural networks and improved the error rate
to 15.3% with a margin of 10.8% to the follow up competitor. The key for
success was a combination of different elaborate techniques.
First, the implementation of a learning algorithm on a graphics processing
unit (GPU) architecture increased the training speed and hence the amount of
data it can be trained on. Second, non-saturating activation functions [139] im-
proved the training process of the deep architectures, and finally, a stochastic
method, called Dropout [140], led to reduced overfitting. Thereafter, deep neu-
ral networks also produced record-breaking results in machine translation and
speech recognition [141–143], which are now part of our everyday life. How-
ever, machine learning and neural network have been around for decades [144–
152]. The idea of a perceptron, a main building block of neural networks, dates
back to 1943 and 1958 [153–155]. It is inspired by the nervous activity in the
human brain.
Other machine learning algorithms exist and are widely used, such as lin-
ear models, k-means clustering, Gaussian mixture models, the expectation-
maximization algorithm, Markov models, support vector machines and Gaus-
sian processes [149, 156, 157], but they have not received as much attention
as deep neural networks [158]. The rapid growth of machine learning in re-
cent years is not only due to record breaking results, but also open source
software (TensorFlow and PyTorch [13, 159]), open publishing communities,
open access research, and a vast amount of online resources, such as online
1
https://en.wikipedia.org/wiki/AI_winter
32 3 Machine Learning Methods
courses [160], blogs of leading scientists [161–166] and even podcasts inter-
viewing leading scientists [167].
This chapter introduces the machine learning algorithms used in the later
chapters, with the focus on neural networks. Section 3.1 provides a general
overview of the different machine learning concepts. In Section 3.2, the struc-
ture and the training of neural networks is explained. Regularization and the
role of the activation function is also discussed in this Section. Examples for
regression and classification using neural networks are studied in Sections 3.3
and 3.4. The principal component analysis (PCA) and the autoencoder (AE)
are discussed in Section 3.5 as examples of dimension reduction/feature extrac-
tion algorithms. The implications of deep learning for this work, the automatic
differentiation algorithm and momentum aided optimization are discussed in
Section 3.6.
3.1 Learning Algorithms

Machine learning is a field in computer science that enables computers to learn
from data or in interaction with a real or virtual environment through statis-
tical methods. It is commonly categorised into three sub-fields. Supervised,
unsupervised and reinforcement learning.
3.1.1 Supervised Learning

Supervised learning uses a human labeled dataset to train an algorithm, which
thereafter labels unseen data. The prime example is image classification [137].
Given a dataset of pictures of cats and dogs, a trained algorithm should be able
to distinguish a new picture of a cat or a dog. For training, the initial dataset is
randomly split into three. A larger part used for training, the training set, and
two smaller parts used for testing and validation, the test and validation set.
The error of the trained algorithm on the test set, reflects its performance
on unseen data during the training process. A better performance on the
training set than on the test set indicates overfitting of the algorithm. The
performance on the validation set, is used after the training is completed, for
comparing different algorithms, machine learning models architectures and
hyperparameters.
Learning to put animal pictures into categories, is a classification problem.
In contrast, if the learning algorithm learns to predict a continuous value,
3.1 Learning Algorithms 33
it is solving a regression problem. Section 3.3 and 3.4 show examples of a

regression and classification problem. In supervised learning, the quality of
the dataset is important. Since the algorithm is not explicitly programmed,
insufficient data or falsely labeled data will lead to an incorrect algorithm.
In Chapter 4, we use regression to predict continuously varying coefficients
of a linear combination to predict second order moments of a memory effect
in the nonlinear fiber channel. In Chapter 5, we classify high-order solitons
after transmission.
3.1.2 Unsupervised Learning

Compared to supervised learning, unsupervised learning works with unlabelled
data. It aims to find a set of descriptive patterns in the dataset, called features.
Thereafter, only the informative features are used, for example by discarding
the raw dataset, but keeping the features of each datum, resulting in a new
dataset with reduced redundancy.
With the earlier example of a dataset with pictures of cats and dogs, an
unsupervised learning algorithm finds that there are two different categories
represented in the dataset. It recognises that the pictures of dogs have some-
thing in common, and that there is a contrast to the pictures of cats. This is
also referred to as clustering [168], such as cluster A for dogs and cluster B
for cats. If the learning algorithm is repeated, the result could be vice versa,
cluster A for cats and cluster B for dogs. Hence, unsupervised learning works
within the data without reference given from the outside, whereas supervised
learning works from outside on the data with human input as reference.
A clustering procedure can be used for anomaly/outlier detection [169].
Given the dataset of cats and dogs includes unknowingly an additional batch
of pictures of lamas. Then the unsupervised learning algorithm should detect
three different clusters instead of two: A, B and C for cats, dogs and lamas,
which makes the anomaly visible.
Another application of unsupervised learning is face recognition [170]. Be-
fore an algorithm compares two human faces, their pictures have to be trans-
lated into a domain in which they are mathematically comparable. A good
face recognition system has learned, from a dataset of human faces, to extract
features that preserve the most distinguishable patterns of a human face. In
this way, the system translates every human face into a vector representation
of these most important features. Faces are now distinguished, by comparing
their feature vectors. Two feature vectors of different pictures, but of the same
person, will be close to each other in the feature space, given a metric, such
as euclidean distance.
Dimension reduction algorithms such as the PCA and the AE are explained
in Sections 3.5.1 and 3.5.2. In Chapter 4, principal components are extracted
from a dataset of second order moments of a memory effect in the nonlinear
fiber channel and thereafter a dimensionality reduced representation is used
for prediction. In Chapter 6, an AE is used to learn a geometric constellation
shape.
3.1.3 Reinforcement Learning

Reinforcement learning describes a method that learns by rewarding itself for
actions that favour a desired outcome. Most famously, the research team
of Google Deepmind has trained their algorithm AlphaGoZero [171] to learn
the game of Go. It taught itself through self-play and reinforcing/rewarding
moves which led to victory. Reinforcement learning is based on a Markov
decision process [172], a mathematical model for decision making. There are
groups exploring reinforcement learning in the field of telecommunication [173],
however, it is not used in this work.
3.2 Neural Network

Neural networks [149] are built of many nodes. A node has a number of inputs,
each of them is multiplied with a different weight. The linear combination of
the weighted inputs determine the firing strength of the node. The firing rule is
defined by the nonlinear activation function. Multiple nodes in parallel, a layer,
have equal inputs but a different set of weights, hence they fire independently
given the same input values. Multiple layers in series, where the outputs of
the previous layer serve as input for the next layer, lead to more complex and
deep neural networks [158].
This section explains the structure of neural networks, how to train and
regularize them, and their activation functions.
3.2 Neural Network 35
0.5
x 1 w1
y1 0
x2 -5 0 5
w2
(a) (b)
Figure 3.1: (a) A single neuron and (b) a logistic sigmoid activation func-
tion.
3.2.1 Structure
A Single Neuron
A single neuron is shown in Fig. 3.1 (a). It has two inputs (x1 and x2 ),
two weights (w1 and w2 ), a logistic sigmoid activation function σ(·) and an
output y1 . The activation function is depicted in Fig. 3.1 (b). The output is
connected to the inputs as follow:
( 2 )
∑
y1 = σ wi xi , (3.1a)
i=1
1
σ(x) = . (3.1b)
1 + e−x
The firing strength of the output is clearly dependent on the weights, also
(k)
collectively called the parameters, θ = {wj,i }. Fig. 3.2 shows a schematic plot
of the weight-space of (3.1a), where each point represents a single realization
of the neuron. For different positions in the weight-space, the firing rule of the
neuron changes. Hence, the single neuron can learn a wide range of functions.
A Neural Network with a Single Hidden Layer

A neural network built of several neurons is shown in Fig. 3.3 and given by:
1 1 1
0.5
w2 0.5 0.5
y1
y1
y1
5 0
10
0
10
0
10
0
10
0
10
0
10
0 0 0
-10 -10 -10 -10 -10 -10
x1 x2 x1 x2 x1 x2
w = (−2, 4) w = (2, 4) w = (5, 4)

4
1
0.5
y1
3 0
10
0
10
0
-10 -10
x1 x2
1 1
0.5 0.5
y1
y1
w = (3, 2)
2 0
10
0
10
0
10
0
10
0 0
-10 -10 -10 -10
x1 x2 x1 x2
1
0.5
y1
w = (−2, 1) w = (0, 1)
1 0
10
0
10
0
-10 -10
x1 x2
w = (2, 0)
0
1 1 1
w1
0.5 0.5 0.5
y1
y1
y1
-1 0
10
0
10
0
10
0
10
0
10
0
10
0 0 0
-10 -10 -10 -10 -10 -10
x1 x2 x1 x2 x1 x2
w = (−2, −2) w = (3, −2) w = (5, −2)

-2
-3 -2 -1 0 1 2 3 4 5 6
Figure 3.2: This figure visualises the weight-space of a single neuron, equa-
tion (3.1a). Each point in the weight-space represents a single
realisation of the neuron, a 2-D logistic sigmoid function. This
figure is a replication of a figure in [174].
 
∑
2
hi = σ  wi,j xj + wi,0  ,
(1) (1)
(3.2a)
j=1
∑
5
(2) (2)
yi = wi,j hj + wi,0 , (3.2b)
j=1
where the nodes hi are called hidden nodes and collectively hidden layer. This
neural network has two input nodes xi , one hidden layer with five hidden
(1)
nodes hi , and an output layer of two nodes yi . The weights wi,j connect the
(2)
input layer with the hidden layer and the weights wi,j connect the hidden
(1) (2)
layer with the output layer. The weights wi,0 and wi,0 are the weights of
the biases, which determine an offset. The output layer omits the activation
function and is a linear combination of the weighted hidden nodes.
The neural network can represent a much wider range of functions than a
single neuron, but also includes vastly more free parameters. A neural network
can represent arbitrarily complex functions [145], by increasing the number
of hidden nodes, which is equal to increasing the number of free parameters.
The amount of free parameters determines its computational complexity and
with an increasing number of parameters, a neural network becomes prone
to overfitting. The weight-space picture from the single neuron, depicted in
Fig. 3.2, does not change. Also for the neural network, a point in its weight-
space, a set of fixed weights, represents one possible realization. The whole
weight-space represents all its possible realizations.
Input Hidden Output

layer layer layer
wi,j
(1) h1 wi,j
(2)
h2
x1 y1
h3
x2 y2
h4
1 h5 1 1 Bias
Figure 3.3: A neural network with a input, hidden and output layer holding
two, five and two nodes, respectively.
A mathematical more compact representation of a neural network is given

as follows:
y = fθ (x), (3.3)
where y is a vector of the output nodes, x is a vector of the input nodes, θ is
the parameter vector and fθ (·) represents the mapping of the neural network
given θ. Finding a point in its weight-space, which leads to a neural network
with a desired function, is called training. The next section will explain the
training procedure, which is nothing else but a smart search in the weight-
space.
3.2.2 Training
For a supervised learning task, a neural network is trained with a labeled
dataset, a set of input vectors {x(n) }, n = 1, ..., N , and their corresponding
outcome {t(n) }, a set of vectors called targets. The samples of the training
set are used to adjust the weights of the neural network, such that the output
vector becomes similar to the target vector for every sample of the training
set,
fθ (x(n) ) = y (n) ≈ t(n) . (3.4)
The first step towards that goal is to define a loss function. The loss is appli-
cation specific and should yield an error if (3.4) is not true. For classification
problems, the cross-entropy loss is typically chosen (3.5a). For regression
problems, the mean squared error (MSE) is often appropriate (3.5b).
Cross-entropy
z }| {
[ ]
1 ∑
N ∑
D ( )
(n) (n)
Lcross-entropy (θ) = − td log yd , (3.5a)
N n=1 d=1 θ
| {z }
Expectation
Mean squared error
z }| {
[ ]
1 ∑
N
1 ∑
D
(n) (n)
LMSE (θ) = |td − yd |2 , (3.5b)
N n=1
2D d=1 θ
| {z }
Expectation
where N is the number of samples in the training set and D is the number
of output nodes. For the cross-entropy loss function, the output vector of the
neural network should always represent a probability vector, this is further dis-
cussed in the Appendix A. By minimizing the loss function, given the training
data, the neural network is trained. This is typically achieved by gradient de-
scent optimization. The gradient of the loss function w.r.t. the weights of the
neural network is computed. With gradient information in the weight-space,
the current set of weights is updated, such that the loss is minimized:
θ (l+1) = θ (l) − η∇θ L(θ (l) ), (3.6)
where η > 0 is the learning rate, l is the training step iteration and ∇θ L(·) is
the gradient of the loss function. The gradient of a neural network is efficiently
computed with the backpropagation algorithm [149, 158]. The backpropaga-
tion algorithm applies the differentiation chain rule on the neural network
w.r.t. the weights.
When the whole training set is used for each update step, the process is
called batch gradient descent, since there is only one data batch. However,
the training set is often split into multiple batches, such that the loss function
averages over a subset of the training set NBatch < N . The whole training
set is still used, but the first update step is computed with the first batch
and the second update step with the second batch and so on. The first batch
is used again, after all batches have been used. The number of times the
neural network has been trained with all batches is referred to as epoches.
This update procedure is commonly referred to as stochastic gradient descent
(SGD) [158, 175], since it updates with a smaller batch and obtains an estimate
of the gradient rather than the real gradient. Besides adding some uncertainty
to the training, SGD also eases the memory demands of the gradient descent
optimization, due to smaller batches.
After several iterations with either update procedure the weights adjust,
such that at best the neural network learns the relationships present in the
training set. However, the optimization is non-convex [176, 177], and as a
general rule converges in a local minima. Whether the trained neural network
performs sufficient is determined with tests and validations.
3.2.3 Test and Validation

The dataset is split into three parts, the training, test and validation set. The
test and validation sets are not used during the training process for the mini-
mization in (3.6), but the loss of the test set is still computed at each iteration.
It is used to estimate the performance of the neural network on unseen data.
If the loss of the test set increases while the loss of the training set decreases,
the neural network is overfitting on the training set. Regularization methods
combat overfitting and are discussed in Section 3.2.5. The loss of the valida-
tion set is computed after training is completed, for comparison of seperately
trained neural network architectures and different hyperparameters. Grid and
manual search are typically used for hyperparameter optimization [178, 179].
3.2.4 Activation Functions

The activation function adds nonlinearity to the neural network, allowing it to
learn nonlinear relationships from the dataset [174]. Two commonly used ac-
tivation functions are the logistic sigmoid and hyperbolic tangent, as depicted
in Fig. 3.4.
1
-1
-5 0 5
Figure 3.4: Logistic sigmoid and hyperbolic tangent activation functions.
3.2.5 Regularization
Overfitting describes a situation where the learning algorithm fits the training
set too well, to a degree that the noise within the training data is reflected
in the prediction. Methods countering overfitting are called regularization.
One way of regularization is by adding a term to the loss function, which
penalises large weights in the neural network. The sum of the loss function
and a regularization term is called cost function [149],
C(θ) = L(θ) + βR(θ), (3.7a)

(k) ∑ (k)
R(θ = {wj,i }) = |wj,i |2 , (3.7b)
k,i,j
where β is a hyperparameter controlling the importance of the regularization

term and k is the layer number. Instead of minimizing the loss L(·), now the
3.3 Regression 41
0.5
0
-5 0 5
Figure 3.5: Logistic sigmoid activation function with varying input scaling.
cost C(·) is minimized. The regularization term R(·) works on the parameters
of the neural network and this particular regularization is called L2 regulariza-
tion. By forcing the neural network to jointly minimize the loss and keeping
the weights small, it is bound to ignore noisy patterns in the training data.
Fig. 3.5 shows a logistic sigmoid function, σ(ax), for a varying parameter a.
As indicated, for a → ∞ the logistic sigmoid function (3.1b) becomes the unit
step function. Considering a neural network, which is comprised of many
logistic sigmoid functions, large weights lead to radical rather than smooth
transitions in the overall learned function. When overfitting occurs, the neural
network models the noise within the training set. To fit noise patterns, the
neural network requires large weights modeling step-like transitions. Hence,
small weight values keep the neural network from overfitting.
3.3 Regression
This section will discuss a regression problem addressed with supervised learn-
ing. As an example, a neural network is used to fit a sine function. The dataset
consists of noisy samples of a sine function. Fig. 3.6 shows the sine function,
with 50 training data points as orange crosses and 5 testing data points de-
picted as green circles. A neural network, with one input node, one hidden
layer of 32 hidden nodes and one output node, is trained to learn from the
training data points. Fig. 3.7 (a) shows the convergence of the MSE loss func-
tion of the training and test set with respect to the training iteration. Fig. 3.7
(b), (c), (d), (e) and (f) show the learned prediction at training iterations
200, 400, 600, 800 and 1800, respectively. After 600 training iterations the
neural networks starts to overfit. It is detected with the loss of the test set,
1.5 sin(x)
Training sample
1 Testing sample
0.5
sin(x)
0
-0.5
-1
-1.5
-6 -4 -2 0 2 4 6
x
Figure 3.6: Sine function with noisy training and testing samples.
in Fig. 3.7 (a) both losses initially decrease, but around iteration 600 the loss
of the test set starts increasing, whereas the loss of the training set keeps
decreasing. Since both datasets are sampled from the same function and hold
the same pattern, after iteration 600, the learning algorithm must be learning
a pattern that is only inherent in the training dataset, that is the noise signal
of the training set. At iteration 200-600 in Fig. 3.7 (c) and (d), the learned
functions resemble smooth almost sine like curves, whereas at iteration 800-
1800, the learned functions start to have more radical bends, which reflect the
overfitting of the learning algorithm, annotated by the arrows in Fig. 3.7 (f).
Regularization methods, discussed in Section 3.2.5, counter overfitting.
3.4 Classification
This section will discuss a classification problem addressed with supervised
learning. As an example, a neural network is used to classify and detect a
quadrature phase-shift keying (QPSK) signal after transmission over an addi-
tive white Gaussian noise (AWGN) channel with a signal-to-noise ratio (SNR)
of 8 dB. For a training set, 1000 symbols are uniformly sampled from the
QPSK constellation and transmitted through the channel. The dataset con-
sists of the received observations and the corresponding label, which associates
each observation to one of the four classes of the respective QPSK symbol. The
labels are one-hot encoded vectors, as discussed in the Appendix A. The same
process is repeated for obtaining the test set. Fig. 3.8 (a) shows the training
set observations, with colors/markers indicating the four distinct classes.
A neural network with two input nodes (in-phase (I)/quadrature (Q) com-
ponents), one layer of 32 hidden nodes and four output nodes (one-hot encoded
vector), is trained to learn from the training data points. Fig. 3.8 (b) shows
3.4 Classification 43
0.6 1.5 sin(x)

Training Prediction
0.5 Testing 1 Training sample
Fig. (b)-(f) Testing sample
0.4 0.5
sin(x)
Loss
0.3 0
0.2 -0.5
0.1 -1
Iteration=200
0 -1.5
500 1000 1500 2000 -6 -4 -2 0 2 4 6
# Iteration x
(a) (b)
1.5 sin(x) 1.5 sin(x)

Prediction Prediction
1 Training sample 1 Training sample
Testing sample Testing sample
0.5 0.5
sin(x)
sin(x)
0 0
-0.5 -0.5
-1 -1
Iteration=400 Iteration=600
-1.5 -1.5
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x
(c) (d)
1.5 sin(x) 1.5 sin(x)

Prediction Prediction
1 Training sample 1 Training sample
Testing sample Testing sample
0.5 0.5
sin(x)
sin(x)
0 0
-0.5 -0.5
-1 -1
Iteration=800 Iteration=1800
-1.5 -1.5
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
x x
(e) (f)
Figure 3.7: (a) Loss function of the training and test sets with respect to
the training iteration, and the neural network prediction after
(b) 200, (c) 400, (d) 600, (e) 800, (f) 1800 training iterations.
The arrows in (f) indicate overfitting.
1.5 1
Training loss
0.8 Testing loss
Quadrature
0.6
Loss
0
0.4
0.2
-1.5 0
-1.5 0 1.5 0 100 200 300 400 500
In-phase # Iteration
(a) (b)
Figure 3.8: (a) Received observations of a QPSK signal with colors/markers

indicating the four distinct classes of the respective QPSK sym-
bol. (b) Loss function of the training and test sets with respect
to the training iteration.
the convergence of the cross-entropy loss function, for both the training set
and the test set with respect to the training iteration. After training, the
neural network is able to decide on future observations with unknown labels.
Fig. 3.9 shows the decision regions learned by the neural network. For an
equiprobable QPSK modulation format and the AWGN channel, the optimal
decision boundaries are known to be the x-axis and y-axis, such that a mini-
mum distance receiver is optimal [63]. The neural network has not learned the
optimal decision boundaries, because it classifies region in the first quadrant
(blue) to the second quadrant (green), as shown in Fig. 3.9. The problem is
that the training data samples only yield an estimate of the real probability
distribution of the AWGN channel. Hence, the neural network, only exposed
to samples, learns that some of the green/upper left region must be in the
first quadrant. Adding regularization and increasing the size of the training
set, will both improve the decision boundaries. However, learning the exact
optimal decision boundaries with a neural network is challenging. The same
problem is encountered in Chapter 5, where the learned decision boundaries
of a soliton transmission is only close to optimal.
3.5 Dimension Reduction and Feature Extraction 45
(a) (b)
Figure 3.9: (a) Decision regions of the trained neural network and (b) a
zoomed version of (a).
3.5 Dimension Reduction and Feature

Extraction
This section explains the PCA and the AE.
3.5.1 Principal Component Analysis

Theory
The PCA is a linear dimension reduction algorithm [149, 158]. Given is a
mean-free dataset {x(n) }, n = 1, ..., N , where each sample is a column vector
of dimension D. Concatenating all samples in a matrix,
D×N [ ]
X = x(1) , x(2) , ..., x(N ) , (3.8)
allows to define the eigendecomposition of the covariance matrix C of the

dataset:
D×D [ ]
C = E (X − E[X])(X − E[X])T , (3.9a)
[ ]
= E X × XT ,
Cpn = λn pn , (3.9b)
where the eigenvalue λn represents the variance of the dataset in direction of

the respective eigenvector pn . In (3.9a), we use that the dataset has zero mean.
The eigenvectors are called principal components of the dataset, are ordered
such that λ1 < λ2 < ... < λn , and are orthogonal since C is symmetric. A
dataset with equivalent redundancy {y (n) } is obtained with a linear change of
basis:
D×N
X =P ×Y, (3.10a)
D×D
P = [p1 , p2 , ..., pD ] , (3.10b)
D×N [ ]
Y = y (1) , y (2) , ..., y (N ) , (3.10c)
where the principal components act as new basis. An approximate reconstruc-

e principal components is given by:
tion of X with the first D
D×N
f = Pe × Y
X f, (3.11a)
e
D×D [ ]
Pe = p1 , p2 , ..., pD
e , (3.11b)
e
D×N [ ]
f = y
Y e(1) , y
e(2) , ..., y
e(N ) , (3.11c)
where De < D and {ye(n) } is a new dataset with reduced dimension, where each
e
sample is of dimension D.
e
For every D the optimal linear reconstruction is achieved, since the princi-
pal components are orthogonal and ordered with decreasing variance. Given D e
the average reconstruction accuracy is:
∑De
λ
e de=1 de
r(D) = ∑D . (3.12)
d=1 λd
The eigenvalues are often called spectra of the PCA, since they reflect the
energy of the signal (the dataset) held by their respective principal component,
similar to the Fourier transform, where the spectral amplitudes reflect the
energy held by their respective sine and cosine basis.
Example
The matched filter for a pulse-shaped transmission can be estimated using the
PCA. Given is a bandlimited communication system, transmitting symbols
a.u. 1
-1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Time [ns]
(a)
1
a.u.
-1
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Time [ns]
(b)
1
a.u.
-1
-0.05 0 0.05
Time [ns]
(c)
Figure 3.10: Trace of 5 pulses (a) before and (b) after transmission. (c) 32
aligned noisy pulses after transmission.
of a BPSK constellation at symbol rate 10 GBd with a root raised cosine

pulse-shaping filter over an AWGN channel with an SNR of 32 dB within
the signal bandwidth. The system has one limitation, the pulses must be
non-overlapping and separated in time. Fig. 3.10 (a) and (b) show a train of
pulses before and after transmission. The received pulses must be matched
filtered and downsampled to one sample per symbol (one sample per pulse),
for optimal detection. In essence, the matched filtering and downsampling are
performing a dimension reduction. For a root raised cosine pulse-shape, the
1
Variance proportion 0.75
0.5
0.25
0
1 2 3 4 5
# Principal Component
(a)
Matched filter
1 1st PC
2nd PC
a.u.
-1
-0.05 0 0.05
Time [ns]
(b)
Quadrature
-1 0 1
In-phase
(c)
Figure 3.11: (a) PCA spectra. (b) Matched filter and first two principal
components. (c) Distorted BPSK symbols extracted by the
PCA.
optimal matched filter is again a root raised cosine pulse-shaping filter, which
can also be obtained by applying the PCA.
For this, the received train of pulses are aligned, as shown in Fig. 3.10 (c),
and used as dataset for the PCA. The mean of the dataset is subtracted from
the samples and the samples are put into the columns of a matrix as described
in (3.8). Thereafter, the eigendecomposition is applied for the PCA.
The PCA spectra, as shown in Fig. 3.11 (a), has one principal compo-
nent holding more than 75% of the variance of the dataset. The first and
second principal component are depicted in Fig. 3.11 (b) together with the
root raised cosine pulse-shaping filter. The first principal component is per-
fectly aligned with the real matched filter. The second and all other principal
components exclusively model noise from the channel. The first row of the
f matrix in (3.11c) holds the coefficients of the first principal component.
Y
Here, they represent the transmitted BPSK symbols with values around ±1,
see Fig. 3.11 (c). Hence, the PCA found the same dimension reduction as the
combined matched filter and downsampling, and extracted the pulse-shape
from the received signal.
3.5.2 Autoencoder
The AE is a nonlinear dimension reduction algorithm [158, 180]. It is com-
posed of two neural networks, encoder and decoder, arranged in sequence, see
Fig. 3.12. The aim is to reproduce the input of the encoder x, at the output
of the decoder y. The output of the encoder h, the latent space, is of lower
dimension than the input and output of the AE. The encoder must learn
a representation of the data, that holds enough information for replication
through the decoder.
The AE is trained similar to a single neural network, with the differ-
ence that the input data is also used as targets. After training, the encoder
has learned a non-redundant feature extraction/dimension reduction method.
Since neural networks include nonlinearities, the AE can learn nonlinear re-
lationships in contrast to the PCA, which only extracts linear independent
features.
Encoder Decoder
x1 y1 ≈ x1
x2 h1 y2 ≈ x2
x3 h2 y3 ≈ x3
x4 y4 ≈ x4
1 1 1 1
Figure 3.12: An AE composed of two neural networks, encoder and decoder,

which share a layer.
By embedding a channel model within the encoder and decoder, the AE

can learn codes and modulation formats in its latent space [181]. In Chapter 6,
a fiber channel model is combined with an AE to learn geometric constellation
shapes jointly robust to amplification noise and nonlinear effects of the fiber
optic channel.
3.6 Deep Learning

When the term deep learning is used, typically convolutional neural networks
or recurrent neural networks with multiple layers are assumed. However, many
concepts contributed to deep learning are not necessarily connected to deep
architectures, but facilitate their training and improve their performance. For
the applications in this work, neural networks of no more than 3 layers are
used, yet many techniques developed for deep architectures have been used.
This section introduces further methods which are contributed to deep
learning and used in later chapters.
3.6.1 Rectified Linear Unit

The rectified linear unit (ReLU) is given by [139, 142, 158, 182]:
{
0, for x < 0
ReLU(x) = . (3.13)
x, for x ≥ 0
The rise of the ReLU function is due to the vanishing gradients problem [183–
185] caused by logistic sigmoid and hyperbolic tangent activation functions
during training in deep architectures. During the backpropagation algorithm
the gradients of the activation function for each node are passed on to the
previous layer and multiplied with the gradients there. In deep architectures
this process is repeated several times and the gradients accumulate. The final
product is exclusively comprised of values smaller than one, since the gradients
of the logistic sigmoid and hyperbolic tangent functions are strictly smaller
than one. The training of the neural network becomes impractical slow due
to a very small final gradient. In contrast, the ReLU function only has two
gradients, zero and one. Such that during backpropagation the gradients are
either on or off. In this way, the ReLU function prevents vanishing gradients
and facilitates training of deep neural networks. Further, the computation is
cheaper as there is no exponential function in the ReLU function.
3.6 Deep Learning 51
3.6.2 Momentum
Advanced optimzation algorithms include a momentum term in the gradient
update step of (3.6). It adds a fraction of the previous update vector to
the most recent gradient, introducing a memory effect. Mathematically, it is
described as follows [186]:
v (l) = αM v (l−1) + η∇θ L(θ (l) ), (3.14a)

θ (l+1) = θ (l) − v (l) , (3.14b)
where v (l) is the update vector at iteration l, η is the learning rate and αM is
a hyperparameter defining the fraction of memory to account for. Momentum
makes the training converge faster and helps to escape small local minima.
More elaborate optimization algorithms including momentum, such as
the Adam optimization algorithm [187], are already implemented in Tensor-
Flow [13] and PyTorch [159].
3.6.3 Automatic differentiation

TensorFlow [13] and PyTorch [159], include a function called automatic dif-
ferentiation [188, 189], which is a numerical method to accurately compute
derivatives of numeric functions. It is similar to the backpropagation algo-
rithm [149, 158] for differentiating neural networks, but more general. Math-
ematical models are described by a sequence of elementary operations, such
as addition, subtraction, multiplication and division, in combination with el-
ementary functions, such as sin(·), cos(·), exp(·) and log(·). Automatic differ-
entiation applies the chain rule repeatedly to such a sequence of operations
and functions w.r.t. a set of parameters, such that automatic differentiation
allows to determine the gradient of arbitrary deep mathematical structures.
The obtained gradients are then used for optimization.
In Chapter 6, we use the automatic differentiation algorithm to calculate
the gradient of a channel model, which makes end-to-end learning possible.
52
CHAPTER 4
Prediction of Fiber
Channel Properties
This chapter includes results presented in the conference contribution [C6].
4.1 Introduction
Analytical models of communication systems assess their performance without
cumbersome numerical simulations. In particular, fully loaded wavelength-
division multiplexing (WDM) systems are almost impossible to simulate with
the split-step Fourier method (SSFM). Many mathematical models of WDM
systems have been proposed and some are discussed in Section 2.3. These
semi-analytical models predict the system performance very accurately. They
require to compute a set of non-analytical properties, which must be ob-
tained through computational expensive numerical integration. Thereafter,
the model is analytical, as long as these properties are assumed constant.
However, when the system and the physical parameters change, the assump-
tion often breaks and the properties must be recalculated. A full assessment
of the system is only possible, if the performance metric is independent of
the described properties. Since these properties are a function of the physical
layer parameters, their relationship can be learned using supervised learning.
Thereafter, the computational expensive numerical integration is skipped and
instead predicted by the learning algorithm. The supervised learning algo-
rithm is not limited to learn single parameters, but capable of learning multiple
parameters at the same time. Further, functional properties can be learned,
such as auto-correlation functions or spectra.
In this chapter, a supervised learning algorithm is trained to predict auto-
correlation functions, obtained from the fiber model described in Section 2.3.2.
They describe the auto-correlation of one single term of the summation in (2.15),
54 4 Prediction of Fiber Channel Properties
i.e. stochastic process of one interfering channel. It can be used to emulate

the cross-phase modulation (XPM) process from white Gaussian noise sources,
instead of generating the interfering waveforms. Similar auto-correlation func-
tions are used in [29, 190, 191], for emulation of the optical fiber channel
including memory effects, and in [32], for a nonlinear interference (NLI) miti-
gation algorithm.
4.2 Machine Learning Framework

The machine learning framework, shown in Fig. 4.1, is trained to predict the
auto-correlation functions generated from the fiber model described in Sec-
tion 2.3.2. First, a dataset of multiple auto-correlation functions is generated
by simulations of the fiber channel model, and thereafter the nonlinear map-
ping of the physical layer parameters to the generated auto-correlation func-
tion is learned. However, the output dimension of the neural network must
be as large as the length of the auto-correlation function. With the entire
auto-correlation function as output vector, the neural network is overly com-
plex with a large number of trainable weights. Instead, it would be favourable
to reduce the number of output units of the neural network by transform-
ing the auto-correlation functions such that less variables carry most of the
information.
This is achieved with the principal component analysis (PCA), which per-
forms a change of basis on the auto-correlation functions, as described in
Section 3.5.1. The principal components are chosen such that few of them
cover most of the variance of the auto-correlation functions. A linear combi-
nation of the chosen principal components approximates the auto-correlation
functions. Although, by discarding insignificant principal components infor-
mation is lost, it also reduces the dimensionality of the dataset. Recovering
the auto-correlation functions with a subset of De = 3 principal components is
given by:
∑ e
D
(n) (n)
R [k] = c e pde[k], (4.1)
d
de=1
where R(n) [k] is the n-th auto-correlation function of the dataset with symbol
e
delay k, pde[k] is the d-th
(n)
principal component and c e are the coefficients of
d
the PCA. Each auto-correlation function is now described by three coefficients
(n) (n) (n)
{c1 , c2 , c3 }.
4.2 Machine Learning Framework 55
Fiber Channel Model
...
Auto-correlation
functions
Data set Neural Network
Predict
Equation (3.8)
Train
PCA
...
Fig. 4.2 (a)

Coefficients Principal Components
Figure 4.1: Machine learning framework.
The principal components, which represent pseudo auto-correlation func-

tions, are shown in Fig. 4.2 (a). Physical layer parameters, such as launch
power, channel spacing, span length and propagated distance, are used as an
input to the neural network, which predicts the coefficients of the principal
components, such that the number of neural network output nodes is reduced
to 3. Fig. 4.2 (b) shows an example of a true auto-correlation function ob-
tained by the fiber channel model and the prediction by the trained machine
learning framework.
Even though the most important principal components are chosen, the
other principal components also hold information about the auto-correlation
functions. This information is lost during the dimension reduction and cannot
be recovered by the neural network. Hence, there are two error sources, the
dimensionality reduction itself and an insufficiently trained neural network.
0.5 10-5
3
Perturbation Model
2.5 PCA/NN Prediction
0.3
2
R[k]
1.5
R[k]
0.1
-0.1 0.5
0
-0.3
0 20 40 60 80 100 -0.5
0 20 40 60 80 100
Symbol delay [k]
Symbol delay [k]
(a) (b)
Figure 4.2: (a) Three principal components whose linear combination re-
cover all auto-correlation functions with an accuracy of 99.3%.
An accuracy of 88.5% is obtained by only using the first prin-
cipal component. (b) Auto-correlation function as a function of
symbol delay, obtained from the model and the machine learn-
ing framework. The latter is a linear combination of the three
principal components in (a).
Table 4.1: Range of physical layer parameters for the generated auto-
correlation functions.
Parameter Min. Max.
Interfering channel launch power P -2 dBm 4.5 dBm
Interfering channel spacing Ω 35 GHz 350 GHz
Span length LSp 40 km 120 km
Propagated distance z ′ 0 km 2000 km
4.3 Training and Prediction

The PCA was performed on a dataset of 3000 auto-correlation functions of
XPM-induced phase-noise. Differently shaped auto-correlation functions are
obtained by changing the four physical layer parameters within the ranges
given in Table 4.1. With the PCA the dimensionality of the auto-correlation
functions is reduced to 3. They hold 99.3% of the variance of the dataset.
An auto-correlation function describes an XPM process, generated by an
interfering channel within a LSp long span, starting at z ′ into the link, with
Ω channel spacing from the channel of interest (COI) and a launch power of
PTx . Other parameters, such as the modulation format, sample frequency,
4.4 Results and Discussion 57
chromatic dispersion coefficient, nonlinear coefficient, attenuation and roll-off

factor are set to 16 QAM, 32 GHz, 17 ps/(km nm), 1.2 (W km)−1 , 0.2 dB/km
and 0.1, respectively. Regarding a WDM system, the auto-correlation function
of an XPM process of multiple interfering channels is obtained by summing
the auto-correlation functions of each interfering channel.
The neural network is trained to learn a continuous nonlinear mapping
from the input space of physical layer parameters {PTx , Ω, LSp , z ′ } to the out-
put space of 3 coefficients {c1 , c2 , c3 } determining the actual auto-correlation
function. The dataset is normalized and split into training and testing sets.
While the training data is used to learn the nonlinear mapping, the test data
is used to evaluate the robustness of the fit. The neural network is trained
with the mean squared error (MSE) loss function given in (3.5b).
4.4 Results and Discussion

The PCA extracts the main features of the auto-correlation functions from
the dataset as principal components, see Fig. 4.2 (a). The first principal
component p1 [k] is most important. It depicts a quickly decaying always
positive auto-correlation function with short tail. This feature models the
power of the XPM process. The second and third principal components, p2 [k]
and p3 [k], model the long tail caused by chromatic dispersion. Further, the
reconstruction of arbitrary auto-correlation functions also enables the analysis
of the peak at R[0] and the half width at half maximum (HWHM) of the auto-
correlation function.
Every column in Fig. 4.3 shows how the contribution of the principal com-
ponent coefficients changes (top) and how the peak R[0] and HWHM of the
auto-correlation function changes (bottom) with continuously changing the
parameters. For the prediction in Fig. 4.3, when sweeping one of the four
parameters the others are set to the center of their respective range.
Fig. 4.3 (a,e) show the dependence of the interference to the channel launch
power. Increased power levels yield an increased absolute value of the contri-
bution of all coefficients. Since the first principal component models the power
in the interference, it is effected more and the peak R[0] of the auto-correlation
function increases. Fig. 4.3 (b,f) depict the dependency of the channel spac-
ing. Interfering channels with larger channel spacing from the COI have less
impact. The evolution of the coefficients lead to a longer tail as shown by the
HWHM of the auto-correlation function, due to the longer walk-off of the in-
terfering channel with larger channel spacing. Fig. 4.3 (c,g) show that the span
10-4 (a) 10-4 (b)

4 4
{c1,c2,c3} c1
2 2
c3
0 0
c2
-2 -2
-2 0 2 4 100 200 300
IC Launch Power P [dBm] Channel Spacing [GHz]
10-4 (c) 10-4 (d)

4 4
{c1,c2,c3}
2 2
0 0
-2 -2
40 60 80 100 120 0 500 1000 1500 2000
Span Length z'n [km] Propagated Distance L [km]
10-4 (e) 10-4 (f)
HWHM symbol delay

Prediction
1.5 1.5 15
ACF Peak R[0]
Model
HWHM 15
1 1 10
10
R[0]
0.5 0.5
5 5
0 0
-2 0 2 4 100 200 300
IC Launch Power P [dBm] Channel Spacing [GHz]
10-4 (g) 10-4 (h)

HWHM symbol delay
1.5 15 1.5
ACF Peak R[0]
15
1 1
10 10
0.5 0.5
5 5
0 0
40 60 80 100 120 0 500 1000 1500 2000
Span Length z'n [km] Propagated Distance L [km]
Figure 4.3: (a-d) Contribution of the respective principal component to the

overall auto-correlation function, and (e-h) the auto-correlation
function peak R[0] and the HWHM for model (squares) and
prediction (lines). Descriptions are in the respective first plot.
4.5 Summary 59
length has minor effects on the auto-correlation function, in particular since

all nonlinear interactions occur at high power levels within the beginning of
the span. Fig. 4.3 (d,h) show that without chromatic dispersion there is hardly
any memory and with increasing chromatic dispersion the three coefficients
and the HWHM of the auto-correlation function converge.
4.5 Summary
A framework is proposed to learn properties of an optical fiber channel model,
which are computationally expensive to estimate. With a sparse dataset, the
PCA enables to extract the most important features of the auto-correlation
function of an interference effect. Thereafter, the neural network is able to
predict the learned properties. Together they become a tool to analyze the
effects on a system, when physical layer parameters are changing.
60
CHAPTER 5
Neural Network
Receiver for NFDM
Systems
This chapter includes results presented in journal and conference contribu-
tions [C2, J2].
5.1 Introduction
This chapter takes a data science approach on receiving high-order solitons, de-
rived from the nonlinear Fourier transform (NFT). The transmission concept
is known as nonlinear frequency-division multiplexing (NFDM) and was intro-
duced in Section 2.6. As discussed there, under a lossless channel assumption,
the information carried by the solitons is perfectly recovered, since a linear
transformation in the NFT domain compensates for a fixed transform of the
constellations. Taking fiber losses and amplification noise into account, the
received solitons are distorted with unknown effect on the constellations [105–
111], and the NFT does no longer provide exact solutions to the nonlinear
Schrödinger equation (NLSE) and Manakov equations. After collecting data
by transmitting many solitons through the channel, a neural network is able
to learn decision regions in the high dimensional vector space of the solitons,
similar to the classification example in Section 3.4. For comparison, we also
apply a minimum distance receiver in time-domain.
The study of this chapter makes two points. First, the statistics of the
accumulated distortions obtained by the propagated soliton is not additive
Gaussian. Second, the NFT-based receiver is outperformed by the neural
network receiver.
62 5 Neural Network Receiver for NFDM Systems
Receiver Nonlinear
Spectra
Linear Transf.
Decision
Receiver
BER
NFT
NFT
Transmitter Transmission Link
NFT symbol mapping
Coherent Frontend
PRBS generation
Minimum
Receiver
Decision
Distance
INFT
BER
MZM
41.7 km
Q EDFA EDFA
Span
x NSp Spans
Resampling
Decision
Receiver
Network
Neural
BER
Figure 5.1: Simulation setup.
Section 5.2 shows the simulation setup. Section 5.3 introduces the three
compared receivers. Section 5.4.1 compares their performances for the additive
white Gaussian noise (AWGN) channel. Section 5.4.2 looks at the soliton
evolution in time and frequency-domain during propagation. Section 5.4.3
compares the receiver performances for the fiber optic channel. A study of
the neural network hyperparameters is given in Section 5.4.4. Section 5.5
summarizes the findings.
5.2 Simulation Setup

The simulation setup is described in Fig. 5.1. Single and dual polariza-
tion NFDM systems are simulated. The transmitter generates bits through
a pseudorandom binary sequence (PRBS) generator with sequence length
211 − 1 = 2047, converts bits to symbols and applies the inverse NFT al-
gorithm to map symbols to soliton pulses. Both single and dual polarization
systems use 2 eigenvalues (λ1 =0.3j and λ2 =0.6j) with quadrature phase-shift
keying (QPSK) constellations as b(λi ) scattering coefficients with unitary ra-
dius, resulting in N =16 and N =256 possible waveforms for single and dual
5.3 Receiver Schemes 63
polarization, respectively, as described in Section 2.6. A rotational offset of

π/4 is applied to the constellations associated to the eigenvalue with larger
imaginary part, as depicted in Fig. 2.6. The symbol rate is 1/Ts =1 GBd and
an NFT intrinsic normalization parameter is given by T0 =30 ps [97]. The
Mach-Zehnder modulator (MZM) is driven in pseudo-linear region by the elec-
trical signal. The laser is considered ideal, such that the effect of phase noise is
neglected. The NFDM-based transmitter generates a time-domain waveform
of truncated and thus non-overlapping soliton pulses,
∑
A(t, z = 0) = uk (t − kTs , z), (5.1)
k
where each uk (t, z) is one of 256 soliton pulses and Ts is the symbol duration
of one single pulse.
Two studies are conducted. First, the waveform is received after transmis-
sion over the AWGN channel. Second, the waveform is propagated through the
fiber optic channel using the split-step Fourier method (SSFM) with chromatic
dispersion parameter 17.5 ps/(nm km), nonlinear coefficient 1.25 1/(W km),
fiber loss 0.195 dB/km and increasing transmission distance from 0 to 5004 km.
The link is divided into NSp spans of 41.7 km with erbium-doped fiber am-
plifiers (EDFAs) compensating for the fiber loss. After coherent detection
the waveform is sliced into single pulses at R=128 samples and thereafter
independently detected by three receiver schemes.
5.3 Receiver Schemes
5.3.1 NFT Receiver

The first receiver uses the typical NFDM receiver scheme to recover the data.
After transmission and coherent detection, the received pulses are transformed
back into the nonlinear spectrum with the NFT. As explained in Section 2.6,
a linear shift in NFT-domain compensates for the channel and recovers the
scattering coefficients, as shown in Fig. 5.1. Besides mapping the data carrying
solitons into a domain, where the effects of the fiber channel are effectively
compensated for, the NFT also performs a dimension reduction. A dimension
reduction from the oversampled soliton, to the discrete eigenvalues and their
scattering coefficients. Fig 5.2 shows a dual-polarization NFDM signal after
1251 km transmission in the NFT-domain.
X-Polarization Y-Polarization
1 1
Im[b 1( 2)]
Im[b 2( 2)]
0 0
0.7
-1 -1
0.6
0.5
-1 0 1 -1 0 1
0.4
Im[ ]
Re[b 1( 2)] Re[b 2( 2)]

0.3
0.2
0.1 1 1
Im[b 1( 1)]
Im[b 2( 1)]
0
0 0 0
Re[ ]
Discrete
eigenvalues -1 -1
-1 0 1 -1 0 1
Re[b 1( 1)] Re[b 2( 1)]
Scattering
coefficients
Figure 5.2: Constellation of dual-polarization NFDM signal in NFT-domain

after 1251 km transmission.
5.3.2 Minimum Distance Receiver

After transmitting over a distance L, the minimum distance receiver is trained
by averaging over multiple received instances of each of the N individual
(n)
solitons. These averages are used as references uref (t, L), n=1, ..., N , to detect
future received solitons. Hence, for every received soliton pulse uk (t, L) the
Euclidean distance to all references is calculated and the reference pulse with
minimum distance is chosen for decoding:
(∫ ∞ )
(n)
nopt [k] = argmin |uk (t, L) − uref (t, L)|2 dt , (5.2)
n −∞
where the integral becomes a summation for sampled signals. The minimum
distance receiver is trained with 100,000 training symbols to extract the refer-
ences. For dual polarization ∼ 391 per reference, and for single polarization
∼ 6250 per reference.
5.3.3 Neural Network Receiver

The minimum distance receiver obtains N averaged references after training.
It does not take the conditional probability distributions of the channel output
into account, whereas the neural network receiver estimates these during its
training process.
With a sampling rate of R samples per symbol, all received pulses lie in an
Rp dimensional complex space, where p is the number of polarizations. Thus,
the neural network learns the mapping from the Rp dimensional complex space
of pulses to a decision on one of the N possible transmitted messages. The
Rp dimensional complex input to the neural network is separated in real and
imaginary parts and the decisions are transformed to one-hot encoded vectors
of length N , see Appendix A. Hence, the neural network has 2Rp input nodes
and N output nodes. The neural network is trained to process one soliton at a
time, which means memory effects in between two neighboring solitons cannot
be taken into account. This also means that the neural network cannot learn
the pattern of the used PRBS [60].
The neural network receiver has one hidden layer with 32 and 128 hidden
nodes for single and dual polarization. All neural networks are trained with
sigmoid activation functions, softmax output layer, cross-entropy loss function
and the Adam optimization algorithm, see Chapter 3. The dataset size is
100,000 symbols for each distance, where 90% are used for the training and
10% for testing. The training is stopped when the test performance ceases to
improve (early stopping). An independent dataset of 100,000 symbols for each
distance is used for validation and all performance estimations in this chapter.
The dataset entries are randomly permuted before training and testing.
For further analysis of the neural network hyperparameters, the amount
of training data, the number of hidden nodes and the number of samples per
symbol are swept.
5.4 Results and Discussion
5.4.1 Receiver Comparison over the AWGN Channel

As described in Section 3.4, the minimum distance receiver is optimal for the
AWGN channel with the optimal decision boundaries, which the neural net-
work receiver approximates. In Fig. 5.3, the performance of all three receivers
is shown for increasing optical signal-to-noise ratio (OSNR) for the AWGN
0
10
NFT
10 -1 Minimum distance
Neural network
10 -2
BER
10 -3
10 -4
-5 0 5 10 15
OSNR [dB]
Figure 5.3: BER in respect to the OSNR for NFT, minimum distance and
neural network receivers, and the AWGN channel.
channel. The neural network receiver learns close to optimal decision regions.
This suggests, that over the fiber optic channel, the neural network will also
learn close to optimal performance, given enough training data and a large
enough neural network size. The neural network hyperparameters are opti-
mized for a OSNR of 0, which causes an error floor for SNR > 5. The NFT
receiver is very error prone towards an AWGN source.
5.4.2 Temporal and Spectral Evolution

For correct detection, during transmission the solitons must stay temporally
confined within their time slot Ts . The power envelope of the NFDM gener-
ated solitons evolve during transmission. This evolution is demonstrated in
Fig. 5.4 (a), one instance of the possible solitons is propagated under lossless
and noiseless (ideal) conditions and with losses and noise (nonideal). It is
observable that under ideal conditions (top) the soliton evolves but stays in
its time slot. The soliton distorts under non-ideal conditions (bottom), and
is in the process of splitting and being removed from its time slot. Further,
the PSD of NFDM solitons also evolves during transmission. A whole train
of soliton pulses is propagated under non-ideal conditions and the evolution
of the spectrum of the complex envelope is shown in Fig. 5.4 (b). It is observ-
able that the bandwidth changes with the transmission distance and thus the
receiver sampling rate must be chosen accordingly.
NFDM Soliton Evolution

Ideal Channel
Time [a.u.]
Nonideal Channel
0 km 2500 km 5000 km
(a)
(b)
Figure 5.4: (a) Evolution of one soliton pulse through an ideal (lossless and
noiseless) and non-ideal realistic channel. (b) The evolution of
the PSD of a train of soliton pulses for a non-ideal realistic chan-
nel in respect to the transmission distance and the bandwidth
containing 99% of its power (dashed).
5.4.3 Receiver Comparison over the Fiber Optic

Channel
This section presents the simulation results of the three receiver scheme per-
formances. Fig. 5.5 shows BER performances for (a) single and (b) dual
polarization transmissions. During transmission the solitons are distorted by
attenuation and noise. This causes the solitons to gradually split and move
out of their time slot, as shown in Fig. 5.1 (b). The NFT-based detection
assumes undistorted pulses and that the received solitons haved stayed within
their time slot throughout transmission. The wrong assumptions lead to a
performance degradation as is shown in Fig. 5.5. With the chosen setup, the
10-1
-2
10
BER
10-3
-4
NFT
10
Minimum distance
Neural network
0 1000 2000 3000 4000 5000

Distance [km]
(a)
10-1
-2
10
BER
10-3
-4
NFT
10
Minimum distance
Neural network
0 1000 2000 3000 4000 5000

Distance [km]
(b)
Figure 5.5: BER in respect to the transmitted distance for NFT, minimum
distance and neural network receivers, (a) single and (b) dual
polarization.
degradation starts after a transmission distance of 1000 km. For shorter dis-
tances, in particular around 800-900 km, the performance of the NFT receiver
degrades but improves again at 1000 km. The observable oscillating behav-
ior of the performance curves are difficult to explain and must be further
examined.
The accumulated distortion of the fiber link cannot be modeled as Gaus-

sian noise source, since the neural network receiver outperforms the minimum
distance receiver, which is optimal for a Gaussian noise source. By estimat-
ing the probability distribution of the distortions, the neural network receiver
allows transmission of up to 3000 km with a BER below 10−3 .
10-1 4 Hidden Nodes 10-1 8 Samples Per Symbol 10-1 10000 Symbols
8 Hidden Nodes 16 Samples Per Symbol 25000 Symbols
16 Hidden Nodes 32 Samples Per Symbol 75000 Symbols
10-2 32 Hidden Nodes 10-2 128 Samples Per Symbol 10-2 100000 Symbols
BER
BER
BER
10-3 10-3 10-3
10-4 10-4 10-4
1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000
Distance [km] Distance [km] Distance [km]
(a) (b) (c)
10-1 2001.6 km 10-1 10-1 2001.6 km

2001.6 km
3002.4 km 3002.4 km 3002.4 km
10-2 10-2 10-2
BER
BER
BER
10-3 10-3
10-3
-4
10
10-4 10-4
10000 50000 100000

4 16 32 64 96 128 2 16 32 64 128
# Training Symbols
# Hidden Units Samples Per Symbol
(d) (f)
(e)
Figure 5.6: Neural network receiver BER performance for (a,d) different
numbers of hidden nodes, (b,e) numbers of samples per symbol
and (c,f) sizes of the training dataset for the single polarization
case.
5.4.4 Neural Network Receiver Hyperparameters

This section studies three neural network hyperparameters important to its
performance. Hyperparameter optimization is a multidimensional problem
and solved via a grid search in the hyperparameter space [178]. A grid search
for every neural network trained, at every distance, is infeasible. Yet, the
hyperparameters are swept independently to indicate their impact on perfor-
mance and complexity. The complexity of a neural network is typically given
by the number of free parameters and hence by the number of nodes, whereas
here the number of input nodes is determined by the number of samples per
symbol, and the number of output nodes by N . More free parameters empower
a neural network to learn more complex patterns from data. Regularization
techniques and increasing the training set size improves the performance. To
find the best performing but least complex neural network, three hyperparam-
eters are studied, the number of hidden nodes, the number of input nodes and
the number of training samples:
A neural network with infinite hidden nodes can model any function [145],
such that by increasing the number of hidden nodes the performance will reach
a limit, where no more information can be extracted from the data. This
limit also determines a number of hidden nodes for which the performance
has converged, and a neural network with the lowest complexity is obtained.
The sampling rate is in direct relationship to the number of input nodes
of the neural network. The number of input nodes must be as low as possible
for less complexity. At the same time the sampling rate should not conflict
with the sampling theorem, otherwise information is lost. In contrast to typ-
ical wavelength-division multiplexing (WDM) systems, the bandwidth of the
NFDM signal has no sharp bandlimit, Fig. 5.4, and reducing the sampling
rate is a trade-off between providing more information to the neural network
and the complexity of its input layer.
The neural network attempts to learn the probability distribution at the
channel output from the received solitons, with more data, more empirical
evidence is available to estimate it accurately.
For single polarization the performance for different hyperparameters is
studied. Fig. 5.6 shows the performance for the number of hidden nodes, the
sampling rate and the training data size. Additionally, at two transmission
lengths, 2001.6 km and 3002.4 km (48 and 72 spans), the hyperparameters
are optimized. Fig. 5.6 (a) and (d) show that 16 hidden nodes are sufficient to
model the data with a neural network, with 8 hidden nodes the performance is
degraded, and no noticeable performance improvement is achieved with 32 or
more hidden nodes. Depending on the transmission distance, sampling rates
of 16 to 32 are sufficient to capture the information provided by the signal,
Fig. 5.6 (b) and (e). Fig. 5.6 (c) and (f) show that more than 50,000-75,000
data samples for training will yield negligible performance improvement.
5.5 Summary
This study analyzed different reception methods for NFDM systems encod-
ing information in the discrete spectrum. The neural network outperforms
the standard NFT-based receiver and a minimum distance metric receiver.
We show, that the statistics of the accumulated distortions obtained by the
propagated soliton is not additive Gaussian, since the neural network receiver
5.5 Summary 71
outperforms the minimum distance receiver. Both alternative receivers out-

perform the NFT-based receiver.
72
CHAPTER 6
Geometric
Constellation Shaping
via End-to-end Learning
This chapter includes results presented in a submitted journal contribution [S1],
and a conference contribution [C1].
6.1 Introduction
The shaping gain in the additive white Gaussian noise (AWGN) channel is
known to be 1.53 dB [35], whereas the shaping gain for the fiber optic channel
is unknown. However, analytical fiber channel models give the insight, that
the nonlinear impairment occuring in the fiber channel is dependent on statis-
tical properties of the constellation in use [27]. In particular, large high-order
moments of the constellation lead to increased nonlinear interference (NLI).
While the amplification noise in the fiber channel is modeled as AWGN source,
shaping gain is achieved by a Gaussian-like shaped constellation. Such constel-
lations hold relatively large high-order moments. A constellation jointly robust
to the constellation dependent nonlinear effects and amplification noise, must
weigh the reduction of high-order moments against a Gaussian-like shape.
This chapter combines an unsupervised learning algorithm with a fiber
channel model to obtain a constellation including this implicit trade-off. By
embedding a fiber channel model within an autoencoder (AE), the encoder is
bound to learn a constellation jointly robust to both impairments, as depicted
in Fig. 6.1. When the AE is trained by wrapping the Gaussian noise (GN)-
model [23, 26], the learned constellation is optimized for an AWGN channel,
with an effective signal-to-noise ratio (SNR) determined by the launch power,
74 6 Geometric Constellation Shaping via End-to-end Learning
Encoder Decoder
e1 e2 e3 e4
1 0 0 0 s1 r1 p(s1 = 1| · )
Channel
0 1 0 0 s2 I I r2 p(s2 = 1| · )
0 0 1 0 s3 Q Q r3 p(s3 = 1| · )
0 0 0 1 s4 r4 p(s4 = 1| · )
Optimization
Figure 6.1: Two neural networks, encoder and decoder, wrap a channel
model and are optimized end-to-end. Given multiple instances of
four different one-hot encoded vectors, the encoder learns four
constellation symbols which are robust to the channel impair-
ments. Simultaneously, the decoder learns to classify the re-
ceived symbols such that the initial vectors are reconstructed.
and nonlinear effects are not mitigated. Trained with the nonlinear interfer-
ence noise (NLIN)-model [27, 28], the learned constellation mitigates nonlinear
effects by optimizing its high-order moments.
This chapter compares the performance in terms of mutual information
(MI) and estimated received effective SNR of the learned constellations to
standard quadrature amplitude modulation (QAM) and iterative polar mod-
ulation (IPM)-based geometrically shaped constellations. IPM-based geomet-
rically shaped constellations were introduced in [64] and thereafter applied in
optical communications [117, 118]. The iterative optimization method opti-
mizes for a geometric shaped constellation of multiple rings to achieve shap-
ing gain. The IPM-based geometrically shaped constellations optimized under
AWGN assumption provided in [64, 117, 118] are used in this work. Results of
both simulations and experimental demonstrations are presented. The exper-
imental validation also revealed the challenge of matching the chosen channel
model of the training process to the experimental setup.
Section 6.2 describes the AE architecture, its training, and joint optimiza-
tion of system parameters. Section 6.3 describes the simulation results evaluat-
ing the learned constellations with the NLIN-model and the split-step Fourier
method (SSFM). The experimental demonstration and results are presented
6.2 End-to-end Learning 75
s ∈ S = {ei | i = 1..M }
Encoder
Neural Network
R2N → CN fθf (·)
Normalization
x ∈ CN
Channel cGN/NLIN (·)
Decoder y ∈ CN
CN → R2N
Neural Network gθg (·)
Softmax
∑M
r ∈ {p ∈ RM
+| i=1 pi = 1}
Figure 6.2: End-to-end AE model.
in Section 6.4. Section 6.5 described joint optimization of system parameters

and constellation shape. In Section 6.6 the simulation and experimental re-
sults are discussed and thereafter concluded in Section 6.7. The optical fiber
models together with the AE model are available online as Python/TensorFlow
programs [192].
6.2 End-to-end Learning
6.2.1 Autoencoder Model Architecture

An AE model, with neural networks as encoder and decoder, with embedded
channel model, is depicted in Fig. 6.2 and mathematically described as follows:
x = fθf (s), (6.1a)

y = cNLIN/GN (x), (6.1b)
r = gθg (y), (6.1c)
where fθf (·) is the encoder, gθg (·) the decoder and cGN/NLIN (·) are the channel
models. The channel models are described in Section 2.3.5. The goal of the AE
is to reproduce the input s at the output r, through the latent variable x (and
its impaired version y). The trainable variables of the encoder and decoder
neural networks are represented by θf and θg , respectively. The parameter
vector θ = {θf , θg } includes all trainable neural network weights. While the
encoder is optimizing the location of non-equidistant constellation points, the
decoder learns the conditional probability distribution of the symbols at the
channel output.
The AE must be constructed, such that the learned constellation has the
desired dimension and order. The constellation dimension is equal to the
dimension of the latent space. The constellation order is equal to the input
and output space of the AE. A constellation of order M is trained with one-
hot encoded vectors, s ∈ S = {ei | i = 1, ..., M }. The decoder is concluded
with a softmax function, as described in Appendix A, such that the decoder
∑M
yields a probability vector, r ∈ {p ∈ RM +| i=1 pi = 1}.
The neural networks work with real numbers, such that a constellation
of N complex dimensions is learned by choosing the output of the encoder
network and input of the decoder network to be of 2N real dimensions. An
average power constraint on the learned constellation is achieved with a nor-
malization before the channel.
6.2.2 Autoencoder Model Training

The AE model is trained like a single neural network, which is treated in
Chapter 3. The end-to-end optimization minimizes the cross-entropy loss and
uses the momentum based Adam optimizer [187].
Fig. 6.3 shows the convergence of the loss for different training batch sizes.
A small training batch size leads to faster convergence but worse final perfor-
mance, and a large training batch size leads to slower convergence but superior
final performance. A good trade-off between convergence speed, computation
time and performance is obtained when starting the training with smaller
6.2 End-to-end Learning 77
1.7
Cross-entropy Loss
Training-set size 8M
1.65 Training-set size 8M/2048M
1.6
0 50 100 150 200

# Iteration
Figure 6.3: Convergence of cross-entropy loss for different training batch

sizes. The green/dashed curve is trained with a training batch
size of 8M until iteration 100, from where it is increased to
2048M . This example learns a constellation with the NLIN-
model for M =64.
Table 6.1: Hyperparameters of both encoder and decoder neural network.

Hyperparameter
# Layers 1-2
# Hidden nodes per layer 16-32
Learning rate 0.001
Activation function ReLU [139]
Optimization method Adam [187]
Training batch size Adaptive (see Section 6.2.2)
batch size and increasing it after initial convergence. This is shown in Fig. 6.3,
where the green/dashed plot depicts a training run with the training batch
size being increased after 100 iterations. A larger training batch size provides
more empirical evidence of the channel model characteristics. With better
statistics the AE can optimize more accurately.
The size of the neural networks, number of layers and hidden nodes, is
depending on the order of the constellation. Other neural network hyperpa-
rameters are given in Table 6.1. Besides the size of the training batch, the
hyperparameters only have marginal effect on the performance of the training
process.
6.2.3 Mutual Information Estimation

From a receiver standpoint both channel models appear Gaussian. With both
channel models, GN-model or NLIN-model, a receiver under a memoryless
Gaussian auxiliary channel assumption is the maximum likelihood (ML) re-
ceiver [35, 92]. As discussed in Section 3.4, the decoder neural network at-
tempts to mimic the ML receiver, but only reaches close to optimal perfor-
mance. The MI could be estimated with the output of the decoder, as de-
scribed in Appendix A, but the valid auxiliary channel assumption allows to
estimate the MI as described in [35, 92]. The latter has superior performance,
since the decoder will only learn close to optimal detection, as discussed in
Section 3.4. This allows to discard the trained neural networks for transmis-
sion and only use the learned constellations. At the transmitting end, the
encoder neural network is replaced by a look-up table. The training process
still requires both neural networks due to their differentiability.
With a dispersion free channel model [193] or intensity modulation/direct
detection (IM/DD) transmission [194], where a Gaussian auxiliary channel
assumption does not hold, the trained decoder neural network or another
receiver is required for detection.
6.2.4 Learned Constellations

A constellation obtained by training an AE with embedded NLIN-model or
GN-model are referred to as AE-NLIN or AE-GN constellation, respectively.
Fig. 6.4 shows a set of learned constellations at different launch power levels.
The top row shows AE-GN constellations and the bottom row shows AE-
NLIN constellations. At low power levels (left), the available SNR is too
low and the AE found a constellation where the inner constellation points are
randomly positioned. At optimal per channel launch power level (center), both
constellations are similar and also yield similar performance in MI. At high
power levels (right), the NLI is the primary impairment. All points in the AE-
NLIN constellation form a ring of uniform intensity. This is because a ring-like
constellation has minimized moments and minimizes the NLI. The AE-GN
constellation, trained under an AWGN channel assumption independent of
modulation dependent effects, lack this property.
6.3 Simulation 79
GN-model GN-model GN-model GN-model

Quadrature
Quadrature
Quadrature
Quadrature
MI=7.05 -5.5 dBm MI=9.31 0.0 dBm MI=6.71 4.5 dBm MI=1.95 9.5 dBm
In-phase In-phase In-phase In-phase
NLIN-model NLIN-model NLIN-model NLIN-model

Quadrature
Quadrature
Quadrature
Quadrature
MI=7.05 -5.5 dBm MI=9.32 0.0 dBm MI=6.88 4.5 dBm MI=2.47 9.5 dBm
In-phase In-phase In-phase In-phase
Figure 6.4: Constellations learned with (top) the GN-model and (bottom)
the NLIN-model for M =64 and 2000 km (20 spans) transmission
length at per channel launch power (left to right) −5.5 dBm,
0.0 dBm, 4.5 dBm and 9.5 dBm.
6.3 Simulation
6.3.1 Setup
For the simulation study, a wavelength-division multiplexing (WDM) commu-

nication system was simulated with the SSFM. In Table 6.2, the transmission
parameters are given. The transmitter models digital pulse-shaping and ideal
optical in-phase (I)/quadrature (Q) modulation. All channels are assumed to
have the same optical launch power and draw from the same constellation.
The channel included multiple spans with EDFAs. The receiver modeled an
optical filter for the center channel, digital chromatic dispersion compensa-
tion and matched filtering. Also the NLIN-model was used for performance
evaluations. The AE learned constellations, QAM constellations and IPM-
based geometrically shaped constellations were used to simulate the system
described above.
Simulation Parameter
# symbols 131072 = 217
Symbol rate 32 GBd
Oversampling 32
Channel spacing 50 GHz
# Channels 5
# Polarisation 2
Pulse-shaping root-raised-cosine
Roll-off factor SSFM: 0.05
NLIN-model: 0 (Nyquist)
Span length 100 km
Nonlinear coefficient 1.3 (W km)−1
Dispersion parameter 16.48 ps/(nm km)
Attenuation 0.2 dB/km
EDFA noise figure 5 dB
Stepsize1 0.1 km
Table 6.2: Simulation parameters of SSFM simulations and NLIN-model

evaluations.
6.3.2 Results
The performance of the constellations was compared in terms of the measured
effective SNR and the MI, which represents maximum achievable information
rate (AIR) under an auxiliary channel assumption [35]. The simulation results
for a 2000 km transmission (20 spans) are shown in Fig. 6.5. The learned
constellations outperform the QAM constellations in MI, and the IPM-based
constellations in effective SNR. The QAM constellations have largest effective
SNR due to their low high-order moments, but do not provide shaping gain.
The optimal launch power for the NLIN-model learned constellations is
slightly shifted towards higher power levels compared to GN-model learned
constellations and IPM-based constellations. Fig. 6.6 (a) shows, that for higher
power levels the gain in respect to QAM constellations is larger for the NLIN-
model learned constellations, and in Fig. 6.6 (b), the NLIN-model learned
constellations have a negative slope in moment. This shows when embedding
1
In a fruitful discussion with Associate Professor Andrea Carena from the Polytechnic
University of Turin, it was called to attention, that a constant stepsize, as chosen here, can
lead to artificial effects within the SSFM. It can be avoided by randomizing the stepsize at
every step.
6.3 Simulation 81
Split-step Fourier Analytical Model (NLIN)

64 QAM 64 AE-GN 64 AE-NLIN 64 IPM 64 QAM 64 AE-GN 64 AE-NLIN 64 IPM
256 QAM 256 AE-GN 256 AE-NLIN 256 IPM 256 QAM 256 AE-GN 256 AE-NLIN 256 IPM
14.5 14.5
M = 64 M = 256
effective SNR
effective SNR
14 14
13.5 13.5
13 13
-2 -1 0 1 2 -2 -1 0 1 2
Launch Power Per Channel [dBm] Launch Power Per Channel [dBm]
9.5 9.5
M = 64 M = 256
MI [bit/4D-symbol]
MI [bit/4D-symbol]
9 9
8.5 8.5
-2 -1 0 1 2 -2 -1 0 1 2
Launch Power Per Channel [dBm] Launch Power Per Channel [dBm]
Figure 6.5: Performance in (top) effective SNR and (bottom) MI in re-

spect to launch power after 2000 km transmission (20 spans)
for (left) M =64 and (right) M =256. Plots denoted as AE-
GN and AE-NLIN indicate that the constellation was learned
with an AE using the GN-model and NLIN-model, respectively.
Lines depict performance evaluations using the NLIN-model and
markers using the SSFM.
0.45
0.35
[bit/4D-symbol] 0.25
MI Gain
0.15 Model SSF

64 GN 64 GN
256 GN 256 GN
0.05 64 NLIN 64 NLIN
256 NLIN 256 NLIN
64 IPM 64 IPM
-0.05 256 IPM 256 IPM
-0.15
-5 -4 -3 -2 -1 0 1 2 3 4 5
Launch Power Per Channel [dBm]
(a)
5 6
GN 25
4
Moment
IPM 256
3 NLIN
256
QAM 256
2
1
-5 -4 -3 -2 -1 0 1 2 3 4 5
Launch Power Per Channel [dBm]
(b)
0.6 64 NLIN 64 IPM 64 GN

[bit/4D-symbol]
256 NLIN 256 IPM 256 GN

MI Gain
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50 55
# of Spans
(c)
Figure 6.6: (a) Gain compared to the standard M -QAM constellations in re-
spect to the launch power after 2000 km transmission (20 spans).
(b) 4-th and 6-th order moment (µ4 and µ6 ) of the considered
constellations with M =256 in respect to the per channel launch
power and 2000 km transmission distance. (c) Gain compared
to the standard M -QAM constellations at the respective opti-
mal launch power in respect to the number of spans estimated
with the NLIN-model.
6.4 Experimental Demonstration 83
AWGs
PC
DP-IQ
Even Channels
50 GHz VOA Coherent
spaced Rx
AWGs
carriers IL EDFA BPF 80 Gs/s
MCF
DP-IQ LO
Odd Channels
Figure 6.7: Experimental setup. Abbr.: (IL) Interleaver, (PC) Polariza-

tion Controller, (AWG) Arbitrary Waveform Generator, (DP-
IQ) Dual-Polarization IQ Modulator, (EDFA) Erbium Doped
Fiber Amplifier, (VOA) Variable Optical Attenuator, (MCF)
Multicore Fiber (used to emulate 3 identical single-mode fiber
spans), (BPF) Band-pass Filter, (LO) Local Oscillator.
the NLIN-model within the AE, the learned constellations have smaller high-
order moments and smaller NLI. At optimal launch power and transmission
distances between 2500 km and 5500 km, the improvements are marginal, as
shown in Fig. 6.6 (c). Yet, at shorter distances the learned constellations
outperformed the IPM-based constellations.
For a transmission distance of 2000 km (20 spans) at the respective optimal
launch power, the learned constellations obtained 0.02 bit/4D and 0.00 bit/4D
higher MI than IPM-based constellations evaluated with the NLIN-model for
M =64 and M =256, respectively. Evaluated with the SSFM, the learned con-
stellations obtained 0.00 bit/4D and 0.00 bit/4D improved performance in MI
compared to IPM-based constellations for M =64 and M =256, respectively.
6.4 Experimental Demonstration
6.4.1 Setup
Fig. 6.7 depicts the experimental setup. The 5 channel WDM system was
centered at 1546.5 nm with a 50 GHz channel spacing with the carrier lines
extracted from a 25 GHz comb source using one port of a 25 GHz/50 GHz
optical interleaver. Four arbitrary waveform generator (AWG) with a baud
rate of 24.5 GBd were used to modulate two electrically decorrelated signals
onto the odd and even channels, generated in a second 50 GHz/100 GHz op-
tical interleaver, (IL in Fig. 6.7). The fiber channel was comprised of 1 to
3 fiber spans with lumped amplification by EDFAs. In the absence of iden-
tical fiber spans, we chose to use 3 far separated outer cores of a multicore
fiber. The multicore fiber was of 53.5 km length and the crosstalk between
cores was measured to be negligible. The launch power in to each span was
adjusted by variable optical attenuators (VOA) placed after in-line EDFAs.
By scanning the fiber launch power, also the input power to post-span EDFAs
changes, which may slightly impact the noise performance. At the receiving
end, the center channel was optically filtered and passed on to the heterodyne
coherent receiver. The digital signal processing (DSP) was performed offline.
It involved resampling, digital chromatic dispersion compensation, and data
aided algorithms for both the estimation of the frequency offset and for the
filter taps of the least mean squares (LMS) dual-polarization equalization.
The equalization integrated a blind phase search. The data aided algorithms
are further discussed in Section 6.6.3. The AE model was enhanced to in-
clude transmitter imperfections in terms of an additional AWGN source in
between the normalization and the channel. For the experiment, new con-
stellations must be learned by the enhanced AE model including transmitter
imperfections. Otherwise the learned constellations are not suited for the used
experimental setup. The back-to-back SNR for a 64-QAM constellation was
measured and yielded 22.55 dB, whereas the back-to-back SNR of a 64-AE-
NLIN constellation yielded 21.87 dB. This latter back-to-back SNR was used
within the enhanced AE model to simulate the transmitter imperfections and
learn all the AE constellations for the experiments. We note that it was chal-
lenging to reliably measure some experimental features such as transmitter
imperfections and varying EDFA noise performance and accurately represent
them in the model. Hence, it is likely that some discrepancy between the
experiment and modeled link occurred, as further discussed in Section 6.6.2.
6.4.2 Results
Also experimentally, the constellations were evaluated using the measured
effective SNR and the MI. Fig. 6.8 shows the performances of the learned,
QAM and IPM-based constellations. In general, the learned constellations
yielded improved performances compared to the IPM-based and QAM con-
stellations. However, with more spans the discrepancy between the experi-
ment and modeled link increases. Over one span, Fig. 6.8 (left), the learned
constellations outperformed QAM and IPM-based constellations. For M =64
6.4 Experimental Demonstration 85
64 QAM 64 AE-NLIN 256 QAM AE-256 NLIN

64 IPM 64 AE-GN 256 IPM AE-256 GN
22.5 22.5 22.5
22 22 22
21.5 21.5 21.5

effective SNR
effective SNR
effective SNR
21 21 21
20.5 20.5 20.5
20 20 20
19.5 19.5 19.5
19 19 19
-4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12
Launch Power [dBm] Launch Power [dBm] Launch Power [dBm]
12 12 12
MI [bit/4D-symbol]
MI [bit/4D-symbol]
MI [bit/4D-symbol]
11.8 11.8 11.8
11.6 11.6 11.6

-4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12
MI [bit/4D-symbol]
MI [bit/4D-symbol]
MI [bit/4D-symbol]
14 14 14
13.5 13.5 13.5
13 13 13
-4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12 -4 -2 0 2 4 6 8 10 12
Figure 6.8: Experimental Results. (top) Effective SNR in respect to the

launch power for M =64 and M =256 constellations. (mid) MI
in respect to the launch power for M =64 constellations. (bot-
tom) MI in respect to the launch power for M =256 constella-
tions. (left), (center) and (right) depict the results of one,
two and three span transmissions, respectively.
the IPM-based constellations performed inferior. Over two spans and M =64,
Fig. 6.8 (center), IPM-based constellations yielded improved performances.
Over two spans and for M =256, they outperformed the GN-model learned
constellations and matched the performance of the NLIN-model learned con-
stellations. This indicates, that the NLIN-model learned constellations failed
to mitigate nonlinear effects due to the model discrepancy. Similar findings to
the transmissions over two spans were found over three spans, Fig. 6.8 (right).
Collectively, the learned constellations, either learned with the NLIN-model
or the GN-model, yielded improved performance compared to QAM and IPM-
based constellations at their respective optimal launch power. For M =64 the
improved performance was 0.02 bit/4D, 0.04 bit/4D and 0.03 bit/4D in MI
over 1, 2 and 3 spans transmission, respectively. For M =256 the improved
performance was 0.12 bit/4D, 0.04 bit/4D and 0.06 bit/4D in MI over 1, 2
and 3 spans transmission, respectively.
6.5 Joint Optimization of Constellation and

System Parameters
The AE model is constructed and trained with the machine learning framework
TensorFlow [13]. It provides an application programming interface to compose
computation graphs. Both the neural networks and the fiber channel model
are implemented as such. TensorFlow includes powerful optimization methods
for the training of neural networks. Yet, the training is not limited to the
weights of the neural network. Also other parameters of the computation
graph can be trained jointly, which is achieved by extending θ, as long as the
model stays differentiable w.r.t. θ.
In Section 6.3.2, the results show that the optimal per channel launch
lower of the M =256 NLIN-model learned constellation lies at 0.13 dBm. This
optima is obtained through training several AEs at different launch powers.
Yet, the optima can be estimated jointly with the training for a constellation
shape, by including the launch power to the trainable parameters of the AE
model, θ = {θf , θg , Ptx }. The training process starts with an initial guess
for all parameters. The neural network weights are initialized with Gaussian
noise. The launch power converges towards the optimal value of 0.13 dBm,
Fig. 6.9 (a), even when starting the training process with different initial
values. Further, the performances in cross-entropy and MI of all training
6.6 Discussion 87
15 6 10
Launch Power [dBm]
Cross-entropy Loss
10
MI [bit/4D-symbol]
8
0.13 dBm 5
5
6
0 4
4
-5
3 2
-10
-15 2 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
# Iteration # Iteration # Iteration
(a) (b) (c)
Figure 6.9: (a) Convergence of the per channel launch power starting from
different initial values, and respective (b) cross-entropy loss and
(c) MI, when jointly learning the constellation shape and launch
power. The constellation is learned using the NLIN-model for
M =256.
runs converge independent from the initial value chosen for Ptx , as shown in
Fig. 6.9 (b) and (c).
6.6 Discussion
6.6.1 Mitigation of Nonlinear Effects
The NLIN-model describes a relationship between the launch power, the high-
order moments of the constellation and signal degrading nonlinear effects.
Hence, there exist geometric shaped constellations that minimize nonlinear
effects by minimizing their high-order moments. In contrast, constellations ro-
bust to Gaussian noise are Gaussian shaped, with large high-order moments.
This means, the minimization only improves the system performance if the
relative contributions of the modulation dependent nonlinear effects are sig-
nificant compared to those of the amplified spontaneous emission (ASE) noise.
If the relative contributions to the overall signal degradation of both noise
sources are similar, an optimal constellation must be optimized for a trade-off
between the two. The AE model is able to capture these channel character-
istics, as shown by the higher robustness of the NLIN-model learned constel-
lations. At higher power levels these learned constellations yield a gain in
MI and SNR compared to IPM-based geometrically shaped constellations and
constellations learned with the GN-model. Further, in Fig. 6.6 (b), the high-
order moments of the more robust constellations decline with larger launch
power. This shows that nonlinear effects are mitigated. At larger transmission
distance, Fig. 6.6 (c), modulation format dependent nonlinear effects become
less dominant and therefore also the gains of the constellations learned with
the NLIN-model are reduced.
6.6.2 Model Discrepancy

The learned constellations match the performance or outperform the IPM-
based geometrically shaped constellations. Yet, the results are not distinctly
showing that the NLIN-model learned constellations mitigate nonlinear effects
experimentally. The channel model parameters used for the AE model train-
ing must match those of the experimental setup. Unknown fiber parameters,
imperfect components and device operating points make this challenging. In
particular, transmitter imperfections due to digital-to-analog conversion are
hard to capture in the model, since they can be specific to a particular constel-
lation. For the experimental demonstration in Section 6.4.1, the SNR at the
transmitter was measured for a QAM constellation and included as AWGN
source into the AE model. Including further transmitter/receiver components
is desirable, but also adds more unknowns. The sensitivity of the presented
method to such uncertainties must be further studied. Methods introduc-
ing reinforcement learning [173], or adversarial networks [195] can potentially
avoid the modeling discrepancy and learn in direct interaction with the real
system.
6.6.3 Digital Signal Processing Drawbacks

Standard signal recovery algorithms often exploit symmetries inherent in QAM
constellations. This is a typical issue for non-standard QAM constellations.
Constellation independent pilot-aided DSP algorithms are typically required
in such cases [196, 197]. In order to receive the learned constellations, the fre-
quency offset estimation implements an exhaustive search based on temporal
correlations. It correlates the received signal with many different frequency
shifted versions of the transmitted signal until the matching frequency offset
is found. The adaptive LMS equalizer taps are estimated with full knowledge
of the transmitted sequence. The blind phase search is integrated within the
equalizer, but has no knowledge of the data and searches across all angles
instead of angles of one quadrant.
6.7 Summary 89
Other constellation shaping methods for the fiber optic channel, which con-
strain the solution space to square constellations preserve required symmetries,
such that standard DSP algorithms are applicable [133].
6.7 Summary
This chapter proposed a new method for geometric shaping in fiber optic com-
munication systems, by using an unsupervised machine learning algorithm.
With a differentiable channel model including modulation dependent nonlin-
ear effects, the learning algorithm yields a constellation mitigating these, with
gains of up to 0.13 bit/4D in simulation and up to 0.12 bit/4D experimentally.
The machine learning optimization method is independent of the embedded
channel model and allows joint optimization of system parameters. The ex-
perimental results show improved performances but also highlight challenges
regarding matching the model parameters to the real system and limitations
of conventional DSP algorithms for the learned constellations.
90
CHAPTER 7
Conclusion
The increasing demand for global data traffic must be met with more flexible
and scalable coherent optical communication systems. The limitations of such
systems require better fiber channel models, complex optimization methods
and consideration of entirely new approaches for transmission. Further, future
optical communication systems will grow in complexity, in which autonomous
algorithms can help to explore possible solutions for increasing transmission
data rates and resource allocation. Machine learning promises to provide so-
lutions to the described future challenges. The methods presented in this
work include a machine learning aided fiber channel model to increase its
computational speed, a viable receiver for a novel transmission technique, and
a learning algorithm optimizing for a modulation format. They provide ini-
tial and explorative studies of machine learning methods for coherent optical
communication systems which future research can built upon.
This thesis presented a review of machine learning methods, and applied
them in novel contexts to coherent optical communication systems increasing
their capabilities. In this section, the results of this work are summarized and
an outlook on potential future research directions is presented.
7.1 Summary
The reviewed machine learning methods were applied in coherent optical com-
munication systems as follow:

Models of the nonlinear optical fiber channel are important to provide simula-
tion tools, and to assess the system performance. The most accurate models
are only semi-analytic, leading to exhaustive computation, if the impact of
specific system parameters are of interest. In this case real-time applications
92 7 Conclusion
can become unfeasible. In Chapter 4, with a sparse dataset of the computa-

tional expensive properties, a machine learning algorithm is trained to predict
and interpolate them. Going forward the machine learning algorithm predicts
the properties and the computationally heavy step is skipped. The new ma-
chine learning aided channel model is not analytic, but the reduced processing
requirement means that real-time analysis becomes feasible.
The study in Chapter 5 analyses different reception methods for nonlinear

Fourier transform (NFT) derived signals. Three receivers are compared, an
NFT-based receiver with decision in the nonlinear spectra, a time-domain
minimum distance metric receiver and a neural network receiver trained on
historical data. The neural network outperforms the standard NFT-based re-
ceiver and the minimum distance metric receiver. Distortions caused by losses
and noise during transmission, are not compensated for by the NFT receiver.
The neural network handles such impairments by learning the distortion char-
acteristics from previous transmissions and is highly adaptable to the system
configuration. The poor performance of the minimum distance metric receiver
leads to the conclusion, that the accumulated distortions throughout the link
cannot be modeled as additive white Gaussian noise (AWGN). The obtained
results present a performance benchmark for future NFT receiver algorithms.
In Chapter 6, a new geometric constellation shaping method for fiber optic

communication systems was proposed. By embedding a fiber channel model
within two neural networks, the fiber channel characteristics are captured, such
that the learned constellation shape is jointly robust to amplification noise
and signal dependent nonlinear effects. The method is independent of the
channel model, as long as it is differentiable w.r.t. the optimized parameters.
A simulation and experimental study were conducted and performance gain
was reported, with up to 0.13 bit/4D in simulation and up to 0.12 bit/4D
experimentally. Problems were identified, such as discrepancies in between
the fiber channel model and the experimental setup and challenges to the
digital signal processing (DSP) subsystem.
7.2 Outlook 93
7.2 Outlook
Future directions for the proposed methods are given as follows:

The presented framework learns a nonlinear mapping between physical layer
parameters and computational expensive model properties. With more data
the neural network will yield more accurate predictions. Future research may
include a sensitivity analysis of the required density of data samples in each
dimension of the input to the neural network. Further, since additional data
samples are available through computation, one may determine where in the
input space an extra training sample would yield the most performance gain
after retraining the algorithm. Machine learning methods that allow the al-
gorithm to query new data points are called active learning and would be an
interesting avenue of future study [198, 199].

In Chapter 5 it was argued, that since for the AWGN channel the trained
neural network learns a close to optimal decision rule, the learned decision rule
for the nonlinear optical fiber channel must also be close to optimal. The poor
performance of the NFT-based receiver must be further studied, in particular
the impact of the NFT on the received signal, since it could cause information
loss. An interesting area of further study would be to apply classification in
the NFT-domain on the observed eigenvalues and scattering coefficients. If the
performance of an NFT-domain neural network is worse than the presented
time-domain neural network, the NFT transformation at the receiver must
introduce information loss. If the performance of the two neural networks is
similar, the NFT is a viable dimension reduction algorithm.

In Chapter 6, discrepancies in between the fiber channel model and the exper-
imental setup were reported. Methods independent of a differentiable channel
model are an interesting area for future research. In particular, generative
adversarial network [195] or methods from the reinforcement learning commu-
nity [173] are able to learn in interaction with a real system. Such methods
94 7 Conclusion
are a generalization of the ones presented in this thesis and would be beneficial
to explore in future work.
The obtained constellation shapes are incompatible with standard DSP
algorithms, which were developed for modulation formats with intrinsic sym-
metries. For practical commercial use of the learned constellation shapes, new
DSP algorithms are required. These algorithms could be machine learning
aided [200], or the learning algorithm must be forced to learn constellation
shapes with intrinsic symmetries [133], making them compatible with stan-
dard DSP algorithms. Moreover, extending the optimization method to opti-
mize for generalized mutual information (GMI) rather than mutual informa-
tion (MI), will certainly yield different constellation shapes and is important
for integration with binary forward error correction (FEC) [127].
In contrast to the applied channel models, the optical fiber channel is not
memoryless due to dispersive effects. Combining temporal machine learning
methods, such as recurrent neural networks [158], with models including mem-
ory [29, 190, 191, 201], might yield larger gains.
However, the reported results show that machine learning is a viable
method for optimization within complex optical communication systems. The
study presents an initial exploration of the end-to-end learning method for
optical coherent transmission systems.
APPENDIX A
Softmax Activation and
Mutual Information
Estimation
A neural network trained for classification is concluded with a softmax layer [158],
ensuring that the output of the neural network is always a probability vector.
The softmax layer is given by:
e xi
softmaxi (x) = ∑D , for i = 1, ..., D (A.1)
xd
d=1 e
where xi are the output nodes of the neural network and D the number of
classes. The number of neural network output nodes is equivalent to the
number of classes. For training, the targets of the data are required to be
transcribed to probability vectors with 100% certainty. Such vectors are typ-
ically called one-hot encoded vectors, holding a single one for the respective
class and else zeros, {ei | i = 1, ..., D}, where ei are the unit vectors.
The classification example in Section 3.4 trains a neural network to detect
observations of QPSK symbols transmitted over the additive white Gaussian
noise (AWGN) channel. The symbols are equal probable with the probability
1
D . The output of the softmax layer, since it yields a probability distribution
over the classes, can be used for mutual information (MI) estimation:
[ ]
pX|Y (x|y)
I(X; Y ) ≥ E log2 , (A.2a)
pX (x)
(∑ )
∑
N D (n) ) (n)
· ti
i=1 softmaxi (x
≈ log2 1 , (A.2b)
n=1 D
96 A Softmax Activation and Mutual Information Estimation
where N is the number of data samples and t(n) the target vectors of the
respective data sample. The sum in the numerator requires only one evaluation
since all targets are one-hot encoded.
Bibliography
[1] M. Castells et al., Information technology, globalization and social de-
velopment. United Nations Research Institute for Social Development
Geneva, 1999, vol. 114.
[2] Cisco, “Cisco visual networking index: Forecast and methodology, 2016–
2021,” CISCO White paper, 2017.
[3] P. J. Winzer, D. T. Neilson, and A. R. Chraplyvy, “Fiber-optic trans-
mission and networking: the previous 20 and the next 20 years,” Optics
express, vol. 26, no. 18, pp. 24 190–24 239, 2018.
[4] K. Kao and G. A. Hockham, “Dielectric-fibre surface waveguides for
optical frequencies,” in Proceedings of the Institution of Electrical En-
gineers, IET, vol. 113, 1966, pp. 1151–1158.
[5] T. Miya, Y. Terunuma, T. Hosaka, and T. Miyashita, “Ultimate low-
loss single-mode fibre at 1.55 µm,” Electronics Letters, vol. 15, no. 4,
pp. 106–108, 1979.
[6] E. Desurvire, Erbium-doped fiber amplifiers: principles and applications.
Wiley New York, 1994, vol. 19.
[7] J. Zyskind and A. Srivastava, Optically amplified WDM networks. Aca-
demic press, 2011.
[8] E. Ip, A. P. T. Lau, D. J. Barros, and J. M. Kahn, “Coherent detection
in optical fiber systems,” Optics express, vol. 16, no. 2, pp. 753–791,
2008.
[9] S. J. Savory, “Digital coherent optical receivers: algorithms and sub-
systems,” IEEE Journal of Selected Topics in Quantum Electronics,
vol. 16, no. 5, pp. 1164–1179, 2010.
98 Bibliography
[10] S. L. Olsson, J. Cho, S. Chandrasekhar, X. Chen, E. C. Burrows, and

P. J. Winzer, “Record-High 17.3-bit/s/Hz Spectral Efficiency Transmis-
sion over 50 km Using Probabilistically Shaped PDM 4096-QAM,” in
2018 Optical Fiber Communications Conference and Exposition (OFC),
IEEE, 2018, pp. 1–3.
[11] P. J. Winzer and D. T. Neilson, “From scaling disparities to integrated
parallelism: A decathlon for a decade,” Journal of Lightwave Technol-
ogy, vol. 35, no. 5, pp. 1099–1115, 2017.
[12] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior,
P. Tucker, K. Yang, Q. V. Le, et al., “Large scale distributed deep
networks,” in Advances in neural information processing systems, 2012,
pp. 1223–1231.
[13] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: a system for
large-scale machine learning.,” in OSDI, vol. 16, 2016, pp. 265–283.
[14] R. S. Tucker, “Green optical communications—part i: Energy limita-
tions in transport,” IEEE Journal of selected topics in quantum elec-
tronics, vol. 17, no. 2, pp. 245–260, 2011.
[15] G. P. Agrawal, Nonlinear Fiber Optics, 5th ed. Academic Press, 2012.
[16] R.-J. Essiambre, G. Kramer, P. J. Winzer, G. J. Foschini, and B.
Goebel, “Capacity limits of optical fiber networks,” Journal of Light-
wave Technology, vol. 28, no. 4, pp. 662–701, 2010.
[17] P. P. Mitra and J. B. Stark, “Nonlinear limits to the information capac-
ity of optical fibre communications,” Nature, vol. 411, no. 6841, p. 1027,
2001.
[18] A. D. Ellis, J. Zhao, and D. Cotter, “Approaching the non-linear Shan-
non limit,” Journal of Lightwave Technology, vol. 28, no. 4, pp. 423–
433, 2010.
[19] R. I. Killey and C. Behrens, “Shannon’s theory in nonlinear systems,”
Journal of Modern Optics, vol. 58, no. 1, pp. 1–10, 2011.
[20] E. Agrell, M. Karlsson, A. Chraplyvy, D. J. Richardson, P. M. Krumm-
rich, P. Winzer, K. Roberts, J. K. Fischer, S. J. Savory, B. J. Eggleton,
et al., “Roadmap of optical communications,” Journal of Optics, vol. 18,
no. 6, p. 063 002, 2016.
Bibliography 99
[21] J. C. Cartledge, F. P. Guiomar, F. R. Kschischang, G. Liga, and M. P.

Yankov, “Digital signal processing for fiber nonlinearities,” Optics ex-
press, vol. 25, no. 3, pp. 1916–1936, 2017.
[22] G. P. Agrawal, Fiber-optic communication systems, 4th ed. John Wiley
& Sons, 2012.
[23] P. Poggiolini, “The GN model of non-linear propagation in uncom-
pensated coherent optical systems,” Journal of Lightwave Technology,
vol. 30, no. 24, pp. 3857–3879, 2012.
[24] M. Secondini and E. Forestieri, “Analytical fiber-optic channel model in
the presence of cross-phase modulation,” IEEE Photonics Technology
Letters, vol. 24, no. 22, pp. 2016–2019, 2012.
[25] M. Secondini, E. Forestieri, and G. Prati, “Achievable information rate
in nonlinear WDM fiber-optic systems with arbitrary modulation for-
mats and dispersion maps,” Journal of Lightwave Technology, vol. 31,
no. 23, pp. 3839–3852, 2013.
[26] P. Poggiolini, G. Bosco, A. Carena, V. Curri, Y. Jiang, and F. Forghieri,
“The GN-model of fiber non-linear propagation and its applications,”
Journal of lightwave technology, vol. 32, no. 4, pp. 694–721, 2014.
[27] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Properties of nonlinear
noise in long, dispersion-uncompensated fiber links,” Optics Express,
vol. 21, no. 22, pp. 25 685–25 699, 2013.
[28] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Accumulation of non-
linear interference noise in fiber-optic systems,” Optics express, vol. 22,
no. 12, pp. 14 199–14 211, 2014.
[29] R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Inter-channel nonlinear
interference noise in WDM systems: modeling and mitigation,” Journal
of Lightwave Technology, vol. 33, no. 5, pp. 1044–1053, 2015.
[30] A. Carena, G. Bosco, V. Curri, Y. Jiang, P. Poggiolini, and F. Forghieri,
“EGN model of non-linear fiber propagation,” Optics Express, vol. 22,
no. 13, pp. 16 335–16 362, 2014.
[31] Z. Tao, L. Dou, W. Yan, L. Li, T. Hoshida, and J. C. Rasmussen,
“Multiplier-free intrachannel nonlinearity compensating algorithm oper-
ating at symbol rate,” Journal of Lightwave Technology, vol. 29, no. 17,
pp. 2570–2576, 2011.
100 Bibliography
[32] O. Golani, M. Feder, and M. Shtaif, “Kalman-MLSE Equalization for

NLIN Mitigation,” Journal of Lightwave Technology, vol. 36, no. 12,
pp. 2541–2550, 2018.
[33] K. V. Peddanarappagari and M. Brandt-Pearce, “Volterra series trans-
fer function of single-mode fibers,” Journal of lightwave technology,
vol. 15, no. 12, pp. 2232–2241, 1997.
[34] M. P. Yankovn, F. Da Ros, E. P. da Silva, S. Forchhammer, K. J.
Larsen, L. K. Oxenløwe, M. Galili, and D. Zibar, “Constellation shap-
ing for WDM systems using 256QAM/1024QAM with probabilistic op-
timization,” Journal of Lightwave Technology, vol. 34, no. 22, pp. 5146–
5156, 2016.
[35] T. Fehenberger, A. Alvarado, G. Böcherer, and N. Hanik, “On prob-
abilistic shaping of quadrature amplitude modulation for the nonlin-
ear fiber channel,” Journal of Lightwave Technology, vol. 34, no. 21,
pp. 5063–5073, 2016.
[36] D. Richardson, J. Fini, and L. Nelson, “Space-division multiplexing in
optical fibres,” Nature Photonics, vol. 7, no. 5, p. 354, 2013.
[37] P. J. Winzer, “Energy-efficient optical transport capacity scaling through
spatial multiplexing,” IEEE Photonics Technology Letters, vol. 23, no. 13,
pp. 851–853, 2011.
[38] B. Puttnam, R. Luı́s, W. Klaus, J. Sakaguchi, J.-M. D. Mendinueta,
Y. Awaji, N. Wada, Y. Tamura, T. Hayashi, M. Hirano, et al., “2.15
Pb/s transmission using a 22 core homogeneous single-mode multi-core
fiber and wideband optical comb,” in Optical Communication (ECOC),
2015 European Conference on, IEEE, 2015, pp. 1–3.
[39] G. Rademacher, R. S. Luı́s, B. J. Puttnam, T. A. Eriksson, E. Agrell,
R. Maruyama, K. Aikawa, H. Furukawa, Y. Awaji, and N. Wada, “159
Tbit/s C+ L Band Transmission over 1045 km 3-Mode Graded-Index
Few-Mode Fiber,” in Optical Fiber Communication Conference, Optical
Society of America, 2018, Th4C–4.
[40] R. Luı́s, G. Rademacher, B. Puttnam, T. Eriksson, H. Furukawa, A.
Ross-Adams, S. Gross, M. Withford, N. Riesen, Y. Sasaki, K. Saitoh,
K. Aikawa, Y. Awaji, and N. Wada, “1.2 Pb/s Transmission Over a 160
µm Cladding, 4-Core, 3-Mode Fiber, Using 368 C+L band PDM-256-
QAM Channels,” in Optical Communication (ECOC), 2018 European
Conference on, IEEE, 2018.
Bibliography 101
[41] J. Mata, I. de Miguel, R. J. Durán, N. Merayo, S. K. Singh, A. Jukan,

and M. Chamania, “Artificial intelligence (AI) methods in optical net-
works: A comprehensive survey,” Optical Switching and Networking,
2018.
[42] F. Musumeci, C. Rottondi, A. Nag, I. Macaluso, D. Zibar, M. Ruffini,
and M. Tornatore, “A Survey on Application of Machine Learning Tech-
niques in Optical Networks,” arXiv preprint arXiv:1803.07976, 2018.
[43] J. Estaran, R. Rios-Müller, M. Mestre, F. Jorge, H. Mardoyan, A. Kon-
czykowska, J.-Y. Dupuy, and S. Bigo, “Artificial neural networks for
linear and non-linear impairment mitigation in high-baudrate IM/DD
systems,” in ECOC 2016; 42nd European Conference on Optical Com-
munication; Proceedings of, VDE, 2016, pp. 1–3.
[44] P. Li, L. Yi, L. Xue, and W. Hu, “100Gbps IM/DD Transmission over
25km SSMF using 20G-class DML and PIN Enabled by Machine Learn-
ing,” in Optical Fiber Communication Conference, Optical Society of
America, 2018, W2A–46.
[45] C.-Y. Chuang, L.-C. Liu, C.-C. Wei, J.-J. Liu, L. Henrickson, W.-J.
Huang, C.-L. Wang, Y.-K. Chen, and J. Chen, “Convolutional Neural
Network based Nonlinear Classifier for 112-Gbps High Speed Optical
Link,” in Optical Fiber Communication Conference, Optical Society of
America, 2018, W2A–43.
[46] F. Lu, P.-C. Peng, S. Liu, M. Xu, S. Shen, and G.-K. Chang, “Inte-
gration of Multivariate Gaussian Mixture Model for Enhanced PAM-4
Decoding Employing Basis Expansion,” in Optical Fiber Communica-
tion Conference, Optical Society of America, 2018, M2F–1.
[47] D. Wang, M. Zhang, Z. Li, C. Song, M. Fu, J. Li, and X. Chen, “System
impairment compensation in coherent optical communications by using
a bio-inspired detector based on artificial neural network and genetic
algorithm,” Optics Communications, vol. 399, pp. 1–12, 2017.
[48] T. S. R. Shen and A. P. T. Lau, “Fiber nonlinearity compensation using
extreme learning machine for DSP-based coherent communication sys-
tems,” in Opto-Electronics and Communications Conference (OECC),
2011 16th, IEEE, 2011, pp. 816–817.
102 Bibliography
[49] D. Wang, M. Zhang, M. Fu, Z. Cai, Z. Li, H. Han, Y. Cui, and B.

Luo, “Nonlinearity mitigation using a machine learning detector based
on k-Nearest neighbors,” IEEE Photonics Technology Letters, vol. 28,
no. 19, pp. 2102–2105, 2016.
[50] D. Wang, M. Zhang, Z. Li, Y. Cui, J. Liu, Y. Yang, and H. Wang, “Non-
linear decision boundary created by a machine learning-based classifier
to mitigate nonlinear phase noise,” in Optical Communication (ECOC),
[51] X. Wu, J. A. Jargon, R. A. Skoog, L. Paraschis, and A. E. Willner,
“Applications of artificial neural networks in optical performance mon-
itoring,” Journal of Lightwave Technology, vol. 27, no. 16, pp. 3580–
3589, 2009.
[52] T. B. Anderson, A. Kowalczyk, K. Clarke, S. D. Dods, D. Hewitt, and
J. C. Li, “Multi impairment monitoring for optical networks,” Journal
of Lightwave Technology, vol. 27, no. 16, pp. 3729–3736, 2009.
[53] J. A. Jargon, X. Wu, H. Y. Choi, Y. C. Chung, and A. E. Willner,
“Optical performance monitoring of QPSK data channels by use of
neural networks trained with parameters derived from asynchronous
constellation diagrams,” Optics express, vol. 18, no. 5, pp. 4931–4938,
2010.
[54] T. S. R. Shen, K. Meng, A. P. T. Lau, and Z. Y. Dong, “Optical per-
formance monitoring using artificial neural network trained with asyn-
chronous amplitude histograms,” IEEE Photonics Technology Letters,
vol. 22, no. 22, pp. 1665–1667, 2010.
[55] F. N. Khan, T. S. R. Shen, Y. Zhou, A. P. T. Lau, and C. Lu, “Op-
tical performance monitoring using artificial neural networks trained
with empirical moments of asynchronously sampled signal amplitudes,”
IEEE Photon. Technol. Lett., vol. 24, no. 12, pp. 982–984, 2012.
[56] V. Ribeiro, L. Costa, M. Lima, and A. L. Teixeira, “Optical perfor-
mance monitoring using the novel parametric asynchronous eye dia-
gram,” Optics Express, vol. 20, no. 9, pp. 9851–9861, 2012.
[57] T. Tanimura, T. Hoshida, T. Kato, S. Watanabe, and H. Morikawa,
“Data analytics based optical performance monitoring technique for
optical transport networks,” in Optical Fiber Communication Confer-
ence, Optical Society of America, 2018, Tu3E–3.
Bibliography 103
[58] C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deep

neural networks,” in 2018 Optical Fiber Communications Conference
and Exposition (OFC), IEEE, 2018, pp. 1–3.
[59] H. Wymeersch, N. V. Irukulapati, D. Marsella, P. Johannisson, E.
Agrell, and M. Secondini, “On the use of factor graphs in optical com-
munications,” in Optical Fiber Communication Conference, Optical So-
ciety of America, 2015, W4K–2.
[60] T. A. Eriksson, H. Bülow, and A. Leven, “Applying neural networks
in optical communication systems: possible pitfalls,” IEEE Photonics
Technology Letters, vol. 29, no. 23, pp. 2091–2094, 2017.
[61] K. Petermann, Laser diode modulation and noise. Springer Science &
Business Media, 2012, vol. 3.
[62] T. Pfau, S. Hoffmann, and R. Noé, “Hardware-efficient coherent digital
receiver concept with feedforward carrier recovery for M -QAM constel-
lations,” Journal of Lightwave Technology, vol. 27, no. 8, pp. 989–999,
2009.
[63] J. G. Proakis, “Digital communications,” 1995.
[64] Z. H. Peric, I. B. Djordjevic, S. M. Bogosavljevic, and M. C. Stefanovic,
“Design of signal constellations for Gaussian channel by using iterative
polar quantization,” in Electrotechnical Conference, 1998. MELECON
98., 9th Mediterranean, IEEE, vol. 2, 1998, pp. 866–869.
[65] R. Schmogrow, S. Ben-Ezra, P. C. Schindler, B. Nebendahl, C. Koos,
W. Freude, and J. Leuthold, “Pulse-shaping with digital, electrical, and
optical filters—a comparison,” Journal of lightwave technology, vol. 31,
no. 15, pp. 2570–2577, 2013.
[66] B. Châtelain, C. Laperle, K. Roberts, M. Chagnon, X. Xu, A. Borowiec,
F. Gagnon, and D. V. Plant, “A family of Nyquist pulses for coherent
optical communications,” Optics Express, vol. 20, no. 8, pp. 8397–8416,
2012.
[67] D. J. Geisler, N. K. Fontaine, R. P. Scott, J. P. Heritage, K. Okamoto,
and S. B. Yoo, “360-Gb/s optical transmitter with arbitrary modula-
tion format and dispersion precompensation,” IEEE Photonics Tech-
nology Letters, vol. 21, no. 7, pp. 489–491, 2009.
104 Bibliography
[68] Y. Gao, J. C. Cartledge, A. S. Karar, S. S.-H. Yam, M. O’Sullivan,

C. Laperle, A. Borowiec, and K. Roberts, “Reducing the complexity
of perturbation based nonlinearity pre-compensation using symmetric
EDC and pulse shaping,” Optics express, vol. 22, no. 2, pp. 1209–1219,
2014.
[69] A. Ghazisaeidi and R.-J. Essiambre, “Calculation of coefficients of per-
turbative nonlinear pre-compensation for Nyquist pulses,” in Optical
Communication (ECOC), 2014 European Conference on, IEEE, 2014,
pp. 1–3.
[70] C. Laperle and M. O’Sullivan, “Advances in high-speed DACs, ADCs,
and DSP for optical coherent transceivers,” Journal of Lightwave Tech-
nology, vol. 32, no. 4, pp. 629–643, 2014.
[71] G. Khanna, B. Spinnler, S. Calabrò, E. De Man, and N. Hanik, “A ro-
bust adaptive pre-distortion method for optical communication trans-
mitters,” IEEE Photonics Technology Letters, vol. 28, no. 7, pp. 752–
755, 2016.
[72] C. Cahn, “Combined digital phase and amplitude modulation com-
munication systems,” IRE Transactions on Communications Systems,
vol. 8, no. 3, pp. 150–155, 1960.
[73] M. Seimetz, High-order modulation for optical fiber transmission. Springer,
2009, vol. 143.
[74] S. V. Manakov, “On the theory of two-dimensional stationary self-
focusing of electromagnetic waves,” Soviet Physics-JETP, vol. 38, no. 2,
pp. 248–253, 1974.
[75] O. V. Sinkin, R. Holzlohner, J. Zweck, and C. R. Menyuk, “Optimiza-
tion of the split-step Fourier method in modeling optical-fiber com-
munications systems,” Journal of lightwave technology, vol. 21, no. 1,
pp. 61–68, 2003.
[76] Z. Tao, W. Yan, L. Liu, L. Li, S. Oda, T. Hoshida, and J. C. Ras-
mussen, “Simple fiber model for determination of XPM effects,” Jour-
nal of Lightwave Technology, vol. 29, no. 7, pp. 974–986, 2011.
[77] A. Splett, C. Kurtzke, and K. Petermann, “Ultimate Transmission Ca-
pacity of Amplified Optical Fiber Communication Systems taking into
Account Fiber Nonlinearities,” 1993.
Bibliography 105
[78] A. Mecozzi and R.-J. Essiambre, “Nonlinear Shannon limit in pseu-

dolinear coherent systems,” Journal of Lightwave Technology, vol. 30,
no. 12, pp. 2011–2024, 2012.
[79] A. Chraplyvy, A. Gnauck, R. Tkach, and R. Derosier, “8* 10 Gb/s
transmission through 280 km of dispersion-managed fiber,” IEEE pho-
tonics technology letters, vol. 5, no. 10, pp. 1233–1235, 1993.
[80] E. Ip and J. M. Kahn, “Digital equalization of chromatic dispersion
and polarization mode dispersion,” Journal of Lightwave Technology,
vol. 25, no. 8, pp. 2033–2043, 2007.
[81] S. J. Savory, “Digital filters for coherent optical receivers,” Optics ex-
press, vol. 16, no. 2, pp. 804–817, 2008.
[82] F. Gardner, “A BPSK/QPSK timing-error detector for sampled re-
ceivers,” IEEE Transactions on communications, vol. 34, no. 5, pp. 423–
429, 1986.
[83] M. Yan, Z. Tao, L. Dou, L. Li, Y. Zhao, T. Hoshida, and J. C. Ras-
mussen, “Digital clock recovery algorithm for Nyquist signal,” in Opti-
cal Fiber Communication Conference, Optical Society of America, 2013,
OTu2I–7.
[84] X. Zhou and C. Xie, Enabling technologies for high spectral-efficiency
coherent optical communication networks. John Wiley & Sons, 2016.
[85] M. S. Faruk and K. Kikuchi, “Adaptive frequency-domain equalization
in digital coherent optical receivers,” Optics Express, vol. 19, no. 13,
pp. 12 789–12 798, 2011.
[86] J. H. Ke, K. P. Zhong, Y. Gao, J. C. Cartledge, A. S. Karar, and
M. A. Rezania, “Linewidth-tolerant and low-complexity two-stage car-
rier phase estimation for dual-polarization 16-QAM coherent optical
fiber communications,” Journal of lightwave technology, vol. 30, no. 24,
pp. 3987–3992, 2012.
[87] M. Kuschnerov, F. N. Hauske, K. Piyawanno, B. Spinnler, M. S. Alfiad,
A. Napoli, and B. Lankl, “DSP for coherent single-carrier receivers,”
Journal of lightwave technology, vol. 27, no. 16, pp. 3614–3622, 2009.
[88] E. Ip and J. M. Kahn, “Compensation of dispersion and nonlinear im-
pairments using digital backpropagation,” Journal of Lightwave Tech-
nology, vol. 26, no. 20, pp. 3416–3425, 2008.
106 Bibliography
[89] Y. Fan, L. Dou, Z. Tao, T. Hoshida, and J. C. Rasmussen, “A high per-

formance nonlinear compensation algorithm with reduced complexity
based on XPM model,” in Optical Fiber Communications Conference
and Exhibition (OFC), 2014, IEEE, 2014, pp. 1–3.
[90] A. Napoli, Z. Maalej, V. A. Sleiffer, M. Kuschnerov, D. Rafique, E. Tim-
mers, B. Spinnler, T. Rahman, L. D. Coelho, and N. Hanik, “Reduced
complexity digital back-propagation methods for optical communica-
tion systems,” Journal of lightwave technology, vol. 32, no. 7, pp. 1351–
1362, 2014.
[91] T. M. Cover and J. A. Thomas, Elements of information theory. John
Wiley & Sons, 2012.
[92] T. A. Eriksson, T. Fehenberger, P. A. Andrekson, M. Karlsson, N.
Hanik, and E. Agrell, “Impact of 4D channel distribution on the achiev-
able rates in coherent optical communication experiments,” Journal of
Lightwave Technology, vol. 34, no. 9, pp. 2256–2266, 2016.
[93] A. Alvarado, E. Agrell, D. Lavery, R. Maher, and P. Bayvel, “Replacing
the soft-decision FEC limit paradigm in the design of optical commu-
nication systems,” Journal of Lightwave Technology, vol. 33, no. 20,
pp. 4338–4352, 2015.
[94] S. K. Turitsyn, J. E. Prilepsky, S. T. Le, S. Wahls, L. L. Frumin, M.
Kamalian, and S. A. Derevyanko, “Nonlinear Fourier transform for
optical data processing and transmission: advances and perspectives,”
Optica, vol. 4, no. 3, pp. 307–322, 2017.
[95] A. Shabat and V. Zakharov, “Exact theory of two-dimensional self-
focusing and one-dimensional self-modulation of waves in nonlinear
media,” Soviet physics JETP, vol. 34, no. 1, p. 62, 1972.
[96] M. J. Ablowitz, D. J. Kaup, A. C. Newell, and H. Segur, “The inverse
scattering transform-Fourier analysis for nonlinear problems,” Studies
in Applied Mathematics, vol. 53, no. 4, pp. 249–315, 1974.
[97] M. I. Yousefi and F. R. Kschischang, “Information transmission using
the nonlinear Fourier transform, Part I: Mathematical tools,” IEEE
Transactions on Information Theory, vol. 60, no. 7, pp. 4312–4328,
2014.
Bibliography 107
[98] Z. Dong, S. Hari, T. Gui, K. Zhong, M. Yousefi, C. Lu, P.-K. A. Wai,

F. Kschischang, and A. Lau, “Nonlinear frequency division multiplexed
transmissions based on NFT,” IEEE Photon. Technol. Lett., vol. 99,
p. 1, 2015.
[99] J. E. Prilepsky, S. A. Derevyanko, and S. K. Turitsyn, “Nonlinear spec-
tral management: Linearization of the lossless fiber channel,” Optics
express, vol. 21, no. 20, pp. 24 344–24 367, 2013.
[100] S. T. Le, J. E. Prilepsky, and S. K. Turitsyn, “Nonlinear inverse syn-
thesis for high spectral efficiency transmission in optical fibers,” Optics
express, vol. 22, no. 22, pp. 26 720–26 741, 2014.
[101] S. T. Le, I. D. Philips, J. E. Prilepsky, P. Harper, A. D. Ellis, and S. K.
Turitsyn, “Demonstration of nonlinear inverse synthesis transmission
over transoceanic distances,” Journal of Lightwave Technology, vol. 34,
no. 10, pp. 2459–2466, 2016.
[102] T. Gui, C. Lu, A. P. T. Lau, and P. Wai, “High-order modulation
on a single discrete eigenvalue for optical communications based on
nonlinear Fourier transform,” Optics express, vol. 25, no. 17, pp. 20 286–
20 297, 2017.
[103] S. T. Le, V. Aref, and H. Buelow, “Nonlinear signal multiplexing for
communication beyond the Kerr nonlinearity limit,” Nature Photonics,
vol. 11, no. 9, p. 570, 2017.
[104] S. Gaiarin, A. M. Perego, E. P. da Silva, F. Da Ros, and D. Zibar, “Dual-
polarization nonlinear Fourier transform-based optical communication
system,” Optica, vol. 5, no. 3, pp. 263–270, 2018.
[105] T. Lakoba and D. Kaup, “Perturbation theory for the Manakov soli-
ton and its applications to pulse propagation in randomly birefringent
fibers,” Physical Review E, vol. 56, no. 5, p. 6147, 1997.
[106] J. Yang, “Multisoliton perturbation theory for the Manakov equations
and its applications to nonlinear optics,” Physical Review E, vol. 59,
no. 2, p. 2393, 1999.
[107] S. A. Derevyanko, J. E. Prilepsky, and D. A. Yakushev, “Statistics of
a noise-driven Manakov soliton,” Journal of Physics A: Mathematical
and General, vol. 39, no. 6, p. 1297, 2006.
108 Bibliography
[108] H. Terauchi, Y. Matsuda, A. Toyota, and A. Maruta, “Noise toler-

ance of eigenvalue modulated optical transmission system based on
digital coherent technology,” in Optical Fibre Technology, 2014 Opto-
Electronics and Communication Conference and Australian Conference
on, IEEE, 2014, pp. 778–780.
[109] W. Q. Zhang, Q. Zhang, K. Amir, T. Gui, X. Zhang, A. P. T. Lau,
C. Lu, T. Chan, and S. Afshar, “Noise correlation between eigenvalues
in nonlinear frequency division multiplexing,” in Nonlinear Photonics,
Optical Society of America, 2016, JM6A–13.
[110] T. Gui, T. H. Chan, C. Lu, A. P. T. Lau, and P.-K. A. Wai, “Alterna-
tive decoding methods for optical communications based on nonlinear
Fourier transform,” Journal of Lightwave Technology, vol. 35, no. 9,
pp. 1542–1550, 2017.
[111] J. Garcia and V. Aref, “Statistics of the Eigenvalues of a Noisy Multi-
Soliton Pulse,” arXiv preprint arXiv:1808.03800, 2018.
[112] R. F. Fischer, Precoding and signal shaping for digital transmission.
John Wiley & Sons, 2005.
[113] C. Pan and F. R. Kschischang, “Probabilistic 16-QAM shaping in
WDM systems,” Journal of Lightwave Technology, vol. 34, no. 18, pp. 4285–
4292, 2016.
[114] F. Buchali, F. Steiner, G. Böcherer, L. Schmalen, P. Schulte, and W.
Idler, “Rate adaptation and reach increase by probabilistically shaped
64-QAM: An experimental demonstration,” Journal of Lightwave Tech-
nology, vol. 34, no. 7, pp. 1599–1609, 2016.
[115] J. Renner, T. Fehenberger, M. P. Yankov, F. Da Ros, S. Forchhammer,
G. Böcherer, and N. Hanik, “Experimental comparison of probabilistic
shaping methods for unrepeated fiber transmission,” Journal of Light-
wave Technology, vol. 35, no. 22, pp. 4871–4879, 2017.
[116] R. Maher, K. Croussore, M. Lauermann, R. Going, X. Xu, and J. Rahn,
“Constellation shaped 66 GBd DP-1024QAM transceiver with 400 km
transmission over standard SMF,” in Optical Communication (ECOC),
[117] I. B. Djordjevic, H. G. Batshon, L. Xu, and T. Wang, “Coded polarization-
multiplexed iterative polar modulation (PM-IPM) for beyond 400 Gb/s
serial optical transmission,” in Optical Fiber Communication Confer-
ence, Optical Society of America, 2010, OMK2.
Bibliography 109
[118] H. G. Batshon, I. B. Djordjevic, L. Xu, and T. Wang, “Iterative polar

quantization-based modulation to achieve channel capacity in ultrahigh-
speed optical communication systems,” IEEE Photonics Journal, vol. 2,
no. 4, pp. 593–599, 2010.
[119] A. A. El-Rahman and J. Cartledge, “Multidimensional geometric shap-
ing for QAM constellations,” in Optical Communication (ECOC), 2017
European Conference on, IEEE, 2017, pp. 1–3.
[120] T. Lotz, X. Liu, S. Chandrasekhar, P. Winzer, H. Haunstein, S. Randel,
S. Corteselli, B. Zhu, and D. Peckham, “Coded PDM-OFDM trans-
mission with shaped 256-iterative-polar-modulation achieving 11.15-
b/s/Hz intrachannel spectral efficiency and 800-km reach,” Journal of
[121] F. Steiner and G. Böcherer, “Comparison of geometric and probabilis-
tic shaping with application to ATSC 3.0,” in SCC 2017; 11th Inter-
national ITG Conference on Systems, Communications and Coding;
Proceedings of, VDE, 2017, pp. 1–6.
[122] Z. Qu and I. B. Djordjevic, “Geometrically shaped 16QAM outper-
forming probabilistically shaped 16QAM,” in Optical Communication
(ECOC), 2017 European Conference on, IEEE, 2017, pp. 1–3.
[123] D. S. Millar, T. Fehenberger, T. Koike-Akino, K. Kojima, and K. Par-
sons, “Coded Modulation for Next-Generation Optical Communica-
tions,” in Optical Fiber Communication Conference, Optical Society
of America, 2018, Tu3C–3.
[124] F. Jardel, T. A. Eriksson, C. Measson, A. Ghazisaeidi, F. Buchali, W.
Idler, and J.-J. Boutros, “Exploring and Experimenting with Shap-
ing Designs for Next-Generation Optical Communications,” Journal
of Lightwave Technology, 2018.
[125] S. Zhang, F. Yaman, E. Mateo, T. Inoue, K. Nakamura, and Y. Inada,
“Design and performance evaluation of a GMI-optimized 32QAM,” in
Optical Communication (ECOC), 2017 European Conference on, IEEE,
2017, pp. 1–3.
[126] S. Zhang and F. Yaman, “Design and Comparison of Advanced Modu-
lation Formats Based on Generalized Mutual Information,” Journal of
110 Bibliography
[127] B. Chen, C. Okonkwo, H. Hafermann, and A. Alvarado, “Increasing

achievable information rates via geometric shaping,” arXiv preprint
arXiv:1804.08850, 2018.
[128] B. Chen, C. Okonkwo, D. Lavery, and A. Alvarado, “Geometrically-
shaped 64-point constellations via achievable information rates,” in
2018 20th International Conference on Transparent Optical Networks
(ICTON), IEEE, 2018, pp. 1–4.
[129] A. Shiner, M. Reimer, A. Borowiec, S. O. Gharan, J. Gaudette, P.
Mehta, D. Charlton, K. Roberts, and M. O’Sullivan, “Demonstration of
an 8-dimensional modulation format with reduced inter-channel nonlin-
earities in a polarization multiplexed coherent system,” Optics express,
vol. 22, no. 17, pp. 20 366–20 374, 2014.
[130] O. Geller, R. Dar, M. Feder, and M. Shtaif, “A shaping algorithm
for mitigating inter-channel nonlinear phase-noise in nonlinear fiber
systems,” Journal of Lightwave Technology, vol. 34, no. 16, pp. 3884–
3889, 2016.
[131] K. Kojima, T. Yoshida, T. Koike-Akino, D. S. Millar, K. Parsons,
M. Pajovic, and V. Arlunno, “Nonlinearity-tolerant four-dimensional
2A8PSK family for 5–7 bits/symbol spectral efficiency,” Journal of
[132] E. Sillekens, D. Semrau, G. Liga, N. Shevchenko, Z. Li, A. Alvarado, P.
Bayvel, R. Killey, and D. Lavery, “A simple nonlinearity-tailored prob-
abilistic shaping distribution for square QAM,” in Optical Fiber Com-
munication Conference, Optical Society of America, 2018, pp. M3C–
4.
[133] E. Sillekens, D. Semrau, D. Lavery, P. Bayvel, and R. I. Killey, “Ex-
perimental Demonstration of Geometrically-Shaped Constellations Tai-
lored to the Nonlinear Fibre Channel,” in European Conf. Optical Com-
munication (ECOC), 2018, Tu3G.3.
[134] M. P. Yankov, K. J. Larsen, and S. Forchhammer, “Temporal proba-
bilistic shaping for mitigation of nonlinearities in optical fiber systems,”
Journal of Lightwave Technology, vol. 35, no. 10, pp. 1803–1810, 2017.
[135] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521,
no. 7553, p. 436, 2015.
[136] J. Schmidhuber, “Deep learning in neural networks: An overview,” Neu-
ral networks, vol. 61, pp. 85–117, 2015.
Bibliography 111
[137] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-

tion with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[138] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
A large-scale hierarchical image database,” in Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee,
2009, pp. 248–255.
[139] V. Nair and G. E. Hinton, “Rectified linear units improve restricted
boltzmann machines,” in Proceedings of the 27th international confer-
ence on machine learning (ICML-10), 2010, pp. 807–814.
[140] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from over-
fitting,” The Journal of Machine Learning Research, vol. 15, no. 1,
pp. 1929–1958, 2014.
[141] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A.
Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural
networks for acoustic modeling in speech recognition: The shared views
of four research groups,” IEEE Signal processing magazine, vol. 29,
no. 6, pp. 82–97, 2012.
[142] G. E. Dahl, T. N. Sainath, and G. E. Hinton, “Improving deep neural
networks for LVCSR using rectified linear units and dropout,” in Acous-
tics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, IEEE, 2013, pp. 8609–8613.
[143] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
systems, 2014, pp. 3104–3112.
[144] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-
tations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533,
1986.
[145] G. Cybenko, “Approximations by superpositions of a sigmoidal func-
tion,” Mathematics of Control, Signals and Systems, vol. 2, pp. 183–
192, 1989.
[146] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Advances in neural information processing systems, 1990, pp. 598–605.
[147] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo
methods,” 1993.
112 Bibliography
[148] S. Haykin, Neural networks: a comprehensive foundation. Prentice Hall

PTR, 1994.
[149] C. M. Bishop, Neural networks for pattern recognition. Oxford univer-
sity press, 1995.
[150] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
ing applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[151] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient back-
prop,” in Neural networks: Tricks of the trade, Springer, 2012, pp. 9–
48.
[152] R. M. Neal, Bayesian learning for neural networks. Springer Science &
Business Media, 2012, vol. 118.
[153] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas im-
manent in nervous activity,” The bulletin of mathematical biophysics,
vol. 5, no. 4, pp. 115–133, 1943.
[154] F. Rosenblatt, “The perceptron: a probabilistic model for information
storage and organization in the brain.,” Psychological review, vol. 65,
no. 6, p. 386, 1958.
[155] M. Minsky and S. A. Papert, Perceptrons: An introduction to compu-
tational geometry. MIT press, 2017.
[156] C. E. Rasmussen, “Gaussian processes in machine learning,” in Ad-
vanced lectures on machine learning, Springer, 2004, pp. 63–71.
[157] C. Robert, Machine learning, a probabilistic perspective, 2014.
[158] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT press Cambridge, 2016, vol. 1.
[159] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z.
Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation
in PyTorch,” in NIPS-W, 2017.
[160] Free courses and blog posts on machine learning and beyond, http :
//www.fast.ai/, Accessed: 2018-08-30.
[161] Open AI blog, https://blog.openai.com/, Accessed: 2018-08-30.
[162] Machine learning blog by Google, https://medium.com/tensorflow,
Accessed: 2018-08-30.
Bibliography 113
[163] Blog on machine learning topics with nice visualisations, http://colah.

github.io/, Accessed: 2018-08-30.
[164] Blog by machine learning research scientist, https://www.inference.
vc/, Accessed: 2018-08-30.
[165] Blog by machine learning research scientist, http://dustintran.com/
blog/, Accessed: 2018-08-30.
[166] Blog by machine learning research scientist, http://www.cs.ox.ac.
uk/people/yarin.gal/website/blog.html, Accessed: 2018-08-30.
[167] Machine learning podcast, http://www.thetalkingmachines.com/,
Accessed: 2018-08-30.
[168] A. K. Jain and R. C. Dubes, “Algorithms for clustering data,” 1988.
[169] V. Hodge and J. Austin, “A survey of outlier detection methodologies,”
Artificial intelligence review, vol. 22, no. 2, pp. 85–126, 2004.
[170] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cog-
nitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[171] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.
Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al., “Mastering the
game of Go without human knowledge,” Nature, vol. 550, no. 7676,
p. 354, 2017.
[172] R. S. Sutton and A. G. Barto, Introduction to reinforcement learning.
MIT press Cambridge, 1998, vol. 135.
[173] F. A. Aoudia and J. Hoydis, “End-to-End Learning of Communications
Systems Without a Channel Model,” arXiv preprint arXiv:1804.02276,
2018.
[174] D. J. MacKay and D. J. Mac Kay, Information theory, inference and
learning algorithms. Cambridge university press, 2003.
[175] H. Robbins and S. Monro, “A stochastic approximation method,” in
Herbert Robbins Selected Papers, Springer, 1985, pp. 102–109.
[176] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfel-
low, and R. Fergus, “Intriguing properties of neural networks,” arXiv
preprint arXiv:1312.6199, 2013.
[177] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y.
Bengio, “Identifying and attacking the saddle point problem in high-
dimensional non-convex optimization,” in Advances in neural informa-
tion processing systems, 2014, pp. 2933–2941.
114 Bibliography
[178] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for

hyper-parameter optimization,” in Advances in neural information pro-
cessing systems, 2011, pp. 2546–2554.
[179] J. Bergstra and Y. Bengio, “Random search for hyper-parameter op-
timization,” Journal of Machine Learning Research, vol. 13, no. Feb,
pp. 281–305, 2012.
[180] P. Baldi, “Autoencoders, unsupervised learning, and deep architec-
tures,” in Proceedings of ICML workshop on unsupervised and transfer
learning, 2012, pp. 37–49.
[181] T. O’Shea and J. Hoydis, “An introduction to deep learning for the
physical layer,” IEEE Transactions on Cognitive Communications and
Networking, vol. 3, no. 4, pp. 563–575, 2017.
[182] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation
functions,” 2018.
[183] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al., Gradient
flow in recurrent nets: the difficulty of learning long-term dependencies,
2001.
[184] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the fourteenth international conference on
artificial intelligence and statistics, 2011, pp. 315–323.
[185] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
recurrent neural networks,” in International Conference on Machine
Learning, 2013, pp. 1310–1318.
[186] N. Qian, “On the momentum term in gradient descent learning algo-
rithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
[187] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[188] M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon, “Au-
tomatic differentiation of algorithms,” Journal of Computational and
Applied Mathematics, vol. 124, no. 1-2, pp. 171–190, 2000.
[189] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind,
“Automatic differentiation in machine learning: a survey.,” Journal of
machine learning research, vol. 18, no. 153, pp. 1–153, 2017.
Bibliography 115
[190] O. Golani, R. Dar, M. Feder, A. Mecozzi, and M. Shtaif, “Modeling the

bit-error-rate performance of nonlinear fiber-optic systems,” Journal of
[191] O. Golani, M. Feder, A. Mecozzi, and M. Shatif, “Correlations and
phase noise in NLIN-modelling and system implications,” in Optical
Fiber Communication Conference, Optical Society of America, 2016,
W3I–2.
[192] R. T. Jones, Source code, https://github.com/rassibassi/claude,
2018.
[193] S. Li, C. Häger, N. Garcia, and H. Wymeersch, “Achievable Information
Rates for Nonlinear Fiber Communication via End-to-end Autoencoder
Learning,” in European Conf. Optical Communication (ECOC), 2018,
We1F.3.
[194] B. P. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Buelow,
D. Lavery, P. Bayvel, and L. Schmalen, “End-to-end Deep Learning
of Optical Fiber Communications,” Journal of Lightwave Technology,
pp. 1–1, 2018.
[195] T. J. O’Shea, T. Roy, N. West, and B. C. Hilburn, “Physical Layer
Communications System Design Over-the-Air Using Adversarial Net-
works,” arXiv preprint arXiv:1803.03145, 2018.
[196] D. S. Millar, R. Maher, D. Lavery, T. Koike-Akino, M. Pajovic, A.
Alvarado, M. Paskov, K. Kojima, K. Parsons, B. C. Thomsen, S. J.
Savory, and P. Bayvel, “Design of a 1 Tb/s superchannel coherent re-
ceiver,” Journal of Lightwave Technology, vol. 34, no. 6, pp. 1453–1463,
2016.
[197] M. P. Yankov, E. P. da Silva, F. Da Ros, and D. Zibar, “Experimental
analysis of pilot-based equalization for probabilistically shaped WDM
systems with 256QAM/1024QAM,” in Optical Fiber Communications
Conference and Exhibition (OFC), 2017, IEEE, 2017, pp. 1–3.
[198] D. Angluin, “Queries and concept learning,” Machine learning, vol. 2,
no. 4, pp. 319–342, 1988.
[199] N. A. H. Mamitsuka et al., “Query learning strategies using boosting
and bagging,” in Machine learning: proceedings of the fifteenth interna-
tional conference (ICML’98), Morgan Kaufmann Pub, vol. 1, 1998.
116 Bibliography
[200] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio trans-

former networks: Attention models for learning to synchronize in wire-
less systems,” in Signals, Systems and Computers, 2016 50th Asilomar
Conference on, IEEE, 2016, pp. 662–666.
[201] E. Agrell, A. Alvarado, G. Durisi, and M. Karlsson, “Capacity of a
nonlinear optical channel with finite memory,” Journal of Lightwave
Technology, vol. 32, no. 16, pp. 2862–2876, 2014.

RasmusJonesThesis v12 Revised Print

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RasmusJonesThesis v12 Revised Print

Uploaded by

Copyright:

Available Formats

Downloaded from orbit.dtu.

dk on: Aug 18, 2021

Machine Learning Methods in Coherent Optical Communication Systems

Jones, Rasmus Thomas

Link back to DTU Orbit

Machine Learning Methods in

Rasmus Thomas Jones

Kongens Lyngby 2019

This Ph.D. project was financed by Keysight Technologies (Germany, Böblin-

• Darko Zibar (main supervisor), Associate Professor, Department of Pho-

• Metodi P. Yankov (co-supervisor), Fingerprint Cards A/S, 2730 Herlev,

• Molly Piels (co-supervisor), Juniper Networks, 1133 Innovation Way,

It is more fun to talk with someone who

Articles in peer-reviewed journals

Articles submitted to peer-reviewed journals

Contributions to international peer-reviewed

[C9] J. Wass, J. Thrane, M. Piels, J. C. Diniz, R. Jones, and D. Zibar,

2 Coherent communication systems 9

2.4.2 Digital Signal Processing . . . . . . . . . . . . . . . . . 23

3 Machine Learning Methods 31

4 Prediction of Fiber Channel Properties 53

5 Neural Network Receiver for NFDM Systems 61

5.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Geometric Constellation Shaping via End-to-end Learning 73

A Softmax Activation and Mutual Information Estimation 95

AIR achievable information rate

ASE amplified spontaneous emission

AWG arbitrary waveform generator

AWGN additive white Gaussian noise

BER bit error rate

BPSK binary phase shift keying

COI channel of interest

DAC digital-to-analog converter

DBP digital back-propagation

DSP digital signal processing

EDFA erbium-doped fiber amplifier

EGN extended Gaussian noise

FEC forward error correction

FWM four-wave mixing

GMI generalized mutual information

GPU graphics processing unit

HWHM half width at half maximum

IM/DD intensity modulation/direct detection

IPM iterative polar modulation

LMS least mean squares

MSE mean squared error

MZM Mach-Zehnder modulator

NFDM nonlinear frequency-division multiplexing

NFT nonlinear Fourier transform

NLI nonlinear interference

NLIN nonlinear interference noise

NLSE nonlinear Schrödinger equation

OSNR optical signal-to-noise ratio

PCA principal component analysis

PRBS pseudorandom binary sequence

PSD power spectral density

QAM quadrature amplitude modulation

QPSK quadrature phase-shift keying

ReLU rectified linear unit

SDM space-division multiplexing

SGD stochastic gradient descent

SSMF standard single-mode fiber

SNR signal-to-noise ratio

SPM self-phase modulation