You are on page 1of 242

NOVEL POLYNOMIAL APPROXIMATION METHODS FOR GENERATING

CORRECTLY ROUNDED ELEMENTARY FUNCTIONS

By

JAY PHIL LIM

A dissertation submitted to the

School of Graduate Studies

Rutgers, The State University of New Jersey

In partial fulfillment of the requirements

For the degree of

Doctor of Philosophy

Graduate Program in Computer Science

Written under the direction of

Santosh Nagarakatte

And approved by

New Brunswick, New Jersey


October, 2021
ABSTRACT OF THE DISSERTATION

Novel Polynomial Approximation Methods for Generating Correctly Rounded


Elementary Functions

by JAY PHIL LIM

Dissertation Director:
Prof. Santosh Nagarakatte

All endeavors in science use math libraries to approximate elementary functions (e.g.
ln(x) or ex ). Unfortunately, mainstream math libraries for floating point (FP) representa-
tions do not produce correctly rounded results for all inputs. In addition, given the impor-
tance of FP performance in numerous domains, several new variants of FP and its alterna-
tives have been proposed (e.g., bfloat16, tensorfloat32, posits). These representations do
not have correctly rounded math libraries. This dissertation proposes the RL IBM approach,
a collection of novel techniques for generating polynomial approximations that produce
correctly rounded results of an elementary function f (x).
Existing approaches use polynomial approximations that approximate the real value
of f (x). The minimax approach generates polynomials that minimize the maximum ap-
proximation error compared to the real value of f (x) across all inputs. However, minimax
polynomials produce wrong results due to approximation errors and rounding errors in the
implementation. In contrast, the RL IBM approach makes a case for generating polyno-
mials that approximate the correctly rounded result of f (x) (i.e., the exact value of f (x)
computed in reals and rounded to the target representations). This approach provides more
freedom in generating efficient polynomials that produce correctly rounded results for all

ii
inputs.
Additionally, this dissertation makes the following contributions. First, we show that
the problem of generating polynomials that produce correctly rounded results can be struc-
tured as a linear programming problem. Second, the RL IBM approach accounts for nu-
merical errors that occur from range reduction by automatically adjusting the amount of
freedom available to generate polynomials. The generated polynomial with RL IBM pro-
duces the correctly rounded results for all inputs. Third, we propose a set of techniques to
develop correctly rounded elementary functions for 32-bit types. Specifically, we generate
efficient piecewise polynomials using counterexample guided polynomial generation and
bit-pattern based domain splitting. Finally, we extend the RL IBM approach to generate a
single polynomial approximation that produces the correctly rounded results for multiple
rounding modes and multiple precision configurations. To generate correctly rounded ele-
mentary functions for n-bit type, our key idea is to generate a polynomial approximation
for a (n + 2)-bit representation using the round-to-odd mode. We provide formal proof
that the resulting polynomial will produce the correctly rounded results for all standard
rounding modes and for multiple representations with k bits where k ≤ n.
Using the RL IBM approach, we have developed several implementations of elementary
functions for various representations and rounding modes. These elementary functions
produce the correctly rounded results for all inputs in the target representations. Addition-
ally, the functions are faster than mainstream math libraries which have been optimized for
decades. Our RL IBM -A LL prototype, a collection of elementary functions that produce
the correctly rounded results for multiple floating point representations with all standard
rounding modes, has 1.1×, 1.3×, and 1.9× speedup over glibc, Intel, and CR-LIBM math
libraries, respectively. Our prototypes also provide the first correctly rounded elementary
functions for 32-bit posit.

iii
ACKNOWLEDGMENTS

First and foremost, this dissertation would have not been possible without my advisor
Santosh Nagarakatte, who has been my mentor, collaborator, and role model. His support
and guidance has made a tremendous impact in my journey to become a researcher. Santosh
has spent countless number of hours with me reading and discussing papers to teach me
how to think rigorously and critically. Our time together brainstorming ideas have helped
me develop my problem solving skills: simpler is better! He taught me to be compassionate
and caring for students. In times of trouble, Santosh has supported me financially and
psychologically to ensure that I can focus on growing as a researcher. Without Santosh’s
support, I would not have been able to begin, let alone complete, this dissertation.
I would like to thank my dissertation committee members, Ulrich Kremer, Richard
Martin, and Zachary Tatlock. Their insight and feedback have been extremely helpful in
improving this dissertation.
Many insights and ideas in this dissertation would have not been realized without my
collaborators, John Gustafson and Mridul Aanjaneya. Our discussions with John Gustafson
regarding the Minefield technique and the round-to-odd rounding mode have been crucial
in developing the key insights of the RL IBM approach. The realization from Mridul Aan-
janeya that generating polynomials is a linear programming problems have been instrumen-
tal in automating the RL IBM approach. Although not direct collaborators, I have learned
tremendously from the members of the FPBench community on topics related to floating
point. My gratitude goes especially towards Zachary Tatlock, Pavel Panchekha, and Bill
Zorn for organizing and leading the monthly meetings and yearly workshops. I would also
like to extend my thanks to my mentor David Tarditi during my internship in MSR Red-
mond, who has been instrumental in my development in the foundation of programming
languages.
I have been shaped by many great people in Rutgers University throughout my PhD

iv
career. I thank Vinod Ganapathy for noticing my potential and mentoring me in the early
years, along with Santosh. I learned the importance of simplifying complex ideas through
my interactions with Ulrich Kremer and Badri Nath. It was always fun to talk to Richard
Martin and learn about the history of computer science.
I would like to thank my lab-mates, David Menendez, Adarsh Yoga, Nader Boushehri-
inejad, Mohammedreza Solyaniyeh, Sangeeta Chowdharay, Harishankar Vishwanathan,
and Matan Shachnai, at the RAPL group for their support. I would like to give special
thanks to Adarsh who has been like a big brother since the early years of my PhD as well
as Sangeeta and Matan, my collaborators in different projects. I also met great peers out-
side of RAPL group including Georgiana, Daehan, Hai, Daeyoung, Liu, Hong-yu, Jaewoo,
Sunghyun, and David. In stressful times of PhD, they brought fun and gave me motivation
to continue.
I am deeply thankful for my family, my wife, daughter, parents, and parents-in-law for
their moral support throughout the years. My wife Inyoung has always been my greatest
rival and supporter who has constantly pushed me to be at my best. My daughter Sammie
has been the light of my life which motivates me to continue. I thank my parents, Tae-Seop
and Myung, for the educational and emotional support they provided throughout my life.
I would also like to thank my friends, Dexter, Carolyn, Shinjae, Daehoon, Jisoo, Bong,
Grace, James, Inah, Nick, Joonsang, Paul, Kimun, Kay, Hongjin, and Ben for all the fun
activities that provided refreshing breaks from work.

v
To the loves of my life, Inyoung and Sammie

vi
TABLE OF CONTENTS

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.1 The RL IBM Approach for Correctly Rounded Polynomial Approx-


imations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2.2 The RL IBM Approach With Range Reduction . . . . . . . . . . . . 9

1.2.3 Scaling RL IBM To 32-bit Representations . . . . . . . . . . . . . . 10

1.2.4 A Single Approximation for Multiple Representations and Round-


ing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.5 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Contributions to This Dissertation . . . . . . . . . . . . . . . . . . . . . . 14

1.4 Organization of This Dissertation . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

vii
2.1 The Floating Point Representation . . . . . . . . . . . . . . . . . . . . . . 16

2.1.1 Interpreting a Floating Point Bit-String . . . . . . . . . . . . . . . . 17

2.1.2 Rounding a Real Number to the FP Representation . . . . . . . . . 19

2.1.3 A Systematic Methodology for Rounding To the FP Representation 22

2.1.4 Different FP Configurations and Fixed-Precision Variants . . . . . . 29

2.2 The Posit Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.1 Decoding a Posit Bit-String . . . . . . . . . . . . . . . . . . . . . . 33

2.2.2 Interpreting a Posit Bit-String . . . . . . . . . . . . . . . . . . . . 33

2.2.3 Rounding a Real Number to the Posit Representation . . . . . . . . 36

2.2.4 A Systematic Methodology for Rounding To Posit Representation . 40

2.2.5 Posit Configurations and Posit Variants . . . . . . . . . . . . . . . 43

2.3 Numerical Errors in Finite Precision Representation . . . . . . . . . . . . . 43

2.3.1 Rounding Error with Extremal Values. . . . . . . . . . . . . . . . . 44

2.3.2 Double Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.3 Cancellation Error . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.4 Prior Work on Approximating Elementary Functions . . . . . . . . . . . . 46

2.4.1 Approximating AR (x) . . . . . . . . . . . . . . . . . . . . . . . . 46

2.4.2 Implementation in Finite Precision . . . . . . . . . . . . . . . . . . 52

2.5 Challenges in Generating Correctly Rounded Functions by Approximating


f (x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Chapter 3: The RL IBM Approach For Correctly Rounded Polynomial Approx-


imations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1 Approximating The Correctly Rounded Result . . . . . . . . . . . . . . . . 56

viii
3.2 Illustration Of Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.1 Computing the Correctly Rounded Result. . . . . . . . . . . . . . . 61

3.2.2 Identifying the Rounding Interval . . . . . . . . . . . . . . . . . . 61

3.2.3 Generating a Polynomial Approximation . . . . . . . . . . . . . . . 62

3.3 The RL IBM Approach To Generate Correctly Rounded Polynomial Ap-


proximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.1 High-Level Overview of The RL IBM Approach . . . . . . . . . . . 64

3.3.2 Computing the Rounding Intervals . . . . . . . . . . . . . . . . . . 66

3.3.3 Generating the Polynomial With an LP Formulation . . . . . . . . . 68

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Chapter 4: The RL IBM Approach With Range Reduction . . . . . . . . . . . . . 73

4.1 Generating Polynomial Approximations With Range Reduction . . . . . . . 73

4.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2.1 Range Reduction for ln(x) . . . . . . . . . . . . . . . . . . . . . . 76

4.2.2 Identifying Correctly Rounded Results and Rounding Intervals . . . 77

4.2.3 Computing the Reduced Input and the Reduced Interval . . . . . . . 79

4.2.4 Combining the Reduced Intervals . . . . . . . . . . . . . . . . . . 80

4.2.5 Generating a Polynomial Approximation . . . . . . . . . . . . . . . 82

4.3 Range Reduction Strategies for Various Elementary Functions . . . . . . . 83

4.3.1 Logarithm functions (loga (x)) . . . . . . . . . . . . . . . . . . . . 83

4.3.2 Exponential functions (ax ) . . . . . . . . . . . . . . . . . . . . . . 87

4.3.3 Hyperbolic Sine Function (sinh(x)) . . . . . . . . . . . . . . . . . 89

4.3.4 Hyperbolic Cosine Function (cosh(x)) . . . . . . . . . . . . . . . . 93

ix
4.3.5 Trigonometric Sinpi Function (sinpi(x)) . . . . . . . . . . . . . . . 95

4.3.6 Trigonometric Cospi Function (cospi(x)) . . . . . . . . . . . . . . 97

4.4 Our Approach For Generating Polynomials With Range Reduction . . . . . 99

4.4.1 High-level Overview of the RL IBM Approach For Univariate Out-


put Compensation Functions . . . . . . . . . . . . . . . . . . . . . 102

4.4.2 Identifying Reduced Inputs and Reduced Intervals . . . . . . . . . . 103

4.4.3 Combining the Reduced Constraints . . . . . . . . . . . . . . . . . 106

4.5 The RL IBM Approach For Multivariate Output Compensation Functions . . 107

4.5.1 High-level Overview of the RL IBM Approach For Multivariate Out-


put Compensation Function . . . . . . . . . . . . . . . . . . . . . . 110

4.5.2 Identifying Reduced Inputs and Intervals For Each Polynomial . . . 111

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Chapter 5: The RL IBM Approach for 32-Bit Representations . . . . . . . . . . . 115

5.1 Scaling Our Approach to 32-Bit Representations . . . . . . . . . . . . . . . 115

5.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

5.2.1 Domain Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.2 Polynomial Generation for Each Sub-Domain. . . . . . . . . . . . 121

5.3 Our Approach to Generate Piecewise Polynomials . . . . . . . . . . . . . . 121

5.3.1 Domain Splitting for Efficient Piecewise Polynomials . . . . . . . . 124

5.3.2 Counterexample Guided Polynomial Generation . . . . . . . . . . . 127

5.3.3 Storing the Coefficients of Piecewise Polynomials . . . . . . . . . . 129

5.3.4 Implementing the Elementary Function . . . . . . . . . . . . . . . 130

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

x
Chapter 6: A Single Polynomial that Produces Correct Results For Multiple
Representations and Rounding Modes . . . . . . . . . . . . . . . . . 132

6.1 Case For Generic Math Libraries . . . . . . . . . . . . . . . . . . . . . . . 132

6.2 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2.1 A Strawman Approach . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2.2 Generating A Polynomial Approximation With the rno Mode. . . . 138

6.2.3 Generating Generic Polynomial Approximations . . . . . . . . . . 139

6.3 The RL IBM Approach to Generate Generic Polynomials . . . . . . . . . . 142

6.3.1 The Round to Odd (rno) Rounding Mode . . . . . . . . . . . . . . 143

6.3.2 Why Does The rno Result Avoid Double Rounding Error? . . . . . 145

6.3.3 Generating Polynomials for Tn+2 with the rno Mode . . . . . . . . 147

6.3.4 Computing the rno result of f (x) in Tn+2 . . . . . . . . . . . . . . 149

6.3.5 Computing the Odd Intervals . . . . . . . . . . . . . . . . . . . . . 150

6.3.6 Implementing the Elementary Function . . . . . . . . . . . . . . . 151

6.3.7 Efficiently Identifying Inputs With Singleton Generic Interval . . . 152

6.4 Proof of Tn+2 Result with rno Producing Correctly Rounded Result for Tk . 158

6.5 Odd Intervals for Extremal Values in Posit Representations . . . . . . . . . 163

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Chapter 7: Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 166

7.1 Experimental Methodology And Setup . . . . . . . . . . . . . . . . . . . . 166

7.2 Experimental Evaluation of RL IBM -16 . . . . . . . . . . . . . . . . . . . 169

7.2.1 Correctness Evaluation of RL IBM -16 . . . . . . . . . . . . . . . . 169

7.2.2 Performance Evaluation of RL IBM -16 . . . . . . . . . . . . . . . . 171

xi
7.3 Experimental Evaluation with RL IBM -32 . . . . . . . . . . . . . . . . . . 174

7.3.1 Correctness Evaluation of RL IBM -32 . . . . . . . . . . . . . . . . 176

7.3.2 Performance Evaluation of RL IBM -32 . . . . . . . . . . . . . . . . 178

7.4 Experimental Evaluation With RL IBM -A LL . . . . . . . . . . . . . . . . . 181

7.4.1 Correctness Evaluation of RL IBM -A LL . . . . . . . . . . . . . . . 183

7.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . 189

Chapter 8: Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.1 Approximation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.2 Correctly Rounded Approximation . . . . . . . . . . . . . . . . . . . . . . 198

8.3 Range Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8.4 Verification of Math Libraries . . . . . . . . . . . . . . . . . . . . . . . . . 206

8.5 Math Library Repair Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Chapter 9: Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . 209

9.1 Dissertation Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

xii
LIST OF TABLES

7.1 Details on the elementary functions in RL IBM -16. . . . . . . . . . . . . . . . . 170

7.2 Generation of correctly rounded results in bfloat16 with rne mode using RL IBM -
16 and various math libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . 171

7.3 Details on the elementary functions in RL IBM -32. . . . . . . . . . . . . . . . . 175

7.4 Generation of correctly rounded results in 32-bit float with the rne mode using
RL IBM -32 and various math libraries. . . . . . . . . . . . . . . . . . . . . . . 176

7.5 Generation of correctly rounded results in posit32 using RL IBM -32 and various
math libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.6 Detalis on the elementary functions in RL IBM -A LL. . . . . . . . . . . . . . . . 182

7.7 Report on the ability of RL IBM -A LL’s functions to produce correctly rounded rno
results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

7.8 Generation of correctly rounded results for 32-bit float with the standard IEEE-754
rounding modes using RL IBM -A LL and various math libraries. . . . . . . . . . 185

7.9 Generation of correctly rounded results for TensorFloat32 with the standard IEEE-
754 rounding modes using RL IBM -A LL and various math libraries. . . . . . . . 187

7.10 Generation of correctly rounded results for bfloat16 with the five standard IEEE-
754 rounding modes using RL IBM -A LL and various math libraries. . . . . . . . . 188

xiii
LIST OF FIGURES

1.1 Illustration of the intuition behind the RL IBM approach, which makes a case for
approximating the correctly rounded result of f (x). . . . . . . . . . . . . . . . 7

1.2 The RL IBM’s approach to generate a polynomial approximation. . . . . . . 7

1.3 The RL IBM’s approach to generate a polynomial approximation used with range
reduction strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 The RL IBM’s approach to generate a polynomial approximation for 32-bit repre-
sentations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 The RL IBM’s approach to generate polynomial approximations for multiple rep-
resentations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Bit-pattern of various standard IEEE-754 FP representations. . . . . . . . . . . . 17

2.2 Examples of bit-patterns for different types of floating point values. . . . . . 18

2.3 Illustration of the standard floating point rounding modes. . . . . . . . . . . . . 20

2.4 The bit-pattern used to round a real value to a floating point representation. . . . . 23

2.5 Illustration of the range of real values represented by different rounding components. 24

2.6 Rounding decision for the rne mode using the rounding components. . . . . . . . 26

2.7 Rounding decision for various rounding modes based on the rounding components. 27

2.8 Bit-patterns of variants of floating point representations. . . . . . . . . . . . . . 29

2.9 Bit-patterns of MSFP12 and Flex16+5 representations. . . . . . . . . . . . . . 30

2.10 Bit-pattern of an n-bit posit representation. . . . . . . . . . . . . . . . . . . . 32

xiv
2.11 Bit-patterns of various values in a h9, 2i posit representation. . . . . . . . . . . . 34

2.12 Illustration for identifying the midpoint between two adjacent posit values. . . . . 37

2.13 Illustration of posit’s arithmetic and geometric rounding behavior. . . . . . . . . 38

2.14 Special rounding behaviors in the posit rounding mode. . . . . . . . . . . . . . 39

2.15 Rounding decisions for values near 0 with the posit rounding mode. . . . . . . . 41

2.16 Rounding decision for values outside of the dynamic range with the posit rounding
mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.17 Rounding decision for non-special-case values with the posit rounding mode. . . . 42

2.18 An illustration of the double rounding error. . . . . . . . . . . . . . . . . . . . 45

2.19 An illustration depicting the challenges in producing the correctly rounded result
of f (x). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.1 Different amount of freedom available in generating polynomials when approxi-


mating the real value ofo f (x) or the correctly rounded result itself. . . . . . . . 57

3.2 The RL IBM’s approach to generate a polynomial approximation of ln(x) for FP5. 60

3.3 A real number line showing the rounding intervals of various FP5 values. . . . . 61

3.4 The polynomial approximation of ln(x) for FP5 generated using the RL IBM’s
approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 The RL IBM’s approach to generate a polynomial approximation of ln(x) for FP5
that produces the correctly rounded results for all inputs when used with range
reduction strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.2 The list of reduced inputs and reduced intervals that define the constraints on the
output of the polynomial to be generated. . . . . . . . . . . . . . . . . . . . . 81

4.3 The polynomial approximation of ln(x) for FP5 that produces the correctly rounded
results for all inputs when used with the given range reduction strategy. . . . . . . 82

5.1 Bit-pattern of the 6-bit FP (FP6) representation. . . . . . . . . . . . . . . . . . 117

xv
5.2 The list of inputs and rounding intervals for generating a polynomial approxima-
tion of ln(x) for FP6 using the RL IBM’s approach. . . . . . . . . . . . . . . . 118

5.3 The procedure to split the input domain into two sub-domains for ln(x) and FP6
using the RL IBM’s approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.4 Illustration of the counterexample guided polynomial generation used for the first
sub-domain of ln(x) for FP6. . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.5 Illustration of the counterexample guided polynomial generation used for the sec-
ond sub-domain of ln(x) for FP6. . . . . . . . . . . . . . . . . . . . . . . . . 122

5.6 The piecewise polynomial that produces the correctly rounded result of ln(x) for
all inputs in FP6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.7 Details on how the RL IBM’s approach splits the input domain into smaller sub-
domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.1 Bit-patterns of the FP4, FP5, and FP7 representations with two exponent bits. . . 135

6.2 The correctly rounded results and its rounding intervals of ln(1.5) in the FP4 and
FP5 representations with the standard floating point rounding modes. . . . . . . . 137

6.3 The correctly rounded results and its rounding intervals of ln(1.0) in the FP4 and
FP5 representations with the standard floating point rounding modes. . . . . . . . 139

6.4 The odd interval for the correctly rounded result of ln(x) in FP7 with the rno
mode, for each input in FP5. . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.5 The polynomial generated using the odd intervals that produces the correctly rounded
results of ln(x) for both FP4 and FP5 representations and all standard floating
point rounding modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6.6 Illustration of the rno mode. . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.7 An illustration showing that the rno mode maintains sufficient information to pro-
duce the correctly rounded result of real values. . . . . . . . . . . . . . . . . . 146

6.8 A pictorial presentation of Lemma 2 and Lemma 3. . . . . . . . . . . . . . . . 159

6.9 A pictorial presentation of the proof of Theorem 6.1. . . . . . . . . . . . . . . . 161

6.10 The odd interval computation for the posit representations to account for the spe-
cial posit rounding behavior when the real value is outside of the dynamic range. . 164

xvi
7.1 Speedup of RL IBM -16’s bfloat16 functions compared to glibc and Intel math li-
brary functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

7.2 Speedup of RL IBM -16’s posit16 functions compared to SoftPosit-Math library. . 172

7.3 An illustration showing the amount of freedom provided by the RL IBM’s approach
to generate polynomials that produce correctly rounded results of 10x for all in-
puts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.4 Speedup of RL IBM -32’s float functions compared to glibc and Intel math library
functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.5 Speedup of RL IBM -32’s posit32 functions compared to glibc and Intel’s math
library functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

7.6 The performance impact between the size of the piecewise polynomial compared
to the degree of the polynomial. . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.7 Speedup of RL IBM -A LL functions compared to various math libraries when pro-
ducing 32-bit float results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

7.8 Speedup of RL IBM -A LL functions compared to various math libraries when pro-
ducing bfloat16 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.9 Speedup of RL IBM -A LL functions compared to various math libraries when pro-
ducing posit32 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

xvii
1

CHAPTER 1
INTRODUCTION

Every programming language has primitive data types to approximate real numbers. The
two important attributes for such datatypes are the ability to represent a wide range of
values (dynamic range) and represent a particular real number precisely (precision). The
floating point (FP) representation, standardized by the IEEE-754 standard in 1985 [27], is
a widely used primitive datatype to represent real numbers for its large dynamic range and
reasonable precision. For example, every number in JavaScript used to be represented with
double (a 64-bit FP representation), until in 2020 when JavaScript introduced the BigInt
datatype to represent large integer values [68].
In recent years, the performance of FP arithmetic operations is becoming important,
especially in machine learning and high performance computing domains. Hence, modern
accelerators, processors, and systems have explored new variants of FP representations
that tweak the bit-length, dynamic range, and precision [44, 55, 58, 74, 78, 100, 108, 112,
116, 130, 131]. Some of these variants are precision configurations of the standard FP
representation (i.e. bfloat16 [131] and TensorFloat32 [108]). Several new representations
use the same strategy as FP to encode the values but store the bit-pattern in different ways
(i.e., MSFP12 [116] and FlexPoint [78]). There are also completely new representations
(i.e., log number systems [44, 112, 130] or posit [55, 58]). In particular, posits can provide a
wider dynamic range than FP for a given number of bits. Additionally, posits provide more
precision when representing values near 1, known as tapered precision. The new variants
of FP also aim towards minimizing the bit-length for efficient hardware implementation
of primitive operations [66, 69, 70, 107, 136, 142] while maintaining sufficient amount of
dynamic range and precision.
Any number system that approximates real numbers needs a math library that provides
2

implementations of elementary functions [103] (e.g., sin(x), ln(x), ex ). Elementary func-


tions are widely used in scientific domains ranging from machine learning to scientific
simulation. For example, the exponential function ex is used widely to model the sigmoid
x
function (i.e., exe+1 ) and the fourier transformation in digital signal processing. As ele-
mentary functions are essential building blocks for scientific applications, efficiently and
accurately computing the result of an elementary function f (x) is paramount. The IEEE-
754 standard [27] defines a list of elementary functions and recommends math libraries
implement them to produce the correctly rounded result.
Given a representation T with finite precision (e.g., float), the correctly rounded result
of an elementary function f (x) for an input x ∈ T is defined as the value of f (x) computed
with real numbers and rounded to a value in the representation T. Developing correctly
rounded math libraries is a challenging problem. Hence, there have been seminal work
on generating approximations of elementary functions [4, 11, 17, 20, 21, 22, 71, 80, 99],
verifying the correctness of math libraries [8, 35, 38, 39, 59, 60, 62, 83, 119], and repairing
math libraries [145]. Further, there are correctly rounded math libraries for the float and
double types [29, 65, 101, 146] for some rounding modes. Widely used math libraries [51,
67] do not produce correctly rounded results for all inputs.
The new FP representations currently do not have math libraries. A quick and naive
way to address this problem is to repurpose an existing math library. For example, to ap-
proximate ex for bfloat16, we can use an implementation of ex designed for 32-bit float.
We can convert a bfloat16 input to a float value, use an existing implementation of ex to
produce the float result, and round the output back to bfloat16. This approach is appealing
if a correctly rounded math library for a representation with significantly higher precision
compared to the target representation is available. However, it can produce wrong results
due to the double rounding error. A Double rounding error can occur when one rounds a
real value to an intermediate representation and subsequently rounds the intermediate re-
sult to the target representation, known as double rounding. Depending on the rounding
3

modes used in each rounding operation, double rounding may result in a different value
compared to directly rounding the real value to the target representation, which is called
the double rounding error. As a correctly rounded math library effectively rounds the real
value of f (x) to its target representation, rounding this result a smaller representation can
cause double rounding errors and produce wrong results. Chapter 2 provides more details
on the double rounding error and why repurposing an existing library fails to produce the
correctly rounded result. The repurposed math library is intended for a larger representa-
tion and computes the result with significantly more precision than necessary. Hence, its
performance may not be ideal.
The most common method to approximate an elementary function f (x) is with a poly-
nomial approximation. Prior approaches generate polynomials that minimizes the maxi-
mum error among all inputs compared to the real value of f (x), which is known as the
minimax approach. This approach is based on the Weierstrass approximation theorem [144]
which states that if f (x) is a continuous real function over a domain [a, b], then there exists
a polynomial P (x) where the error with respect to f (x) is bounded, i.e., |f (x) − P (x)| < 
for all real numbers x ∈ [a, b] where  > 0. The Chebyshev alternating theorem [137] pro-
vides the condition for such a polynomial. A minimax polynomial of degree d has exactly
d + 2 inputs with the absolute maximum error |f (x) − P (x)| and the error for each of these
inputs alternate in sign. Finally, Remez algorithm [114] provides the procedure to identify
the minimax polynomial of degree d with real numbers.
Polynomial approximations are efficient and accurate when approximating f (x) in a
small domain [a, b]. To approximate elementary functions for the entire input domain in
the target representation T, polynomials are often used with range reduction that reduces
the input domain to a smaller domain. A typical range reduction strategy has three com-
ponents. First, the range reduction function reduces the original input x ∈ T to a reduced
input x0 in a small domain [a, b]. The polynomial then approximates the elementary func-
tion f (x0 ) using the reduced input x0 . Finally, the result of the polynomial approximation,
4

which approximates for the reduced input, is adjusted with the output compensation func-
tion to produce the result for the original input x. Some strategies (e.g., Cody and Waite’s
range reduction for logarithm functions [25]) transforms the original f (x) into a differ-
ent function g(x) where the polynomial approximation is more efficient (i.e., even or odd
polynomial). The most commonly used strategy in mainstream math libraries uses a com-
bination of mathematical identities and look-up tables to efficiently evaluate the output
compensation function [133, 134, 135]. To produce accurate results for T, the range re-
duction, polynomial approximation, and output compensation functions are evaluated in a
representation H that has higher precision compared to T.
There are several challenges in generating efficient polynomial approximations that
produce correctly rounded results for all inputs using the minimax approach. These chal-
lenges arise due to the fundamental difference between the goal of correctly rounded math
libraries and the minimax approach. Correctly rounded math libraries must produce the
correctly rounded result of f (x). In comparison, the goal of the minimax approach is to
approximate the real value of f (x) as closely as possible. Although the correctly rounded
result of f (x) is computed by rounding the real value of f (x) to the target representation,
rounding a value close to (but not exactly equal to) f (x) does not guarantee to produce the
same result. Hence, the error bound of the polynomial P (x) generated with the minimax
approach compared to f (x) may need to be arbitrarily small. Consider an input xi where
the exact result of f (xi ) with real numbers is close to the rounding boundary of two values
in T (i.e., f (xi ) rounds to a value v1 but f (xi ) +  rounds to a different value v2 for an
input xi and a small value ). To produce the correctly rounded result of f (xi ), the error
of the polynomial approximation P (xi ) must be smaller than . This likely requires a large
degree polynomial with many terms, resulting in less than ideal performance. Also, there
is no known general method to analyze and predict the error bound of P (x) that guarantees
to produce the correctly rounded result for all inputs in T. This problem is widely known
as table-maker’s dilemma [75]. The table maker’s dilemma states that there is no general
5

method to predict how accurate an approximation of f (x) must be to produce the correctly
rounded result for an arbitrary input. The only way to determine the error bound is by
iteratively computing the real value of f (x) for each input x [85].
Additionally, there is a disparity between the settings assumed by the minimax ap-
proach and correctly rounded math libraries. The error bound guaranteed by the Remez
algorithm only holds if the polynomial is evaluated with real numbers. However, math li-
brary functions are evaluated with finite precision representations, which causes numerical
errors when evaluating the polynomial, range reduction, and output compensation func-
tions. To produce correctly rounded results for all inputs, this necessitates the polynomial
approximation to account for numerical errors and the internal computations may need to
be evaluated in a significantly high precision representation. For example, several parts of
the elementary functions in CR-LIBM [30], a correctly rounded math library for the 64-bit
double type (with 53 precision bits), are implemented with the triple-double representation
where each intermediate value is represented with three double values (a total of 159 pre-
cision bits). In addition, CR-LIBM formally proves that the polynomial they generate will
produce correctly rounded results for all inputs even with numerical errors.

1.1 Dissertation Statement

To generate efficient and correctly rounded polynomial approximations of an elementary


function f (x), this dissertation makes a case for generating polynomials that approximate
the correctly rounded result of f (x). In comparison, mainstream math libraries generate
polynomials that approximate the real value of f (x) using the minimax approach. Approx-
imating the correctly rounded results provide more freedom in generating polynomials that
produce correctly rounded results for all inputs, resulting in efficient polynomials.
6

1.2 Contributions of the Dissertation

This dissertation proposes the RL IBM approach, a set of techniques to generate efficient
polynomial approximations that produce correctly rounded results of an elementary func-
tion f (x) for various representations. The contributions of this dissertation can be summa-
rized as follows:

1. We make a case for approximating the correctly rounded result of f (x) rather than
the real value of f (x). We demonstrate that the task of generating polynomials can
be framed as a linear programming (LP) problem. Our approach accounts for the
numerical errors when evaluating range reduction and output compensation func-
tions. Our approach guarantees that the resulting polynomials produce the correctly
rounded results for all inputs.

2. To scale our approach to 32-bit types, we present counterexample guided polynomial


generating with sampling. Additionally, we generate efficient piecewise polynomials
by splitting the input domain using few bits in the bit-pattern of the input.

3. We propose a novel approach to generate a single polynomial approximation that pro-


duces the correctly rounded results for multiple representations and rounding modes
by using the round to odd mode.

In the remainder of this section, we provide a brief description of each contribution


listed above and highlight the key ideas. We discuss these techniques in greater detail in
the subsequent chapters.

1.2.1 The RL IBM Approach for Correctly Rounded Polynomial Approximations

Our RL IBM approach makes a case for generating polynomials that directly approximate
the correctly rounded result of f (x) itself. Figure 1.1(a) illustrates our key intuition. A
finite precision representation T can only represent finitely many values (i.e., v1 , v2 , and
7

An Approach to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:17
rounding interval of v2 rounds to f(x)
785 1 Function Main(f , T, H, X , RR, OC, OC 1 , d): l ≤ c0 + c1x + c2x2 … ≤ h
786 2 L v1
CalcRndIntervals(f , T, H, Xl ) v2 h v3
787 3 if L = ; then return false
788 4 L0
An Approach CalcRedIntervals(f , L, T, H,Math
to Generate Correctly Rounded
(a) 1for
Libraries
RR, OC, OC ) New Floating Point Variants (b)1:17
0
if L = ; then return false
789 5

790785 6 1 Function
Figure
, T, 1.1:
CombineRedIntervals(L
Main(f (a) OC,
H, X , RR, The
0 ) values v1 , v2 , and v3 are representable values in a representation T. The
OC 1 , d):
791786 7 2 realreturn
value false
if L = ;CalcRndIntervals(f
then of f (x), T,forH, aX given
) input x cannot be exactly represented in T and is rounded to v2 .
792787 8 3 Our approach
S, PifHL = SynthesizePoly(
; then identifies
return false , d) the rounding interval of v2 (shown in gray box). (b) Then, we generate
793788 9 4 if SL 0= true then return P approximation
a polynomial
CalcRedIntervals(f , L, T, H, RR,using
OC, OCthe 1rounding
) interval (i.e., [l, h]) for each input x with an LP
794789 10 elseif Lreturn
0 false return false
= ; then
5
formulation.
795790 6 CombineRedIntervals(L 0 )
796791 if = algorithm
Fig.7 7. Overall ; then return
that false
creates the polynomial approximation P (x ) that will produce the cor-
rectly rounded Entire input domain
797792 8 S, P H result.AnEach function,
SynthesizePoly(
Approach CalcIntervals,
, d)
to Generate CalcRedIntervals,
Correctly Rounded Math CombineRedIntervals,
Libraries for New Floating Point Variants and 1:19
798793 SynthesizePoly is explained An Approach
laterPin this section.to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19
9 if S = true then return
Convert
799794 else return
10 883 false
1 Function CalculateL Input x 0 (f , L, T, to H H, RR, OC,0 OC 1Input , d): x 1 Function Calculate (L ):
0
H RR, OC, OC 1, ,T, (L 0 ):
800795 1 Function CalcRndIntervals(f
884 2 L0 883 in, T
T, H,
1 Function
X
; foreach (x i , [l0 i , hi10 ): ]) 2Function
CalculateL L do , L, T,
(fGetRndInterval( in H,
2 X 0d):H): { x0 i | (1 xFunction 2 L Calculate
0 , I 0 )Polynomial
i xi 0
0}
0 | ( 0 , I 0 ) 2 L0}
801796 2 L ;
Fig. 7. Overall 884 21 L ;
11 foreach t (x i , [l
GetPrecVal(
i , h i ]) 2 L do , T) LP 2
will produceapproximation X { xi xi xi
885 algorithm 3 that t 1 creates OC H (li ,the x i ) polynomial 1
l approximation P3(x ) that ;
Formulation the cor-
802797 3 rectlyforeach
rounded x 2 XAndo Approach
Each to 885
GenerateOC H1Correctly
3 tRounded
1 12 OC H l (li ,CalcRedIntervals,
Math i)
xmin{
Libraries for2 NewH| Floating 2 [tl , Point ] and RN ( , T)
Variants 3 = } 0 P(x’) ; and 1:19
886 result. 4 t 2function, (hCalcIntervals,
i , xi ) 4 CombineRedIntervals,
foreach unique i 2 X do
RN (f is (x ), T) t 2 13 OC Ht1u(hi , x iGetSuccVal( ) , T) foreach unique i 2 X 0 do
ify886
803798 4 SynthesizePoly explained later in4 this section.
887 5 OC H is an increasing Identify values then
function 5 i ; 4
= Correctly ifin Rounding 2 H| then , tu ] and
804799 5 I 883 GetRndInterval(
1 6Function CalculateL
887 , T, H) t 1 ; 0 (f , tL,
5 14OC is anhOC,increasing
max 1 , {d): function 21 6[Function foreachRN ( (5, T) 0 I=0 0 ): } i0 ;
888 rounded 2 T, HH
H, that
RR, OC interval Calculate x j ,(L x j ) 2 L do
805800 6 if
1 Function884
I = ; then return0 ;
888
; foreach 6 round
]) 210L do
15 to yt 1 ;
return
Function t 2[l, h] 0 0 60 0 2 Lforeach
0} ( x0 j , I x0 j ) 2 L 0 do
CalcRndIntervals(f
889 27 L else //in,OC
f(x) T (x
T, HH, ,X
i is [li): ,ahidecreasing function [l, h]
GetRndInterval( 27 X , T, { H): xifi
0
| (i =x i , Ixx i )then
;i i t 1 else //
0
1 0 if i = x j then
j
806801 7 2 LL ; L [ {(x, 3 8 I )} Computed t 1 889 OC 7 ,x ) OC Ht is a GetPrecVal( decreasing 3 function 7
885
890 usingH t(l 2 an 11 l 8 , T) ; i i [ {I x j }
end foreach 890 1 library)
8 t ;
2 l t 1 , ] andunique
0
i [ {I x j }
9 dooracle RN ( 8 , T)
8 3
807802
891 x 24 X
886 t 2end (i.e., OC H (hi , x i )
MPFR 12 min{ 2 H| 4 92 [tlforeach end i 2= X 0 do } i
9 4 return L887 RN510 (f (x ), T)ifwhile 891 9 end end
808803
892 OC H OC is an H (increasing
, x i ) < [li function , h13 i ] do then tu GetSuccVal( 5
10 , T) end
i ; 9
Figure 1.2: The
892 RL10
; ’s while
approach OC H (to i ) <max
, xgenerate [li , h{i ] a do polynomial
6 2 [ , tuforeach approximation
and —(10(0 ,,IT) 0 )0 2 L end
that0 do
produces the

809804 5 I 888 GetRndInterval(
611 , tT,1 IBM
H)
AdjHigher( t 2 , H)
14 h 2 H| ] RN = }
893 11 AdjHigher( , the
H) interval 11 i I x0 jx2j 0 ix Ij x j 0
810805 Fig.68. For each if889
input
I = ; x then
2 X ,
894 correctly rounded result
712 returnelse
CalcRndIntervals(f
893 if
// ; OC > H then
is ,
return T,
of f (x).if > then return ;
a decreasing 15 H, ; X ) identifies
return
function [l, h] 7 I = [l, h] where
if i 11
= allxj
values
then i in I 0 2 I x
transcendental function812 f (x ). The if GetRndInterval(
i = ; then return ; xj i j
811806 I rounds
7 to theLcorrectly
890 L [813{(x, rounded I )} end result t12
894 f2(x; ) fort 1a given 12i i [ {I ifx0 j }i, = ; then return ;
895 endvalues in I rounds to . GetPrecValue(
812807 8 function
T, H) end returns 914the interval end 895 I 2
while OCHHwhere
13
( , x i ) <all [li , hi ] do
13 ,[ T){(returns i , i )} the
891
896 9 end 13 [ {( i , i )}
preceding value of in the T representation 14 and while
GetSuccValue( OC ( , x ) < , [l
T) , h returns
] do the
14 end
succeeding value of in T.
813808 9 return 892 L 1015 while 896 OC H ( , x i ) < [li , hH)
AdjLower( ] do H
v ). Given an input x, if the real value of f (x) (highlighted with star) cannot be exactly
i
i i i
end end
return —
897 3 10 14
814809 1116
15
897 if AdjHigher(> then return , H) ; AdjLower( , H) 15
0
893
898 16 if > then return ;
11 i I x015j 2 i I xreturn j
(2) 8. if then return
899 input x 2 X , For
12 end 898each pair (x, I x ), T,
> ; we compute thethe reduced input x . We
0 also
815810 Fig. CalcRedIntervals:
For each894 17
represented in T, then it is rounded to a value in T (i.e., v ). There is a range of values [l, h]
CalcRndIntervals(f
17 end
2 H, L, X ) identifies interval
12 2
I =if[l,
i
h] = where
; then all
returnvalues ; in
816811 compute
I rounds to the
895 the reduced
correctly
13 18 rounded end
interval
L8990 L 0 [ {(RR
result If 0 (x= )
H for
(x[l i 0,a[hgiven
), 0, ] ])}that de�nes
transcendental the range
function off (xinputs
). The for
GetRndInterval( the output ,
900 x 0 0 [ {(RRin ), [ , ])}to 13. GetPrecValue( [ {( i , i )}
while I 18H2( H, xin L , hall]Lvalues
i do H (x I irounds
817812 T, H)compensation
function 896
901
returns such
1419 the that
endinterval any
900 OC value i ) I<0 [lis
where i output
in the vicinity of v (highlighted in gray box) where all real values in the interval rounds to
x compensated to a value end in I x . The ,pair
T) returns (x 0, I 0 )the
x
1920 end , to
14
818813 speci�es
preceding value
897what
902
of 1520thein the output 901of
T representation
return L 0 P (x AdjLower() needs and H)
0be such that A(x ) rounds
GetSuccValue( , T) returns the toreturn . CalcRedIntervals
succeeding value of in T.
if 20 then return return L 15
819814 returns 898 a903list L containing all such pair of constraints for all input x .
16 0 902 > ;
v . We call this interval the rounding interval. If we generate a polynomial that produces
2 17
(3)(2) Fig. 9. Theend 903Because all inputs I x are 2 reduced to the (x reduced 2 Linput 0 , there Af0may (x i )be
0 For each pair weconstraint compute
820815 CombineRedIntervals:
899
CalcRedIntervals:
904
function CalculateL’ (x,transforms L,each
) CalculateL’ i the
, I x i ) reduced xinput
that constrains x .,HWe alsoa new A (x ) into a new
into
00 [ Fig. 9. H The function transforms each A constraint (xI i , Ieven x i ) 2in L the thatpresence
constrains
821816 multiple
compute reduced
900
905 intervals
18constraint,L( 904
the reduced i for
0 ,L
xinterval
I x i eachI xreduced
), {(RR
that (x
constraints
0 = i ), [
[l0
,
0 , h
])}
0
P input
0H] that
(
a value in the rounding interval, then the result of the polynomial will always round to the x i inde�nes L . P (x fthe
0 ) such0 that A 0 ),Hmust range
satisfies
0 produce of
f ,Hinputs
(x i a value
) 2 x i forwithin the all
output of f ,H i
19range end constraint, ( 00x i) ,2I xIi 0), .that constraints PH ( x i ) such that Af ,H satisfies Af ,H (x i0) 2 I x i the even in the presence of
thecompensation
reduced 906 interval suchfor
reductionthat as ) to
long produce
as in Ixxi the
( xcorrectThe value
function whento
Calculate rounded.combines Thus,
multiple we pair combine
constraints with
0anyrange value isas output compensated a valueCalculate in I x . The
822817 901 905
A(x P H (x ,multiple 0
Ix )
return i.e. ( 0reduction
0 , I 0 ), 0 long
i
0 2asLP0 H ( x0 i ) 2 0I x0 = . The 0 function combines a finalconstraints with the
823818 allspeci�es
reduced 907 interval
902 what thefor
20same reduced
output each906Linput, reduced
of
correctly rounded result (i.e., v ). P
same (x reduced input
x)i needs xi ( x
input, and
x j 0, Ibe
to x j ) such
i.e. ( produce
0 ,
where
I that
0 ), ( the
A(x
0x i i 0)pair
, I )
x j , into
rounds
2 L(x
0
0,a single
where to ) where
.0
constraint
CalcRedIntervals
= 0 , represents
into
and creates
a single constraint and creates a final
list of constraints for PH . 2 x x x x x x
824819 thereturns
combined
908a listinterval.
903 L 0 containing 907
CalcRedIntervals allofsuch
list constraints pair ofreturns for PH . a listfor all
constraints
i i j
containing
j
input x . the constraint pair
i j

825820 (3) , ) for 909 each reduced0 input


(x 0CombineRedIntervals:
904
Fig. 9. The function 908 Because
The basic RL x 0 . all inputs
CalculateL’ transforms each constraint (x i , I x i ) 2 L that constrains
are reduced to the reduced input x 0 , there
approach consists of three steps. Figure 1.2 pictorially shows each
Af ,H (x i ) into a new
may be
(4) SynthesizePoly: constraint, Each ( x909 , I x0 i ), IBM
pair that 0, constraints PH ( x0 i ) such
species the that
constraint Af ,H 0satisfies on Athe output
(x ) 2 I x i even of in the
0 ). presence
Weall of
826821 multiple 910 reduced intervals for each reduced
905 i (x ) 2 input in L 0 . P (x ) must produce
f ,H i
a value P (x within
frame synthesizing range
of 0 reduction
, 0910 as
that long 0 as
satis�es P;.H (Inx0 particular,
all I x0 i . Thevalues
) 2 constraints function innear Calculatethe
as anboundary
LP combines
problem of multiple
the interval,
and constraints
generate with
ora may the
827822 the reduced
906
911 interval
I i.e.
P (x [ for A(x x)i to
), ]
step. As the goal of the RL I 0 ,0produce
, 0 [),IBM
i
0] ,the 0I 0) correct value 0 =when
approach is to approximate the correctly rounded result of rounded. Thus, we combine i.e.
samexreducedi
be a valuefor such
input, of
i.e.
thatreducedI( i.e.
, I (,
x i x i ( x i, x ) x<j I x jx0or I 2 , L 0;.whereIn particular, 0 ,values
into a near
single the
constraintboundary and
check whether i.e. or may
of
createsthe ainterval,
final
correct x i . jTherefore,
pair (x 0, we repeatedlyrepresents
0
828823 all reduced(x ). interval
P912
907 911each OC H input
i xx i OC ( , x )x i< I x
and Hproduce i the ) where
list of constraints
the interval.
boundary 912 values, be
for a
P value
. such
and , is correctly that i
OC ( , x
outputacompensated) < I or OC ( , x
to a value ) < I .
iin ITherefore, we repeatedly check whether
x i while reducing pair the
H i x i x
the combined returns listi containing the constraint
829824 H H
908 CalcRedIntervals
4.1 Calculating
913
The f (x), the first step is to compute the correctly rounded result of f (x) for each input x in T.
Rounding the boundary values, and , is correctly output compensated to a value in while reducing the
830825 (x 0, 909)914for each boundary reduced of913[ ,input ] Interval
if they x 0 . do not (lines 10-17). Finally, we store the �nal interval I x i = [ , ] where
0 I x i
boundary
and (x of [ , ] if they
and A(x do
the not (lines 10-17). Finally, we store the �nal interval I 0 = [ , ] where
The �rst step in our approach i )is 2 to identify OC0,Hthe values I x i that thecorresponding
) must produces reduced suchoutput input,
that the x i rounded i ) in L i
0
831826 (4) SynthesizePoly:
910
915 OC H ( , xEach 914 I xpair ( ), x2 i ) 2 species
We use an oracle (an arbitrary precision math library, i.e., MPFR math library) to compute constraint on the of =P (x RR0 ).H (x We x
of(line
0 , i.e. [ , ] i IOC
18).
0 , (;.,In 2 I x i and
x i )particular, values
OC H ( =,near x i ) 2the and the corresponding
I x i boundary of the interval, reduced
i.e. or input, may x i = RR H (x i ) in L
0
832827 value of frame is916
A(x ) 911 equal I to
x the correctly 0 x Hrounded
synthesizing P (x ) that satis�es all constraints in as an LP problem and generate a
915 result of f (x ), i.e. RN (A(x ), T) = RN ( , T), for
(line 18).
i i
833828 correct 912 be
0 a value such 916 that OC H ( , x i ) < I x i or OC H ( , x i ) < I x i . Therefore, we repeatedly check whether
917P (x ). the real value of f (x) and round the value to T. Then, we identify the rounding interval [l, h]
829 913 the boundary
4.3 Calculating Proc. 917 ACM Program., Lang.,
values, and is correctly Vol. 1,output No. POPL, compensated Article 1. Publication to a value in date: while reducing
I x i January 2021. the
918 4.3 Calculating
830 4.1 Calculating 914 boundary
OnceThethe of
Rounding
list [
918 , ] if
of constraints they
Interval do not (lines 10-17). Finally, we store the
L 0 is identi�ed, we 0merge the constraints ( x0 1 , I x0 i ),x i. . . ( x0 i , I x0 i 0) 2 0L0 0 �nal interval I 0 = [ , ] where
919 for each input, where any value in the rounding interval rounds to the correctly rounded
and x i and the A(x corresponding
831 The �rst step 915
920in ourOC (
approach , x ) 2 919
where x 1 = · · · = x i . Each
H i 0 I is
x i to Once OC
identify0 the
H ( ,list
x
thei )of of
2valuesIconstraints
these that
constraints L is) must identi�ed,
boundproduces thereduced
we
output merge suchinput,
of the Pthat
H (x) the
constraints= RR
x isuch H (x(iA
rounded
that )xin , ILx i ), . . . ( x0 i , I x0 i ) 2 L 0
f1,H (x)
value of A(x916 ) is equal(line
produces 18).
to the the 920 correctly
correct where value rounded
0
x 1 for = ·each · · result
= input x i . of
0 Each
x i , that =of fthese reduces constraints
to
RNthe samebound value=the output , T),of forPH (x) such that Af ,H (x)
x i(. The function
832 (x ), i.e. (A(x ), T) RN 0
921
Calculate in Figure 15 shows how we merge the constraints.i First, for all constraints ( x0 i ,value
produces the correct value for each input , that reduces to the same I x0 i ) 2 x i . The function
x 0
833 917 921
922
918
4.30 Calculating
L , we identify 922
Proc. a listCalculate
ACM ofProgram.
unique reduced inLang.,Figure 151,shows
inputs,
Vol. No.XPOPL, 0 how
(lineArticlewe1). merge
For1.each theunique
Publication
constraints. reduced
date:
First,
January inputfor2021.all2constraints
X 0, ( x0 i , I x0 i ) 2
923
Once
we identify the list923 L , we identify
of constraints
all
0
constraints (L x0is
0 a0 list
, Iidenti�ed, of0 unique we merge reduced the inputs,
constraints X (line
0 0 1).
, I 0For ), . each unique
. .I(x0 x0into , I x0 i ) 2(linereduced
L0 input 2 X 0,
x i ) 2 L where0 x i and group the
919
924 = ( xintervals
1 xi
0 =924· · · =we 0identify allthese constraints where and (group theAintervals we I x i into (line
0 0 i 0
where x i . Each as of list ofconstraints bound Lthe output =of PxHiof(x) PHsuch that
i i
920 4-10). x 1can be considered the constraint ( x i , Ithat x i ) 2bounds the output ). Therefore, f ,H (x)
925
921 produces
create a uni�ed the correct
925 4-10).
constraint value by can be
for taking considered
each input intersection as
x i , that of the list
reduces of constraint
all intervals to thein that
same(line bounds
value the
11). xIfi .the
0 outputfunctionH ( ). Therefore, we
Theintersected of P
926
922 interval, isin;,Figure
Calculate 926 create
then it15 means a
showsuni�ed that how constraint
there we ismerge no output by taking
the constraints. intersection
P ( ) that First, satis�es of
forall all intervals
constraintsin( x0(line
allconstraints in I x0 )11).
,and 2 If the intersected
our
927
8

result of f (x). Because the polynomial is evaluated in the higher precision representation
H, the rounding interval is also in H.
Each input x and its corresponding rounding interval [l, h] defines a constraint on the
output of the polynomial. The result of the polynomial P (x) = c0 + c1 x + c2 x2 + . . . must
be a value between l and h for the given input x to produce the correctly rounded result of
f (x):
l ≤ c0 + c1 x + c2 x 2 + · · · ≤ h

The variables l, h, and x are constants in this inequality formula. The unknown vari-
ables, c0 , c1 , . . . , are the coefficients of the polynomial that we wish to generate. Using
the inequality formula for each input xi , we can create a system of linear inequalities that
encodes the problem of identifying the coefficients of a d-degree polynomial that produces
the correctly rounded results for all inputs:
 
    c0  
l 1 x1 x21 ... xd1  
 h1 
 1  
    1
 c
≤ 
l2  ≤ 1 x2 x22 ... xd2  
 .  h2 
    ..   
.. .. .. .. ..  ..
. . . . .   .
cd

This system of linear inequalities is precisely a linear programming (LP) query without an
objective function. Hence, we use an LP solver to solve this formulation. The polynomial
constructed using the solution of the LP query produces the correctly rounded result of
f (x) for all inputs.
The rounding interval [l, h] for each input x that we identify is larger than [f (x) −
, f (x) + ] where  is the error bound of the correctly rounded polynomial approxima-
tion generated using prior approaches. Hence, RL IBM provides larger freedom to generate
polynomials that produce correctly rounded results. This freedom results in better perfor-
mance.
89 5 if L 0 = ; then return false
0)
90785 6 1 Function CombineRedIntervals(L
Main(f , T, H, X , RR, OC, OC 1 , d):
91786 7 2 if L = ;CalcRndIntervals(f
then return false , T, H, X )
92787 8 3 ; then return false, d)
S, PifHL = SynthesizePoly(
93788 9 4 if SL 0= true then return P
CalcRedIntervals(f , L, T, H, RR, OC, OC 1 ) 9
94789 10 5 elseif Lreturn false return false
0 = ; then
95790 6 CombineRedIntervals(L 0 )
96791 if = algorithm
Fig.7 7. Overall ; then return
that false
creates the polynomial approximation P (x ) that will produce the cor-
rectly rounded Entire input domain
97792 8 S, P H result.AnEach function,
SynthesizePoly(
Approach CalcIntervals,
, d)
to Generate CalcRedIntervals,
Correctly Rounded Math CombineRedIntervals,
Libraries for New Floating Point Variants and 1:19
98793 SynthesizePoly is explained An Approach
laterPin this section.to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19
9 if S = true then return Reduced
Convert Range
99794 else return
10 883 false
1 Function CalculateL Input x 0 (f , L, T, to H H, RR, OC,0 OC 1Input , d): x Reduction Calculateinput
1 Function
1
(L 0 ):
00795 1 Function CalcRndIntervals(f 0 883in, T, T 1H,Function X ):
; foreach (x i , [l0 i , hi ]) 2 L do 10 Function
CalculateL (f , in H,
GetRndInterval(
L, T, H RR, OC, OC , ,T, d):
0 H): 0 1 Function
{ x i | ( x i , I x0 i )0 2 L Calculate
0 x’ 0} (L 0 ): Polynomial
884 2 L 2 X 0 | ( 0 LP 0 0
01796 2 L ;
Fig. 7. Overall 884 21 L ;
11 foreach t (x i , [l
GetPrecVal(
i , h i ]) 2 L do , T) 2 X { xi x i x i ) 2 L } approximation
, I
885 algorithm 3 that t 1 creates OC H (li ,the x i ) polynomial l approximation P3(x ) that ;
1 (l , x )
will produce the cor- Formulation
02797 foreach x 2 XAndo Approach
Each to 885
Generate OC H1Correctly
3 t 1 12
Rounded OC l i CalcRedIntervals,
Math min{
Libraries 2 New
for H| Floating 2 [tl , Point ] and RN ( , T)
Variants = } ; P(x’)
i
unique i 2 X 0 doand 1:19
3 rectly rounded 3
886 result. 4 t 2function, (hCalcIntervals,
i , xi ) H
1 (h , x ) 4 CombineRedIntervals,
foreach 0 do
03798 4 SynthesizePoly RN (f (x ),
is explained T) 886 4 t 2 OC t GetSuccVal( , T) foreach unique 2 X
ifylater in this section.function 13 u i i 4 i
887 5 OC H is an increasing Identify values H then 5 Infer thei ; Reduced
= Correctly Rounding
04799 5 I 883 GetRndInterval(
1 6Function CalculateL
887 , T, H) t 1 ; 0 (f , tL,
5 ifin 14OC HH is an
hOC,increasingmax 1 , {d): 2 H| then
function 21 6[Functionu ] and
, tOutput of RN ( (, T)
foreach
Calculate
5 0interval
I=x0 j0 ):) }2 Li0 do
;
888 rounded 2 T, H, that
RR, OC interval x j ,(L
05800 6 if
1 Function884
I = ; then return0 ;
888
; foreach 6 round
]) 210L do
15 to yt 1 ;
return
Function t 2 [l, h] Polynomial
0 0 6 0 [l’, 0 h’] foreach 0 ( x0 j , I x0 j ) 2 L 0 do
CalcRndIntervals(f
889 27 L else //in,OC
f(x) T (x
T, HH, ,X
i is [li): ,ahidecreasing function [l, h]
GetRndInterval( 27 X , T, { H):xifi
0
| (i =x i , Ixx i )then 2L }
LL ; L [ {(x, 3 8 I )}
1 else // OC Ht is a GetPrecVal( decreasing 3 function j
0 if i = x j then
0
2 ;i i t 1
06801 7 2 t 1 889 OC 7 ,x ) 7
885
890 Oracle H t(l
Result 11 l 8 , T) ; i i [ {I x j }
end foreach 890 18 t 2 ; t 1 , ] andunique
0
i [ {I x j }
9 do
8 3 t 2end OC H (hi , x i ) min{ 2 H| 4 92 [tlforeach 8 0
07802
891 x 24 X
886 12 l RN ( , T)
end i 2= X do } i
9 4 return L887 RN510 (f (x ), T)ifwhile 891 9 end end
08803 is an H (increasing
, x i ) < [li function i ] do then GetSuccVal( 10 , T)
; 9
892 Figure 1.3: The RL
OC H OC
892 10
; while
, h13 tu
’s approach to generate a polynomial approximation that produces the cor-
IBM OC H ( h, x i ) <max [li , h{i ] do
5
6 2 [ , tuforeach
end
i
and —(10(0 ,,IT) 0 )0 2 L end
0 do

09804 5 I 888
893
GetRndInterval(
611 , tT, H)
AdjHigher(
1 t 2 , H)
14 2 H| ] RN x x = }
11 i Ix j 2 0 i Ix j
0 j j
10805 Fig.68. For each if889I = ;x7then
input
894
returnrectly rounded result of f (x) when used with a range reduction strategy. We highlight the extension
122 X , CalcRndIntervals(f
else 893// if; OC
11
> H then is a return , T,15H,; X
decreasing
AdjHigher(
) identifies
function[l,the
return
, H) interval
h] 7 I = [l, h] if where i 11 = all values in I x0 2 i I x0 j
x j then i
if > then return ; 12 if GetRndInterval(
= ; then return ;, j
11806 I rounds
7 to the correctly
L890 rounded
L [813{(x, I )} end result t2 ;
12 (x
from Figure 1.2 with the gray box.
894 f ) for t1 a given transcendental function 8 f (x ). The i
12i i [ {I
0
ifx j }i = ; then return ;
895 end 13 [ {( , )}
8 function
end returns 914the interval OCHHwhere ( , x i ) <all [li ,values
hi ] do in I rounds to . GetPrecValue( , T) returns the
T, H) end 895 I 2
13 i i
12807 891 while 9 end 13 [ {( i , i )}
13808
preceding value 896
return ofL in the T representation
while 896 OC H
14
( , xand
AdjLower( ) < [lwhile
GetSuccValue(
, hH) OC H ( , x i ) <,[lT)
] do i ] do
i , hreturns the succeeding
14 end value of in T.
9 892 1015 i i i end end

897 10 14
15
, H) ; AdjLower( , H) 15 return
14809 1116 897 if AdjHigher( > then return 0
893
898 if ; > then return ;
11 i I x015j 2 i I xreturn j
898ifeach >16 pair then(x, return
15810 (2) 8.
Fig. CalcRedIntervals:
For each 894
899 input1217 x 2 X , For
1.2.2
end
CalcRndIntervals(f The RL
17
IBM end
I x ) Approach With Range Reduction
, 2
T, L,
H, Xwe ) compute
identifies the theinterval
12
reduced I = if [l,input
i
h] = where
; x
then
0 . Weallreturn also
values ; in
16811 compute
I rounds to the895 the reduced
correctly
13 18 rounded end
interval
L8990 L 0 [ {(RR
result Ifx0 (x=H) (x for
[li ),0,a[hgiven0, ] ])}that de�nes thefunction
transcendental range off (xinputs for
). The GetRndInterval( the output ,
900 0 0 [ {(RRin 13 [ {( , )}
while I 18H2( H, xin L , hall]Lvalues
i do H (x ), [ , ])}to . GetPrecValue( , T)
I irounds
i i 0 0
17812 T, H)compensation
function 896
901
returns such
1419 the that
end interval any
900 OC value i ) I<0 [lis
where
x
i output compensated to a value end x in I . The pair returns
(x , I x
the
)
19 0 end , to
14
18813 speci�es
preceding value
897what
902
of 1520thein the output 901of
T representation
return L 0 P (x AdjLower( ) needs and H)
0be such that A(x ) rounds
GetSuccValue(
In general, generating polynomials for a small input domain is easier than the entire input , T) returns the toreturn . CalcRedIntervals
succeeding value of in T.
return L 15
902if all thenpair return
20
19814 returns 898 a903list L 16 0
containing > such of constraints
; for all input x .
(3)(2) end Because all inputs x CalculateL’ to
are reduced the (x reduced
20815 CombineRedIntervals:
899
CalcRedIntervals:
17
Fig. 9. The For
903
function eachCalculateL’ pair (x, transforms
I ) 2 L, eachwe compute
constraint i the 2 Linput
, I x i ) reduced that constrains 0 , there
xinput xAf0may .,HWe (x i )be alsoa new A (x ) into a new
into
multiple
compute
904
reduced intervals
18constraint,L( 904 domain. Hence, elementary functions are often approximated with a combination of range
0 0 L00 [Fig.
for each
{(RR 9. H The
(x i ),function
reduced [ ,0 ])} input in
0 ) such0 that
.
transforms
0 ) must each A constraint
produce a
(xI i , Ieven
value x i ) 2in
within
L thatpresence constrains
all of f ,H i
905 the reduced xinterval H] that de�nes fthe
,H 0range of f ,Hinputs i for thethe output
21816 900 ,
i xi
I ), that I constraints
0 = [l 0 , h0
P 0 ( xi L P (x A satisfies (x i ) 2 x
constraint, x ,2I xIi 0), .that
22817 thecompensation
reduced901
19range end
interval such for
reduction
that 905
A(x any to
as) long value produce
as PH in ( (Ix00xi i)the
is output
x The constraints
correct value
function
compensated
PH ( x i ) such
when
Calculate torounded.
a
that Af ,H
combines
value inThus,I
satisfies
multiple . The we
Af ,H (x i0) 2
combine
constraints
pair (x , with
I 0)
even in the presence of
I x i the
906 range i 0 0 x
i.e. ( 0reduction
x as 0 long 0 2asLP0 H . The 0 function combines multiple
( x i ) 2 0I x = Calculate a finalconstraints with the
x
23818 allspeci�es
reduced 907 interval
902 what
20same reduced return
thefor reduction and polynomial approximation. The original input x is reduced to a smaller
output each906
0
Linput, reduced
of P
same
0 , I 0 ),
input
x)i needs
(x reduced xi ( x
input,to and
i.e. ( produce
x j ) such
x j 0, Ibe where
0 that 0 the
0x i 0)pair
A(x
i
rounds 0 , to) where
x j , into
(x 0 a single constraint
.0 CalcRedIntervals
0
x i , I x i ), ( x j , I x j ) 2 L where x i = x j , into a single constraint and creates a final
represents
and creates
thereturns
combined list of constraints for .
908a listinterval. pair ofreturns PH . a listfor all containing
input x . the constraint pair
P
24819 903 L 0 containing 907
CalcRedIntervals allofsuch
list
H
constraints constraints
for
25820 , ) for 909 each reduced0 input
(x 0CombineRedIntervals:
(3) 904
Fig. 9. The function 908 Because x 0 . all inputs
CalculateL’ transforms each
are
domain with the range reduction function. Subsequently, the polynomial approximation reduced constraint (x i , I x i ) 2 L that constrains
to the reduced input x 0 , there Af ,H (x i ) into a new
may be
(4) SynthesizePoly: constraint, Each ( x909 , I x0 i ), (x
pair that 0, constraints PH ( x0 i ) such
species the that
constraint Af ,H 0satisfies on Athe i ) 2 I x i even
output
(x of P in the
0 ). presence
Weall of
26821 multiple 910 reduced intervals for each reduced
905 i ) 2 0 0 input in L . P (x ) must produce a value
0 f ,H (xwithin
frame synthesizing range
ofinterval reduction
I x i , i.e. as
), that long
] I x)satis�es as ;. In ( ) 2
particular,
iall constraints
. The function
valuesin near Calculatethean boundary combines multiple
of theThus, interval, constraints orwith
a may the
0 aswhen LProunded.
problem and generate
0 0 0 P I
27822 the reduced
906 P (x[for 910 A(x to, produce H x
the
x i correct value we i.e. combine
28823 correct
all reduced
P
911
907(x 0 ).same
be a
reduced911
value function is used with the reduced input. The resulting value is adjusted with the output
interval for eachforreducedsuch
input, of
that
i.e.i I(x0 i x,0 ii.e.
OC
, I x0 i[), (, x0]j , I x0Ijx0)0i 2,L 0;.where
( , x input
i ) < I xx or and Hproduce
OC In
( particular,
, x i ) the
0 ,values
x i<=I x. jTherefore, into a single
x i pair (x , ) where
near
0 we the
constraintboundary
repeatedly
and creates
check
represents
of the ainterval,
whether
final i.e. or may
912 H
list of constraints
the interval.
boundary values, be PaHvalue
. such
and , is correctly that i
OC ( , x
outputacompensated) < I or OC ( , x
to a value ) < I . Therefore,
in constraint
I x i while reducing we repeatedly check whether
pair the
i x i x
the combined 912 returns list containing the
29824 H i H i
908
913 CalcRedIntervals
4.1 Calculating The
boundary Rounding the
Interval boundary values,
compensation function to produce the final output. The entire approximation function A(x)
of if they do not (lines 10-17).and Finally, , is correctly we store outputthe compensated
�nal interval to
0 = a value in
whereI x i while reducing the
30825 0
(x , 909 )914for each reduced input x . 913 [ , ] 0 I [ , ]
boundary
and ( )of [ , I] if and theythe do not (lines 10-17).reduced Finally, input, we store the xi
�nal interval 0I x = [ , ] where
0
31826 The �rst step in
(4) SynthesizePoly:
910 our approach
915 OC H ( , xEach i )is 2 to
914
iidentify
I xpair OC0,Hthe
(x , x2values
i ) 2 species x i that A(x thecorresponding
) mustconstraint produces on the suchoutput that the x i rounded
of =P (x RR H (x
0 ). We i ) in L i
of(line , i.e. ;. Ini particular, and values and the corresponding reduced or amay x i = RR H (x i ) in L
input, 0
of H =near (xthe i boundary of),the T) interval,
0 OC
0 ( , x ) 2 I OC ( , x ) 2 I
i 18).
value of frame is916 equal
synthesizing I xto the[Pcorrectly 915]0 ) that
,(x I x i ,Hrounded
satis�es result
xi
all constraints fi in ), xi.e.as RN an LP problem and( generate i.e.
, T), for
32827 A(x ) 911
be a value
of f (x) is a composition of these three functions. For example, the input domain for log (x)
such that (lineOC 18).( , x ) < I or OC ( , x ) < I . Therefore,
(A(x
we
= RN
repeatedly check whether 2
33828 correct 912
917P (x ).
0 916 H i xi H i xi
829 913 the
4.3 boundary
Calculating 917 values, and , is correctly
Proc. ACM Program. Lang., Vol. 1, No. POPL, Article 1. Publication date:i January 2021. output compensated to a value in I x while reducing the
4.1 Calculating
918
boundary
Once Thethe is (0, ∞). Polynomials can approximate log (x) for the smaller domain [1, 2) much more
oflist
Rounding [ of
918 ] 4.3
if they
, constraints Calculating
Interval do not L 0 is(lines identi�ed, 10-17). we Finally, wethe store the2 �nal interval ( x0 1 , I x0 i I),x i. .=. ([ x0,i , ]I x0where
0
0merge constraints ) 2 0L0 0
830 914
919
831 The �rst step 915 in our OCwhereH (
approach , x i )
0 2 919 I is
x· and
i· to
· Once OC
identify0 .the
H (
Each ,list
x
the i )of of2 I
values
thesex and
constraints that the
constraints corresponding
A(xL is) identi�ed,
must
bound produces
the reduced
we
output merge suchinput,
of the that (x)constraints
x ithe= RR
such H (x(iA
rounded
that )ix0in , ILx i ), . . . ( x0 i , I x0 i ) 2 L 0
P f1,H (x)
i
920 = = H
832 value of A(x916 ) is equal(line
produces 18).
to the
x1
the 920 correctly
correct where xi
value rounded
x 1 for = input x i . of
accurately than the entire input domain. Hence, we reduce the original input x into the
0 = ·each · · result 0 Each
x i , that =of fthese reduces
(x ), i.e. constraints
to
RNthe (A(x samebound
), T) value =the RN output
0 (. The , T),of forPH (x) such that Af ,H (x)
function
921
833 917 Calculate 921in Figure produces 15 shows the correct how wevalue mergefor theeach input x i First,
constraints. , thatfor reducesall constraints to xthe i
same ( x0 i ,valueI x0 i ) 2 x i . The function
0
922
4.30 Calculating
L , we identify 922
a listCalculate inLang.,
Figure 1501,shows 0khow we1). merge theunique constraints. First, for2021. all2constraints ( x0 i , I x0 i ) 2
0 , of unique reduced
reduced input x using x = x × 2 where x ∈ [1, 2) and k is an integer. We approximate inputs, (lineArticle For1.each 0 (line0 1).reduced input X 0,
918
923 Proc. ACM 0
Program. Vol. No.XPOPL, 0 Publication date: January
Once
we identify the list all 923 of constraints we identify
constraints (L x0is, Iidenti�ed,
L 0 a list of unique
0 we merge reduced the inputs,
constraints X For
0 each 0unique 0 reduced
0 input 2 X 0,
x i ) 2 L where0 x i and group the
919
924
0 = ( xintervals, I ), . . .I(x0 xinto
1 xi
, I x i ) 2(line L
where 0 =924· · · =we 0identify . Each all
of
i
constraints
these constraints ( , I 0 ) 2
bound L 0 where
the output = of and group
such
i i
that the intervals I 0 into (line
920
925 4-10). 0 x 1can be considered as the list of constraint x i that x bounds the outputHof PH ( ). Therefore, P x i (x) A f ,H we (x) x
y = log (x ) using a polynomial for the reduced domain [1, 2). Then, log (x) for the
0 xi i i

921 produces
create a the
uni�ed 925correct
2 4-10).
constraint value canfor
by beeach
taking considered
input
intersection x i
asthat
, theof list
reduces
all ofintervals
constraint
to the insamethat bounds
value
(line 11).
0the
If . The
the output function
intersected of PH ( ). Therefore, 2 we
926 xi
922 interval, isin;,Figure
Calculate 926 create
then it15 means a uni�ed
shows that how constraint
there we ismerge no output by taking
the constraints. intersection
P H ( ) that First, satis�es of
forall all intervals
allconstraints
constraintsin( x i ,and in (line
0 0 11).
I x i )our2 If the intersected
927
923
0 , we identify
Lalgorithm terminates interval,
ofby unique outputting isreduced
;, then ititinputs,
original input x is approximated by compensating the output y with y = y + e.
927 a list means
(line 12). Xthat 0 (line
Finally, there isFor
1).Calculatenoeachoutput uniqueP H ( )reduced
returns that thesatis�es
listinput
0all constraints
containing 2 X 0, in 0 and our
928
924 wetheidentify
merged all 928
constraints algorithm
constraints terminates
( i ,( xi )i ,for
0 0
I x i )each by
0
2 L unique outputting
where i 2=X x. iThe it
0 (line 12).
andpolynomial Finally,
group the Pintervals Calculate
(x) should I x i into
0 returns
be generated (line the list containing
929 0. H
4-10). can be
929
such that it satis�es the constraints . the
considered merged as constraints
the list of constraint
( , ) for
that each bounds unique the output
2 X The
of polynomial Therefore, P (x) we should be generated
925
930 Range reduction, polynomial approximation, and output compensation functions are
such thatbyit taking satis�esintersection
i
the constraints
i
.
i P H ( ). H
926
931 create a uni�ed 930 constraint of all intervals in (line 11). If the intersected
927 interval, is931 ;, then it meansProc. thatACM thereProgram. is no output Lang., Vol. P H (1, )No. that satis�es
POPL, Articleall constraints
1. Publication date: in January and our 2021.
928
evaluated
algorithm in abyfinite
terminates precision
outputting it (line representation.
12).Proc.
Finally,
ACM Program. Hence,
Lang.,
Calculate returns
Vol. they
1, No.the listcan
POPL, experience
1. Publication date:numerical
containing
Article January 2021. errors.
929 the merged constraints ( i , i ) for each unique i 2 X 0. The polynomial PH (x) should be generated
930 Tothat
such useit satis�es
polynomials with range
the constraints . reduction, the RL IBM approach adjusts the output intervals
931

for the polynomials


Proc.to account
ACM forVol.the
Program. Lang., numerical
1, No. POPL, Article 1. errors. Figure
Publication date: January 1.3
2021. pictorially shows the

modifications to handle range reductions. The additional steps that adjust the intervals are
highlighted with the gray box. In this setting, each original input and rounding interval con-
10

straint defines the output of the entire approximation of f (x) (i.e., A(x)) should produce.
We use the original inputs and the range reduction function to identify the list of inputs for
the polynomial. We call these inputs the reduced inputs. Similarly, we use the rounding
intervals and the output compensation function to infer the range of values that the poly-
nomial should produce. We call this range of values the reduced interval. If a polynomial
approximation produces a value in the reduced interval for each reduced input, then the
result when used with the output compensation function produces a value in the rounding
interval. With the reduced inputs and reduced intervals, we generate polynomial approxi-
mations using LP formulations. Further, we develop modified range reduction techniques
for some elementary functions to minimize cancellation errors in output compensation (see
Chapter 4 for more detail).

1.2.3 Scaling RL IBM To 32-bit Representations

The 32-bit representations have four billion inputs. Naively using the RL IBM approach
for 32-bit types can generate millions of constraints, which is beyond the capabilities of LP
solvers. Additionally, it may not be feasible to generate a single polynomial of a reasonable
degree that satisfies all constraints. It will most likely require a high-degree polynomial to
produce the correctly rounded result, which is not ideal for performance. Hence, we extend
the RL IBM approach with two techniques to generate efficient polynomial approximations
for 32-bit types: We propose to (1) use the counterexample guided polynomial generation
technique that samples inputs to handle a large number of constraints and (2) generate
piecewise polynomials for efficient implementations of elementary functions. Figure 1.4
illustrates the steps to generate polynomial approximations for 32-bit types.

Counterexample guided polynomial generation. To generate polynomial approximations


that produce the correctly rounded results for millions of inputs, it is not necessary to con-
sider every single input and its corresponding interval. We only need to reason about inputs
5 if L 0 = ; then return false
0)
6 1 Function CombineRedIntervals(L
Main(f , T, H, X , RR, OC, OC 1 , d):
7 2 ifL = CalcRndIntervals(f
; then return false , T, H, X )
8 3 S,if then return false , d)
P HL = ;SynthesizePoly(
9 4 ifLS0 = true then return P
CalcRedIntervals(f , L, T, H, RR, OC, OC 1 ) 11
10 5 else
if Lreturn falsereturn false
0 = ; then

6 CombineRedIntervals(L 0 )
Fig. if = ;algorithm
7 7. Overall then return thatfalse
creates the polynomial approximation P (x ) that will produce the cor-
rectly
8 rounded
S, P H result. Each Entire
SynthesizePoly(
input domain
function, CalcIntervals, CalcRedIntervals, CombineRedIntervals, and
, d)
An Approachlater
to Generate Correctly Rounded Math Libraries for New Floating Point Variants 1:19
if S = true then return P in this
SynthesizePoly
9 is explained section.
Reduced Sample Generate
10 else return false Input x Input x Sub-domain #1 Polynomial #1
0 (f input1 Start (L 0 ): x’ polynomial
1 883 1 Function CalculateL
Function CalcRndIntervals(f in, T,
T H, X ):, L, T, H Function
in10H, , d):
RR, OC, OCx’GetRndInterval( 1 Function , T, H):Calculate
2 ; 884 algorithm
L Overall 2 L 0 that ; foreach (x ithe i ])112 L do tl approximation
, [li , hpolynomial GetPrecVal( Sub-domain
2 , T) X #2 { x | ( x , I x ) 2 L }
0 0 0 0 0 Polynomial #2
Fig. 7. creates 1 (l , x )
P (x ) that willi produce i i the cor-
3 rectlyforeach
rounded 885x 2 X do Each 1function,
3
result. t OC H i CalcIntervals,
i 12 min{ 2 H| 23 [t…
l CalcRedIntervals, ] and; RN ( ,Add
l , CombineRedIntervals, T) = }
incorrect and Is polynomial …
1 (h , x )
4SynthesizePoly 886RNis (f4explained
(x ), T) t 2 later OCin H this i section.
i 13 tu GetSuccVal(4 , T)foreach uniquex’ to sample
i 2 X do
0 No correct? Yes
Correctly Rounding Reduced Domain Splitting for
5 I 887GetRndInterval(
5 if OC H, T,
is anH)increasing functionhthen
14 max { 2Piecewise , tu ] and
H| 52 [ Polynomials i RN ; ( , T) = }
Counterexample Guided Polynomial Generation
rounded interval interval
6 if I 888
= ; then
6
1 Function CalcRndIntervals(f
return ; t
f(x) in, T, 1 ; t 2
T H, X ): [l, h] 10 Function
15 return[l’, h’]GetRndInterval( , T, H): ( x j , I x j ) 2 L do
[l, h] 6 foreach 0 0 0

7 2 L L ;889L [ {(x, 7 I )} else // OC H is a decreasing 11 tfunction


Range GetPrecVal( 7 , T) if i = x0 j then
Oracle Result l Reduction
8 3 end 890
foreach x 2 X do 8 t 2 ; t 1 12 l min{ 2 H| 2 [tl , ] and RN ( i , T) =
8 i [} {I x0 j }
9 4 return 891 9
L RN (f (x ), T) end GetSuccVal( 9 , T) end
5
10
I 892 GetRndInterval(
Figure
while OC 1.4:
H ( H)
, T, , x iThe
) < [li RL
13
] do
, hi14
t
IBM ’su approach for
h
generating
max { 2 H|10 2 [ , tu end ] and —
a polynomial approximation that produces the
RN ( , T) = }
Fig.6 8. For each if893Iinput 11
= ; then correctly
x 2 X ,return ; rounded ,result
CalcRndIntervals(f AdjHigher( ,
15 H, Xof
H)
T, f (x)[l,for
) return
identifies h] 32-bit
the interval
11 types.
I = [l,ih] where I x0 j 2 all
0
I values
i xj
in
I rounds to the 12 rounded result
correctly if > then for return
a given ;transcendental function The
7 L 894 L [ {(x, I )} f (x ) 12
f (x ). if i = ; then return ; ,
GetRndInterval(
T, 8H) function
end 895 returns 13 the intervalend I 2 H where all values in I rounds to . GetPrecValue( , T) returns the
13 [ {( i , i )}
preceding value
return 896L
of in
14 the T while OC H ( , x iand
representation i , hi ] do
) < [lGetSuccValue( , T) returns the succeeding value of in T.
9
897
15 with highly AdjLower( constrained
, H) intervals to ensure
14 end that we generate polynomials that satisfy these
15 return
16 if > then return ;
(2) 8.CalcRedIntervals:
Fig. For each For each pair (x, I x ,)T,2H,
898 input x 2 X , CalcRndIntervals(f L,Xwe compute
) identifies thethe reduced
interval I = [l,input
h] where x 0 .allWe alsoin
values
compute
I rounds to the
17
the reduced
899correctly constraints.
end
rounded 0interval
result
0
f
I 0(x =
x
The
) for [l 0a, resulting
0 ] that
given
h polynomial
de�nes
transcendental the range
function willof
f (x likely
inputs
). The forsatisfythe output
GetRndInterval( other, constraints. Hence, we sam-
T, H)compensation 18 L L [2{(RR H (x i ), [ , ])}
function 900returns suchthethat any Ivalue
interval in I x0 isall
H where output
valuescompensated
in I rounds toto .aGetPrecValue(
value in I x . The, T) pair returns
(x 0, I x0the)
of 19 in end
speci�es
preceding value901what
20 0
thethe ple
output
T a of few
representation
return L 0 P (x inputs
0 ) needs and and
to be compute
such
GetSuccValue( that ,
A(x the
T) ) rounding
rounds
returns theto . interval
succeeding value
CalcRedIntervals for of the in T.sampled inputs. If we intend
returns902 a list L containing all such pair of constraints for all input x .
(3)(2)CombineRedIntervals:
903 ForBecause
each all inputs
pair (x, I x )are reduced
we also to the reduced
compute the reduced input input x 0 , there x 0 . may
We also be
CalcRedIntervals:
multiple
Fig. 9. The
904reduced
tofunction
intervals
use range
for
CalculateL’
each
reduction,
0 reduced transforms
0, h 0input
2 L, we
each
in 0 . P (x 0identify
constraint (x i , I x i ) 2 L the
must produce
reduced
that constrains
a value
Af ,Hinputs
within
(x i ) into aand
all
new intervals. We generate a
compute the reduced interval I
constraint, ( x0 i , I x0 i ), that constraints
x = [l P ]( that de�nes
L
0 ) such that A the
) range
satisfies A of inputs
(x ) 2 I for
even the
in the output
presence of
the reduced interval
compensation such for
thatA(x any ) tovalueproduce the0 correct
H xi
value whentorounded.
f ,H
value inThus,
f ,H i xi
wepair combine
as PHin ( xI0 xi ) is
2 Ioutput compensated Calculate acombines I x . The
905 0 0
(x ,with 0
I x ) the
allspeci�es
reduced
906 what
range
interval
samethe
candidate
reduction
for
output
reduced each
as long
input, ofreduced
P (x
i.e.
polynomial
I x0input
( x0)i ,needs
0
), ( x0to
x 0
x, I be
i that
) 2such
produces
. The function
0 and Lproduce
that A(x
0 where 0 a value
0 the) pair
rounds (x ,to ) in
0
.wherethe interval
multiple constraints
CalcRedIntervalsrepresents for each sampled input using LP
j xj x i = x j , into a single constraint and creates a final
the combined
returns 907 a list interval.
list containing
Lof0 constraints allPsuch
CalcRedIntervals
for .
i
pair of returns
constraints a list for containing
all input x . the constraint pair
0, ) for H
908 each reduced input x 0 . all
(3)(xCombineRedIntervals: formulation. Because Next, inputs wearecheck reduced if tothe thecandidate
reduced input polynomial
x 0 , there may produces
be a value in the rounding
(4) SynthesizePoly:
multiple reduced intervals
909 Each pairfor 0
(xeach, ) reduced 2 species inputthe in Lconstraint
0 . P (x 0 ) must on produce
the output of P (x
a value
0 ). We
within all
frame synthesizing
the reduced
910 interval P (xfor) that
0
A(x )satis�es
to produce all constraints
the correct valuein as whenan LP problemThus,
rounded. and generate
we combine a
correct P (x 0 ).of I x0 i , i.e.interval
[ , ] I 0(or , ;.reduced
In particular, interval)
values near fortheall inputs
boundary of(orthe reduced
interval, i.e.inputs). or mayWe add any counterexam-
all reduced
911 interval for eachxreduced i
input x 0 and produce the pair (x 0, ) where represents
be a value such that OC H ( , x i ) < I x i or OC H ( , x i ) < I x i . Therefore, we repeatedly check whether
the combined
912 interval. CalcRedIntervals returns a list containing the constraint pair
the boundaryple inputsvalues,Interval and , is correctly output compensated to a value in I x i while the reducing the polynomial produces the
4.1 Calculating
(x , 913
0 ) for eachThe Rounding
reduced input to x 0 . the sample and repeat the process until generated
The (4)
�rstSynthesizePoly:
step in boundary
914 our approach ofis[ to if they0 do
, ]identify the notvalues
(lines 10-17).
that Finally,
must weproduces
store the �nal such interval
that the = [ 0, ] where
I x0 i rounded
Each pair (x , ) 2 species the constraint on the output of P (x ). We 0
A(x )
value offrame is equalOC Hto ( ,thexi ) 2 I and OC H ( , x i ) 2result I x i andofthe = corresponding RNreduced input, x i (=,RR T),H (x i ) in L
synthesizing
A(x )915 Pcorrectly
(x x0 )i thatrounded
correctly rounded
satis�es results
all constraintsfor all fin (xinputs.
), as
i.e.an This
LP
(A(x ),counterexample
problem T) = andRN generate for guided polynomial generation
a
(line 18).
correct P (x ).
916 0
917 Proc. ACM Program. Lang., Vol. 1, No. POPL, Article 1. Publication date: January 2021.
4.1 Calculating
918
4.3The process isInterval
Calculating
Rounding inspired by counterexample guided inductive synthesis (CEGIS) [72, 123] used
The �rst step Once
919in our the list of
approach is constraints
to identify
0 is identi�ed, we merge the constraints ( 0 , I 0 ), . . . ( 0 , I 0 ) 2 L 0
Lthe values that A(x ) must produces suchx 1that x i the roundedxi xi
920 where
value of A(x ) is equal toin
0 = · · · = 0 . Each of these constraints bound the output of P (x) such that A
1 program
xthe correctlyx i synthesis.
rounded result of = f (x ), i.e. RN (A(x ), T)H = RN0 ( , T), for f ,H (x)
921 produces the correct value for each input x i , that reduces to the same value x i . The function
922 Calculate Proc. in Figure
ACM 15 shows how
Program. Lang.,we merge
Vol. 1, No.the constraints.
POPL, Article 1.First, for all constraints
Publication date: January
0
( 2021. 0
xi , Ixi ) 2
923 L , we identify a list of unique reduced inputs, X (line 1). For each unique reduced input 2 X 0,
0 0

924 Piecewise
we identify polynomials.
all constraints ( x0 i , I x0 i ) 2 L 0 whereWhen = thex i and group theof
number intervals
inputs intothe(line
I x0 i in sample exceeds the capabil-
925 4-10). can be considered as the list of constraint that bounds the output of PH ( ). Therefore, we
926 create a uni�ed constraint by taking intersection of all intervals in (line 11). If the intersected
927
ity of our LP solver, the LP solver cannot generate a polynomial, or the generated poly-
interval, is ;, then it means that there is no output P H ( ) that satis�es all constraints in and our
928 algorithm terminates by outputting it (line 12). Finally, Calculate returns the list containing
929 nomial
the merged does
constraints ( i ,not satisfy
i ) for each unique our iperformance
2 X 0. The polynomialconstraint,
PH (x) should we split the input domain [a, b] into
be generated
930 such that it satis�es the constraints .
931 sub-domains [a, b0 ) and [b0 , b]. Then, we generate a polynomial for each sub-domain using
Proc. ACM Program. Lang., Vol. 1, No. POPL, Article 1. Publication date: January 2021.

counterexample guided polynomial generation. The domain is split in such a way that we
can use a few bits in the bit-pattern of the input to identify which polynomial to evaluate.
If the RL IBM approach cannot generate a polynomial for every sub-domain, then the input
domain is further split into smaller sub-domains. The final number of sub-domains depends
12

on the chosen elementary function and representation. Among the elementary functions we
tested for the 32-bit float representation, the RL IBM approach splits the original input do-
main into 26 sub-domains on average. This strategy allows the RL IBM approach to generate
piecewise polynomials with a low degree resulting in implementations that are faster than
mainstream math libraries.

1.2.4 A Single Approximation for Multiple Representations and Rounding Modes

The FP representation officially has five rounding modes defined by the IEEE-754 standard.
The correctly rounded result of f (x) for a given input x may be different depending on the
chosen rounding mode. One possible solution to produce the correctly rounded result for
each rounding mode is to adopt the strategy from CR-LIBM [29] or the above RL IBM
approach. They generate a correctly rounded polynomial approximation of f (x) for each
rounding mode. While this approach may be feasible for a single type, it can quickly
become overwhelming when we generate correctly rounded approximations for various
representations. It would be ideal if we can generate a single polynomial approximation
that produces the correctly rounded results for multiple rounding modes and precision con-
figurations.
We propose a novel approach to create such polynomial approximations. The key idea
is to create polynomials that approximate the correctly rounded result of f (x) with the
round-to-odd (rno) mode. The rno mode is a non-standard rounding mode that has been
used to avoid double rounding issues when performing primitive operations with extended
precision and subsequently rounding the result to a lower precision representation [9]. The
rno mode can be described as follows: If the real value of f (x) is exactly representable
in our target representation, then it is rounded to that value. Otherwise, it is rounded to
a value whose bit-string is odd when interpreted as an unsigned integer. Our contribution
is in recognizing that rno mode can be used to generate correctly rounded polynomial
approximations for multiple rounding modes. Chapter 6 provides more detail regarding the
representable in our target representation, then it is rounded to that value. Otherwise, it
is rounded to a value where the bit-string is odd when interpreted as an unsigned inte-
d Math Libraries for New Floating Point Variants 1:17
ger. Our contribution is in recognizing that rno mode can be used to generate correctly
13
ctly Rounded Math Libraries for New Floating Point Variants 1:17

OC 1 , d):
1
, ,XRR,
) OC, OC , d): rounded polynomial approximations for multiple rounding modes. Section 6.3.2 provides
als(f , T, H, X )
,false
H, RR, OC, OC 1 )
als(f , L, T, H, RR, OC, OC 1 ) more detail regarding the rno mode.
n false
ervals(L ) Entire input domain
n false Suppose that the goal is to generate correctly rounded polynomial approximation of
Poly( , d) An Approach to Generate Correctlyto
An Approach Rounded Math
Generate Libraries
Correctly for New
Rounded Floating
Math Pointfor
Libraries Variants
New Floating Point Variants 1:19 1:19
rn P f (x) for all k bit representations Tk with any rounding modes where k is bounded (i.e., Polynomial
Input x Convert
883produce theFunction CalculateL 0 (f , L, T, to H
H, 1Input
,
x
Function
Generate
1 , d): Calculate (L approximation
0 ):
hat The
the polynomial approximation P (x ) that will
creates the polynomial
CalcIntervals,
k  n).
approximation
CalcRedIntervals,
largest T
P (x ) that will
CombineRedIntervals,
1representation
k produce
cor-
and the
883
cor- is
in T
T n .
1 Function
Then, we propose RR,
CalculateL to OC,0 (f
OC
generate , L,a T, d):
polynomial
in H
H, RR, 1 OC
OC, Polynomial 1 Function Calculate
P(x)
(L 0 ):
isfunction,
oach Inputs
section. CalcIntervals,
x and
to Generate InputsRounded
Correctly CalcRedIntervals,
x and Math 884
Sub-domain
Reducedfor New Floating
Libraries #12
CombineRedIntervals,
Point Variants L 0 Sample
and; foreach
884 (x i L, [l0 i , hi;])
Generate
1:19 2 polynomial 2 L do
foreach
Polynomial #1
(x i , [l i , h i ]) 2 L do 2 X0 { x0 i | (2 x0 i , I x0Xi )0 2 L 0{} 0 | ( 0 , I 0 ) 2 L 0 }
x’ x x i xi
of3f1 (l(x)
the correctly
d later in this section. Start i
rounded
the rounding
that produces
inputs x’
and reduced the 885 correctly 3 roundedt 1result OC i , xin
i ) the
t (n + #21
OC2)-bit (l ,representation
xi ) Tn+2 3 ;
1
intervals Sub-domain #2 Polynomial
H, result
ction
X f(x) Function
y =10
):CalculateL (f [l,, L,h]T,inGetRndInterval(
H RR, OC, OC 1 ,[ld):
intervals , T, H):
i’, hi’] 1 Function Calculate (L ): 885
H H i 3 ;
1
H,
vals(fin, T,
L
T 11
foreach
H, X ):t 10 Function GetRndInterval( …
L do
(x i , [ll i , hi ])GetPrecVal( , T)
886
, T, H):
X 4{ xi | ( xi x’
Add incorrect
t polynomial
, I xto ) L2 }
Is polynomial
OC (h , xproduces
4correct?i ) t
i Yes 2

1
OC (hresults , x i ) for Tk for any 4 foreach unique X 0 do unique i 2 X 0 do
i 2foreach
t1 OC12H1 (li , x il) with the rno mode. We prove that this
min{ tl Range
11
H| GetPrecVal(
[tl , ] andDomainRN
2
, T)
3 ( ,Splitting
T) = for } 886 No H
i sample
correct H i 4
if OC is an increasing is an then
Oracle Result Reduction
and RN ( , T)Counterexample
= } y = Guided rno result Odd ; 5
t2 OC13H1 (hi , xtiu)
12 l
GetSuccVal(
t
min{, T) H| Piecewise
GetSuccVal(
887[tl , ] Polynomials
,
4
T)
5
foreach unique i X do 887 H Polynomial
5 if OC
Generationfunction
Identify
H odd increasing function 5
then i
Range Reduction i ;
kof f(x) in1 intervalbits. 6
13
H) if OC H is
rval( , T,
h
14 an increasing
H) standard rounding modes as long as T and Tt ; havet the same number
max { u then
function H| [ , tu ] and RN
H| 888
5 ( , T) =i }
[ , tu ] and6RN ( , T) = } n+2 2 interval of y1
t ; of
t exponent Domainforeach
Splitting (6 x0 j , I x0 j ) 2foreach
L 0 do ( 0 , I 0 ) 2 L 0 do
15t 1 ; return 14 h
Figure 1.3: WIP: Workflow to generate polynomials for T
urn
t2 [l, h] max {
return [l, h]
6 foreach ( x j , I x j ) L888 do
T n+2. 6
with rno mode. 2 [l, h] xj xj
We present the proof in Section 6.4.else // OC is a else if i Guided 0 Polynomial Generation
7 Counterexample 7= x j then
15
else // OC H is a decreasing function decreasing
t2 ; t1 889
7
7 if i = x j
then
{I889
xj }
7 H // OC function
H is a decreasing function if i = x0 j then
t ;
8 iOracle
i Result 0
end
890 9 8 end
890 82 t 1 t ;
2 t1 8 8i i [ {I x j } i
0
i [ {I x j }
terexample guided polynomial
while OC H ( , x i ) [li , hi ] do
Figure 1.5 generation. presents the steps brieflytoend explain
generate counterexample
a polynomial guided.
end approximation that produces 9
10 end
tervals(f , T, AdjHigher(
H, X ) identifies , H) the interval I = [l, h]
, T, H, X )function
identifiesf the
89111where all values
9
I = [l, h]i where
in
Ix j i xj
I 891 9 end 9 end
CalcRndIntervals(f
(x ) for if then return (x ). interval all
, values in
a given
> transcendental
ey insight is that we the do notresult
havein to Figure
The GetRndInterval(
892
reason 10
1.5:
ifGetRndInterval(
abouti = then return ,
every while
The inRL
constraint.OC, IBM ’s) <approach
(we,compute
xonlyi [l ,the
while
need i i do (to, xgenerate
]OC
h highly ] polynomial
do of 10
) < [li , hiresult approximations
end that produces the cor-
— end
ded result (x ) for aingiven
allf values transcendental
rno function
T
12
. For
T)The
f (x, ).
each input 892 T 10weH correctly irounded

H end
where I rounds to . GetPrecValue( returns the H 10
13 n+2 n
terval
tion and
while
H where all values returns
I GetSuccValue(
OC H ( , x i ) [li , hi ,] T) do
in I rounds to . GetPrecValue(
the succeeding value 11of in , T)T.{(returns
i , i )} the
AdjHigher( , H) AdjHigher( , H)
representation and GetSuccValue( , T) returns893
AdjLower( , H) rectly
the
14 end
succeeding
rounded
value of in T.
result
893
11
forthen Tn+2 with rno thatmode. 11 i I110
xj 2 i xj
I0 i
0
I x0 j 2 i I x j
rained interval. we sample
in T some with samples
the rno proportional
mode. Next, toifthe
we>number
identify of constraints.
a return
range if of;> values rounds ;to the
f (x) return
then return
15
h pair (x, if I >) then L, we return compute the n+2 reduced 894 input x . 12 We also 12
:l IFor
x
endeach pair (x, I x ) L, we compute the reduced input x . We also 894 12 if i = ;12then return ;
if i = ; then return ;
we generate polynomial for the sample. we check ifend
x = [l , h ] that de�nes the range of inputs for the 13output
edlueinterval I{(RR
Lin I Lis outputx =H (x , h[ ,] that
[lcompensated
i ), ])} de�nes to a the range
value in 895 . of inputs
The pair for the output all constraints are satisfied,
13 end add
a value in I x . The pair (x , I x895 [13{( i , i )}
I (x , I ) 13
hat
xend) needs
x
any value to beinsuch I x isthat outputA(xcompensated
) rounds to .to
x
CalcRedIntervals 14
x
while OC ( , x ) <while
)
[l , h ]OC do ( , x i ) < [li , hi ] do [ {( i , i )}
896 to . CalcRedIntervals 14H i i i
utput
such of
return needs to be for
L P (xof)constraints such allthat A(xx). rounds 896synthesis. H end
x , theremode.
if
taining
pair
not. allrinse
such and
pair of repeat.
constraints
input
this for is inspired
all input rno x . by CEGIS
15may be from 11
AdjLower( , H) AdjLower( , H)
14 14 end
use all inputs are reduced to the reduced input 15
Because all inputs L are reduced to (x897
the reduced ) Linput x , there
all Afmay 897
,H (x i )be return
als:
The function
each reduced input
CalculateL’ intransforms
. P (x )each
mustconstraint
produce i ,a
I xvalue that constrains
within into a new
if of 16 > then return
15 15 return
if ;> then return ;
i
ervals ( x ,for
int,produce I x ),each reducedvalue
that correct
constraints input )in L .rounded.
such P (xAf),H
that must produce
satisfies 16ia) value within
I x even in the allpresence
o
for
i i
to
the
longproduce
as PH ( x the
PH ( xwhen
) I xcorrect
i
value when
898
Thus,
rounded.
Awef ,H (x
i combine
Thus, we combine 898
Our pairgoal
end is to0 generate correctly rounded polynomial approximation of f (x) for all k
eductionA(x as) . The function Calculate combines multiple constraints with the
duced input x and produce the pair (x , ) where represents 17 constraint and creates end
i i
ain
or each
educed splitting.
edIntervals reduced
input, i.e. ( , I
returns
i x a single
x inputi
), ( , and
I
xxa list
j x )
j polynomial
produce
L where
containing
i the
jx pair
= x899
the
, that
into
(x , )produces
a where
single
constraint correctly
represents
899 rounded
a final 17 results for all 32-
val.
onstraints for PH .
x . CalcRedIntervals returns a list containing the 18 constraint pair L0 L 18[ {(RR H (x iL),0 [ , L])} 0 [ {(RR (x ), [ , ])}
uced input .
xspecies the constraint on the 900 of P (x ). We
output 900 H i
putspair
x , ) has(xhigh , ) degree. high degree polynomial is
ofaslow.
Pend hence, we split the input domain
Each
satis�es all
i.e.(x[ ), that
P
constraints
] Isatis�es . In all
species
in as
constraints
particular,
theanconstraint
LP problem
valuesinnearas thean 901
onand
bit thegenerate
LP problem
boundary
output
representations
19
of theand
(x ). We
i.e. 901
generate
interval,
return
a may
or 0
19
Tk end with any rounding mode where k is bounded (i.e., k ≤ n). The
L0
x
multiple
i
alue such thatsub-domains.
OC H ( , x i ) I x or OC polynomials
H ( , x i ) I x . 902 are more
Therefore, 20
efficient
we repeatedly check at Lapproximating
whether 20 return
a function in
i i
undary values, and , is correctly output compensated to a value in I x i while reducing the
Interval 902
ounding ]Interval
ary of [the
entify
ler
ch )isdomains.
if theythat
, values
toI xidentify
i and
do not
resulting
OC Hthe
A(x
, values
(lines
) must10-17).
that
and
Finally, such
produces
polynomial
theRN
A(x must largest T representation is T . We propose to generate a polynomial that produces the
we 903
store
thatthe
will
(A(xproduces
)corresponding
�nal
the
such
), T) =reduced
interval I x i = [ , ] where
rounded
have lower
that
input,
(Fig. 9. k
theThe
x irounded
903 faster elementary function.
=degree.
H (x i ) in L
RRfunction CalculateL’ n
transforms each constraint (x i , Ieach L that constrains
x i ) 2constraint into a new Af ,H (x i )
(x i ) constrains
,yx irounded result ( of x i ) = Ifx i(x
8). correctly rounded result of = f (x ), i.e. 904
he
), i.e. RN , T), for
RN (A(x ), T) = RN ( , T), for 904
Fig. 9. The function CalculateL’ transforms (x i , I x i ) A2f L,Hthat into a new
he key challenge in using piecewise polynomial constraint, ( 0 , I x0 i ),constraint,
is to efficiently
xi that constraints
identify ( x0 i ,polyno-
which I x0Pi H
0 such that A
), (that
x i ) constraints ,H( satisfies
Pf H f ,H (x
0 ) suchAthat A I x i even Ainf ,H
if ),H2 satisfies the(xpresence
) 2 I of in the presence of
even
905
M Program. Lang., Vol. 1, No. POPL, Article 1. Publication date: January 2021. x i x
Calculating
Proc. ACM Program. Lang., Vol. 1, No. POPL, Article
he list of constraints L is identi�ed, we merge the
correctly
906
range reductionrounded905
1. Publication date: January 2021.
result
as long as PHof
constraints ( x , I x ), . . . ( x , I x ) L range reduction
( x0fi
) (x)
2asI x0long
i
.in
The the
as PH ((n
function0 + 2)-bit
Calculate
0
x i ) 2 0I x i . The
i
representation
combines
function T n+2 with
multiple constraints
Calculate combines withthe
i
multiple rno mode.
theconstraints with the
to evaluate given an input. naively using branch
same statement
reduced 906input,
1 i
is not optimal.
i.e.
i
( 0 , I 0 we
i
), ( propose
0 , I 0 ) 2 L0 0 where 0 0 0
= 0 , into
0 a single0 constraint
0 and creates a final
x = · · · = x . Each of these constraints bound the output of P H (x) such that A f ,H (x) same reduced
x i x i input, x j xi.e. j ( x i , I x i ), ( xxj i, I x j ) x2j L where x i = x j , into a single constraint and creates a final
907to the same value x . The907
1 i
ces the correct value for each input x i , that reduces function
lit
latethe in domain using
Figure 15 shows how the bit-pattern
we merge We theprove
of First,
the constraints.
908
list allof
input.
for Thisthat
constraints
constraints
identify a list of unique reduced inputs, X (line 1). For each unique reduced input
x , Ithis
(allows
908X ,
i
x ) to
for
listpolynomial
.
efficiently
i identify produces
PHconstraints
of i
for PH .
the correct results for Tk for any standard rounding
ntify all constraints ( x i , I x i ) L where = x i 909 and group the intervals I x i into (line
domain index using
can be considered as the listbit-wise
of constraintoperation. 909 we
that bounds the output of PH ( ). Therefore,
910
a uni�ed constraint by taking intersection of all intervals modes as long as T and Tn+2 have the same number of exponent bits. We present the
in (line 11). If the intersected
910 k
of I , i.e. [ , ] of
0
al, is , then it means that there is no output P H ( ) that satis�es all constraints
911
in and our
x i the list containing
0 0
I I ,, ;.
xi xi i.e.In[ particular,
, ] I x0 i ,values near the boundary
;. In particular, values near of the
theinterval,
boundary or interval,
i.e.of the may i.e. or may
hm terminates by outputting it (line 12). Finally, Calculate returns 911
rgedelementary
constraints ( i , ifunctions
) for each unique fori multiple be
proof in Chapter 6. a
X . The polynomial
912 value
representations such that
and rounding
PH (x) should be generated be a
OC value
H ( modes
, such
x i ) < that
I x i or
OCOC (
H (, x , i
x) i )
< <I x
I x or
i . Therefore,
OC ( , x i )we
< Irepeatedly
x . Therefore,checkwe whether
repeatedly check whether
hat it satis�es the constraints . 912 H i H i
913 the boundary values, and , isvalues,
the boundary correctlyand output , iscompensated
correctly output to a compensated
value in I x i while reducing
to a value in I xthe
i while reducing the
913
are many representations andVol.rounding
Proc. ACM Program. Lang.,
boundary
modes
1, No. POPL, Article now.
1. Publication
ofusing , the
[ 2021.
date: January
ifprior
theyapproaches,
]boundary doofnot [ ,(lines ]we 10-17).
iftothey doFinally,
not (lines we10-17).
store the �nal interval
Finally, we storeIthe
0
x i =�nal ] whereI x0 = [ , ] where
[ , interval
914
Figure 1.5 914presents
and the steps generate
and the a
corresponding polynomial reduced approximation
input, that produces the
i ) in L x i = RR H (x i ) in L 0
0 i
915 OC
to generate a polynomial approximation for each representation H ( , x i ) 2
915
I x i OC H
OC
( ,
H x( ) , x
2 i
and each rounding
i
)
I x
2i and
I x i OC H ( , x i ) 2 I x i and the corresponding xreduced
i = RR H input,
(x
916 (line 18). (line 18).
e. this takes a lot of effort. rno result in Tn+2 . For each input in Tn , we compute the correctly rounded result of f (x) in
917
916
917
4.3 a single
ather, we propose an approach to918generate Calculating 4.3 approximation
polynomial Calculatingthat
918
Tn+2Once
919 with thethe
list
919
rno
of mode.
constraints
Once Next,
L 0 is
the list of we identify
identi�ed, a range
weL 0merge
constraints of
we values
the constraints
is identi�ed,
0 that
merge( the
0 rounds
x 1 , Iconstraints
0 0to 0 the
x i ), . . . ( x i , (I x ix)1 ,2I xL
0 rno 0result.
0 ), . . . ( x i , I x0 i ) 2 L 0
uces the correctly rounded results for multiple representations and
0 rounding modes.
where x i . Each of these constraints bound the output of PH (x)thesuch thatofAPf H,H(x)
0 i
0920 0
920 · · ·
x1 = where
= x 1 = · · · = x i . Each of these constraints bound output (x)such that Af ,H (x)
We
ey idea is to create polynomial that
921 call this
produces
approximates range
thethe of
921 correct values
produces
correctly rounded forthe
valueresult
the fodd
each
correct
of interval.
(x)input x i ,for
value that Given
reduces
each axilist
inputto the ofreduces
same
, that inputs
valuetoand oddfunction
0 . The
xthe
i same intervals, wefunction
value 0 . The use xi
Calculate 922in Figure 15 shows how we 15
merge thehow
constraints. First,
the for all constraints ( x0 iall
, I x0constraints
be used to in Figure shows we merge constraints. First, for
922 Calculate )2 ( x0 i , I x0 i ) 2
the rno mode. Our contribution is in recognizing that rno can generate i
0
923 L , we
the RL IBMidentify
923 a list
approach
0
L 0, of unique
weto reduced
generate
identify ainputs,
a list of (line 1).approximation
polynomial
unique
Xreduced For each
inputs, X 0 unique
0
(line 1).reduced
thateach
For input
produces , correctly
Xthe
unique2reduced input 2 X 0,
924 we identify924
all constraints
we identify
0 0
( xalli x i ) 2 L where
0
, Iconstraints ( x0 i , I x=
0 ) 2 x i Land group the
0 where 0
= intervals
x i and group
I x i into (line I x0 into (line
the intervals
64-10).
can be considered as the list of constraint that
i
bounds the output of Therefore, we
i

rounded result 925 of 4-10).


f (x) incanTbe considered as the list of constraint that bounds P H the output of P
( ). H ( ). Therefore, we
n+2 with the rno mode. To generate efficient approximations
925
926 create a uni�ed
926 constraint by taking
create a uni�ed intersection
constraint by takingof allintersection
intervals in of (line 11). If the
all intervals in intersected
(line 11). If the intersected
927 interval, is927;, then it meansisthat
interval, thereitismeans
;, then no output
that thereP H ( )isthat satis�esP Hall( constraints
no output ) that satis�es in alland our
constraints in and our
for large representations,
algorithm terminates we combine
by outputting it (line theFinally,
12). rno approach with
returns range reduction
the list containing and generate
928 928 algorithm terminates by outputting (line 12). Finally,
itCalculate Calculate returns the list containing
929 the merged 929
constraints
the merged ) for each unique
( i , i constraints ( i , i )ifor 2 Xeach0 . The polynomial 0P (x) should be generated
unique i 2 X . HThe polynomial PH (x) should be generated
piecewise
930 polynomials
such that it930
satis�es using
thethat
such it satis�es.the constraints guided
constraints counterexample . polynomial generation.
931 931
Proc. ACM Program. Lang., Vol. 1,Program.
Proc. ACM No. POPL, Article
Lang., Vol.1.1,Publication
No. POPL, date: January
Article 2021. date: January 2021.
1. Publication
1.2.5 Prototype

Using the RL IBM approach, we generated multiple prototypes of correctly rounded ele-
mentary functions targeting various representations and rounding modes. RL IBM -16 con-
14

tains ten elementary functions for the 16-bit bfloat16 and posit16 types that produce cor-
rectly rounded results with the default, round-to-the-nearest-tie-goes-to-even (rne), round-
ing mode. RL IBM -32 contains ten correctly rounded elementary functions for the 32-bit
float and posit32 types with rne mode. Finally, RL IBM -A LL contains ten functions that
produce the correctly rounded rno results for the 34-bit FP representation. These func-
tions produce correctly rounded results for all FP representations ranging from 10-bits to
32-bits including bfloat16, TensorFloat32, and float types. RL IBM -A LL also contains ten
functions for posit representations. These functions produce correctly rounded results for
10-bit to 32-bit posit types. The functions in RL IBM -A LL support all standard rounding
modes. The resulting elementary functions are faster than mainstream math libraries while
producing correctly rounded results for all inputs in the target representations and rounding
modes. In particular, the floating point functions in RL IBM -A LL have 1.1×, 1.3×, and
1.9× speedup over glibc, Intel, and CR-LIBM math libraries, respectively.

1.3 Contributions to This Dissertation

The ideas and approaches in this dissertation are drawn from previously published papers
written with my advisor Santosh Nagarakatte and collaborators Mridul Aanjaneya and John
Gustafson. They are:

1. ”An Approach to Generate Correctly Rounded Math Libraries for New Floating Point
Variants,” [90] and its corresponding technical report [89] which introduces the initial
RL IBM approach to generate polynomials that approximate the correctly rounded
result of f (x) and how to handle numerical errors in range reduction.

2. ”High Performance Correctly Rounded Math Libraries for 32-bit Floating Point Rep-
resentations,” [93] and its corresponding technical report [96] which scales the RL IBM
approach for 32-bit types with domain splitting and counterexample guided polyno-
mial generation.
15

3. ”A Novel Polynomial Approximation Method to Produce Correctly Rounded Results


for Multiple Representations and Rounding Modes,” [95] which shows the RL IBM
approach to generate polynomials that produces correctly rounded results for multi-
ple representations and rounding modes.

1.4 Organization of This Dissertation

Chapter 2 presents details on the FP representation and the posit representation, the two tar-
get representations of the elementary functions that we generate. Additionally, it provides
background on prior approaches in approximating elementary functions and challenges in
producing correctly rounded results. Chapter 3 presents the RL IBM approach to generate
correctly rounded polynomial approximations of f (x) using LP formulation. Chapter 4
extends RL IBM to handle numerical errors in range reduction and output compensation.
Chapter 5 scales the RL IBM approach to generate piecewise polynomials for 32-bit types.
Chapter 6 presents our approach to generate a single polynomial approximation that pro-
duces correctly rounded results for multiple precision configurations and rounding modes
using the rno mode. Chapter 7 evaluates the correctness and performance of the elemen-
tary functions generated with RL IBM. Chapter 8 discusses the prior related work in creat-
ing efficient and accurate approximations of elementary functions. Chapter 9 concludes by
presenting future directions.
16

CHAPTER 2
BACKGROUND

In this chapter, we provide background on the floating point (FP) representation [27] and
the posit [55, 58] representation. The FP and posit representations are used to approximate
real numbers. We focus on these two representations because FP is the most widely used
representation. While posit was proposed in 2014, there is already excitement and interest
in exploring the posit representation in different domains [18, 45, 77, 102]. We describe
how to interpret an FP or posit bit-string to a real number, round an arbitrary real number
to a chosen representation, and the numerical errors that occur when using these represen-
tations. Next, we describe the state-of-the-art methodology in approximating elementary
functions, performing range reduction, and generating the polynomial approximation. Fi-
nally, we conclude by discussing challenges in generating correctly rounded approxima-
tions.

2.1 The Floating Point Representation

There are two important attributes in any representation that approximates real numbers:
dynamic range and precision. The dynamic range determines the range of values that a
representation can express, both values close to zero and values with large magnitude. The
precision of a representation measures how accurately a real value can be represented.
Given a limited number of bits n to represent real values, number system designers must
make a trade-off between the dynamic range and the precision.
The floating point (FP) representation, specified in the IEEE-754 standard [27], uses
a fixed number of bits to determine the dynamic and the precision of the representation.
They are identified using two parameters, the total number of bits in the representation n
and the number of exponent bits |E|. We denote a particular FP configuration with Fn,|E| .
17

S E1 E2 … E5 F1 F2 F3 … F10 S E1 E2 … E8 F1 F2 F3 … F23 S E1 E2 … E11 F1 F2 F3 … F52


sign exponent mantissa sign exponent mantissa sign exponent mantissa
(a) 16-bit half (b) 32-bit float (c) 64-bit double

Figure 2.1: Bit-string for the IEEE-754 FP representations. (a) 16-bit half. (b) 32-bit float. (c)
64-bit double.

There are three components in an FP bit-string: a sign bit S to represent whether an FP


value is positive or negative, |E|-bits to represent the exponent, and |F | = n − 1 − |E|
bits to represent the fractional value, known as the mantissa. The exponent bits determine
the dynamic range while the mantissa bits determine the precision of the FP representation.
Because FP has a fixed amount of precision (i.e. mantissa bits) for a given representation,
FP is known as a fixed-precision representation. There are three default FP configurations
specified by IEEE-754: 16-bit half (F16,5 ), 32-bit float (F32,8 ), and a 64-bit double (F64,11 ).
Figure 2.1 shows the bit-string representations for them.

2.1.1 Interpreting a Floating Point Bit-String

There are three categories of floating point values depending on the value of the exponent
bits when interpreted as an unsigned integer: normal values, denormal values, and special
case values. In all three categories, if the sign bit is 0 then the value is positive. If the sign
bit is 1, then the value is negative.

Normal values. The value represented by an FP bit-string is a normal value if the expo-
nent bits E are neither all ones nor all zeros, (i.e., 0 < x < 2|E| − 1). In such a case, the
value represented by the bit-string is equal to,

F
(−1)S × (1 + ) × 2E−bias
2|F |

where bias = 2|E|−1 − 1. E and F represent the exponent and mantissa bits interpreted as
an unsigned integer. The bit-string of normal values x in FP encodes the value represented
in a binary scientific notation x = (−1)s × m × 2e where e is the exponent value and
18

0 1 0 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1
sign exponent fraction sign exponent fraction
(a) (b)

0 1 1 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1
sign exponent fraction sign exponent fraction
(c) (d)

Figure 2.2: Examples of bit-patterns representing a (a) normal value, (b) denormal value,
(c) infinity, and (d) NaN in the 9-bit FP representation with 4 exponent bits (i.e., F9,4 ).

m ∈ [1, 2) is known as the significand. The mantissa bits F encodes the fractional part
of the significand without the leading one to the left of the radix point (i.e., 1.f1 f2 f3 . . . ).
Additionally, the exponent bits E encode exponent value e as an unsigned integer. To
represent both negative and positive exponent, E is scaled with −bias. By encoding the
exponent as an unsigned integer that is scaled with bias, comparison of two FP values with
the same sign can be efficiently performed by comparing the bit-string of the values using
unsigned integer comparisons.

Example. Figure 2.2(a) presents the bit-string of a normal value in a 9-bit FP represen-
tation with four exponent bits (i.e., F9,4 ). In F9,4 , bias = 24−1 − 1 = 7. This bit-string
represents the value,

14 14
(−1)0 × (1 + 4
) × 29−7 = (1 + ) × 22 = 1.875 × 22 = 7.5
2 16

Denormal values. If all exponent bits in E are zeros, then the FP value is denormal. In
such a case, the value represented by the bit-string is equal to,

F
(−1)S × × 21−bias
2|F |

Denormal values are used to represent values close to zero. There is a gap between the
smallest representable normal value 21−bias and 0. By using denormal values, FP represen-
tations can represent values significantly smaller in magnitude compared to 21−bias . The
19

smallest denormal value representable in FP representation is 21−bias−|F | .

Example. Figure 2.2(b) presents the bit-string of a denormal value in F9,4 . Since the
exponent bits are all zeros, it represents a denormalized value. The value encoded by the
bit-string in a binary scientific notation is,

3 3
(−1)1 × 4
× 21−7 = −1 × × 2−6 = −0.1875 × 2−6 = −1.5 × 2−9
2 16

This value is significantly closer to 0 compared to the smallest normal value in F9,4 , which
is 21−bias = 2−6 .

Special values. Finally, if all exponent bits are ones, then the FP value is a special value.
If all mantissa bits are zeros, then it represents ±∞ depending on the sign bit S. Fig-
ure 2.2(c) shows the bit-string of ∞ in F9,4 . In all other cases, it represents not-a-number
(NaN). NaNs are used to represent exceptional conditions such as the result of 0.0
0.0
or ∞−∞.
Figure 2.2(d) shows a NaN value.
Due to the way FP representations are designed, an n-bit FP does not encode 2n unique
values. First, two bit-patterns represent 0: One where all bits are zero and another with
a 1 bit followed by n − 1 zeros. While these two values are sometimes referred to as
+0 and −0, they are semantically equivalent. Second, there are 2|F |+1 − 2 bit-patterns
that represent NaN in a given Fn,|E| configuration. In the 32-bit float, roughly 16 million
bit-patterns represent NaN.

2.1.2 Rounding a Real Number to the FP Representation

The total number of bits in an FP representation is finite. Many real values cannot be
exactly represented in a FP representation. According to the IEEE-54 standard, if a real
value vR cannot be exactly represented, then it is rounded to either the largest FP value
20

All values in this All values in this All values in this All values in this
interval rounds to interval rounds to interval rounds to interval rounds to

vsm (even) midpoint vlg (odd) vsm (odd) midpoint vlg (even)

(a) Round-to-nearest-tie-goes-to-even when vsm is even (b) Round-to-nearest-tie-goes-to-even when vlg is even

All values in this All values in this All values in this All values in this
interval rounds to interval rounds to interval rounds to interval rounds to

vsm midpoint vlg vsm midpoint vlg

(c) Round-to-nearest-tie-goes-away when x < 0 (d) Round-to-nearest-tie-goes-away when x > 0

All values in this interval rounds to All values in this interval rounds to

vsm vlg vsm vlg


(e) Round-toward-zero when x < 0 (f) Round-toward-zero when x > 0

All values in this interval rounds to All values in this interval rounds to

vsm vlg vsm vlg


(g) Round-toward-postiive-infinity (h) Round-toward-negative-infinity

Figure 2.3: When a real value vR is not exactly representable in T, then vR is rounded to one of
the two adjacent values vsm , vlg ∈ T depending on the rounding mode used. We show the range of
vR that rounds to vsm (blue box) and vlg (green box) if we use (a) rne mode when the bit-string of
vsm is even, (b) rne mode when the bit-string of vlg is even, (c) rna mode when x < 0, (d) rna
mode when x > 0, (e) rnz mode when x < 0, (f) rnz mode when x > 0, (g) rnp mode, and (h)
rnn mode.

smaller than vR (vsm ) or the smallest FP value larger than vR (vlg ).

vsm = max{v ∈ F | v ≤ vR }

vlg = min{v ∈ F | v ≥ vR }

Note that if vR is exactly representable in F, then vR = vsm = vlg . Otherwise, vR rounds


to vsm or vlg depending on the choice of the rounding mode, rm. We denote the operation
of rounding vR to an FP representation F using a rounding mode rm with the function
RNF,rm (vR ).
21

Standard Rounding Modes for FP Representation The IEEE-754 standard specifies five
different rounding modes: round to the nearest, tie goes to even (rne), round to the near-
est, tie goes away (rna), round towards zero (rnz), round towards positive infinity (rnp),
and round towards negative infinity (rnn). The standard mandates the result of the basic

arithmetic operations (+, −, ×, ÷) and square root operation ( x) to be correctly rounded.

The rne mode: The round to the nearest, tie goes to even rounding mode rounds vR to
either vsm or vlg , depending on which value is closer to vR . If |vsm − vR | < |vlg − vR |, then
vR rounds to vsm . If |vsm − vR | > |vlg − vR |, then vR rounds to vlg . In the case that vR is
vsm +vlg
exactly in the middle of vsm and vlg (i.e., vR = 2
), vR is rounded to a value where the
bit-string representation is even when interpreted as an unsigned integer. The rne rounding
mode is the default and the most commonly used rounding mode. Figure 2.3(a) shows the
range of real values between vsm and vlg that rounds to vsm (blue region) and vlg (green
region) if the bit-string of vsm is even when interpreted as an unsigned integer. Similarly,
Figure 2.3(b) shows the range of real values that rounds to vsm and vlg if the bit-string of
vlg is even when interpreted as an unsigned integer.

The rna mode: Similar to the rne mode, the round to the nearest, tie goes away rounding
mode rounds vR to the closest FP value (i.e., either vsm or vlg ). If |vsm − vR | < |vlg − vR |,
then vR rounds to vsm . If |vsm − vR | > |vlg − vR |, then vR rounds to vlg . In the case that
vsm +vlg
vR is exactly in the middle of vsm and vlg (vR = 2
), vR is rounded to a value that is
farther away from 0. Specifically, vR rounds to vlg if vR > 0 because

0 ≤ vsm < vR < vlg ≤ ∞

Similarly, vR rounds to vsm if vR < 0 because

−∞ ≤ vsm < vR < vlg ≤ 0

The rna rounding mode is also known as the human rounding since it resembles how we
round decimal numbers, i.e. 1.5 rounds up to 2 when rounded to the nearest integer while
22

1.4 rounds to 1. Figure 2.3(c) and (d) illustrates rounding with rna mode depending on
whether vR < 0 or vR > 0.

The rnz mode: In the round to zero mode, vR is rounded to the value that is closer to 0.
vR is rounded to vsm if x > 0 and vR is rounded to vlg if x < 0. The rnz mode is equivalent
to truncating the significand bits of vR that cannot fit into the mantissa bits, similar to how
real numbers are rounded to a machine integer. Figure 2.3(e) and (f) shows the range of real
values between vsm and vlg that rounds to vsm or vlg when vR < 0 and vR > 0, respectively.

The rnp mode: The round towards positive infinity mode always rounds vR to the larger
value vlg , the value that is closer to +∞. The rnp mode is also known as rounding up.
Figure 2.3(g) shows the range of vR ’s that rounds to vsm or vlg using rnp mode.

The rnn mode: The round towards negative infinity mode always rounds vR to the smaller
value vsm , the value that is closer to −∞. The rnn mode is also known as rounding down.
Figure 2.3(h) shows the range of vR ’s that rounds to vsm or vlg using rnn mode.

2.1.3 A Systematic Methodology for Rounding To the FP Representation

We describe a methodology to systematically round a real number vR . Special values ±∞


or N aN can be directly translated into F. As described above, we need to identify the two
values vsm and vlg in Fn,|E| representations that are adjacent to vR and decide which value
vR rounds to. We identify the four pieces of information (s, v − , rb, sticky) necessary to
correctly decide whether vR rounds to vsm or vlg . We call these pieces of information the
rounding components. The first component s represents the sign (1 or −1) that identifies
whether vR is positive or negative. The value v − , which we call the truncated value, rep-
resents the smaller magnitude value between vsm or vlg . The last two components rb and
sticky determines whether vR is equal to vsm , closer to vsm , in the middle of vsm and vlg ,
closer to vlg , or equal to vlg .
m = 1.9999... = 2.0. If |vR | is a denormal value, then e = 1 mand
bias m = |v2Re | =
= 1.9999... . Next,
2.0. If
we|vR | is a denormal value, then e = 1 bias and m = |vR |
. Next, we

fiE 7fiE
2e

encode the exponent value e and the significand m into the exponent
encode
bitsthe
E exponent
and mantissa
value e and the significand m into the exponent bits E and mantissa
bits M , respectively, and concatenate the bit-strings with a leadingbits
zeroMbit
, respectively,
to representand 23 that
thatconcatenate the bit-strings with a leading zero bit to represent
|vR | is a positive value: |vR | is a positive value:

gwriaiEEnEiitnInsin.b
gwriaiEEnEiitnInsin.b
moth
b
B|vR | =b0E

moth
b2 . . . bE3|E| M
1 1E b41 M2 b. 5. . … … bn

= 0b2 b3 . . . b|E|+1 b|E|+2 . . . bn bn+1 bn+2 . . .


sign |E| exponent bits
bn+1

∞ mantissa bits

b
B|vR | = 00E1 E12 . . . E
1 |E| M…
1 M2 .1. . 0 …

(6.1) = 0b2 b3 . . . b|E|+1 b|E|+2 . . . bn bn+1 bn+2 . . .


sign |E| exponent bits
1 1

∞ mantissa bits

(6.1)

gh
N
Identifying vsm and vlg . Our third step is to identifygh
m
(a)

Identifying
the two values and v to,
that v vcan round
R
bff
m N (b)

. Our third step is to identify the two values that v


sm lg R can round to,
bff
Figure 2.4: (a) Bit-string of |vR |vsm when
vsm and vlg . To identify these values, we first identify two values
represented
adjacent
and vlg . toTo|videntify starts
in the infinite extended precision F∞ , where
R | in F, these values, we first identify two values adjacent to |vR | in F,
there are infinite number of bits for mantissa. (b) Bit-string of |vR | represented in F∞ when vR is
starts
outside the dynamic range of Fn,|E| . from from
127 62 127 62
To identify the rounding components, we first decompose vR into vR = s × |vR | where s
is the first rounding component that represents the sign of vR . The value |vR | represents the
magnitude of vR . Next, we represent |vR | in the FP representation with an infinite number
of mantissa bits while having the same number of exponents. We call this representation
the infinite extended precision representation F∞ . Intuitively, F∞ is similar to Fn,|E| but
has infinite amount of precision to exactly represent |vR |. The bit-string of |vR | in F∞
is pictorially illustrated in Figure 2.4(a). The first bit b1 represents the sign bit. Since
|vR | ≥ 0, b1 = 0. The bits b2 to b|E|+1 represent the exponent bits. The remaining bits
are mantissa bits. If the value of |vR | is larger than the dynamic range, then we represent
it with the largest representable value in F∞ , where the exponent bits are equivalent to the
largest normal value in Fn,|E| and all mantissa bits are ones. The bit-string is illustrated
in Figure 2.4(b). Note that we cannot use ∞ to represent |vR | because we need to make a
clear distinction between a real number and ∞. The rnz mode never rounds a real value to
∞, rnp mode does not round negative real values to −∞, and rnn mode does not round
positive real values to ∞.

Identifying the truncated value. To identify vsm and vlg , we identify two values v − and
v + adjacent to |vR | in F. Intuitively, v − represents the largest value that is smaller than
or equal to |vR | and v + represents the smallest value larger than |vR |. To identify v − , we
truncate B|vR | to n bits,
Bv− = 0b2 b3 b4 . . . bn−1 bn
24

rb = 1 rb = 0 rb = 0 rb = 1

-v+ midpoint -v- v- midpoint v+


- -
(b) Values that corresponds to rb and v When vR ≥ 0
(a) Values that corresponds to rb and v When vR < 0

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

-v+ midpoint -v- v- midpoint v+


-
(c) Values that corresponds to rb, sticky and v When vR < 0 -
(d) Values that corresponds to rb, sticky and v When vR ≤ 0

Figure 2.5: We show the range of real values that corresponds to the different values in the rounding
bit rb when (a) vR < 0 and (b) vR ≥ 0. Additionally, we show the range of values that corresponds
to the different values of rb and sticky bit sticky when (c) vR < 0 and (d) vR ≥ 0.

The value encoded by Bv− in F is v − , the largest value smaller than or equal to |vR |. We
call v − the truncated value, which is the second rounding component. Then, the succeeding
value of v − in Fn,|E| is v + , which can be obtained by adding 1 to the bit-string of v − . In the
context of rounding vR to Fn,|E| , v − and v + have the following relationship,


−v + < vR ≤ −v − if vR < 0 (s = −1)

v − ≤ v < v +
R if vR ≥ 0 (s = 1)

We can compute the two possible values that vR rounds to, vsm and vlg , using s, v − , and
v + . If vR is exactly representable in Fn,|E| , then |vR | = v − . Thus, vsm = vlg = s × v − . If
vR is not exactly representable in F, then v − < |vR | < v + . Hence, vsm = v − and vlg = v +
if s = 1 (when vR ≥ 0). Otherwise, vR is negative and vsm = −v + and vlg = −v − .

Identifying the rounding bit. The two components s and v − alone only informs that
v − ≤ vR < v + if s = 1 and −v + < vR ≤ −v − if s = −1. For the rounding mode rnz,
these two components are sufficient to identify the correctly rounded result, since ±v − is
always closer to zero compared to ±v + . However, to determine the rounding decision for
rne and rna, we must find whether vR is closer to s × v − , closer to s × v + , or exactly in the
middle of the two values. To determine the rounding decision for rnp and rnn, we need to
identify whether vR is exactly representable with s × v − . Our next two steps are to identify
the rounding bit and sticky bit that will help us identify the relationship between vR , v − ,
25

and v + .
We identify the rounding bit (rb) by extracting the (n + 1)st bit from B|vR | . Intuitively,
the rounding bit describes whether |vR | is closer to v − than v + . If the rounding bit is 0, then
v − +v +
|vR | is closer to v − (i.e., v − ≤ |vR | < 2
). If the rounding bit is 1, then |vR | is either in
v − +v +
the middle or closer to v − (i.e., 2
≤ |vR | < v + ). Figure 2.5(a) and (b) shows the range
of real values that corresponds to different values of v − and rb.

Identifying the sticky bit. While the rounding bit tells us whether |vR | is closer to v − , it
does not tell us whether |vR | is exactly equal to v − or is exactly in the middle of v − and
v − +v +
v + (i.e., |vR | = 2
). When looking at the bit-string B|vR | , |vR | is equal to v − when the
(n + 1)st bit (rb) is zero and the remaining bits from (n + 2)nd bits are all zeros. If rb = 0
and any bit afterwards is a one, then |vR | is not equal to v − . Likewise, |vR | is exactly in the
middle of v − and v + when the (n + 1)st bit is a one and the remaining bits from (n + 2)nd
bits are all zeros in B|vR | . If rb = 1 and any bit afterwards is a one, then |vR | is closer to v + .
In both cases, we need to determine whether all bits in B|vR | starting from the (n + 2)nd bits
are zeros or not. We define the sticky bit as the result of bitwise or operation on all bits in
B|vR | starting from the (n + 2)nd bit:

sticky = bn+2 | bn+3 | bn+3 | . . .

where | is the bitwise or operation.


Using the four rounding components (s, v − , rb, and sticky), we can now determine
the relationship between |vR | and the two nearest FP values v − and v + where v + can be
computed from v − :




|vR | = v − if rb = 0 ∧ sticky = 0





v − < |vR | < v − +v +
2 if rb = 0 ∧ sticky = 1

 v − +v +

|vR | = if rb = 1 ∧ sticky = 0


2



 v− +v+
2 < |vR | < v + if rb = 1 ∧ sticky = 1
26

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
- -
(a) rne rounding when s = -1 and bit-string of v is even (b) rne rounding when s = 1 and bit-string of v is even

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
- -
(c) rne rounding when s = -1 and bit-string of v is odd (d) rne rounding when s = 1 and bit-string of v is odd

Figure 2.6: The rne mode using the rounding components. We illustrate the rounding decision
when vR is positive or negative and when v − is even or odd. The green interval represents all real
values that round to the FP value with green color (i.e., −v + when vR < 0 and v − when vR ≥ 0).
Similarly, the blue interval represents all real values that round to the FP value with blue color (i.e.,
−v − when vR < 0 and v + when vR ≥ 0) with blue.

Figure 2.5(e) and (f) pictorially show the range of values for vR that corresponds to the
different values of the rounding components.

Determining the correctly rounded result of vR . Finally, based on the four components
we identified, we identify the correctly rounded value of vR . In the rne mode, vR rounds
to the closest value. If rb = 0, then it indicates that |vR | is closer to v − , which means vR
is closer to s × v − . Thus, vR rounds to s × v − . If the rounding bit rb = 1 and the sticky
bit sticky = 1, then it indicates that |vR | is closer to v + and vR rounds to s × v + . If |vR | is
exactly in the middle of v − and v + , (which also indicates that vR is exactly in the middle of
s × v − and s × v + ), then the rounding bit rb = 1 and the sticky bit sticky = 0. vR rounds
to the value where the bit-string is even when interpreted as an unsigned integer. Finally, if
lb = 1, then the bit-string of v − is odd and vR rounds to s × v + . The rounding decision for
the rne mode can be formalized as follows,




 s × v − if rb = 0





s × v − if rb = 1 ∧ sticky = 0 ∧ IsEven(v − )
RNF,rne (vR ) =



 s × v + if rb = 1 ∧ sticky = 0 ∧ ¬IsEven(v − )





s × v + if rb = 1 ∧ sticky = 1
27

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
(a) rna rounding when s = -1 (b) rna rounding when s = 1

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
(c) rnz rounding when s = -1 (d) rna rounding when s = 1

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
(e) rnp rounding when s = -1 (f) rnp rounding when s = 1

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ - - +
-v -v v v
(g) rnn rounding when s = -1 (h) rnn rounding when s = 1

Figure 2.7: Rounding decisions for various rounding modes based on the rounding components.
The interval of real values colored with green round to the FP value colored green. Similarly, the
interval of real values colored with blue round to the FP value colored blue.

Figure 2.6 shows rounding decision of vR in rne based on different values of rounding
components. We highlight the range of values for vR that round to the smaller value (i.e.,
−v + when s = −1 and v − when s = 1) with the green color and the range of values for vR
that round to the larger value (i.e., −v − when s = −1 and v + when s = 1) with the blue
color.
The rna mode is similar to rne mode. If the rounding bit rb = 0, then it indicates that
vR is closer to s × v − . If the rounding bit rb = 1 and the sticky bit sticky = 1, then it
indicates that vR is closer to s × v + . If vR is exactly in the middle of s × v − and s × v + ,
then the rounding bit rb = 1 and the sticky bit sticky = 0. We must round to the value that
is farther away from 0, which is always s × v + because |vR | < v + . The rounding decision
28

for the rna mode can be summarized as follows,






 s × v − if rb = 0


RNF,rna (vR ) = s × v + if rb = 1 ∧ sticky = 0





s × v + if rb = 1 ∧ sticky = 1

Figure 2.7(a) and (b) shows the rounding decision of vR in rna based on different values of
rounding components.
In the rnz mode, vR always rounds to the value closer to zero. Since v − is always
smaller than or equal to |vR | (i.e., v − ≤ |vR |), vR always rounds to s × v − :

RNF,rnz (vR ) = s × v −

Figure 2.6(c) and (d) shows rounding decision of vR in rnz based on different values of
rounding components.
In the rnp rounding mode, vR always rounds to the larger value. If vR is exactly repre-
sentable with F, which is indicated by the rounding bit with rb = 0 and the sticky bit with
sticky = 0, then vR = s × v − . If vR is not exactly representable and is greater than or equal
to zero (i.e., s = 1) then vR rounds to v + because vR < v + . Otherwise, vR rounds to −v −
since vR < 0 and vR < −v − . Formally,




 s × v − if rb = 0 ∧ sticky = 0


RNF,rnp (vR ) = v + if ¬(rb = 0 ∧ sticky = 0) ∧ s = 1





−v − if ¬(rb = 0 ∧ sticky = 0) ∧ s = −1

Figure 2.6(e) and (f) shows rounding decision of vR in rnp based on different values of
rounding components.
Finally in the rnn rounding mode, vR always rounds to the smaller value if it cannot
be exactly represented. If vR is exactly representable with F (i.e., rb = 0 and sticky = 0),
then vR is equal to s × v − (i.e., vR = s × v − ). If vR is not exactly representable and is
greater than or equal to zero, then vR rounds to v − because v − < vR . Otherwise, vR rounds
29

16 bits 19 bits 8 bits

S E1 E2 E3 … E8 F1 F2 … F7 S E1 E2 E3 … E8 F1 F2 … F10 S E1 E2 E3 E4 E5 F1 F2
sign exponent mantissa sign exponent mantissa sign exponent mantissa
(a) Bfloat16 (b) TensorFloat32 (c) MSFP8

Figure 2.8: Layout of (a) bfloat16, (b) TensorFloat32, and (c) MSFP8.

to −v + since vR < 0 and in that case −v + < vR . The rounding decision for rnn mode can
be formalized with,




 s × v − if rb = 0 ∧ sticky = 0


RNF,rnn (vR ) = v − if ¬(rb = 0 ∧ sticky = 0) ∧ s = 1





 −v + if ¬(rb = 0 ∧ sticky = 0) ∧ s = −1

Figure 2.6(g) and (h) shows rounding decision of vR in rnn based on different values of
rounding components.

2.1.4 Different FP Configurations and Fixed-Precision Variants

The IEEE-754 standard specifies a set of configurations for different lengths of n to be


used in general including half (F16,5 ), float (F32,8 ), and double (F64,11 ) datatype. These
formats have been adopted widely since the introduction of the standard and most modern
commercial hardware architectures and programming languages support basic arithmetic
operations with these types. The float and double datatype have been used in scientific
applications for decades.

Other FP configurations. In the past few years, several new configurations of FP have
been proposed and used. Especially in machine learning and high performance computing
domains, the performance of computations, cost of data storage, and the data transfer rate
is paramount. Additionally, a wide dynamic range is more desirable compared to high pre-
cision to produce accurate results. Thus, FP configurations with small bit-length and wide
dynamic ranges have been proposed for these domains. Google proposed Bfloat16 [131]
(F16,8 ), which is a 16-bit representation with 8 bits of exponent. Similarly, NVidia proposed
30

S F1 F2 F3 F1 F2 … F16

S F1 F2 F3 F1 F2 … F16
E1 E2 … E8 E1 E2 … E5
exponent S F1 F2 F3 exponent F1 F2 … F16
sign mantissa sign mantissa


(a) MSFP12 (a) Flex16+5 (Flexpoint)

Figure 2.9: Layout of (a) MSFP12 and (b) Flex16+5, a Flexpoint configuration. Values in these
representations share the same exponent field. Each MSFP12 value stores its sign bit and mantissa
bits separately from the shared exponent field. Each Flexpoint value stores its mantissa bits sepa-
rately, where the mantissa bits encode both the sign and significand at the same time using a signed
integer format (i.e., the mantissa bits of a value v and its inverse −v are two’s complement of each
other).

tensorfloat32 [108] (F19,8 ), a 19-bit representation with 8 bits for exponent. Both bfloat16
and tensorfloat32 have the same number of exponent bits as the 32-bit float type. Microsoft
proposed MSFP8-11 [100], which are 8 to 11-bit FP configuration with 5 bits of exponent
(F8,5 to F11,5 ). MSFP8-11 has the same number of exponent bits as the 16-bit half type.
Figure 2.8 illustrates the bit-pattern of bfloat16, TensorFloat32, and MPSF8.

Non-standard FP representations. Microsoft proposed MSFP12-16 [116], a 12 to 16-bit


FP-like representation that slightly deviates from the FP semantics. Figure 2.9(a) shows
the layout of MSFP12 bit-strings. Abstractly, MSFP12-16 uses 8 bits of exponent and 3 to
7 bits of mantissa bits. In MSFP12-16, the sign bit and the exponent bits are encoded in the
same way that FP representation does. The mantissa bits encode the entire significand of
the value, similar to how denormal values are encoded in FP representations. The novelty
of MSFP12-16 is that it stores the bit-string of values with similar magnitude together.
Because these values have the same exponent bits, MSFP12-16 stores only one instance
of the exponent field. The sign bit and mantissa bits are stored separately for each value.
Additionally, MSFP12-16 can store values with different magnitudes together by adjusting
the mantissa bits (similar to denormal values in the FP representation).
Similarly, Flexpoint [78] proposed by Intel also stores the bit-string of multiple values
31

together. Figure 2.9(b) shows the layout of Flexpoint bit-strings. Flexpoint only maintains
exponent bits and mantissa bits. To represent negative values, the mantissa bits encode
the sign of the value using two’s complement. More specifically, the mantissa bits of a
Flexpoint value v and its additive inverse −v are two’s complement of each other. Flexpoint
groups the bit-string of multiple values and stores only one instance of the exponent bits.
The mantissa bits are adjusted to represent values of different magnitude. Both MSFP12-
16 and Flexpoint representations are proposed to minimize the amount of space required to
store data and create efficient hardware implementations for primitive operations.

Log number systems. The bit-string of a value x in log number systems (LNS) [44,
112, 130] encodes the logarithm of |x|, i.e., log2 (|x|) = e + f where e is an integer and
f is a fractional value f ∈ [0, 1). Alternatively, we can interpret the LNS bit-string as
representing the value x = (−1)S ×2e+f = (−1)S ×2e ×2f , where S is encoded by the sign
bit, e is encoded by the exponent bits, and f is encoded by the mantissa bits. Depending
on the specific LNS representation, the negative exponent value e < 0 is encoded with
exponent bits using offset (similar to bias in FP representation) or two’s complement form
(similar to signed integer).
For example, consider the bit-string of a value in F9,4 illustrated in Figure 2.2(a). This
bit-string represents the value 1.875 × 22 in F9,4 . Now, let us interpret the same bit-string in
a 9-bit LNS representation with four exponent bits and four mantissa bits. For the ease of
exposition, we interpret the exponent bits with the same strategy as FP representations (i.e.,
using bias). In our 9-bit LNS representation, the exponent bits represent the value e = 2.
The mantissa bits are interpreted as f = 0.875, where the leading bit is zero. Hence, the
value represented by the bit-string in our 9-bit LNS representation is,

22+0.875 = 22.875 ≈ 7.336 . . .

The main advantage of LNS is that multiplication and division operations are more
efficient compared to FP representations. Without the loss of generality, let us suppose
32

n bits

S R1 R2 … R|R| R E1 E2 … Ees F1 F2 … F|F|


sign regime guard exponent fraction

Figure 2.10: Layout of an n-bit posit representation bit-pattern.

x1 = 2e1 +f1 and x2 = 2e2 +f2 are two positive values in a LNS representation. Then,
x1 × x2 can be computed by adding the values e1 , e2 , f1 , and f2 ,

x1 × x2 = 2e1 +f1 × 2e2 +f2 = 2(e1 +f1 +e2 +f2 ) = 2(e1 +e2 )+(f1 +f2 )

Thus, multiplying two values in LNS is as simple as adding the exponent bits and the
fraction bits separately. Division operation follows similarly. LNS is often used in signal
processing where multiplication is one of the most commonly used primitive operations for
Fourier transformations.

2.2 The Posit Representation

Posit [55, 58] is intended to be a stand-in replacement for FP. There are two notable dif-
ferences in posit compared to FP. First, the posit representation provides tapered precision
where different values have different amounts of precision. Specifically, posits represent
values near one with higher precision while providing a large dynamic range. Second,
an n-bit posit representation can represent more distinct values compared to FP. There is
excitement in exploring posits in several domains [18, 45, 77, 102]. Although commer-
cial hardware architectures do not support posit yet, there are hardware proposals for posit
arithmetic [69, 70, 136].
A posit representation Pn,es is identified by two parameters, the total number of bits n
and the maximum number of bits to represent the exponent es. There are five components
to a posit bit-string: a sign bit S, regime bits R, a regime guard bit R, up to es bits of
exponent bits E, and fraction bits F . Figure 2.10(a) shows each of the five components
in a posit bit-string. The number of regime bits, exponent bits, and fraction bits varies
33

depending on the value being represented. The posit representation uses a minimal number
of regime bits and exponent bits to express the magnitude of the value. The remaining bits
are used for the fraction to provide as much precision as possible. The posit standard [55]
specifies four standard configurations of posit, 8-bit posit8 (P8,0 ), 16-bit posit16 (P16,1 ),
32-bit posit32 (P32,2 ), and 64-bit posit64 (P64,3 ).

2.2.1 Decoding a Posit Bit-String

The first bit in a posit bit-string is the sign bit. If S = 0 then the represented value is
positive. If S = 1, then the value is negative. If the value is negative, then the bit-string
is decoded after performing two’s complement of the bit-string, similar to how a signed
integer is decoded. The next three components, R, R, and E are used to represent the
exponent of the value. Abstract, the regime R is the super exponent. It extends the dynamic
range of the posit representation encoded by the exponent bits E. After the sign bit, the
next consecutive 1’s (or 0’s) represent the regime bits. The regime bits are only terminated
when it encounters an opposite bit 0 (or 1), known as the regime guard bit (R), or the
bit-string itself terminates. The size of R can be anywhere between 1 ≤ |R| ≤ n − 1.
If there are any remaining bits after the regime and regime guard bits, up to es of the
next bits (i.e., min{n − 1 − |R|, es} bits) represent the exponent bits E. In the case that
|E| < es (i.e., there were less than es remaining bits), then the exponent bits E is padded
with 0’s to the right until |E| = es. Finally, the remaining bits (i.e. (max{0, n − 1 − |R| −
|E|}) bits) represent the fraction bits. If there is no remaining bit, then we assign a single 0
bit to F (F = 0).

2.2.2 Interpreting a Posit Bit-String

The sign of a posit value is determined using the sign bit, (−1)S . The regime and exponent
es
bits together contribute to the magnitude of the represented value. Let useed = 22 and k
34

0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1
sign regime guard exponent fraction sign regime guard exponent
(a) (b)

0 1 1 1 1 1 1 1 1
sign regime
(c)

Figure 2.11: (a) Bit-string representing the value 1.625 × 25 in h9, 2i-posit. (b) Bit-string repre-
senting the value 2−22 in h9, 2i-posit. (c) Bit-string representing the value 228 in h9, 2i-posit.

be, 


−r the regime bits are 00 s
k=


r − 1 the regime bits are 10 s

Then, the magnitude of the represented value is,

es es ×k+E
useedk × 2E = (22 )k × 2E = 22

where E is interpreted as an unsigned integer. The largest magnitude that the exponent bits
es −1
can encode is 22 . This is exactly a half of useed (i.e., useed
2
). Thus, regime bits extend
this magnitude to extends the dynamic range of representable values.
The significand represented by the posit bit-string is calculated similarly to a normal
value in FP using the fraction bits F , which contributes 1 + F
2|F |
to the final value. Finally,
the final value that a posit bit-string represents is:

   
s F s F es
(−1) × useed × 2 × 1 + |F | = (−1) × 1 + |F | × 22 ×k+E
k E
2 2

There are two special bit-strings in each posit representation that do not follow the
above rule. The bit-string of all 0’s represents the value 0. The bit-string of a 1 followed by
all 0’s represents not-a-real (NaR). NaR represents all exceptional values.
35

Example 1 Consider the bit-string of the P9,2 configuration illustrated in Figure 2.11 (a),
2
with a total of 9 bits with up to 2 exponent bits. The useed in P9,2 is useed = 22 = 24 =
16. The bit-string is decomposed into S = 0, R = 11, R = 0, E = 01, and F = 101.
Because the regime bits (R) consist of consecutive 1’s, k = |R| − 1 = 2 − 1 = 1. Finally,
the value represented by the bit-string is,

 
0 1 15 2 1
(−1) × useed × 2 × 1 + = (22 ) × 21 × 1.625 = 1.625 × 25
8

When a small number of regime bits are used, the represented value is closer to 1. More
bits are used to represent the fraction bits, leading to four precision bits (one implicit 1 bit
to the left of the radix point and three fraction bits in the significand).

Example 2 In some circumstances, posit bit-string may not have fraction bits or even
some exponent bits. Consider the bit-string oof a P9,2 value illustrated in Figure 2.11 (b).
The bit-string contains one exponent bit and no fraction bits. The bit-string decomposes
into S = 0, R = 000000, R = 1, E = 10, and F = 0. Because there is only one
exponent bit (1), we pad a 0 bit to the right of the exponent bit. Regime bits (R) consist of
consecutive 0’s, so k = −|R| = −6. The value represented by this bit-string is,

0 2 −6
(−1)0 × useed−6 × 22 × (1 + ) = (22 ) × 22 × 1 = 2−24 × 22 = 2−22
2

When a larger number of regime bits are used to represent the magnitude of the value, a
smaller number of bits are allocated to fraction bits (in the case, 0 bits), leading to only one
precision bit.

Example 3 In extreme cases, the posit bit-string may consist of only a sign bit and regime
bits. Consider the bit-string of a P9,2 value in Figure 2.11 (c). There are only two com-
ponents in the bit-string, S = 0 and R = 11111111. In this case, both exponent bits and
36

fraction bits become zeros (i.e., E = 00 F = 0) and k = R − 1 = 7. The value represented


by the bit-string is.

0 2 7
(−1)0 × useed7 × 20 × (1 + ) = (22 ) × 1 × 1 = 228 .
2

Comparison with FP representation There are two key differences between posit and
FP. First, there is exactly one bit-pattern representing 0 and one bit-pattern representing
N aR. Hence, every bit-pattern in a given posit represents unique values. Second, given
the same number of bits, posit values have more precision than FP in some range of values
while providing a similar amount or more dynamic range. For example, let us compare
the standard 32-bit FP and posit configurations, which are float and posit32, respectively.
Float values have 24 bits of precision with dynamic range of [2−149 , 2128 ). In comparison,
posit values have up to 27 bits of precision near 1 with the dynamic range of [2−120 , 2120 ].
More specifically, posit values in the range [2−20 , 220 ] have the same or more precision bits
compared to float values in the range. This range is known as the golden zone [37]. A
slightly different 32-bit posit configuration P32,3 provides up to 26 bits of precision, which
is still more than the precision of float, with a dynamic range of [2−240 , 2240 ]. However,
as posit value approaches 0 or increases to a large value, the number of precision bits can
decrease to as little as 1 precision bit, as shown in Figure 2.11(c).

2.2.3 Rounding a Real Number to the Posit Representation

The posit standard supports only one rounding mode, round to the nearest, tie goes to even
(rne). If a real value vR is exactly representable in a posit representation Pn,es , then we
round vR to that value. If vR is not exactly representable, then it is rounded to either vsm ,
the largest posit value smaller than or equal to vR , or vlg , the smallest value larger than
or equal to vR . The high-level idea of posit rounding is similar to rne rounding mode in
FP representations: vR rounds to the closer value between vsm or vlg . If vR is a midpoint
>
>
>
>
: v+ if 6 (rb = 0 ^ sticky ^ s round
= 0) the =0
[ Fixme: To be in the background chapter. ] Explain
well. It will use the exact same method of decomposing vR = s ⇥
6.4.2
37
Rounding a Real Value to Posit Representation with rne Mode
string of |vR | in the infinite extended precision, identifying v , rb,
[ Fixme: correctly
To be in rounded
the background
value v ofchapter. ] Explain
|vR | using the rounding
a deterministic schem
algorithm, a
well. It will
s ⇥use
v, the exact
which same
is the methodvalue
rounded of decomposing
of vR . vR = s ⇥ |vR |, com
vsm vlg
stringvalues
(a) Two adjacent posit of |vRin| in the infinite extended precision, identifying v , rb, and stick
Pn,es
correctly rounded value v of |vR | using a deterministic algorithm, and then fin
Pn+1,es
s ⇥ v, which is the rounded value of vR .
w0 w1 w
Pn,es 6.4.3 Round2to Odd Rounding Mode (rno)
(vsm) (vlg)

(b) Values vsm and vlg in Pn+1,esThe


(w0Rounding
and w2) to odd (rno) is a rounding mode specified neither in

or the standards in the posit representation. However, it has been p


6.4.3 Round to Odd Rounding Mode (rno)
purpose of producing the correctly rounded result of vR even wi
w0 w1 w2
(vsm) The Rounding
(midpoint) to odd (rno)
(v lg)
is a rounding
explain more formally, modethe
let us define specified neither in
representation Tnthe
asIEEE
we h
or the standards
(c) w1 is midpoint between vsm
theand in
vlgthe
chapter. posit
If Tnrepresentation. However,then
is a FP representation, it hasTnbeen
= Fproposed
n,|E| whe
w
purpose ofposit
producing the correctly
representation, then rounded result of vRn even2.with double
Figure 2.12: Illustration of identifying the midpoint between two posit values vsm and vwhere
Tn = Pn,es
lg . (a)
Additional
explain
Suppose there are two adjacent posit values vsm and more
vlg in formally,
Pn,es toletTus
. (b) Then,
corresponds define
n .the
If two the
Tn is a representation
values are exactly Tn asthen
FP representation, we T
have at the
n+2 = F
representable in Pn+1,es (w0 and w2 , respectively) and thereIf isTna isvalue
the chapter. a FPthen between
1 ∈ Pn+1,es then
wrepresentation, Tn =wF0 n,|E| where > |E
representation, Tn+2 = Pn,es . Intuitively, Tn+2 hasntwo m
and w2 . (c) The value w1 is the midpoint between vsm and vlg when rounding a real value vR to
Pn,es . posit representation,
having the then
sameTnumber
n = Pn,es where n bits.2. Then,
of exponent Additionally, we aded
performing
corresponds to Tn . aIfreal
rounding is a FP
Tnvalue representation,
vR to Tn+2 with rnothen Tn+2 =
rounding Fn+2,|E|
mode . If
and then
between vsm or vlg , then vR rounds to the value wherethetheresult
representation, bit-string
then toTn+2 is
= even
Tn with when
n,es . standard
Pany interpreted
Intuitively, has two more
Tn+2 rounding
specified mode bits
rm tp
having the same number of exponent bits. Then, performing a double rou
as an unsigned integer. 120
rounding a real value vR to Tn+2 with rno rounding mode and then subseque
the result to Tn with any standard specified rounding mode rm produces t
Definition of midpoint for posit representations. However, the definition of midpoint for
120
posit representations is slightly different compared to the midpoint in the rne rounding
mode for FP representations. In posit representations, the midpoint of two adjacent values
vsm and vlg in Pn,es is defined as the value in Pn+1,es between vsm and vlg . We illustrate
the midpoint for posit representations with Figure 2.12. Given two adjacent posit values
vsm and vlg in the posit representation Pn,es (shown in Figure 2.12(a)), both vsm and vlg
are exactly representable in Pn+1,es (the values w0 and w2 , respectively in Figure 2.12(b)).
Additionally, there is another value w1 ∈ Pn+1,es between vsm and vlg . The value w1 is the
midpoint between vsm and vlg .
Once the midpoint between vsm and vlg are identified, then rounding a real value vR be-
tween vsm and vlg follows the rule similar to FP’s rne rounding mode. If vR is smaller than
the midpoint, then vR rounds to vsm . If vR is larger than the midpoint, then vR rounds to
38

All values in this All values in this All values in this All values in this
interval rounds to interval rounds to interval rounds to interval rounds to

1.0 1.00… 1.00… 2116 2118 2120


(even) (midpoint) (odd) (even) (midpoint) (odd)
(a) Arithmetic rounding (b) Geometric rounding

Figure 2.13: Illustration of posit rounding behavior when rounding a real value vR to P32,2 . These
behaviors are present for all posit configurations. (a) In most cases, posit exhibits arithmetic round-
ing behavior. (b) However, in some special cases, posit exhibits geometric rounding behavior.

vlg . If vR is equal to the midpoint, then we round to the value where the bit-string represen-
tation in Pn,es is even when interpreted as an unsigned integer. This method of identifying
midpoint allows systematically rounding vR to a posit value to follow a similar strategy as
rounding vR to an FP value, while still maintaining the tapered-precision property.

Different behaviors in posit rounding. In the FP representation’s rne mode, the mid-
point of two adjacent values vsm and vlg is the arithmetic mean between the two values
vsm +vlg
(i.e., 2
). Rounding a real value vR using the arithmetic mean of two adjacent values
as the midpoint is called arithmetic rounding. Instead, rounding vR using the geometric

mean of two adjacent values (i.e., vsm vlg ) as the midpoint is called geometric rounding.
Interestingly, posit rounding has both arithmetic rounding and geometric rounding depend-
ing on the magnitude of vR . In most cases, the midpoint is the arithmetic mean between vsm
vsm +vlg
and vlg , i.e. 2
as illustrated in Figure 2.13(a). Specifically, this occurs when the last
bit of the midpoint in Pn+1,es represents a fraction bit. The values vsm , the midpoint, and vlg
increase arithmetically and rounding vR between vsm and vlg exhibit arithmetic rounding.
However, when vR is an extremal value, posit rounding can exhibit geometric rounding.
Figure 2.13(b) illustrates geometric rounding by rounding vR to P32,2 where vR is between
two largest representable posit values vsm = 2116 and vlg = 2120 . The midpoint w1 between
vsm and vlg is 2118 where w1 is a factor of 4 larger than vsm and vlg is a factor of 4 larger

than w1 . Hence, w1 is the geometric mean of vsm and vlg (i.e. 2116 2120 = 2118 ) and vR is
geometrically rounded to the nearest value. This occurs specifically when the last bit of the
39

All values in this


All values larger than
interval rounds to
2120 rounds to 2120

0 2-120 2120
(a) Saturation to minpos (b) Saturation to maxpos

Figure 2.14: When rounding a real value vR to a posit representation (i.e., P32,2 ), there are two
types of special cases. Without the loss of generality, assume that vR ≥ 0. (a) All vR greater
than 0 and smaller than the smallest positive representable value minpox (i.e., minpos = 2−120 )
rounds to minpos. (b) All vR greater than the largest positive representable value maxpos (i.e.,
maxpos = 2120 rounds to maxpos.)

midpoint in the Pn+1,es representation is either a regime bit, regime guard bit, or exponent
bits.

Special cases in posit rounding. Additionally, there are two classes of special cases when
rounding vR to a posit value. Without the loss of generality, let us assume that vR ≥ 0.
The special cases apply to negative vR similarly. In posit rounding, the only real value
that rounds to 0 is vR = 0 itself. Any value greater than 0 and smaller than the smallest
representable positive value (minpos) rounds to minpos. Figure 2.14(a) illustrates this
special case for the 32-bit standard posit representation P32,2 . This special case ensures
that the rounded value preserves the sign of the original real value vR . In contrast, FP
rounding can underflow non-zero values to 0, losing the sign of vR .
Second, posit does not have a representation for ±∞. Any positive real value outside
of the dynamic range of posit representation is rounded to the largest representable positive
value, maxpos. Figure 2.14(b) illustrates this case for the 32-bit standard posit representa-
tion P32,2 . The largest positive value in P32,2 is 2120 . All values larger than 2120 rounds to
2120 . Thus, posit rounding always rounds real values vR into a real posit value. In contrast,
the rne rounding mode in FP representations can overflow real values to ∞, making no
distinction between values that are ∞ and values that are just too large to represent in the
chosen FP representation.
40

2.2.4 A Systematic Methodology for Rounding To Posit Representation

Rounding a real value vR to a posit value in Pn,es follows a similar strategy as rounding to
FP representations. Intuitively, we identify the four rounding components, s, v − , rb, and
sticky to identify the two values vsm and vlg in Pn,es that are adjacent to vR and decide
which value vR rounds to. To provide a brief of how to identify the rounding components,
we first decompose vR into vR = s × |vR | where s is the first rounding component that
represents the sign of vR . Next, we represent |vR | in the infinite extended precision rep-
resentation P∞ , where P∞ has the same limit in the es bits but can have infinitely many
regime bits and fraction bits. Because the number of available regime and fraction bits is
unbounded, any real value can be exactly represented in P∞ . Third, we truncate B|vR | , the
bit-string of |vR | in P∞ , to n bits to obtain the truncated value v − , which is the second
rounding component. We identify the rounding bit by extracting the (n + 1)st bit in B|vR | to
identify the third rounding component rb, the rounding bit. Finally, we perform a bit-wise
or operation on all bits in B|vR | starting from the (n + 2)nd bit to identify the last rounding
component, sticky.
The truncated value v − is the largest value smaller than or equal to |vR |. Then v + , the
succeeding value of v − is the smallest value larger than |vR |. Additionally, the value that we
obtain by concatenating v − and the rounding bit rb, then decoding the resulting (n + 1)-bit
bit-string in Pn+1,es is the midpoint of v − and v + . Thus, the four rounding components can
be used to determine the relationship between |vR | and the two nearest posit values:




 |vR | = v − if rb = 0 ∧ sticky = 0





v − < |vR | < vmid if rb = 0 ∧ sticky = 1



 |vR | = vmid if rb = 1 ∧ sticky = 0





 vmid < |vR | < v + if rb = 1 ∧ sticky = 1

where vmid is the midpoint between v − and v + .


41

v- = 0 and v- = 0 and
v- = 0 and v- = 0 and
(rb = 1 or sticky = 1) rb = 0 and
rb = 0 and (rb = 1 or sticky = 1)
sticky = 0
sticky = 0

+ - - +
-v = -minpos -v = 0 v =0 v = minpos
(a) Posit rounding near 0 when s = -1 (b) Posit rounding near 0 when s = 1

Figure 2.15: (a) Illustration of rounding a real value vR to a posit value when −minpos ≤ vR < 0
where −minpos is the largest negative representable value. (b) Illustration of rounding a real value
vR to a posit value when 0 < vR ≤ minpos where minpos is the smallest positive representable
value. All real values in the green interval rounds to the posit value with the green color. All real
values in the blue interval rounds to the posit value with the blue color.

v- = maxpos v- = maxpos
rb = 0 or 1 rb = 0 or 1
sticky = 0 or 1 sticky = 0 or 1

- -
-v = -maxpos v = maxpos
(a) Posit rounding for values ≤ -maxpos (b) Posit rounding for values ≥ maxpos

Figure 2.16: (a) When vR is smaller than the smallest representable negative value in a posit
representation, i.e., vR ≤ −maxpos, then vR rounds to −maxpos. This case can be identified with
s = −1 and v − = maxpos. (b) When vR is larger than the largest representable positive value in
a posit representation, i.e., vR ≥ maxpos, then vR rounds to maxpos. This case can be identified
with s = 1 and v − = maxpos.

Determining the correctly rounded result of vR . We now explain the rounding decision
and the final rounded value of vR in the posit representation using the four rounding compo-
nents. First, we determine whether vR is a special case value in posit rounding. If v − = 0,
rb = 0, and sticky = 0, then vR = 0. In this case, vR rounds to 0. If v − = 0 and either
rb = 1 or sticky = 1, then it indicates that |vR | is larger than 0 but smaller than v + , the
smallest representable posit value minpos. Then, vR rounds to s × v + . Figure 2.15(a) and
(b) pictorially shows the rounding decision of vR to posit value based on different values of
rounding components when v − = 0.
Similarly, if v − is equal to the largest representable posit value maxpos, then it indicates
that |vR | is a value outside of the dynamic range of Pn,es . In posit rounding, values outside
of the dynamic range rounds to ±maxpos. Thus, vR rounds to s × v − . Figure 2.16(a) and
(b) illustrates the rounding decision of vR based on different values of rounding components
42

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ midpoint - - midpoint +
-v -v v v
- -
(a) Rounding when s = -1 and bit-string of v is even (b) Rounding when s = 1 and bit-string of v is even

rb = 1, sticky = 0 rb = 0, sticky = 0 rb = 0, sticky = 0 rb = 1, sticky = 0


rb = 1, sticky = 1 rb = 0, sticky = 1 rb = 0, sticky = 1 rb = 1, sticky = 1

+ midpoint - - midpoint +
-v -v v v
- -
(c) Rounding when s = -1 and bit-string of v is odd (d) Rounding when s = 1 and bit-string of v is odd

Figure 2.17: The posit rounding rule using the rounding components for all vR within the dynamic
range of the posit representation. When an interval is colored green, all real values in the interval
rounds to the smaller value (i.e., −v + when vR < 0 and v − when vR ≥ 0). Similarly, all real values
in the interval colored blue will round to the larger posit value (i.e., −v − when vR < 0 and v + when
vR ≥ 0).

when vR is outside of the dynamic range of the posit representation.


If vR is not a special case value, then vR is rounded using the same rule as the rne
rule in FP representations. If the rounding bit rb = 0, then it indicates that |vR | is smaller
than the midpoint vmid . In this case, vR rounds to s × v − . If the rounding bit rb = 1 and
the sticky bit sticky = 1, then it indicates that |vR | is greater than vmid and vR rounds to
s × v + . If |vR | is equal to the midpoint, then the rounding bit rb = 1 and the sticky bit
sticky = 0. In this case, we round to the value whose bit-string is even when interpreted
as an unsigned integer. In the posit representation, the bit-string of a posit value v and
its negative counterpart −v are two’s complement of each other. The two’s complement
operation preserves the parity of the bit-string. Thus, the bit-string of v − is even if and
only if the bit-string of −v − is even. It suffices to check the parity of v − to determine the
correct rounding decision. If rb = 1, sticky = 0, and v − is even, then vR rounds to s × v − .
Otherwise, vR rounds to s × v + . Figure 2.17 pictorially shows the rounding decision of
vR to a posit representation based on different values of rounding components. The posit
43

rounding decision including the special cases can be summarized as follows,




0

 if v − = 0 ∧ rb = 0 ∧ sticky = 0





 s × v + if v − = 0 ∧ (rb = 1 ∨ sticky = 1)







 s × v − if v − = maxpos



RNP,rne (vR ) = s × v − if vR is not special case ∧ rb = 0







 s × v − if vR is not special case ∧ rb = 1 ∧ sticky = 0 ∧ IsEven(v − )






s × v + if vR is not special case ∧ rb = 1 ∧ sticky = 0 ∧ ¬IsEven(v − )





s × v + if vR is not special case ∧ rb = 1 ∧ sticky = 1

2.2.5 Posit Configurations and Posit Variants

The posit standard [55] suggests four standard configurations: The 8-bit posit8 (P8,0 ), 16-
bit posit16 (P16,1 ), 32-bit posit32 (P32,2 ), and 64-bit posit64 (P64,3 ). These configurations
are specifically chosen to provide both a wide dynamic range as well as high precision.
Apart from the standard posit representation, a tapered-precision log number system by
combining posit representations and LNS has been explored and shown to be effective in
the machine learning domain [74].

2.3 Numerical Errors in Finite Precision Representation

Because any FP representation can only represent a finite number of real values, the results
of primitive operations (i.e., rounding, addition, subtraction, multiplication, and division)
can experience rounding error. Modern hardware and libraries produce correctly rounded
results for these primitive operations. However, rounding errors can get amplified with
a certain combination of primitive operations because the intermediate results need to be
rounded [37, 52, 104]. We highlight three important classes of numerical errors that are
highly relevant to generating correctly rounded math libraries.
44

2.3.1 Rounding Error with Extremal Values.

Any representation T has a limit on the dynamic range. When we round a real value
vR outside of the dynamic range, then a significant amount of error between vR and the
rounded results can occur. There are three classes of such rounding errors depending on
the rounding mode: overflow, underflow, and saturation error.
Overflow error occurs when a value larger than the largest representable value in T is
rounded to ∞ (i.e. 2130 rounds to ∞ in the 32-bit float type using the rne mode). When
evaluating an expression in T, if an intermediate result encounters an overflow error, then
the final value of the expression will likely evaluate to ∞, 0, or N aN . It can occur even if
the expression evaluates to a value representable by T if evaluated in reals. Underflow error
occurs when values close to 0 rounds to 0 (i.e. 2−150 rounds to 0 in the 32-bit float using the
rne mode). Although the absolute error of vR and 0 may be small, underflow error loses
the information on the sign of vR , which can be detrimental in some applications. Finally,
saturation error occurs when any value outside of the dynamic range rounds to either the
smallest or the largest representable value. Rounding posit values experience saturation
error.
All three types of error (overflow, underflow, and saturation error) can produce results
with significant error, which can make accurate approximation and error analysis difficult.
Thus, math libraries typically classify inputs that cause these errors into special cases and
directly return the correct result.

2.3.2 Double Rounding

Double rounding occurs when a value vR is first rounded to a representation T1 using a


rounding mode rm1 , then rounded again to a smaller representation T2 using a rounding
mode rm2 . In some cases, double rounding is harmless and produces the same result as if
rounding vR directly to T2 using rm2 rounding mode. However, with a certain combination
of rm1 and rm2 , it is not guaranteed that double rounding produces the correctly rounded
45

Correct Rounding Double rounding error

FPn+1 w0 w1 w2

FPn v0 (odd) v1 (even)

Figure 2.18: An illustration of the double rounding error. Star represents a real value vR . Circles
represent the two values v1 and v2 adjacent to vR in an n-bit representation Tn . An (n + 1)-bit
representation Tn can exactly represent both v1 and v2 (w0 and w2 , respectively). In addition, there
is another value w1 between w0 and w1 , highlighted with the rhombus. Rounding vR to Tn+1 and
then subsequently rounding the result Tn produces a value different from directly rounding vR to
Tn .

result, known as the double rounding error. Intuitively, double rounding error occurs be-
cause the error in rounding vR to T1 is sometimes significant enough to affect the rounding
decision of rounding the value in T1 to T2 . We illustrate double rounding error with Fig-
ure 2.18. Consider a case of rounding a real value vR (star) to an n-bit representation Tn .
There are two values v0 and v1 in Tn adjacent to vR . When we directly round vR to Tn
using the rne rounding mode, vR rounds to v0 . Instead, let us perform double rounding
by first rounding vR to Tn+1 , an (n + 1)-bit representation with 1 additional precision bit
compared to Tn . Tn+1 can exactly represent v0 and v1 . Additionally, Tn+1 can exactly
represent another value w1 between v0 and v1 . Using the rne rounding mode, vR rounds to
w1 in Tn+1 . If we then round the result w1 to Tn using rne rounding mode, it rounds to v1
which is not the correctly rounded result of vR in Tn .

2.3.3 Cancellation Error

When two similar but inexact values are subtracted from each other, cancellation errors can
occur. These two similar values have the same exponent bits and several significant bits of
the fraction bits are the same. Subtracting the values result in the significant fraction bits
becoming zero. The result then depends on the least significant bits, which are inexact,
significantly increasing the error of the result. Cancellation error where all bits are wrong
with respect to the correct answer obtained by evaluating the expression in real numbers is
46

known as the catastrophic cancellation. Typically, catastrophic cancellation results in the


subtraction operation to produce 0 (when the correct answer should not be 0), leading to
difficulties in accurately evaluating mathematical expressions.

Illustration of catastrophic cancellation. Consider an example of evaluating the expres-


sion b2 − 4ac using F9,4 representation with rne mode, where a = 1.125, b = 2.0, and
c = 0.875. All three values are exactly representable. Mathematically, the result of the
expression is b2 − 4ac = 0.0625, a value that can also be exactly represented with F9,4 .
If we evaluate b2 − 4ac with a 9-bit FP representation F9,4 , then b2 = 4.0 rounds to 4.0,
which is an exact result. However, 4ac = 3.9375 rounds to 4.0 because 3.9375 cannot be
exactly represented. This rounding error combined with the subtraction operation causes
catastrophic cancellation, producing an incorrect result 0 (compared to the correct result
0.0625). If this result is used to determine the number of roots in ax2 + bx + c, then it can
have catastrophic consequences.

2.4 Prior Work on Approximating Elementary Functions

The state-of-the-art techniques in approximating an elementary function f (x) for a target


representation T involves two steps in general. First, mathematical reasoning and approx-
imation theory is used to derive a function AR (x) that approximates f (x) in real numbers.
Second, the approximation function AR (x) is implemented as AH (x) using a finite preci-
sion H that has higher precision than T.

2.4.1 Approximating AR (x)

Deriving AR (x) further involves three steps. First, we identify inputs that exhibit special
behavior (i.e., ±∞). Second, we use range reduction to reduce the entire input domain
[a, b] to a smaller domain [a0 , b0 ] ∈ [a, b] and perform any other function transformations.
Depending on the range reduction and transformations used, the function that needs to be
47

approximated may be transformed into a different function g(x). Third, we generate a


polynomial P (x) that closely approximates g(x) in the domain [a0 , b0 ].

Identifying special cases. There are three types of special cases. The first type includes
inputs that produce undefined values (i.e., NaN or NaR) or ±∞. For example, in the
case of f (x) = ex , f (x) = ∞ if x = ∞ and f (x) = 0 if x = −∞. The second type
consists of hard to compute results. For example, the only real number that rounds to 0
in the posit representation is 0 itself. In the case of approximating f (x) = ln(x) for posit
representations, it is simpler to classify x = 1 as a special case because ln(1) = 0 and the
slightest amount of error in approximation will round to a non-zero result.
The third type consists of inputs that produce interesting outputs when evaluating f (x).
These cases include inputs that produce results with underflow, overflow, saturation er-
ror, or ranges of inputs that produce the same rounded result RNT,rm (f (x)). For ex-
ample, when approximating f (x) = ex for the 32-bit float with the rne mode, all in-
puts x ∈ (−∞, −103.9 . . . ] produce the result RNF,rne (ex ) = 0 (underflow), all inputs
x ∈ [88.72 . . . , ∞) produces results that round to ∞ (overflow), and the results for inputs
between [−2.98 · · · × 10−8 , 5.96 · · · × 10−8 ] rounds to 1.0. These inputs can be explicitly
filtered out by directly returning the result, significantly reducing the input domain where
we have to approximate f (x).

Range reduction. It is mathematically simpler to approximate f (x) for a small domain


of inputs. Therefore, most math libraries use range reduction to reduce the entire input
domain [a, b] into a smaller domain [a0 , b0 ] before generating the polynomial approximation.
The basic intuition of range reduction is to reduce an input x ∈ [a, b] to a different value
x0 ∈ [a0 , b0 ], where [a0 , b0 ] ⊆ [a, b]. Then, the polynomial P (x) approximates the output y 0
using the reduced input x0 , y 0 = P (x0 ). Finally, the output y 0 is compensated to produce
the result for the original input x. The process of reducing the input is called the range
reduction (x0 = RR(x)) and compensating the output y 0 to produce y is called output
48

compensation (y = OC(y 0 )).

Example of range reduction. Consider the function f (x) = log2 (x) where the input
domain is defined over (0, ∞). One way to reduce the range of inputs is to use the mathe-
matical identity log2 (a×2b ) = log2 (a)+b. We decompose the input x into x = (1+x0 )×2E
by representing x using the scientific notation. Here, x0 ∈ [0, 1) is the fractional part of the
significand and the exponent E is an integer. Using the identity,

log2 (x) = log2 ((1 + x0 ) × 2E ) = log2 (1 + x0 ) + log2 (2E ) = log2 (1 + x0 ) + E

The result of log2 (x) can be approximated by computing log2 (1 + x0 ) + E. Thus, we can
approximate log2 (x) by first range reducing the input x into the reduced input x0 . The
value E can be identified by computing the exponent of x. Then, we approximate g(x0 ) =
log2 (1 + x0 ) using a polynomial approximation P (x0 ). Finally, we compensate the output
of P (x0 ) by computing P (x0 ) + E, where the value of E is dependent on the original input
x. The range reduction function, output compensation function, and the function we must
approximate can be summarized as follows,
x
RR(x) = −1 OC(y 0 ) = y 0 + E g(x0 ) = log2 (1 + x0 )
2E
This range reduction reduces the domain of inputs that P (x0 ) has to approximate from
(0, ∞) to [0, 1). Note that even though our goal was to approximate log2 (x), the function
that we have to approximate with P (x0 ) is g(x0 ) = log2 (1 + x0 ). Math libraries often
exploit this property of range reduction to transform f (x) to another function g(x) that
may be easier to approximate.

Suitable range of reduced input domain. Reducing the size of the input domain means
that P (x0 ) needs to be accurate only for the reduced domain [a0 , b0 ]. Compared to the entire
input domain, we may be able to generate lower degree polynomial approximation while
maintaining the same error threshold. In addition, if the range reduction reduces the domain
49

to within (−1, 1) (i.e., [a0 , b0 ] ⊆ (−1, 1)) as shown in the above example, then the accuracy
of the polynomial approximation can be increased significantly higher. To understand the
intuition, consider the Taylor series expansion of log2 (1 + x0 ),

x0 x02 x03 x04


log2 (1 + x0 ) = − + − + ...
ln(2) ln(4) ln(8) ln(16)

While this infinite polynomial expansion is equivalent to log2 (1 + x0 ), a finite-degree poly-


nomial (truncated after a number of terms) is an approximation of log2 (1 + x0 ). The longer
the polynomial, the more accurate the approximation is. However, the convergence rate of
the polynomial expansion can vary significantly depending on the value of x0 . If x0 is sig-
nificantly larger than 1, then the magnitude of each subsequent term keeps becoming larger
x0i+1 x0i
(i.e., ln(2)i+1
> ln(2)i
) until i becomes large enough. Thus, the polynomial approximation
may even diverge from the result for several terms before converging to the result. It will
require a large degree polynomial to accurately approximate log2 (1 + x0 ).
Comparatively, if x0 ∈ (−1, 1), then the magnitude of each subsequent term will be
x0i x0i+1
smaller (i.e., ln(2)i
> ln(2)i+1
). The lower degree terms contribute the most to the final
result. In such a case, the polynomial expansion converges much faster and we can approx-
imate log2 (1 + x0 ) accurately with a polynomial of a small degree. Additionally, because
the magnitude of each subsequent term is much smaller, the subtraction will not experience
cancellation errors. Hence, it is ideal to design range reduction such that the reduced input
domain is a subset of (−1, 1) and ideally as close to 0 as possible.

Types of range reductions. There are two categories of range reductions: Multiplicative
reduction and additive reduction. A multiplicative range reduction occurs when the reduced
input x0 is computed using either a multiplication or a division, i.e. x0 = x×C for a constant
C. An additive range reduction occurs when the reduced input x0 is computed using either
an addition or a subtraction, i.e., x0 = x + C.
Like any other computations, range reduction can also experience numerical error. Ad-
50

ditive reduction must be designed with extra care due to catastrophic cancellation. Con-
sider the elementary function sin(x) where the function is defined over the input domain
(−∞, ∞). The sin(x) function is a periodic function with the mathematical identify
sin(x + 2π) = sin(x). Thus, we can decompose the input x into x = x0 + 2πk where
k is an input and x0 ∈ [0, 2π). Then, sin(x) can be computed with sin(x0 ). This range
reduction allows us to approximate sin(x) only for the input domain [0, 2π). In this range
reduction strategy, the reduced input x0 can be calculated with,

x0 = x − 2πk

However, because π is an irrational value, finite precision representations like FP or posit


cannot exactly represent π or 2πk. If both x and 2πk are significantly large values, then
x0 − 2πk can experience catastrophic cancellation resulting in x0 = 0. The result of sin(x0 )
will then be significantly different from sin(x).
Unlike the additive range reduction illustrated for sin(x), the range reduction for log2 (x)
described above (i.e., x0 = x
2E
− 1) has been carefully reasoned and determined that it does
not experience numerical error when evaluated in FP representation. The expression x
2E

can be computed exactly. Additionally, the result of the expression is a value between 1
and 2, i.e., x
2E
∈ [1, 2). Sterbenz Lemma [126] states that the subtraction of any two FP
y
values x and y can be computed exactly as long as 2
≤ x ≤ 2y. Thus, x
2E
− 1 can be
computed exactly.

Polynomial approximation. A common approach to approximate an elementary function


f (x) is with a polynomial function P (x). Compared to other iterative approximation meth-
ods, polynomial approximations can be implemented efficiently with addition, subtraction,
and multiplication operations only. Polynomial approximations are known to be one of the
most efficient approximation techniques.
There are two widely used approaches in generating polynomial approximations. The
least-squares approximation aims to minimize the L2 -norm of the polynomial compared to
51

the real result of f (x),

s
Z b
min ||P (x) − f (x)||2 = min (P (x) − f (x))2
a

The least-square approximation generates polynomials that minimize the average error.
Thus it is a useful technique to generate a polynomial that approximates unstable data
with outliers. However, because there is no bound on the error for a specific input (i.e.,
|P (xi ) − f (xi )| for an arbitrary input xi in the input domain), it is not commonly used
to approximate elementary functions. In the context of math libraries, the goal lies in
minimizing the error for all inputs, rather than most inputs.
The more common approach to generate P (x) is the minimax approximation, which
aims to minimize the maximum error or L∞ -norm,

min ||P (x) − f (x)||∞ = min sup |P (x) − f (x)|


x∈[a,b]

where sup is the supremum function that identifies the maximum value in a set. Minimiz-
ing the L∞ -norm provides a guarantee on the bound of the error for all inputs, making it
a more appealing approach for approximating mathematical functions. Thus, most prior
approaches in generating polynomial approximation use the minimax approach, based on
the Weierstrass approximation theorem and the Chebyshev alternating theorem [137]. The
Weierstrass approximation theorem proves the existence of a minimax polynomial: If f (x)
is a continuous real function in the input domain [a, b], there exists a polynomial P (x) such
that the maximum error is bounded, i.e., |f (x) − P (x)| <  where  > 0 for all x ∈ [a, b].
The Chebyshev alternating theorem extends the Weierstrass approximation theorem and
states that the polynomial of degree d that minimizes the maximum error is guaranteed to
have exactly d + 2 points where the error of P (x) at these points is the maximum and it
alternates in sign. Remez algorithm [114] is the state-of-the-art method in generating such
a polynomial. Remez algorithm also provides the maximum error of P (x), which can be
52

used to analyze the error bound of the approximation function.

2.4.2 Implementation in Finite Precision

Finally, the mathematically derived approximation AR (x) of an elementary function f (x)


is implemented in finite precision representation. Typically, the implementation uses a
higher precision representation than the intended target representation (T) to minimize the
error of the intermediate result when evaluating range reduction, output compensation, and
polynomial approximation. We use AH (x) to represent the implementation of AR (x) using
the representation H, which has higher precision than T (i.e., T ⊆ H). The result of AH (x)
is then rounded to T and produces the final result.

2.5 Challenges in Generating Correctly Rounded Functions by Approximating f (x)

An approximation function of f (x) is defined to be a correctly rounded function for T rep-


resentation with rm rounding mode if it evaluates to RNT,rm (f (x)) for all inputs x in the
input domain. There are two key challenges in generating correctly rounded approxima-
tion. First, because P (x0 ) is an approximation of f (x0 ), P (x0 ) incurs approximation error
approx = |P (x0 )−f (x0 )| > 0. Second, evaluating AH (x) in a finite precision representation
incurs rounding error compared to AR (x): round = |AR (x) − AH (x)| > 0.
Because P (x0 ) is an approximation of f (x0 ) and AH (x) is evaluated in finite precision,
it is impossible to reduce both approx and round to zero. The total error ,

 = |AH (x) − f (x)| ≤ approx + rounding

cannot be 0 for all inputs x. The approximation error can be reduced by increasing the
degree of polynomial approximation and the rounding error can be reduced by increasing
the precision of H. Because increasing the degree of the polynomial or the precision of
H results in performance slowdown, the math library developer has to make trade-offs
53

Result of float
Correct 10x Rounded result after
math library
Rounding using float math library

)
b1 = 0.95703125 0.958984375 b2 = 0.9609375

Figure 2.19: Horizontal axis represents a real number line. Given an input x =
−0.0181884765625 that is exactly representable in Bfloat16, b1 and b2 represent the two closest
Bfloat16 values to the real value of 10x = 0.958984357 . . . . The correctly rounded Bfloat16
value is b1 (black star). If an approximation of 10x has more than  = 1.70 · · · × 10−8 error, then
the result may round to b2 , which is an incorrect result. Using a correctly rounded 32-bit FP math
library to approximate 10x for Bfloat16 results in wrong results. When we use the 32-bit FP
library to compute 10x , it produces the value shown with red diamond, which then rounds to b2
producing an incorrect result.

between reducing error or increasing the performance.

Identifying the error bound for correctly rounded approximation Unfortunately, there
is no known general method to predict the error bound for  such that the rounded re-
sult of the approximation (i.e., RNT,rm (AH (x))) is equal to the rounded result of f (x)
(i.e., RNT,rm (f (x))) for all inputs x. Since f (x) can be arbitrarily close to the round-
ing boundary of two representable values, the error  may need to be arbitrarily small to
produce the correctly rounded result. To illustrate this problem, consider approximating
f (x) = 10x for bfloat16 (B) using rne mode with the input x = −0.0181884765625,
which is exactly representable in B. Figure 2.19 illustrates our example. The two closest
representable B = F16,8 values to the real value of 10x = 0.958984357 . . . is b1 and b2 in
Figure 2.19. Using the rne rounding mode, 10x rounds to b1 (highlighted with black start).
Consider the rounding boundary between b1 and b2 . For any value v between b1 and b2 ,
if v < 0.958984375, then v rounds to b1 . Otherwise, v rounds to b2 . This infers that the
approximation of 10x must have an error less than

 < |0.958984357 · · · − 0.958984375| = 1.70239 · · · × 10−8


54

Otherwise, 10x +  will round to b2 , producing an incorrectly rounded result. Thus, the
approximation function AH (x) must produce a value that has a relative error less than
1.77 · · · × 10−8 , which requires roughly d−log2 (1.77 · · · × 10−8 )e = 26 precision bits
to determine the correctly rounded result. 26 precision bits are more than 3× the precision
of bfloat16 (which has 8 precision bits).
This problem is widely known as the table-maker’s dilemma [75], which states that
there does not exist a general method to predict the amount of precision (i.e., H) required
to compute the correctly rounded result in T for all inputs. The only way of determining
the necessary precision is to iteratively compute the real result of f (x) for all inputs and
manually analyzing the error bound [85].

Using existing math library. An alternative method to create a math library for a repre-
sentation T is to use an existing math library for T0 where T ⊆ T0 . We can convert the
input x to x0 ∈ T0 , use the math library for T0 , and round the result back to the target repre-
sentation T. This strategy is especially appealing if a correct math library for T0 exists and
T0 has significantly more precision than T. If this strategy is viable, we can even employ
the same strategy to create math libraries for different representations T1 , T2 , T3 , · · · ⊆ T0 ,
thus opening the possibility to use a math library for T0 as a generic math library.
However, using a correctly rounded math library designed for T0 does not guarantee a
correctly rounded result for T because of double rounding error. Our example with Fig-
ure 2.19 illustrates this problem. If we use a math library for 32-bit float (F) that produces
correctly rounded results with rne mode to approximate 10x , then it produces the value
y 0 = 0.958984375 (represented with red diamond in 2.19). In the context of the float
math library, y 0 is the correctly rounded result, i.e., y 0 = RNF,rne (10x ). However, when we
round y 0 to bfloat16 because y 0 cannot be exactly represented in bfloat16, then y 0 rounds to
RNB,rne (y 0 ) = b2 , which is not the correctly rounded result. Therefore,

(RNB,rne (RNF , rne)(10x )) 6= (b1 = RNB,rne (10x ))


55

Due to double rounding error, using an existing high precision math library to create a
lower precision math library does not guarantee to produce the correctly rounded result for
all inputs, unless the math library is specifically created to be a generic math library.

Summary. The FP representation is a fixed-precision representation that is the most widely


used to approximate real numbers. The Posit representation is a tapered-precision represen-
tation that has been gaining excitement as a possible alternative of FP for its wide dynamic
range and high precision near 1. Real values that cannot be exactly represented in the FP
or the posit representation can be rounded to the FP or posit value with one of the five
rounding modes using the rounding components (s, v − , rb, sticky). Any computation in
finite precision can experience numerical errors caused by overflow, underflow, saturation,
double rounding, and cancellation error.
In the past, math libraries were generated using the minimax approach that gener-
ates polynomials that approximate the real value of f (x). To generate efficient polyno-
mials, range reductions are used to reduce the entire input domain into a small domain.
Generating a correctly rounded math library by mathematically approximating f (x) is
challenging because the error of the approximation may have to be arbitrarily small (i.e.
 = |AR (x)−f (x)|). Evaluating the approximation function with finite precision arithmetic
(AH (x)) incurs evaluation error, also makes the problem harder. Additionally, approximat-
ing an elementary function for T using a higher precision math library does not guarantee
a correctly rounded result due to double rounding error. Thus, either we have to create a
separate math library for each desired representation Ti , or create a generic math library
designed specifically to generate correctly rounded results for various representations.
56

CHAPTER 3
THE RLIBM APPROACH FOR CORRECTLY ROUNDED POLYNOMIAL
APPROXIMATIONS

In this chapter, we describe our approach to generate polynomial approximations that pro-
duce the correctly rounded results of elementary functions f (x). Existing methods generate
polynomials that approximate the real value of the elementary function f (x) and produce
wrong results due to approximation and rounding error in the implementation. In contrast,
we advocate for generating polynomials by approximating the correctly rounded value of
f (x) (i.e., the real value f (x) directly rounded to the target representation). The novelty of
our approach is that it provides more freedom in generating efficient polynomials that pro-
duce correctly rounded results. Then, we frame the problem of generating the polynomial
that produces the correctly rounded result as a linear programming problem. The approach
and intuition presented in this chapter is the foundation for the remaining chapters when
generating polynomials with range reduction (Chapter 4), generating piecewise polyno-
mials (Chapter 5), and generating polynomial approximations for multiple representations
and rounding modes (Chapter 6).

3.1 Approximating The Correctly Rounded Result

A correctly rounded result of an elementary function f (x) is defined as the value produced
by computing the real value of f (x) and rounding it to the target representation. While there
has been a large body of work to create accurate approximations of elementary functions [8,
15, 17, 20, 22, 30, 46, 56, 60, 71, 80, 97, 103, 119, 128, 135, 135], developing elementary
functions that produce correctly rounded results for all inputs is still a challenging problem.
There are a few correctly rounded math library for the commonly used float and double
datatype [29, 65, 101, 146], but widely used math libraries [51, 67] do not produce correctly
57

Available freedom Available freedom


when approximating f(x) when approximating x1
Correctly Rounded Correctly Rounded

y0 y1 f(x) y2 y0 y1 f(x) y2

(a) (b)

Figure 3.1: Illustration of the amount of freedom available (red box) in generating polynomials
that produce correctly rounded results when (a) approximating the real value of f (x) and (b) ap-
proximating the correctly rounded result itself. We show the real value of f (x) with a star. The
correctly rounded result of f (x) is shown with a filled circle.

rounded results for all inputs.


Prior approaches generate polynomials that minimize the maximum error among all
input points compared to the real value of the elementary function f (x), known as the
minimax approach. Remez algorithm [114] is a procedure to identify such minimax poly-
nomials. The primary challenge in generating a correctly rounded polynomial approxima-
tion using the minimax approach is that the error of the polynomial approximation (i.e.,
|P (x) − f (x)|) may need to be arbitrarily small. When the real value of f (x) is arbitrarily
close to the rounding boundary of two adjacent values in the target representation, even
a small amount of error can produce the wrong result. We illustrate this problem with
Figure 3.1(a), where we show the task of approximating f (x) (star) and producing the cor-
rectly rounded result in our target representation using the rne rounding mode, where the
correctly rounded result is y1 (filled circle). Because f (x) lies close to the midpoint be-
tween y1 and y2 , the two adjacent values in our target representation, there is only a small
amount of freedom in generating a polynomial that produces the correctly rounded result
(red box). This challenge stems from the fundamental disconnect between the goal of math
libraries and the purpose of the minimax approach. The goal of math libraries is to pro-
duce the correctly rounded result while the goal of the minimax approach is to generate a
polynomial that approximates the exact value of f (x) in real number.
58

Key idea #1: Approximate the correctly rounded result. Our key idea is to generate
a polynomial that approximates the correctly rounded result of f (x). We illustrate the
advantage of our approach with Figure 3.1(b). By approximating the correctly rounded
result y1 , our polynomial approximation can produce any value that will round to y1 when
rounded to our target representation (red box). There is a range of values [l, h] in the
vicinity of y1 where all values in the interval rounds to the correctly rounded result y1 . We
call this range of values the rounding interval. As long as we generate a polynomial that
produces a value in [l, h], the result of the polynomial rounds to the correctly rounded result
y1 . This insight provides much more freedom to generate a correctly rounded polynomial
approximation. Our key idea is inspired by the Minefield method [56]. Our contribution
lies in developing techniques to systematically and automatically generate a polynomial
that produces the correctly rounded results, which we describe below.

Key idea #2: Generating the polynomial as a linear programming (LP) problem. Each
input x and the corresponding rounding interval [l, h] defines a constraint on the coefficients
of the polynomial P (x) that we wish to generate. The result of the polynomial P (x) must
be a value between l and h to produce the correctly rounded results of f (x) (i.e., y1 ) for
a given input x. Specifically, the constraint pair (x, [l, h]) is a linear constraint on the
coefficients of the polynomial P (x):

l ≤ c0 + c1 x + c2 x 2 + · · · ≤ h

Abstractly, a polynomial with the coefficients that satisfy the above constraint will produce
the correctly rounded result of f (x) for the input x. Now our problem is to generate a
polynomial that satisfies all linear constraints specified by the constraint pair (x, [l, h]) for
each input x in the target representation. Hence, we frame the problem of generating a
polynomial that produces the correctly rounded result as a linear programming (LP) prob-
lem. We construct a system of linear inequalities that solves for the coefficients of the
polynomial that satisfy all linear constraints (x, [l, h]) and use an LP solver to solve for the
59

coefficients. The polynomial constructed with the resulting coefficients, when evaluated in
real numbers, is guaranteed to produce the correctly rounded result of f (x).

Key idea #3: Search and refine technique. Evaluating the resulting polynomial in a fi-
nite precision representation, however, does not guarantee to produce the correctly rounded
result. Thus, we propose a search-and-refine technique. Initially, we generate a candidate
polynomial that satisfies all constraints (x, [l, h]) when evaluated in real numbers. If eval-
uating the polynomial in a chosen finite precision representation does not satisfy any con-
straint, then we iteratively restrict the interval [l, h] to a smaller interval and solve for the
polynomial. We repeat the process until we identify a polynomial that produces the cor-
rectly rounded result when evaluated in the finite precision representation, or the LP solver
determines that it is infeasible to find such a polynomial.

3.2 Illustration Of Our Approach

We provide an end-to-end example of generating a polynomial approximation of ln(x)


that produces the correctly rounded results for all inputs in a 5-bit FP representation with
2 exponent bits (i.e., FP5 = F5,2 ) using the rne mode. We show this example only to
illustrate our approach and insight. In practice, creating a look-up table with pre-computed
results is more beneficial for a 5-bit FP. The steps of our approach are shown in Figure 3.2
and the resulting polynomial is shown in Figure 3.4.
The ln(x) function is defined over the input domain (0, ∞). There are 11 values ranging
from 0.25 to 3.5 in FP5 within the input domain. The remaining 25 − 11 = 21 values are
special case inputs where the result is not a real number:




 −∞ if x = 0


The correctly rounded result of ln(x) = ∞ if x = ∞





N aN if x < 0 or x = N aN
60

1.5

−1.625 ≤ P (0.25) ≤ −1.375


1.0
−0.875 < P (0.5) < −0.625
−0.375 < P (0.75) < −0.125
0.5
−0.125 ≤ P (1.0) ≤ 0.125
0.0
0.125 < P (1.25) < 0.375
0.375 ≤ P (1.5) ≤ 0.625
−0.5 0.375 ≤ P (1.75) ≤ 0.625
ln(x)
0.625 < P (2.0) < 0.875
−1.0 5 result 0.875 ≤ P (2.5) ≤ 1.125
Upper bound 0.875 ≤ P (3.0) ≤ 1.125
−1.5 Lower bound
1.125 < P (3.5) < 1.375
0.25 0.5 0.75 1.0 1.25 1.5 1.75 2 2.5 3 3.5

(a) (b)

     
−1.625 1 0.25 0.252 0.253 −1.375
−0.874 . . . 1 0.5 0.52 0.53  −0.625 . . .
     
−0.374 . . . 1 0.75 0.752 3
0.75   −0.125 . . .
    
 −0.125  1 1.02 1.03   
   1.0  c   0.125 
     
 0.125 . . .  1 1.25 1.252 1.253   0   0.374 . . . 
    c1   
 0.375  ≤ 1 1.5 1.52 3
1.5    ≤  0.625 
    c2  
 0.375  1 1.75 1.752 1.753   0.625 
    c  
 0.625 . . .  1 2.0 2.02 2.03  3  0.874 . . . 
     
 0.875  1 2.5 2.52 2.53   1.125 
     
 0.875  1 3.0 3.02 3.03   1.125 
1.125 . . . 1 3.5 3.52 3.52 1.374 . . .

(c)

Figure 3.2: Our approach for ln(x) with FP5. (a) For each input x in FP5, we accurately compute
the correctly rounded result (black circle) and identify intervals around the result so that all values
round to it. (b) The set of constraints that must be satisfied by the polynomial for the reduced input.
(c) LP formulation for generating a polynomial of degree three that satisfies the constraints in (b).
The resulting polynomial is shown in Figure 3.4.
61

[ ]( )
0.75 0.875 1.0 1.125 1.25 1.375 1.5
(0b00011) (0b00100) (0b00101) (0b00110)

Figure 3.3: This figure shows the real number line and several adjacent FP5 values, 0.75, 1.0, 1.25,
and 1.5. Any real value in the blue interval [0.875, 1.125], rounds to 1.0 in FP5 with RNE mode.
Similarly, any value in the green interval (1.125, 1.375) rounds to 1.25.

These special cases can be implemented with a simple check followed by returning the
correct result.

3.2.1 Computing the Correctly Rounded Result.

Once the special case inputs are filtered, there are 11 FP5 inputs in the input domain
[0.25, 3.5]. The inputs are shown on the x-axis in Figure 3.2(a). Our first step is to identify
the correctly rounded result of ln(x) for these 11 inputs. We use an oracle (i.e., MPFR math
library [46]) to compute the real value of ln(x) and round the result to FP5. Figure 3.2(a)
shows the correctly rounded result for each of the inputs with black dots.

3.2.2 Identifying the Rounding Interval

The polynomial approximation of ln(x) is evaluated in a higher precision representation


compared to FP5. In our example, we will use the double representation to evaluate the
polynomial. Our next step is to identify a range of values that rounds to the correctly
rounded result when rounded to FP5 with rne mode. We call this range of values the
rounding interval. If our polynomial produces a value within the rounding interval, then
the result will round to the correctly rounded value, RNF P 5,rne (ln(x)). Figure 3.2(a) shows
the rounding interval for each result using the blue (upper bound) and orange (lower bound)
square bracket.
Suppose that we want to compute the rounding interval for 1.0, which is the correctly
rounded result of ln(2.5). To identify the lower bound l of the rounding interval, we first
identify the preceding value of 1.0 in FP5, which is 0.75. Then, we identify l between 0.75
62

and 1.0 such that all values v ≥ l rounds to 1.0. More formally,

l = min{v | 0.75 ≤ v ≤ 1.0 ∧ RNF P 5,RN E (v) = 1.0}

In our case, l = 0.875. Similarly, to identify the upper bound h in the rounding interval,
we identify the succeeding value of 1.0 in FP5, which is 1.25. Then, we identify the upper
bound h between 1.0 and 1.25 such that all values smaller than or equal to h rounds to 1.0:

u = max{v | 1.0 ≤ v1.25 ∧ RNF P 5,RN E (v) = 1.0}

In this case, h = 1.125. Hence, the rounding interval for 1.0 is [l, h] = [0.875, 1.125].
Figure 3.3 shows the rounding interval of 1.0 with a blue box and the rounding interval of
1.25 with a yellow box for FP5. The bit-string of 1.25 in FP5 is odd when interpreted as an
unsigned integer. From the definition of rne mode, 1.125 rounds to 1.0 and 1.375 rounds
to 1.5. All real values larger than 1.125 and smaller than 1.375 rounds to 1.25. Because all
of our computations are evaluated with double, we identify the rounding interval of 1.25 in
double. Hence, the rounding interval of 1.25 in double is [l+ , h− ], where l+ is the smallest
double value larger than 1.125 and h− is the largest double value smaller than 1.375.

3.2.3 Generating a Polynomial Approximation

The rounding intervals specify constraints on the output of the polynomial P (x) for each
input x, where P (x) is our approximation function. Based on the intervals we computed,
if P (x) produces a value within the rounding interval, the result is guaranteed to round to
the correctly rounded result in FP5. Figure 3.2(b) shows the constraints for P (x) for each
input. Thus, our goal is to generate P (x) that satisfies all of the constraints.
To generate a polynomial P (x) of degree 3, we encode the problem as an LP problem
that solves for the coefficients of P (x). We create a system of linear inequalities that specify
the constraints of P (x) (Figure 3.2(c)) and use an LP solver to solve for the coefficients
of P (x) that satisfies all the constraints. If the LP solver produces a solution, the resulting
polynomial P (x) created using the coefficients (Figure 3.4(a)) satisfies all the constraints
63

1.5

1.0

0.5
2 3
P (x) = c0 + c1 x + c2 x + c3 x
0.0
c0 = −2.084401709401708657765 . . .
c1 = 3.1251526251526207111908 . . . −0.5
c2 = −1.191697191697188351611 . . .
ln(x)
c3 = 0.66056166056165438460 . . . −1.0 P(x)
Upper bound
−1.5 Lower bound

1 2 3

(a) (b)

Figure 3.4: Continuation from Figure 3.2, where we approximate ln(x) for F5 . (a) The coefficients
generated by the LP solver for the polynomial. (b) Generated polynomial satisfies the constraints.

(shown with a black line in Figure 3.4(b)). Finally, we verify that P (x) indeed produces
the correctly rounded result by evaluating the polynomial and comparing the result against
the correctly rounded result for all inputs.

3.3 The RL IBM Approach To Generate Correctly Rounded Polynomial Approxima-


tion

We formally define the correctly rounded result of f (x). Given an elementary function
f (x), a target representation T, a rounding mode rm, and an input domain X = [a, b] ⊆
T, our goal is to generate an approximation function of f (x) that produces the correctly
rounded result for all inputs in X.

Definition 3.1. A value v is the correctly rounded result of an elementary function f (x) for
the representation T and rounding mode rm, if v = RNT,rm (f (x)).

Based on Definition 3.1, the correctly rounded result of f (x) is unique for the chosen
representation and rounding mode for each input x. More formally, there does not exist
two distinct values v1 6= v2 where both v1 and v2 are correctly rounded results of f (x) for
64

Algorithm 3.1: Our approach to generate a polynomial approximation PH (x) of degree


d that produces the correctly rounded result of f in the target representation T, for all
inputs X ⊆ T. On successfully finding a polynomial, it returns (true, PH ). Other-
wise, it returns (false, DNE) where DNE means that the polynomial Does-Not-Exist.
CalcRndIntervals and GeneratePoly are shown in Algorithm 3.2 and Algo-
rithm 3.4, respectively.
1 Function CorrectlyRoundedPoly(f , T, H, X, d):
2 L ← CalcRndIntervals(f , T, H, X)
3 if L = ∅ then return (false, DNE)
4 S, PH ← GeneratePoly(L, d)
5 if S = true then return (true, PH )
6 else return (false, DNE)

an input x.
We propose an approach to produce a polynomial approximation that produces the cor-
rectly rounded result of f (x) when rounded to the target representation T. The polynomial
is evaluated in a higher precision representation H compared to T to evaluate the polyno-
mial accurately. We use P (x) to represent the polynomial generated with our approach that
is evaluated in real numbers. We use PH (x) to represent P (x) evaluated in H. The result of
PH (x) is rounded to T to produce the final result. Our goal is to ensure that PH (x) produces
the correctly rounded result of f (x) in T.

3.3.1 High-Level Overview of The RL IBM Approach

Our approach to generate PH (x) is illustrated in Algorithm 3.1. Our approach assumes the
existence of an oracle that generates the real result of f (x). The oracle is used to com-
pute the correctly rounded result of f (x) in T. We also assume that the special case inputs
are already filtered out from the input domain X. The special cases are identified using
mathematical identities of f (x). The target representation T, the rounding mode rm to
round the result to T, the higher precision representation H to evaluate the approximation
function, and the degree d of the polynomial are inputs provided by the math library devel-
oper. Since our approach applies for any rounding mode rm, the remainder of the chapter
assumes that rm = rne and omits the discussion of specific rounding mode. Hence, we
65

Algorithm 3.2: For each input x ∈ X, CalcRndIntervals identifies the in-


terval I = [l, h] where all values in I round to the correctly rounded result. The
GetRndInterval function takes the correctly rounded result y and returns the in-
terval I ⊆ H where all values in I round to y. GetPrecValue(y, T) returns the value
preceeding y in T. GetSuccValue(y, T) returns the value succeeding y in T.
1 Function CalcRndIntervals(f , T, H, X):
2 L←∅
3 foreach x ∈ X do
4 y ← RNT,rne (f (x))
5 I ← GetRndInterval(y, T, H)
6 L ← L ∪ {(x, I)}
7 end
8 return L
9 Function GetRndInterval(y, T, H):
10 y − ← GetPrecVal(y, T)
11 l ← min{v ∈ H|v ∈ (y − , y] and RNT (v) = y}
12 y + ← GetSuccVal(y, T)
13 h ← max{v ∈ H|v ∈ [y, y + ) and RNT (v) = y}
14 return [l, h]

use the term RNT (x) to describe rounding a real value x to the representation T using a
preferred rounding mode (in our case, rne).
Our algorithm consists of two main steps. First, we compute the correctly rounded
result of f (x), y = RNT (f (x)), for each input x using the oracle. Then, we compute
an interval I = [l, h] ⊆ H where all values v ∈ I rounds to y. The interval I describes
all the values that that our polynomial P (x) can produce such that P (x) rounds to the
correctly rounded result. Thus, the pair (x, I) specifies the input-output constraint of P (x).
The function CalcRndIntervals in Algorithm 3.1 returns a list L that contains the
constraints (x, I) for all inputs x ∈ X. Second, we generate the polynomial P (x) of
degree d that satisfies all constraints (x, I) ∈ L using linear programming. We make
sure that evaluating the polynomial in H (i.e., PH (x)) also satisfies all the constraints in
L. The function GeneratePoly returns the polynomial PH , if our approach identifies
a polynomial of degree d that satisfies all constraints. Otherwise, it returns a message
indicating that it was not successful in identifying such a polynomial. We now describe
these two steps in more detail.
66

3.3.2 Computing the Rounding Intervals

Given an elementary function f (x), the RL IBM approach identifies the values that PH (x)
must produce so that the rounded value of PH (x) is equivalent to the correctly rounded
result of f (x) for each input. We call this interval the rounding interval. Because our
polynomial is evaluated in H, we look for the interval I in H. For each input x, if our
polynomial evaluated in H produces a value in the corresponding rounding interval I, then
the result is guaranteed to round to the correctly rounded result y. Thus, the pair (x, I)
for each input x defines constraints on the output of PH (x) such that rounding the result
produces the correctly rounded result.
Algorithm 3.2 presents our approach to compute the constraints (x, I). For each input
x in the input domain X, we compute f (x) with real numbers using an oracle and round
the result to the target precision T to produce the correctly rounded result y (line 4). Next,
we compute the rounding interval of y. The rounding interval can be computed as follows.
First, we identify y − , the preceding value of y in T (line 9). Next, we find the minimum
value l ∈ H between y − and y that rounds to y (line 10). The value l defines the lower bound
of the rounding interval. Similarly for the upper bound, we identify y + , the succeeding
value of y in T (line 11). Then, we find the maximum value h ∈ H between y and y +
that rounds to y (line 12). The value h is the upper bound of the rounding interval. Thus,
the interval [l, h] is the rounding interval of y (line 3) and the pair (x, [l, h]) specifies the
constraint on the output of AH (x) to produce the correctly rounded result for the input x.
We generate such constraints for all inputs in the input domain X and produce a list of
these constraints L (line 6).

Efficiently computing the rounding interval. The function GetRndInterval in Al-


gorithm 3.2 presents a naive method of computing the rounding interval. Abstract, we
identify y − , the value preceding y in H. Then, we find the smallest value l between y − and
y that rounds to y in T. Similarly, we identify y + , the value succeeding y in H. Next, we
67

Algorithm 3.3: Given a 32-bit float (F) value v, GetRndIntervalFP32 computes


the rounding interval of v with RNE rounding mode. The final rounding interval [l, h]
is in the context of the 64-bit double type (D).
1 Function GetRndIntervalFP32(v):
2 v − ← GetPrecVal(v, F)
3 if v − = −∞ then v − = −2128

4 l ← v 2+v
5 v + ← GetSuccVal(v, F)
6 if v + = ∞ then v + = 2128
+
7 h ← v+v 2
8 if bit-string of v is odd when interpreted as unsigned integer then
9 l ← GetSuccVal(l, D)
10 h ← GetPrecVal(h, D)
11 end
12 return [l, h]

find the largest value h between y and y + that rounds to y in T. The interval [l, h] is the
rounding interval of y. This naive approach may be an expensive process if H has substan-
tially larger precision compared to T. For example, if T is the IEEE-754 standard 32-bit
float type F = F32,8 and H is the IEEE-754 standard 64-bit double type D = F64,11 , then
there are at least 229 double values between two adjacent float values.
Identifying the rounding intervals can be performed efficiently by identifying the round-
ing boundaries of values in T using the properties of H, T, and the rounding mode rm. To
illustrate this idea, we present an algorithm to efficiently compute the rounding interval in
double [l, h] ∈ D for the float value v ∈ F with the rounding mode rne (Algorithm 3.3).
This algorithm can be adapted to compute the rounding interval of various H, T, and rm.
The efficient technique identifies v − , the preceding float value of v and computes the
rounding boundary l between v − and v (lines 2-4). Among the values between v − and
v, all values smaller than l should round to v − and all values larger than l should round
v − +v
to v. The rounding boundary l can be computed efficiently with l = 2
, because the
rne mode in FP rounds to the nearest float value. The rounding boundary is the arithmetic
mean between v − and v. The value l is exactly representable in double. Similarly, we
identify v + , the succeeding float value of v and round boundary h between v and v + using
68

v+v +
the formula h = 2
(lines 5-7). The conditional statements in lines 3 and 6 make sure
that we compute the correct rounding boundary when v is the smallest negative value (line
3) or the largest positive value (line 6) where the adjacent value of v is ±∞.
Next, based on the rounding boundary, we determine whether the rounding boundary l
and h themselves should be included in the rounding interval. Because l and h is the exact
mid-point between v and its adjacent float values (i.e. v − and v + , respectively), we have to
decide on the tie goes to even part of rne, based on the bit-string of v. If the bit-string of v
is even when interpreted as an unsigned integer, the rounding interval of v is [l, h] (line 11).
Otherwise, the rounding interval is (l, h). In this case, we identify the succeeding value
of l in double and the preceding value of h in double (lines 8-10) and return the interval
between these two values.

3.3.3 Generating the Polynomial With an LP Formulation

Each constraint (x, [l, h]) ∈ L specifies that PH (x) must satisfy the property l ≤ PH (x) ≤
h. This constraint ensures that PH (x) produces the correctly rounded result when rounded
to T. Each constraint (x, [l, h]) on P (x) can be expressed in the form:

l ≤ c0 + c1 x + c2 x 2 + · · · + cd x d ≤ h

In the above expression, the values x, l, and h are constants and we must identify the
coefficients c0 , c1 , . . . , cd that satisfy the constraints. Thus, in the perspective of finding the
coefficients, the constraint (x, [l, h]) is a linear constraint. We can express all constraints in
(xi , [li , hi ]) ∈ L with a single system of linear inequalities as shown below,
 
    c0  
l1 1 x1 x21 . . . xd1  
  h1 
   
    1
 c
  
 l2  1 x2 x22 . . . xd2     h2 
 ≤  c ≤ 
 ..   .. .. .. ..   2
  . 
 .  . . . .  .   .. 
     ..   
l 1 x ... xd   h
|L| |L| |L| |L|
cd
69

This system of linear inequalities is a linear programming (LP) problem. We can use an LP
solver to solve for the coefficients that satisfy all constraints. Since LP solvers produce so-
lutions in real numbers, the polynomial P (x) created with the coefficients when evaluated
with real numbers satisfies all constraints in L. Thus, the result of P (x) is guaranteed to
round to the correctly rounded result for all inputs.

Accounting for numerical error when evaluating P (x) in H. The numerical errors that
occur when evaluating the polynomial P (x), generated with our approach, in H can cause
the result of PH (x) to not satisfy some constraints in L. Suppose that we generate a poly-
nomial P (x) that satisfies a constraint (x, [l, h]) ∈ L, such that evaluating the polynomial
in real numbers produces the value h (i.e., PR (x) = h). However, if we evaluate the poly-
nomial in H, it can be the case that PH (x) to lie outside of the rounding interval (i.e.,
PH (x) > h), due to the numerical error in finite precision arithmetic. Hence, when we
generate P (x) that satisfies all constraints (x, [l, h]), we must account for the numerical
errors to ensure that PH (x) evaluates to a value in the rounding interval.
To address this challenge, we propose a search-and-refine technique. We first use the
LP solver to solve for the coefficients of P (x) by encoding the constraints (x, [l, h]) and
generate a candidate polynomial that produces a value in the rounding interval when eval-
uated with real numbers. Next, we evaluate the polynomial in H for each input x and
check whether PH (x) produces a value in the rounding interval. If PH (x) does not satisfy
a constraint (x, [l, h]) ∈ L, then we refine the rounding interval [l, h] to a smaller interval
[l0 , h0 ] ⊆ [l, h]. If PH (x) < l, then we increase the lower bound l and if PH (x) > h, we
decrease the upper bound h. We use the LP solver again to generate the coefficients of
PH (x) that satisfies the refined constraint (x, [l0 , h0 ]) instead of (x, [l, h]). This process is
repeated until either PH (x) satisfies all constraints in L or the LP solver returns with no
solution.
Algorithm 3.4 provides the algorithm for generating the coefficients of the polynomial
70

Algorithm 3.4: GeneratePoly generates a polynomial PH (x) of degree d that satis-


fies all constraints in L when evaluated in H. If it cannot generate such a polynomial,
then it returns false. LPSolve solves for the real number coefficients of a polynomial
PR (x) using an LP solver where PR (x) satisfies all constraints in L when evaluated
in real number. CreatePolynomial creates PH (x) that evaluates the polynomial
PR (x) in H. The Verify function checks whether the generated polynomial PH (x)
satisfies all constraints in L when evaluated in H and refines the interval for each con-
straint that PH (x) does not satisfy.
1 Function GeneratePoly(L, H, d):
2 L0 ← L
3 while true do
4 C ← LPSolve(L0 , d)
5 if C = ∅ then return (false, DNE)
6 PH ← CreatePolynomial(C, d, H)
7 L0 ← Verify(PH , L, L0 , H)
8 if L0 = ∅ then return (true, PH )
9 end
10 Function Verify(PH , L, L0 , H):
11 IsCorrect ← true
12 foreach (x, [l, h], [l0 , h0 ]) ∈ {(x, I, I 0 ) | (x, I) ∈ L, (x, I 0 ) ∈ L0 } do
13 if PH (x) < l then
14 L0 ← L0 − {(x, [l0 , h0 ])}
15 l ← GetSuccVal(σ, H)
16 L0 ← L0 ∪ {(x, [l, h0 ])}
17 IsCorrect ← f alse
18 else if PH (x) > h then
19 L0 ← L0 − {(x, [l0 , h0 ])}
20 h ← GetPrecVal(h0 , H)
21 L0 ← L0 ∪ {(x, [l0 , h])}
22 IsCorrect ← f alse
23 end
24 end
25 if IsCorrect then return ∅
26 else return L0

using the LP solver. L0 keeps track of the refined constraints that we compute during the
search-and-refine process. Initially, we copy the constraints L into L0 (line 2). We use the
list L0 to generate the polynomial and L to verify that the generated polynomial satisfies the
input-and-rounding-interval constraints. If the polynomial generated using the LP solver
does not satisfy L, then we restrict the intervals in L0 .
We encode the constraints in L0 into a system of linear inequalities and use an LP solver
71

to solve for the coefficients of the polynomial of degree d that satisfies the constraints in L0
when evaluated with real numbers (line 4). If the LP solver determines that there is no solu-
tion (i.e., there exists no polynomial that satisfies the constraints in L0 ), then our algorithm
concludes that it is not possible to generate a polynomial that produces correctly rounded
results and terminates (line 5). In this case, the math library developers can increase the
degree of the polynomial d or the precision of H and restart the whole process. Otherwise,
we use the solved coefficients and create PH (x) that evaluates P (x) in H by rounding all
coefficients to H and performing all operations in H (line 6).
The polynomial PH (x) is a candidate solution for AH (x). We verify that PH (x) indeed
satisfies all constraints in L (line 7) by evaluating PH (x) and checking the result is within
the rounding interval [l, h] for each input x (line 11 - 21). If PH (x) satisfies all constraints in
L, then our algorithm concludes with success and returns PH (x) (line 8). However, if there
exists a constraint (x, [l, h]) in L, where PH (x) evaluates to a value outside of the rounding
interval [l, h], then we restrict the refined constraint (x, [l, h0 ]) in L0 that corresponds to the
same input x. If PH (x) is smaller than the lower bound l of the rounding interval, then we
restrict the lower bound l0 of the refined interval in L0 to the value succeeding l0 in H (lines
12-16). This process forces the P (x) that we generate in the next iteration to produce a
value larger than l0 , increasing the likelihood of PH (x) to produce a value larger than or
equal to l. Likewise, if PH (x) is larger than the upper bound l of the rounding interval in
L, then we restrict the upper bound h0 of the refined interval in L0 to the value preceding h0
in H (lines 17-21).
We repeatedly generate a new candidate polynomial with the refined constraints L0
until the new polynomial satisfies all constraints in L or the LP solver determines that it is
infeasible. If a refined constraint (x, [l0 , h0 ]) ∈ L0 is restricted to the point where l0 > h0
(or [l0 , h0 ] = ∅), then the LP solver will automatically determine that there is no feasible
solution. When our algorithm successfully returns a polynomial, then PH (x) is guaranteed
to produce a value that rounds to the correctly rounded result of f (x) in T.
72

3.4 Summary

We propose a novel approach to generate polynomial approximations of elementary func-


tions that produces the correctly rounded result for all inputs. To produce the correctly
rounded result of f (x), we make a case for approximating the correctly rounded result
itself, not the real value of f (x). With this insight, our approach generates polynomial
approximations by identifying the maximum amount of freedom available to generate the
correctly rounded result for each input. We identify the rounding interval for each input
x where all values in the interval rounds to the correctly rounded result of f (x). This in-
terval defines the amount of freedom to generate the polynomial. The rounding interval
for each input defines a linear constraint for the polynomial approximation to produce the
correctly rounded results. Hence, we frame the problem of generating the polynomial as
an LP problem and use the LP solver to generate the polynomial. When our approach
successfully generates a polynomial, then the polynomial is guaranteed to produce the cor-
rectly rounded result of f (x) for all inputs. Our approach can automatically produce such
polynomials.
This key idea is the foundation for generating efficient and correct polynomial approxi-
mations, which we present in the subsequent chapters. Combining with range reduction and
domain splitting can generate lower degree polynomials while still producing the correctly
rounded results, thus increasing the performance (described in Section 4 and Section 5).
An extension of the approach (described in Section 6) can generate a single polynomial
approximation that produces the correctly rounded result for multiple representations and
rounding modes. The math library created using our approach combined with the tech-
niques presented in subsequent chapters produce correctly rounded results for all inputs
and are faster than mainstream libraries.
73

CHAPTER 4
THE RLIBM APPROACH WITH RANGE REDUCTION

As discussed in the previous chapter, our insight is to approximate the correctly rounded
result rather than the real value of f (x). If the degree of the polynomial is sufficiently
large, then a single polynomial can produce the correctly rounded result of f (x) for all
inputs. In practice, however, polynomials are often used with range reduction strategies to
create efficient approximations. Range reductions significantly reduce the input domain of
f (x) into a much smaller domain. Polynomial approximations are efficient at approximat-
ing elementary functions in a smaller domain. Hence, a much lower degree polynomial
approximation can be used to produce the correctly rounded result of f (x) for all inputs.
In this chapter, we present our approach to generate polynomial approximations that
produce the correctly rounded results of f (x) when used with range reduction strategies.
The main challenge lies in the fact that we now have to account for the numerical errors that
arise in the range reduction and the output compensation functions. Our strategy is to infer
the range of values that the polynomial should produce such that, when used with the range
reduction strategy, it produces the correctly rounded results for all inputs. Additionally, we
present the range reduction strategies we use to approximate several elementary functions.
Some of these strategies are developed specifically with the goal of minimizing numerical
errors.

4.1 Generating Polynomial Approximations With Range Reduction

Our approach that we discussed in Chapter 3 can generate polynomial approximations for
any target representation as long as there is no limit on the degree of the polynomial and the
representation H we use to evaluate the polynomial has sufficiently high precision. How-
ever, evaluating a high degree polynomial is not efficient especially if H is a representation
74

that is not natively supported in the hardware, thus having to use software simulation for
all primitive operations. Our goal is not only to generate correctly rounded approximations
of f (x), but also an efficient approximation.
A widely adopted strategy to increase the performance of an approximation of elemen-
tary functions is to use range reduction along with polynomial approximation, as described
in Chapter 2. In essence, range reduction reduces the domain of the function we must ap-
proximate to a smaller domain. As polynomials are much more efficient in approximating
elementary functions in a small domain, this allows us to generate a lower degree polyno-
mial, which can be evaluated efficiently. With a well designed range reduction, the perfor-
mance benefit of using a lower degree polynomial far outweighs the additional overhead
incurred from using range reduction. Hence, developing efficient range reduction strategies
have been researched for many years [12, 25, 32, 40, 42, 48, 103, 117, 133, 134, 135, 141]
and all mainstream math libraries use range reduction whenever possible.
Using a range reduction strategy, an approximation A(x) of an elementary function
f (x) has three components. (1) A range reduction function that reduces an original input
x from the entire input domain to a reduced input x0 in a smaller domain, (2) a polynomial
approximation function that uses the reduced input x0 to approximate a function g(x0 ), and
(3) an output compensation function that compensates the approximation of g(x0 ) using the
reduced input x0 to produce the approximation of f (x) with the original input x. Depend-
ing on the range reduction strategy, the function g(x0 ) approximated by the polynomial
approximation may not be the same as f (x).

Challenges in generating polynomial approximations that work with range reductions.


The primary challenge in generating a polynomial approximation that produces the cor-
rectly rounded result of f (x) in the target representation T, when used with range reduc-
tion strategy, is the fact that the polynomial must account for the numerical error that occur
when evaluating the range reduction and the output compensation function. As range re-
75

duction, polynomial approximation, and output compensation functions are evaluated in a


finite precision representation H, it is impossible to eliminate numerical errors for an ar-
bitrary range reduction strategy. Thus, naively generating polynomial approximations for
g(x0 ) will not guarantee to produce correctly rounded results for all inputs. Prior work
mathematically analyzes the error of approximation functions and verifies that elementary
functions produce correctly rounded results [21, 29, 34, 38, 59, 62, 82, 83, 145].

Our approach. We propose a novel approach to generate polynomial approximations


with range reduction strategies. Our key insight is to infer the list of the inputs and the
range of values that our polynomial should produce. Our approach identifies these inputs
and range of values using the following strategy. First, for each input x in T, we compute
the correctly rounded result of f (x) in T and identify the rounding interval [l, h]. The list
of inputs x and its corresponding rounding interval [l, h] now constrains the output of the
entire approximation A(x) such that it produces the correctly rounded result of f (x) in T.
Next, we infer the inputs x0 and the range of values [l0 , h0 ] that our polynomial approx-
imation should produce using the range reduction and the output compensation function
based on x and [l, h]. We call x0 the reduced input and [l0 , h0 ] the reduced interval. During
this step, we also account for the numerical errors that occur with the range reduction and
the output compensation function. The list of reduced input and interval pairs (x0 , [l0 , h0 ])
constrains the polynomial such that it produces the correctly rounded of f (x) in T when
used with range reduction strategy and evaluated in H. With the list of these constraints,
we frame the problem of generating the polynomial as LP problem and use LP solver to
create the polynomial. Our approach is compatible with a wide variety of range reduc-
tion strategies, handling both univariate and multivariate output compensation functions.
Additionally, we designed several range reduction strategies that minimize the amount of
numerical error by eliminating cancellation errors.
Using our approach and range reduction strategies that we designed, we created a
76

prototype, RL IBM -16, containing several elementary functions for bfloat16 and posit16
representations. Our automatic approach in generating polynomial approximations has
been pivotal in generating efficient polynomials. Our elementary functions are faster than
mainstream math libraries repurposed for bfloat16 and posit16 while producing correctly
rounded results for all inputs as detailed in Chapter 7. In contrast, mainstream math li-
braries do not produce correctly rounded bfloat16 results for all inputs. Our prototype is
the first correctly rounded math library for bfloat16.

4.2 Illustration

We illustrate an end-to-end example of our approach for creating a correctly rounded ap-
proximation function for ln(x) for the FP5 representation with RNE rounding mode using
range reduction. Note that our goal is the same as the example from Chapter 3, but now
with range reduction and polynomial approximation. There are two purposes in our ex-
ample: (1) to show that range reduction allows us to generate a lower degree polynomial
that produces correctly rounded results when used with the output compensation function
and (2) to illustrate our approach in generating a polynomial that accounts for the numerical
error while evaluating the output compensation function in a finite precision representation.

4.2.1 Range Reduction for ln(x)

The ln(x) function is defined over the input domain (0, ∞). There are 11 values ranging
from 0.25 to 3.5 in F5,2 within (0, ∞). The remaining 21 values are special case inputs.
These special cases can be implemented with a simple check followed by returning the
correct result. Our strategy for generating a polynomial approximation for ln(x) is by
approximating ln(x) using log2 (x). We perform range reduction and output compensation
using two mathematical identities,

log2 (x)
ln(x) = , log2 (x × y z ) = log2 (x) + zlog2 (y)
log2 (e)
77

We decompose the input x into x = x0 × 2m where x0 is the significand of the input


ranging from x0 ∈ [1, 2) and m ∈ Z is the integer exponent of x. Then, ln(x) can be
computed using,
log2 (x0 × 2m ) log2 (x0 ) + mlog2 (2) log2 (x0 ) + m
ln(x0 × 2m ) = = =
log2 (e) log2 (e) log2 (e)

Based on the above equation, we construct the range reduction function RR(x), the
output compensation function OC(y 0 ), and the function that we need to approximate g(x0 )
as follows,
y0 + m
RR(x) = x0 OC(y 0 ) = g(x0 ) = log2 (x0 )
log2 (e)
Using this range reduction strategy, we approximate ln(x) by first using RR(x) to re-
duce the original input to the reduced input x0 . Next, we use a polynomial approximation
y 0 = P (x0 ) that approximates g(x0 ) for x0 ∈ [1, 2). Since P (x0 ) approximates g(x0 ) with
the reduced input x0 , we use the output compensation function OC(y 0 ) to generate an ap-
proximation of ln(x) for the original input x. More formally,

ln(x) ≈ OC(P (RR(x)))

Our goal now is to generate a polynomial approximation P (x0 ) that produces the cor-
rectly rounded results when used with the range reduction and output compensation func-
tion. Mathematically, P (x0 ) approximates log2 (x0 ) for the reduced inputs x0 in the domain
[1, 2).

4.2.2 Identifying Correctly Rounded Results and Rounding Intervals

There are a total of 11 inputs in the input domain (0, ∞). Our first step is to identify a
range of values that our entire approximation function should produce for each input, such
that the result rounds to the correctly rounded results. For each of the 11 inputs, we use
an oracle (i.e., MPFR math library) to compute y, the correctly rounded result of ln(x) in
FP5. Figure 4.1(a) shows the 11 inputs in the x-axis and the correctly rounded result for
each input with a black dot.
78

1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0

-0.5 -0.5

-1.0 -1.0
ln(x)
5 result
Upper bound
-1.5 Lower bound -1.5

0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.5 3.0 3.5 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.5 3.0 3.5
1.0 1.0 1.5 1.0 1.25 1.5 1.75 1.0 1.25 1.5 1.75

(a) (b)
1.5

0.4
1.0

0.2
0.5

0.0
0.0

-0.5

−0.2

-1.0

−0.4
-1.5

0.25 0.5 1.0 2.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.5 3.0 3.5
(1.0) (1.0) (1.0) (1.0) 1.0 1.0 1.5 1.0 1.25 1.5 1.75 1.0 1.25 1.5 1.75

(c) (d)

Figure 4.1: Our approach for ln(x) for F5,2 with range reduction. (a) For each input x in FP5, we
compute the correctly rounded results (black circle) and the rounding interval. (b) For each input
and the corresponding rounding interval, we perform range reduction to obtain the reduced input
(The number below each value on the x-axis). We also show the reduced interval for each input.
Multiple inputs can map to the same reduced input after range reduction (intervals with the same
color). In such scenarios, we combine the reduced intervals by computing the common region in
the intervals. (c) We highlight the common region in intervals that correspond to the reduced input
1.0. (d) We highlight the common region in intervals that correspond to each reduced input with the
darker color. The combined intervals are shown in Figure 4.2.
79

We perform the range reduction, output compensation, and polynomial evaluation using
double precision. The double result of the approximation function is rounded to FP5. Thus,
we identify the rounding interval [l, h], a range of double values that rounds to the correctly
rounded result, for each input. Figure 4.1(a) shows the rounding interval for each input with
a bracket. The orange bracket represents the lower bound and the blue bracket represents
the upper bound of the rounding interval.

4.2.3 Computing the Reduced Input and the Reduced Interval

Our approximation function takes an input x and uses the range reduction RR(x) to pro-
duce the reduced input x0 . The reduced input x0 is used with a polynomial approximation
P (x0 ) to produce the output y 0 . Finally, the output y 0 is compensated with the output com-
pensation function OC(y 0 ) to produce the final output y. The pair, (x, [l, h]) computed in
the previous step constrains the output of the output compensation function such that our
approximation for the elementary function as a whole produces the correctly rounded re-
sults. The next step is to use each (x, [l, h]) and infer all the inputs for P (x0 ) and the range
of values that P (x0 ) should produce such that it produces the correctly rounded results when
used with the output compensation function. The inputs to P (x0 ) are the reduced inputs.
To compute the reduced input, we perform range reduction on each input x in the origi-
nal input domain. Figure 4.1(b) shows the reduced input on the x-axis, below the original
input.
Next, we identify a range of values that P (x0 ) should produce for each input, such
that P (x0 ) produces the correctly rounded results when used with the output compensation
function. We call this range of values the reduced interval. We infer the reduced intervals
using the rounding interval [l, h] for each input and the output compensation function. In
our example where OC(y 0 ) is a continuous and bijective function over the real numbers, we
can use the inverse of the output compensation function to identify the reduced intervals.
80

For the input x = 3.5 = 1.75 × 21 , the output compensation function is,

y0 + 1
OC(y 0 ) =
log2 (e)

and the inverse of the output compensation function is,

OC −1 (y) = y × log2 (e) − 1

We use the inverse function and the rounding interval [l, h] to compute the candidate
reduced interval, [l0 , h0 ] by computing l0 = OC −1 (l) and h0 = OC −1 (h). Next, we restrict
the reduced interval [l0 , h0 ] to account for the numerical error that occurs when evaluating
the output compensation function with the double type. We verify whether the result of the
output compensation function with l0 (i.e., OC(l0 )) and with h0 (i.e., OC(h0 )) lies in [l, h]
when evaluated in double. If it does not, then we iteratively restrict the reduced interval
[l0 , h0 ] to a smaller interval until both OC(l0 ) and OC(h0 ) results in [l, h] when evaluated in
double. The vertical bars in Figure 4.1(b) show the reduced input and reduced interval for
each original input x.

4.2.4 Combining the Reduced Intervals

Multiple inputs from the original input domain can map to the same reduced input after
range reduction. In Figure 4.1(b), the inputs, reduced inputs, and reduced intervals corre-
sponding the same reduced input are colored with the same color. Consider Figure 4.1(c),
which depicts the reduced intervals for just four inputs x1 = 0.25, x2 = 0.5, x3 = 1.0, and
x4 = 2.0. All four inputs map to the same reduced input x0 = 1.0. However, the reduced
intervals that we computed for x1 , x2 , x3 , and x4 are [l10 , h01 ], [l20 , h02 ], [l30 , h03 ], and [l40 , h04 ],
respectively. The reduced intervals are not identical because the intervals are computed to
account for the output compensation function and any numerical error that may occur when
evaluating OC(y 0 ) in double.
The reduced interval for x1 indicate that P (1.0) must produce a value in [l10 , h01 ] such
that the final result, after evaluating the output compensation function in double, will round
81

1.5

1.0

0.5

−0.098 . . . ≤ P (1.00) ≤ 0.016 . . .


0.0
0.262 . . . ≤ P (1.25) ≤ 0.541 . . .
-0.5
0.541 . . . ≤ P (1.50) ≤ 0.623 . . .
0.623 . . . ≤ P (1.75) ≤ 0.901 . . .
-1.0

-1.5

1.0 1.25 1.5 1.75

(a) (b)

Figure 4.2: Continuation from Figure 4.1. (a) The combined reduced intervals by computing the
common regions in the reduced intervals that correspond to the same reduced inputs from Figure 4.1.
(b) The set of constraints that must be satisfied by the polynomial to produce the correctly rounded
results when used with the output compensation function. The LP formulation and the resulting
polynomial are shown in Figure 4.3.

to the correctly rounded result of ln(0.25). Similarly, the reduced interval for x2 indicate
that P (1.0) must produce a value in [l20 , h02 ] such that the final result rounds to the correctly
rounded result of ln(0.5). The reduced interval for each of the four inputs xi , for i ∈
{1, 2, 3, 4}, indicate that P (1.0) must produce a value in [li0 , h0i ] such that the final result
rounds to the correct result of ln(xi ). Thus, P (1.0) must satisfy all four reduced intervals
[li0 , h0i ]:
P (1.0) ∈ [l10 , h01 ] ∩ [l20 , h02 ] ∩ [l30 , h03 ] ∩ [l40 , h04 ]

Hence, we combine all reduced intervals that correspond to the same reduced input by
computing the common interval. Figure 4.1(c) shows the common interval for the reduced
input x0 = 1.0 using the darker brown color. Figure 4.1(d) shows the common interval
for each reduced input using a darker shade of the color. Once the common intervals are
combined, we are left with one combined reduced interval for each unique reduced input
x0 . The combined intervals for each reduced input are shown in Figure 4.2(a).
82

1.5
     
−0.098 . . . 1 1   0.016 . . .
 0.262 . . .  1 1.25 c0 0.541 . . . 1.0
     
 0.541 . . .  ≤ 1 1.5  c1 ≤ 0.623 . . .
0.5
0.623 . . . 1 1.75 0.901 . . .

0.0
(a)
-0.5

P(x) = c0 + c1x
-1.0 Output Range for P(1.0)
P (x) = c0 + c1 x Output Range for P(1.25)
Output Range for P(1.5)
c0 = −1.0331383243 . . . -1.5 Output Range for P(1.75)

c1 = 1.04943264311 . . . 1.0 1.25 1.5 1.75

(b) (c)

Figure 4.3: Continuation from Figure 4.2. (a) The LP formulation to generate a monomial that
satisfies all the constraints. (b) The coefficients generated by the LP solver. (c) Generated monomial
satisfies all constraints.

4.2.5 Generating a Polynomial Approximation

The combined intervals specify the constraints on the output of the polynomial P (x0 ) for
each reduced input x0 , such that the result when used with output compensation function in
double results in the correctly rounded results for all original inputs. Figure 4.2(b) numeri-
cally shows the constraints for P (x0 ) for each reduced input.
To synthesize a polynomial P (x0 ) for the degree d, we encode the problem as an LP
problem, similar to our approach in Chapter 3. We look for a polynomial that satisfies the
constraints for each reduced input. Figure 4.3(a) shows the LP formulation for generating
a monomial (polynomial of degree 1) that satisfies all constraints in Figure 4.2(b). The
generated monomial (in Figure 4.3(b)) satisfies all constraints as shown in Figure 4.3(c).
The use of range reduction allows us to generate lower degree polynomial (i.e. degree 1
in this example) compared to generating a polynomial for the entire input domain (i.e.,
polynomial of degree 3 in Chapter 3).
83

4.3 Range Reduction Strategies for Various Elementary Functions

We now describe range reduction strategies that we used to generate approximations for
various elementary functions. We group families of functions with similar range reduction
strategies (e.g., loga (x) or ax ). For each family of functions, we describe two range reduc-
tions. We first describe a simple range reduction strategy suitable for representations with a
small number of bits (i.e. bfloat16). Second, we describe a more sophisticated range reduc-
tion strategy that is attractive for a larger bitwidth. The sophisticated strategies typically
require pre-computed look-up tables or approximations of multiple functions to signifi-
cantly reduce the input domain to generate low degree polynomial while also producing
the correctly rounded result for all inputs.

4.3.1 Logarithm functions (loga (x))

The logarithm functions loga (x) is defined over the input domain (0, ∞). In general, all
range reductions for loga (x) use the mathematical identity,

loga (t × 2m ) = loga (t) + mloga (2)

to initially reduce the input domain by decomposing the input x into x = t × 2m where
t ∈ [1, 2) is the significand of x and m ∈ Z is the integer exponent of x. Then, loga (x) can
be computed with,
loga (x) = loga (t) + mloga (2)

Using this range reduction, we must approximate loga (t) for the inputs t ∈ [1, 2).

Basic Range Reduction of loga (x)

To further reduce the input domain and, we use a slightly modified version of Cody and
Waite’s range reduction [25]. It uses the mathematical property of logarithms, loga (x) =
log2 (x)
log2 (a)
to approximate logarithm functions using the approximation of log2 (x):
log2 (t) log2 (2) log2 (t) + m
loga (x) = +m =
log2 (a) log2 (a) log2 (a)
84

As a second step, we introduce a new variable x0 = t−1


t+1
. The variable t can be expressed
as a function of x0 using the following mathematical reasoning:

t−1
x0 =
t+1
tx0 + x0 = t − 1

t − tx0 = 1 + x0

t(1 − x0 ) = 1 + x0
1 + x0
t=
1 − x0

1+x0
We substitute t in log2 (t) with 1−x0
:
 
0 1 + x0
g(x ) = log2 (t) = log2
1 − x0

The polynomial expansion of g(x0 ) is an odd polynomial, i.e. P (x) = c1 x + c3 x3 +


c5 x5 . . . and converges rapidly as the number of terms increases. Combining all steps, we
decompose loga (x) to, 
1+x0
log2 1−x0
+m
loga (x) =
log2 (a)
The range reduction function x0 = RR(x), the output compensation function y =
OC(y 0 ), and the function that we need to approximate, y 0 = g(x0 ) can be summarized as
follows,
 
t−1
0 0 0 1 + x0 y0 + m
RR(x) = x = g(x ) = y = log2 OC(y 0 , x) =
t+1 1 − x0 log2 (a)

where t and m is identified by decomposing x = t × 2m . With this range reduction tech-


nique, we need to approximate g(x0 ) for the reduced input domain x0 ∈ [0, 13 ).

Sophisticated Range Reduction for loga (x)

To significantly reduce the input domain for large representations, we use a table-based
approach [134] to perform range reduction for loga (x) functions. We reduce the range of t
85

Algorithm 4.1: ComplexLogG1 uses the sophisticated range reduction strategy to approx-
imate the elementary function loga (x) for inputs x ≥ 1. GetSignificand and GetExp
returns the significand and the exponent value of x, respectiively. Algorithm 4.2 presents the
algorithm to approximate loga (x) for inputs 0 < x < 1.
1 Function ComplexLogG1(x):
2 t ← GetSignificand(x); m ← GetExp(x)
3 J ← b128(t − 1)c
J
4 F ← 1 + 128
5 f ←t−F
6 x0 ← Ff // Reduced input x0 is a value in [0, 1281
)
7 0
y ← g(x )0 0 0
// g(x ) approximates loga (1 + x )
8 y ← y 0 + loga (F ) + mloga (2) // Output compensation
9 return y

by transforming t into t = F + f where F is a discrete variable with F = 1 + J


128
, J is a
value in the set {0, 1, 2, . . . , 127} and f is a continuous variable in the range of f ∈ [0, 128
1
).
Intuitively, F is the value represented by the first 8 bits of the significand m and f is the
value represented by the rest of the bits. Then, loga (t) can be computed with,
    
f f
loga (t) = loga (F + f ) = loga F 1 + = loga (F ) + loga 1 +
F F
f
If we denote the reduced input x0 = F
, then loga (t) can be computed with,

loga (t) = loga (F ) + loga (1 + x0 )

f
The reduced input x0 is in the range of [0, 128
1
). The computation x0 = F
can be effi-
ciently performed by computing f × 1
F
if the value 1
F
is computed ahead of time and the
values are stored in a look-up table (128 values). Additionally, since there are 128 possible
values for F and loga (F ) is a constant for a given base a, we pre-compute the values for
loga (F ) (128 values for each base a) and store them in a look-up tables. We approximate
loga (1 + x0 ) for the reduced input domain x0 ∈ [0, 128
1
) with a polynomial.
Algorithm 4.1 summarizes the sophisticated range reduction strategy for loga (x), show-
ing the range reduction (lines 2-6), polynomial approximations (line 7), and output com-
pensation function (line 8).
86

Eliminating Cancellation Error

The values of loga (2) cannot be exactly represented with finite precision representations
for all bases a ∈ Z except for a = 2. The approximation error for loga (2) can be amplified
by m if the magnitude of m is large. Additionally, if m = −1, the output compensation
function may experience cancellation error because y 0 + loga (F ) ≥ 0 and mloga (2) < 0.
The magnitude of y 0 + loga (F ) and mloga (2) may be similar and have different signs,
leading to cancellation error and significant amount of numerical error when evaluating the
output compensation function with finite precision.
We use a different range reduction strategy when m < 0 to avoid cancellation error.
Once the original input x is decomposed to x = t × 2m where t ∈ [1, 2) is the significand
and m is the exponent of x, we check whether m < 0. If it is, we modify the decomposition
0
to x = t
2
× 2m+1 . If we denote t0 = t
2
and m0 = m + 1, then x = t0 × 2m . We further
reduce the range of t0 by transforming t0 into t0 = F 0 − f 0 where F 0 is a discrete variable
J0
1− 256
, J 0 is a value in the set {0, 1, 2, . . . , 127}, and f 0 is a continuous variable in the
range of f 0 ∈ [0, 256
1
]. Then, loga (t0 ) can be computed with,
    
0 0 0 0 f0 0 f0
loga (t ) = loga (F − f ) = loga F 1 − 0 = loga (F ) + loga 1 − 0
F F
0
If we denote the reduced input x0 = − Ff 0 , then x0 is in the range of [− 128
1
, 0]. The function
loga (x) can be computed with,

loga (t0 ) = loga (F 0 ) + loga (1 + x0 ) + m0 loga (2)

All three terms loga (F 0 ), loga (1+x0 ), and m0 loga (2) are non-positive values and adding
0
these three values do not experience cancellation error. The computation x0 = − Ff 0 can be
efficiently performed by computing −f 0 × F10 if the value 1
F0
is computed ahead of time and
stored in a look-up table (128 values). Additionally, since there are 128 possible values for
F 0 and loga (F 0 ) is constant for a given base a, we also pre-compute the values for loga (F 0 )
and store them in a look-up table (128 values for each base a). We approximate loga (1+x0 )
for the reduced input domain x0 ∈ [− 128
1
, 0] with a polynomial.
87

Algorithm 4.2: ComplexLogL1 uses the sophisticated range reduction strategy to approxi-
mate the elementary function loga (x) for inputs 0 < x < 1.
1 Function ComplexLogL1(x):
2 t0 ← GetSignificand(x) ÷ 2; m0 ← GetExp(x) + 1 // t0 ∈ [0.5, 1)
3 0
J ← 255 − b256t c 0 0
// J ∈ {0, 1, . . . , 127}
0 J0
4 F ← 1 − 256
5 f 0 ← F 0 − t0
0
6 x0 ← − Ff 0 // Reduced input x0 is a value in [− 128 1
, 0]
7 0
y ← g(x ) 0 // g(x ) approximates loga (1 + x0 )
0

8 y ← y 0 + loga (F 0 ) + m0 loga (2) // Output compensation


9 return y

Algorithm 4.2 summarizes the sophisticated range reduction strategy of loga (x) for the
input 0 < x < 1, showing the range reduction (lines 2-6), polynomial approximations (line
7), and output compensation function (line 8).

4.3.2 Exponential functions (ax )

The exponential function ax is mathematically defined for the input domain x ∈ (−∞, ∞).
For small target representations, we use range reduction that reduces the input domain to
a reduced input domain x0 ∈ [0, 1). For large target representations, we use table-based
range reduction to reduce the input domain further.

Basic Range Reduction for ax

For small representations, we approximate all exponential functions with 2x . As a first step,
we use the mathematical property ax = 2xlog2 (a) to decompose any exponential function to
a function of 2x . Second, we decompose the value xlog2 (a) into the integral part i and the
remaining fractional part x0 ∈ [0, 1), i.e. xlog2 (a) = i + x0 . More formally,

i = bxlog2 (a)c, x0 = xlog2 (a) − i

where bvc is a floor function that rounds v down to an integer. Finally, using the property
2x+y = 2x 2y , ax decomposes to
0 0
ax = 2xlog2 (a) = 2i+x = 2x 2i
88

The above decomposition allows us to approximate any exponential function by approxi-


mating 2x for x ∈ [0, 1). Multiplication by 2i can be computed efficiently using bit-wise
0
operations by increasing the exponent value of 2x in the bit-string of the higher precision
representation used to evaluate the approximation function. The range reduction function
x0 = RR(x), the output compensation function y = OC(y 0 , x), and the function we need
to approximate y 0 = g(x0 ) can be summarized as follows:

0
RR(x) = x0 = xlog2 (a) − bxlog2 (a)c, g(x0 ) = y 0 = 2x , OC(y 0 , x) = y 0 2bxlog2 (a)c

0
With this range reduction technique, we need to approximate 2x for the reduced input
domain x0 ∈ [0, 1).

Sophisticated Range Reduction for ax

The table-based range reduction for exponential functions ax [133] is applicable for all
bases a. Our goal in the table-based range reduction is to use the mathematical identities
ax+y = ax ay and axloga (2) = 2x to efficiently compute ax . Suppose that we somehow
decompose the input x into x = tloga (2) + r where t is an integer and r is the remainder
of the value, r = x − tloga (2). Then, ax can be computed with,

ax = atloga (2)+r = atloga (2) ar = 2t ar

In such a case, we need to approximate ar and multiplication by 2t can be efficiently com-


puted using bit-wise operations. We use this decomposition twice in the range reduction.
j
To reduce the input x we transform x into x = tloga (2) + 64
loga (2) + x0 where t
loga (2)
is an integer, j is a value in a set {0, 1, 2, . . . , 63} and |t0 | ≤ 64
. Intuitively, t can be
j
computed by identifying the integral part of the value x
loga (2)
. The value 64
can be computed
by identifying the first 6 fractional bits of x
loga (2)
. Lastly, x0 can be computed by identifying
j
the remaining value, x0 = x − tloga (2) − 64
loga (2).
The scaling by loga (2) allows us to create efficient output compensation formula. The
89

Algorithm 4.3: ComplexExp uses the sophisticated range reduction strategy to approximate
the elementary function ax for all inputs x.
1 Function ComplexExp(x):
2 if x < 0 then t ← d logxa (2) e else t ← b logxa (2) c // Integer cast (rnz mode)
3 t1 ← x − tloga (2)
4 if t1 < 0 then j ← dt1 × 64
loga (2) e else j ← bt1 × 64
loga (2) c
j
5 x0 ← t1 − 64 loga (2) // Reduced input x0 is a value in [− log64
a (2) loga (2)
, 64 ]
0
6 y 0 ← g(x0 ) // g(x0 ) approximates ax
j
7 y ← 2t × 2 64 × y 0 // Output compensation
8 return y

value of ax can be computed with,


j 0 j 0 j 0
ax = atloga (2)+ 64 loga (2)+x = atloga (2) × a 64 loga (2) × ax = 2t × 2 64 × ax

Multiplication by 2n can be computed efficiently using bit-wise operations. We pre-compute


j 0
and store the values of 2 64 in a look-up table (i.e. 64 values in total) and approximate ax
for the input domain of x0 ∈ [− log64
a (2) loga (2)
, 64 ]. Algorithm 4.3 summarizes the sophisti-
cated range reduction strategy for ax , showing the range reduction (lines 2-5), polynomial
approximations (line 6), and output compensation function (line 7).

4.3.3 Hyperbolic Sine Function (sinh(x))

Mathematically, the sinh(x) is defined for the input domain x ∈ (−∞, ∞). We use a
table-based range reduction inspired from CR-LIBM [30] for sinh(x) using the hyper-
bolic identities sinh(a + b) = sinh(a)cosh(b) + cosh(a)sinh(b) and cosh(a + b) =
cosh(a)cosh(b) + sinh(a)sinh(b).

Basic Range Reduction for sinh(x)

Using the property sinh(−x) = −sinh(x), the result of sinh(x) can be computed with
s × sinh(|x|) where s = 1 if x >= 0 and s = −1 if x < 0. We first transform the input x
into |x| with x = s × |x|. Next, we decompose |x| into two parts:

|x| = kln(2) + x0
90

where k ≥ 0 is an integer and x0 ∈ [0, ln(2)). The intuition for scaling the values of k with
ln(2) will be explained below. If we denote K = kln(2), then sinh(|x|) can be computed
with,
sinh(|x|) = sinh(K + x0 ) = sinh(K)cosh(x0 ) + cosh(K)sinh(x0 )

Because the magnitude of sinh(x) increases exponentially, the result of sinh(x) lies
outside of the dynamic range for most values of the original input x. Hence, the values
of k are limited even for relatively large precision representations (i.e. 32-bit float). We
identify all possible values for k, pre-compute sinh(K) and cosh(K) for each k, and store
the values into look-up tables.
We approximate two functions g1 (x0 ) = sinh(x0 ) and g2 (x0 ) = cosh(x0 ) for the reduced
inputs x0 ∈ [0, ln(2)). The function sinh(x) can be computed with,

sinh(x) = s × (sinh(K)cosh(x0 ) + cosh(K)sinh(x0 ))

Finally, the range reduction x0 = RR(x), output compensation y = OC(y10 , y20 ), and the
two functions that we must approximate (y10 = g1 (x0 ) and y20 = g2 (x0 )) can be summarized
as follows:

RR(x) = x0 = |x| − kln(2),

g1 (x0 ) = y10 = sinh(x0 ), g2 (x0 ) = y20 = cosh(x0 )

OC(y10 , y20 ) = s × (sinh(kln(2)) × y20 + cosh(kln(2)) × y10 ) ,

where s and k are identified from the decomposition x = s×kln(2)+x0 as described above.
Note that the output compensation function requires approximation of two functions, g1 (x0 )
and g2 (x0 ).
91

Sophisticated Range Reduction for sinh(x)

Using the property sinh(−x) = −sinh(x), we decompose the input x into |x| with x =
s × |x|, where s is the sign of the original input x. Next, we decompose |x| into three parts:

j
|x| = kln(2) + ln(2) + x0
64

where both k and j are integers, k ≥ 0, 0 ≤ j < 64, and x0 is a value in [0, ln(2)
64
).
|x|
Intuitively, k can be computed by identifying the integral part of ln(2)
, j can be computed
|x|
by identifying the first 6 bits in the fractional part of ln(2)
, and x0 can be computed with the
j j
remainder, i.e. x0 = |x| − kln(2) − 64
ln(2). If we denote K = kln(2) and J = 64
ln(2),
then sinh(|x|) can be computed with,

sinh(|x|) = SH × cosh(x0 ) + CH × sinh(x0 )

SH = sinh(K) × cosh(J) + cosh(K) × sinh(J)

CH = cosh(K) × cosh(J) + sinh(K) × sinh(J)

We pre-compute the values of sinh(J), cosh(J), sinh(K), and cosh(K) and store the
values in look-up tables. There are only 64 possible values for J. The values of K are also
limited, if we filter all inputs that cause the value of sinh(x) to go outside of the dynamic
range of the target finite-precision representation as special case inputs. For example, in the
case of 32-bit float, the real value of sinh(x) goes outside of the dynamic range of 32-bit
float for all inputs x where |x| > 130ln(2). Thus, the values of k do not exceed 129 and
the size of the look-up tables for sinh(K) and cosh(K) is 130 entries each.
Alternatively, we can choose to compute the values of sinh(K) and cosh(K) manually
using the following properties,

ex − e−x ex + e−x
sinh(x) = cosh(x) =
2 2
92

Algorithm 4.4: ComplexSinh uses the sophisticated range reduction strategy to approxi-
mate the elementary function sinh(x) for all inputs x.
1 Function ComplexSinh(x):
2 if x < 0 then s ← 1 else s ← 0
|x|
3 k ← b ln(2) c
4 t1 ← |x| − kln(2)
64
5 j ← bt1 × ln(2) c
j
6 x0 ← t1 − 64 ln(2) // Reduced input x0 is a value in [0, ln(2)64 ]
7 0 0
y1 ← g1 (x ) // g1 (x0 ) approximates sinh(x0 )
8 y20 ← g2 (x0 ) // g2 (x0 ) approximates cosh(x0 )
j j
9 SH ← sinh(kln(2))cosh( 64 ln(2)) + cosh(kln(2))sinh( 64 ln(2))
j j
10 CH ← sinh(kln(2))sinh( 64 ln(2)) + cosh(kln(2))cosh( 64 ln(2))
11 y ← (−1)s (SH × y20 + CH × y10 ) // Output compensation
12 return y

Then, sinh(K) and cosh(K) can be computed as follows,

sinh(K) = sinh(kln(2)) = 2k−1 − 2−k−1

cosh(K) = cosh(kln(2)) = 2k−1 + 2−k−1

Although sinh(K) and cosh(K) cannot be exactly represented with finite-precision


representations for all K, the correctly rounded value of sinh(K) and cosh(K) for a
higher-precision representations can be computed efficiently using bit-wise operations, ad-
ditions, or subtractions. This strategy can be effective for representations with a large
dynamic range such as the double type.
We approximate two functions g1 (x0 ) = sinh(x0 ) and g2 (x0 ) = cosh(x0 ) for the reduced
inputs x0 ∈ [0, ln(2)
64
). Algorithm 4.4 summarizes the sophisticated range reduction strategy
for sinh(x), showing the range reduction (lines 2-6), polynomial approximations (line 7-8),
and output compensation function (lines 9-11).
93

4.3.4 Hyperbolic Cosine Function (cosh(x))

Mathematically, the cosh(x) is defined for the input domain x ∈ (−∞, ∞). The range
reduction for cosh(x) uses a similar strategy as sinh(x) (described in Section 4.3.3) to
reduce the input x. We use a table-based range reduction inspired from CR-LIBM for
cosh(x) using the hyperbolic identities sinh(a + b) = sinh(a)cosh(b) + cosh(a)sinh(b)
and cosh(a + b) = cosh(a)cosh(b) + sinh(a)sinh(b).

Basic Range Reduction for cosh(x)

The cosh(x) function has a property, cosh(−x) = cosh(x). The result of cosh(x) can
be computed with cosh(|x|). Using this property, we decompose the original input x into
x = s × |x| where s is the sign of the input. Next, we decompose |x| into two parts,
|x| = kln(2) + x0 where k ≥ 0 is an integer and x0 is a value in x0 ∈ [0, ln(2)). If we
denote K = kln(2), then cosh(x) can be computed with,

cosh(x) = cosh(K + x0 ) = sinh(K)sinh(x0 ) + cosh(K)cosh(x0 )

We pre-compute and store the values of sinh(K) and cosh(K) in a look-up table and
approximate two functions g1 (x0 ) = sinh(x0 ) and g2 (x0 ) = cosh(x0 ) for the reduced in-
puts x0 ∈ [0, ln(2)). Finally, the range reduction x0 = RR(x), output compensation
y = OC(y10 , y20 ), and the two functions that we must approximate (y10 = g1 (x0 ) and
y20 = g2 (x0 )) for cosh(x) can be summarized as follows:

RR(x) = x0 = |x| − kln(2)

g1 (x0 ) = y10 = sinh(x0 ), g2 (x0 ) = y20 = cosh(x0 ),

OC(y10 , y20 ) = sinh(kln(2)) × y10 + cosh(kln(2)) × y20

where k is identified from the decomposition |x| = kln(2) + x0 as described above.


94

Algorithm 4.5: ComplexCosh uses the sophisticated range reduction strategy to approxi-
mate the elementary function cosh(x) for all inputs x.
1 Function ComplexCosh(x):
|x|
2 k ← b ln(2) c
3 t1 ← |x| − kln(2)
64
4 j ← bt1 × ln(2) c
j
5 x0 ← t1 − 64 ln(2) // Reduced input x0 is a value in [0, ln(2)64 ]
6 0 0
y1 ← g1 (x ) // g1 (x0 ) approximates sinh(x0 )
7 y20 ← g2 (x0 ) // g2 (x0 ) approximates cosh(x0 )
j j
8 SH ← sinh(kln(2))cosh( 64 ln(2)) + cosh(kln(2))sinh( 64 ln(2))
j j
9 CH ← sinh(kln(2))sinh( 64 ln(2)) + cosh(kln(2))cosh( 64 ln(2))
10 y ← SH × y10 + CH × y20 // Output compensation
11 return y

Sophisticated Range Reduction for cosh(x)

We first decompose the input x into x = s × |x| where s is the sign of the original input x.
The result of cosh(x) can be computed with cosh(|x|). Next, we decompose |x| into three
parts similar to how we decompose the input |x| for sinh(x):

j
|x| = kln(2) + ln(2) + x0
64

Both k and j are integers, k ≥ 0, 0 ≤ j < 64, and x0 is a real number value in [0, ln(2)
64
).
j
If we denote K = kln(2) and J = 64
ln(2), then cosh(|x|) can be computed with,

cosh(x) = SH × sinh(x0 ) + CH × cosh(x0 )

SH = sinh(K) × cosh(J) + cosh(K) × sinh(J)

CH = cosh(K) × cosh(J) + sinh(K) × sinh(J)

We pre-compute and store the values of sinh(K), cosh(K), sinh(J), and cosh(J) in
lookup tables. Alternatively, sinh(K) and cosh(K) can be computed manually using the
methodology described in Section 4.3.3. We approximate both g1 (x0 ) = sinh(x0 ) and
g2 (x0 ) = cosh(x0 ) for the reduced inputs x0 ∈ [0, ln(2)
64
). Algorithm 4.5 summarizes the
sophisticated range reduction strategy for sinh(x), showing the range reduction (lines 2-
95

Algorithm 4.6: BasicSinpi uses the basic range reduction strategy to approximate the
elementary function sinpi(x) for all inputs x.
1 Function BasicSinpi(x):
2 j ← x − 2b x2 c
3 k ← bjc
4 l ←j−k
5 if l ≤ 0.5 then l0 ← l else l0 ← 1.0 − l
6 x0 ← l 0 // The reduced input x0 is a value in [0, 0.5]
7 0
y ← g(x ) 0 // Polynomial approximation g(x0 ) approximates sinpi(x0 )
8 y ← (−1) y k 0 // Output compensation function
9 return y

5), polynomial approximations (line 6-7), and output compensation function (lines 8-10).

4.3.5 Trigonometric Sinpi Function (sinpi(x))

The sinpi(x) function is equivalent to sin(πx). Compared to the sin(x) function, per-
forming range reduction accurately for sinpi(x) is considered straight-forward, because
sinpi(x) has a period of length 2. Decomposing the original input x into a value v within
a period of oscillation (i.e., [0, 2]) can be computed exactly with finite precision arithmetic
since v can be computed with v = x − 2i for an integer i. Comparatively, sin(x) has a
period of size 2π. Decomposing an arbitrary original input x into a value v within a period
(i.e., [0, 2π]) exactly is impossible using finite precision arithmetic since v = x − 2π × i
for an integer i, but π cannot be represented exactly. Thus, many math libraries provide
approximations of sinpi(x) along with sin(x).

Basic Range Reduction for sinpi(x)

The range reduction of sinpi(x) for small representations leverages the periodicity of
sinpi(x) to reduce the input. First, we transform the input x into x = 2i + j where i
is an integer and j ∈ [0, 2). Then, sinpi(x) = sinpi(j) due to periodicity. Next, we de-
compose j into j = k + l where k ∈ {0, 1} is the integral part of j and l ∈ [0, 1) is the
96

Algorithm 4.7: ComplexSinpi uses the sophisticated range reduction strategy to approxi-
mate the elementary function sinpi(x) for all inputs x.
1 Function ComplexSinpi(x):
2 j ← x − 2b x2 c
3 k ← bjc
4 l ←j−k
5 if l ≤ 0.5 then l0 ← l else l0 ← 1.0 − l // l0 is a value in [0, 0.5]
6 n ← b512l c 0
n
7 x0 ← l0 − 512 // Reduced input x0 is a value in [0, 512 1
]
8 0
y1 ← g1 (x )0 // g1 (x ) approximates sinpi(x0 )
0

9 y20 ← g2 (x0 ) // g2 (x0 ) approximates cospi(x0 )


k n
 0 n
 0
10 y ← (−1) × sinpi 512 y2 + cospi 512 y1 // Output compensation
11 return y

fractional part. The value sinpi(j) can be computed with,

sinpi(j) = (−1)k × sinpi(l)

Third, we use the fact that sinpi(x) between x = [0.5, 1) is a mirror of sinpi(x) between
x = [0, 0.5] and decompose l into,


l if l ≤ 0.5
l0 =

1.0 − l if l > 0.5

With this decomposition, sinpi(l) can be computed with sinpi(l) = sinpi(l0 ). For small
representations, we use the value l0 as the reduced input (i.e. x0 = l0 ). Then, we can
approximate sinpi(x) using the formula,

sinpi(x) = (−1)k × sinpi(x0 )

Algorithm 4.6 summarizes the basic range reduction strategy for sinpi(x), showing the
range reduction (lines 2-6), polynomial approximation (line 7), and output compensation
function (line 8).

Sophisticated Range Reduction for sinpi(x)

The range reduction of sinpi(x) for large representations is based on the basic range reduc-
tion strategy and further reduces the transformed input l0 in the domain l0 ∈ [0, 0.5] using
97

table-based range reduction [135]. We split l0 into l0 = n


512
+x0 where n is a discrete variable
in the set {0, 1, . . . , 256} and x0 is a continuous variable in [0, 512
1
]. The value of sinpi(l0 )
can be computed using the trigonometric identity sinpi(a + b) = sinpi(a)cospi(b) +
cospi(a) + sinpi(b):
 n   n 
sinpi(l0 ) = sinpi cospi(x0 ) + cospi sinpi(x0 )
512 512
 
We pre-compute and store the values of sinpi 512
n
and cospi 512
n
in lookup tables (i.e.
total of 512 values). We approximate g1 (x0 ) = sinpi(x0 ) and g2 (x0 ) = cospi(x0 ) for the
reduced input domain x0 ∈ [0, 512
1
].
Algorithm 4.7 summarizes the sophisticated range reduction strategy for sinpi(x),
showing the range reduction (lines 2-7), polynomial approximations (line 8-9), and out-
put compensation function (line 10).

4.3.6 Trigonometric Cospi Function (cospi(x))

The cospi(x) function is equivalent to cos(πx). Many math libraries provide approxima-
tions for cospi(x) function along with cos(x). It is a periodic function where the period
of oscillation has a length of 2. Thus, cospi(x) can be approximated accurately even for
large inputs because the input x can be exactly reduced to a value v within a period of
oscillation (i.e. [0, 2]) using the formula v = x − 2i for an integer i. Similar to the sinpi(x)
function, the range reduction of cospi(x) for small representations leverage the periodicity
of cospi(x) and the range reduction for large representations use table-based technique.

Basic Range Reduction for cospi(x)

First, we transform the input x into x = 2i + j where i is an integer and j ∈ [0, 2). Due
to the periodicity, cospi(x) = cospi(j). Second, we decompose j into j = k + l where
k ∈ {0, 1} is the integral part of j and l ∈ [0, 1) is the fractional part. Using the property
cospi(1 + x) = −cospi(x), the value cospi(j) can be computed with,

cospi(j) = (−1)k × cospi(l)


98

Algorithm 4.8: BasicCospi uses the basic range reduction strategy to approximate the
elementary function cospi(x) for all inputs x.
1 Function BasicCospi(x):
2 j ← x − 2b x2 c
3 k ← bjc
4 l ←j−k
5 if l ≤ 0.5 then l0 ← l else l0 ← 1.0 − l
6 if l ≤ 0.5 then m ← 0 else m ← 1
7 x0 ← l 0 // The reduced input x0 is a value in [0, 0.5]
8 0
y ← g(x ) 0 // Polynomial approximation g(x0 ) approximates cospi(x0 )
9 y ← (−1)k (−1)m y 0 // Output compensation function
10 return y

Third, we reduce l using the fact that cospi(x) between [0.5, 1) is a mirror image of
cospi(x) between [0, 0.5] with the opposite sign. We decompose l into m and l0 where:
 

0 if l ≤ 0.5 
l if l ≤ 0.5
0
m= l =

1 if l > 0.5 
1.0 − l if l > 0.5

With this decomposition, cospi(l) can be computed with cospi(l) = (−1)m × cospi(l0 ).
For small representations, we use the value l0 as the reduced input (i.e., x0 = l0 ). Then, we
can approximate cospi(x) using the formula,

cospi(x) = (−1)k × (−1)m × cospi(x0 )

We approximate g(x0 ) = cospi(x0 ) for the reduced input domain x0 ∈ [0, 0.5].
Algorithm 4.8 summarizes the basic range reduction strategy for cospi(x), showing the
range reduction (lines 2-7), polynomial approximation (line 8), and output compensation
function (line 9).

Sophisticated Range Reduction for cospi(x)

The range reduction of cospi(x) for large representations is based on the basic range re-
duction strategy. The transformed input l0 ∈ [0, 0.5] is further reduced to a value in smaller
domain using table-based range reduction. One way of reducing l0 into smaller value is to
decompose it into l0 = a + b, where a >= 0 and b >= 0, and use the trigonometric identity
99

cospi(a + b) = cospi(a)cospi(b) − sinpi(a)sinpi(b). However, this strategy can cause


cancellation error especially when the values of cospi(a)cospi(b) and sinpi(a)sinpi(b) are
similar in magnitude and has the same sign.
Instead, we transform l0 into x0 and n such that,


 1
x0 if l0 < (in this case, n = 0)
512
l0 =


 n
− x0 otherwise
512

where n is an integer value in the set {0, 1, 2, . . . , 256} and x0 is a fractional value in
1
[0, 512 ]. Then, cospi(l0 ) can be computed with the trigonometric identity cospi(a − b) =
cospi(a)copsi(b) + sinpi(a)sinpi(b),


 1
cospi(x0 ) (because n = 0) if l0 <
0 512
cospi(l ) =


cospi( n ) × cospi(x0 ) + sinpi( n ) × sinpi(x0 ) otherwise
512 512

This formula is monotonically increasing for all inputs x0 and does not experience cancella-
tion error because cospi( 512
n n
), sinpi( 512 ), cospi(x0 ), and sinpi(x0 ) are non-negative for all

possible combination of n and x0 . We pre-compute and store the values of sinpi 512 n
and

cospi 512n
in lookup tables for a total of 514 values. We approximate g1 (x0 ) = sinpi(x0 )
and g2 (x0 ) = cospi(x0 ) for the reduced input domain x0 ∈ [0, 512
1
].
Algorithm 4.9 summarizes the sophisticated range reduction strategy for cospi(x), show-
ing the range reduction (lines 2-8), polynomial approximations (line 9-10), and output
compensation function (lines 11-12).

4.4 Our Approach For Generating Polynomials With Range Reduction

Our goal in this chapter is to generate a polynomial approximation P (x) that produces the
correctly rounded result of f (x) in the target representation T using a particular rounding
mode rm when used with range reduction strategies. Depending on the range reduction
100

Algorithm 4.9: ComplexCospi uses the sophisticated range reduction strategy to approxi-
mate the elementary function cospi(x) for all inputs x.
1 Function ComplexCospi(x):
2 j ← x − 2b x2 c
3 k ← bjc
4 l ←j−k
5 if l ≤ 0.5 then l0 ← l else l0 ← 1.0 − l // l0 is a value in [0, 0.5]
6 if l ≤ 0.5 then m ← 0 else m ← 1
7 n ← d512l0 e
8 if l0 < 512
1
then x0 ← l0 else x0 ← 512
n
− l0 // x0 is a value in [0, 512
1
]
9 0
y1 ← g1 (x )0 0
// g1 (x ) approximates sinpi(x )0

10 y20 ← g2 (x0 ) // g2 (x0 ) approximates cospi(x0 )


11 if l0 < 512
1
then y ← (−1)k (−1)m y20 // Output compensation
 0  0
12 else y ← (−1)k (−1)m cospi 512 n n
y2 + sinpi 512 y1
13 return y

strategy, the output compensation function may be univariate (requires us to approximate


a single function, i.e., range reduction used for loga (x)) or multivariate (requires us to ap-
proximate multiple functions, i.e., range reduction used for sinh(x)). In this section, we
describe how to generate polynomial approximations of f (x) with univariate output com-
pensation functions to explain our key insights and approach. In Section 4.5, we present a
more general technique to handle multivariate output compensation functions.
We use the notation AH (x) to represent the approximation of the elementary function
f (x) produced with our approach while evaluating all internal computation with the higher
precision representation H. The result of AH (x) is then rounded to the target representation
T to produce the final result. AH (x) is composed of three components: (1) the range
reduction function x0 = RRH (x) that reduces the original input into the reduced input in a
smaller domain, (2) the polynomial approximation y 0 = PH (x0 ) using the reduced input x0 ,
and (3) the output compensation function y = OCH (y 0 ) that compensates the output y 0 to
produce the approximation of f (x). More formally,

AH (x) = OCH (PH (RRH (x)))

The result of AH (x) (i.e., y) is rounded to T to produce the final result. Given a range
reduction strategy, the task of creating AH (x) that produces the correctly rounded result of
101

f (x) for all inputs involves generating an appropriate polynomial PH (x0 ).


There are two challenges in generating PH (x0 ) such that the final result of AH (x) pro-
duces the correctly rounded results. First, the polynomial must account for the range reduc-
tion and the output compensation function. Specifically, PH (x0 ) must produce a value that
rounds to the correctly rounded result of f (x) when used with the output compensation
function OCH (y 0 ). If we were to approximate f (x) without range reduction (i.e., Chap-
ter 3), the rounding interval [l, h] of each input x directly defined the values that our desired
polynomial approximation should produce. However, with range reduction, the rounding
interval for each input defines the constraint on the output of the AH (x) as a whole (i.e.,
AH (x) ∈ [l, h] for each x). The original input x is transformed via the range reduction
function and the output of the PH (x) is altered by the output compensation function. We
can no longer directly use the constraints (x, [l, h]) to generate polynomial approximations.
Second, the polynomial also needs to account for the numerical error in evaluating the
range reduction and output compensation function in H(x).
Hence, we infer the inputs x0 for PH (x0 ) and the range of values [l0 , h0 ] that PH (x0 )
should produce based on the original input x, the rounding interval [l, h], the range re-
duction function, and the output compensation function. We call the input x0 the reduced
input and the interval [l0 , h0 ] the reduced interval. The reduced inputs and their correspond-
ing reduced intervals define input-and-output constraints for PH (x0 ) such that the result of
PH (x0 ) used with the output compensation function produces correctly rounded results. We
call the constraint pair (x0 , [l0 , h0 ]) the reduced constraint. Once the reduced constraints are
identified for all inputs, we can then encode the problem of generating a polynomial that
satisfies all reduced constraints into a system of linear inequalities and use an LP solver to
generate PH (x0 ).
102

Algorithm 4.10: Our approach to generate a polynomial approximation PH (x) that pro-
duces the correctly rounded result for all inputs when used with univariate output compen-
sation function. On successfully finding a polynomial, it returns (true, PH ). Otherwise,
it returns (false, DNE) where DNE means that the polynomial Does-Not-Exist. Functions,
CalcReducIntervals and CombineReducIntervals are shown in Algorithm 4.11
and Algorithm 4.12, respectively.
1 Function CorrectlyRoundedPoly(f , T, H, X, d, RRH , OCH ):
2 L ← CalcRndIntervals(f , T, H, X)
3 if L = ∅ then return (false, DNE)
4 L0 ← CalcReducIntervals(L, H, RRH , OCH )
5 if L0 = ∅ then return (false, DNE)
6 L∗ ← CombineReducIntervals(L0 )
7 if L∗ = ∅ then return (false, DNE)
8 S, PH ← GeneratePoly(L∗ , d)
9 if S = true then return (true, PH )
10 else return (false, DNE)

4.4.1 High-level Overview of the RL IBM Approach For Univariate Output Compen-
sation Functions

Our approach for generating the the polynomial PH (x0 ) is shown in Algorithm 4.10. Our
approach assumes the existence of an oracle, which produces the real value of f (x). The
polynomial approximation is closely related to the specific range reduction strategy. Hence,
we also require the range reduction RRH (x0 ) and the output compensation function OCH (y 0 )
from the math library developer. Additionally, our approach requires the output compen-
sation function OC(y 0 ) to be invertible (i.e., continuous and bijective) to automatically
identify the reduced intervals. In our experience, we found that all univariate output com-
pensation functions that we have used are invertible. Finally, the degree of the polynomial
is an input provided by the math library designer.
Our approach extends from the approach described in Chapter 3. First, we compute
the correctly rounded result of f (x) (i.e. RNT,rm (f (x))) for each input x using our oracle.
Then, we identify the rounding interval [l, h] ∈ H where all values in the interval rounds
to the correctly rounded result using the steps described in Chapter 3 (line 2). The pair
(x, [l, h]) specifies that the final result of AH (x) must produce a value in [l, h] (i.e., AH (x) ∈
103

[l, h]) for the result to round to the correctly rounded result.
Second, for each input x, we compute the reduced input x0 using range reduction. Then,
we infer the reduced interval [l0 , h0 ] for each pair (x, [l, h]) in L using the output com-
pensation function (line 4). Abstractly, the reduced intervals [l0 , h0 ] constrains the output
of the polynomial PH (x0 ) such that the result, when used with the output compensation
function, will produce a value in the rounding interval [l, h]. The final result of the out-
put compensation function will then round to the correctly rounded result of f (x) in T.
CalcReducIntervals in Algorithm 4.11 returns a list L0 containing the the pair of
reduced input and interval (x0 , [l0 .h0 ]) for each pair (x, [l, h]) in L.
Third, multiple inputs from the original input domain x can map to the same reduced
input after range reduction. Hence, there can be multiple reduced intervals corresponding
to the same reduced input x0 . Thus, we combine all reduced intervals that correspond to the
same reduced input x0 by identifying the common region and produce a pair (x0 , [l∗ , h∗ ]) for
each reduced input x0 (line 6). CombineReducIntervals in Algorithm 4.12 returns
a list L∗ containing the constraint pair (x0 , [l∗ , h∗ ]). Finally, we generate a polynomial of
degree d using LP formulation so that all constraints in L∗ are satisfied using the approach
described in Section 3.3.3 (line 8).
Note that our approach for generating PH (x) with range reduction involves two addi-
tional steps compared to the steps in Chapter 3. The first step that identifies the rounding
interval for AH (x) (CalcRndIntervals) and the last step that generates a polynomial
that satisfies certain constraints (GeneratePoly) are identical. The two additional steps
involve identifying the reduced inputs, reduced intervals, and combining reduced intervals.
In the remainder of the section, we focus our efforts in presenting these steps.

4.4.2 Identifying Reduced Inputs and Reduced Intervals

Once we compute the rounding intervals [l, h] for each original input x, the intervals con-
strain the values that our entire approximation AH (x) should produce such that it produces
104

the correctly rounded result of f (x) (i.e. RNT (AH (x)) = RNT (f (x))), for each input
x ∈ X. Our next step is to identify the inputs x0 to the polynomial PH (x0 ) and the range of
values [l0 , h0 ] that the polynomial should produce such that AH (x) produces the correctly
rounded results. We call x0 the reduced input and the interval [l0 , h0 ] the reduced interval.
The original input x is reduced to a reduced input x0 using the range reduction function
(x0 = RR(x)). The reduced input x0 is used as the input to the polynomial PH (x0 ) which
produces the value y 0 . Hence, the input to the polynomial can be computed by applying the
range reduction x0 = RR(x) to each original input x.
Next, we need to identify the reduced interval for each reduced input x0 . The output of
the polynomial y 0 is used as the input to the output compensation function y = OC(y 0 ). The
resulting value y is the final output of AH (x). More specifically, AH (x) = OCH (PH (x0 )).
To produce the correctly rounded result, y has to be a value in the rounding interval [l, h].
Hence, we use the inverse of the output compensation function (y 0 = OC −1 (y)) along with
the rounding interval to infer the reduced interval. This strategy is feasible if the output
compensation function is continuous and bijective.
With real numbers, it is straightforward to compute the reduced interval by using the
inverse output compensation function (i.e., [l0 , h0 ] = [OC −1 (l), OC −1 (h)]). The result of
the output compensation evaluated in real numbers using any value in this reduced interval
will round to the correctly rounded results. However, the output compensation function is
evaluated in a finite precision representation, H, which may cause numerical errors. Thus,
we take numerical errors into account by restricting the reduced interval and ensure that the
result of the output compensation function, when evaluated in H, produces the correctly
rounded results for all values in the reduced interval.
Algorithm 4.11 presents our approach in computing the reduced constraints (x0 , [l0 , h0 ])
for each (x, [l, h]) ∈ L . For each original input x, we use the range reduction function
to compute the reduced input x0 (line 4). To compute the reduced interval [l0 , h0 ], we first
evaluate the values v1 = OCH−1 (l) and v2 = OCH−1 (h), where OCH−1 (y) evaluates the
105

Algorithm 4.11: CalcReducIntervals computes the reduced input x0 and the re-
duced interval [l0 , h0 ] for each constraint pair (x, [l, h]) in L. The reduced constraint pair
(x0 , [l0 , h0 ]) specifies the bound on the output of PH (x0 ) such that it produces the correct
value for the input x when used with the output compensation function.
1 Function CalcReducIntervals(L, H, RRH , OCH ):
2 L0 ← ∅
3 foreach (x, [l, h]) ∈ L do
4 x0 ← RRH (x)
// Set initial reduced interval
5 if OCH is an increasing function then
6 [α, β] ← [OCH−1 (l, x), OCH−1 (h, x)]
7 else [α, β] ← [OCH−1 (h, x), OCH−1 (l, x)]
// Increase the lower bound if necessary
8 while OCH (α, x) ∈
/ [l, h] do
9 α ← GetSuccVal(α, H)
10 if α > β then return ∅
11 end
// Decrease the upper bound if necessary
12 while OCH (β, x) ∈ / [l, h] do
13 β ← GetPrecVal(β, H)
14 if α > β then return ∅
15 end
16 L0 ← L0 ∪ {(x0 , [α, β])}
17 end
18 return L0

inverse of the output compensation function with H. We set a candidate reduced interval
[α, β] = [v1 , v2 ] if the output compensation function is an increasing function (lines 5-6)
or [α, β] = [v2 , v1 ] if the output compensation function is a decreasing function (line 7).
Next, we restrict the interval [α, β] to account for the numerical error that occurs when
evaluating OCH (y 0 ). We verify whether the output compensated value of the lower bound
(i.e., OCH (α)) results in [l, h]. If it does not, we replace α with the succeeding value in H.
We repeat the process until OCH (α) results in the rounding interval (lines 8-10). Similarly,
we verify whether the output compensated value of the upper bound (i.e., OCH (β)) results
in [l, h]. If it does not, we replace β with the preceding value in H. We repeat the process
until OCH (β) results in [l, h] (lines 11-13). If α > β at any point in our process, then
106

it indicates that there does not exist a polynomial that will produce the correctly rounded
result when used with the output compensation function. In such a case, the math library
developer must design a different range reduction strategy or increase the precision of H.
If the resulting interval [α, β] 6= ∅, then [α, β] is the reduced interval (i.e., [l0 , h0 ] =
[α, β]). The reduced constraint pair (x0 , [l0 , h0 ]) for each (x, [l, h]) specifies the constraint
on the values that PH (x0 ) should produce, such that AH (x) ∈ [l, h]. We create a list L0
containing all reduced constraints (line 16).

4.4.3 Combining the Reduced Constraints

Each reduced constraint (x0i , [li0 , h0i ]) ∈ L0 corresponds to a constraint (xi , [li , hi ]) ∈ L and
specifies a bound on the output of PH (x0i ) ∈ [li0 , h0i ] to ensure that AH (xi ) ∈ [li , hi ]. The
range reduction function reduces the original input xi from the entire input domain of f (x)
into a reduced input x0i in a smaller reduced domain. Hence, it is possible for multiple orig-
inal inputs to be range reduced to the same reduced input. More formally, there can exist
multiple constraints (x1 , [l1 , h1 ]), (x2 , [l2 , h2 ]), · · · ∈ L where x∗ = RR(x1 ) = RR(x2 ) =
. . . . Consequently, L0 can contain multiple reduced constraints with the same reduced input
(x∗ , [l10 , h01 ]), (x∗ , [l20 , h02 ]), · · · ∈ L0 . These reduced constraints specify that the polynomial
PH (x∗ ) must produce a value in [l10 , h01 ] to guarantee that AH (x1 ) ∈ [l1 , h1 ] and it should
also produce a value in [l20 , h02 ] to guarantee that AH (x2 ) ∈ [l2 , h2 ]. Hence, for each unique
reduced input x∗ , the polynomial must satisfy all reduced constraints corresponding to x∗ ,
i.e., PH (x∗ ) ∈ [l10 , h01 ] ∩ [l20 , h02 ] ∩ . . . . If there are multiple reduced constraints with the
same reduced input, we combine the reduced intervals corresponding to the same reduced
inputs by computing the common interval.
The function CombineReducIntervals in Algorithm 4.12 describes our technique
that combines all reduced constraints with the same reduced input. For each unique reduced
input x∗ in the set of reduced constraints L0 (line 2-4), we identify all reduced intervals that
correspond to x∗ (line 5). Then, we identify the combined interval [l∗ , h∗ ] by computing
107

Algorithm 4.12: CombineReducIntervals combines any reduced constraints


with the same reduced input, i.e. (x01 , [l10 , h01 ]) and (x02 , [l20 , h02 ]) where x01 = x02 into
a single combined constraint (x01 , [l∗ , h∗ ]) by computing the common interval range in
[l10 , h01 ] and [l20 , h02 ].
1 Function CombineReducIntervals(L0 ):
2 X ∗ ← {x0 | (x0 , I 0 ) ∈ L0 }
3 L∗ ← ∅
4 foreach x∗ ∈ X ∗ do
5 Ω ← {[l0 , hT 0 ] | (x∗ , [l0 , h0 ]) ∈ L0 }

6 [l∗ , h∗ ] ← [l0 ,h0 ]∈Ω [l0 , h0 ]


7 if [l∗ , h∗ ] = ∅ then return ∅
8 L∗ ← L∗ ∪ {(x∗ , [l∗ , h∗ ])}
9 end
10 return L∗

the intersection between the reduced intervals (line 6). If the combined interval is empty,
then it indicates that there is no value that PH (x0 ) can produce, such that it produces the
correctly rounded results of f (x) for all inputs when used with output compensation func-
tion. Otherwise, we create a combined constraint pair (x∗ , [l∗ , h∗ ]) for each unique reduced
input x∗ and produce a list of constraints L∗ (line 8).
Each combined constraint (x0 , [l0 , h0 ]) ∈ L∗ specifies that PH (x0 ) should satisfy l0 ≤
PH (x0 ) ≤ h0 . This constraint ensures that the result of the output compensation function
using PH (x0 ) produces the correctly rounded result for all inputs,

RNT,rm (OCH (PH (x0 ))) = RNT,rm (f (x))

Our final step encodes the problem of solving for a polynomial that satisfies all combined
constraints in L∗ into a system of linear inequalities and uses an LP solver to generate a
polynomial that satisfies all constraints in L∗ . We use the same methodology described in
Chapter 3 (GenreatePoly in Algorithm 3.4) to generate such polynomial.

4.5 The RL IBM Approach For Multivariate Output Compensation Functions

We now describe our approach to generate a polynomial approximation PiH (x0 ) that pro-
duce the correctly rounded results of an elementary function f (x) in our target representa-
108

tion T when used with multivariate output compensation functions. Some range reduction
strategy (i.e., sinh(x)) uses multivariate output compensation function, where the output
compensation functions require multiple functions to be approximated,

y 0 = OCH (g1 (x0 ), g2 (x0 ), . . . )

where gi (x0 ) are the functions that we have to approximate with polynomial approxima-
tions. Our goal is to synthesize the polynomial approximations PiH (x0 ) for each gi (x0 ).
With multiple polynomial approximations and multivariate output compensation function,
the entire approximation of f (x) can be formally defined as,

AH (x) = OCH (P1H (x0 ), P2H (x0 ), . . . )

where x0 is the reduced input computed with x0 = RRH (x). Similar to the approach we
used in generating polynomial approximation for univariate output compensation function,
our strategy involves identifying the reduced intervals [li0 , h0i ] for each PiH (x) and for each
reduced input x0 . The reduced intervals for each PiH (x0 ) define the values that PiH (x0 )
should produce such that the result of the output compensation function produces the cor-
rectly rounded results. Then, the reduced inputs and the reduced intervals for each PiH (x0 )
can be used to frame the problem of generating PiH (x0 ) that satisfies the reduced constraint
pair (x0 , [li0 , h0i ]) as a linear programming problem and use an LP solver to generate each
polynomial.

Challenges in identifying reduced intervals with multivariate output compensation func-


tion. There are two main challenges in generating reduced intervals for each Pi (x0 ) in
the context of multivariate output compensation function. First, the output compensation
function is a multivariate function with one output (i.e., y = OC(y10 , y20 , . . . ) ). Hence, the
inverse function does not exist, based on the Inverse Function Theorem [124]. We cannot
use the same approach described in Section 4.4 to infer the reduced intervals with inverse
output compensation function. Second, we must ensure that the size of the reduced in-
109

Algorithm 4.13: Our approach to generate polynomials PiH (x0 ) that approximate gi (x0 ) in
the multivariate output compensation function. On successfully finding polynomials, it returns
a polynomial for each gi (x0 ), where the output of the polynomial used with the output compen-
sation function produces the correctly rounded results. Otherwise, it returns (false, DNE) where
DNE means that the polynomial Does-Not-Exist. CalcReducIntervalsMulti is shown
in Algorithm 4.14.
1 Function CorrectlyRoundedPolyMulti(f , T, H, X, d, RRH , OCH ):
2 L ← CalcRndIntervals(f , T, H, X)
3 if L = ∅ then return (false, DNE)
4 L0 ← CalcReducIntervalsMulti(L, H, RRH , OCH )
5 Result ← ∅
6 foreach (gi , L0i ) ∈ L0 do
7 if L0i = ∅ then return (false, DNE)
8 L∗i ← CombineReducIntervals(L0i )
9 if L∗i = ∅ then return (false, DNE)
10 S, PiH ← GeneratePoly(L∗i , d)
11 if S = f alse then return (false, DNE)
12 Result ← Result ∪ (gi , Pi )
13 end
14 return (true, Result)

terval [li0 , h0i ] for each Pi (x0 ) is large enough to provide sufficient freedom for generating
polynomial approximations.
To address these challenges, we initially set the reduced interval for each PiH (x0 ) with
a singleton value [vi , vi ]. The intuition is to identify at least a single value where the output
compensation function produces the correctly rounded results. Each value vi for PiH (x0 )
guarantees that the result of the output compensation function using the singleton values
will produce the correctly rounded result of f (x). Next, we incrementally widen the re-
duced intervals [li0 , h0i ] for each PiH (x0 ) at the same time to ensure that the relative size of
the reduced intervals is equal to each other. If the output compensation function is mono-
tonically increasing and our methodology returns reduced intervals, then evaluating the
output compensation function using any values in the reduced intervals is guaranteed to
produce the correctly rounded result.
110

4.5.1 High-level Overview of the RL IBM Approach For Multivariate Output Com-
pensation Function

The top-level algorithm for generating polynomials that produce the correctly rounded re-
sults when used with multivariate output compensation function is shown in Algorithm 4.13.
Our algorithm generates a polynomial PiH (x0 ) for each gi (x0 ) that must be approximated
to use the output compensation function (line 12-13). If we are unable to find a polynomial
for a particular gi (x0 ) (line 11), then the developer of the math library should explore a
different range reduction strategy, use a higher precision representation for H, or generate
a higher degree polynomial.
The approach to generate polynomials for multivariate output compensation function
has four main steps. First, we compute the correctly rounded result of f (x) for each in-
put x using an oracle (line 2). Then, we identify the rounding interval [l, h] ∈ H where
all values in the interval rounds to the correctly rounded result. The first step returns a
list L containing the constraint pair (x, [l, h]) for each input x. Each constraint (x, [l, h])
constrains the output of our entire approximation AH (x) such that it produces the correctly
rounded result of f (x).
Second, for each constraint (x, [l, h]), we identify the reduced input x0 and the reduced
intervals [li0 , h0i ] for each polynomial PiH (x0 ) at the same time (line 4). The reduced con-
straint pairs (x0 , [li0 , h0i ]) for each PiH (x0 ) specify that if PiH (x0 ) produces a value in the re-
duced interval, then the polynomials used with the output compensation function will pro-
duce the correctly rounded results. CalcReducIntervalsMulti in Algorithm 4.14
returns a list L0 containing the list of reduced constraints L0i for each function gi (x0 ).
Third, for each list L0i , we combine all reduced intervals that correspond to the same
reduced input x0 using the strategy described in Section 4.4.3 to make sure that there is one
combined interval [li∗ , h∗i ] for each unique x0 (line 7). Finally, we generate the polynomials
PiH (x0 ) that approximates gi (x0 ) using the list of combined constraints L∗i . In the remainder
of this section, we present our algorithm to identify reduced intervals for each polynomial
111

Algorithm 4.14: CalcRedIntervalsMulti computes the reduced interval [li0 , h0i ] and
the reduced input x0 corresponding to input x for each function gi (x0 ) used in the output com-
pensation. If our polynomial approximation for gi (x0 ) produces a value in [li0 , h0i ], then we can
generate the correctly rounded result for x. The function returns a list with (x0 , [li0 , h0i ]) for each
gi .
1 Function CalcReducIntervalsMulti(L, RRH , OCH ):
2 if OCH is not monotonic function then return ∅
3 G ← {list of functions used in OCH }
4 foreach gi ∈ G do Li ← ∅
5 foreach (x, [l, h]) ∈ L do
6 x0 ← RRH (x)
7 V ← {RNH (gi (x0 )) | gi ∈ G}
8 if OCH (V, x) ∈ / [l, h] then return ∅
9 I 0 ← {[v, v] | v ∈ V } // Singleton reduced interval for each gi (x0 )
10 while true do
// Decrease the lower bounds li0 simulataneously
11 A ← {GetPrecVal(li0 , H) | [li0 , h0i ] ∈ I 0 }
12 if OCH (A, x) ∈
/ [l, h] then break
13 I 0 ← {[GetPrecVal(Ii0 , H), h0i ] | [li0 , h0i ] ∈ I 0 }
14 end
15 while true do
// Increase the upper bounds h0i simulataneously
16 B ← { GetSuccVal(h0i , H) | [li0 , h0i ] ∈ I 0 }
17 if OCH (B, x) ∈ / [l, h] then break
18 I 0 ← {[Ii0 , GetSuccVal(h0i , H)] | [li0 , h0i ] ∈ I 0 }
19 end
20 foreach [li0 , h0i ] ∈ I 0 do
21 Li ← Li ∪ (x0 , [li0 , h0i ])
22 end
23 end
24 return {(gi , Li ) | gi ∈ G}

PiH (x0 ).

4.5.2 Identifying Reduced Inputs and Intervals For Each Polynomial

After computing the rounding intervals from the first step (line 2 in Algorithm 4.13), we
have a list of constraints (x, [l, h]) that our entire approximation function AH (x) must sat-
isfy for each input x to produce the correctly rounded results. The multivariate output
compensation function requires us to approximate multiple functions gi (x0 ). The polyno-
mial approximations of each gi (x0 ) must produce values such that evaluating the output
112

compensation function using these values produces the correctly rounded result of f (x).
The goal of our second step is to identify the reduced inputs and the reduced intervals for
each PiH (x0 ).
Algorithm 4.14 presents the process of identifying the reduced input and intervals.
The reduced inputs can be computed using the range reduction function on each input
x (line 6). To compute the reduced intervals, we use a two-step approach. First, for
each reduced input x0 , we identify a single value vi for each gi (x0 ) using an oracle. Us-
ing vi ’s with the output compensation function produces the correctly rounded result (i.e.,
OCH (v1 , v2 , . . . ) ∈ [l, h]). The purpose of vi ’s is to be the starting point in identifying the
reduced intervals. If the representation H has high enough precision, then using the cor-
rectly rounded result of gi (x0 ) in H (i.e., vi = RNH (gi (x0 ))) with the output compensation
function will produce the correctly rounded result (line 7). If the result of output com-
pensation using RNH (gi (x0 )) does not produce the correctly rounded result, then either the
precision of H should be increased or the range reduction strategy should be redesigned.
An alternative approach that we have found useful in some cases is to search for a value in
the vicinity of gi (x0 ) in H that can be used for vi . In our experience, the values (i.e., vi ) that
produce the correctly rounded result when used with output compensation were at most
one or two H values away from gi (x0 ). Using vi , we create a candidate reduced interval
[li0 .h0i ] = [vi , vi ] for each gi (x0 ) (line 9). These singleton intervals satisfy the most impor-
tant invariant of a reduced interval: Any value within the reduced intervals (i.e., vi ’s) must
guarantee to produce the correctly rounded result when used with the output compensation
function.
Based on the candidate reduced interval [li0 , h0i ] for each PiH (x0 ), our next step is to
identify the maximum amount of freedom available to generate PiH (x0 ). We check if we
can decrease the lower bound of the intervals for each PiH (x0 ) at the same time. We itera-
tively check if using the preceding values of li0 in H with the output compensation function
produces a value in the rounding interval [l, h]. If it does, then we widen the reduced in-
113

terval by replacing li0 with the preceding value of li . We repeat the process until using the
preceding values of li0 with the output compensation function does not produce a value in
[l, h] (lines 10-13).
Similarly, we check if we can increase the upper bound of the intervals for each PiH (x0 ).
We compute the succeeding value of h0i and check if the output compensation function using
these values will produce a value in [l, h]. If it does, then we widen the reduced interval by
replacing each h0i with the succeeding value of h0i . We repeat this process until the output
compensation function using the succeeding value of h0i does not produce a value in [l, h]
(lines 14-17). This process identifies the reduced intervals [li0 , h0i ] for each PiH (x0 ) where
using any value within the interval with the output compensation function produces the
correctly rounded results. Additionally, it ensures that the relative size of [li0 , h0i ] for each
PiH (x0 ) for a given input x is the same, providing a similar amount of freedom between
each polynomial.
Identifying the lower bound and the upper bound of the reduced inputs can be computed
more efficiently. Because the output compensation function is monotonically increasing,
the final lower bounds can be identified by performing a binary search between vi and the
minimum representable value of H. Similarly, the final upper bounds can be identified by
performing a binary search between vi and the maximum representable value of H. Thus,
widening the reduced intervals requires at most n iterations where n is the number of bits
in H.
Finally, we store the reduced constraints (x0 , [li0 , h0i ]) for each function gi (x0 ) in a list L0i
(line 19) and return L0i for all gi (x0 ) (line 20). Our approach then combines the reduced
intervals that map to the same reduced inputs and produces a polynomial that satisfies all
constraints using LP formulation.
114

4.6 Summary

A common strategy to generate an efficient polynomial approximation of an elementary


function f (x) is to use range reduction strategies, which reduce the input domain of the
function into a smaller domain. Thus, a low degree polynomial can be used to produce
correctly rounded results. Generating polynomial approximations that produce correctly
rounded results when used with range reduction is a difficult task. Both the range reduction
function and the output compensation function are evaluated in finite precision representa-
tions and can experience numerical errors.
In this chapter, we propose a novel technique to generate efficient and correctly rounded
approximation functions in the presence of range reduction. Our key insight is to infer the
inputs and the range of values that our polynomial approximation should produce using
the rounding interval of the correctly rounded results of f (x), the range reduction function,
and the output compensation functions. Once such a list of inputs and the range of out-
put values for the desired polynomial is identified, then we can use an LP formulation to
specify the constraints for our polynomial and use an LP solver to generate the polynomial,
as described in Chapter 3. When our approach successfully generates polynomials, then
the polynomials used with the output compensation function are guaranteed to produce the
correctly rounded results for all inputs. Our approach supports various complex range re-
duction strategies where the output compensation functions require approximations of one
or more functions. We also propose improved range reduction strategies for several ele-
mentary functions to eliminate cancellation errors in the output compensation functions,
increasing the numerical stability. Our methodology automatically generates a polynomial
given an elementary function f (x), the range reduction function, and the output compen-
sation function, thus requiring little assistance from math library developers.
115

CHAPTER 5
THE RLIBM APPROACH FOR 32-BIT REPRESENTATIONS

In this chapter, we scale the RL IBM approach described in the previous chapters to produce
correctly rounded results for 32-bit representations. The primary challenge in generating
polynomial approximations for 32-bit representations is that there are four billion inputs
and the result must be significantly more accurate compared to 16-bit representations. To
address these challenges, we propose two enhancements. First, we use counterexample
guided polynomial generation to handle millions of constraints. Second, we generate ef-
ficient piecewise polynomials by splitting the input domain using the bit-pattern of the
inputs.

5.1 Scaling Our Approach to 32-Bit Representations

The IEEE-754 standard 32-bit float type is one of the most commonly used datatypes for
three reasons. (1) Most commercial hardware supports efficient float arithmetic. (2) It pro-
vides a reasonable amount of dynamic range (approximately [10−45 , 1038 )) and precision
(roughly 6 decimal digits) for many scientific applications. (3) It requires half the storage
space compared to the 64-bit double, another configuration supported in common hard-
ware, providing faster data transfer. Unfortunately, widely used math libraries (i.e., glibc’s
and Intel’s libm) do not produce correctly rounded results of elementary functions for all
inputs in the 32-bit float.
There are two primary challenges in scaling the RL IBM approach to 32-bit represen-
tations. First, 32-bit representations have four billion inputs. Even after range reduction,
there may be millions of reduced inputs. LP formulations with millions of constraints are
beyond the capabilities of the state-of-the-art LP solvers. Second, it may not be possible
to generate a single polynomial with a reasonable degree that satisfies all the constraints.
116

Although a high-degree polynomial can produce correct results for all 32-bit inputs, such
a polynomial may not be ideal from a performance viewpoint. It is desirable to use lower-
degree polynomial considering the performance of the math library. Hence, we extend the
RL IBM approach with two techniques. First, we propose counterexample guided poly-
nomial generation technique that samples inputs to handle a large number of constraints.
Second, we generate piecewise polynomials to create efficient polynomial approximations.

Piecewise polynomials. Given a list of inputs in the domain [a, b], we first try to generate
a single polynomial that produces the correctly rounded result for all inputs using the coun-
terexample guided polynomial generation. If we are unable to generate a polynomial or
the polynomial does not satisfy our performance constraint, then we split the input domain
[a, b] into sub-domains [a, b0 ) and [b0 , b] and generate a polynomial for each sub-domain. To
identify the inputs that belong to each sub-domain, we use bits in the bit-string of the input.
If we still cannot generate polynomials with a reasonable degree for each sub-domain, then
we iteratively split the original input domain into smaller sub-domains until we can gener-
ate a piecewise polynomial that produces the correctly rounded results for all inputs. By
using bit-patterns to identify sub-domains and generating low degree polynomial for each
sub-domain, our strategy produces efficient piecewise polynomials.

Counterexample guided polynomial generation. Even after splitting the input domain,
there may still be millions of inputs in each sub-domain. Thus, we sample a small por-
tion of inputs in the sub-domain to generate a polynomial. Our intuition is that it is not
necessary to reason about all constraints to generate a polynomial that produces correctly
rounded results for all inputs. It is sufficient to reason about inputs with highly constrained
intervals. Our counterexample guided polynomial generation technique uses the following
process. First, we sample some inputs in the sub-domain and generate a candidate poly-
nomial that satisfies the constraints in the sample. Next, we check whether the candidate
polynomial satisfies all constraints in the sub-domain. We add any constraints not satisfied
117

S E1 E2 F1 F2 F3

sign exponent mantissa

Figure 5.1: Bit-string representation of a 6-bit FP (FP6) with 2 exponent bits and 3 mantissa bits.

by the polynomial to the sample. We repeat the process until the generated polynomial sat-
isfies all constraints in the sub-domain or the LP solver determines that it cannot generate
a polynomial that satisfies all constraints in the sample. If the LP solver cannot generate
a polynomial or there are too many inputs in the sample, we split the input domain into
smaller sub-domains and repeat the process. The counterexample guided polynomial gen-
eration is inspired by the counterexample guided inductive synthesis (CEGIS) [72, 123]
used in program synthesis.

5.2 Illustration

We show an end-to-end example of creating piecewise polynomials that produce the cor-
rectly rounded results of ln(x) in a 6-bit FP representation (FP6) that has two exponent bits
and three mantissa bits (Figure 5.1). In our illustration, we use the double representation
to perform all internal computation. This example is illustrated for pedagogical reasons
to highlight our approach in splitting sub-domains and performing counterexample guided
polynomial generation. In practice, it is more beneficial to create a look-up table containing
the results of ln(x) in FP6 as there are only 26 = 64 values.
The elementary function ln(x) is defined over the input domain (−∞, ∞). There are
23 FP6 values ranging from 0.125 to 3.75 in the input domain. The remaining FP6 values
are considered as special cases. If x = 0, then we return ln(x) = −∞. When x = ∞, then
we return ln(x) = ∞. If x < 0 or x = N aN , then the function ln(x) is not defined and
we return N aN .
Our approach to generating piecewise polynomials for FP6 involves three steps. First,
we identify the correctly rounded result of ln(x) in FP6 and its rounding interval for each
118

1.5

1.0

0.5

0.0

−0.5

−1.0
log(x)
−1.5 Rounding intervals
FP6 result
−2.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5

Figure 5.2: We show the function ln(x) with black line. The correctly rounded result of ln(x)
for each input in FP6 is shown with a black circle. The gray boxes represent the rounding intervals
of each correctly rounded result. We split the entire input domain into two sub-domains: the first
sub-domain including inputs from 0.125 to 1.875 (purple background) and the second sub-domain
including inputs from 2.0 to 3.5 (green background). We then generate a polynomial that produces
correct results for each sub-domain.

input. Next, we need to generate polynomial approximations for ln(x). While it is possible
to generate a single polynomial of degree 5 for the entire input domain, our goal is to gen-
erate efficient polynomials. Instead of a single polynomial, we generate a piecewise poly-
nomial where each polynomial has a lower degree, thus increasing performance. Hence,
our second step splits the input domain into smaller sub-domains. Third, we perform coun-
terexample guided polynomial generation to obtain a polynomial for each sub-domain. At
the end of this process, each polynomial produces the correctly rounded result for all inputs
in its corresponding sub-domain. Since there are only 23 inputs in FP6 for ln(x), we do
not use range reduction for exposition.

Identifying the correctly rounded result and rounding intervals. Similar to the approaches
described in previous chapters, we compute the correctly rounded result of ln(x) using an
oracle for each input x in FP6. Then, we identify the rounding interval [l, h] in double rep-
resentation where all values in the interval rounds to the correctly rounded result. Our goal
is to generate piecewise polynomials that produce a value in the rounding interval for each
input x. Figure 5.2 shows the correctly rounded result of ln(x) in FP6 (black circle) and
119

0.125 (0x3fc0000000000000) 0 0 1 1 1 1 1 1 1 1 0 …

common bit sub-domain index


(a)

3.75 (0x400e000000000000) 0 1 0 0 0 0 0 0 0 0 0 …

common bit sub-domain index


(b)

Figure 5.3: To approximate ln(x) for FP6, we create piecewise polynomials with 2 sub-domains.
We use a bit in the double representation of the input to identify the sub-domain for each polynomial.
(a) shows the smallest input for ln(x) in FP6 which corresponds to the first sub-domain. (b) shows
the largest input which corresponds to the second sub-domain.

the corresponding rounding interval (gray area) for each input in FP6. Each pair (x, [l, h])
defines a constraint on the piecewise polynomials that we want to generate.

5.2.1 Domain Splitting

Once we have a list of constraints (x, [l, h]), the next step is to generate polynomial ap-
proximations for ln(x). Initially, we try to generate a single polynomial for the entire input
domain using the counterexample guided polynomial generation. If we cannot generate
a single polynomial or the polynomial does not meet the performance threshold, then we
split the input domain into sub-domains and generate piecewise polynomials. We iteratively
split the domain into smaller sub-domains until we can generate polynomials that produce
the correctly rounded results for all inputs and satisfies the performance requirement.
Let us suppose that we are going to split the input domain into 2 sub-domains and
generate a polynomial approximation for each sub-domain. We use the bit-pattern of the
input in the double representation to identify the sub-domain. All inputs for ln(x) are
positive values. In FP representations, the first bit of the bit-string represents the sign
bit. Thus, the most significant bit (the first bit) among all inputs is identical, as shown in
Figure 5.3(a) and (b). We use the next bit (the second bit), to identify the inputs that belong
to each of the two sub-domains. All inputs such as x = 0.125 (Figure 5.3(a)) where the
120

0.5

0.0
−2.125 ≤ P (0.125) ≤ −1.9375
−0.5
−1.4375 < P (0.25) < −1.3125
−1.0
−0.5625 ≤ P (0.625) ≤ −0.4375 P(x) = c0 + c1x + … + c4x 4
Satisfied intervals
−0.0625 ≤ P (1.0) ≤ 0.0625 −1.5
Sampled intervals
0.5625 < P (1.875) < 0.6875 −2.0 Intervals not satisfied by P(x)

0.25 0.50 0.75 1.00 1.25 1.50 1.75

(a) (b)
0.5

−2.125 ≤ P (0.125) ≤ −1.9375 0.0


−1.4375 < P (0.25) < −1.3125
−0.5
−0.5625 ≤ P (0.625) ≤ −0.4375
−1.0
−0.0625 ≤ P (1.0) ≤ 0.0625
P(x) = c0 + c1x + … + c4x 4
0.4375 ≤ P (1.75) ≤ 0.5625 −1.5
Satisfied intervals
0.5625 < P (1.875) < 0.6875 −2.0 Sampled intervals

0.25 0.50 0.75 1.00 1.25 1.50 1.75

(c) (d)

Figure 5.4: The procedure to generate a polynomial for the first sub-domain. (a) We initially sam-
ple five constraints and generate a candidate 4th degree polynomial that satisfies these constraints.
(b) The generated polynomial satisfies all constraints except for the input x = 1.75. (c) We add
the constraint corresponding to the input x = 1.75 to the sample and generate another polyno-
mial that satisfies the six constraints. (d) The second polynomial satisfies all constraints in the first
sub-domain.

second bit is 0 is grouped into the first sub-domain. There are 15 inputs ranging from 0.125
to 1.875 in the first sub-domain. Similarly, all inputs similar to x = 3.75 (Figure 5.3(b))
where the second bit is 1 is grouped into the second sub-domain. There are 8 inputs ranging
from 2.0 to 3.75 in the second sub-domain. The graph in Figure 5.2 shows the two sub-
domains with different background colors, purple for the first sub-domain and green for the
second sub-domain.
121

5.2.2 Polynomial Generation for Each Sub-Domain.

The final step is to generate a polynomial for each sub-domain. This polynomial must
produce a value within the rounding interval [l, h] for each input x in the sub-domain. In
the first sub-domain, there are 15 such inputs and the corresponding rounding intervals.
Initially, we sample a portion of the inputs in the first sub-domain. Figure 5.4(a) shows the
five inputs that we sample. We encode the inputs and the rounding intervals as linear con-
straints, create an LP query, and use an LP solver to generate a candidate polynomial that
satisfies the constraints in the sample. Figure 5.4(b) shows a candidate 4th degree polyno-
mial. Next, we check whether the candidate polynomial produces a value in the rounding
interval for all inputs in the sub-domain. There is an input where the polynomial does not
produce a value within the reduced interval, highlighted with the red box Figure 5.4(b).
We add this counterexample input to the sample (Figure 5.4(c)). The polynomial created
using the new sample satisfies all the constraints in the first sub-domain, as shown in Fig-
ure 5.4(d).
Using the same approach, we generate a polynomial for the second sub-domain. Fig-
ure 5.5(a) shows the initial sample of three inputs and the constraints. The 1st degree
polynomial generated using the sample of inputs does not produce a value in the rounding
interval for an input in the second sub-domain. (Figure 5.5(b)). We add this counterexam-
ple to the sample (Figure 5.5(c)) and generate another candidate polynomial that satisfies
the four constraints (Figure 5.5(d)). This polynomial does produce a value in the round-
ing interval for all inputs in the second sub-domain. The final piecewise polynomial that
produces the correctly rounded results of f (x) for all 23 inputs is shown in Figure 5.6

5.3 Our Approach to Generate Piecewise Polynomials

Algorithm 5.1 provides a high-level overview of our approach to generate piecewise poly-
nomials that produce correctly rounded results for all inputs in a given 32-bit representation
122

1.4

1.2

1.0

P(x) = c0 + c1x
0.6875 ≤ P (2) ≤ 0.8125 0.8 Sampled intervals
8.125 < P (2.5) < 0.9375 Satisfied intervals
0.6 Intervals not satisfied
1.3125 < P (3.75) < 1.4375
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75

(a) (b)
1.4

1.2

1.0
0.6875 ≤ P (2) ≤ 0.8125
0.6875 ≤ P (2.25) ≤ 0.8125 0.8 P(x) = c0 + c1x
8.125 < P (2.5) < 0.9375 Sampled intervals
0.6 Satisfied intervals
1.3125 < P (3.75) < 1.4375
2.00 2.25 2.50 2.75 3.00 3.25 3.50 3.75

(c) (d)

Figure 5.5: The procedure to generate a polynomial for the second sub-domain. (a) We initially
sample three constraints and generate a candidate 1st degree polynomial that satisfies these con-
straints. (b) The generated polynomial satisfies all constraints except for the input x = 2.25. (c)
We add the constraint corresponding to the input x = 2.25 to the sample and generate another poly-
nomial that satisfies the four constraints. (d) The second polynomial satisfies all constraints in the
second sub-domain.

1.5

1.0

0.5

0.0

−0.5

−1.0
P(x) for the first sub-domain
−1.5 P(x) for the second sub-domain
Rounding intervals
−2.0
0.5 1.0 1.5 2.0 2.5 3.0 3.5

Figure 5.6: Illustration of the piecewise polynomials produced with our approach. Each polyno-
mial produces the correctly rounded results of ln(x) for all inputs in the corresponding sub-domain.
123

Algorithm 5.1: CorrectPiecewise computes piecewise polynomials of degree d. The


generated polynomials produce the correctly rounded result of f (x) for each input in the cor-
responding sub-domain when used with output compensation function. GenPiecewise is
shown in Algorithm 5.2.
1 Function CorrectPiecewise(f , T, H, X, d, RRH , OCH ):
2 L ← CalcRndIntervals(f , T, H, X)
3 if L = ∅ then return (false, DNE)
4 L0 ← CalcReducIntervals(L, H, RRH , OCH )
5 if L0 = ∅ then return (false, DNE)
6 L∗ ← CombineReducIntervals(L0 )
7 if L∗ = ∅ then return (false, DNE)
// Change from Chapter 4: Generates piecewise polynomial
8 Ψ← GenPiecewise(L∗ , d)
9 return (true, Ψ)

T. Algorithm 5.1 extends the RL IBM approach from Chapter 4. Given an elementary func-
tion f (x) and a list of inputs X, we compute the correctly rounded result of f (x) and
identify the rounding interval in H for each result (line 2). Then, using the range reduction
function and output compensation function, we identify the reduced inputs x0 and reduced
intervals [l0 , h0 ] (lines 4-7). The pair (x0 , [l0 , h0 ]) for each x0 specifies a constraint on the
output of the polynomial approximation that we want to generate. If the polynomial ap-
proximation produces a value in the reduced interval, then it will produce the correctly
rounded result for all inputs when used with the output compensation function. Instead
of generating a single polynomial, the extended RL IBM approach generates a piecewise
polynomial using domain splitting and counterexample guided polynomial generation (line
8). We first iteratively split the domain of the reduced input into multiple sub-domains
(Algorithm 5.2). Then, we use counterexample guided polynomial generation which uses
a small portion of constraints to generate a polynomial that produces a value in the reduced
interval for all inputs in the sub-domain (Algorithm 5.3). In the remainder of the section,
we focus on describing our domain splitting and counterexample guided polynomial gen-
eration in more detail. We assume that we have already identified the list of reduced inputs
and intervals.
124

5.3.1 Domain Splitting for Efficient Piecewise Polynomials

Once we compute the list of constraints L∗ containing reduced inputs x0 and its correspond-
ing reduced intervals [l0 , h0 ], our goal is to generate polynomials that satisfy the constraints
in L∗ . Even after range reduction, there can be millions of constraints in L∗ . Given a suffi-
ciently high degree d, the counterexample guided polynomial generation technique which
we describe in Section 5.3.2 can generate a single polynomial. However, a high degree
polynomial is not efficient. Hence, we generate piecewise polynomials with lower degrees.
We split the reduced input domain into sub-domains and generate a polynomial for each
sub-domain. Each polynomial only needs to satisfy the constraints in the corresponding
sub-domain, leading to a lower degree and better performance when evaluating the poly-
nomial. For best performance, effectively splitting the input domain is essential. We must
avoid the situation where the performance gain in evaluating lower degree polynomial is
outweighed by the overhead of identifying which polynomial to evaluate for a given in-
put. Hence, we group the reduced inputs into sub-domains based on the bit-pattern of the
reduced inputs in H.
Algorithm 5.2 illustrates our approach to generate piecewise polynomials. Abstractly,
our strategy is to split the entire input domain into 2n sub-domains and use n bits of the
bit-pattern of x0 to group the reduced inputs. Considering the bit-string of the reduced
input x0 in H, the first bit is the sign bit and the next several bits represent the exponent
bits. Additionally, range reduction reduces the input in T to a value in a small domain
[a, b]. Thus, it is likely that the first several bits of the bit-string are identical for all reduced
inputs. We use the next n bits to group the reduced inputs into each sub-domain.
Some range reduction techniques can create both positive and negative reduced inputs.
The bit-string of positive and negative reduced inputs in H will not have any common
bits because the first bit distinguishes between negative and positive values. Hence, we
separate the reduced inputs and their corresponding reduced interval into two groups: L−
that contains negative reduced inputs, and L+ that contains positive reduced inputs (lines 2
125

Algorithm 5.2: GenPiecewise generates piecewise polynomials that produce a value in the
reduced interval for all reduced inputs in L. It initially attempts to produce a single polynomial
for the entire reduced input domain. If unsuccessful, then it splits the domain into multiple
sub-domains. SplitDomain splits the reduced input domain into sub-domains based on the
bit-pattern of the reduced inputs in H. CeGPolyGen generates a polynomial for each sub-
domain, which is shown in Algorithm 5.3.
1 Function GenApproxFunc(L, d):
2 L− ← {(x0 , [l0 , h0 ]) ∈ L | x0 < 0}
3 L+ ← {(x0 , [l0 , h0 ]) ∈ L | x0 ≥ 0}
4 Ψ− ← GenApproxHelper(L− , d)
5 Ψ+ ← GenApproxHelper(L+ , d)
6 return {Ψ− , Ψ+ }
7 Function GenApproxHelper(L, d):
8 n←0
9 while true do
10 ∆ = SplitDomain(L, n)
11 (status, Ψ) = GenPiecewise(∆, d)
12 if status = true then return Ψ
13 n←n+1
14 end
15 Function GenPiecewise(∆, d):
16 Ψ←∅
17 foreach ∆i ∈ ∆ do
18 (status, Ψi ) ← CeGPolyGen(∆i , d)
19 if status = false then return (f alse, ∅)
20 Ψ ← Ψ ∪ Ψi
21 end
22 return (true, Ψ)

and 3). Then create polynomial approximations for each L− and L+ (lines 4 and 5).
When the list L has either positive reduced inputs (i.e., L = L+ ) or negative reduced
inputs (i.e., L = L− ), we try to generate a single polynomial of degree d that produces a
value in the reduced interval for each reduced input x0 in L (line 11). If we cannot generate
such a polynomial, then we split the reduced input domain into sub-domains (lines 9-13).
The reduced input domain is split into 2n sub-domains (line 10). To split the domain, we
identify the smallest reduced input x0min and the largest reduced input x0max in L, excluding
the reduced input 0. The range reduction function is evaluated in the higher precision
representation H and reduces the entire original input domain into a smaller domain. This
126

0 (0x0000000000000000) 0 0 0 0 0 0 0 0 0 0 0 0 …

5.98… x 10-8 (0x3e70101010101010) 0 0 1 1 1 1 1 0 0 1 1 1 …

common bits sub-domain index

7.81… x 10-3 (0x3f7fffe000000000) 0 0 1 1 1 1 1 1 0 1 1 1 …

common bits sub-domain index

Figure 5.7: Bit-pattern of the reduced inputs 0, x0min , and x0max in the double type (H) when
splitting the reduced input domain of ln(x) for 32-bit float (T) after using the range reduction
strategy described in Chapter 4. The first 7 bits in the bit-pattern of x0min and x0max are identical.
To split the input domain into 25 sub-domains, we use the next 5 bits of the bit-pattern to group
reduced inputs into each sub-domain.

may lead to a significant gap between 0 and the nearest reduced input in L. If we do
not exclude 0 when splitting the domain, several sub-domains will be assigned for the
reduced inputs between 0 and its nearest reduced input, resulting in empty sub-domains.
Hence, when we determine the bit-pattern to split the input domain, we exclude 0 from
consideration to evenly place the reduced inputs into each sub-domain.
The reduced inputs (excluding 0) in L are likely to have common bits in the sign bit and
the exponent bits since the inputs are in a small domain. We identify the consecutive bits
that are identical in the bit-string representation of x0min and x0max in the higher precision
representation H, starting from the most significant bit. Then, we use the next n bits to
identify the sub-domain. We group all reduced inputs and reduced intervals that share the
same bit-pattern of the n bits into the same sub-domain ∆i , indexed by the n-bit used for
sub-domain identification. The reduced input 0 is grouped into ∆0 , as the bit-pattern of 0
is composed of all 0 bits.

Illustration Consider an example where our goal is to generate polynomials approxima-


tions of ln(x) for 32-bit float. Mathematically, the range reduction for ln(x) described in
Chapter 4 reduces the input domain into [0, 128
1
). When we reduce the 32-bit float inputs
into reduced inputs, there is a significant gap between 0 and the nearest reduced input.
Specifically, the reduced input domain is {0} ∪ [5.98 · · · × 10−8 , 7.81 · · · × 10−3 ]. Hence,
127

x0min = 5.98 · · · × 10−8 and x0max = 7.81 · · · × 10−3 .


Figure 5.7 shows the bit-pattern of 0, x0min , and x0max in the higher precision represen-
tation, 64-bit double. There is a significant gap between 0 and x0min as there are roughly
4.5 × 1018 bit-patterns between the two values. The first 7 bits in the bit-patterns of x0min
and x0max are identical. To create 25 sub-domains, we use the next 5 bits in the bit-string.
The reduced input 0 is then assigned to the 0th sub-domain.
Once the sub-domains and the reduced inputs corresponding to each sub-domain are
identified, we try to generate a polynomial of degree d that satisfies the constraints in each
sub-domain ∆i (line 18 in Algorithm 5.2). If we cannot generate a polynomial for at least
one sub-domain, then we split the reduced input into 2n+1 sub-domains and repeat the
process. Once we generate a polynomial for each sub-domain, the coefficients of the poly-
nomials are stored in a table, indexed by the bit-pattern used to identify the sub-domain.
Using bit-patterns of the reduced input in H allows us to efficiently identify the sub-domain
given a reduced input. It requires a single masking operation and a shift operation.

5.3.2 Counterexample Guided Polynomial Generation

Even after range reduction and domain splitting, there can be millions of reduced inputs
and reduced intervals in each sub-domain. With the current hardware technology, even
the state-of-the-art LP solvers can only handle thousands of constraints. To address this
challenge, we propose counterexample guided polynomial generation with sampling. Our
insight is that it is not necessary to reason about every single constraint in the sub-domain,
as illustrated in Section 5.2. As we are approximating continuous elementary functions,
the reduced intervals of adjacent reduced inputs are gradually increasing or decreasing. If a
polynomial satisfies the constraint for a given reduced input, then it is likely that the same
polynomial satisfies the constraint for nearby inputs, except when the reduced intervals of
the nearby inputs are significantly constrained. Thus, as long as we use highly constrained
intervals, the generated polynomial will produce a value in the interval for all inputs in the
128

Algorithm 5.3: GenPolynomial attempts to find a polynomial of degree d that satisfies


the reduced input and interval constraints in ∆i using our counterexample guided sampling ap-
proach. If it is infeasible to find a polynomial of degree d or the size of the sample exceeds
a threshold, then it returns (false, ∅). GeneratePoly generates the coefficients of a polyno-
mial that satisfies all constraints in S using the LP solver (described in Section 3.3). Check
validates that the polynomial generated using the sample satisfies all reduced input and interval
constraints.
1 Function GenPolynomial(∆i , d):
2 S ← Sample(∆i )
3 while true do
4 (status, Ψi ) ← GeneratePoly(S, H, d)
5 if status = f alse then return (f alse, ∅)
6 (Done, S) ← Check(Ψi , ∆i , S)
7 if Done = true then return (true, Ψi )
8 if |S| > threshold then return (f alse, ∅)
9 end
10 Function Check(Ψi , ∆i , S):
11 Done ← true
12 foreach (x0 , [l0 , h0 ]) ∈ ∆i do
13 if not l0 ≤ Ψi (x0 ) ≤ h0 then
14 S ← {(x0 , [l0 , h0 ])} ∪ S
15 Done ← f alse
16 end
17 end
18 return (Done, S)

sub-domain.
Algorithm 5.3 shows the counterexample guided polynomial generation technique. It
accepts a list of reduced constraints [x0 , [l0 , h0 ]] in the sub-domain ∆i and the degree d of
the polynomial that we wish to generate. The list of constraints in ∆i is stored in increasing
order of x0 . Our goal is to generate a polynomial of degree d that produces a value in the
interval [l0 , h0 ] for each reduced input in ∆i . We choose a small portion of constraints in
∆i (line 2) by uniformly sampling the constraints based on the distribution of the reduced
inputs. If there are a significant number of reduced inputs in a small region of the sub-
domain, our process samples more inputs from that region. We also add constraints where
the size of the reduced interval is smaller than a certain bound , which can be specified by
the math library developer.
129

Next, we generate a candidate polynomial that satisfies all the constraints in the sample
using the LP formulation (line 4). If it is not possible to generate a polynomial of degree
d that satisfies all constraints in the sample, we split the reduced input domain into smaller
sub-domains and repeat the entire process. Otherwise, we check whether the candidate
polynomial satisfies all constraints in the sub-domain ∆i (lines 12-17). If there is any con-
straint not satisfied by the polynomial, then the constraints are added to the sample (lines
13-15). We repeat the process of generating candidate polynomials until the polynomial
satisfies all constraints in ∆i (line 7). At any point, if the number of constraints in the
sample exceeds a certain threshold, we determine that we cannot generate a polynomial of
degree d (line 8).

5.3.3 Storing the Coefficients of Piecewise Polynomials

At the end of our process, we obtain two piecewise polynomials, Ψ− that produces the
correctly rounded results of all inputs that reduce to negative reduced inputs and Ψ+ that
produces the correctly rounded results of all inputs that reduce to positive reduced inputs.
The two piecewise polynomials are stored in a separate table.
Each polynomial Ψi in a piecewise polynomial Ψ (i.e., Ψ = Ψ− or Ψ = Ψ+ ) is indexed
using the bit-pattern, i, used to group reduced inputs to the corresponding sub-domain. The
polynomial has a degree of d or smaller,

Ψi = ci,0 + ci,1 x + ci,2 x2 + · · · + ci,d xd

If Ψi has a degree d0 < d, then all coefficients after the d0th term are zeros. If there are no
reduced inputs in a particular sub-domain with index i, then all coefficients in Ψi are zeros.
If there is a term ci,j xj where ci,j = 0 across all polynomials in Ψ, i.e., when polynomial
approximations are even or odd, then we remove the term to avoid unnecessary overhead
when evaluating the polynomial. Finally, the coefficients of each Ψi in Ψ are stored in a
two-dimensional look-up table. The coefficients corresponding to each Ψi in Ψ are indexed
with i. Each coefficient ci,j in Ψi is indexed with j. Using this strategy, we can efficiently
130

retrieve the coefficients of a polynomial Ψi and use the same implementation to evaluate
different Ψi ’s.

5.3.4 Implementing the Elementary Function

We now describe how to implement the elementary function of f (x) that produces the cor-
rectly rounded result for all inputs in T using the generated piecewise polynomials Ψ− and
Ψ+ , the range reduction function RR(x), and the output compensation function OC(y 0 ).
All internal computations are performed in the higher precision representation H. Given
an input x ∈ T, we determine whether x is a special case input, which can be imple-
mented using a series of branch statements. If x is a special case input, then we return
the corresponding result. Next, we convert the input x to H and perform range reduction
x0 = RR(x) to obtain the reduced input. We identify the sign of x0 to determine which
piecewise polynomial to use. If x0 < 0, then we retrieve the look-up table for Ψ− . Other-
wise, we retrieve the look-up table for Ψ+ .
Once we retrieve the look-up table for the correct piecewise polynomial Ψ for x0 , our
next goal is to retrieve the polynomial that corresponds to x0 by identifying the index of
the sub-domain that x0 belongs to. At implementation time, we already know the size
(i.e., 2n ) of the piecewise polynomial Ψ. We use the strategy described in Section 5.3.1 to
identify the n-bit bit-pattern of x0 in H. This process can be efficiently implemented using
a single bitwise and operation and a shift right operation. We use this bit-pattern to
index into the look-up table and retrieve the coefficients of the polynomial Ψi . We evaluate
y 0 = Ψi (x0 ) and then use y 0 to evaluate the output compensation function y = OC(y 0 ). The
value y is in the higher precision representation H. Thus we round y to T, which is the
correctly rounded result of f (x) in the representation T.
131

5.4 Summary

In this chapter, we present our approach for generating piecewise polynomials to create
efficient and correctly rounded implementations. This approach is aimed towards generat-
ing correctly rounded polynomial approximations for 32-bit representations. We propose
two extensions to the RL IBM approach. First, we split the input domain into multiple sub-
domains using bit-patterns in the input and generate a polynomial for each sub-domain.
Second, we use counterexample guided polynomial generation to handle millions of in-
puts. Our approach has been instrumental in generating elementary functions for 32-bit
float and posit32. Our functions produce correctly rounded results for all inputs in 32-bit
representations and are faster than mainstream math libraries.
132

CHAPTER 6
A SINGLE POLYNOMIAL THAT PRODUCES CORRECT RESULTS FOR
MULTIPLE REPRESENTATIONS AND ROUNDING MODES

The previous chapters describe our approach to generate correctly rounded polynomial ap-
proximations for a specific FP representation and a single rounding mode. The generated
polynomials do not guarantee to produce the correctly rounded results for a different repre-
sentation or even a different rounding mode, due to double rounding error. In this chapter,
we propose a novel approach to generate polynomial approximations that produce correctly
rounded results for multiple representations and rounding modes. Our key idea is to ap-
proximate the correctly rounded result for an (n + 2)-bit representation using the round to
odd (rno) rounding mode. The rno rounding mode has properties that avoid double round-
ing issues. We formally prove that the resulting polynomial approximation will produce
correctly rounded results for multiple representations and all standard rounding modes.

6.1 Case For Generic Math Libraries

Using our approaches described in the previous chapters, we have been successful in gener-
ating polynomial approximations of f (x) that produce correctly rounded results for a given
representation and a rounding mode. Since the posit standard specifies one rounding mode,
round to the nearest tie goes to even (i.e., rne), one polynomial approximation suffices
for each given representation. In the IEEE-754 standard for FP representations, there are
five standard rounding modes, rne, rna, rnz, rnp, and rnn mode. To produce correctly
rounded results of f (x) for different rounding modes, we can generate one polynomial ap-
proximation for each rounding mode. Five approximation functions for a given elementary
function and an FP representation are reasonable to generate, especially if we consider only
a few widely used representations.
133

Recently, new variations of FP with smaller bit-length have been proposed with vary-
ing amount of dynamic range and precision, including bfloat16 [76], FlexPoint [78], ten-
sorfloat32 [108], MSFP [100, 116], log number systems [44, 74, 112, 130], and the Posit
representation [55, 58]. Some of these variants are different configurations of FP while
some others (i.e., posits) are completely new number systems. All of these representa-
tions require math libraries to approximate elementary functions if we are to use them in
scientific applications. Generating math libraries for multiple variants of representations
requires significant effort and time, especially for different rounding modes.
Our goal is to generate a single approximation of an elementary function that produces
correctly rounded results for multiple representations and rounding modes. The primary
challenge is in identifying the values that the polynomial approximations should produce.
For example, consider the task of creating a math library for the float representation that
produces correctly rounded results with all rounding modes. A naive attempt at accom-
plishing this task may use a correctly rounded double math library such as CR-LIBM which
produces the correctly rounded result in the double type with rne rounding mode. The
double result can be rounded to the float type using the chosen rounding mode. Due to the
double rounding error, however, the final result will not be equal to the correctly rounded
result of f (x) for float for all inputs. If the real value of f (x) is extremely close to the
rounding boundary of two adjacent float values, then the error caused by rounding f (x) to
double using rne rounding mode can be significant enough to produce the wrong result for
float. We empirically show that this approach of using a correctly rounded math library for
double indeed produces wrong float results in Chapter 7.

Our key idea. Instead of generating a correctly rounded elementary function for each
representation and each rounding mode, it would be ideal to generate a single polynomial
approximation that produces the correctly rounded result of an elementary function for
multiple representations and rounding modes. We propose a novel approach to generate
134

such polynomials! Our key insight is to approximate the correctly rounded result of an
elementary function f (x) in the round to odd (rno) rounding mode. The rno rounding
mode has previously been explored to avoid the double rounding error when performing
primitive operations in extended precision and then subsequently rounding the result to
the smaller precision representation [9]. The rno mode can be described as follows. If a
real value is exactly representable in the target representation, it is rounded to the value
in the representation. Otherwise, rno rounds the real value to the nearest value in the
target representation where the bit-string is odd when interpreted as an unsigned integer.
Abstractly, the rno mode avoids double rounding error because it retains all the information
necessary to identify the correctly rounded value of the original real value. Our contribution
lies in recognizing that the rno mode has the properties which can be used to generate
correctly rounded elementary functions for multiple representations and rounding modes.
We provide more information on the rno mode in Section 6.3.1.
Our approach generates polynomial approximations that produce the correctly rounded
result of f (x) in the (n + 2)-bit representation Tn+2 with the rno mode. This polynomial
approximation produces the correctly rounded result for all k bit representation Tk with
any standard rounding modes, as long as k ≤ n and the number of exponent bits of Tk is
the same as Tn+2 . We formally define the Tn+2 and Tk for FP and posit representations in
Section 6.3 and prove that the polynomial that we generate produces the correctly rounded
results for Tk with all standard rounding modes in Section 6.4.

High-level overview of our approach. To generate the polynomial approximation that


produces the correctly rounded result of f (x) in Tn+2 with the rno mode, we extend the
approach described in the previous chapters. We first compute the correctly rounded result
of f (x) in Tn+2 representation with the rno mode using an oracle to produce yrno . Using
the correctly rounded result, we identify a range of values that round to yrno when rounded
to Tn+2 with the rno mode. We call this interval the odd interval. When a polynomial
135

S E0 E1 F0 F1 S E0 E1 F0 S E0 E1 F0 F1 F2 F3
sign exponent mantissa sign exponent mantissa sign exponent mantissa

(a) (b) (c)

Figure 6.1: (a) Layout of the bit-pattern of the (a) 5-bit FP5, (b) 4-bit FP4, and (c) 7-bit FP7 used
in our illustration. All three representations have two exponent bits.

generates a value in the odd interval, then the polynomial is guaranteed to produce the
correctly rounded result of f (x) for all Tk using any standard rounding modes. Next, our
goal is to generate a polynomial that produces a value in the odd interval for each input. One
of the challenges in generating the polynomial is the existence of singleton odd intervals.
This can occur when the real value of f (x) for a particular input x is exactly representable
in Tn+2 and the bit-string of the value in Tn+2 is even when interpreted as an unsigned
integer. Because generating a polynomial that produces a singleton value for a particular
input is challenging, we use mathematical properties to identify the set of inputs resulting
in a singleton interval and develop an efficient implementation to compute the result. We
generate a polynomial that produces a value in the odd interval for the remaining inputs
and non-singleton odd intervals.
Using this approach, we created a math library with ten elementary functions for FP
and ten elementary functions for posit, which we call RL IBM -A LL. The FP functions in
RL IBM -A LL produce the correctly rounded result in the 34-bit FP representation, F34,8 ,
with the rno mode. They produce the correct results for all FP representations ranging
from the 10-bit F10,8 to the 32-bit F32,8 for all standard IEEE-754 rounding modes. These
representations include bfloat16, TensorFloat32, and float types. The posit functions in
RL IBM -A LL produce the correctly rounded result in the 34-bit posit representation, P34,2 ,
with the rno mode. They produce the correct results for all posit representations ranging
from the 2-bit P2,2 to the 32-bit P32,2 , the standard 32-bit posit configuration.
136

6.2 Illustration

We provide an end-to-end example of creating a single polynomial that produces the cor-
rectly rounded results of ln(x) for a 5-bit FP with 2 exponent bits (FP5) and a 4-bit FP
with 2 exponent bits (FP4) using the standard rounding modes (i.e., rne, rna, rnz, rnp,
and rnn). Figure 6.1(a) and (b) shows the layout of the bit-string of FP5 values and FP4
values, respectively. We present our example with FP5 and FP4 to illustrate our approach
and describe the insights. In practice, creating a look-up table with pre-computed results is
more beneficial for FP5 and FP4 as there are only 32 and 16 bit-patterns, respectively.
The ln(x) function is defined over the input domain (0, ∞). There are 11 values ranging
from 0.25 to 3.5 in FP5 within the input domain. The remaining 25 − 11 = 21 value are
special case inputs. Similarly in FP4, there are 5 values ranging from 0.5 to 3.0 within the
input domain. The remaining 24 − 5 = 11 values are special case inputs. The special cases
can be filtered with a simple check followed by returning the correct result.

6.2.1 A Strawman Approach

Before presenting our approach, we present a strawman approach to generate a polynomial


for FP4 and FP5. The strawman approach highlights the challenges of generating polyno-
mial approximations for multiple representations and rounding modes. Let us pick an input
x = 1.5 that can be represented by both FP5 and FP4. Figure 6.2 shows the real value of
ln(1.5) (gray star) and the correctly rounded result for FP5 and FP4 with different round-
ing modes. Consider the first row in Figure 6.2, which shows the correctly rounded result
of FP5 with the rne mode (black circle). To produce this value, our polynomial approxi-
mation has to produce a value in the rounding interval of ln(1.5), for the rne result. The
second row in Figure 6.2 shows the correctly rounded result of FP5 with the rna mode and
the rounding interval to produce the correctly rounded rna result. The subsequent rows
show the correctly rounded result and the rounding interval for other rounding modes of
137

FP5 and rne


0 0.25 0.5 0.75

FP5 and rna


0 0.25 0.5 0.75

FP5 and rnz


0 0.25 0.5 0.75

FP5 and rnp


0 0.25 0.5 0.75

FP5 and rnn


0 0.25 0.5 0.75

FP4 and rne


0 0.5

FP4 and rna


0 0.5

FP4 and rnz


0 0.5

FP4 and rnp


0 0.5

FP4 and rnn

Interval for
FP7 and rno
0 0.25 0.375 0.4375 0.5 0.75

Figure 6.2: We show the correctly rounded result of ln(1.5) in FP5 and FP4 with all five rounding
modes. The gray star represents the real value of ln(1.5). Values that are representable in FP5, FP4,
and FP7 are shown with circle, square, and rhombus, respectively. The black solid color represents
the correctly rounded result and the gray interval shows the rounding interval. The last row shows
the odd interval where all values in the interval rounds to the correctly rounded results of ln(1.5) in
FP7 with the rno rounding. The odd interval is a subset of all rounding intervals for FP5 and FP4
with five standard rounding modes.

FP5 and FP4. It is possible for a single polynomial to produce the correctly rounded result
of ln(1.5) for both FP5 and FP4 with all five rounding modes if it produces a value within
all ten rounding intervals. A strawman approach for generating a generic polynomial is to
compute the rounding intervals for all possible combinations of target representations and
rounding modes and identify the common region in all rounding intervals. However, this
requires a significant amount of computation especially as the number of target representa-
tions increase.
138

6.2.2 Generating A Polynomial Approximation With the rno Mode.

To create a single polynomial, we generate a polynomial that produces the correctly rounded
results in FP7 with the round to odd (rno) mode, where FP7 is a 7-bit FP representation
with two exponent bits (Figure 6.1(c)). The rno rounding mode can be summarized as fol-
lows. If a real value is exactly representable in FP7, then we represent it with the FP7 value.
Otherwise, the real value rounds to an adjacent FP7 value where the bit-string is odd when
interpreted as an unsigned integer (i.e., the last bit is 1). The rno mode is a non-standard
rounding mode proposed to address double rounding issues with primitive arithmetic op-
erations when using rne rounding mode [9]. We show that rno can be used to generate
correctly rounded results with all standard rounding modes. In Section 6.4, we prove that
a polynomial that produces the correctly rounded results for a (n + 2)-bit representation
with the rno mode also produces the correctly rounded results for all k bit representations
with any standard rounding modes as long as the representation has the same number of
exponent bits |E| and |E| + 1 < k ≤ n. In our illustration, all three representations (i.e.,
FP4, FP5, and FP7) have two exponent bits. When we generate a polynomial that produces
correctly rounded results of ln(x) in FP7 with the rno mode, the result of the polynomial
will round to the correctly rounded result of FP5 and FP4 for all rounding modes.
Additionally, computing the correctly rounded result of ln(x) in FP7 with the rno mode
and identifying the range of values that round to this result is an efficient way to compute the
common interval described above. The last row in Figure 6.2 shows the correctly rounded
result of ln(1.5) in FP7 with the rno mode (black rhombus) and the interval of values that
round to the correctly rounded result (blue region). We call this interval the odd interval.
The odd interval for ln(1.5) is common in all ten rounding intervals shown above. Thus, as
long as we generate a polynomial that produces a value in the odd interval for each input,
the polynomial will produce the correctly rounded result of ln(x) for both FP5 and FP4
with any five rounding modes.
Our approach works because the FP7 representation can represent three additional val-
139

FP5 and rne


-0.5 -0.25 0 0.25

FP5 and rna


-0.5 -0.25 0 0.25

FP5 and rnz


-0.5 -0.25 0 0.25

FP5 and rnp


-0.5 -0.25 0 0.25

FP5 and rnn


-0.5 -0.25 0 0.25

FP4 and rne


-0.5 0

Interval for
FP7 and rno
-0.5 -0.25 -0.0625 0 0.0625 0.25

Figure 6.3: We show the correctly rounded result of ln(1.0) in FP5 and FP4 with all five rounding
modes. The gray star represents the real value of ln(1.0). Values that are representable in FP5, FP4,
and FP7 are shown with circle, square, and rhombus, respectively. The black solid color represents
the correctly rounded result and the gray interval shows the rounding interval. The last row shows
the odd interval (a singleton value) where all values in the interval rounds to the correctly rounded
results of ln(1.0) in FP7 with the rno mode. The odd interval is a subset of all rounding intervals
for FP5 and FP4 with five standard rounding modes.

ues between two adjacent FP5 values and the rno result in FP7 holds enough information
to produce the correctly rounded result of various representations (including FP4 and FP5)
and rounding modes (including the five standard rounding modes). Consequently, the odd
interval is typically smaller than the common intervals among FP4 and FP5 with all round-
ing modes. For instance, the common intervals for ln(1.5) illustrated in Figure 6.2 is
[0.375, 0.5) while the odd interval for ln(1.5) is (0.375, 0.5). We provide more information
regarding the rno mode in Section 6.3.1. The proof in Section 6.4 shows that our approach
with the rno mode applies in general for any representation with n bits.

6.2.3 Generating Generic Polynomial Approximations

To generate a polynomial approximation for FP4 and FP5, we compute the correctly rounded
result of ln(x) in FP7 with the rno mode for each input in FP5 and identify the odd inter-
vals. Since all values in FP4 are representable in FP5, it is sufficient to compute the odd
140

intervals for the inputs in FP5. In our illustration, we evaluate the polynomials with double
precision. Hence, we identify the odd interval in the double precision. If the bit-string of
the correctly rounded result of ln(x) in FP7 with the rno mode is even when interpreted as
an unsigned integer, then the odd interval is a singleton consisting of the correctly rounded
result itself. For example, the odd interval of ln(1.0) = 0 is a singleton value because 0 is
exactly representable in FP7 and the bit-string is even. We pictorially show the rounding
intervals of ln(1.0) for FP4 and FP5 with different rounding modes as well as the odd inter-
val for FP7 in Figure 6.3. If the correctly rounded result of ln(x) in FP7 with the rno mode
is odd, then we can identify the odd interval as follows. We identify the preceding value l
and the succeeding value h of the correctly rounded result in FP7. Then, the open interval
(l, h) is the odd interval in real numbers. Since we evaluate the polynomial with double
precision, we identify the range of values in double within the odd interval. We compute
the succeeding value of l in double, which we denote as l+ , and the preceding value of h in
double, which we denote as h− . The closed interval [l+ , h− ] is the odd interval in double
for the input x. Figure 6.4(a) shows the odd intervals for each input in FP5.
In some instances, such as ln(1.0) = 0, the odd interval may be composed of a single
value. Generating a polynomial that produces a single value, while also producing a value
in the odd interval for all other inputs, is a challenging problem. This is especially true
considering that the polynomial is evaluated in double, a finite precision representation.
Hence, we treat these inputs similarly to special case inputs and exclude them from the
list of inputs that we must approximate. For the ln(x) function in FP7, there is only one
input x = 1.0 with singleton odd interval. Other elementary functions for higher precision
representations may have multiple inputs with singleton odd intervals. In Section 6.3.7, we
present our approach to identify such inputs and efficiently compute the correctly rounded
results using mathematical properties of the elementary functions. Figure 6.4(b) shows the
final list of ten inputs and the corresponding odd interval, excluding the input with singleton
interval, that our polynomial must approximate.
141

−1.5 < P (0.25) < −1.375


−0.75 < P (0.5) < −0.625
1.5
−0.375 < P (0.75) < −0.25
0.125 < P (1.25) < 0.25
1.0
0.375 < P (1.5) < 0.5
0.5
0.5 < P (1.75) ≤ 0.625
0.0
0.625 < P (2.0) < 0.75
-0.5
0.875 < P (2.5) < 1.0
-1.0 1.0 < P (3.0) ≤ 1.125
-1.5 Generic Intervals 1.25 < P (3.5) < 1.375
0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.5 3.0 3.5

(a) (b)

Figure 6.4: (a) shows the odd interval for each input in FP7. The odd interval for the input x = 1.0
is a singleton interval, [0, 0]. (b) The set of constraints that must be satisfied by our polynomial to
produce the correctly rounded results of ln(x) for FP7 with rno mode. Note that the case for input
x = 1.0 is missing because generating a polynomial approximation that produces P (1.0) = 0.0 is
challenging. Thus, we classify the input 1.0 as a special case.

P (x) = c0 + c1 x + c2 x2 + c3 x3 + c4 x4 1.5

c0 = −2.2907925 . . . 1.0

c1 = 4.2825369 . . . 0.5

0.0
c2 = −2.6750194 . . .
-0.5
c3 = 8.1274281 . . .
-1.0 P(x) = c0 + c1x + c2x 2 + c3x 3 + c4x 4
c4 = −9.0132090 . . . -1.5
Generic Intervals

0.25 0.5 0.75 1.25 1.5 1.75 2.0 2.5 3.0 3.5

(a) (b)

Figure 6.5: (a) The coefficients generated by the LP solver for the polynomial that produces cor-
rectly rounded results of ln(x) for all inputs in FP7 with rno mode. (b) Generated polynomial
produces a value in the odd interval for all inputs.

Our next step is to generate a polynomial that produces a value in the odd interval
for all inputs. Since there are only 10 non-special case inputs, we can use the approach
described in Chapter 3 and frame the problem of identifying a polynomial that produces
a value in the odd interval as an LP problem and use an LP solver to find a 4th degree
polynomial. Figure 6.5(a) shows the resulting polynomial and the coefficients generated
by the LP solver. Since the polynomial produces a value in the odd interval (illustrated in
142

Figure 6.5(b)), the result will round to the correctly rounded result of ln(x) for FP4 and FP5
with all five standard rounding modes. To generate polynomial approximations for larger
representations, we can combine this approach with range reduction, domain splitting, and
counterexample guided polynomial generation described in Chapter 4 and Chapter 5.

6.3 The RL IBM Approach to Generate Generic Polynomials

We now formally define the problem. Let Tn be either an n-bit FP (Tn = Fn,|E| ) or posit
(Tn = Pn,es ) representation. Let Tk be a representation where Tk has no more precision
bits compared to Tn with the same number of exponent bits. Specifically, Tk = Fk,|E|
where |E| + 1 < k ≤ n if Tn is a FP representation and Tk = Pk,|E| where 2 ≤ k ≤ n if Tn
is a posit representation. Note that all values exactly representable in Tk are also exactly
representable in Tn (i.e., Tk ⊆ Tn ) for both FP and posit representation. Finally, we define
rm to be a standard rounding mode rm ∈ {rne, rna, rnz, rnp, rnn}. For exposition, the
posit rounding mode can be considered as the rne mode. Our goal is to generate a polyno-
mial approximation P (x) of an elementary function f (x) that produces correctly rounded
results for all inputs in any representation Tk and any rounding mode rm. Specifically,
rounding the result of P (x) to any representation Tk with rm rounding mode must result
in the same value as computing f (x) in real numbers and rounding the result to Tk with
rm rounding mode, for all inputs in Tk ,

RNTk ,rm (P (x)) = RNTk ,rm (f (x))

To produce the correctly rounded results for Tk with rm rounding mode, our approach
approximates the correctly rounded result of f (x) in the Tn+2 representation with the round
to odd (rno) rounding mode. Depending on the target representation, Tn+2 is defined
as follows. If Tk is an FP representation, then Tn+2 = Fn+2,|E| is an (n + 2)-bit FP
representation with |E| bits for the exponent. If Tk is a posit representation, then Tn+2 =
Pn+2,es is an (n+2)-bit posit representation with at most es exponent bits. The key intuition
143

is that the result in Tn+2 with the rno mode maintains a sufficient amount of information
to identify the correctly rounded result of f (x) in Tk with any rounding mode. Section 6.4
formally proves that a polynomial that produces the correctly rounded result of f (x) in
Tn+2 with the rno mode will also produce the correctly rounded result for all Tk using any
rm.
To generate such a polynomial, we extend the approach described in the previous chap-
ters to work for the rno mode. We compute the correctly rounded result of f (x) in Tn+2
with the rno mode for each input in Tn . Next, we identify a range of values that round
to the correctly rounded result when they are rounded to Tn+2 using the rno mode. We
call this range of values the odd interval. Each input in Tn and its corresponding odd in-
terval defines the constraint on the output of the polynomial we wish to generate. Hence,
we can use the approach described in Chapter 3 to generate polynomials that satisfy these
constraints by framing the problem as an LP problem and using an LP solver to identify
the coefficients of the polynomial.

6.3.1 The Round to Odd (rno) Rounding Mode

As we make a case for approximating f (x) for Tn+2 with the rno mode, we now define the
rno mode and describe how to systematically round a real value with the rno mode using
the rounding components described in Chapter 2. We also provide a more detailed intuition
on why rounding the rno result with Tn+2 produces the correctly rounded result in Tk with
any standard rounding modes.
The round to odd rounding mode is a non-standard rounding mode proposed to address
double rounding errors in primitive operations [9]. Previously, it was explored specifically
for the rne mode for computation with the extended precision and then rounding the result
to the target representation. Given a real value vR , the rno mode rounds vR to the target
representation T using the following strategy. If vR is exactly representable with a value
in T, i.e., v ∈ T, then vR rounds to v. Otherwise, vR rounds to the nearest value in T
144

Only w0 rounds to w0 Only w2 rounds to w2 Only w4 rounds to w4 Only w6 rounds to w6

w0 (even) w1 (odd) w2 (even) w3 (odd) w4 (even) w5 (odd) w6 (even)

Values between w0 and w2 Values between w2 and w4 Values between w4 and w6


rounds to w1 rounds to w3 rounds to w5

Figure 6.6: We show the rounding of real values vR to a representation T with the rno mode. The
values w0 , w1 , . . . , w6 are adjacent values in T. The values highlighted with red color (i.e., w1 , w3 ,
and w5 ) represent values where the bit-string representation is odd when interpreted as unsigned
integer. The orange, purple, and gray boxes represent ranges of real values that round to w1 , w3 ,
and w5 , respectively.

where the bit-string of the value is odd when interpreted as an unsigned integer. Figure 6.6
pictorially describes the rno mode. The values w0 , w1 , . . . , w6 are adjacent values in T. The
values highlighted with red color (i.e., w1 , w3 , and w5 ) represent values with odd bit-string.
Similarly, the values highlighted with black color (i.e., w0 , w2 , w4 , and w6 ) represent values
with the even bit-string. The only real values that round to w0 , w2 , w4 , or w6 are the values
w0 , w2 , w4 , and w6 themselves, respectively. Otherwise, the real value rounds to the nearest
value in T where the bit-string is odd (i.e., w1 , w3 , and w5 ). Based on the definition of rno,
we can systematically round vR to T with the rno mode using the rounding components
(s, v − , rb, sticky),


s × v − if IsOdd(v − ) ∨ (rb = 0 ∧ sticky = 0)
vrno = RNT,rno (vR ) =

s × v + otherwise

where v + is the succeeding value of v − in T. The definition of the rno mode is the same
for both FP and posit representations.

Our contribution. Our contribution is to use the rno mode to generate correctly rounded
elementary functions for multiple representations and rounding modes. Specifically, a poly-
nomial that produces the correctly rounded result of f (x) in Tn+2 with the rno mode will
also produce the correctly rounded result of f (x) in Tk with any standard rounding modes.
145

Theorem 6.1. Let f (x) be the real value of an elementary function given an input x and
vrno = RNTn+2 ,rno (f (x)). Let v be a value in the odd interval of vrno such that for all v,
RNTn+2 ,rno (v) = vrno . Choose a rounding mode rm ∈ {rne, rna, rnz, rnp, rnn}. Then,

RNTk ,rm (v) = RNTk ,rm (f (x))

We also provide an efficient procedure to create a polynomial approximation that pro-


duces a value in the odd interval for each input x. By Theorem 6.1, rounding any value v
in the odd interval to Tk using a rounding mode rm is guaranteed to produce the correctly
rounded result of f (x) in Tk with the same rounding mode rm.

6.3.2 Why Does The rno Result Avoid Double Rounding Error?

Let us provide an intuition of why rounding with the rno mode avoids double rounding
error in Figure 6.7. Any values v0 and v1 that is representable in Tn is exactly representable
in Tn+2 (i.e., w0 and w4 ). Further, there are three additional values (i.e., w1 , w2 , and w3 ) in
Tn+2 between w0 and w4 . The value w2 is the midpoint between w0 and w4 . Similarly, the
value w1 is the midpoint between w0 and w2 while w3 is the midpoint between w2 and w4 .
In the rno mode, all values between w0 and w2 rounds to w1 . Similarly, any value between
w2 and w4 rounds to w3 . Figure 6.7(a) illustrates the result of rounding a real value (red
star) directly to Tn with the rne mode (solid black arrow) and the result of double rounding
by first rounding the real value to Tn+2 with the rno mode and subsequently rounding the
result to Tn using the rne mode (blue dotted arrow). It can be seen, double rounding with
the rno mode produces the same value as if directly rounding the real value to Tn using the
rne mode. Similarly, Figure 6.7(b) through (e) illustrates that double rounding with rno
produces the same value as rounding the real value directly to Tn using other four standard
rounding modes.
To round a real value vR to Tn using one of the five standard rounding modes, we
must identify the relationship between vR and the two adjacent values v0 and v1 in Tn , as
explained in Chapter 2. There are five categories of such relationships:
rounds generating
However,
(x) to the correctly
A(x)
nding
produce
te vR with mode rno )rm.
ythe Since
for all
exact x 2ofany
result Tfn(x) for a given input x 2 Tn , then
mode rm produces vrm , the 146
ble yinrnoTnthat
tarrounding , itvis
is guaranteed
sufficient for to
R to Tn+2 using
e same rounding mode rm,
m.echnically, the interval that
NTk ,rm (RNTn+2 ,rno (f (x))) = RNTk ,rm (f (x))
dd interval of vrno is a range
However, generating A(x)
o vbe the result of RN to ,rno
rno when roundedTn+2 Tn+2 w0 Then,
(f (x)). w1 w if ourwapproximation
2 3 w4 w0 w1 w2 w3 w4 w0 w1 w2 w3 w4
produce
te vR withythe rno )exact
for all x 2ofTfn(x) for
result a given input x 2 Tn , then
yrno for all inputs x 2 Tn , thenv0A(x) (even)rounds to(odd)
thev1correctlyv0 (even) (odd) v1 v0 (even) (odd) v1
mode rm produces vrm , the
arrounding
tor y thatvis guaranteed
R to Tn+2 usingto
allrno
representations Tk with any rounding mode rm. (a) Since any
Round to the nearest, tie goes to even
e same rounding mode rm,
m.
Nlue in(RN
Tk ,rm Tk are
Tn+2also exactly
,rno (f (x))) =representable in Tn , it is sufficient for
RNTk ,rm (f (x))
dd interval of vrno is a range
x) produces yrno for all x 2 Tn . Technically, the interval that
o vbe the result of RN to ,rno
rno when roundedTn+2 Tn+2 w0 Then,
(f (x)). w1 w if ourwapproximation
2 3 w4 w0 w1 w2 w3 w4 w0 w1 w2 w3 w4
[yterno
vR, ywith is aexact
rno ], the generic interval
result of f (x) forfor
x. aHowever,
given inputgenerating
x 2 Tn ,A(x) then
yrno for all inputs x 2 Tn , thenv0A(x) rounds to thevcorrectly v v1 v0 v1
mode produces
rm] (where
yrno , yrno , the is to produce y ) for all1 x 2 T 0
vrmoption
the only
torrounding vR to Tn+2
all representations usingany rounding mode rm. Since
Tk with
rno
(b) Round
n
anyto the nearest, tie goes away
enext
same rounding mode
step is to find a range rm, of values near yrno that is guaranteed to
Nlue in(RN
Tk ,rm Tk areTn+2also
,rno (fexactly
(x))) = representable
RNTk ,rm (f (x))in Tn , it is sufficient for
dd
unded interval
result of of fvrno is aallrange
(x) for Tk and rm.
x) produces yrno for all x 2 Tn . Technically, the interval that
o vbe the result of RN to ,rno
rno when roundedTn+2 Tn+2 w0 Then,
(f (x)). w1 w if ourwapproximation
2 3 w4 w0 w1 w2 w3 w4 w0 w1 w2 w3 w4
[yte
ng v , y
Onforwith ],
theall is
the
Odd a generic
exact
Interval interval
result of f for
(x) for
x. aHowever,
given generating
input x 2 T , then
A(x)
x 2 of
rno R rno
Tnv,rno
n
yrno inputs thenv0A(x) rounds to thevcorrectly 1 v0 v1 v0 v1
mode rm produces v , the
yrno , yrno ] (where the only option is to produce yrno ) for all x 2 Tn
rm
or performing
at all representations Tk with any
double rounding rounding
by first mode
rounding vRrm.to TSince
n+2 using
any
(c) Round towards zero
enext
same steprounding
is to find amode range rm, of values near yrno that is guaranteed to
Nlue in(RN
Tk ,rm Tk areTn+2also
,rno (fexactly
(x))) = representable
RNTk ,rm (f (x))in Tn , it is sufficient for
dd
unded interval
result of of fvrno is aallrange
(x) for Tk and rm.
x) produces vrno y=rnoRN forTn+2
all,rno
x (v 2R )Tn . Technically, the interval that
o vbe the result
rno when rounded n+2
of RN to Tn+2 w0 Then,
(f (x)). w1 w if ourwapproximation
3 w4 w0 w1 w2 w3 w4 w0 w1 w2 w3 w4
T ,rno
2
[yte
ng v
On
rno , ywith ], is
the a generic
exact
R the Odd Interval of vrno
rno interval
result of f for
(x) x.
for aHowever,
given generating
input x 2 Tn ,A(x) then
yrno forvrno
ounding all inputs
to Tk using n , then
x 2 aTrounding v0A(x)
moderounds to thevcorrectly
rm produces 1vrm , the v0 v1 v0 v1
yrno , yrno ] (where the only option is to produce yrno ) for all x 2 Tn
or rounding
en
at all representations
performing vRdouble Ttok with
directlyrounding Tk using any rounding
the same
by first mode
rounding
rounding vRrm. TSince
to mode
n+2(d)using
rm,any towards positive infinity
Round
next step is to find a range of values near yrno that is guaranteed to
Nlue in(RN
Tk ,rm Tk areTn+2also
,rno (fexactly
(x))) = representable
RNTk ,rm (f (x))in Tn , it is sufficient for
dd
unded interval
result of of fvrno is aallrange
(x) for Tk and rm.
x) produces vvrno
rm y =rnoRN
= RN ,rm (v
forTTn+2
kall x R2) T . Technically, the interval that
,rno (vR ) n
o vbe the result
when of
rounded RN to Tn+2 w0 Then,
(f (x)). w1 w if ourwapproximation
3 w4 w0 w1 w2 w3 w4 w0 w1 w2 w3 w4
T ,rno
rno n+2
2
[y
ng , y
Onfor ],
RN
theall is a generic
(v
Oddk Interval )interval
= v for x. However, generating A(x)
x 2 of
rno rno T ,rm rno vrno v A(x) rounds to thevcorrectly
rm
ounding vrno inputs
to Tk using n , then
aTrounding v0 v1 v0 v1
yrno 0 mode rm produces 1vrm , the
yrno , yrno ] (where the only option is to produce yrno ) for all x 2 Tn
or rounding
en
at all representations
performing vRdouble Ttok with
directlyrounding Tk using any rounding
the same
by first mode
rounding
rounding vRrm. TSince
to mode
n+2 any towards negative infinity
(e)using
rm,
Round
define
next stepthe
is odd
to findinterval
a range of vofrno . The near
values odd interval
yrno thatofisvguaranteed
rno is a range to
lue in Tk are also exactly representable in Tn , it is sufficient for
re any result
unded value of v inf (x)
the intervalFigure
for all Trounds6.7:toIllustration
v whentorounded show that to T rno result in Tn+2 maintains sufficient information to produce
k and rm. rno n+2
x) produces v rm y
vrno =
=rnoRNRNforTTn+2
kall
the
(v
,rm x R correctly
) rounded result
2R )Tn . Technically, the intervalin T with that (a) rne, (b) rna, (c) rnz, (d) rnp, and (e) rnn mode.
the
,rno (v
n
The real value is represented with a red star. Values v0 and v1 are two adjacent values in Tn . Tn+2
[y , y ],
RN is a
ng On the Odd Intervalcan
rno rno T k
generic
,rm (v rno )interval
= rm for x.
of vvexactly However,
represent v0 generating
and v1 (i.e.,A(x) w0 and w4 , respectively). Additionally, Tn+2 can represent
rno
ounding vrno to Tk usingthree a rounding mode rm produces
more values between v0 and v1 (i.e., v rm , the w , w , and w ). The dotted blue arrows show the double
yrno , yrno ] (where the only option is to produce
148 rounding from a real value to T y rno ) for all x 2 Tn 1 2 3
en
at rounding
performing directlyrounding
vRdouble to Tk using the same
by first rounding
rounding vR to mode with
Tn+2 using
n+2 rm, rno mode and then subsequently rounding the result to Tn .
define the odd interval of v
Sold . The
black odd
arrowinterval
shows
next step is to find a range of values near yrno that is guaranteed to
rno of
the v is
result
rno a
of range
directly rounded the real value to Tn .
re any result
unded value ofv inf (x)
the interval
for all Trounds to v when rounded to Tn+2
k and rm. rno
v = RN
vrno = RNTn+2 T (v R )
,rno (vR )
rm k ,rm

ng On theRNOdd (vrno ) =
Interval
Tk ,rm of vvrm
rno
ounding vrno to Tk using a rounding mode rm produces vrm , the
148
en
at rounding
performing directlyrounding
vRdouble to Tk using the same
by first rounding
rounding v to mode
Tn+2 using
rm,
define the odd interval of vrno . The odd interval ofR vrno is a range
re any value v in the interval rounds to vrno when rounded to Tn+2
147

• vR = v0 .

• v0 < vR < w2 .

• vR = w2 .

• w2 < vR < v1 .

• vR = v1 .

where w2 represents the midpoint between v0 and v1 . Double rounding error occurs when
the intermediate rounded result loses the information regarding the relationships between
vR , v0 , and v1 . However, rounding vR to Tn+2 using the rno mode preserves this relation-
ship. All values between v0 and w2 rounds to w1 , which is still a value between v0 and
w2 . A real value that is exactly equal to w2 is rounded to w2 . All values between w2 and
v1 rounds to w3 , which is still a value between w2 and v1 . Hence, double rounding with
the rno mode produces the same value as if rounding vR directly to Tn . Moreover, by our
definition of odd interval, any value v in the odd interval of the correctly rounded result
of vR in Tn+2 with the rno mode will round to the same value in Tk using rm mode as if
rounding vR directly to Tk using rm mode. Figure 6.7 shows the odd intervals using the
gray area. It can be seen that if we round any value within the gray area to Tn using the
standard rounding modes, then the result will be the same as if rounding the real value in
red star directly to Tn .

6.3.3 Generating Polynomials for Tn+2 with the rno Mode

Our strategy to create a polynomial approximation that produces correctly rounded results
of f (x) in Tk with any standard rounding modes is to generate a polynomial that produces
the correctly rounded results for Tn+2 with the rno mode. Algorithm 6.1 provides a high-
level sketch of our approach to generate such a polynomial. Given an elementary function
f (x) and a list of inputs X in the Tn representation (i.e., X ⊆ Tn ), our first step is to
148

Algorithm 6.1: A sketch of our approach to generate piecewise polynomial of degree d for the
elementary function f (x) in Tn+2 using rno mode. The resulting polynomial when used with
range reduction strategy (RRH and OCH ) produces the correctly rounded results for all inputs
with all Tk and standard rounding modes. CalcResultsInRNO computes the rno result of
f (x) in Tn+2 using oracle (shown in Algorithm 6.2). CalcOddIntervals computes the odd
interval for each input x and the set S containing inputs with singleton odd interval. (shown in
Algorithm 6.3). Using the list of inputs and odd intervals, we use the approach described in
Chapter 5 to generate the piecewise polynomial (lines 5 - 9).
1 Function GenericPolynomial(f , Tn+2 , H, X, d, RRH , OCH ):
2 O ← CalcResultsInRNO(f , Tn+2 , X)
3 (L, S) ← CalcOddIntervals(O, Tn+2 , H)
4 if L = ∅ then return (false, ∅, DNE)
// Generate piecewise polynomial using the approach from
Chapter 5
5 L0 ← CalcReducIntervals(L, H, RRH , OCH )
6 if L0 = ∅ then return (false, DNE)
7 L∗ ← CombineReducIntervals(L0 )
8 if L∗ = ∅ then return (false, DNE)
9 Ψ ← GenPiecewise(L∗ , d)
10 return (true, S, Ψ)

compute the correctly rounded result of f (x) in Tn+2 with the rno mode to produce yrno
for each input x ∈ X (line 2). Algorithm 6.2 shows our algorithm to compute yrno for each
input.
Next, we compute the odd interval of each result yrno (line 3). Any value in the odd in-
terval rounds to yrno . The algorithm to compute the odd interval is shown in Algorithm 6.3.
There can be inputs where the corresponding odd interval is a singleton. By the definition
of the rno mode and the odd interval, this occurs when the real value of f (x) is exactly
representable in Tn+2 and the bit-string of f (x) in Tn+2 is even. Generating polynomial
approximations that produce the exact value of f (x) is a difficult problem. Hence, we
identify inputs with singleton odd intervals using mathematical properties of elementary
functions and handle them as special cases (Section 6.3.7). Once we compute the odd in-
tervals for each input, then we can use the approach described in Chapter 5 to generate
piecewise polynomials with counterexample guided polynomial generation that produces
the correctly rounded result of f (x) when used with range reduction strategies (lines 5-9).
149

Algorithm 6.2: CalcResultsInRNO computes the correctly rounded result of f (x) in


Tn+2 using the rno rounding mode for each input x ∈ X. FGetRComp returns the round-
ing components for rounding y to Tn+2 .
1 Function CalcResultsInRNO(f , Tn+2 , X):
2 O ← ∅;
3 foreach x ∈ X do
4 y = f (x);
5 (s, v − , rb, sticky) ← RComp(y, Tn+2 );
6 if IsOdd(v − )∨(rb = 0 ∧ sticky = 0) then
7 yrno ← s × v −
8 end
9 else
10 v + ← GetSuccVal(v − , Tn+2 );
11 yrno ← s × v + ;
12 end
13 O ← O ∪ (x, yrno );
14 end
15 return O

At the end of this process, we have two components to produce correctly rounded results
for f (x). First, our approach generates a set S containing the inputs x whose odd interval
is a singleton. We implement a check to these inputs and return the pre-computed results
for them. A naive approach is to store the list of these inputs x and the pre-computed
result of f (x) into a look-up table and implement the check using branch conditions or
switch statements. To create a faster implementation, we use mathematical properties of
the elementary functions to implement these checks (Section 6.3.7). Second, our approach
produces piecewise polynomials that produce the correctly rounded results for all inputs
when rounded to any Tk with all standard rounding modes. We now describe our approach
in computing the rno result of f (x) and their odd intervals in Tn+2 .

6.3.4 Computing the rno result of f (x) in Tn+2

The first step in our approach is to identify the correctly rounded result yrno for each input
x ∈ X. Algorithm 6.2 illustrates our steps to compute the rno result of f (x) in Tn+2 . We
compute the real value y = f (x) for each input x using an oracle (line 4). Then, we obtain
150

Algorithm 6.3: CalcOddIntervals computes the odd intervals for each input x based
on the correctly rounded result yrno in Tn+2 . The list S is a set of inputs with singleton odd
interval. The list L contains inputs and the corresponding odd intervals where the interval is
not a singleton. GetPrecVal(a, T) returns the value preceding a in the representation T.
GetSuccVal(a, T) returns the value succeeding a in the representation T.
1 Function CalcOddIntervals(O, Tn+2 , H):
2 foreach (x, yrno ) ∈ O do
3 L←∅
4 S←∅
5 if IsEven(yrno ) then
6 S ← S ∪ (x, yrno )
7 end
8 else
9 y − ← GetPrecVal(yrno , Tn+2 )
10 l ← GetSuccVal(y − , H)
11 y + ← GetSuccVal(yrno , Tn+2 )
12 h ← GetPrecVal(y + , H)
13 L ← L ∪ (x, [l, h])
14 end
15 return (L, S)
16 end

the rounding components (s, v − , rb, sticky) as described in Section 2.1.3 (line 5). If the
real value y is exactly representable in Tn+2 (i.e., rb = 0 and sticky = 0) or the truncated
value v − is odd, then the rno result is s × v − (line 7). Otherwise, the rno result is s × v + ,
where v + is the succeeding value of v − in Tn+2 .

6.3.5 Computing the Odd Intervals

Once the correctly rounded result yrno of f (x) in Tn+2 using the rno mode is computed
for each x, the next step is to identify a range of values such that producing any value in
the interval rounds to yrno . We call this interval the odd interval. Because the range reduc-
tion, polynomial evaluation, and output compensation are evaluated in a higher precision
representation H, we compute the odd interval in H. Algorithm 6.3 illustrates our steps to
compute the odd interval.
If the bit-string of yrno in Tn+2 is even when interpreted as an unsigned integer, then
the odd interval is a singleton. The only value that rounds to yrno with the rno mode is yrno
151

itself. Generating a polynomial approximation that produces the exact singleton value yrno
is a challenging problem. These singletons restrict the amount of freedom available for
polynomial generation approaches described in previous chapters. Hence, we filter these
inputs and store the input x and the exact result yrno separately in the set S (line 7). In
Section 6.3.7, we describe our approach to identify the inputs with singleton odd intervals
and efficiently compute the result using mathematical properties of the elementary function.
If the bit-string of yrno is odd when interpreted as an unsigned integer, then all values in
H that are strictly greater than the preceding value of yrno in Tn+2 (lines 9-10) and strictly
less than the succeeding value of yrno in Tn+2 (lines 11-12) forms the odd interval. Any
value in the odd interval rounds to yrno when rounded to Tn+2 using the rno mode. The
odd interval in this case is not a single interval. We compute the odd intervals for each
input and store them in the list L (line 14).

Generating piecewise polynomial using the odd interval. The final step is to generate
piecewise polynomials that produce a value in the odd interval for each input. Each in-
put and its odd interval, (x, [l, h]) ∈ L specifies the constraint on the polynomial that we
want to generate. We can use the approach described in Chapter 5 to generate the piece-
wise polynomial that satisfies the constraints in L. Our approach also supports generating
polynomials with range reduction strategies described in Chapter 4 as well.

6.3.6 Implementing the Elementary Function

At the end of this process, our approach produces two components: (1) a set S containing
inputs whose odd interval is a singleton and (2) a polynomial approximation of f (x) that
produces the correctly rounded result of f (x) in Tn+2 with the rno mode for all inputs
(excluding the inputs in S). We implement the elementary function f (x) as follows. Given
an input x, we first check whether x is an input with a singleton odd interval. If so, we
can use the set S to obtain the result. Section 6.3.7 describes our approach to perform
152

this check and compute the result efficiently. Otherwise, we use the generated polynomial
approximation to produce the result. We use range reduction RRH to compute the reduced
input, use the reduced input the evaluate the polynomial, and use output compensation
OCH to obtain the result. We round the result of output compensation to Tk using a user
specified rounding mode to produce the final result. By Theorem 6.1, we guarantee that our
implementation produces the correctly rounded result of f (x) for any Tk with any standard
rounding modes for all inputs x ∈ Tk .

6.3.7 Efficiently Identifying Inputs With Singleton Generic Interval

One of the challenges in generating polynomials that produce a value in the odd interval
is that there are inputs with singleton odd intervals containing only yrno . To remedy this
problem, our approach treats all inputs x where the odd interval is a singleton as special
case inputs. Our approach identifies these inputs and the exact result of f (x) in a special
case set S. If the number of special case inputs in S is small, then we can check whether
an input is a special case using a series of branch conditions or switch statements with
minimal performance overhead. However, if there are hundreds of special case inputs,
then this strategy will incur overhead, significantly slowing down the performance of the
approximation function. Thus, our goal is to efficiently identify these inputs.

Mathematically identifying inputs with singleton odd interval. For many elementary
functions, it is possible to mathematically identify whether an input may result in a sin-
gleton odd interval and identify the exact result of f (x). Observe that both Tn and Tn+2
are finite precision representations. Hence, all values in Tn and Tn+2 are rational values.
The only case where the real value of an elementary function f (x) is exactly representable
in Tn+2 for an input in Tn is if the input is a rational value and the result of f (x) is a ratio-
nal value. For many elementary functions, there are only a few rational inputs that produce
rational results. For example, ex is a transcendental function where the only rational input
153

that produces a rational result is the input x = 0 resulting in e0 = 1.0. The problem of
identifying rational inputs that produce rational outputs for various elementary functions is
well studied [1, 5, 26, 106].
To identify inputs in Tn with a single odd interval, we first use mathematical proper-
ties of elementary functions f (x) to identify all rational inputs such that f (x) is a rational
value. Then, we check if these rational inputs and the corresponding rational results are
exactly representable in Tn and Tn+2 , respectively. If it is, the odd interval of yrno corre-
sponding to the rational input may be a singleton interval. We classify these rational inputs
as special cases. Finally, we develop a programmatic approach to efficiently identify the
inputs without using a series of branch statements and compute the exact result of f (x).
We now describe the mathematical properties that we use to identify inputs with single-
ton odd intervals and programmatic approaches to efficiently identify these inputs and the
corresponding results.

Exponential function ex . From Lindemann-Weierstrass theorem [5], if the input x is


rational and non-zero, then ex cannot be a rational value. Thus, x = 0 is the only input
where ex is exactly representable in Tn+2 with e0 = 1. We use a single branch statement to
check whether an input x = 0 and return the value ex = 1.

Exponential function 2x . To determine inputs x ∈ Tn that produces rational result of 2x ,


we first break the input x down into x = f + i where f ∈ [0, 1.0) represents the fractional
part of x and the integer i represents the integral part of x. Then, the function 2x can be
broken down into the following formula,

2x = 2f +i = 2f 2i

Our first goal is to find the rational inputs x where 2x = 2f 2i is a rational value. The value
2i is guaranteed to be a non-zero rational value since i is an integer. For the product 2f 2i to
be a rational value, the value 2f must be a rational value. Since f is also a rational value,
154

p
f can be broken down into f = q
where p and q are both positive integers. Additionally,
because f ∈ [0, 1), it must be the case that p < q. Then,

p √
2f = 2 q =
q
2p

The value 2p is an integer. The q th root of an integer is either an integer or an irrational



value [26]. For the value q 2p to be an integer, it must be the case that 2p is an integer and
p is a multiple of q. Since p < q, it must be the case that p = 0 and f = 0
q
= 0. Thus, the
only rational inputs where 2x is a rational value is when x is an integer i and the fractional
value f = 0.
Next, to identify all inputs x ∈ Tn where 2x is exactly representable in Tn+2 , we
identify all integer inputs x ∈ Tn and check whether 2x is exactly representable in Tn+2 .
If 2x is exactly representable in Tn+2 , then x is an input that can have a singleton generic
interval and x is classified as a special case input. In the case when Tn is a 32-bit float,
Tn+2 can represent all values of 2x for x between −151 ≤ x ≤ 127. Thus, the special case
inputs x ∈ Tn are all integers between −151 and 127 (279 inputs in total),

x ∈ Z ∧ −151 ≤ x ≤ 127

Checking whether an input x ∈ Tn is an integer can be performed efficiently using bit-


wise operations. Similarly, computing the final result 2x can be performed with bit-wise
operations.

Exponential function 10x . Similar to the 2x function, the only rational inputs x that result
in a rational result of 10x is when x is an integer. The reasoning can be followed similarly
to the case for 2x . Additionally, the result of 10x for negative inputs x (i.e., 10−1 = 0.1)
cannot be exactly represented in Tn+2 . Thus, the only inputs x ∈ Tn where 10x can be
exactly represented in Tn+2 are positive integers. In contrast to 2x , 10x grows much faster
and there are only a few inputs in Tn for which 10x is exactly representable in Tn+2 . When
155

Tn is a 32-bit float, there are exactly 12 integer inputs ranging from 0 to 11 where 10x is
exactly representable in Tn+2 . Although there is no efficient methodology in computing
10x even for integer inputs, we can use a switch statement to check whether an input is one
of the twelve special case inputs and return the correct result.

Natural log function ln(x). The result of the function ln(x) is always irrational if an
input x is rational and not equal to zero. This property can be proven using the Lindemann-
Weierstrass theorem [5]. Let us suppose that we have a rational value x 6= 1 where y =
ln(x) 6= 0 is also a rational value. Since y is a non-zero rational value, ey must be an
irrational value from Lindemann-Weierstrass theorem [5]. However,

ey = eln(x) = x

This creates a contraction because we assumed that x is a rational value. Hence, it must be
the case that y is an irrational value.
Thus, the only special case input x ∈ Tn is the input x = 1 where ln(x) is exactly
representable in Tn+2 with ln(1) = 0. We use a single branch statement to check whether
an input x = 1 and returning the value ln(1) = 0.

Log base two function log2 (x). To determine rational inputs x that produces rational
p
result of log2 (x), let us suppose that log2 (x) = q
for positive integers p and q. Then, it
follows that
p √
q
x = 2q = 2p

The value 2p is an integer. The q th root of an integer is either an integer or an irrational


value [26]. For x to be rational, x has to be an integer. For x to be an integer, 2p has to be a
perfect q th power of x. Hence, x must be an integer of the form 2i where i is an integer. We
use bitwise operations to check if the input x is a power of two, i.e., 2i . If it is, we return
the value i, which can be efficiently computed using bitwise operations.
156

Log base 10 function log10 (x). Similar to log2 (x), log10 (x) produces a rational result
when x is a power of 10 (i.e., x = 10i where i is an integer). Additionally, Tn cannot
exactly represent negative power of 10 (i.e., 10−1 = 0.1). While there is no efficient way
of identifying whether an arbitrary value x is a power of 10, there are a limited number of
inputs in Tn that are power of 10. In 32-bit float (i.e., F32,8 ), there are only 11 such values
ranging from 100 to 1010 . For each inputs 10i where 0 ≤ i ≤ 10, the result log10 (10i ) = i
is exactly representable in Tn+2 (i.e., F34,8 ). We use a look-up table to store the inputs and
its corresponding result in Tn+2 and implement the check in a switch statement.

The hyperbolic sine function sinh(x). As an extension of the Lindemann-Weierstrass


theorem, if the input x is rational and non-zero, then y = sinh(x) cannot be a rational
value [106]. The only input x ∈ Tn where sinh(x) is exactly representable in Tn+2 is the
input x = 0 where sinh(0) = 0. We use a single branch statement to check whether an
input x = 0 and returning the value sinh(0) = 0.

The hyperbolic cosine function cosh(x). Similar to the hyperbolic sine function, if the
input x is rational and non-zero, then y = cosh(x) cannot be a rational value. The only
input x ∈ Tn where cosh(x) is exactly representable in Tn+2 is the input x = 0 with
cosh(0) = 1. We use a single branch statement to check whether an input x = 0 and
returning the value cosh(0) = 1.

Trigonometric sinpi function sinpi(x) = sin(πx). The function sinpi(x) is equal to


sin(πx). By Niven’s theorem [106], the only rational values of x between 0 ≤ x ≤ 1
2

where sinpi(x) is also rational number are one of three values:






 0 if x = 0


sinpi(x) = 1 if x = 1

 2 6



1 if x = 1
2
157

Among these inputs 1


6
is not exactly represented in Tn . When we extend the domain of x
to the set of all inputs, there are only three categories of inputs in x ∈ Tn where the result
of sinpi(x) is representable in Tn+2 :



0
 if x is an integer


sinpi(x) = 1 if x ≡ 1
mod 2.0

 2



 3
−1 if x ≡ 2 mod 2.0

We implement the modulo operation efficiently using integer operations. Consider the
case when Tn is a 32-bit float. Without the loss of generality, let us only consider positive
inputs in Tn . First, because float has 23 precision bits, when x is broken down into the
binary scientific notation x = m × 2e , m can have up to 23 fraction bits. Thus, all inputs
x ∈ Tn greater than or equal to 223 are guaranteed to be integer values and the result of
sinpi(x) for these values are always 0. Next, if x < 223 , then we compute 2x in float and
round the result to a 32-bit integer to obtain the value t. In essence, this operation truncates
the value 2x to the integral part of 2x and removes the fractional part. Because 2x < 224 ,
the rounded result is exactly representable in 32-bit integer. If t and 2x are identical, then x
is guaranteed to be either an integer or a multiple of 0.5, indicating that x is a special case
input. We compute the result of sinpi(x) based on t:




 0 if t ≡ 0 mod 2


sinpi(x) = 1 if t ≡ 1 mod 4





−1 if t ≡ 3 mod 4

Trigonometric cospi function cospi(x) = cos(πx). Similar to the sinpi function, the only
input x in Tn where cospi(x) = cos(πx) is also rational can be classified as one of the three
158

categories: 



1 if x is an even integer



cospi(x) = −1 if x is an odd integer






0 if f raction(x) = 0.5

These checks can be performed efficiently using a similar strategy illustrated for the sinpi(x)
function.

6.4 Proof of Tn+2 Result with rno Producing Correctly Rounded Result for Tk

We now provide a proof of Theorem 6.1. Through this proof, we show that the result
produced by our polynomial approximation, which approximates the correctly rounded
result of f (x) in Tn+2 with rno rounding mode, will round to the correctly rounded result
for Tk using any standard rounding modes. Our proof for Theorem 6.1 applies to both
FP and posit representations. We first prove properties of the rno rounding mode and the
rounding components we use to round real values to Tk . Then, we use these properties
to prove Theorem 6.1. When we refer to the bit-string of vR , we refer to it in the infinite
extended precision representation, T∞ . When we refer to the bit-string of vrno , the rno
result of vR in Tn+2 , we refer to it in the Tn+2 representation.

Lemma 1. Let vrno = RNT,rno (vR ). Then, vrno preserves the sign of vR .

The value zero is exactly representable in Tn+2 . In both FP and posit representations,
the bit-string of zero is even when interpreted as an unsigned integer. Thus, the only real
value that rounds to zero when using rno is zero itself. All positive real values round to a
positive value in Tn+2 and all negative real values round to a positive value in Tn+2 when
using rno. Hence, vrno preserves the sign of vR .

Lemma 2. Let vrno = RNT,rno (vR ). Then, the first n+1 bits in |vrno | and |vR | are identical.

We prove the lemma with the rounding components (s, v − , rb, sticky) used for round-
ing vR to vrno in Tn+2 using rno. The value v − is the truncated value of |vR |. Hence, all
which Lemmais equal to b1. n+2The result . in preserves the cnsigncn+1ofvalue
vR . cn+3 Lemma 1.using
The rno.
T allresult in T n+2 preserves
round to the sign ofvalue
vR . in T
, the rnolast bit of B|vvcrno |v | R rno R
rno vrnoreal
ebits
valuebi = i for
v crepresentsall bits
value in
the bT and
largest where
Bc|viand = |vall
| B
Tvalue |22 negative
b=3
smaller .0.i2bbthan
.0b .n2.| .or
3bn+1 +real
nc
bn+2 Thus,
values
b3n+1
2.
equal cn+2
bn+2
to . .|vthe
000 first
…|.round
.We
.…. to a an
positive cn+2
toin|vT valuewhen
in T n+2 and negative values a positive n+2 when using rno.
rno

e refer to be itbecause
in thetornobymode
iB
infinite in extended
0b
n+2 .all
The value
R | precision
000
represents .4the largest value…Tsmaller than or equal …
We…
i berepresentation, is .equal
R R R |.n+2
is a one
would vn+2rounds R
Thus,|v|v thatnvcannot exactly 2}represented to a value
If vequal
Lemma
each
rno 6= vR , which
1.
other.
The means ReduxOr{c
result
rno | 6= v |vRi |,| then
in T
n+
it must = 1,
be thewhich
preserves casethe1that the to
sign last
of bit .B|v|
no
and
se whenresult
In
B In
such
are
(2.1)
such
a
v in
identical,
Bav case,
rno
case,
T
is odd and
reason
ReduxOr{c
n+2
ReduxOr{c
preserves
by
andiabout
(2.2) rno
extension,
B
the case the
the
isneven
is 0
whenequal
sign
first n
separately.
c
is
rno
(2.1)
2to
of
+c
equal v
1 R .
bits
performing
n+2
c in
B3v vtois.performing
odd B
and Lemma
… (2.2)
reduction…reduction 1. c
is.evenThec rnocv
norseparately.
v…operation operation
on
Rresult
n+2 on
v rno in T n+2 preserves the sign of v R .
The value zero is exactly representable in In both FP and posit The
representations,
value zero is exactly representable in . In both FP and posit representations,
|v |
Hence, preserves the sign of
|v |
where
bofn+2
vis=
the
R
1.
.a one
bit-string
The because
is
v
bit-stringrno
odd when
| i i
rnomode rounds
representation
|
interpreted
n i
v+ 2} + as
2} an
of Ball |vis |equivalent
unsigned
that cannot
integer,
4
Rto the T
B
orrno

2n+2
first n +represented
be exactly
n+1
bits of B to .an a value
Hence, v rno preserves the sign of v R . T n+2
v R |v |
cal.

=
R
odd,bn+2 then with|vwith| rounds
Rinfinite infinitenumber to number
v ofbased
(2.1) If B
zeros,
of onisthe
v zeros, odd,definition
then |vR | of roundsrno rounding to v based on the definition of rno rounding
es
ro of
bCombining
isAdditionaly,
exactly
rno n+2
Rounding
the The
both
therepresentable
bit-string value
2)Mode
Theorem
of zero
6.2Bin
zero andTforis =exactly
| Theorem
is .0Bbits
0b
even 2b In3 .b 6.3 . brepresentable
.both
when describes
bbn+13cFP
1 b4and
interpreted the… entire
posit in The
as T
bit-string . of
representations,
value
unsigned bIn |vrnoboth
zero |: b FP
integer. and
is exactly Thus,posit
the representations,
representable
the
bit-string
only real in Tn+2
ofvalue . In both
whenFP and positasrepresentations,
vzero is even interpreted unsigned integer. Thus, the only real value
the b
is bit is= zero since is 2 even: … …
n+2
nd
where
be the case bit-string
(n +
that the bits odd when|v
interpreted
n+2 all as
and
n
an unsigned
where integer, and n all bits
n+1 starting
n+2
Lemma . 2. ,Lemma 2. identical.
rno
 
| 6=Let ).BThen, theBvfirst first bitsn + 2inbits in B||vand the|vR | are Let rno = RNT,rno (vR ). Then, the first n+1 bits in |vrno | and |vR | are identical.
b c v b 2 i n + 2 |vrno
==vvRand (whichB|vrnomeans vmode,
|v Since thus|vRv
i
|),rnoisi
|=
the the vRN
proof first and is
T,rno further
i
20(v =split
i
into
. Since , two is the n+1
|bits
|R
0 | 0|in b|the
| = BReduxOr |v
v rno = nB
{ci B +|v| rno vB
159
rno R|
ReduxOr | i{c n i+
i | that
2}n =+b2} n+2= binn+2 0| +|·with
·0|v
· |R=0|the ·rno
· in
·= bn+2are not
Corollary
Additionally, 6.4.
the, crno
|v bit-string
Let | v
6
= |v be | of
the
indicates zero
rounded is|v0ceven
result
all bits of when
after
v the T interpreted
(n 2) nd n+2
bit as only
(b)
B unsigned
rounding mode integer.
all Thus, the only real value
ero is
e B to be the even
from when
that
cn+3 rounds
firstidentical.interpreted
, . . .nRare
first
n+2 bits
n+4
rno
+to
of 1
all
zero
bits B as
zero’s,
of when
| and|v BBunsigned
= c using
and=
|to be the
. . 0b
.B c integer.
R
b
c rno .
are. 0. is
n+2
b b
identical.
|vR |value represented
Thus,
zero 1 the the
itself. bit-string
All
|vR |
real
positiveof value
zero realis even
values when
that round interpreted
rounds to a
to positive
zero as unsigned
when using integer.
rno is Thus,
zero the
itself. only real value
All positive real values round to a positive
mportant
|vrnov| and B|vproperties are
toWe
of rno B|vRthat v
wevrno
rno 2 |
must
3 2
n
discuss
3
n+1 n n+1
to sign prove Theorem (s,R6.12.
toprove lastthe lemma . with the |vR |rounding components | v , rb, sticky)We used prove for round-
the lemma with the rounding components (s, v , rb, sticky) used for round-
R|
which
(v
zeros. =is
which equal
Because
RN isTequal if ball(v R,b)).
bitstheafter Since bit
therno of
(nrounding
B+|v of 2)nd mode
bit in B preserves are allthezeros, of
then vR ,|vitrno should
| and+ also |v
n+2 , the last bit rno |. c
rno ,rno n+2 |B
that rounds to zero when using is zero itself. All positive real cvalues round to a positive
n+2

o when using isT (2.2)


|v|zero If itself. is even,
All then|vrno
positive c. 2|v rnorounds
real cn+2 to
values that +round
+
,
rounds where cnv cisn+1
tonpositive
atoBpositive the csucceeding value
1|vzero when using inisTzero itself.all All positiverealreal values round to atopositive
a positive
even,
value represents the largest B
value B =
smaller 0b0 b
than . .or b | b
equal b to 000
|v … .|
We.=.
… v … … …
R | value rno in =and ,all negative real values round to alast arevalue
not allin Tvalue
n+2rnowhen using and negative values round value in Tn+2 when using rno.
then
The
v
Additionally,
|v rounds |vtorno 6 | |v
= |vtheindicates
+
v |vwhere
R | vthat
2+
isall
3 the R bitsn 3 +after
succeeding
n+1 4 |v
theRwhere |.
value
(n
rno 2) nd
bit in+ n+2 n+3
rno.
property
be
would Ifbit-string
the Ifand
vcase
be 6=that,
vequal vrepresentation
the
R ,to
vproofs
which
each
rno
means
n+2 of
R
for |vthem. succeeding |v|R |,6=then value iv isit
|,it| then
must nbeodd the 2}case the
that first
the bit bits
R| n+2
vother. inThus,rno | 6=
|vReduxOr{c vwhich is equal to
+ = 1, B
se whenzeros. (2.1)
rno
ing
rno 6= v odd
isvR
Rto , which
and rno means
(2.2)
Tn+2 rnousing
isnegative
even
|vRi rno.
separately.
c nd c
The must
cperforming
value be the case is that the
cnor
truncated
the last bit B|v|
|v|
valueing of v|vRRto |. Hence,
vrno in T alln+2 using rno. The value v is the truncated value of |vR |. Hence, all
aIn
B value
Because in, if147 Talln+2 bits andafterBivall
|the +0positive realbit3bein values 4|vR147 |… areround all to
zeros,
…reduction … athen positive |vcoperation
n+1 | value
and
0 |von Rin| Tn+2 when using rno.
=such avto case, ReduxOr{c is2,rno equal to
(n + 2) B
d allisbare
n+2
identical
negative
one 1. Hence,
because
is a one because rno
real B rno valuesvmode round
preserves
rounds i|v|rno all to
rno mode rounds all |vR | that cannot
=nRathe
|v RN
| that2}Tsign
n+2cannot of
(|v value
Rv |) .
exactly in T value
represented
n+2
R be exactly represented to an a value whenin toT anusing
n+2 a and
valuerno
rno. all negative real
Hence, values
v rno round
preserves to a positive
the sign of value
v R . in T n+2 when using rno.
result
no where
odd, then
would nrounds
|vRv|bit-string
thewith be + in
equal 2Tisbitstotoodd veach in
preserves
based and
vofinterpreted
other. on
v + Thus,thethe the
definition .first
3sign
02 cReduxOr{c.c.2cunsigned ofn+1
ncof3v+ 1rno bits
.|c2integer, …n in
rounding |vR=… | are which
c identical.
n |v cn+1 is equal The
to - value n + 2|vbits R | rounds
in v and to the first n + 2 bits in |vR | are identical. The- value |vR | rounds to
theinfinite v. number
Bzeros, = 0c nc 4ithe +…2} 1, of 1
erves Let bCombining
Bn+2
the
where sign
rno
|vR | beHence,
the ofboth
vRTheorem
bit-string
bit-string
n+2
rno
when
preserves
6.2 and
isrepresentation
odd when interpreted
Theoremthe
of |vas |an
Rsign 6.3
in T
as1 ofan
describes vRiunsigned
where R.
entire
Hence, bit-string
integer, vrno preserves rno |:
the vsign of vR . Rbit Sticky v Rbit Sticky
Lemma 2. LetitBvmust Then, the first bits in |vrno Lemma
| and R |vare
|vcases, 2.orwhen
identical.
Let (vRv).+Then, the. first bitsproof
in |vrno | and R | are when
two|vcases, identical.
=

= v Inand bn+2
our Bcase,
|v= ||v= B
either
1. | v=vv.v + Since
or
andthe
ReduxOr vrno isbe
succeeding = theRN
i| 1=bthe
first
case nthat
T,rno + 2(v
value bB
bits
Rv ).|+|in0= inBB|vT ,. the
|R+0|the|The ·.· first
We n+1
nsplit
+b1 bits our of proof into two
either thevrno the= RNT,rno
succeeding value in Tn+2 We splitn+1our into the
B|vR{c |0b n 03 2+ 3.2} .bv= |with0 v… ·… = bn+2
rno
ro isCorollary
exactly6.4. representable
rno Let rno be the in
rounded
B =
| |v T
i rno
n+2 2 b.
result
0b . bIn
. .bb.nboth
of 2 Rn in
bn+1 bn+1
b
n+2
3 FP
|vT
n+2
rno1n+2 b. .and
4 . posit
n+2 rno representations,
rounding
… n mode
bn+1 bn+2
|Band
rno= RN BCombining
are the Lemma
are(v
R | same
identical.
as).boththe 2.first
Then, Letn the
Theorem vrno first
6.2
bits
B
= and
of RN
|vrno
n+1 | = 0b2 b(v
Theorem
T,rno
. bits
The
3 . . .).
in
first 6.3
Rn
bnThen,
|v
bn+1 1 the first n+1 bits in |v | and |v | are identical.
describes |
bits and Lemma
of the |v entire
|
are are 2. identical.
0Let
bit-string
identical v tob of |v= bRN|: T,rno (vb
rno R ). Then, the
b first b n+1 bits
… in |v | and |v | are identical.
0 b b b …
|v
no (vrno
|vT,rno
bit-string
|vWe
R
6 prove)).of|Since +
therno is 1
odd
lemmarounding B
and when
mode itn+2
preserves is +
(niseven.
rno 2the sign B R
ofinv(c)B, |vitR |should
are notalso
rno rno
… R
bit-string of 1 isn+3
odd and whentheitrounding R
is even. components (s, v2 , rb,3 sticky) used
rno … 0 0 0
ofBwith the rounding components (s, 2 , rb, We used prove forvthe round-lemma n+4with n+1 for round-
+
v = RN ,rno| (v vthe v v
all v 3 sticky) n+1
is which is Tequal to R ,indicates last bit
thatof ||. in
eroAdditionally, all bits after Tthe 2)nd bit
= Rb|v + R
Then, even | , the bit-string representation equal to
|vRn|+when interpreted as unsigned + rno integer. Thus, the only inreal value
B|vrno n+2
rno n+2 |v
|vrno
ven, then
the Additionally,
first
Corollary rounds
2 bits
6.4. |vtorno
of B
Let|v||vrno
= =Rbe|v +
6 | . |Thus,
|v indicates
, where
the
the first nnd
rounded vthat
+ 1is all thebits
bits
result of of Bafter
succeeding rnoin
the value
(n +
| isTidentical with 2)to nd
the bitrno
the first B n|v+ R| 1
rounding are not mode all
be the case that, 6=We
v v
R ,prove afterthe the lemma with the |vR |rounding components (s, , rb, used for1allround-
R |vR
zeros. Because if all bits bit|vin | v bit sticky)
,are all zeros, thencase | and
rno n+2
If vrnothe which means Bthen
titrb, must
2.be| == the
(n|v +rno 2)| 6= |vrnothat 0 the |v last b|v|of b3value
lemma with vIfRvrounding
the bit-string components of 1 bis 3odd,.(s,
R. |, then Wevused prove
bythen for
the the
round-b lemma
definition B with the b
vIfRrounding
Hence,
theto|. bit-string ncomponents
2of (s,
is odd, vthen, The | == vvused bythefor
the round-
b2
definition
0 truncated of rno. b
…of Hence, 0all n +all
B|v=Ring to in 0busing
|vT bnv
inThe value is the truncated |vR3value ing of vHence, in Tall 2, vusing value is R |. Hence,
Figure if6.12: vrno Illustration
B | =v
+2 b2) of .Theorem
rno. bitbreal
|vsticky)
rno (a) corresponds
vzeros, R
to the
| 2and | case…
rno. (1)|v in
n+1 Theorem + rno. rb,|vsticky)
rno |vn+1 2
o when
would (vof using
bitszeros. Because
| .RN rno is
147 zero
all (v bits itself.
after
R )). Since |vrno |the
rnon+2
=All
rno positive
(nrounding
RN T
nd
,rno (|v
| mode
n11
Bvalues
R |) preserves
|vR | are all round
= 1,thewhich signto of vaRpositive
,|vitrno
should also
R rno n+2
abe
is rno one equal
ing(b)
because T
v
tocorresponds
n+2 each ,rno
torno v
other. mode in
Thus, ton+2
rounds
T
ReduxOr{c
theallusing case
|vn+2R | that i (2.1)
rno.
icannot in+be
n
The Theorem
2}exactly
value v 2,isand
represented is equal
the (c)totoan
truncated corresponds
a value value to |v
of the |. case
Hence, (2.2)allin
Tn+2 where using
would
While
be
t =n bits
ReduxOr{b
be rno.
Theorem +equalof R The
bits
2 values |v
toi rno
6.2 i in
|each value
| and
rno
reasons other.
nv +about2}.vand
|v |is
RThus,theare
the the
|vRfirst
ReduxOr{c
first truncated
identical.
n + 1 bits
| in T1nvalue +i |If 2inivalue
bits
the ning
thebit-string +of2}
bit-string
in v|v
|v RR= R|to|.are
1,v Hence,
ofwhich
representation v inisTall
identical. even,
n+2 The
equal using
tothen bits
the
value
(a)n + of
rno. 2last
R The
|vRbit
|vbits value
|ofand
| rounds
in vv |v toR |isare
is
vand theidentical.
the truncated
first n + If 2 value
bits inof|v|vRR| |.are
the bit-string Hence,
of v is all
even,The
identical. then the last
(b)value |vRbit of v is
| rounds to
|vthe case that, rno rno
d all negative
bLet
n+2 B=
where | be
R1. the Theorem
thereal bit-string
bit-string 2. round
representation
is odd whentointerpreted aof positive as an where
unsigned in Tn+2 integer, when using rno.
of the
and the bn+2rounded=n1. +value
first 2+bits vrno ,in
bits
thevnextand
in |v
theorem
|v| the
are 131
| first
reasons
identical. n about
+ +2
thebits
The
last in
|) ((n |v
value
+ R2)|ndare
|vbits
) bitidentical.
|
in the
rounds
in and
to Thefirst
the - value |vR2| bits rounds in |vto | are identical.+ The value |v | rounds to
erves
Combining
the sign either of vvR . or
nboth 2
Theorem the
B|vRsucceeding
6.2 and
| = 0b R 1 b2 b3 . . . bvalue
Theorem
rno = RN6.3 describes
n bn+1 bn+2
T n+2 v . . .in Tn+2
,rno (|v the R n
entire +
186 . WeRsplit our
2
bit-string of |v v
rno |: proof into v either
two cases,
n +v orwhen the Rbitsucceeding
the R Sticky value v in Tn+2 . WeRsplit our proof into two cases, when the
186
bit-string representation B rno | = 0b 2 b3 . . . bn bn+1 1
Combining bothofTheorem vrno . 6.2 |vand Theorem 6.3 +describes the entire bit-string of |vrno |:
ucceeding Let B|vvalue
Corollary either
|even.
6.4. be Let thevvvDepending or
bit-string
+rnoin beT the
the succeeding
on
Wethe
representation
. odd
rounded split
result value of
of value
our v||vRinRof in| in
proof T1and
vn+2
Trb in where
into
with Tsticky,
equalthe either
two . We
to rnocases, veither
rounding split
orwhen |v
the our
mode proof
| = v into
succeeding
the or |v two
rno |cases,
value =
v +vin
+
when
. TIf rb the.= We 0 split our proof into two cases, when the
Then, B|vrno
Additionally,
bit-string
R
| , the bit-string of is
representation
v n+2
|vRbe| indicates
and
of |vwhen
that
rno
allbe bits
Tit n+2 is iseven. n+2 rno
bit-string of v n+2 is odd and when it is even.
inafter inthe | itbit in | are not all
nd
|vrno |=v= 6 RN (n of + 2)| the B=|vR2)-
no = (v Corollary
Theorem
RN 6.3. (v
T,rno 6.4.
Let Rv). Let Then, the the(vfirst
Rrounded
) and n+1 B|vmode bits
result of
the |v
vbit-string and
T| n+2 with
|v |v
rnoR in are
the rno (nidentical.
0 rounding b2 mode b3 …|v b| n+1 0 0 If 1 …
s odd and
rno = RN
when bit-string
and
T n+2 ,rno (v
rno
sticky R )). ofSincerno T,rno
=v B 0, is|vBrno
rno rounding
odd
then | = and
|v0b 1 bwhen
2|b. 3=
rno | preserves
. .. b. nv . bbitn b.is R rno
Hence, bthe
even. sign
n+2 . . all . of vn R ,+ should
2ofbits also
oof |vrno | and R itare identical.
Ifitthe isif even. bit-string is| odd |vand when is even.
|vR=
bit-string beof
bis 3odd, then Tby the v definition R |of rno.If Hence,
the bit-string
all n + of 2 v is odd, then |vrno | = v by the definition of rno. Hence, all n + 2
n+1
v0b 1rno
bit zeros. =Because
representation all
. (v Letbits B after the
|
rno(n
the +2 b2)
bit-string
nd
bit n11 t |v
representationin B |vR |rno areof| |v =the
all vsign
zeros, then v ,|vitrno and also
(v
be therno RN Tn+2
case that,Tn+2 ,rno R |vR | )). Since rounding mode R preserves R | in 1 .ofThen, the
should
b3 of
nglemma with the rounding components (s, ,i| in used for
If
either the |bit-string and ofn by is =odd, then
of1,|vvthen =. by 0 round-
the
Because bthe
definition bn+1
of vHence,
is a|v0,1allthe
2 last bit only
Then, + to
| , therbbit-string representation
v 2}the rb, Tn+2
|v |is|+
sticky) =equalvvHence, …
where
lastvwould
of tis
bitthe
be
=B
in BReduxOr{b
be
odd,
|v
bits
case
|v
equal
then
of |vtoirno
|v
= ReduxOr{c
rno
|that,
rno
=
i 1n|other.
each
| and
rno
+
i=| 2}.
|v irnosticky
|vvR|Thus,
| +are
=
ReduxOr{c
RNTidentical.
where definition
ci rno
,rno (|v
n+2
isR |)
the | i|v of
Ifiththe rno
rno
bit nin =
= |vIf
2}
bit-string
rno. B R the
| where
which
of v nis
1,bit-string
all
ReduxOr 2equal
is+ of
even, v to is rno.
thenodd,
bits then
theoflast bitrno
|vrno | of |nv=+
and
2 by the definition of rno. Hence, all n + 2
|vvis
R | are identical. If the bit-string of v is even, then the last bit of v is
T is
|vn+2
Let
|
b
the
are
Busing
n+2 =
reduction
be bits
1.bit
rno.
identical.
the of The
that
or |vIfis
bit-string the
rno value
| and
different
operation.
representation|v
bit-string |isrno
|v|v are
v Rbetween
B
of the
|131
of
|
|v
=
v
rno
| truncated
=identical.
0b
RN
in
bT and
vTis
1 2 b ,rnoIf
.
even,
3
where
n+2
. . bvvalue
(|v
n +
btheisbits
R |)
then
n11 t the of last
bit-string
the |v
of R|v
last Hence,
of
|.bit,
bit |where
vof
andis|vall
v even,
the
is | are then
last bit
(c) the
of vlastIf
identical.
+
isbit
athe ofbit-string
1. v is of v is even, then the last bit of v is
Hence,
R |v | R R 1 rno R
Combining both Theorem 6.2 and Theorem 6.3 describes 186 the entire bit-string of |vrno |: 186
and thewhere Bfirst
|vt |theReduxOr{b
first 2nbits iin are . identical. .The 0value |vR |ifrounds
all bits to
|v bits Rof and are = identical.
Let
Intuitively, =ben the
+bit-string
Theorem 6.3 representation
+i B|1states
| =
nthat
|v +|2}.
0b |v
1 btheb3rno . of
.last b|v
b|nbit |bin
Rin |v
B|vR T.|1 .| where if and only
R

186
R 2 n+1 n+2 rno
186 186
Corollary
starting
ucceeding
from the
Then, B|vrno value
6.4. (n +
| , the Proof.
Let2)vndrno
v + in We
bit-string T
bitbe
in the
B
split
B
representation
n+2
rounded
are zeros.
R |the
result
R | = 0b b b 131
. |v|vWe split
ofproofs
|v
1 rno
Otherwise,
2our
3| .in
of vR in
. .proof
into
Tbn+2
n bn+1 Figure 6.8: Illustration of Lemma 2 and=Lemma 3 for different cases of |vR | and |vrno | where |vrno |
is
Bb|vTrno
into
two
n+2|. =
n+2cases:
equal two
to
with the rno rounding mode
. . 1. cases,
(1) When whenvrno the = vR in which case |vrno |
(v Since rounding modevrnopreserves the sign
caseof|v|vR=, it|vRshould also
s odd and
rno =We
Proof.
Then,when
RN
B|v
|v
split
R
Tn+2
rno ||,it
the
the
,rno

is
because
(vR )).into
proofs
bit-string
even. rno is the rno result of |v | in T
tworno
rounding
|vrno
cases:
Brepresentation
(1) When
.of
| = 0b1 b2 b3preserves bn11| tin T
. . b|vnrno the . We show
= vR in which
n+2sign
the bit-string of |vR | and |vrno | for three cases. (a) v − is
is equal
6
= and
|
to(2) when v R 6= v in n+2
R which case |vrno |
be therno
because case that, preserves the sign and (2) when vrno 6= vR in which case |v| 6= |vR |. rno
rounding
odd. − is even, rb = 0, and sticky = 0. (c) v − is even and either rb = 1 or sticky = 1.
where ReduxOr{b
t =express | i nB 2}. Band =
= 0b b b . . .(b)
b(|vb |) tv
ng of
First,
v we is odd,
|vR |.then
First,
the we| +=
bit-string
|v rno v |vbyBthe
express i |the definition
bit-string
into,
RN Bof |v rno.
|vrno |
|vrno
Hence,
| and
rno |
B|v | all
|vR |with,
n+2
1 T2 3 ,rno n R
n+2
n11
rno R

where
Let B beReduxOr{b
the bit-string n + 2}. 131 of |vR | in T1 where
i representation
|vR | are |vtR |=
identical. If thei | bit-string of v is even, then the last bit of v is
B|vrno | = 0b2 b3 . . . bn bn+1 bn+2
b3131
B|vR | = 0b1 b2B |v. . . b|n b= bn+2
n+10b 2 b3. .... . bn bn+1 bn+2
B|vR186 rno
n + 2 bits in v − and the first n + 2 bits in |vR | are identical. The value |vR | rounds to
| = 0c2 c3 . . . cn cn+1 cn+2 cn+3 . . .

Then, B|vrno | , the bit-string representation of |vrno | in Tn+2 is equal to


B|vR | = 0c2 c3 . . . cn cn+1 cn+2 cn+3 . . .
B|vrno |148
either v or the succeeding value v + in Tn+2 . We split our proof into two cases: When the
= 0b1 b2 b3 . . . bn bn11−
t

where t = ReduxOr{bi | i n + 2}.

bit-string
131 of v is odd and when it is even. Figure 6.8 pictorially illustrates our proof.
187 −

If the bit-string of v − is odd, then |vrno | = v − by the definition of rno (Figure 6.8(a)).
Hence, all n + 2 bits of |vrno | and |vR | are identical. If the bit-string of v − is even, then
the last bit of v − is even. Depending on the value of rb and sticky, either |vrno | = v − or
|vrno | = v + . If rb = 0 and sticky = 0, then |vrno | = v − (Figure 6.8(b)). Hence, all n + 2
bits of |vrno | and |vR | are identical. If either rb = 1 and sticky = 1, then |vrno | = v +
(Figure 6.8(c)). Because the last bit of v − is a 0, the only bit that is different between v −
and v + is the last bit, where the last bit of v + is a 1. Hence, the first n + 1 bits of |vrno | and
|vR | are identical.

Lemma 3. Let vrno = RNT,rno (vR ). Then, the last bit in |vrno | is equal to the bitwise OR
of all the bits in |vR | starting from the (n + 2)th bit.

Intuitively, Lemma 3 states that the last bit of |vrno | is zero if and only if all bits starting
from the (n + 2)nd bit in |vR | are zeros. We prove this lemma using a similar strategy as
Lemma 2. Let (s, v − , rb, and sticky) be the rounding components for rounding vR to Tn+2
using rno. The value |vR | rounds to either v − or the succeeding value v + in Tn+2 .
If the bit-string of v − is odd, then |vrno | = v − . The (n + 2)nd bit of v − and |vrno | in
160

Tn+2 is 1. Since v − is a truncated value of |vR |, the (n + 2)nd bit of |vR | is 1. Hence, the
bitwise OR of all bits of |Real| starting from the (n + 2)nd bit is 1.
If the bit-string of v − is even, then our proof is subdivided into two additional cases
depending on the values of rb and sticky. If rb = 0 and sticky = 0, then |vrno | = v − . The
(n + 2)nd bit of |vrno | is 0. Similarly, the (n + 2)nd bit of |vR | is even. Additionally, from
the definition of rounding components, rb represents the the (n + 3)rd bit in |vR | and sticky
represents the bitwise OR operation of all bits starting from (n + 4)th bit in |vR |. Since both
rb and sticky are 0, all bits in |vR | starting from the (n + 2)nd bit are 0.
If either rb = 1 or sticky = 1, then |vrno | is equal to v + , where the bit-string of v + is
odd. Hence, the (n + 2)nd bit of |vrno | is 1. Because either rb or sticky is not zero, at least
one bit starting from (n + 2)nd bit in |vR | is 1. The bitwise OR operation of all bits starting
from (n + 2)nd bit in |vR | is 1, which matches the (n + 2)nd bit in |vrno |.

Lemma 4. Define a representation Tk , a rounding mode rm, and two real values v1 and
v2 . Let (s1 , v1− , rb1 , sticky1 ) be the round components for rounding v1 to Tk and (s2 , v2− ,
rb2 , sticky2 ) be the rounding components for rounding v2 to Tk . If s1 = s2 , v1− = v2− ,
rb1 = rb2 , and sticky1 = sticky2 , then RNTk ,rm (v1 ) = RNTk ,rm (v2 ).

Lemma 4 follows directly from the definition of rounding components we described


in Section 2.1.3. Identifying the correctly rounded result of the real value in Tk for any
rounding mode only depends on the rounding components. Chapter 2 describes our de-
terministic approach to round vR to FP and posit representation, respectively, using the
rounding components.

Proof Of Double Rounding With rno Producing Correctly Rounded Result We now
present the proof of Theorem 6.1, which we reiterate for clarity:

Theorem 6.1. Let f (x) be the real value of an elementary function given an input x and
vrno = RNTn+2 ,rno (f (x)). Let v be a value in the odd interval of vrno such that for all v,
he same. components vns2for v rounding and to are the same. isinto identical to
+| T |vrno | concatenated with infiniteof number oftozerothe bi
(vntolongT)nto (RN (. components
R2)).
N Trounding
ounding
ding
vtall
ng 19and
unding
here
at
n vto
T,rm
the
bit
rno
RN
T
as
real
tthe
and
krnobit
T=
are
Rvalue
sticky
the
and s=
mode
ReduxOr{b
rno are
as ,RN
Tidentical.
=directly
)2bit:
(vRsticky
the n ,,Trm
s2are
vsticky
|E| 2bit:
RN
rb
< ,rm
i 2From
and
2,,to
rb
i T|nbit:
{rne,
identical.
2T
(v
, nrb
and
.+
nrno
T,n+2
2We
sticky Theorem
and
rna,
).2}. 2,r
now
sticky no From
. sticky
rnz,
2generalize
R
bv32 . and
.6.3,
rnp,
. v. 2bnidentify
and
bvn+1
identical.
Theorem
components we
components
r rnn
2 identify
and
Theorem
no
the
for}. T
know 6.3, From
Then,
rounding
for
n 6.11
. .rounding
identify the
. rounding
that
we Theorem
forRN
rounding
to the
know
rounding
bit T
the rounding
vall
rno
andnT
bit
last
vto
,rm
krno
the
and
T(bit,
that
as vnto6.3,
long
rno
bitstickythe
and
T)the
Rthe
are we
tos=
n are
as last
2T
sticky
the
,RNvs2know
(n
|E|
n
bit:
are
sticky
identical.
2,bit,
,T+rb
<
bit:
,,the
22)
vns2,rm vthat
rb
bit:2
nd, rb
and (RN bitthe
and 2From
2identical.
sticky
(n T+,inand
n+2
last
B
sticky
2)2
nd
no
|v bit,
Theorem
2(
From
,r. sticky
rno
. | Ris
bit 2the
)).
in equal
. Theorem 6.3,
|v weequal
rno | is6.3,
identical.
B(n + 2) know
Frombitwe inTheorem
that
know
B|vrno | is
thethatlast equal
the
6.3, bit, the
last
we bit,
(n the
know +that
2)(nnd
+bit2)
the last Tbit,
in
nd B
bit
1 |vrno is| Bis(n
in
the identical
|vequal
rno isnd
2) equal
1 bit B|vB |v| rno 1| concatenated
isTequal is Bidentical towith
B|vrno infinite number
| concatenated zeroinfinite
with bits numbe ri
or rounding We �rst R toidentify Tn|vR. | We the decompose
1 b2rounding components
b into for rounding R to Tn . We decompose R into
rno
n ,rm ,rm
B = 0b
goal
20Theorem
ding
t,, 1rb
ka
sign
isreal
tovalue
Proof.
n.
bit.
prove
Let
Next,
6.11
= s
to
tous RN
wedenote
the
guarantees
143|
We
Tnresult
encode
|.
143 The
,rm
143
( r Rthe
to
that of
value
no|
first
)= = RN
performing
RN
result
performing|s intois
identify
T of
Tanthe,rm
double,rperforming
roundingbit-string
no
Rn+2
( reduction
to noR).
the
r(rounding the
Then
result
).component, by rounding
our
of
reduction
or rounding goal
performing
operation sign aorisrealtooperation
bit. on provecomponents
reduction
allto143
Next,
value bits
RN
we to143
on in
nthe
Tencodeor
,rm all
143 Bto (operation
result
|vbits
the
R in
= (s
|| )starting
of RN
result
| into
B 1 ,
Ton
performing
of
the
v
from
,rm all (to,bits
starting
1 the
performing rb
r the
bit-string no ).in
from
1 ,
reductionBsticky
result starting
|vreduction
the
R |ofor performing1 )
or from
operation for the
on
operation rounding
all on
reduction shown
bits inbits
all
or B|vv in
R
in
operation B Figure
to
| starting
on from
starting
all bits6.13
the
from
in B with
the startingB|vfrom .
rno |the
rounding
bn2, = R=
rbb2 n+1 =1bto
rb2b, n+1 sticky
, n+1 is
,Rsticky2 = the
sticky2 = same
ReduxOr ReduxOr
R {t,
= 1 0,{t,
as
ReduxOr the
0, n+2
0,131
{t,
0, round-
.=.0,
0, .0b
0, }0,
.= =
.b0, }t..=
.30b = }tn. 2=
. b..0b t.n., .1bnn , 1rb R R {t, |v n
R | R |v R | |v R |
B Based
21|v
epresentation , the on Theorem
bit-string
representation of+
T n ??,
representation
,nddenoted
in(n if the
the
2
in�nite
asrounding
of |vB B |
extended in 2components
BT . is b
: precision
.
equal.
b b for rounding R toofTT
to b , =rb b =rb ,
b = b, sticky ,
n is, denoted
sticky = thendsameasasTthe=round-
ReduxOr
sticky = ReduxOr
= ReduxOr
0,{t,0, 0,{t,
0, .
0, . .
0,
0, } .
0, =
. .
0, }
t. =
. . }
t = t
to Tnrepresentation :
R v v 2 v 2 3 13 2
n 2 n+1 2 n+1 n+1 2 2 2
with any| rounding
T nd T rounding
rno
= F(n n+2
bit, r noF)(nfor
rno 2 2

ove withthat mode and nvthen subsequently


bit, ) for the result bit, with any Tkrounding
nd
22ingTheorem rno
components T(n
6.12. (2)for
RR)be rounding bit, noto
(r norDefine ,+|E |2)
n , then we prove that (n (+
n 2) +bit, nd , |E
+| 2)nd bit,
ations T k and RN Let as = aRN+real 2)Tnvalue. Ttwo FP representations and
RNsTTnn+2 as (n RN 2)
Note that from Lemma [], the first n + 1 bits of |vrno |1in Tn+2 representation and B|v
n+2

n. . We decompose . .into The


Tn ,rm ( value is the first rounding compo-
inT
T ,rm ,rm
⇥ R ) =|.
|v
bv bbynv = s1rounding
nn+2 ,rm
eorem
it-string
heby
lt asthe bit-string
bbit-string 2.n+2
if {rne,
webB 2Thus,
in
vrounded
B TB we
vis.T.the
in
k show
isTtruncated
the that
is|vtruncated
the value
truncatedthe 1 b22rounding
value
v The .3 The
Bvalue
v value rounding
..| represented
bvThe 0b
.n11The
rounding 3R
rounding R T1B R the truncated
bode
23nall in
n+1 rounding bvn+3
rna, n
2 rnz,
modes
2 rnp, nB
rnn}
rm. rno
By =
|will
extension,
0bThe
produce bvalue
|.a2.RThe
163
correctlythis
n2b
= value
t = 2tproves
represented
rounded. by
represented
ReduxOr
bthe
Theorem
by bit-string
1the
result
n
{b
t = ReduxOr
bbit-string
as
|
the
i if we
n+1 bB
2.
bit-string
n 2Thus,
vrounded
n+2
+
B bvn+3
in
2}2t =
{bi | i n + 2} i
invis
n .T.the
we n. is
in
2 ReduxOr show
T truncated
{bthat
n is truncated
the value
| i the v2rounding
value
n
. The
value
+
v2 . vThe
2}
rounding
2 . The rounding
t = ReduxOr t = ReduxOr {bi | i {bin| t+ i =2}ReduxOr
n + 2} {bi | i n + 2} B|v1rno | = 0b B|v12b 3 .| .=. b0bn B21bb|v .b.n+1
3nrno .| b=nt00000
1 b2nbb3n+1
0b . . .t00000
bn 1 bn b. n+1
. . t00
ame.
i
components vs2for 2rounding and to are the same.
161
rno
ngo24
keng T
asrepresentation
vnto
real longTWe
are tosas
value are
2T,then
|E| 2,, <
vs2are
directly rb truncate
2,, rbvof
and
to 2, rb
T the
and
sticky
. We B truncated
2 . sticky
2 , sticky
and now 2to n2 . bits
. components
generalize value
components to
components
Theorem get
for T and
the
rounding
for 6.11 bit-string
for
rounding
to rounding
all T to T
as representation
to are
longT to are
asT , are,, ,
, of
and , the
and , truncated
and . . . value and
are identical. Additionally from Lemma [], the bit (the last bit) of is equ
rno rno n n R r no n v v v s vs
|E| rb
<v s v
rb sticky
rb sticky sticky nd
t = ReduxOr{b
nent representing
| |
the
1
sign of . The bit-string of in
1
the infinite extended precision (n + 2) |v |
rno rno n rno n 2 n 2 2 2 2 2 2 2
ounding to i | i
. We
n + decompose into
2
25 We identify �rst identify
R T
theThe n
n
rounding
2}. R
the The
bit-string rounding
bitrepresentation
and the
bit-string components
R sticky The
representation ofbit: bit-string
|vrno for ofthe
k

| inrounding v
|vrno R| in the
representation
infinite toinfinite
R extended
2
Tn|v
of
2
. We The | decompose
precision
extendedinbit-string
the
Theinfinite
|v
representation
precision
bit-string Rinto|
extended
representation
R The precision
representation
representation of |vrnoof|representation
bit-string in the| in
representation
|vrno infinite extended
the infinite
of |vrno precision
extended
| in representation
precision
the infinite representation
extended precision representation
rno
n
 bit.
n. Next, we143 encode143 | R |sinto the bit-string
rno
R = s 1 | R |. The value 1 is a rounding component, sign bit. Next,143 we143 encode
143 | R | into the bit-string
26 143
ReduxOr {b 0b | 131 2} ReduxOr {b | 2}
sentation stick of T
1 = , denoted
B = asisto
2 b i3 . . n. b + 1 b , | :B
rb 1 = b , stick 1 = i n +
tobit-string
the bitwise operation of
|vis 1 all bits Bin1 |vR | starting from
get the bit. Hence,
representation T in1identical
is theTin�nite i Bextended
Tidentical n n |ET
F | ,concatenated precision identical with toinfinite
representation
n+1 concatenated
|number of T of, number
T 1 with
denoted
zero isbits iofinfinite
as
identical
tois Tthe =number
to
right: of
| :B |vzero
concatenated bits towith theBright: infinite numbernumber of zero ofbits to bits
the right: OR + 2)nd representation
27
sT k and6.12. Tn+2Let asn1 representation, =
B to 1
| ,representations
|is pictorially
concatenated B|vrno with shown
infinite
Weasn then
BB|v| RRin
T Figure zero identical
bits
n6.13.
F to
B to
theT right:
is | concatenated
rnoidentical to with infinite
concatenated with zero
infinite numberto the ofright:
zero bits to the right:
Wetruncated
thenof Btruncate togetn bits 1 to the (n bit-string o
|vtruncate to bitsbit-string to| get therepresentation Weofrepresentation
then truncate the truncated
tothen bits value
to Bthe bit-string representation of the trunc
|vrno |v,rno
vR be a real1value. Define
We1 then truncate | to bits
|E
B| get |vto getnbit-string
the of the value
We truncate |vrno | to n bits to get the bit-string repr
1 1 |v
B 1 , rbR1 ,We then truncate Bto bits to the representation the truncated value
heorem two|vFP rno
Tk and Tn+2 rno
n
128extracted
bn+2Figure bn+3from . 13(a)
. . B |pictorially R | . The value shows
B | represented and
by stick extracted from ||. The Rn value represented R |by |v rno | |v rno |
R | = 0b 2b 3 . . . bn 1bn bn+1bn+2bn+3 . . .
n+1
Thus, the bit-string
29 rounding components B 1 in Tn is the for163rounding
truncated value, R 1 . Thus, the 1 rounding components for rounding R
epresentation
30We then to Tntruncate ofsthe truncated value 1. |vand
R |1to n bits to get the = bit-string vR |1.representation .1v|b.2of1=n.bn+1
the btruncated . 1.b.nvalue and
and rno2 and identify the brounding bit btand
k+1 the
=sticky t |bit:
3 . .identify bn+1the
k rounding
are 1, B 3 bit and the =bsticky b3 . . . bit:
1 ,| rb , and
2 and B|videntify bb1the 2 brounding k 1bit .and
. . bnthe sticky bit:
1 1
band
11identify
.the
1 brounding b1bit Bb|vand bthe sticky .bit: b32.v.2 . band identify bbthe rounding bit and bthe
0b21bbb3nb
2 bn3rnob n+3sticky
| . . . bit:
B2|v1brno . bn.b .b. . =|v1rno ....2bbnb b
1 t00000 . . .
. . .v | = b
.B 1b t… bn+2
0b
stick 1B rno |
v10b B and
|v
3 .|. =
= identify
n b. n+1 . b|vt00000
the b.rounding
2.t00000
k… k . .k-1 bn bt00000
bit and
n+1 the rno | B
n+2 k+1 sticky
0b2| b=
n+3 … ..0b bit: 3 .1.bB
n+2
nbb|vnn+1
rno1|bn
n+3 … bn. . 1. bnvbn+1
0b2t00000
n+1 t00000
… k k-1 kn+1
31identify the rounding bit and the sticky bit:
Vol. 1, No.
stick 1 =POPL, B Article
ReduxOr ={b0b i3 . . n. b+n 2}
i 1.2|bPublication 11bdate:
Proc.
n , January ACM rb11Program.2022.
= bn+1 ,Lang., stick Vol. 1, No. 1 =POPL, ReduxOr Article i n + 2}date:1January
{bi 1.| Publication 2022.
1
We then to truncate 1
| torepresentation
n bits toWe get the bit-string ofBrepresentation of the
get truncated value
racted from
We
We then truncate
R | . The value
We thenBtruncate
identify Bthe
represented | toremaining
nBbits
by
get the bit-string
|vrno | to n bitsthree toBget |vrnothe
roundingbit-string then
representationof
Wetruncate
components the then truncated
truncate |vrnovalue
the Bto|v1rno
truncated
|We
from nthenbits
| to B to
value
n bits
truncate the
toBget
b|vnR,,|b.n+1
bit-string
|vrnothe
The
1
| to
representation
bit-string
n bits
truncated of the of
representation
to get the bit-string
value vtruncated
the truncated
representation value valueof the truncated value 189 bn+1 ,2 ={t,ReduxOr
i ReduxOr ReduxOr
|vrno
Figure 13(a) B |pictorially shows 1
, rb1 , and stick B 1 extracted B = v 0bfrom = 2 b B30b | .R. | .b
2 .TheB
b
3 n . v value
. .
1 b = n , 0b
represented
1 b 2 n b , 3 rb . . 1by. b
= rb
n b11 n+1 = , rb
sticky 1 = b1n+1
sticky 1 = ReduxOr
= ,ReduxOr sticky
1{bB i v|12{bi== | 0b
ni +Bbv32}
2 n2. + .{b
=. b2} n| 2i1bb3B
i0b n.,vn
. .+b=n2} 1 b22nb
rb
0b =,3 .b.n+1
. brbn, 21=bn ,bsticky
n+1 , rb 22==sticky 0, 0, 0,2{t
sticky .=.
s, thethe rounding
bit-string v2 components
Band inidentify
T vn2 isand for
the
the rounding
rounding
truncated
identify thev2bit and
value,
rounding
R and identify
the.bit Thus,
vthe
sticky
and 1 the rounding
bit:rounding
the sticky
1 bit
bit: and v2the
components and sticky
v1identify
and bit:
for the
rounding
identify roundingthev rounding
and bit and
identify the
bit andsticky
the the
roundingbit:
sticky bit:
bit and the sticky bit: 2
1 1 2 2 R
to Tn are s 1 , in 1 , T
rb 1 ,
k andand stick its 1 .bit-string B is identified by truncating the first k bits in B |vR | . The rounding
v1
. bn0b21 b3n ,. . . bB v212b= , 0b 2 b3 ,. 2. .=bnsticky 1 bn,, 2 = sticky ReduxOr
rb 2 =B 2bn+1 ,0,B0b sticky 3= .bReduxOr {t, 0,
2 b0, .0, }1 b=
.,b.nb. n+1 n,, t =2 ReduxOr
Figure 6.9: Rounding components when rounding v and v v{t,
v1The 1value represented by the bit-string in truncated
Tn isBthe truncated . val
}2b2bn= , {t, 0,sticky .=}0,ReduxOr
Bv2 = 0b2B b3v . . = nrb n= bn+1
The value The
represented value =
ReduxOr vrepresented
0, 20,b=3.{t,
..0b
...0,
2by the bit-string
.1t0, .n.b,B
by
.n.v}21 b= rb
, t20b=
the b3n+1
bit-string insticky is vtheintruncated
= ReduxOr
Tn0,The is
{t,0,0,
thevalue
. 2.0,
truncated
represented
=. .t. } ={t,
value, . value,
t 0, 0, 0, . . . } = t
Thus, byv1the
The . Thus,
value bit-string the B
represented v2 in byTthe nBis the
vbit-string v2 invalue
Tn isv2the Thtr
Tthe k the
rb bn+1 = 0, = rb 2. .= sticky
2rb
1, No. POPL, Article 1. Publication date: Proc.January 2022. The Lang.,value to T , where 1 + |E| < k < n.
Vol. 1, No.represented by
2
the bit-string n
B v1 inv1Tn is n
B B=truncated
2bn+1
value, .rno
vThus, the
bit rb1 is the (k + 1) bit in B|vR | and the sticky bit sticky1 is the bit-wise OR 1of all bitsR
2
ACM Program. st POPL, Article 1. Publication date: January 2022. 2

The value The represented


value represented
We show the bit-string of v in the infinite extended precision and the bit-string of v
by the The value
bit-stringby the
rounding
represented
B bit-string
rounding
v2 in T Bbyis
components
n v2
the
the in
in T
bit-string
truncated
T
rounding
components n is Thethe value
truncated
B The
for2 rounding
v in v T
represented
value.
components
is
The
value
for rounding
2 n the
represented
v truncated
rounding
2
by.The
vR
the
The
R to
value
by
for
value
bit-string
rounding
the
vT
represented
Rrounding
v
ntoare
2 B .
bit-string The
Tv2nsare
in Brounding
T
by
1 , vs
v2n theis
in the
T
to
1R1,, rb
n truncated
bit-string
is
v1T
the B
1 ,nand
rbare
value
in
truncated T
1 , sticky
v2
sand
v .
is
value The
the
1 , components
n2
vsticky v rounding
truncated
. The
1 ,1 .rb1 ,1 .and
2 value
rounding
components
for
stickyv 2 . The
rounding rounding
for vrounding
1 . components rno to Tfor nvare sto2 ,T
rounding
rno vn2 ,are
rno vrbrno ,tovn+2
2s,2and 2T,sticky
nrbare2, sand 2 , rb2 ,2 .and st
22., vsticky
starting
components forfrom
components rounding the
forvrounding (kto +
components Tn v2) arernofor
nd
sto vbit
rounding
2, T 2n ,arerbin 2s, 2and ,Bvvrno rb| :2T, ncomponents
, Rto 2 are
and . sticky 2 , v2 2,. for
scomponents rb2 ,rounding
and forsticky components
rounding 2 . tovT are
n forto T sn2 ,are
rounding v2 ,srb 2 ,v ,2and to 2sticky
, rb 2 .s2 , v22,. rb2 , and sticky2 .
aresticky
T, nand
representation. rno 2|v
sticky vrno rno 2vrno

We now identify the rounding We now


We now identify the identify
roundingthe rounding components
components for roundingfor
components for rounding vrno torno v rounding
to T . We to T . We decompose
v decompose
Tn . Wen decompose
rno v v
n vrno rno rno
143
Bv1 = b1 b2 b3 . . . bk 1 bk , 143 rb1 = bk+1 ,
143 143
sticky1 = bk+2 | b143 k+3 | . . .
143 143 143 143
into vinto
rno v=rnos2=⇥ into ⇥v|v
s|v2 rno The
|.rno
rno The
s2 ⇥value
|. value
= 2The
2 is|.sthe
|vsrno value
issign
the sign is
bit sof2bit the.vOur
of
vrno sign bit of
rno . next
Our next is. to
Our
vrnostep
step is next
to
encode step
|vrnois|v
encode | to
rnoencode
| |vrno |
RN
Figure 6.13 pictorially (v) vrno1 .. Choose
=sticky a rounding mode ∈ {rne, rna,
rmrepresentation rnz,Trnp, rnn}. Then,
intoshows , rbinto
1 ,representation
and
T the
Bv,rno therepresentation
bit-string
into
the n+2 1bit-string
bit-string inrepresentation
theininfinite in extended
the infinite the infinite
extended extended
precision
precision precisionTrepresentation
representation . TTo . To 1 . To 1 1

Next, we identifydothe rounding


do
that, that, components
dolook
let usletfirst
us that,look
first (s
atletthe
us , v2bit-string
2first , rb2representation
look
atbit-string
the ,atsticky
the 2 ) for rounding
bit-string vrno
ofrepresentation
representation |vof |v
| in to Tof
T| in , | in Tn+2 ,
, |vrno rno rno n+2 n+2

Tk . The sign of vrno is the first component s2 . The bit-string of |vrnoT|kin ,rm
Tn+2 is pictorially Tk ,rm RN (v) = RN (f (x)))
B|vrnoB| |v=rno0b
| = .B
2 b30b . . .|1 b=n b0b
.2 b.|v3brno
n 1 b2nbtb3n+1
n+1 . . .tbn 1 bn bn+1 t
shown in Figure 6.13 with B|vrno | .
Note that both f (x) and v are real values that round to vrno when rounded to Tn+2
Note that from Lemma [], the
NoteNote
that that
fromfirst
Note
fromn+ 1 bits
that
Theorem
Theorem of6.2,
from
6.2, we we| inknow
|vTheorem
know
rno Tthat
n+2 representation
6.2,that
wefirst
the know
the first
n+ n1and
that the1Bfirst
bits
+ bits
|vR |Bin
in n + bitsBin
rno | and
B| |v1and
|vrno |vR |Bare
|vrno
R| |
and B|vR | are
are
with
are identical. Additionally the
from
identical.
identical. rno
Lemma
FromFrom mode.
[], the6.3,
identical.
Theorem
Theorem FromHence,
6.3,
(n we
+ nd
we
2)knowbit
Theorem
know we
(the
that show
last
6.3, we
that
the bit)
know
the
last ofthat
last
bit, that
bit,
|v
the |rounding
the
(n
rno is
+(n equal
last
ndbit,
2)+ the
bitnd
2) any
inbit
(n|vin+
B real
B nd value v
2)isrno
rno | |v |bit
is in
equal B|vrnoR| isto
equal Tn+2
equal using rno
to the bitwise OR operation
to the of all
to result
the bitsperforming
result
of toin
ofthe | starting
Rresult
performing
|v from the
of performing
reduction
reduction or (n 2)nd bit.
reduction
+ operation
operation
or onorHence,
onoperation
all all in
bits on
bitsBin Ball bits in
starting
starting | starting
B|vfrom
from Rthe the from the
to produce the value vrno and then subsequently rounding vrno to Tk using any standard
|v R | |vR |

B|vrno |(n=+b(n
b2+
12)
nd 2)nd bit,
b3 .bit,
. . bk (n + 2)nd bit,
1 bk . . . bn bn+1 t t = bn+2 | bn+3 | . . .
rounding mode rm produces the same value
= ReduxOr
t = ReduxOrt{b |i n+
as directly rounding vR to Tk using the same
{b2}| i n + 2} t = ReduxOr{bi | i i n + 2} i
189
rounding
The The mode.
bit-string More
representation formally,
Therepresentation
bit-string bit-stringofrepresentation
|vof | in
rno|v |we ofprove
ininfinite
rnothe the | inthat,
|vinfinite the infinite
extended
rnoextended extended
precision
precision precision representation
representation
representation
T1
T1 is identical T1
is identical
to isrnoB
B|vto identical to B|vrno
rno | concatenated
| |vconcatenated | concatenated
with with infinite
infinite with infinite
number
number of number
of zero
zero bits theof
bits
to zeroright:
toright:
the bits to the right:
RNTk ,rm (RNTn+2 ,rno (vR )) = RNTk ,rm (vR )
1
B|v1rnoB| |v1=rno0b
| = .B
2 b30b . . .|1 b=n b0b
.2 b.|v3brno
n 1 b2nbt00000
n+1 . . .t00000
b3n+1 bn . .1 .bn b. n+1
. . t00000 . . .
Since both f (x) and v round to vrno when rounded to Tn+2 with the rno, we have the
We then
relationship
We then We
truncate B|v1then
truncate B 1to truncate
n to bits
bitsn to
B to nthe
to get
|vget bitsbit-string
to get the bit-string representation
representation of thevalue
of the truncated
rno |the bit-string representation of the truncated value
truncated value
1
rno | |vrno |

v2 and and identify


v2 identify the and
the identify
v2 rounding bit the
rounding bitrounding
and andsticky
the bit andbit:
the sticky
bit: the sticky bit:

k
RNT ,rm (RNTn+2 ,rno (v)) k k
= RNT ,rm (vR ) = RNT ,rm (RNTn+2 ,rno (f (x)))
B v2 B
=v20b= . .2 b. 3B
2 b30b bn.v.2 .1 b=n,0b
1 b2nb,3rb
. .2. =
brb
n b21 bn ,,bn+1 , rb
=
n+1 2 = b2n+1
sticky
sticky 2 = ReduxOr
= ,ReduxOr
sticky
{t, 0, =
2{t, ReduxOr
0, 0, 0, .{t,
. . .0,}.= =0,t 0, . . . } = t
. t} 0,

thus
The Theproving
value
value TheTheorem
value
represented
represented the6.1.
represented
bybit-string
by the Our
by
bit-string
B Bproof
the in Tn holds
bit-string v2 in for
isBtruncated
v2 inv2Tn is the
the isFP
Tn value
truncated vrepresentations
thevalue
truncated value
v2 . rounding
2 . The The when
v2 . The
rounding Tn+2
rounding = Fn+2,|E|
components
components components
for rounding vrnofor
for rounding rounding
Tntoare
vtorno Tnsarevs2 2,to
, rno
2v v2T
,rb are
2 ,nand
rb 2 , vsticky
2 , sticky
sand 2 ,2 .rb2 ,2 .and sticky2 .
and Tk = Fk,|E| as long as 1 + |E| < k ≤ n. For posit representations, our proof holds
143
when Tn+2 = Pn+2,es and Tk =143P143
k,es as long as 2 ≤ k ≤ n.

In high-level, our strategy is to prove that the rounding components for rounding vR to
Tk are the same as the rounding components for rounding vrno to Tk . By Lemma 4, the
equivalence of rounding components proves that RNTk ,rm (vR ) = RNTk ,rm (vrno ) for all
rounding modes rm. We now show the rounding components for rounding vR and vrno to
Tk .
We first identify the rounding components (s1 , v1− , rb1 , sticky1 ) for rounding vR to
Tk . We decompose vR into vR = s1 × |vR |. The value s1 is the first rounding compo-
162

nent representing the sign of vR . The bit-string of |vR | in the infinite extended precision
representation, B|vR | , is pictorially shown in Figure 6.9.

B|vR | = b1 b2 . . . bk−1 bk bk+1 . . . bn+2 bn+3 . . .

We identify the remaining three rounding components from B|vR | . The truncated value v1−
in Tk and its bit-string Bv1− is identified by truncating the first k bits in B|vR | . The rounding
bit rb1 is the (k + 1)st bit in B|vR | and the sticky bit sticky1 is the bit-wise OR of all bits
starting from the (k + 2)nd bit in B|vR | :

Bv1− = b1 b2 b3 . . . bk−1 bk , rb1 = bk+1 , sticky1 = bk+2 | bk+3 | . . .

Figure 6.9 pictorially shows the rounding components Bv1− , rb1 , and sticky1 .
Next, we identify the rounding components (s2 , v2− , rb2 , sticky2 ) for rounding vrno to
Tk . The sign of vrno is the first component s2 . The bit-string of |vrno | in Tn+2 is pictorially
shown in Figure 6.9 with B|vrno | .
Note that from Lemma 2, the first n+1 bits of |vrno | and |vR | are identical. Additionally
from Lemma 3, the (n + 2)nd bit (the last bit) of |vrno | is equal to the bitwise OR operation
of all bits in |vR | starting from the (n + 2)nd bit. Hence,

B|vrno | = b1 b2 b3 . . . bk−1 bk . . . bn bn+1 t t = bn+2 | bn+3 | . . .

Because k ≤ n, there is at least 1 bit (i.e., bn+1 ) between bk and t, where t is the (n + 2)nd
bit in |vrno |.
We identify the remaining three rounding components for vrno from B|vrno | . The bit-
string of v2− , rb2 , and sticky2 are as follows:

bv2− = b1 b2 b3 . . . bk−1 bk rb2 = bk+1 sticky2 = bk+2 | . . . | bn+1 | t

Figure 6.9 also pictorially shows Bv2− , rb2 , and sticky2 .


Now, let us compare the rounding components for rounding vR and vrno to Tk . The
sign, s1 and s2 , are identical because rno rounding mode preserves the sign of vR . The
163

truncated values, v1− and v2− , are equal to each other because the bit-strings are the same.
The rounding bits, rb1 and rb2 , are both equal to bk+1 , which is guaranteed to be a bit in the
first n + 1 bits in both |vR | and |vrno | since k ≤ n. Finally, the sticky bit sticky2 is equal to,

sticky2 = bk+2 | . . . | bn+1 | t = bk+2 | . . . | bn+1 | bn+2 | bn+3 | · · · = sticky1

Hence, all rounding components for rounding vR and vrno to Tk are identical. From
Lemma 4, it follows that RNTk ,rm (RNTn+2 ,rno (vR )) = RNTk ,rm (vR ).

6.5 Odd Intervals for Extremal Values in Posit Representations

As stated multiple times throughout the chapter, our approach and Theorem 6.1 applies
directly to posit representations and posit rounding mode. Polynomial approximations that
produce a value in the odd interval of the rno result in Tn+2 representation (i.e., (n + 2)-bit
posit) will also produce the correctly rounded result for Tk (a k-bit posit) using the posit
rounding mode, as long as Tn+2 and Tk has the same number of es bits and 2 ≤ k ≤ n.
However, when the real value of f (x) for a given input x is outside the dynamic range of
the posit representation Tn , the corresponding odd interval may be unnecessarily restrictive
in the context of producing the correctly rounded result of f (x) in Tk . This is due to the
special behavior of posit rounding for extremal values. Without the loss of generality, let us
assume that a real value vR > 0. The posit rounding rule states that when vR is larger than
the largest representable posit value maxval in a posit representation, then vR rounds to
maxval. Similarly, when a value vR is smaller than the smallest representable posit value
minval, then vR rounds to minval. The odd intervals do not take these special rounding
rules into account. This results in odd intervals that are far smaller than necessary. As the
size of the odd interval defines the amount of freedom we have to generate polynomials
that produce the correctly rounded results, we need to generate the largest possible interval
that still produces correct results for Tk .
Figure 6.10 illustrates the difference between posit rounding and rno rounding for ex-
tremal values. Suppose that our goal is to generate an approximation function of a function
164

P32, 2
120 ∞
2
(a)

P34, 2

120 122 124 128 ∞


2 2 2 2

(b)

Figure 6.10: (a) The posit rounding mode specifies that all real values vR larger than the largest
representable value (i.e., 2120 for P32,2 ) rounds to the largest representable value. (b) There are three
values representable in P34,2 larger than 2120 . Separating odd intervals for each of these three values
are unnecessary for the posit rounding mode, since all values in the odd intervals of 2120 , 2122 , 2124 ,
and 2128 always rounds to 2120 when rounded to P32,2 .

f (x) that produces the correctly rounded results for the posit representation Tn = P32,2 and
all smaller representations Tk = Pk,2 where 2 ≤ k ≤ 32. The largest positive value that
P32,2 can represent is maxvaln = 2120 (the circle in Figure 6.10(a)). Based on the posit
rounding rule, all values vR larger than 2120 rounds to 2120 when rounded to P32,2 . Similarly,
any vR larger than 2120 rounds to the corresponding maximum positive representable value
(maxvalk ) when rounded to any Pk,2 . Hence, if the real value of f (x) given an x is larger
than maxvaln , it is sufficient for the polynomial approximation to produce any value larger
than maxvaln (i.e., [maxvaln , ∞)) for the input x. All values in this interval will round to
maxvalk , the correctly rounded result in Tk .
In contrast, the odd interval corresponding to the rno result of f (x) is far smaller than
[maxvaln , ∞) when the real value of f (x) ≥ maxvaln . The P34,2 representation can
represent three values larger than 2120 : 2122 , 2124 , and 2128 (rhombuses in Figure 6.10(a).
The odd interval of 2120 , 2122 , 2124 , and 2128 are shown with different colors. Depending on
the rno result of f (x), the odd interval can be as small as a singleton value (e.g., [2120 , 2120 ]
if rno result is 2120 )! Yet all values in all four odd intervals round to the same value when
rounded to any Tk representation using the posit rounding mode. There is no need to
reason about the odd intervals of 2120 , 2122 , 2124 , and 2128 separately. In fact, it is better
to constrain the polynomial approximation to produce a value in [2120 , ∞) to provide the
165

maximum amount of freedom to generate the polynomial.


Hence, if the real value of f (x) is larger than maxposn for a given input x, we set
the odd interval of the input x to be [maxposn , ∞). Similarly, if the real value of f (x) is
smaller than the minimum positive representable value in Tn (minposn ) but larger than 0,
we set the odd interval to be (0, minposn ]. The odd interval for the inputs, where the real
value of f (x) < 0 is outside of the dynamic range of Tn , is computed similarly.

6.6 Summary

This chapter presents a novel approach to generate polynomial approximations that pro-
duce correct results for multiple representations Tk and standard rounding modes where
the total number of bits is bounded by k ≤ n. Our key insight is to create polynomial
approximations that produce the correctly rounded result of Tn+2 using the rno mode. We
identify a range of values that round to the correctly rounded result of Tn+2 with the rno
mode, which we call the odd interval. We formally showed that polynomials that pro-
duce a value in the odd interval are guaranteed to produce correctly rounded results for all
representations Tk with standard rounding modes. Although our proof only shows for Tk
with the same number of exponent bits as Tn+2 , we believe that our approach can produce
correctly rounded results for any Tk as long as all values in Tk are representable in Tn
representation. We empirically support our claim in Section 7.3. We handle singleton odd
intervals by mathematically reasoning when elementary functions f (x) will produce exact
rational values.
Using this approach, we generated RL IBM -A LL, a generic math library that produces
the correctly rounded results for multiple FP and posit representations with all standard
rounding modes. Our generic math library is the first math library that produces the cor-
rectly rounded results for bfloat16, TensorFloat32, and float at the same time. This allows
developers to design and use new configurations of FP or posit representations and still be
able to accurately approximate elementary functions with the new representations.
166

CHAPTER 7
EXPERIMENTAL EVALUATION

We present the RL IBM prototype, a correctly rounded elementary function generator. Us-
ing RL IBM, we created several correctly rounded elementary functions for various rep-
resentations and rounding modes. We evaluate our elementary functions on their ability
to produce correctly rounded results for all inputs in their target representations and the
performance. In this chapter, we provide details on the experimental methodology, the
experimental setup, and the result of our experimental evaluation.

7.1 Experimental Methodology And Setup

The RL IBM prototype is a correctly rounded elementary function generator. Currently,


RL IBM supports the FP and the posit representation. RL IBM uses the MPFR library with
up to 1000 precision bits to compute the oracle value of f (x) and round the result to the
target representation. Although the Table-maker’s dilemma states that it is not feasible
to mathematically determine the number of precision necessary to identify the correctly
rounded results, prior work has empirically shown that roughly 160 precision bits in the
worst case are sufficient to compute the correctly rounded result for the double representa-
tion [85]. RL IBM uses SoPlex [50], an exact rational LP solver, to generate the coefficients
of the polynomials with a time limit of five minutes. We limit the size of the LP formu-
lation to contain up to fifty thousand reduced input and interval constraints. If the number
of intervals in the sample pool exceeds fifty thousand, then we increase the number of sub-
domains. To generate elementary functions with good performance, the user can provide
custom range reduction functions, the degree, and the structure of the polynomial (i.e., odd
or even polynomial).
Using RL IBM, we generated several correctly rounded elementary functions for vari-
167

ous representations. For evaluation purposes, we classify these functions into three math
library prototypes, RL IBM -16 [91, 92], RL IBM -32 [94], and RL IBM -A LL, depending on
the bit-width of the target representation. RL IBM -16 contains ten elementary functions for
bfloat16 with the rne mode and ten functions for posit16. RL IBM -32 contains ten func-
tions for the 32-bit float type with the rne mode and ten functions for posit32. Finally,
RL IBM -A LL contains ten functions that produce the correctly rounded result of f (x) for
the 34-bit FP representation (i.e., Tn+2 ) with 8 bits of exponent (FP34) in the rno mode.
FP34 representation is not supported in hardware. Hence, RL IBM -A LL stores the FP34
result in the double type. RL IBM -A LL’s FP functions are designed to produce correct re-
sults for all k-bit FP representations with 8 bits of exponent using any IEEE-754 standard
rounding modes as long as 9 < k ≤ 32. This includes bfloat16, tensorfloat32, and the 32-
bit float. RL IBM -A LL also contains ten posit functions that produce the correctly rounded
result of f (x) for the 34-bit posit representation with es = 2 with the rno mode. These
functions produce correctly rounded results for all k-bit posit representations with es = 2
as long as 2 ≤ k ≤ 32.
All three prototypes perform range reduction, polynomial evaluation, and output com-
pensation using the double type. Polynomials are evaluated with Horner’s method [10] for
efficiency and accuracy of the results. The functions in RL IBM -16 use the simple range
reduction strategies. RL IBM -32 and RL IBM -A LL use the more sophisticated range reduc-
tions described in Chapter 4 to significantly reduce the input domain.

Experiment Methodology. We evaluate the functions in RL IBM -16, RL IBM -32, and
RL IBM -A LL on two criteria: (1) the correctness of the output and (2) the performance
compared to the state-of-the-art libraries. To measure the performance, we compare our FP
functions against glibc’s libm [51], Intel’s libm [67], CR-LIBM [30], and MetaLibm [80].
Intel’s and glibc’s libm have elementary functions for float and double types. They are the
most widely used math libraries. CR-LIBM has correctly rounded elementary functions for
168

the double type. CR-LIBM provides multiple implementations for each elementary func-
tion to support four IEE-754 standard rounding modes, rne, rnp, rnn, and rnz. MetaLibm
provides elementary functions for both the float and the double types which are optimized
with AVX2 vector instructions. To produce the results in a target representation T that is
not natively supported by these libraries, we first convert the input in T to the representa-
tion supported by the library, use the elementary function, and round the result back to T.
We compare our posit16 functions in RL IBM -16 against SoftPosit-Math [88], a correctly
rounded math library for posit16. There are no math libraries for other configurations of the
posit representation (e.g., 32-bit posit32). We compare our posit32 functions in RL IBM -32
and the posit functions in RL IBM -A LL against glibc’s and Intel’s double library as well as
CR-LIBM. The double type can exactly represent all posit32 values. We do not compare
against existing float libraries because float cannot exactly represent all posit32 values.

Experimental setup. We perform all of our experiments on a 2.10GHz Intel Xeon Gold
6230R machine with 192GB of RAM running Ubuntu 18.04. We disabled Intel turbo
boost and hyper-threading to minimize noise. All of our libraries are compiled at the O3
optimization level. We use Intel’s libm from the oneAPI Toolkit and glibc’s libm from
glibc-2.33. The MetaLibm functions are generated using the optimizations for AVX2
extensions enabled. The test harness for comparing glibc’s libm, CR-LIBM, and Met-
aLibm is built using the gcc-10 compiler with -O3 -static -frounding-math
-fsignaling-nans flags. Because Intel’s libm is only supported in Intel’s compiler,
we built the test harness that compares Intel’s libm against our libraries using the icc
compiler. We use the flags -O3 -static -no-ftz -fp-model strict to obtain
as many accurate results as possible. To measure performance, we measure the number
of cycles taken to compute the result for each input using hardware performance counter
rdtscp. We then measured the total time taken to compute the elementary function as the
sum of the time taken by all inputs (i.e., 232 inputs for a 32-bit representation).
169

7.2 Experimental Evaluation of RL IBM -16

Table 7.1 provides details on the list of elementary functions available in RL IBM -16 and
the size of the polynomials generated for each function. Because there are only 216 inputs
in 16-bit representations, we aimed to generate polynomial approximations that minimize
the memory usage of RL IBM -16 functions. Otherwise, it may be better to simply store the
correctly rounded results of f (x) for all 216 inputs in a table. We used range reductions that
use a minimal amount of look-up tables. We tried to create the smallest piecewise polyno-
mials instead of the domain-splitting technique described in Chapter 5. The polynomials
used RL IBM -16 functions are generated within a few minutes.

7.2.1 Correctness Evaluation of RL IBM -16

Table 7.2(a) reports the result of our experiment to check the correctness of bfloat16 func-
tions in RL IBM -16 and other math libraries. All bfloat16 functions in RL IBM -16 produce
correctly rounded bfloat16 results with rne for all inputs. In contrast, we discovered that
re-purposing glibc’s or Intel’s float library did not produce the correctly rounded result for
the input x = −0.0181884765625 in 10x function. This case is especially interesting be-
cause both glibc and Intel’s float library produce the correctly rounded result of 10x for the
float type. However, the float result rounded to bfloat16 is not the correctly rounded result
for bfloat16 due to the double rounding error. Thus, repurposing a correctly rounded func-
tion for a representation T0 does not necessarily produce the correctly rounded result for
another representation T, even if T0 has more precision compared to T. This observation
has been one of the motivations for us to create RL IBM -A LL, a math library that guarantees
to produce correctly rounded results for multiple precisions and rounding modes. Our ex-
periment showed that repurposing glibc’s and Intel’s double library produces the correctly
rounded results of all inputs for bfloat16. Table 7.2(b) reports that all posit16 functions in
RL IBM -16 produce correctly rounded results for all inputs. SoftPosit-Math functions also
170

Table 7.1: Details on generated polynomials for bfloat16 and posit16 functions. For each
elementary function, we report the total time taken to generate the polynomials, the to-
tal number of reduced intervals, the number of polynomials generated, the degree of the
generated polynomial, and the number of terms in the polynomial.

Total Time Reduced # of # of


f (x) Degree
(Seconds) Inputs Polynomials Terms
bfloat16 functions
ln(x) 0.84 128 1 7 4
log2(x) 8.65 128 1 5 3
log10(x) 1.63 128 1 5 3
exp(x) 2.9 3820 1 4 5
exp2(x) 0.89 1937 1 4 5
exp10(x) 3 3840 1 4 5
5 3
sinh(x) 0.27 422 3 0 1
6 4
5 3
cosh(x) 0.27 471 2
6 4
1 1
sinpi(x) 32 16129 2
7 4
0 1
cospi(x) 32.2 16129 3 6 4
0 1
posit16 functions
ln(x) 3.32 4096 1 9 5
log2(x) 5.7 4096 1 9 5
log10(x) 6.5 4096 1 9 5
exp(x) 6.2 57371 1 6 7
exp2(x) 5.2 24201 1 6 7
exp10(x) 12 53106 1 6 7
7 4
sinh(x) 37.4 13044 2
6 4
1 1
7 4
cosh(x) 391.9 14400 4
6 4
6 4
1 1
sinpi(x) 38 12289 2
9 5
0 1
cospi(x) 85.7 12289 3 8 5
0 1
171

Table 7.2: (a) Generation of correctly rounded results for bfloat16 with RL IBM -16, glibc’s
float libm, and Intel’s float libm. Math libraries created by repurposing glibc’s double libm
and Intel’s double libm produce correct bfloat16 results for all inputs. (b) Generation of
correctly rounded results for posit16 with RL IBM -16 and SoftPosit-Math. 3indicates that
the library produces the correctly rounded result for all inputs. Otherwise, we use 7. N/A
indicates that the implementation is not available.

Bfloat16 Using Using Using Posit16 Using Using


Functions RL IBM -16 glibc float Intel float Functions RL IBM -16 SoftPosit-Math
ln(x) 3 3 3 ln(x) 3 3
log2 (x) 3 3 3 log2 (x) 3 3
log10 (x) 3 3 3 log10 (x) 3 N/A
ex 3 3 3 ex 3 N/A
2x 3 3 3 2x 3 3
10 x 3 7(1) 7(1) 10 x 3 3
sinh(x) 3 3 3 sinh(x) 3 N/A
cosh(x) 3 3 3 cosh(x) 3 N/A
sinpi(x) 3 N/A 3 sinpi(x) 3 3
cospi(x) 3 N/A 3 cospi(x) 3 3
(a) Correctly rounded results with bfloat16 (b) Correctly rounded results with posit16

produce correctly rounded results for the available functions. However, SoftPosit-Math
does not contain log10 (x), 10x , sinh(x), and cosh(x).

7.2.2 Performance Evaluation of RL IBM -16

Performance of bfloat16 functions in RL IBM -16. Figure 7.1(a) reports the speedup of
bfloat16 functions in RL IBM -16 against glibc’s float library (left bar in each cluster) and
glibc’s double library (right bar in each cluster). On average, RL IBM -16 has 1.1× speedup
over glibc’s float functions and 1.2× speedup over glibc’s double functions. Figure 7.1(b)
reports the speedup of bfloat16 functions in RL IBM -16 against Intel’s float library (left bar
in each cluster) and Intel’s double library (right bar in each cluster). On average, RL IBM -
16 has 1.4× speedup over Intel’s float functions and 1.6× speedup over Intel’s double
functions. RL IBM -16’s bfloat16 functions are faster than all corresponding functions in
Intel’s double library. RL IBM -16 is faster than glibc’s float and double library as well as
Intel’s float functions except for ln(x), log2 (x), and log10 (x). RL IBM -16 aims to generate
efficient polynomials that produce correctly rounded results for all inputs while using a
172

(a) Speedup of RL IBM -16’s float functions over (b) Speedup of RL IBM -16’s float functions over
glibc libm Intel libm
Speedup over float libm Speedup over double libm Speedup over float libm Speedup over double libm

2x 2x
Speedup

Speedup
1x 1x

0x 0x
x x x x
) ) ) x (x) (x) an ) ) ) x (x) (x) i(x) i(x) an
ln(x log 2(x og 10(x e 2 10
sinh cosh geome ln(x log 2(xog 10(x e 2 10
sinh cosh sinp cosp geome
l l

Figure 7.1: (a) Speedup of RL IBM -16’s bfloat16 functions compared to glibc’s float functions
(left) and glibc’s double functions (right). (b) Speedup of RL IBM -16’s functions compared to Intel’s
float functions (left) and Intel’s double functions (right).

2x
Speedup

1x

0x
x x
) (x) i(x) ) n
ln(x log 2
e 2
sinp pi(x mea
cos geo

Figure 7.2: Speedup of RL IBM -16’s posit16 functions compared to SoftPosit-Math library when
the input is available as a double. It avoids the software simulated cast from posit16 to double and
vice versa. SoftPosit-Math takes a posit16 input that is internally represented as an integer.

small amount of memory. The ln(x), log2 (x), and log10 (x) functions in RL IBM -16 use
range reductions that do not require look-up tables to minimize memory usage. These
range reductions have higher overhead compared to the range reductions used in glibc and
Intel’s functions. In all other cases, the lower degree polynomial that RL IBM generates
allows our bfloat16 functions to be faster than mainstream math libraries.

Performance of posit16 functions in RL IBM -16. We measure the speedup of RL IBM -


16’s posit16 functions compared to SoftPosit-Math functions when computing the posit16
result given a posit16 input. We found out that RL IBM -16 can have a significant slowdown
compared to SoftPosit-Math. The primary reason for this slowdown is because posit16
inputs are cast to double type before using RL IBM -16. The double result is then rounded
to posit16 value. Because common hardware architecture does not support posit represen-
tations, the cast operation is done using software simulation. We found out that posit16
functions in RL IBM -16 spend 75% of the time casting values between posit16 and double.
In comparison, SoftPosit-Math represents posit16 values internally as an integer and the el-
173

l’ = 1.003906… amount of freedom in mini-max approach


h’ = 1.01171…

g(x’) = 1.003907… PH(x’) = 1.003977… amount of freedom in our approach [l’, h’]

Figure 7.3: An illustration showing that our approach provides more freedom in generating a
polynomial that produces correctly rounded results of 10x for all inputs. The reduced interval [l0 , h0 ]
(shown in green box) corresponds to the reduced input x0 = 0.0056264 . . . . The real value of g(x0 )
is shown with the black circle. The polynomial generated using our approach produces a value
(red diamond) within the rounding interval. If we approximated the real result g(x0 ) instead of the
correctly rounded result, the margin of error allowed is much smaller (box with black border).

ementary functions are super optimized with purely integer operations to eliminate the cost
of casting. We measured the performance of RL IBM -16 without the cost of this cast, which
we report in Figure 7.2. SoftPosit-Math does not have implementations for log10 (x), 10x ,
sinh(x), and cosh(x). Hence, we report the speedup for the functions available in both
SoftPosit-Math and RL IBM -16. On average, RL IBM -16 has 15% slowdown compared to
SoftPosit-Math. The ln(x), log2 (x), sinpi(x), and cospi(x) functions in RL IBM -16 have
similar performance compared to SoftPosit-Math. However, the super-optimized imple-
mentations in SoftPosit-Math show higher performance for ex and 2x even though both
libraries use similar degree polynomials.

Performance impact of approximating the correctly rounded result. We provide a case


study to show that the RL IBM approach in approximating the correctly rounded result pro-
vides more freedom in generating efficient polynomial. Consider the elementary function
10x for bfloat16. The output compensation function we use for 10x requires us to approxi-
0
mate g(x0 ) = 2x where the reduced inputs are in the domain x0 ∈ [0, 1). RL IBM generated
a 4th degree polynomial that approximates g(x0 ) which produces the correctly rounded re-
sult of 10x for the entire input domain of bfloat16 when used with output compensation
functions.
The RL IBM approach provides more freedom to produce low degree polynomials. We
illustrate this point with Figure 7.3, which shows the reduced interval [l0 , h0 ] (green box)
174

for the reduced input x0 = 0.00562 . . . . The real value of g(x0 ) is highlighted with the
black circle. The real value of g(x0 ) is extremely close to l0 with only  = |g(x0 ) − l0 | ≈
1.31 × 10−6 amount of error. If we are to generate a polynomial that approximates the
real value of g(x0 ), then the polynomial must have an error less than , i.e. the polynomial
must produce a value within [g(x0 ) − , g(x0 ) + ]. This tight restriction can force the
mini-max approach to generate a higher degree polynomial to produce correctly rounded
results. Instead, the polynomial generated with our approach produces a value shown with
the red diamond from Figure 7.3. This value has an error of approximately 7.05 × 10−5 ,
which is much larger than , but still within the reduced interval. This freedom allows our
approach to generate lower degree polynomial, yet produce the correctly rounded results
for all bfloat16 inputs when used with output compensation function.

7.3 Experimental Evaluation with RL IBM -32

Table 7.3 provides details on the list of elementary functions available in RL IBM -32 and
the size of the polynomials generated for each function. We aimed to generate polynomials
with the best performance given a storage budget. Hence, we generated piecewise polyno-
mials where the degree of the polynomial is less than or equal to 7 (at most 8 coefficients)
and the number of sub-domains was less than or equal to 214 . Since each coefficient is
stored as a double value (8 bytes), the maximum size of the look-up table that stores the
coefficients of the polynomial will be 8bytes × 8 × 214 = 1MB. The output compensation
function that we use for sinh(x), cosh(x), sinpi(x), and cospi(x) involve two elementary
functions (i.e., gi (x0 )). We generated two piecewise polynomials for these functions. The
ex , 2x , and 10x function has both negative and positive reduced inputs. Hence, we created a
piecewise polynomial for negative reduced inputs and a piecewise polynomial for positive
reduced inputs.
Table 7.3 also reports the amount of time taken to generate float and posit32 functions.
The amount of time needed ranges from 19 minutes for cospi(x) for the float type to 25
175

Table 7.3: Details on the generated polynomials for float and posit32 in RL IBM -32. For each
elementary function, we report the time taken to generate the polynomials in minutes, the number
of reduced inputs, the size of the piecewise polynomial for approximating gi (x0 ), the maximum
degree of the polynomial, and the number of terms in the polynomial.

Gen. Time Reduced # of Poly- Deg- # of


f (x)
(Minutes) Inputs nomials ree Terms
float functions
ln(x) 218 7.2E6 210 3 3
log2 (x) 251 7.2E6 28 3 3
log10 (x) 429 7.2E6 28 3 3
27 4 5
ex 117 5.2E8
27 4 5
24 4 5
2x 86 3.0E8
23 4 5
26 4 5
10x 169 5.2E8
27 3 4
sinh(x) 28 1.5E8 26 5 3
cosh(x) 24 1.5E8 26 4 3
sinpi(x) 30 1.2E8 1 5 3
cospi(x) 19 1.2E8 1 4 3
posit32 functions
ln(x) 264 1.1E8 211 4 4
log2 (x) 288 1.1E8 28 4 4
log10 (x) 685 1.1E8 212 3 3
212 3 4
ex 1089 3.5E9
212 3 4
210 3 4
2x 814 7.9E8
212 3 4
213 3 4
10x 1528 3.4E9
213 3 4
214 5 3
sinh(x) 461 1.6E9
214 4 3
214 3 2
cosh(x) 528 1.7E9
212 6 4
212 3 2
sinpi(x) 716 1.1E8
213 2 2
211 5 3
cospi(x) 342 1.9E8
212 2 2

hours for 10x for the posit type. The majority of the total time is spent in computing the or-
acle results and the rounding intervals (i.e., 86% and 70% of total time for float and posit32,
respectively). It takes significantly longer to generate polynomials for posit32 as a whole.
There are fewer special cases in posit32 functions and requires more time to compute the or-
176

Table 7.4: Generation of correctly rounded results for 32-bit floats with RL IBM -32, Intel’s float
and double libm, glibc’s float and double libm, CR-LIBM, and MetaLibm’s float and double libm.
3indicates that the library produces the correctly rounded result for all inputs. Otherwise, we use
7. For each 7, we show the number of inputs with wrong results.

float Using Using Using Using Using


functions RL IBM -32 glibc float glibc double Intel float Intel double
ln(x) 3 7(4.2E5) 7(5) 7(1060) 7(5)
log2 (x) 3 7(3.1E5) 3 7(276) 3
log10 (x) 3 7(3.0E7) 7(1) 7(1.5E5) 7(1)
ex 3 7(1.7E5) 3 7(2.5E5) 3
2x 3 7(1.7E5) 7(2) 7(7.2E5) 7(2)
10x 3 7(1.7E5) 3 7(3.9E5) 3
sinh(x) 3 7(7.1E7) 7(2) 7(2.5E5) 7(2)
cosh(x) 3 7(1.8E7) 3 7(1.4E5) 3
sinpi(x) 3 N/A N/A 7(3.4E5) 3
cospi(x) 3 N/A N/A 7(3.8E5) 3
float Using Using Using
functions CR-LIBM MetaLibm float MetaLibm double
ln(x) 7(5) N/A N/A
log2 (x) 3 N/A N/A
log10 (x) 7(1) N/A N/A
ex 3 7(5.1E8) 7(5.1E8)
2x N/A 7(6.5E7) 7(1026)
10x N/A N/A N/A
sinh(x) 7(2) N/A N/A
cosh(x) 3 7(1.1E7) 3
sinpi(x) 3 N/A N/A
cospi(x) 3 N/A N/A

acle results. Additionally, posit32 has higher precision compared to the 32-bit float and has
saturating behavior with extremal values. Hence, RL IBM -32 generates larger piecewise
polynomials. Nonetheless, our domain splitting and counterexample guided polynomial
generation techniques have been vital in generating low degree polynomials that produce
the correctly rounded results for all 232 inputs in 32-bit representations.

7.3.1 Correctness Evaluation of RL IBM -32

Correctness of float functions. Table 7.4 shows the result of our experiment to check the
correctness of float functions in RL IBM -32 and other math libraries. All ten elementary
177

Table 7.5: Generation of correctly rounded results with posit32 functions for all inputs by RL IBM -
32, Intel and glibc’s double libraries, and CR-LIBM. 3indicates that the library produces the cor-
rectly rounded result for all inputs and otherwise, we use 7.

posit32 Using Using Using Using


functions RL IBM -32 glibc double Intel double CR-LIBM
ln(x) 3 7(22) 7(22) 7(22)
log2 (x) 3 7(19) 7(18) 7(18)
log10 (x) 3 7(26) 7(23) 7(23)
ex 3 7(4.4E8) 7(4.4E8) 7(4.4E8)
2x 3 7(4.0E8) 7(4.0E8) N/A
10x 3 7(5.2E8) 7(5.2E8) N/A
Sinh(x) 3 7(4.4E8) 7(4.4E8) 7(4.4E8)
Cosh(x) 3 7(4.4E8) 7(4.4E8) 7(4.4E8)
Sinpi(x) 3 N/A 7(70) 7(70)
Cospi(x) 3 N/A 7(90) 7(90)

functions in RL IBM -32 produce the correctly rounded result in the 32-bit float with rne
for all inputs. In contrast, the elementary functions in glibc, Intel, and MetaLibm’s float
libraries do not produce correct results for all inputs. Several functions in glibc and Met-
aLibm’s float library produce wrong results for millions of inputs while Intel’s float library
produces wrong results for thousands of inputs. When repurposed for the 32-bit float, the
double functions from glibc, Intel, and CR-LIBM do not produce correct float results for
ln(x), log10 (x), 2x , and sinh(x) due to double rounding error (i.e., rounding the real value
of f (x) to double and subsequently rounding the result to float). Even when f (x) is approx-
imated accurately enough to produce the correctly rounded double value (i.e., CR-LIBM
functions), rounding the double result to float may not produce the correctly rounded float
result. Finally, the functions in MetaLibm do not produce correct float results even when
MetaLibm uses Sollya [21], the same tool used by CR-LIBM to generate polynomials that
produce correctly rounded double results.

Correctness of posit32 functions. Table 7.5 shows that all ten posit functions in RL IBM -
32 produce correctly rounded results in posit32 for all inputs. All posit32 values cannot
be exactly represented in float representation but they can be represented in the double
178

representation. Hence, we created posit32 math libraries by repurposing glibc and intel’s
double math library as well as CR-LIBM. We tested the ability of the repurposed libraries
to produce correct posit32 results for the ten posit functions available in RL IBM -32. These
libraries do not produce correct results for posit32 functions. Contrary to float results, they
produce wrong results for millions of inputs, especially for exponential and hyperbolic
functions. The primary reason for such a high number of wrong results is the different
rounding behaviors between the FP representation with rne and the posit representations.
Posit representations do not underflow to 0 or overflow to ∞. Instead, extremely large
values are rounded to the largest representable posit value. Similarly, values extremely
close, but not equal, to 0 are rounded to the smallest non-zero representable posit value.
Hence, all repurposed libraries produce wrong results when the real value of f (x) is either
extremely large or close to zero.

7.3.2 Performance Evaluation of RL IBM -32

Performance of float functions. Figure 7.4(a) reports the speedup of RL IBM -32’s float
functions compared to glibc’s float library (left bar in each cluster) and glibc’s double
library (right bar in each cluster). On average, RL IBM -32 has 1.1× speedup over glibc’s
float library and 1.1× speedup over glibc’s double library. Figure 7.4(b) reports the speedup
of RL IBM -32 compared to Intel’s float library (left bar in each cluster) and Intel’s double
library (right bar in each cluster). On average, RL IBM -32 has 1.4× and 1.6× speedup over
Intel’s float and double library, respectively. Intel’s libraries produce correctly rounded
float results for more inputs compared to glibc’s libraries. Hence, RL IBM -32 has more
speedup compared to Intel’s libraries. Figure 7.4(c) reports the speedup of RL IBM -32
compared to CR-LIBM. RL IBM -32 has 1.9× speedup over CR-LIBM on average. CR-
LIBM produces correctly rounded double results for all double inputs. Thus, RL IBM -
32 has a higher speedup over CR-LIBM compared to glibc or Intel’s libraries. Finally,
Figure 7.4(d) reports the speedup of RL IBM -32 compared to MetaLibm’s float library (left
179

(a) Speedup of RL IBM -32’s float functions over (b) Speedup of RL IBM -32’s float functions over
glibc libm Intel libm
Speedup over float libm Speedup over double libm Speedup over float libm Speedup over double libm

2x 2x
Speedup

Speedup
1x 1x

0x 0x
x x x x
) ) ) x (x) (x) an ) ) ) x (x) (x) i(x) i(x) an
ln(x log 2(x og 10(x e 2 10
sinh cosh geome ln(x log 2(xog 10(x e 2 10
sinh cosh sinp cosp geome
l l

(c) Speedup of RL IBM -32’s float functions over (d) Speedup of RL IBM -32’s float functions over
CR-LIBM MetaLibm
Speedup over double libm Speedup over float libm Speedup over double libm
4x
2.5x 3.4x

Speedup
2x
Speedup

2x

0x 0x
) x
)
ln(x log 2(x og 10(x
) e (x) (x) i(x) i(x) an exp exp2 cosh ean
l sinh cosh sinp cosp geome geom

Figure 7.4: (a) Speedup of RL IBM -32’s float functions compared to glibc’s float functions (left)
and glibc’s double functions (right). (b) Speedup of RL IBM -32’s functions compared to Intel’s
float functions (left) and Intel’s double functions (right). (c) Speedup of RL IBM -32’s functions
compared to CR-LIBM functions. (d) Speedup of RL IBM -32’s functions compared to MetaLibm’s
float functions (left) and double functions (right) built with AVX2 optimizations.

bar in each cluster) and double library (right bar in each cluster). On average, RL IBM -32
has 2.5× and 2.7× speedup compared to MetaLibm. RL IBM -32’s float functions are faster
than all corresponding functions in Intel’s library, CR-LIBM, and MetaLibm. RL IBM -32
is faster than glibc’s library except for ln(x), log2 (x) and log10 (x) functions. However,
glibc’s functions do not produce correctly rounded results for all inputs. In all other cases,
RL IBM -32 produces correctly rounded float results in rne rounding mode and is more
efficient. RL IBM’s approach in generating piecewise polynomials for RL IBM -32 with low
degree polynomials has been instrumental in creating efficient elementary functions.

Performance of posit functions. The graphs in Figure 7.5 reports the speedup of RL IBM -
32’s posit functions compared to the posit32 math libraries created by repurposing glibc’s
double library, Intel’s double library, and CR-LIBM. On average, RL IBM -32 has 1.1×, 1×,
and 1.5× speedup over glibc’s library, Intel’s library, and CR-LIBM. All three repurposed
math libraries do not produce correctly rounded posit32 results for all inputs. RL IBM -32
180

(a) Speedup of RL IBM -32’s posit32 functions (b) Speedup of RL IBM -32’s posit32 functions
over glibc libm over Intel libm
Speedup 2x 2x

Speedup
1x 1x

0x 0x
x x x x x x
x) x) x) e ) ) n x) x) x) e ) ) ) ) n
ln( log 2( 0(
2 10 h(x osh(x mea ln( log 2( g 10(
2 10 h(x h(x i(x i(x ea
log 1 sin c o lo sin cos sinp cosp eom
ge g

(c) Speedup of RL IBM -32’s posit32 functions


over CR-LIBM
2.4x
2x
Speedup

1x

0x
) x
)
ln(x log 2(x og 10(x
) e (x) (x) i(x) i(x) an
l sinh cosh sinp cosp geome

Figure 7.5: (a) Speedup of RL IBM -32’s posit32 functions compared to glibc’s double functions.
(b) Speedup of RL IBM -32’s posit32 functions compared to Intel’s double functions. (c) Speedup
of RL IBM -32’s posit32 functions compared to CR-LIBM functions.

is the first correctly rounded math library for posit32 types and performs similar to or faster
than the repurposed libraries.

Performance impact of piecewise polynomials. We present a case study to show the per-
formance benefits of using piecewise polynomials. There are four 32-bit float functions
log2 (x), log10 (x), sinpi(x), and cospi(x) where RL IBM could generate a single polyno-
mial and produce correctly rounded results for all inputs. We created piecewise polyno-
mials for these functions with an increasing number of sub-domains. We validated each
piecewise polynomials to ensure that they produce correctly rounded results for all inputs.
We measured the change in performance as the number of sub-domains in the piecewise
polynomial increases from a single polynomial (i.e., 20 ) to 212 . Figure 7.6 reports the per-
formance of log2 (x) and log10 (x) with increasing number of sub-domains compared to
the performance of a single polynomial. Figure 7.6 does not show sinpi(x) and cospi(x)
because we found that using a single polynomial has the best performance.
Initially, there is a small decrease in the performance (14% overhead) from a single
polynomial to a piecewise polynomial because the degree of the polynomial does not de-
181

1.2x
1.0x

Speedup
0.8x
0.6x
0.4x log2(x)
0.2x log10(x)
0.0x
20 22 24 26 28 210 212
Number of sub-domains for the piecewise polynomial

Figure 7.6: Speedup of log2(x) and log10(x) function for the 32-bit float with increasing size
of piecewise polynomial approximations compared to a single polynomial generated using our ap-
proach. All polynomials produce the correctly rounded result for all inputs when used with the same
output compensation function. A circle represents a decrease in the degree of the polynomial.

crease and the implementations experience overhead when using look-up tables. However,
when we keep increasing the number of sub-domains, the degree of the polynomials starts
to decrease, and the performance increases. When the number of sub-domains increases
from 24 to 25 , the degree of the polynomials decreases from 5 to 4 and we only see approx-
imately 5% overhead compared to a single polynomial. At 28 sub-domains, the degree of
the polynomials decrease to 3 and we achieve 1.15× speedup compared to a single poly-
nomial. A piecewise polynomial containing 28 polynomials with degree 3 requires 6KB to
store all coefficients.

7.4 Experimental Evaluation With RL IBM -A LL

Table 7.6 provide details on the list of elementary functions available in RL IBM -A LL, the
total time taken to generate the polynomial approximations, and the size of the piecewise
polynomials generated for each function. For these generic functions in RL IBM -A LL, we
restricted the degree of the piecewise polynomials to 8 degrees at most. We also aimed for
smaller than or equal to 217 sub-domains. We increased the storage budget for RL IBM -A LL
compared to RL IBM -32 to account for the additional precision that RL IBM -A LL requires.
As RL IBM -A LL produces piecewise polynomials for 34-bit representations, the number
of sub-domains used in the resulting piecewise polynomials are bigger than RL IBM -32.
However, the degree of the polynomial in each sub-domain is similar to RL IBM -32.
182

Table 7.6: Details about the generated polynomials. For each elementary function, we show the
time taken to generate the polynomials in minutes, the size of the piecewise polynomial for approx-
imating gi (r), the maximum degree of the polynomial, and the number of terms in the polynomial.

floating point functions posit functions


Gen. Gen.
# of Poly- Deg- # of # of Poly- Deg- # of
f (x) Time f (x) Time
nomials ree Terms nomials ree Terms
(Min.) (Min.)
ln(x) 325 210 3 3 213 3 3
ln(x) 262
log2 (x) 420 28 3 3 212 3 3
log10 (x) 546 28 3 3 211 3 3
log2 (x) 296
x 27 4 5 211 3 3
e 241
27 4 5 214 3 3
log10 (x) 419
27 3 4 214 3 3
2x 151
27 3 4 214 2 3
ex 1048
x 28 3 4 214 3 4
10 402
28 3 4 214 2 3
2x 1179
sinh(x) 143 26 5 3 215 2 3
cosh(x) 135 25 4 3 217 2 3
10x 1825
sinpi(x) 308 22 5 3 215 2 3
cospi(x) 316 22 4 3 215 3 2
sinh(x) 652
215 2 2
215 3 2
cosh(x) 603
215 2 2
214 3 2
sinpi(x) 410
213 4 3
215 3 2
cospi(x) 745
215 2 2

Table 7.6 also reports the amount of time taken to generate RL IBM -A LL functions in
the second column. The amount of time needed ranges from roughly 2 hours to generate
cosh(x) function for the FP representation to 30 hours to generate 10x function for the
posit representation. For FP functions, roughly 79% of the total time is spent in computing
the oracle results using the MPFR library. In comparison, computing the intervals and
generating the polynomials takes 15% and 5% of the total time on average, respectively.
Because the posit functions have fewer special cases and require more precision, RL IBM
spends more time computing the intervals (23% of the total time) and generating larger
piecewise polynomials (39% of the total time). The remaining 38% of the total time is
spent in computing the oracle result using MPFR and correctly rounding the result to 34-
bit posit value with rno mode.
183

Table 7.7: (a) RL IBM -A LL’s FP functions produce correctly rounded results in the 34-bit floating
point representation (i.e., F34,8 ) using the rno rounding mode for all inputs. (b) Similarly, RL IBM -
A LL’s posit functions produce correctly rounded results in the 34-bit posit representation (i.e., P34,2 )
using the rno rounding mode for all inputs. 3indicates that the library produces the correctly
rounded 34-bit result using rno for all inputs. Otherwise, we use 7.

RL IBM -A LL produces correct RL IBM -A LL produces correct


f (x) f (x)
34-bit FP results with rno? 34-bit posit results with rno?
ln(x) 3 ln(x) 3
log2 (x) 3 log2 (x) 3
log10 (x) 3 log10 (x) 3
ex 3 ex 3
2x 3 2x 3
10x 3 10x 3
sinh(x) 3 sinh(x) 3
cosh(x) 3 cosh(x) 3
sinpi(x) 3 sinpi(x) 3
cospi(x) 3 cospi(x) 3

(a) (b)

7.4.1 Correctness Evaluation of RL IBM -A LL

Table 7.7(a) reports that the FP functions in RL IBM -A LL produces the correctly rounded
results in FP34 with the rno mode for all inputs. By Theorem 6.1 in Chapter 6, our ex-
periment indicates that the FP functions in RL IBM -A LL produce the correct results for all
k-bit FP representations with 8 bits of exponent using any IEEE-754 rounding modes as
long as 9 < k ≤ 32. These representations include bfloat16, tensorfloat32, and float repre-
sentations. Similarly, Table 7.7(b) reports that all posit functions in RL IBM -A LL produces
the correctly rounded results in the 34-bit posit representation with es = 2 using the rno
mode. Hence, the posit functions are guaranteed to produce the correct results for all k-bit
posit representation as long as 2 ≤ k ≤ 34 and es = 2, including the standard posit32
representation.
In addition, we wanted to check if RL IBM -A LL’s functions can produce correctly
rounded results for other representations. Hence, we built a test harness to check if the
FP functions in RL IBM -A LL produce correctly rounded results for all inputs with differ-
ent FP representations where the number of exponent bits ranges from 2 to 8 bits and the
184

number of mantissa bits ranges from 1 to 23 bits (i.e., 23 × 7 = 161). These configurations
include the standard 16-bit half type. Our experiment shows that RL IBM -A LL produces the
correct results for all 161 representations with all standard rounding modes. RL IBM -A LL
is the first library that provides the correctly rounded results for all inputs in hundreds of
FP representations with any standard rounding modes. Similarly, we built a test harness to
check and verify that RL IBM -A LL’s posit functions produce correctly rounded results for
all inputs in the standard 16-bit posit representation (i.e., posit16).
Next, we evaluate the ability of various state-of-the-art math libraries to produce cor-
rectly rounded results in float, tensorfloat32, and bfloat16 using the five IEEE-754 standard
rounding modes. The glibc’s and Intel’s float and double libraries only have one imple-
mentation per elementary function. Hence, we use the same function for all five round-
ing modes. In contrast, CR-LIBM has four implementations for each elementary function
that produces correctly rounded double results with rne, rnz, rnp, and rnn, respectively.
Hence, when computing the correctly rounded result for a given rounding mode rm using
CR-LIBM, we use the implementation that corresponds to rm.

Correctly rounded results with all rounding modes for float. Table 7.8 reports the result
of our experiment to check whether the existing libraries produce correctly rounded float
results. While RL IBM -A LL produces the correct float results will all five rounding modes
for all inputs, many elementary functions in glibc and Intel’s libm do not produce correctly
rounded results for all inputs with all rounding modes. In particular, even glibc and Intel’s
double functions do not produce correct float results of ex , 10x , sinh(x), and cosh(x) for
all inputs with rnn, rnp, and rnz modes. These results are especially interesting because
the error occurs due to the different behaviors of rnn, rnp, and rnz mode compared to
rne. For example, ex approaches infinity for large positive inputs. Both glibc and Intel’s
double libraries produce ∞ for these inputs, which is the correct float result of ex in the rne
mode. However, this result is incorrect for rnn or rnz mode because the result of ex should
185

Table 7.8: Generation of correctly rounded results for 32-bit float using the five standard IEEE-754
rounding modes rne, rnn, rnp, rnz, and rna. We use the elementary functions from RL IBM -A LL,
glibc’s libm (float and double), Intel’s libm (float and double), RLibm32, and CR-LIBM. If the math
library produces a double value, we round the output to a float value. 3indicates that the library
produces the correctly rounded float result using a given rounding mode for all inputs. Otherwise,
we use 7.

Using RL IBM -A LL Using glibc float libm Using glibc double libm
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7
log2 (x) 3 3 3 3 3 7 7 7 7 7 3 3 3 3 3
log10 (x) 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7
ex 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
2x 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7
10x 3 3 3 3 3 7 7 7 7 7 3 7 7 7 7
sinh(x) 3 3 3 3 3 7 7 7 7 7 7 7 7 7 7
cosh(x) 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
sinpi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
cospi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Using Intel float libm Using Intel double libm Using CRLIBM
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 7 7 7 7 7 7 7 7 7 7 7 3 3 3 N/A
log2 (x) 7 7 7 7 7 3 3 3 3 3 3 3 3 3 N/A
log10 (x) 7 7 7 7 7 7 7 7 7 7 7 3 3 3 N/A
ex 7 7 7 7 7 3 7 7 7 3 3 3 3 3 N/A
2x 7 7 7 7 7 7 7 7 7 7 N/A N/A N/A N/A N/A
10x 7 7 7 7 7 3 7 7 7 3 N/A N/A N/A N/A N/A
sinh(x) 7 7 7 7 7 7 7 7 7 7 7 3 3 3 N/A
cosh(x) 7 7 7 7 7 3 7 7 7 3 3 3 3 3 N/A
sinpi(x) 7 7 7 7 7 3 3 3 3 7 3 3 3 3 N/A
cospi(x) 7 7 7 7 7 3 7 3 7 7 3 3 3 3 N/A
Using RLibm32
f (x) rne rnn rnp rnz rna
ln(x) 3 7 7 7 3
log2 (x) 3 7 7 7 3
log10 (x) 3 7 7 7 3
ex 3 7 7 7 3
2x 3 7 7 7 7
10x 3 7 7 7 3
sinh(x) 3 7 7 7 3
cosh(x) 3 7 7 7 3
sinpi(x) 3 7 7 7 3
cospi(x) 3 7 7 7 3

round down and produce the largest representable positive value in float instead. Similarly,
ex approaches zero for small negative numbers. Both glibc and Intel’s double libraries
produce 0 for these inputs which is the correct float result of ex in the rne mode. However,
186

this result is incorrect for rnp because the result of ex should round up and produce the
smallest representable positive value in float.
In contrast, CR-LIBM’s functions for rnn, rnp, and rnz mode produce the correctly
rounded float results with rnn, rnp, and rnz, respectively. Double rounding with these
rounding modes always produces correctly rounded results. For example, rounding a real
value vR to double with rnn and then subsequently rounding the result to float with rnn
produces the same result as if rounding vR directly to float with rnn mode. This is not true
for rne mode. CR-Libm’s functions for rne do not produce the correctly rounded float
results with rne mode for all inputs due to double rounding error. Additionally, CR-LIBM
does not provide implementations for the rna mode. RL IBM -A LL’s elementary function
guarantees to provide correctly rounded results for all five rounding modes.
RL IBM -32, which produces correctly rounded float results in the rne rounding mode
also produces correctly rounded results in the rna rounding mode for all functions except
2x . The rna rounding mode behaves similar to rna except when the value is equal to the
midpoint between two adjacent float values. In comparison, RL IBM -32 does not produce
correct results for all inputs with rnn, rnp, or rnz modes.

Correctly rounded results with all rounding modes for tensorfloat32 and bfloat16. Ta-
ble 7.9 reports the result of our experiment to check whether existing libraries produce
correctly rounded tensorfloat32 results. As tensorfloat32 has the same number of expo-
nent bits as FP34, RL IBM -A LL produces the correctly rounded results for all inputs and
all rounding modes. In contrast, glibc and Intel’s library, as well as RL IBM -32 does not
produce correctly round results in tensorfloat32 for all inputs and all rounding modes. Even
when glibc and Intel’s double libraries are designed to produce double results which have
significantly higher precision than tensorfloat32, they produce wrong results with extremal
values for rnn, rnp, and rnz mode, due to double rounding error. CR-LIBM’s functions
for each rounding mode produce correctly rounded tensorfloat32 results for all available
187

Table 7.9: Generation of correctly rounded results for TensorFloat32 using the five standard IEEE-
754 rounding modes rne, rnn, rnp, rnz, and rna. We use the elementary functions from RL IBM -
A LL, glibc’s libm (float and double), Intel’s libm (float and double), RLibm32, and CR-LIBM.
Then, we convert the output to TensorFloat32 values. 3indicates that the library produces the
correctly rounded TensorFloat32 result using a given rounding mode for all inputs. Otherwise, we
use 7.

Using RL IBM -A LL Using glibc float libm Using glibc double libm
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 3 3 3 3 3 7 7 7 7 7 3 3 3 3 3
log2 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
log10 (x) 3 3 3 3 3 7 7 7 7 7 3 3 3 3 3
ex 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
2x 3 3 3 3 3 7 7 7 7 3 3 7 7 7 3
10x 3 3 3 3 3 3 7 7 7 7 3 7 7 7 3
sinh(x) 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
cosh(x) 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
sinpi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
cospi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Using Intel float libm Using Intel double libm Using CRLIBM
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 7 7 7 7 7 3 3 3 3 3 3 3 3 3 N/A
log2 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 N/A
log10 (x) 7 7 7 7 7 3 3 3 3 3 3 3 3 3 N/A
ex 7 7 7 7 7 3 7 7 7 3 3 3 3 3 N/A
2x 7 7 7 7 3 3 7 7 7 3 N/A N/A N/A N/A N/A
10x 3 7 7 7 7 3 7 7 7 3 N/A N/A N/A N/A N/A
sinh(x) 7 7 7 7 7 3 7 7 7 3 3 3 3 3 N/A
cosh(x) 7 7 7 7 7 3 7 7 7 3 3 3 3 3 N/A
sinpi(x) 3 7 7 7 3 3 3 3 3 3 3 3 3 3 N/A
cospi(x) 3 7 7 7 3 3 7 3 7 3 3 3 3 3 N/A
Using RLibm32
f (x) rne rnn rnp rnz rna
ln(x) 7 7 7 7 7
log2 (x) 3 3 3 3 3
log10 (x) 7 7 7 7 7
ex 7 7 7 7 7
2x 7 7 7 7 3
10x 3 7 7 7 7
sinh(x) 7 7 7 7 7
cosh(x) 7 7 7 7 7
sinpi(x) 3 7 7 7 3
cospi(x) 3 7 7 7 3

elementary functions and rounding modes.


Table 7.10 presents the result of our experiment to check the ability of existing libraries
to produce correctly rounded bfloat16 results. Similar to tensorfloat32, RL IBM -A LL pro-
188

Table 7.10: Generation of correctly rounded results for bfloat16 using the five standard IEEE-
754 rounding modes rne, rnn, rnp, rnz, and rna. We show the results with the elementary
functions from RL IBM -A LL, glibc’s libm (float and double), Intel’s libm (float and double), RLibm,
RLibm32, and CR-LIBM. 3indicates that the library produces the correctly rounded bfloat16 result
using a given rounding mode for all inputs. Otherwise, we use 7.

Using RL IBM -A LL Using glibc float libm Using glibc double libm
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 3 3 3 3 3 3 7 7 3 7 3 3 3 3 3
log2 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
log10 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
ex 3 3 3 3 3 3 7 7 7 3 3 7 7 7 3
2x 3 3 3 3 3 3 7 7 7 3 3 7 7 7 3
10x 3 3 3 3 3 7 7 7 7 7 3 7 7 7 3
sinh(x) 3 3 3 3 3 3 7 7 7 3 3 7 7 7 3
cosh(x) 3 3 3 3 3 3 7 7 7 3 3 7 7 7 3
sinpi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
cospi(x) 3 3 3 3 3 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Using Intel float libm Using Intel double libm Using CRLIBM
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 3 7 7 3 7 3 3 3 3 3 3 3 3 3 N/A
log2 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 N/A
log10 (x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 N/A
ex 3 7 7 7 3 3 7 7 7 3 3 3 3 3 N/A
2x 3 7 7 7 3 3 7 7 7 3 N/A N/A N/A N/A N/A
10x 7 7 7 7 7 3 7 7 7 3 N/A N/A N/A N/A N/A
sinh(x) 3 7 7 7 3 3 7 7 7 3 3 3 3 3 N/A
cosh(x) 3 7 7 7 3 3 7 7 7 3 3 3 3 3 N/A
sinpi(x) 3 3 3 3 3 3 3 3 3 3 3 3 3 3 N/A
cospi(x) 3 7 3 7 3 3 7 3 7 3 3 3 3 3 N/A
Using RLibm32 Using RLibm
f (x) rne rnn rnp rnz rna rne rnn rnp rnz rna
ln(x) 3 7 7 3 7 3 7 7 7 3
log2 (x) 3 3 3 3 3 3 7 7 7 3
log10 (x) 3 3 3 3 3 3 7 7 7 3
2x 3 7 7 7 3 3 7 7 7 3
2x 3 7 7 7 3 3 7 7 7 7
10x 7 7 7 7 7 3 7 7 7 3
sinh(x) 3 7 7 7 3 3 7 7 7 3
cosh(x) 3 7 7 7 3 3 7 7 7 3
sinpi(x) 3 3 3 3 3 3 7 7 7 3
cospi(x) 3 7 3 7 3 3 7 7 7 3

duces the correctly rounded bfloat16 results for all rounding modes. Many elementary
functions in Glibc’s library, Intel’s library, RL IBM -32, and RL IBM -16 do not produce
correctly rounded bfloat16 results especially with rnn, rnp, and rnz mode. CR-LIBM’s
functions, on the other hand, produces correctly rounded results for all inputs in bfloat16
189

(a) Speedup of RL IBM -A LL over RLibm32 (b) Speedup of RL IBM -A LL over glibc libm
Speedup over RLibm32 Speedup over float libm Speedup over double libm

2x 2x

Speedup
Speedup

1x 1x

0x 0x
x x x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) ) )
ln(x log 2(x og 10(x e 2 10 (x) (x) an.
l sinh cosh sinp cosp geome l sinh cosh geome

(c) Speedup of RL IBM -A LL over Intel libm (d) Speedup of RL IBM -A LL over CR-LIBM
Speedup over float libm Speedup over double libm Speedup over double libm
2.5x 3.4x
2x
Speedup

2x

Speedup
1x

0x 0x
x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) )
ln(x log 2(x og 10(x
) e (x) (x) i(x) i(x) an.
l sinh cosh sinp cospgeome l sinh cosh sinp cosp geome

Figure 7.7: (a) Speedup of RL IBM -A LL functions compared to RLibm32 functions when pro-
ducing 32-bit float results. (b) Speedup of RL IBM -A LL functions compared to Intel’s float func-
tions (left) and Intel’s double functions (right) when producing 32-bit float results. (c) Speedup
of RL IBM -A LL functions compared to CR-LIBM functions when producing 32-bit float results.
(d) Speedup of RL IBM -A LL functions compared to glibc’s float functions (left) and glibc’s double
functions (right) when producing 32-bit float results.

with the four available rounding modes.

7.4.2 Performance Evaluation

Performance evaluation of RL IBM -A LL when producing float results. Figure 7.7 presents
the speedup of RL IBM -A LL’s FP functions compared to several math libraries when pro-
ducing float results with rne mode. Figure 7.7(a) reports the speedup of RL IBM -A LL’s FP
functions over RL IBM -32’s functions. RL IBM -A LL is as fast as RL IBM -32 (2% slower
than RL IBM -32) while being able to produce correctly rounded results for multiple rep-
resentations with all standard rounding modes. RL IBM -A LL’s sinh(x) function is 12%
slower than to RL IBM -32. When input x is near 0, sinh(x) exhibits a linear behavior and
sinh(x) ≈ x produces correctly rounded float values with rne rounding mode. RL IBM -32
uses this property by simply returning x for inputs near 0. In comparison, RL IBM -A LL
spends significantly more computation to ensure that it produces correctly rounded results
for all rounding modes
190

Figure 7.7(b) presents the speedup of RL IBM -A LL’s FP functions over glibc’s float
functions (left bar in each cluster) and double functions (right bar in each cluster). RL IBM -
A LL has 1.1× and 1.1× speedup over glibc’s float and double functions, respectively. Fig-
ure 7.7(c) presents the speedup of RL IBM -A LL’s FP functions over Intel’s float functions
(left bar in each cluster) and double functions (right bar in each cluster). RL IBM -A LL has
1.3× and 1.5× speedup over Intel’s float and double functions, respectively. Figure 7.7(d)
presents the speedup of RL IBM -A LL’s FP functions over CR-LIBM functions. RL IBM -
A LL has 1.9× speedup over CR-LIBM functions. RL IBM -A LL is comparable or faster
than the existing libraries while producing correct float results with all rounding modes. In
comparison, glibc’s libm, Intel’s libm, and CR-LIBM do not produce correct results for all
inputs when they are used to produce float values.

Performance evaluation of RL IBM -A LL when producing bfloat16 values. Figure 7.7


reports the speedup of RL IBM -A LL’s FP functions compared to several math libraries when
producing bfloat16 results with rne mode. Figure 7.8(a) presents the speedup of RL IBM -
A LL’s FP functions over RL IBM -32. On average, RL IBM -A LL’s FP functions are 8%
slower than RL IBM -32 when producing bfloat16. While RL IBM -32 produces float values
which can be efficiently rounded to bfloat16 using bit-wise operations, RL IBM -A LL pro-
duces double values. Because there is no hardware support for correctly rounding a double
value to bfloat16, RL IBM -A LL simulates this rounding with software simulation. Hence,
there is an additional 6% slowdown when RL IBM -A LL produces bfloat16 results compared
to RL IBM -32.
Figure 7.8(b) reports the speedup of RL IBM -A LL’s FP functions over glibc’s float
functions (left bar in each cluster) and glibc’s double functions (right bar in each clus-
ter). RL IBM -A LL has 1× and 1.1× speedup over glibc’s float and double library, respec-
tively. Figure 7.8(c) reports the speedup of RL IBM -A LL’s FP functions over Intel’s float
functions (left bar in each cluster) and Intel’s double functions (right bar in each cluster).
191

(a) Speedup of RL IBM -A LL over RLibm32 (b) Speedup of RL IBM -A LL over glibc libm
Speedup over RLibm32 Speedup over float libm Speedup over double libm

2x 2x

Speedup
Speedup

1x 1x

0x 0x
x x x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) ) )
ln(x log 2(x og 10(x e 2 10 (x) (x) an.
l sinh cosh sinp cosp geome l sinh cosh geome

(c) Speedup of RL IBM -A LL over Intel libm (d) Speedup of RL IBM -A LL over CR-LIBM
Speedup over float libm Speedup over double libm Speedup over double libm

2x
Speedup

2x

Speedup
1x

0x 0x
x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) )
ln(x log 2(x og 10(x
) e (x) (x) i(x) i(x) an.
l sinh cosh sinp cospgeome l sinh cosh sinp cosp geome

(e) Speedup of RL IBM -A LL over RLibm


Speedup over RLibm

2x
Speedup

1x

0x
x x
) ) ) x (x) (x) i(x) i(x) an.
ln(x log 2(xog 10(x e 2 10
sinh cosh sinp cosp geome
l

Figure 7.8: (a) Speedup of RL IBM -A LL functions compared to RLibm32 functions when pro-
ducing bfloat16 results. (b) Speedup of RL IBM -A LL functions compared to Intel’s float functions
(left) and Intel’s double functions (right) when producing bfloat16 results. (c) Speedup of RL IBM -
A LL functions compared to CR-LIBM functions when producing bfloat16 results. (d) Speedup
of RL IBM -A LL functions compared to glibc’s float functions (left) and glibc’s double functions
(right) when producing bfloat16 results. (e) Speedup of RL IBM -A LL functions compared to RLibm
functions when producing bfloat16 results.

RL IBM -A LL has 1.2× and 1.4× speedup over Intel’s float and double library, respectively.
Figure 7.8(d) reports the speedup of RL IBM -A LL’s FP functions over CR-LIBM. RL IBM -
A LL has 1.62× speedup over CR-LIBM. CR-LIBM performs better in producing bfloat16
results compared to its performance in producing float results. We conjecture that this is
because CR-LIBM has a two-phase approach in approximating elementary functions. For
inputs where the correctly rounded result of f (x) is easy to approximate, CR-LIBM pro-
duces the results efficiently. For inputs where the exact value of f (x) in reals is close to the
rounding boundary of two double values, CR-LIBM uses more accurate but slower approx-
imation methods. CR-LIBM likely computes the correctly rounded result of f (x) efficient
192

(a) Speedup of RL IBM -A LL over RLibm32 (b) Speedup of RL IBM -A LL over glibc libm
Speedup over RLibm32 Speedup over double libm

2x 2x

Speedup
Speedup

1x 1x

0x 0x
x x x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) )
ln(x log 2(x og 10(x
) e 2 10 (x) (x) an.
l sinh cosh sinp cosp geome l sinh cosh geome

(c) Speedup of RL IBM -A LL over Intel libm (d) Speedup of RL IBM -A LL over CR-LIBM
Speedup over double libm Speedup over double libm
2.4x
2x
Speedup

2x

Speedup
1x

0x 0x
x x x x
) ) )
ln(x log 2(xog 10(x e 2 10 (x) (x) i(x) i(x) an. ) )
ln(x log 2(x og 10(x
) e (x) (x) i(x) i(x) an.
l sinh cosh sinp cosp geome l sinh cosh sinp cosp geome

Figure 7.9: (a) Speedup of RL IBM -A LL functions compared to RLibm32 functions when produc-
ing posit32 results. (b) Speedup of RL IBM -A LL functions compared to Intel’s double functions
(right) when producing posit32 results. (c) Speedup of RL IBM -A LL functions compared to CR-
LIBM functions when producing posit32 results. (d) Speedup of RL IBM -A LL functions compared
to glibc’s double functions (right) when producing posit32 results.

with bfloat16 inputs compared to float inputs.


Figure 7.8(e) reports the speedup of RL IBM -A LL’s FP functions over RL IBM -16. On
average, RL IBM -A LL’s FP functions have 9% overhead over RL IBM -16. Since RL IBM -
16 is developed and optimized specifically for bfloat16, the majority of RL IBM -16’s func-
tions have speedup over RL IBM -A LL except for ln(x), log2 (x), log10 (x), and cospi(x).
RL IBM -A LL uses a more efficient range reduction strategy compared to RL IBM -16 and
results in better performance for these functions.

Performance evaluation of RL IBM -A LL when producing posit32 values. Figure 7.9


presents the speedup of RL IBM -A LL’s posit functions over various libraries when produc-
ing posit32 values. Figure 7.9(a) presents the speedup of RL IBM -A LL’s posit functions
over RL IBM -32’s posit functions. On average, RL IBM -A LL is 3% slower compared to
RL IBM -32. This additional overhead is caused by RL IBM -A LL’s posit functions produc-
ing correctly rounded results for 34-bit posit representation. Figure 7.9(b), (c), and (d) re-
ports the speedup of RL IBM -A LL’s posit functions over the repurposed math library using
193

glibc’s double library, Intel’s double library, and CR-LIBM. RL IBM -A LL has 1.0×, 1.0×,
and 1.4× speedup over these libraries, respectively. The performance of RL IBM -A LL’s
posit functions is comparable or faster than the state-of-the-art libraries while guaranteeing
to produce correct results for multiple posit configurations. In comparison, Glibc’s dou-
ble library, Intel’s double library, and CR-LIBM do not produce correctly rounded posit32
results for all inputs.
194

CHAPTER 8
RELATED WORK

For small bit-length datatypes (e.g., 8 bits), it is more beneficial to mathematically compute
the results of f (x) for each input (i.e., a total of 256 values) and store the results in a table.
However, as the size of data types increases (i.e., 32-bits or 64-bits), storing the results of
f (x) for each input required exponentially larger storage space. For instance, storing the
results of f (x) for the 32-bit float type requires 232 × 4B = 16GB of storage. Hence,
multiple decades of research in approximation and range reduction techniques have been
developed to accurately approximating elementary functions possible. Further, there are
numerous efforts in verifying the error bound of various implementations of elementary
functions and repairing math libraries to improve them. We present several seminal work
that has influenced and inspired the RL IBM approach.

8.1 Approximation Methods

Polynomial approximation. Polynomials are efficient to evaluate especially when using


Horner’s method [10] which requires at most n + 1 additions and n multiplications to eval-
uate a polynomial of degree n. Further, based on the Weierstrass approximation theorem,
any smooth function f (x) can be approximated as closely as needed using a polynomial ap-
proximation [144]. Hence, it is one of the most popular approximation techniques. Taylor
series [2], also known as the Taylor polynomial expansion, is an infinite degree polynomial
used to approximate any smooth function f (x) for inputs near a single pivot point a:

f 0 (a) f 00 (a) f 000 (a)


P (x) = f (a) + (x − a) + (x − a)2 + (x − a)3 + . . .
1! 2! 3!

Truncating it up to a desirable degree n creates a finite-degree (i.e., nth degree) polynomial


approximation. There are two distinct advantages of using the Taylor series when approxi-
195

mating elementary functions. First, generating the polynomial (especially the coefficients)
is straightforward as long as the derivatives can be computed. Second, the error of the
Taylor polynomial approximation is bounded by the maximum value of the (n + 1)st term
among all inputs in the input domain. Hence, Taylor polynomial can be used to produce an
approximation of f (x) with arbitrary precision. However, the main drawback of the Taylor
polynomial is that the error of the polynomial increases substantially as the input x deviates
away from the pivot point a. Comparatively, the Remez algorithm (explained in more detail
in Chapter 2) generates polynomials that produce more accurate results for a wider domain
of inputs, if the desired error bound is known when generating the polynomial.


Newton’s method for square root function. Unlike elementary functions, the x func-
tion is most commonly approximated with the Newton’s method or its variants. Abstractly,

the square root function y = x for a given input x (i.e., x is a constant) is converted to
y 2 − x = 0. Then, the root of this quadratic formula (with the independent variable y) is
solved using the iterative Newton’s method


y0 = initial guess of x
f (yn ) yn x
yn+1 = yn − 0
= +
f (yn ) 2 2yn


where f (yn ) = y 2 − x. The result yn is an approximation of x after n iteration. The New-
ton’s method converges quadratically on average. It is common to store a rough estimate

of x in a small lookup-table and use this value as the y0 to expedite the algorithm [3]. An
advantage of the Newton’s ralgorithm is that the numerical error occurring in each iteration
does not affect the numerical error of the final result. Rather, the numerical error of an
iteration will only slightly affect the convergence rate of the next iteration. With enough
iterations, the Newton’s method will produce an accurate result.
The hardware implementation of floating point division operation is slower than other
primitive operations. A technique to avoid the division operation in the Newton’s method
196

is to approximate the reciprocal square root (i.e., √1 )


x
using the Newton’s method (i.e.,
f (y) = y 2 − √1x ). Once the reciprocal of the square root (i.e., y = √1x ) is approximated,
√ √
x can be found by computing x = x × y. This strategy completely eliminates the
division operations and still converges quadratically. A similar technique known as Halley’s
method [110] provides a technique to iteratively approximates √1
x
with a cubic convergence

rate and the Goldschmidt’s algorithm [53] approximates both x and √1
x
simultaneously.

CORDIC. The shift-and-add algorithms are iterative processes that can approximate vari-
ous functions using only addition and binary shift operations (i.e., multiplication by powers
of 2’s). The very first concept of shift-and-add algorithm was used to compute the loga-
rithm of various values [115]. Since then, it has been adopted by the CORDIC method [139],
an algorithm for approximating trigonometric functions.
In essence, the CORDIC method computes trigonometric functions through a series of
2-dimensional vector rotations. The angle θ is decomposed into a series of small decreasing
angles θ = θ0 ± θ1 ± θ2 ± . . . . An initial vector [x0 , y0 ]T = [1, 0]T is rotated using these
angles (i.e., θi ) iteratively. By choosing a particular sequence of angles ±θi such that
tan(θi ) = 2−i , the vector rotation by the angle θi in each iteration can be computed with,
    
xi+1 1 −2−i x
 =K   i
yi+1 2−i 1 yi

where K is a predetermined constant. After a sequence of rotations by the angle θi , the


resulting vector [xi , yi ]T is an approximation of [cos(θ), sin(θ)]T . This vector rotation only
requires additions and shift operations after some optimization steps.
Mathematically, each iteration of the CORDIC method provides one digit of accuracy
in approximating cos(θ) and sin(θ). It is well received in the hardware community for its
straightforward implementations and ability to produce results efficiently when only a few
digits of accuracy is required. A generalized CORDIC algorithm was developed to approx-
imate hyperbolic functions [140] or perform multi-dimensional CORDIC algorithm [122].
Several work have been proposed to improve the performance by parallelizing [49, 138],
197

pipelining [43], and modularizing [143] the CORDIC method. The angle recoding [19, 64],
fast-forwarding [97], and the redundant algorithm [132] reduces the number of iterations
to both increase the performance and the accuracy. A detailed overview of prior research
on CORDIC is available in the survey [98].

Bipartite table method. Instead of storing the result of f (x) in a lookup table for ev-
ery input, the bipartite method uses two smaller lookup tables to approximate f (x). First
proposed to approximate trigonometric functions [129] and reciprocals (i.e., x1 ) [31], it has
been generalized for all elementary functions using polynomial approximations [120, 127].
Abstractly, the bipartite method divides the input domain into k larger subdomains and then
further split each subdomain into another k smaller subdomains (a total of k 2 subdomains).
The function f (x) in each smaller subdomain is approximated using a 1st degree polyno-
mial, P (x) = ci,j + ci x, where ci is unique for each larger subdomain (i.e., a total of k
values) and ci,j is specific to each smaller subdomain (i.e., a total of k 2 values). Using
this technique, approximating f (x) only requires two lookup tables with k and k 2 entries,
respectively.
The coefficients ci , j and cj can be identified using Taylor’s expansion. The input x
is decomposed into three k-bit smaller numbers, x = x0 + x1 + x2 . Then, the Taylor’s
expansion of f (x) with the pivot point x0 + x1 is used to create a monomial approximation,

P (x) = f (x0 + x1 ) + (x − x0 − x1 ) × f 0 (x0 + x1 ) = f (x0 + x1 ) + x2 × f 0 (x0 + x1 )

By approximating the term f 0 (x0 + x1 ) with f 0 (x0 ), we get the approximation function,

P (x) = ci,j + ci x2 , ci,j = f (x0 + x1 ), ci = f 0 (x0 )

The bipartite method can be implemented efficiently using two small lookup tables and a
few primitive operations. Hence, it is a popular technique along with the CORDIC method
to create custom hardware implementation. An extension, known as the multipartite table
method [41, 121] uses several smaller lookup tables created with higher degree Taylor
expansion to increase the accuracy.
198

8.2 Correctly Rounded Approximation

Naively using the approximation algorithms discussed in Section 8.1 with finite precision
representations does not guarantee to produce the correctly rounded result of f (x). As
the table maker’s dilemma [75] states, it may require an arbitrary amount of precision to
compute the correctly rounded results for an arbitrary input x and elementary function f (x).
Hence, several techniques were developed to avoid the table maker’s dilemma, produce the
correctly rounded results of f (x) efficiently, and create correctly rounded math libraries.

Math libraries using the onion peeling strategy. The first successful attempts in creating
correctly rounded math libraries for the float and double types (i.e., LibUltim [65, 146] and
LibMCR [101]) use the onion peeling approach (also know as Ziv’s strategy). It avoids
the table maker’s dilemma by iteratively approximating f (x) with progressively higher
precision until it can determine the correctly rounded result of f (x). Initially, the onion
peeling strategy chooses a target precision p1 . Then, it approximates f (x) with p1 bits of
precision to produce a result result y1 and its corresponding error bound 1 . The onion
peeling strategy typically uses approximation methods that can potentially compute f (x)
to arbitrary precision and produce the error bound at the same time (e.g., Taylor series
or the CORDIC method). Based on the result y1 and 1 , the real value of f (x) is within
[y1 − 1 , y1 + 1 ]. If both y1 − 1 and y1 + 1 rounds to the same value in the target
representation, then the rounded result of y1 is the correctly rounded result. Otherwise,
the current approximation cannot conclude what the correct result is. The onion peeling
strategy chooses a higher precision p2 and repeats the process.
MPFR [46], an arbitrary precision math library, implements elementary functions using
the same strategy. In the MPFR math library, the user can define the precision of the
target FP representation. Hence, the amount of precision required to compute the correctly
rounded result of f (x) for the target representation is truly unbounded, which makes the
onion peeling strategy especially suitable for MPFR.
199

The introduction of the onion peeling strategy changed the perception of math library
developers by showing that avoiding the table maker’s dilemma and generating a correctly
rounded math library is possible. However, the main drawback is its performance. Itera-
tively approximating f (x) with higher precision is expensive. Theoretically, the onion peel-
ing strategy may never terminate and keep increasing the working precision (i.e., pi ). Re-
gardless, we emphasize the impact of the onion peeling strategy, especially on our RL IBM
approach. Although the elementary functions that we create do not use the onion peeling
strategy directly, the RL IBM approach computes the correctly rounded results of f (x) us-
ing the MPFR library. This allows us to avoid the Table-maker’s dilemma and generate
efficient and correct polynomial approximations.

Resolving table maker’s dilemma for specific representations. In 2001, the table maker’s
dilemma for the double precision was resolved by identifying the highest amount of preci-
sion required to approximate f (x) and produce the correctly rounded results for all inputs
in the double precision [85]. This worst-case precision requirement was computed using a
filter-and-check strategy. The filtering step initially splits the input domain into extremely
small sub-domains where f (x) can be accurately approximated even with a 1st degree
polynomial approximation. Then, it approximates f (x) for all inputs in the subdomain
and checks whether the result is within a certain threshold of the rounding boundary. Any
subdomain where no f (x) is close to the rounding boundary is filtered. Next, the checking
step computed f (x) for all inputs in the remaining subdomains and identified the neces-
sary amount of precision to compute the correctly rounded result of f (x) in double. It was
found that roughly 160 precision bits were required, in the worst case, to produce correctly
rounded results for the double type. There is active research in identifying the precision
required to produce correctly rounded results for other representations [14, 125].

Math libraries using the Remez algorithm. Using the worst case precision requirement,
CR-LIBM [29, 86] implements several elementary functions for the double type with min-
200

imax polynomial approximations. The polynomials are generated using Sollya [21], which
provides a modified Remez algorithm. The original Remez algorithm produces minimax
polynomials P (x) with real number coefficients ci and guarantees error bound  when eval-
uated in real number. However, rounding these coefficients and evaluating the polynomial
in a finite precision representation may produce an error significantly larger than . Hence,
Sollya generates minimax polynomials with coefficients in a finite precision representation
H [15]. Abstractly, Sollya searches through all polynomials P̂ (x) with rational coefficients
cˆi around ci . All possible ranges of cˆi for 0 ≤ i ≤ n where n is the degree of P̂ (x)
form a rational polytope in Zn+1 . Then it uses a rational value search algorithm within
the polytype to identify candidate polynomials and calculates the infinity norm of the error
of the polynomial until it finds a P̂ (x) with a sufficiently small maximum error. Sollya
also computes the maximum numerical error of polynomial evaluation using interval arith-
metic [20, 22] and produces Gappa [99] proofs to ensure that the polynomial produces the
correctly rounded results. The polynomial searching algorithm has later been optimized
using the lattice basis reduction technique [11].
MetaLibm [16, 80] is an elementary function generator built using Sollya. The main
goal of MetaLibm is to generate efficient minimax polynomials with user-defined error
bounds. It automatically creates range reduction strategies, performs hardware specific op-
timizations, and generates piecewise polynomials to approximate f (x) [79]. MetaLibm
uses another polynomial approximation to select the appropriate polynomial in the piece-
wise polynomials for a given input [81]. This allows the elementary functions to be paral-
lelized with vector instructions. Other than Sollya, another modified Remez algorithm has
been proposed to generate a minimax polynomial that also accounts for numerical errors
when evaluated in a finite precision representation [4]. It has been shown to produce cor-
rectly rounded results for small input domains where the range reduction is not necessary.
201

SoftPosit-Math using minefield method. While identifying the worst case precision re-
quirement resolves the table maker’s dilemma, approximating the real value of f (x) does
not provide the maximum amount of freedom to generate efficient and correctly rounded
polynomial approximation. Instead, the Minefield method [57] identifies the largest interval
of values that polynomial approximations should produce to compute the correctly rounded
results. It declares all other regions as minefields. Then, it generates a polynomial that
avoids all mines. SoftPosit-Math [88] uses the Minefield method to create several correctly
rounded posit16 elementary functions. Our RL IBM approach is inspired by the Minefield
method. The rounding intervals in the RL IBM approach can be considered as the safe paths
in the Minefield method. The RL IBM approach generalizes the Minefield method for nu-
merous representations while considering range reduction and output compensation. We
also formalize an automatic process to encode the safe paths as linear constraints and use
an LP solver to generate the polynomial. More importantly, all prior approaches including
the minefield method generate elementary functions that target a particular representation
and rounding mode. The RL IBM approach can generate a single polynomial approximation
that handles multiple representations and rounding modes at the same time.

Custom hardware function evaluation. One of the most important aspects of custom
hardware design for approximating elementary functions is the hardware cost to imple-
ment the function. One way to reduce the hardware cost is to use fixed point represen-
tations rather than floating point representations. Fixed point representations directly en-
code binary fraction values in an n-bit representation. Essentially, it is equivalent to an
integer type with an implicit radix point. Hence, all fixed point arithmetic can be per-
formed using integer arithmetic. Another strategy to reduce the hardware cost, especially
when using piecewise polynomial approximations, is to reduce the bit-length of the coef-
ficients. Smaller bit-length coefficients will require smaller lookup tables and arithmetic
units. Hence, a technique [36] was proposed to generate piecewise polynomials with the
202

smallest bit-length fixed point coefficients that still produce the correctly rounded results
for all inputs. This technique generates an integer linear programming (ILP) formulation
that specifies constraints on the outputs of the polynomial they wish to generate, simi-
lar to the RL IBM approach. Additionally, they encode the objective function in the ILP
formulation to minimize the bit-length of the coefficients and use ILP solver to generate
the polynomial. This technique is effective in generating efficient piecewise polynomial
for fixed point representations where the dynamic range is limited. In comparison, float-
ing point representations have significantly larger dynamic range and the RL IBM approach
presents a technique to produce correctly rounded results when evaluated with range re-
duction strategy. We hope to combine it with the RL IBM approach to incorporate range
reduction techniques and our odd interval techniques to optimize our elementary functions.

8.3 Range Reduction Strategies

To generate efficient implementations of elementary functions, math library designers use


polynomial approximations and range reduction strategies together. The simplest form of
range reduction strategies directly use mathematical identities of elementary functions. For
example, the natural log function ln(x) has the mathematical identify ln(a × bc ) = ln(a) +
cln(b). If the original input x is decomposed to x = x0 × 2m where x0 ∈ [1, 2) and m ∈ Z
is the exponent of x, then ln(x) can be computed using ln(x0 ) + mln(2). This strategy
allows us to only approximate ln(x0 ) for a small input domain [1, 2). There are also range
reduction strategies designed for individual elementary functions [25] that transforms the
function we must approximate (i.e., ln(x)) into a different function. In particular for ln(x),
x0 −1
the reduced input x0 can be transformed into a different value t = x0 +1
such that ln(x0 )
can be approximated with ln( 1−t
1+t
). This transformed function can be approximated with an
odd polynomial, effectively reducing the cost of evaluating the polynomial to half. More
detail on this specific range reduction strategy is provided in Chapter 4. Unfortunately,
these strategies must be developed separately for each elementary function and does not
203

guarantee that a similar approach can be found for other elementary functions.

Tabled-Based Range Reduction Strategy. Instead, all mainstream math libraries and the
RL IBM approach adopt a more general strategy known as the table-based range reduction
(also known as the Tang’s method) [133, 134, 135]. This strategy significantly reduces
the input domain which results in a low degree polynomial and uses a look-up tables to
accurately and efficiently evaluate the output compensation function. The insight is to
decompose the original input x into two components: a continuous variable and a discrete
variable. For instance, consider the function ln(x) for the input space x ∈ [1, 2). The input
x can be decomposed into x = u + v where u ∈ { 64 , 65 , . . . , 127
64 64 64
} is a discrete variable and
v is a continuous variable in [0, 64
1
). This decomposition effectively splits the original input
domain [1, 2) into 64 equal sub-divisions (i.e. [1, 65
64
), [ 65 , 66 ), . . . ) where u represents the
64 64

index of each sub-division. Then, ln(x) can be computed with,

v
ln(x) = ln(u + v) = ln(u(1 + v)) = ln(u) + ln(1 + )
u

Because u is a discrete variable (i.e. exactly 64 possible values), we can pre-compute


ln(u) with high precision and store the results into a look-up table. If we denote v
u
as
x0 , approximating ln(x) only requires a polynomial approximation of ln(1 + x0 ) for the
reduced input domain x0 ∈ [0, 64
1
). The output compensation function requires a single
table lookup and an addition of the value ln(u).
There are three distinct advantages of the table-based range reduction that make it at-
tractive. First, the significant reduction in the input domain allows us to generate a lower
degree polynomial, beneficial for performance. The output compensation function can also
be evaluated efficiently using the lookup table. Second, the table-based range reduction is
flexible. It is straightforward to configure the amount of domain reduction by adjusting the
input decomposition. The original range reduction for ln(x) [134] recommended to split
the original input domain into 64 sub-divisions (i.e., reduced domain of [0, 64
1
)). Our proto-
types split the domain into 128 sub-divisions to reduce the input domain into [0, 128
1
). Using
204

table-based method, the size of the reduced input domain (i.e., [0, 21i ]) can be configured
by the number of the sub-domains (i.e., 2i sub-domains). The table-based method can also
be nested multiple times to further reduce the input domain [29, 73]. Finally, table-based
range reduction can be applied to various elementary functions. The only requirement to
apply this strategy is for the elementary function to have an identify f (a ◦ b) = f (a)  f (b)
where ◦ and  are primitive operations (i.e., ln(a × b) = ln(a) + ln(b)). There are range
reduction strategies for logarithm functions [134], exponential functions [133], trigonomet-
ric functions [135], and hyperbolic functions [29]. All elementary functions in RL IBM -32
and RL IBM -A LL use table-based range reduction for efficient implementations (additional
details can be found in Chapter 4).

Accurate Additive Range Reduction Strategies. Additive range reductions can experi-
ence significant amount of cancellation error when evaluated in a finite precision represen-
tation H. As described in Chapter 2, additive range reduction reduces an input x into the
reduced input x0 ∈ (−C, C) by computing x0 = x − kC for a constant value C and an
integer k (i.e., x0 ≡ x mod C). When the magnitude of x is similar to C (i.e., k is a small
value), x − kC does not experience significant error. However, when x is considerably
larger than C and the value of x and kC are similar, then x − kC can experience catas-
trophic cancellation. Trigonometric functions use additive range reductions (i.e., C = π2 )
to exploit trigonometric identities (i.e., sin(a + 2π) = sin(a)) and experience such errors
when the input x is large. Hence, the very early implementations of trigonometric functions
only supported small inputs (i.e., x ∈ [− π2 , π2 ]) delegating the job of range reduction to the
user [6]. Some implementations use their own definition of π [105], usually a value ob-
tained by rounding the real value of π to a finite-representation, or use an arbitrary precision
representation to perform range reduction for trigonometric functions [109].
Cody and Waite reduction method [24, 25] provides a strategy to accurately perform
additive range reduction for relatively small inputs (i.e., x < 100). It stores the value of
205

C with two values C1 and C2 in H. These values are chosen in a particular way such that
C1 is close to C and the sum of C1 and C2 approximates C (i.e., C ≈ C1 + C2 ). Then the
range reduction is evaluated with x0 = (x − kC1 ) − kC2 . If the value of kC1 is exactly
representable in H, then z = x − kC1 can be exactly computed by Sterbenz Lemma [126]
and z − kC2 will not experience significant amount of numerical error.
The Payne and Hanek reduction [113] is an effective strategy for extremely large inputs
(i.e., x > 264 ). The insight is that when the value of an input x is close to kC, the additive
range reduction (i.e., x − kC) will cancel the first p significant bits of x and kC (which
is the cause of cancellation error). If it is possible to identify the number of bits that will
be canceled (i.e., p), then the subtraction can be optimized by not considering the first p
bits of both x and kC. This also means the first p bits of kC do not have to be computed.
The Payne and Hanek reduction provides a systematic algorithm to approximate p and per-
form x − kC accurately. Due to the overhead of identifying p, Payne and Hanek reduction
method is mostly used for large inputs where it is common to experience catastrophic can-
cellation unless the additive range reduction is performed with an extremely high precision
representation.
For medium sized inputs (i.e., 23 < x < 264 ), the modular range reduction [13, 33, 87]
can accurately compute additive range reduction. This strategy uses mathematical identities
of the modulus operation,

(x + y) mod C = ((x mod C) + (y mod C)) mod C

The modular range reduction breaks down an input with n-bits of precision x = m × 2k ,
Pn−1
where m is the significand and k is the exponent of x, into x = i=0 xi 2
k−i
. Here,
xi ∈ {0, 1} is the ith most significant precision bit in m. Abstractly, xi 2k−i is the value
represented by the xi bit. Next, it computes the value,
n−1
X
r= xi (2k−1 mod C)
i=0

By the properties of modulus operation, r mod C is equivalent to x mod C. Since r


206

is a sum of at most n values smaller than C, it is guaranteed that r < nC where r is a


value much smaller than the original input x. Then, the value r is reduced to x0 = r − kC
(where 0 ≤ k < n is an integer) to produce the reduced input x0 ∈ (−C, C). All possible
values of xi 2k−i mod C and kC are stored in a lookup table. Because the magnitude of x
as well as the possible values of k are limited (i.e., x < 264 and k < n), the size of the
lookup table is small. For example, computing x mod π
2
(i.e., C = π2 ) for an input between
x ∈ [0, 264 ] would require roughly 64 values for xi 2k−i mod C and 53 possible values for
kC. The modular range reduction has been optimized where x is decomposed into groups
of bits (rather than individual bits) to reduce the number of accumulation operations [13].
The on-the-fly range reduction strategy [87] integrates the x0 = r − kC computation into
the accumulation of each term (xi 2k−1 mod C) to efficiently compute the final result. The
trigonometric functions in CR-LIBM use all three additive range reduction strategies to
produce correctly rounded double results.

8.4 Verification of Math Libraries

As performance and correctness are both important with math libraries, there is extensive
research to formally bound the error of elementary functions. HOL Light [63] is a theo-
rem prover primarily used for verifying FP operations. It formalizes the semantics of FP
arithmetic [62] and proves the bound of the error of several implementations of elemen-
tary functions [59, 60]. It has also been used to verify the correctness of the floating point
arithmetic operations in Intel Itanium chipset [61]. Similarly, CoQ proof assistant has been
used to prove the correctness of implementations of Cody and Waite’s reduction method
with fused-multiply-add operations [8]. These verification approaches require a
significant amount of manual work. Hence, researchers have developed an automatic veri-
fication technique and verified that many elementary functions in Intel’s math.h have at
most 1 ulp error [82, 84].
Sollya verifies that the elementary functions that it generates produce correctly rounded
207

results with the help of Gappa [35, 38, 39]. Sollya translates the implementations of the
elementary functions to Gappa lemmas. Then, Gappa uses integer arithmetic to bound the
error of each operation with some hints from the math library developers. Finally, Sollya
ensures that the final result of the elementary function will round to the correctly rounded
result based on the error bound. It has been used to verify the elementary functions in
CR-LIBM.
Instead of using formal proofs, RL IBM validates that the generated polynomial pro-
duces the correctly rounded results by evaluating the polynomial for each input. This
manual process is possible because our target representations (i.e., 32-bit float) have a rea-
sonable number of inputs (i.e., 232 inputs). To extend RL IBM for the double type where
iterating through all inputs is infeasible, we will likely have to rely on prior verification
efforts to formally prove the correctness of the generated polynomials.

8.5 Math Library Repair Tools

An alternative for creating correctly rounded elementary functions is to repair existing math
libraries that do not produce correctly rounded results. If the numerical error in evaluating
the implementation is the cause of an incorrect result, we can use tools that detect numerical
errors to diagnose the root cause of the error [7, 23, 47, 54, 118, 145, 147]. Subsequently,
we can rewrite parts of the implementation that causes these errors using rewriting tools
such as Herbie [111] or Salsa [28]. If the cause of the incorrect result stems from the
approximation error, then we can use a math library repair tool [145]. This tool identifies
small domains of inputs that result in high error by performing binary searches in the entire
input domain. Then, it uses piecewise polynomials containing linear or quadratic equations
to repair the elementary function for these domains.
Unfortunately, these tools are primarily designed to reduce errors rather than com-
pletely removing them. Hence, math libraries repaired with these tools are not likely to
produce correctly rounded results for all inputs. Additionally, the repaired code may intro-
208

duce overhead, negatively impacting the performance. Hence, it is ideal to use tools like
RL IBM or Sollya and create correctly rounded elementary functions.
209

CHAPTER 9
CONCLUSION AND FUTURE DIRECTIONS

This dissertation presents the RL IBM approach for generating polynomial approximations
that produce correctly rounded results of elementary functions f (x) for all inputs. The
RL IBM approach makes a case for approximating the correctly rounded result of f (x)
rather than the real value of f (x) itself. We first summarize the contributions in this disser-
tation and then present directions for future work.

9.1 Dissertation Summary

As elementary functions are essential components in scientific applications, there have


been seminal research over decades to approximate elementary functions accurately and
efficiently. However, efficiently computing the correctly rounded results of elementary
functions remains a challenging problem. In this dissertation, we propose the RL IBM ap-
proach to generate polynomial approximations that produce correctly rounded results for all
inputs. While prior approaches try to approximate the real value of the elementary function,
RL IBM advocates for approximating the correctly rounded result. The RL IBM approach
identifies the maximum amount of freedom available to generate the correctly rounded re-
sults and uses this freedom to generate efficient polynomial approximations using linear
programming.
To generate polynomial approximations that produce the correctly rounded results for
32-bit representations, RL IBM uses the counterexample guided polynomial generation tech-
nique inspired by program synthesis. This technique samples a small fraction of inputs to
generate a polynomial that produces the correctly rounded results for all 32-bit inputs. Ad-
ditionally, RL IBM generates piecewise polynomials with low degrees by dividing the input
domain into sub-domains using the bit-pattern of the inputs. These two techniques allow
210

RL IBM to generate efficient implementations of elementary functions.


Finally, RL IBM presents a novel approach to generate a single polynomial approxima-
tion that produces the correctly rounded results for multiple representations and rounding
modes. When the goal is to generate correctly rounded results for Tk representations where
k ≤ n, the RL IBM approach generates a polynomial approximation that produces the cor-
rectly rounded result for Tn+2 with the rno mode. RL IBM addresses the issue of inputs
with singleton intervals using mathematical identities of elementary functions to efficiently
identify these inputs and compute the exact result of f (x). Based on our formal proofs, the
resulting polynomial is guaranteed to produce correctly rounded results for all inputs in all
Tk representations with any standard rounding modes.
The resulting elementary functions generated with the RL IBM approach not only pro-
duce correctly rounded results for various representations and rounding modes, but are
faster than mainstream math libraries. The target representations of our elementary func-
tions include, but are not limited to, bfloat16, tensrofloat32, float, posit16, and posit32.
In particular, our RL IBM -A LL prototype provides the first implementations of elementary
functions that produce the correctly rounded results for hundreds of representations with
all standard rounding modes. We hope that this dissertation motivates existing and new
representations to mandate correctly rounded elementary functions.

9.2 Future Directions

The RL IBM approach is our initial effort in creating the correctly rounded math libraries.
There are multiple directions to improve it. We present three potential venues that can be
explored.

Generating correctly rounded approximations for all elementary functions. The IEEE-
754 standard defines 36 commonly used elementary functions and recommends math li-
braries to provide correctly rounded implementations for them. Currently, our prototypes
211

provide implementations for ten of these functions. We intend to generate correctly rounded
approximations for all 36 elementary functions. We believe the RL IBM approach can gen-
erate polynomial approximations for univariate elementary functions (e.g., sin(x)). How-
ever, it may require us to develop novel range reductions to create efficient implemen-
tations. For example, it may be necessary to revisit range reductions for trigonometric
functions such as sin(x) and cos(x) to ensure that we compute the reduced inputs accu-
rately. While performing range reduction in higher precision will accomplish this task, the
performance of high precision arithmetic operations are not ideal.
Some elementary functions recommended by the IEEE-754 are multivariate functions
(i.e., xy and atan( xy )). The RL IBM approach does not handle automatically generating
polynomial approximations for these functions. As there are multiple inputs in these el-
ementary functions, the polynomial approximation must be a multivariate polynomial as
well. The challenge in using the RL IBM to generate multivariate polynomials is in framing
the input-output constraints as linear constraints.

Generating correctly rounded polynomial approximations for 64-bit representations. An-


other possible avenue to improve the RL IBM approach is in its support for generating cor-
rectly rounded polynomial approximations for the double type. The double type is the
standard 64-bit FP representation. It has more than twice the precision of the 32-bit float
type (i.e., 53 bits for double compared to 24 bits for float). Many scientific applications re-
quiring high precision perform internal computations with double. Hence, an efficient and
correctly rounded math library for double is essential. The RL IBM approach can generate
polynomials that produce the correctly rounded results for a small sample of inputs in dou-
ble. However, the challenge lies in verifying that this polynomial produces the correctly
rounded results for all inputs in double. Iterating through all 264 inputs each time we gener-
ate a polynomial to verify its correctness is not feasible. We will most likely have to extend
the RL IBM approach with formal verification techniques to check for the correctness of the
212

polynomial approximation.

Performance optimizations with integer arithmetics. Both the performance and the cor-
rectness of math libraries are important. Hence, we plan to focus on improving the perfor-
mance of our elementary functions while making sure that they produce correctly rounded
results. One possible approach to optimize our implementations is to use integer opera-
tions rather than floating point operations. All of our elementary functions perform inter-
nal computations with double. While common architectures provide hardware support for
arithmetic operations with double, it is still slower than integer operations. Hence, imple-
menting elementary functions with integer operations instead of floating point operations
will increase the performance. The challenge in creating elementary functions with integer
operations is to ensure that the resulting values are correctly rounded results. We plan to
extend the RL IBM approach to generate polynomials with integer coefficients that produce
correctly rounded results for all inputs when evaluated with integers. Another direction that
we will explore is to develop new range reduction techniques that do not require floating
point operations.
213

BIBLIOGRAPHY

[1] Martin Aigner and Gnter M. Ziegler. 2009. Proofs from THE BOOK (4th ed.).
Springer Publishing Company, Incorporated.

[2] Tom M Apostol. 1974. Mathematical analysis; 2nd ed. Addison-Wesley, Reading,
MA.

[3] Inc. Apple Computer. 2002. sqrt.c. https://opensource.apple.com/source/Libm/


Libm-47.1/ppc.subproj/sqrt.c.auto.html

[4] Denis Arzelier, Florent Bréhard, and Mioara Joldes. 2019. Exchange Algorithm
for Evaluation and Approximation Error-Optimized Polynomials. In 2019 IEEE
26th Symposium on Computer Arithmetic (ARITH). 30–37. https://doi.org/10.1109/
ARITH.2019.00014

[5] Alan Baker. 1975. Transcendental Number Theory. Cambridge University Press.

[6] Nelson H. F. Beebe. 2017. The Mathematical-Function Computation Handbook:


Programming Using the MathCW Portable Software Library (1st ed.). Springer
Publishing Company, Incorporated.

[7] Florian Benz, Andreas Hildebrandt, and Sebastian Hack. 2012. A Dynamic Pro-
gram Analysis to Find Floating-point Accuracy Problems. In Proceedings of the
33rd ACM SIGPLAN Conference on Programming Language Design and Imple-
mentation (Beijing, China) (PLDI ’12). ACM, New York, NY, USA, 453–462.
https://doi.org/10.1145/2345156.2254118

[8] Sylvie Boldo, Marc Daumas, and Ren-Cang Li. 2009. Formally Verified Argument
Reduction with a Fused Multiply-Add. In IEEE Transactions on Computers, Vol. 58.
1139–1145. https://doi.org/10.1109/TC.2008.216

[9] Sylvie Boldo and Guillaume Melquiond. 2005. When double rounding is odd. In
17th IMACS World Congress. Paris, France, 11.

[10] Peter Borwein and Tamas Erdelyi. 1995. Polynomials and Polynomial Inequalities.
Springer New York. https://doi.org/10.1007/978-1-4612-0793-1

[11] Nicolas Brisebarre and Sylvvain Chevillard. 2007. Efficient polynomial L-


approximations. In 18th IEEE Symposium on Computer Arithmetic (ARITH ’07).
https://doi.org/10.1109/ARITH.2007.17

[12] N. Brisebarre, D. Defour, P. Kornerup, J.-M. Muller, and N. Revol. 2005. A new
range-reduction algorithm. IEEE Trans. Comput. 54, 3 (2005), 331–339. https:
//doi.org/10.1109/TC.2005.36
214

[13] Nicolas Brisebarre, David Defour, Peter Kornerup, Jean-Michel Muller, and
Nathalie Revol. 2005. A new range-reduction algorithm. IEEE Trans. Comput.
54, 3 (2005), 331–339. https://doi.org/10.1109/TC.2005.36

[14] Nicolas Brisebarre and Guillaume Hanrot. 2021. Integer points close to a transcen-
dental curve and correctly-rounded evaluation of a function. (May 2021). https:
//hal.archives-ouvertes.fr/hal-03240179 working paper or preprint.

[15] Nicolas Brisebarre, Jean-Michel Muller, and Arnaud Tisserand. 2006. Comput-
ing Machine-Efficient Polynomial Approximations. In ACM ACM Transactions on
Mathematical Software, Vol. 32. Association for Computing Machinery, New York,
NY, USA, 236–256. https://doi.org/10.1145/1141885.1141890

[16] Nicolas Brunie, Florent de Dinechin, Olga Kupriianova, and Christoph Lauter. 2015.
Code Generators for Mathematical Functions. In 2015 IEEE 22nd Symposium on
Computer Arithmetic. 66–73. https://doi.org/10.1109/ARITH.2015.22

[17] Hung Tien Bui and Sofiene Tahar. 1999. Design and synthesis of an IEEE-754
exponential function. In Engineering Solutions for the Next Millennium. 1999 IEEE
Canadian Conference on Electrical and Computer Engineering, Vol. 1. 450–455
vol.1. https://doi.org/10.1109/CCECE.1999.807240

[18] Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L.
Gustafson, and Dhireesha Kudithipudi. 2019. Deep Positron: A Deep Neural Net-
work Using the Posit Number System. In 2019 Design, Automation Test in Europe
Conference Exhibition (DATE). 1421–1426.

[19] Cheng-Shing Wu, An-Yeu Wu, and Chih-Hsiu Lin. 2003. A high-performance/low-
latency vector rotational CORDIC architecture based on extended elementary angle
set and trellis-based searching schemes. IEEE Transactions on Circuits and Systems
II: Analog and Digital Signal Processing (2003).

[20] Sylvain Chevillard, John Harrison, Mioara Joldes, and Christoph Lauter. 2011. Ef-
ficient and accurate computation of upper bounds of approximation errors. In Theo-
retical Computer Science, Vol. 412.

[21] Sylvain Chevillard, Mioara Joldes, and Christoph Lauter. 2010. Sollya: An En-
vironment for the Development of Numerical Codes. In Mathematical Software -
ICMS 2010 (Lecture Notes in Computer Science, Vol. 6327). Springer, Heidelberg,
Germany, 28–31. https://doi.org/10.1007/978-3-642-15582-6 5

[22] Sylvain Chevillard and Christopher Lauter. 2007. A Certified Infinite Norm for
the Implementation of Elementary Functions. In Seventh International Conference
on Quality Software (QSIC 2007). 153–160. https://doi.org/10.1109/QSIC.2007.
4385491

[23] Sangeeta Chowdhary, Jay P. Lim, and Santosh Nagarakatte. 2020. Debugging and
Detecting Numerical Errors in Computation with Posits. In 41st ACM SIGPLAN
215

Conference on Programming Language Design and Implementation (PLDI’20).


https://doi.org/10.1145/3385412.3386004

[24] William J Cody. 1982. Implementation and testing of function software. In Prob-
lems and Methodologies in Mathematical Software Production, Paul C. Messina and
Almerico Murli (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 24–47.

[25] William J. Cody and William M. Waite. 1980. Software manual for the elementary
functions. Prentice-Hall, Englewood Cliffs, NJ.

[26] P. M. (Paul Moritz) Cohn. 1974. Algebra [by] P. M. Cohn. Wiley, London.

[27] Mike Cowlishaw. 2008. IEEE Standard for Floating-Point Arithmetic. IEEE 754-
2008. IEEE Computer Society. 1–70 pages. https://doi.org/10.1109/IEEESTD.
2008.4610935

[28] Nasrine Damouche and Matthieu Martel. 2018. Salsa: An Automatic Tool to Im-
prove the Numerical Accuracy of Programs. In Automated Formal Methods (Kalpa
Publications in Computing, Vol. 5), Natarajan Shankar and Bruno Dutertre (Eds.).
63–76. https://doi.org/10.29007/j2fd

[29] Catherine Daramy, David Defour, Florent Dinechin, and Jean-Michel Muller. 2003.
CR-LIBM: A correctly rounded elementary function library. In Proceedings of SPIE
Vol. 5205: Advanced Signal Processing Algorithms, Architectures, and Implementa-
tions XIII, Vol. 5205. https://doi.org/10.1117/12.505591

[30] Catherine Daramy-Loirat, David Defour, Florent de Dinechin, Matthieu Gallet,


Nicolas Gast, Christoph Lauter, and Jean-Michel Muller. 2006. CR-LIBM A library
of correctly rounded elementary functions in double-precision. Research Report.
LIP,. https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804

[31] Debjit Das Sarma and David W. Matula. 1995. Faithful bipartite ROM reciprocal
tables. In Proceedings of the 12th Symposium on Computer Arithmetic. 17–28. https:
//doi.org/10.1109/ARITH.1995.465381

[32] Marc Daumas, Christophe Mazenc, Xavier Merrheim, and Jean michel Muller. 1995.
Modular range reduction: A new algorithm for fast and accurate computation of the
elementary functions. Journal of Universal Computer Science (1995), 175.

[33] Marc Daumas, Christophe Mazenc, Xavier Merrheim, and Jean-Michel Muller.
1995. Modular Range Reduction: a New Algorithm for Fast and Accurate Com-
putation of the Elementary Functions. (1995), 162–175. https://doi.org/10.1007/
978-3-642-80350-5 15

[34] Marc Daumas and Guillaume Melquiond. 2010. Certification of Bounds on Expres-
sions Involving Rounded Operators. ACM Transactions on Mathematical Softwware
37, 1, Article 2 (Jan. 2010), 20 pages. https://doi.org/10.1145/1644001.1644003
216

[35] Marc Daumas, Guillaume Melquiond, and Cesar Munoz. 2005. Guaranteed
proofs using interval arithmetic. In 17th IEEE Symposium on Computer Arithmetic
(ARITH’05). 188–195. https://doi.org/10.1109/ARITH.2005.25

[36] Davide De Caro, Ettore Napoli, Darjn Esposito, Gerardo Castellano, Nicola Pe-
tra, and Antonio G. M. Strollo. 2017. Minimizing Coefficients Wordlength for
Piecewise-Polynomial Hardware Function Evaluation With Exact or Faithful Round-
ing. IEEE Transactions on Circuits and Systems I: Regular Papers (2017). https:
//doi.org/10.1109/TCSI.2016.2629850

[37] Florent de Dinechin, Luc Forget, Jean-Michel Muller, and Yohann Uguen. 2019.
Posits: the good, the bad and the ugly. In CoNGA’19 - Conference for Next Genera-
tion Arithmetic. ACM Press, Singapore, Singapore, 1–10.

[38] Florent de Dinechin, Christopher Lauter, and Guillaume Melquiond. 2011. Certify-
ing the Floating-Point Implementation of an Elementary Function Using Gappa. In
IEEE Transactions on Computers, Vol. 60. 242–253. https://doi.org/10.1109/TC.
2010.128

[39] Florent de Dinechin, Christoph Quirin Lauter, and Guillaume Melquiond. 2006.
Assisted Verification of Elementary Functions Using Gappa. In Proceedings of
the 2006 ACM Symposium on Applied Computing (Dijon, France) (SAC ’06). As-
sociation for Computing Machinery, New York, NY, USA, 1318–1322. https:
//doi.org/10.1145/1141277.1141584

[40] Florent de Dinechin and Arnaud Tisserand. 2005. Multipartite table methods. IEEE
Trans. Comput. 54, 3 (2005), 319–330. https://doi.org/10.1109/TC.2005.54

[41] Florent de Dinechin and Arnaud Tisserand. 2005. Multipartite table methods. IEEE
Trans. Comput. 54, 3 (2005), 319–330. https://doi.org/10.1109/TC.2005.54

[42] Hugues de Lassus Saint-Geniès, David Defour, and Guillaume Revy. 2015. Range
reduction based on Pythagorean triples for trigonometric function evaluation. In
2015 IEEE 26th International Conference on Application-specific Systems, Architec-
tures and Processors (ASAP). 74–81. https://doi.org/10.1109/ASAP.2015.7245712

[43] Ed Deprettere, Patrick Dewilde, and R. Udo. 1984. Pipelined cordic architectures
for fast VLSI filtering and array processing. In ICASSP ’84. IEEE International
Conference on Acoustics, Speech, and Signal Processing.

[44] Albert D. Edgar and Samuel C. Lee. 1977. The Focus Number System. IEEE Trans.
Comput. C-26, 11 (1977), 1167–1170. https://doi.org/10.1109/TC.1977.1674770

[45] Seyed H. Fatemi Langroudi, Tej Pandit, and Dhireesha Kudithipudi. 2018. Deep
Learning Inference on Embedded Devices: Fixed-Point vs Posit. In 2018 1st Work-
shop on Energy Efficient Machine Learning and Cognitive Computing for Embedded
Applications (EMC2). 19–23.
217

[46] Laurent Fousse, Guillaume Hanrot, Vincent Lefèvre, Patrick Pélissier, and Paul
Zimmermann. 2007. MPFR: A Multiple-precision Binary Floating-point Library
with Correct Rounding. ACM Trans. Math. Software 33, 2, Article 13 (June 2007).
https://doi.org/10.1145/1236463.1236468

[47] Zhoulai Fu and Zhendong Su. 2019. Effective Floating-point Analysis via Weak-
distance Minimization. In Proceedings of the 40th ACM SIGPLAN Conference on
Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI
2019). ACM, New York, NY, USA, 439–452. https://doi.org/10.1145/3314221.
3314632

[48] Shmuel Gal. 1986. Computing elementary functions: A new approach for achiev-
ing high accuracy and good performance. In Accurate Scientific Computations,
Willard L. Miranker and Richard A. Toupin (Eds.). Springer Berlin Heidelberg,
Berlin, Heidelberg, 1–16.

[49] Bimal Gisuthan and Thambipillai Srikanthan. 2002. Pipelining flat CORDIC based
trigonometric function generators. Microelectronics Journal (2002).

[50] Ambros M. Gleixner, Daniel E. Steffy, and Kati Wolter. 2012. Improving the Ac-
curacy of Linear Programming Solvers with Iterative Refinement. In Proceedings of
the 37th International Symposium on Symbolic and Algebraic Computation (Greno-
ble, France) (ISSAC ’12). Association for Computing Machinery, New York, NY,
USA, 187–194. https://doi.org/10.1145/2442829.2442858

[51] GNU. 2019. The GNU C Library (glibc). https://www.gnu.org/software/libc/

[52] David Goldberg. 1991. What Every Computer Scientist Should Know About
Floating-point Arithmetic. In ACM Computing Surveys, Vol. 23. ACM, New York,
NY, USA, 5–48.

[53] Robert Goldschmidt. 2005. Applications of division by convergence. (08 2005).

[54] Eric Goubault. 2001. Static Analyses of the Precision of Floating-Point Opera-
tions. In Proceedings of the 8th International Symposium on Static Analysis (SAS).
Springer, 234–259. https://doi.org/10.1007/3-540-47764-0 14

[55] John Gustafson. 2017. Posit Arithmetic. https://posithub.org/docs/Posits4.pdf

[56] John Gustafson. 2020. The Minefield Method: A Uniformly Fast Solution to the
Table-Maker’s Dilemma. https://bit.ly/2ZP4kHj

[57] John Gustafson. 2020. The Minefield Method: A Uniformly Fast Solution to the
Table-Maker’s Dilemma. https://bit.ly/2ZP4kHj

[58] John Gustafson and Isaac Yonemoto. 2017. Beating Floating Point at Its Own Game:
Posit Arithmetic. Supercomputing Frontiers and Innovations: an International Jour-
nal 4, 2 (June 2017), 71–86.
218

[59] John Harrison. 1997. Floating point verification in HOL light: The exponential func-
tion. In Algebraic Methodology and Software Technology, Michael Johnson (Ed.).
Springer Berlin Heidelberg, Berlin, Heidelberg, 246–260. https://doi.org/10.1007/
BFb0000475

[60] John Harrison. 1997. Verifying the Accuracy of Polynomial Approximations in


HOL. In International Conference on Theorem Proving in Higher Order Logics.
https://doi.org/10.1007/BFb0028391

[61] John Harrison. 2006. Floating-Point Verification using Theorem Proving. , 211–
242 pages.

[62] John Harrison. 2009. HOL Light: An Overview. In Proceedings of the 22nd Inter-
national Conference on Theorem Proving in Higher Order Logics, TPHOLs 2009
(Lecture Notes in Computer Science, Vol. 5674), Stefan Berghofer, Tobias Nipkow,
Christian Urban, and Makarius Wenzel (Eds.). Springer-Verlag, Munich, Germany,
60–66. https://doi.org/10.1007/978-3-642-03359-9 4

[63] John Harrison. 2017. The HOL Light theorem prover. https://www.cl.cam.ac.uk/
∼jrh13/hol-light/

[64] Yu Hen Hu and Homer H. M. Chern. 1996. A Novel Implementation of CORDIC


Algorithm Using Backward Angle Recoding (BAR). IEEE Trans. Comput. (1996).

[65] IBM. 2008. Accurate Portable MathLib. http://oss.software.ibm.com/mathlib/

[66] Intel. 2020. Code Sample: Intel R Deep Learning Boost New Deep Learning In-
struction bfloat16 - Intrinsic Functions. https://software.intel.com/content/www/
us/en/develop/articles/intel-deep-learning-boost-new-instruction-bfloat16.html

[67] Intel. 2020. Overview: Intel R Math Library. https://software.intel.com/content/


www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/
top/optimization-and-programming-guide/intel-math-library/
overview-intel-math-library.html

[68] ECMA International. 2020. ECMA-262, 11th edition, June 2020 ECMAScript 2020
Language Specification. https://262.ecma-international.org/11.0/

[69] Manish Kumar Jaiswal. 2017. Universal number Posit HDL Arithmetic Architecture
generator. https://github.com/manish-kj/Posit-HDL-Arithmetic

[70] Manish Kumar Jaiswal and Hayden K.-H So. 2018. Universal number posit arith-
metic generator on FPGA. In 2018 Design, Automation Test in Europe Conference
Exhibition (DATE). 1159–1162.

[71] Claude-Pierre Jeannerod, Hervé Knochel, Christophe Monat, and Guillaume Revy.
2011. Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation.
IEEE Trans. Comput. 60. https://doi.org/10.1109/TC.2010.152
219

[72] Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. 2010. Oracle-
guided component-based program synthesis. In 2010 ACM/IEEE 32nd International
Conference on Software Engineering, Vol. 1. 215–224. https://doi.org/10.1145/
1806799.1806833

[73] Fredrik Johansson. 2014. Efficient implementation of elementary functions in the


medium-precision range. CoRR abs/1410.7176 (2014). arXiv:1410.7176 http://
arxiv.org/abs/1410.7176

[74] Jeff Johnson. 2018. Rethinking floating point for deep learning. http://export.arxiv.
org/abs/1811.01721

[75] William Kahan. 2004. A Logarithm Too Clever by Half. https://people.eecs.


berkeley.edu/∼wkahan/LOG10HAF.TXT

[76] Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Ku-
nal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka,
Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke,
Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy,
Bharat Kaul, and Pradeep Dubey. 2019. A Study of BFLOAT16 for Deep Learning
Training. arXiv:1905.12322 [cs.LG]

[77] Milan Klöwer, Peter D. Düben, and Tim N. Palmer. 2019. Posits As an Alternative to
Floats for Weather and Climate Models. In Proceedings of the Conference for Next
Generation Arithmetic 2019 (Singapore, Singapore) (CoNGA’19). ACM, New York,
NY, USA, Article 2, 8 pages.

[78] Urs Köster, Tristan J. Webb, Xin Wang, Marcel Nassar, Arjun K. Bansal, William H.
Constable, Oğuz H. Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrow-
shahi, Carey Kloss, Ruby J. Pai, and Naveen Rao. 2017. Flexpoint: An Adaptive Nu-
merical Format for Efficient Training of Deep Neural Networks. In Proceedings of
the 31st International Conference on Neural Information Processing Systems (Long
Beach, California, USA) (NIPS’17). 1740–1750.

[79] Olga Kupriianova and Christoph Lauter. 2014. A domain splitting algorithm for the
mathematical functions code generator. In 2014 48th Asilomar Conference on Sig-
nals, Systems and Computers. 1271–1275. https://doi.org/10.1109/ACSSC.2014.
7094664

[80] Olga Kupriianova and Christoph Lauter. 2014. Metalibm: A Mathematical Func-
tions Code Generator. In 4th International Congress on Mathematical Software.
https://doi.org/10.1007/978-3-662-44199-2 106

[81] Olga Kupriianova and Christoph Lauter. 2015. Replacing Branches by Polyno-
mials in Vectorizable Elementary Functions. In Scientific Computing, Computer
Arithmetic, and Validated Numerics, Marco Nehmeier, Jürgen Wolff von Guden-
berg, and Warwick Tucker (Eds.). Springer International Publishing, Cham, 14–22.
https://doi.org/10.1007/978-3-319-31769-4 2
220

[82] Wonyeol Lee, Rahul Sharma, and Alex Aiken. 2016. Verifying Bit-Manipulations
of Floating-Point. In Proceedings of the 37th ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation (Santa Barbara, CA, USA) (PLDI
’16). Association for Computing Machinery, New York, NY, USA, 70–84. https:
//doi.org/10.1145/2908080.2908107

[83] Wonyeol Lee, Rahul Sharma, and Alex Aiken. 2017. On Automatically Proving
the Correctness of Math.h Implementations. Proceedings of the ACM Programming
Langguages 2, POPL, Article 47 (Dec. 2017), 32 pages. https://doi.org/10.1145/
3158135

[84] Wonyeol Lee, Rahul Sharma, and Alex Aiken. 2017. On Automatically Proving the
Correctness of Math.h Implementations. Proceedings of the ACM on Programming
Languages 2, POPL, Article 47 (Dec. 2017), 32 pages. https://doi.org/10.1145/
3158135

[85] Vincent Lefèvre and Jean-Michel Muller. 2001. Worst Cases for Correct Round-
ing of the Elementary Functions in Double Precision. In 15th IEEE Symposium on
Computer Arithmetic (Arith ’01). 111–118. https://doi.org/10.1109/ARITH.2001.
930110

[86] Vincent Lefèvre, Jean-Michel Muller, and Arnaud Tisserand. 1998. Toward cor-
rectly rounded transcendentals. IEEE Trans. Comput. 47, 11 (1998), 1235–1243.
https://doi.org/10.1109/12.736435

[87] Vincent Lefèvre and Jean-Michel Muller. 2003. On-the-Fly Range Reduction. Jour-
nal of VLSI Signal Processing 33 (01 2003), 31–35. https://doi.org/10.1023/A:
1021137717282

[88] Cerlane Leong. 2019. SoftPosit-Math. https://gitlab.com/cerlane/softposit-math

[89] Jay P. Lim, Mridul Aanjaneya, John Gustafson, and Santosh Nagarakatte. 2020. A
Novel Approach to Generate Correctly Rounded Math Libraries for New Floating
Point Representations. arXiv:2007.05344 [cs.MS]

[90] Jay P. Lim, Mridul Aanjaneya, John Gustafson, and Santosh Nagarakatte. 2021. An
Approach to Generate Correctly Rounded Math Libraries for New Floating Point
Variants. Proc. ACM Program. Lang. 5, POPL, Article 29 (Jan. 2021), 30 pages.
https://doi.org/10.1145/3434310

[91] Jay P. Lim and Santosh Nagarakatte. 2020. RLibm. https://github.com/rutgers-apl/


rlibm

[92] Jay P. Lim and Santosh Nagarakatte. 2020. RLibm-generator. https://github.com/


rutgers-apl/rlibm-generator

[93] Jay P. Lim and Santosh Nagarakatte. 2021. High Performance Correctly Rounded
Math Libraries for 32-Bit Floating Point Representations. In Proceedings of the 42nd
221

ACM SIGPLAN International Conference on Programming Language Design and


Implementation (Virtual, Canada) (PLDI 2021). Association for Computing Machin-
ery, New York, NY, USA, 359–374. https://doi.org/10.1145/3453483.3454049

[94] Jay P. Lim and Santosh Nagarakatte. 2021. RLibm-32. https://github.com/


rutgers-apl/rlibm-32

[95] Jay P. Lim and Santosh Nagarakatte. 2021. RLIBM-ALL: A Novel Polynomial
Approximation Method to Produce Correctly Rounded Results for Multiple Repre-
sentations and Rounding Modes. arXiv:2108.06756 [abs] Rutgers Department of
Computer Science Technical Report DCS-TR-757.

[96] Jay P Lim and Rutgers University Santosh Nagarakatte. 2021. RLIBM-32: High
Performance Correctly Rounded Math Libraries for 32-bit Floating Point Represen-
tations. Rutgers Department of Computer Science Technical Report DCS-TR-754.

[97] Jay P. Lim, Matan Shachnai, and Santosh Nagarakatte. 2020. Approximating
Trigonometric Functions for Posits Using the CORDIC Method. In Proceedings of
the 17th ACM International Conference on Computing Frontiers (Catania, Sicily,
Italy) (CF ’20). Association for Computing Machinery, New York, NY, USA, 19–28.
https://doi.org/10.1145/3387902.3392632

[98] Pramod K. Meher, Javier Valls, Tso-Bing Juang, K. Sridharan, and Koushik Ma-
haratna. 2009. 50 Years of CORDIC: Algorithms, Architectures, and Applications.
IEEE Transactions on Circuits and Systems I: Regular Papers (2009).

[99] Guillaume Melquiond. 2019. Gappa. http://gappa.gforge.inria.fr

[100] Microsoft. 2019. (Cloud) Acceleration at Microsoft. https://old.hotchips.org/hc31/


HC31 T2 Microsoft CarrieChiouChung.pdf

[101] Sun Microsystems. 2008. LIBMCR 3 ”16 February 2008” ”libmcr-0.9”. http:
//www.math.utah.edu/cgi-bin/man2html.cgi?/usr/local/man/man3/libmcr.3

[102] Raúl Murillo Montero, Alberto A. Del Barrio, and Guillermo Botella. 2019.
Template-Based Posit Multiplication for Training and Inferring in Neural Networks.
arXiv:1907.04091 [cs.CV]

[103] Jean-Michel Muller. 2005. Elementary Functions: Algorithms and Implementation.


Birkhauser. https://doi.org/10.1007/978-1-4899-7983-4

[104] Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod,


Mioara Joldes, Vincent Lefvre, Guillaume Melquiond, Nathalie Revol, and Serge
Torres. 2018. Handbook of Floating-Point Arithmetic (2nd ed.). Birkhäuser Basel.

[105] Kwok C. Ng. 1992. Argument reduction for huge arguments: Good to the last
bit (can be obtained by sending an e-mail to the author: kwok.ng@eng.sun.com.
Technical Report. SunPro.
222

[106] Ivan Niven. 1956. Irrational Numbers. Mathematical Association of America.

[107] NVIDIA. 2020. NVIDIA AMPERE ARCHITECTURE. https://www.nvidia.com/


en-us/data-center/nvidia-ampere-gpu-architecture/

[108] NVIDIA. 2020. TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up
to 20x. https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/

[109] Oracle. 2015. Oracle R Solaris Studio 12.4: Numerical Computation Guide. https:
//docs.oracle.com/cd/E37069 01/html/E39019/z4000ac119729.html

[110] James M. Ortega and Werner C. Rheinboldt. 2000. Iterative Solution of Nonlinear
Equations in Several Variables. Society for Industrial and Applied Mathematics.
https://doi.org/10.1137/1.9780898719468

[111] Pavel Panchekha, Alex Sanchez-Stern, James R. Wilcox, and Zachary Tatlock. 2015.
Automatically Improving Accuracy for Floating Point Expressions. In Proceedings
of the 36th ACM SIGPLAN Conference on Programming Language Design and Im-
plementation, Vol. 50. Association for Computing Machinery, New York, NY, USA,
1–11. https://doi.org/10.1145/2813885.2737959

[112] Behrooz Parhami. 2020. Computing with logarithmic number system arithmetic:
Implementation methods and performance benefits. Computers & Electrical Engi-
neering 87 (2020), 106800.

[113] Mary H. Payne and Robert N. Hanek. 1983. Radian Reduction for Trigonometric
Functions. ACM SIGNUM Newslett. 18, 1 (Jan. 1983), 19–24. https://doi.org/10.
1145/1057600.1057602

[114] Eugene Remes. 1934. Sur un procédé convergent d’approximations successives pour
déterminer les polynômes d’approximation. Comptes rendus de l’Académie des Sci-
ences 198 (1934), 2063–2065.

[115] Denis Roegel. 2010. A reconstruction of the tables of Briggs’ ¡i¿Arithmetica loga-
rithmica¡/i¿ (1624). Research Report. https://hal.inria.fr/inria-00543939

[116] Bita Rouhani, Daniel Lo, Ritchie Zhao, Ming Liu, Jeremy Fowers, Kalin Ovtcharov,
Anna Vinogradsky, Sarah Massengill, Lita Yang, Ray Bittner, Alessandro Forin,
Haishan Zhu, Taesik Na, Prerak Patel, Shuai Che, Lok Chand Koppaka, Xia Song,
Subhojit Som, Kaustav Das, Saurabh Tiwary, Steve Reinhardt, Sitaram Lanka, Eric
Chung, and Doug Burger. 2020. Pushing the Limits of Narrow Precision Inferencing
at Cloud Scale with Microsoft Floating Point. In The Thirty-fourth Annual Confer-
ence on Neural Information Processing Systems. ACM.

[117] Hughes Saint-Genies, David Defour, and Guillaume Revy. 2017. Exact Lookup
Tables for the Evaluation of Trigonometric and Hyperbolic Functions. IEEE Trans.
Comput. 66, 12 (dec 2017), 2058–2071. https://doi.org/10.1109/TC.2017.2703870
223

[118] Alex Sanchez-Stern, Pavel Panchekha, Sorin Lerner, and Zachary Tatlock. 2018.
Finding Root Causes of Floating Point Error. In Proceedings of the 39th ACM
SIGPLAN Conference on Programming Language Design and Implementation
(Philadelphia, PA, USA) (PLDI 2018). ACM, New York, NY, USA, 256–269.
https://doi.org/10.1145/3296979.3192411

[119] Jun Sawada. 2002. Formal verification of divide and square root algorithms using
series calculation. In 3rd International Workshop on the ACL2 Theorem Prover and
its Applications.

[120] Michael J. Schulte and James E. Stine. 1997. Symmetric bipartite tables for accu-
rate function approximation. In Proceedings 13th IEEE Sympsoium on Computer
Arithmetic. 175–183. https://doi.org/10.1109/ARITH.1997.614893

[121] Michael J. Schulte and James E. Stine. 1999. Approximating elementary functions
with symmetric bipartite tables. IEEE Trans. Comput. 48, 8 (1999), 842–847. https:
//doi.org/10.1109/12.795125

[122] Shen-Fu Hsiao and Jean-Marc Delosme. 1995. Householder CORDIC algorithms.
IEEE Trans. Comput. (1995).

[123] Armando Solar-Lezama, Rodric Rabbah, Rastislav Bodı́k, and Kemal Ebcioğlu.
2005. Programming by Sketching for Bit-Streaming Programs. In Proceedings of
the 2005 ACM SIGPLAN Conference on Programming Language Design and Im-
plementation (PLDI ’05). New York, NY, USA, 281–294. https://doi.org/10.1145/
1065010.1065045

[124] Michael Spivak. 1971. Calculus On Manifolds: A Modern Approach To Classical


Theorems Of Advanced Calculus. Avalon Publishing.

[125] Damien Stehle, Vincent Lefevre, and Paul Zimmermann. 2005. Searching worst
cases of a one-variable function using lattice reduction. IEEE Trans. Comput. 54, 3
(2005), 340–346. https://doi.org/10.1109/TC.2005.55

[126] Pat H Sterbenz. 1974. Floating-point computation. Prentice-Hall, Englewood Cliffs,


NJ.

[127] James E. Stine and Michael J. Schulte. 1999. The Symmetric Table Addition Method
for Accurate Function Approximation. In The Journal of VLSI Signal Processing-
Systems for Signal, Image, and Video Technology. https://doi.org/10.1023/A:
1008004523235

[128] Shane Story and Ping Tak Peter Tang. 1999. New algorithms for improved tran-
scendental functions on IA-64. In Proceedings 14th IEEE Symposium on Computer
Arithmetic. 4–11. https://doi.org/10.1109/ARITH.1999.762822

[129] David A. Sunderland, Roger A. Strauch, Steven S. Wharfield, Henry T. Peterson,


and Christopher R. Cole. 1984. CMOS/SOS frequency synthesizer LSI circuit for
224

spread spectrum communications. IEEE Journal of Solid-State Circuits 19, 4 (1984),


497–506. https://doi.org/10.1109/JSSC.1984.1052173

[130] Earl E. Swartzlander and Aristides G. Alexopoulos. 1975. The Sign/Logarithm


Number System. IEEE Trans. Comput. C-24, 12 (1975), 1238–1242. https:
//doi.org/10.1109/T-C.1975.224172

[131] Giuseppe Tagliavini, Stefan Mach, Davide Rossi, Andrea Marongiu, and Luca
Benin. 2018. A transprecision floating-point platform for ultra-low power com-
puting. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE).
1051–1056. https://doi.org/10.23919/DATE.2018.8342167

[132] Naofumi Takagi, Tohru Asada, and Shuzo Yajima. 1991. Redundant CORDIC meth-
ods with a constant scale factor for sine and cosine computation. IEEE Trans. Com-
put. (1991).

[133] Ping-Tak Peter Tang. 1989. Table-Driven Implementation of the Exponential Func-
tion in IEEE Floating-Point Arithmetic. ACM Trans. Math. Software 15, 2 (June
1989), 144–157. https://doi.org/10.1145/63522.214389

[134] Ping-Tak Peter Tang. 1990. Table-Driven Implementation of the Logarithm Function
in IEEE Floating-Point Arithmetic. ACM Trans. Math. Software 16, 4 (Dec. 1990),
378–400. https://doi.org/10.1145/98267.98294

[135] P. T. P. Tang. 1991. Table-lookup algorithms for elementary functions and their error
analysis. In [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.
232–236. https://doi.org/10.1109/ARITH.1991.145565

[136] Sugandha Tiwari, Neel Gala, Chester Rebeiro, and V. Kamakoti. 2019. PERI: A
Posit Enabled RISC-V Core. arXiv:1908.01466 [cs.AR]

[137] Lloyd N. Trefethen. 2012. Approximation Theory and Approximation Practice


(Other Titles in Applied Mathematics). Society for Industrial and Applied Math-
ematics, USA.

[138] Tso-Bing Juang, Shen-Fu Hsiao, and Ming-Yu Tsai. 2004. Para-CORDIC: parallel
CORDIC rotation algorithm. IEEE Transactions on Circuits and Systems I: Regular
Papers (2004).

[139] Jack Volder. 1959. The CORDIC Computing Technique. In Papers Presented at
the the March 3-5, 1959, Western Joint Computer Conference (IRE-AIEE-ACM ’59
(Western)).

[140] John S. Walther. 1971. A Unified Algorithm for Elementary Functions. In Pro-
ceedings of the May 18-20, 1971, Spring Joint Computer Conference (AFIPS ’71
(Spring)).
225

[141] Dong Wang, Jean-Michel Muller, Nicolas Brisebarre, and Miloš D. Ercegovac. 2014.
(M, p, k)-Friendly Points: A Table-Based Method to Evaluate Trigonometric Func-
tion. IEEE Transactions on Circuits and Systems II: Express Briefs 61, 9 (2014),
711–715. https://doi.org/10.1109/TCSII.2014.2331094

[142] Shibo Wang and Pankaj Kanwar. 2019. BFloat16: The secret to high performance
on Cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/
bfloat16-the-secret-to-high-performance-on-cloud-tpus

[143] Shaoyun Wang, Vincenzo Piuri, and Earl E. Wartzlander. 1997. Hybrid CORDIC
algorithms. IEEE Trans. Comput. (1997).

[144] Karl Weierstrass. 1885. Über die analytische Darstellbarkeit sogenannter


willkürlicher Functionen reeller Argumente. In Sitzungsberichte der Königlich
Preussischen Akademie der Wissenschaften zu Berlin.

[145] Xin Yi, Liqian Chen, Xiaoguang Mao, and Tao Ji. 2019. Efficient Automated Repair
of High Floating-Point Errors in Numerical Libraries. Proceedings of the ACM on
Programming Languages 3, POPL, Article 56 (Jan. 2019), 29 pages. https://doi.
org/10.1145/3290369

[146] Abraham Ziv. 1991. Fast Evaluation of Elementary Mathematical Functions with
Correctly Rounded Last Bit. ACM Trans. Math. Software 17, 3 (Sept. 1991),
410–423. https://doi.org/10.1145/114697.116813

[147] Daming Zou, Muhan Zeng, Yingfei Xiong, Zhoulai Fu, Lu Zhang, and Zhendong
Su. 2019. Detecting Floating-Point Errors via Atomic Conditions. Proceedings of
the ACM on Programming Languages 4, POPL, Article 60 (Dec. 2019), 27 pages.
https://doi.org/10.1145/3371128

You might also like