Reverse Engineering for Beginners

Dennis Yurichev

Reverse Engineering for Beginners

Dennis Yurichev
<dennis(a)yurichev.com>

cb n d
©2013-2015, Dennis Yurichev.
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/.
Text version (April 1, 2015).
There is probably a newer version of this text, and Russian language version also accessible at
beginners.re. E-book reader version is also available on the page.
There are also shortened LITE-version (introductory), intended to those who wants very quick
introduction to reverse engineering basics: beginners.re
You may also subscribe to my twitter, to get information about updates of this text, etc: @yurichev

2

or to subscribe to mailing list .
The cover was made by Andy Nechaevsky: facebook.
1 twitter
2 yurichev.com

ii

1

Please take a short survey!
Feedback is extremely important to the author!

http://beginners.re/survey.html.

i

SHORT CONTENTS

SHORT CONTENTS

Short contents
I

Code patterns

1

II

Important fundamentals

450

III

Slightly more advanced examples

458

IV

Java

603

V

Finding important/interesting stuff in the code

639

VI

OS-specific

662

VII

Tools

716

VIII

More examples

722

IX

Examples of reversing proprietary file formats

834

X

Other things

865

XI

Books/blogs worth reading

883

XII

Exercises

887

Afterword

923

Appendix

925

Acronyms used

964

ii

CONTENTS

CONTENTS

Contents
0.0.1

I

Donate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Code patterns

iv

1

1 A short introduction to the CPU
1.1 A couple of words about different ISA3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3

2 Simplest possible function
2.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Note about MIPS instruction/register names

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

4
4
4
4
5

3 Hello, world!
3.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 MSVC . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.3 GCC: AT&T syntax . . . . . . . . . . . . . . . . . . .
3.2 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 MSVC—x86-64 . . . . . . . . . . . . . . . . . . . . .
3.2.2 GCC—x86-64 . . . . . . . . . . . . . . . . . . . . . .
3.3 GCC—one more thing . . . . . . . . . . . . . . . . . . . . . .
3.4 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Non-optimizing Keil 6/2013 (ARM mode) . . . .
3.4.2 Non-optimizing Keil 6/2013 (thumb mode) . . .
3.4.3 Optimizing Xcode 4.6.3 (LLVM) (ARM mode) . . .
3.4.4 Optimizing Xcode 4.6.3 (LLVM) (thumb-2 mode)
3.4.5 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Word about “global pointer” . . . . . . . . . . . . .
3.5.2 Optimizing GCC . . . . . . . . . . . . . . . . . . . . .
3.5.3 Non-optimizing GCC . . . . . . . . . . . . . . . . . .
3.5.4 Role of the stack frame in this example . . . . . .
3.5.5 Optimizing GCC: load it into GDB . . . . . . . . . .
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

6
6
6
7
8
9
9
10
10
11
12
13
13
14
15
16
16
16
17
19
19
20
20
20

.
.
.
.

4 Function prologue and epilogue
21
4.1 Recursion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Stack
5.1 Why does the stack grow backwards? . . . .
5.2 What is the stack used for? . . . . . . . . . .
5.2.1 Save the function’s return address .
5.2.2 Passing function arguments . . . . .
5.2.3 Local variable storage . . . . . . . . .
5.2.4 x86: alloca() function . . . . . . . . .
5.2.5 (Windows) SEH . . . . . . . . . . . . .
5.2.6 Buffer overflow protection . . . . . .
5.3 Typical stack layout . . . . . . . . . . . . . . .
3 Instruction

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Set Architecture

iii

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

22
22
23
23
24
24
25
26
26
27

CONTENTS

5.4 Noise in stack . . . .
5.5 Exercises . . . . . . .
5.5.1 Exercise #1
5.5.2 Exercise #2

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

27
30
30
30

6 printf() with several arguments
6.1 x86 . . . . . . . . . . . . . . . . .
6.1.1 x86: 3 arguments . . .
6.1.2 x64: 8 arguments . . .
6.2 ARM . . . . . . . . . . . . . . . . .
6.2.1 ARM: 3 arguments . . .
6.2.2 ARM: 8 arguments . . .
6.3 MIPS . . . . . . . . . . . . . . . .
6.3.1 3 arguments . . . . . . .
6.3.2 8 arguments . . . . . . .
6.4 Conclusion . . . . . . . . . . . . .
6.5 By the way . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

33
33
33
40
43
43
45
48
48
51
54
55

7 scanf()
7.1 Simple example . . . . . . . . . . . . . . . . . . . . . . .
7.1.1 About pointers . . . . . . . . . . . . . . . . . . .
7.1.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.3 MSVC + OllyDbg . . . . . . . . . . . . . . . . . .
7.1.4 x64 . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.5 ARM . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1.6 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Global variables . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 MSVC: x86 . . . . . . . . . . . . . . . . . . . . . .
7.2.2 MSVC: x86 + OllyDbg . . . . . . . . . . . . . . .
7.2.3 GCC: x86 . . . . . . . . . . . . . . . . . . . . . . .
7.2.4 MSVC: x64 . . . . . . . . . . . . . . . . . . . . . .
7.2.5 ARM: Optimizing Keil 6/2013 (thumb mode)
7.2.6 ARM64 . . . . . . . . . . . . . . . . . . . . . . . .
7.2.7 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 scanf() result checking . . . . . . . . . . . . . . . . . . .
7.3.1 MSVC: x86 . . . . . . . . . . . . . . . . . . . . . .
7.3.2 MSVC: x86: IDA . . . . . . . . . . . . . . . . . . .
7.3.3 MSVC: x86 + OllyDbg . . . . . . . . . . . . . . .
7.3.4 MSVC: x86 + Hiew . . . . . . . . . . . . . . . . .
7.3.5 MSVC: x64 . . . . . . . . . . . . . . . . . . . . . .
7.3.6 ARM . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.7 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.8 Exercise . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

56
56
56
56
59
61
62
64
65
65
67
68
68
69
70
70
73
74
75
79
81
82
83
84
85

8 Accessing passed arguments
8.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1.1 MSVC . . . . . . . . . . . . . . . . . . . . . . .
8.1.2 MSVC + OllyDbg . . . . . . . . . . . . . . . .
8.1.3 GCC . . . . . . . . . . . . . . . . . . . . . . . .
8.2 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 MSVC . . . . . . . . . . . . . . . . . . . . . . .
8.2.2 GCC . . . . . . . . . . . . . . . . . . . . . . . .
8.2.3 GCC: uint64_t instead of int . . . . . . . . .
8.3 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1 Non-optimizing Keil 6/2013 (ARM mode)
8.3.2 Optimizing Keil 6/2013 (ARM mode) . . .
8.3.3 Optimizing Keil 6/2013 (thumb mode) . .
8.3.4 ARM64 . . . . . . . . . . . . . . . . . . . . . .
8.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

86
86
86
87
87
88
88
89
90
91
91
91
92
92
93

9 More about results returning
9.1 Attempt to use the result of a function returning void . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 What if we do not use the function result? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Returning a structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95
95
96
96

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

iv

CONTENTS

10 Pointers
10.1 Global variables example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Local variables example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98
98
104
107

11 GOTO
108
11.1 Dead code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.2 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12 Conditional jumps
12.1 Simple example . . . . . . . . . . . . . . . . . . .
12.1.1 x86 . . . . . . . . . . . . . . . . . . . . . .
12.1.2 ARM . . . . . . . . . . . . . . . . . . . . . .
12.1.3 MIPS . . . . . . . . . . . . . . . . . . . . . .
12.2 Calculating absolute value . . . . . . . . . . . . .
12.2.1 Optimizing MSVC . . . . . . . . . . . . . .
12.2.2 Optimizing Keil 6/2013: thumb mode .
12.2.3 Optimizing Keil 6/2013: ARM mode . .
12.2.4 Non-optimizing GCC 4.9 (ARM64) . . .
12.2.5 MIPS . . . . . . . . . . . . . . . . . . . . . .
12.2.6 Branchless version? . . . . . . . . . . . .
12.3 Conditional operator . . . . . . . . . . . . . . . .
12.3.1 x86 . . . . . . . . . . . . . . . . . . . . . .
12.3.2 ARM . . . . . . . . . . . . . . . . . . . . . .
12.3.3 ARM64 . . . . . . . . . . . . . . . . . . . .
12.3.4 MIPS . . . . . . . . . . . . . . . . . . . . . .
12.3.5 Let’s rewrite it in an if/else way . .
12.3.6 Conclusion . . . . . . . . . . . . . . . . . .
12.3.7 Exercise . . . . . . . . . . . . . . . . . . . .
12.4 Getting minimal and maximal values . . . . . .
12.4.1 32-bit . . . . . . . . . . . . . . . . . . . . .
12.4.2 64-bit . . . . . . . . . . . . . . . . . . . . .
12.4.3 MIPS . . . . . . . . . . . . . . . . . . . . . .
12.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
12.5.1 x86 . . . . . . . . . . . . . . . . . . . . . .
12.5.2 ARM . . . . . . . . . . . . . . . . . . . . . .
12.5.3 MIPS . . . . . . . . . . . . . . . . . . . . . .
12.5.4 Branchless . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

111
111
111
122
125
127
127
128
128
128
129
129
129
129
130
131
131
132
132
132
132
132
134
136
137
137
137
137
137

13 switch()/case/default
13.1 Small number of cases . . . . . . . . . . . . . . . . . . . .
13.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.1.2 ARM: Optimizing Keil 6/2013 (ARM mode) . . .
13.1.3 ARM: Optimizing Keil 6/2013 (thumb mode) .
13.1.4 ARM64: Non-optimizing GCC (Linaro) 4.9 . . .
13.1.5 ARM64: Optimizing GCC (Linaro) 4.9 . . . . . .
13.1.6 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
13.2 A lot of cases . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.2 ARM: Optimizing Keil 6/2013 (ARM mode) . . .
13.2.3 ARM: Optimizing Keil 6/2013 (thumb mode) .
13.2.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
13.3 When there are several case statements in one block
13.3.1 MSVC . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.2 GCC . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.3 ARM64: Optimizing GCC 4.9.1 . . . . . . . . . . .
13.4 Fall-through . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.4.1 MSVC x86 . . . . . . . . . . . . . . . . . . . . . . .
13.4.2 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . .
13.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

139
139
139
149
149
150
151
151
152
152
152
158
159
160
162
162
163
164
164
166
167
167
168
168

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

v

CONTENTS

14 Loops
14.1 Simple example . . . . . . . . . . . . . . . .
14.1.1 x86 . . . . . . . . . . . . . . . . . . .
14.1.2 x86: OllyDbg . . . . . . . . . . . . .
14.1.3 x86: tracer . . . . . . . . . . . . . . .
14.1.4 ARM . . . . . . . . . . . . . . . . . . .
14.1.5 MIPS . . . . . . . . . . . . . . . . . . .
14.1.6 One more thing . . . . . . . . . . . .
14.2 Memory blocks copying routine . . . . . .
14.2.1 Straight-forward implementation
14.2.2 ARM in ARM mode . . . . . . . . . .
14.2.3 MIPS . . . . . . . . . . . . . . . . . . .
14.2.4 Vectorization . . . . . . . . . . . . .
14.3 Conclusion . . . . . . . . . . . . . . . . . . . .
14.4 Exercises . . . . . . . . . . . . . . . . . . . . .
14.4.1 Exercise #1 . . . . . . . . . . . . . .
14.4.2 Exercise #2 . . . . . . . . . . . . . .
14.4.3 Exercise #3 . . . . . . . . . . . . . .
14.4.4 Exercise #4 . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

169
169
169
173
173
175
178
179
179
179
180
181
181
181
183
183
183
183
185

15 Simple C-strings processing
15.1 strlen() . . . . . . . . . .
15.1.1 x86 . . . . . . .
15.1.2 ARM . . . . . . .
15.1.3 MIPS . . . . . . .
15.2 Exercises . . . . . . . . .
15.2.1 Exercise #1 . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

188
188
188
195
197
198
198

16 Replacing arithmetic instructions to other ones
16.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1.1 Multiplication using addition . . . . . . . . . . . . .
16.1.2 Multiplication using shifting . . . . . . . . . . . . . .
16.1.3 Multiplication using shifting/subtracting/adding
16.2 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.1 Division using shifts . . . . . . . . . . . . . . . . . . .
16.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3.1 Exercise #2 . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

201
201
201
201
202
205
205
206
206

17 Floating-point unit
17.1 IEEE 754 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3 ARM, MIPS, x86/x64 SIMD . . . . . . . . . . . . . . . . . . . . . . . . . .
17.4 C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.5 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.5.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.5.2 ARM: Optimizing Xcode 4.6.3 (LLVM) (ARM mode) . . . . . .
17.5.3 ARM: Optimizing Keil 6/2013 (thumb mode) . . . . . . . . .
17.5.4 ARM64: Optimizing GCC (Linaro) 4.9 . . . . . . . . . . . . . .
17.5.5 ARM64: Non-optimizing GCC (Linaro) 4.9 . . . . . . . . . . .
17.5.6 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.6 Passing floating point numbers via arguments . . . . . . . . . . . . .
17.6.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.6.2 ARM + Non-optimizing Xcode 4.6.3 (LLVM) (thumb-2 mode)
17.6.3 ARM + Non-optimizing Keil 6/2013 (ARM mode) . . . . . . .
17.6.4 ARM64 + Optimizing GCC (Linaro) 4.9 . . . . . . . . . . . . . .
17.6.5 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.7 Comparison example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.7.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.7.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.7.3 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.7.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.8 Stack, calculators and reverse Polish notation . . . . . . . . . . . . .
17.9 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.10Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.10.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

208
208
208
208
208
209
209
216
216
217
217
218
219
219
220
220
221
221
222
223
250
253
254
255
255
255
255

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

vi

CONTENTS

17.10.2 Exercise #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
18 Arrays
18.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.1.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.1.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Buffer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2.1 Reading outside array bounds . . . . . . . . . . . . . .
18.2.2 Writing beyond array bounds . . . . . . . . . . . . . .
18.3 Buffer overflow protection methods . . . . . . . . . . . . . . .
18.3.1 Optimizing Xcode 4.6.3 (LLVM) (thumb-2 mode) . .
18.4 One more word about arrays . . . . . . . . . . . . . . . . . . .
18.5 Array of pointers to strings . . . . . . . . . . . . . . . . . . . .
18.5.1 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.2 32-bit ARM . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.3 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.5.5 Array overflow . . . . . . . . . . . . . . . . . . . . . . .
18.6 Multidimensional arrays . . . . . . . . . . . . . . . . . . . . . .
18.6.1 Two-dimensional array example . . . . . . . . . . . .
18.6.2 Access two-dimensional array as one-dimensional
18.6.3 Three-dimensional array example . . . . . . . . . . .
18.6.4 More examples . . . . . . . . . . . . . . . . . . . . . . .
18.7 Pack of strings as a two-dimensional array . . . . . . . . . .
18.7.1 32-bit ARM . . . . . . . . . . . . . . . . . . . . . . . . . .
18.7.2 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.7.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .
18.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.9.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . . . .
18.9.2 Exercise #2 . . . . . . . . . . . . . . . . . . . . . . . . .
18.9.3 Exercise #3 . . . . . . . . . . . . . . . . . . . . . . . . .
18.9.4 Exercise #4 . . . . . . . . . . . . . . . . . . . . . . . . .
18.9.5 Exercise #5 . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

257
257
257
260
262
264
264
267
271
273
275
275
276
277
278
278
279
281
282
283
285
287
287
289
290
290
291
291
291
291
294
298
299
300

19 Manipulating specific bit(s)
19.1 Specific bit checking . . . . . . . . . . . . . . . . . . . . . . . . . .
19.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.1.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Specific bit setting/clearing . . . . . . . . . . . . . . . . . . . . . .
19.2.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2.2 ARM + Optimizing Keil 6/2013 (ARM mode) . . . . . . .
19.2.3 ARM + Optimizing Keil 6/2013 (thumb mode) . . . . . .
19.2.4 ARM + Optimizing Xcode 4.6.3 (LLVM) (ARM mode) . .
19.2.5 ARM: more about the BIC instruction . . . . . . . . . . .
19.2.6 ARM64: Optimizing GCC (Linaro) 4.9 . . . . . . . . . . .
19.2.7 ARM64: Non-optimizing GCC (Linaro) 4.9 . . . . . . . .
19.2.8 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3 Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4 Specific bit setting/clearing: FPU4 example . . . . . . . . . . . .
19.4.1 A word about the XOR operation . . . . . . . . . . . . . .
19.4.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.4.4 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.5 Counting bits set to 1 . . . . . . . . . . . . . . . . . . . . . . . . . .
19.5.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.5.2 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.5.3 ARM + Optimizing Xcode 4.6.3 (LLVM) (ARM mode) . .
19.5.4 ARM + Optimizing Xcode 4.6.3 (LLVM) (thumb-2 mode)
19.5.5 ARM64 + Optimizing GCC 4.9 . . . . . . . . . . . . . . . .
19.5.6 ARM64 + Non-optimizing GCC 4.9 . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

305
305
305
307
308
309
314
315
315
315
315
316
316
316
316
317
317
318
319
321
322
330
332
333
333
333

4 Floating-point

unit

vii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

CONTENTS

19.5.7 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.6.1 Check for specific bit (known at compile stage)
19.6.2 Check for specific bit (specified at runtime) . . .
19.6.3 Set specific bit (known at compile stage) . . . .
19.6.4 Set specific bit (specified at runtime) . . . . . . .
19.6.5 Clear specific bit (known at compile stage) . . .
19.6.6 Clear specific bit (specified at runtime) . . . . . .
19.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.7.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . .
19.7.2 Exercise #2 . . . . . . . . . . . . . . . . . . . . . . .
19.7.3 Exercise #3 . . . . . . . . . . . . . . . . . . . . . . .
19.7.4 Exercise #4 . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

334
336
336
336
337
337
337
337
338
338
339
341
342

20 Linear congruential generator as pseudorandom number generator
20.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.2 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.3 32-bit ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.4.1 MIPS relocations . . . . . . . . . . . . . . . . . . . . . . . .
20.5 Thread-safe version of the example . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

344
344
345
346
346
347
348

21 Structures
21.1 MSVC: SYSTEMTIME example . . . . . . . . . . . . . . . . .
21.1.1 OllyDbg . . . . . . . . . . . . . . . . . . . . . . . . . .
21.1.2 Replacing the structure with array . . . . . . . . .
21.2 Let’s allocate space for a structure using malloc() . . . .
21.3 UNIX: struct tm . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3.4 Structure as a set of values . . . . . . . . . . . . .
21.3.5 Structure as an array of 32-bit words . . . . . . .
21.3.6 Structure as an array of bytes . . . . . . . . . . . .
21.4 Fields packing in structure . . . . . . . . . . . . . . . . . . .
21.4.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4.4 One more word . . . . . . . . . . . . . . . . . . . . .
21.5 Nested structures . . . . . . . . . . . . . . . . . . . . . . . .
21.5.1 OllyDbg . . . . . . . . . . . . . . . . . . . . . . . . . .
21.6 Bit fields in a structure . . . . . . . . . . . . . . . . . . . . .
21.6.1 CPUID example . . . . . . . . . . . . . . . . . . . . .
21.6.2 Working with the float type as with a structure
21.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.7.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . .
21.7.2 Exercise #2 . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

349
349
351
351
352
354
354
356
358
359
361
362
363
364
368
369
370
370
372
372
372
376
379
379
379

22 Unions
22.1 Pseudo-random number generator example .
22.1.1 x86 . . . . . . . . . . . . . . . . . . . . .
22.1.2 MIPS . . . . . . . . . . . . . . . . . . . . .
22.1.3 ARM (ARM mode) . . . . . . . . . . . . .
22.2 Calculating machine epsilon . . . . . . . . . .
22.2.1 x86 . . . . . . . . . . . . . . . . . . . . .
22.2.2 ARM64 . . . . . . . . . . . . . . . . . . .
22.2.3 MIPS . . . . . . . . . . . . . . . . . . . . .
22.2.4 Conclusion . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

384
384
385
386
387
388
389
389
389
390

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

viii

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

CONTENTS

23 Pointers to functions
23.1 MSVC . . . . . . . . . . . . . . . . . . . . .
23.1.1 MSVC + OllyDbg . . . . . . . . .
23.1.2 MSVC + tracer . . . . . . . . . . .
23.1.3 MSVC + tracer (code coverage)
23.2 GCC . . . . . . . . . . . . . . . . . . . . . .
23.2.1 GCC + GDB (with source code) .
23.2.2 GCC + GDB (no source code) . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

391
392
394
396
398
398
399
400

24 64-bit values in 32-bit environment
24.1 Returning of 64-bit value . . . . . . . . . .
24.1.1 x86 . . . . . . . . . . . . . . . . . . .
24.1.2 ARM . . . . . . . . . . . . . . . . . . .
24.1.3 MIPS . . . . . . . . . . . . . . . . . . .
24.2 Arguments passing, addition, subtraction
24.2.1 x86 . . . . . . . . . . . . . . . . . . .
24.2.2 ARM . . . . . . . . . . . . . . . . . . .
24.2.3 MIPS . . . . . . . . . . . . . . . . . . .
24.3 Multiplication, division . . . . . . . . . . . .
24.3.1 x86 . . . . . . . . . . . . . . . . . . .
24.3.2 ARM . . . . . . . . . . . . . . . . . . .
24.3.3 MIPS . . . . . . . . . . . . . . . . . . .
24.4 Shifting right . . . . . . . . . . . . . . . . . .
24.4.1 x86 . . . . . . . . . . . . . . . . . . .
24.4.2 ARM . . . . . . . . . . . . . . . . . . .
24.4.3 MIPS . . . . . . . . . . . . . . . . . . .
24.5 Converting 32-bit value into 64-bit one .
24.5.1 x86 . . . . . . . . . . . . . . . . . . .
24.5.2 ARM . . . . . . . . . . . . . . . . . . .
24.5.3 MIPS . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

403
403
403
403
403
404
404
405
406
407
407
408
409
410
410
411
411
411
411
412
412

25 SIMD
25.1 Vectorization . . . . . . . . . . . . .
25.1.1 Addition example . . . . .
25.1.2 Memory copy example . .
25.2 SIMD strlen() implementation

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

413
413
414
419
422

26 64 bits
26.1 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26.3 Float point numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

426
426
432
433

27 Working with floating point numbers using SIMD
27.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . .
27.1.1 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.1.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.2 Passing floating point number via arguments . . . . . .
27.3 Comparison example . . . . . . . . . . . . . . . . . . . . .
27.3.1 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.3.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . .
27.4 Calculating machine epsilon: x64 and SIMD . . . . . .
27.5 Pseudo-random number generator example revisited .
27.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

434
434
434
435
441
441
442
443
443
444
444

28 More about ARM
28.1 Number sign (#) before number .
28.2 Addressing modes . . . . . . . . . .
28.3 Loading a constant into a register
28.3.1 32-bit ARM . . . . . . . . . .
28.3.2 ARM64 . . . . . . . . . . . .
28.4 Relocs in ARM64 . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

445
445
445
445
445
446
447

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

29 More about MIPS
449
29.1 Loading constants into register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
29.2 Further reading about MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
ix

CONTENTS

II

Important fundamentals

450

30 Signed number representations
31 Endianness
31.1 Big-endian . . . .
31.2 Little-endian . .
31.3 Example . . . . .
31.4 Bi-endian . . . .
31.5 Converting data

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

451
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

32 Memory

453
453
453
453
454
454
455

33 CPU
456
33.1 Branch predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
33.2 Data dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
34 Hash functions
457
34.1 How one-way function works? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457

III

Slightly more advanced examples

35 Temperature converting
35.1 Integer values . . . . . . . . . . . . . .
35.1.1 Optimizing MSVC 2012 x86
35.1.2 Optimizing MSVC 2012 x64
35.2 Floating-point values . . . . . . . . .

.
.
.
.

458
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

459
459
459
460
461

36 Fibonacci numbers
464
36.1 Example #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
36.2 Example #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
36.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
37 CRC32 calculation example
38 Network address calculation example
38.1 calc_network_address() . . . . . .
38.2 form_IP() . . . . . . . . . . . . . . .
38.3 print_as_IP() . . . . . . . . . . . . .
38.4 form_netmask() and set_bit() . .
38.5 Summary . . . . . . . . . . . . . . .

471
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

474
475
476
477
478
479

39 Several iterators
480
39.1 Three iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
39.2 Two iterators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
39.3 Intel C++ 2011 case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482
40 Duff’s device

485

41 Division by 9
41.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.2 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.2.1 Optimizing Xcode 4.6.3 (LLVM) (ARM mode) . . . . . .
41.2.2 Optimizing Xcode 4.6.3 (LLVM) (thumb-2 mode) . . .
41.2.3 Non-optimizing Xcode 4.6.3 (LLVM) and Keil 6/2013
41.3 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.4 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.5 Getting the divisor . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.5.1 Variant #1 . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.5.2 Variant #2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
41.6 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

488
488
489
489
490
490
490
490
491
491
492
493

CONTENTS

42 String to number conversion (atoi())
42.1 Simple example . . . . . . . . . . . . . . . . . . .
42.1.1 Optimizing MSVC 2013 x64 . . . . . . .
42.1.2 Optimizing GCC 4.9.1 x64 . . . . . . . .
42.1.3 Optimizing Keil 6/2013 (ARM mode) .
42.1.4 Optimizing Keil 6/2013 (thumb mode)
42.1.5 Optimizing GCC 4.9.1 ARM64 . . . . . .
42.2 A slightly advanced example . . . . . . . . . . .
42.2.1 Optimizing GCC 4.9.1 x64 . . . . . . . .
42.2.2 Optimizing Keil 6/2013 (ARM mode) .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

494
494
494
495
495
496
496
497
497
499

43 Inline functions
43.1 Strings and memory functions
43.1.1 strcmp() . . . . . . . . . .
43.1.2 strlen() . . . . . . . . . .
43.1.3 strcpy() . . . . . . . . . .
43.1.4 memset() . . . . . . . . .
43.1.5 memcpy() . . . . . . . . .
43.1.6 memcmp() . . . . . . . .
43.1.7 IDA script . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

500
501
501
503
503
503
504
507
508

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

44 C99 restrict

509

45 Branchless abs() function
512
45.1 Optimizing GCC 4.9.1 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
45.2 Optimizing GCC 4.9 ARM64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
46 Variadic functions
46.1 Computing arithmetic mean . . . . . . . . . .
46.1.1 cdecl calling conventions . . . . . . .
46.1.2 Register-based calling conventions
46.2 vprintf() function case . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

514
514
514
515
517

47 Strings trimming
47.1 x64: Optimizing MSVC 2013 . . . . . . . . . .
47.2 x64: Non-optimizing GCC 4.9.1 . . . . . . . . .
47.3 x64: Optimizing GCC 4.9.1 . . . . . . . . . . . .
47.4 ARM64: Non-optimizing GCC (Linaro) 4.9 . .
47.5 ARM64: Optimizing GCC (Linaro) 4.9 . . . . .
47.6 ARM: Optimizing Keil 6/2013 (ARM mode) . .
47.7 ARM: Optimizing Keil 6/2013 (thumb mode)
47.8 MIPS . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

519
520
521
522
523
524
525
525
526

48 toupper() function
48.1 x64 . . . . . . . . . . . . . . . . . . . .
48.1.1 Two comparison operations
48.1.2 One comparison operation .
48.2 ARM . . . . . . . . . . . . . . . . . . . .
48.2.1 GCC for ARM64 . . . . . . . .
48.3 Summary . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

528
528
528
529
530
530
531

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

49 Incorrectly disassembled code
532
49.1 Disassembling from an incorrect start (x86) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
49.2 How does random noise looks disassembled? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
50 Obfuscation
50.1 Text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50.2 Executable code . . . . . . . . . . . . . . . . . . . . . . . . .
50.2.1 Inserting garbage . . . . . . . . . . . . . . . . . . .
50.2.2 Replacing instructions with bloated equivalents
50.2.3 Always executed/never executed code . . . . . .
50.2.4 Making a lot of mess . . . . . . . . . . . . . . . . .
50.2.5 Using indirect pointers . . . . . . . . . . . . . . . .
50.3 Virtual machine / pseudo-code . . . . . . . . . . . . . . . .
50.4 Other things to mention . . . . . . . . . . . . . . . . . . . .

xi

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

537
537
537
537
538
538
538
539
539
539

CONTENTS

50.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
50.5.1 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
51 C++
51.1 Classes . . . . . . . . . . . . .
51.1.1 A simple example .
51.1.2 Class inheritance . .
51.1.3 Encapsulation . . . .
51.1.4 Multiple inheritance
51.1.5 Virtual methods . .
51.2 ostream . . . . . . . . . . . .
51.3 References . . . . . . . . . . .
51.4 STL . . . . . . . . . . . . . . .
51.4.1 std::string . . . . . . .
51.4.2 std::list . . . . . . . .
51.4.3 std::vector . . . . . .
51.4.4 std::map and std::set

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

52 Negative array indices
53 Windows 16-bit
53.1 Example#1 . . . . . . . .
53.2 Example #2 . . . . . . . .
53.3 Example #3 . . . . . . . .
53.4 Example #4 . . . . . . . .
53.5 Example #5 . . . . . . . .
53.6 Example #6 . . . . . . . .
53.6.1 Global variables

IV

589
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Java

592
592
592
593
594
596
600
601

603

54 Java
54.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
54.2 Returning a value . . . . . . . . . . . . . . . . . . . . . .
54.3 Simple calculating functions . . . . . . . . . . . . . . .
54.4 JVM5 memory model . . . . . . . . . . . . . . . . . . . .
54.5 Simple function calling . . . . . . . . . . . . . . . . . . .
54.6 Calling beep() . . . . . . . . . . . . . . . . . . . . . . . . .
54.7 Linear congruential PRNG6 . . . . . . . . . . . . . . . .
54.8 Conditional jumps . . . . . . . . . . . . . . . . . . . . . .
54.9 Passing arguments . . . . . . . . . . . . . . . . . . . . . .
54.10Bitfields . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.11Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.12switch() . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.13Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.13.1Simple example . . . . . . . . . . . . . . . . . .
54.13.2Summing elements of array . . . . . . . . . . .
54.13.3main() function sole argument is array too
54.13.4Pre-initialized array of strings . . . . . . . . . .
54.13.5Variadic functions . . . . . . . . . . . . . . . . .
54.13.6Two-dimensional arrays . . . . . . . . . . . . . .
54.13.7Three-dimensional arrays . . . . . . . . . . . .
54.13.8Summary . . . . . . . . . . . . . . . . . . . . . . .
54.14Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.14.1First example . . . . . . . . . . . . . . . . . . . .
54.14.2Second example . . . . . . . . . . . . . . . . . .
54.15Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.16Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54.17Simple patching . . . . . . . . . . . . . . . . . . . . . . .
54.17.1 First example . . . . . . . . . . . . . . . . . . . .
54.17.2 Second example . . . . . . . . . . . . . . . . . .
54.18Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Java

540
540
540
546
548
550
553
555
556
556
557
563
572
578

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

virtual machine
number generator

6 Pseudorandom

xii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

604
604
604
608
610
611
612
612
613
615
616
617
619
620
620
621
622
622
624
626
626
627
627
627
628
629
632
634
634
635
638

CONTENTS

V

Finding important/interesting stuff in the code

55 Identification of executable files
55.1 Microsoft Visual C++ . . . . .
55.1.1 Name mangling . . .
55.2 GCC . . . . . . . . . . . . . . .
55.2.1 Name mangling . . .
55.2.2 Cygwin . . . . . . . .
55.2.3 MinGW . . . . . . . .
55.3 Intel FORTRAN . . . . . . . .
55.4 Watcom, OpenWatcom . . .
55.4.1 Name mangling . . .
55.5 Borland . . . . . . . . . . . . .
55.5.1 Delphi . . . . . . . . .
55.6 Other known DLLs . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

639
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

641
641
641
641
641
641
641
641
642
642
642
642
643

56 Communication with the outer world (win32)
644
56.1 Often used functions in the Windows API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
56.2 tracer: Intercepting all functions in specific module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
57 Strings
57.1 Text strings . . . . . . . .
57.1.1 C/C++ . . . . . . .
57.1.2 Borland Delphi .
57.1.3 Unicode . . . . . .
57.1.4 Base64 . . . . . .
57.2 Error/debug messages .
57.3 Suspicious magic strings

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

58 Calls to assert()

646
646
646
646
647
649
649
650
651

59 Constants
652
59.1 Magic numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652
59.1.1 DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
59.2 Searching for constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
60 Finding the right instructions

654

61 Suspicious code patterns
656
61.1 XOR instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
61.2 Hand-written assembly code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
62 Using magic numbers while tracing
63 Other things
63.1 General idea . . . . . . . . . . . . .
63.2 C++ . . . . . . . . . . . . . . . . . .
63.3 Some binary file patterns . . . .
63.4 Memory “snapshots” comparing
63.4.1 Windows registry . . . . .
63.4.2 Blink-comparator . . . . .

VI

658
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

OS-specific

659
659
659
659
660
661
661

662

64 Arguments passing methods (calling conventions)
64.1 cdecl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64.2 stdcall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64.2.1 Functions with variable number of arguments
64.3 fastcall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64.3.1 GCC regparm . . . . . . . . . . . . . . . . . . . . .
64.3.2 Watcom/OpenWatcom . . . . . . . . . . . . . . . .
64.4 thiscall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64.5 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64.5.1 Windows x64 . . . . . . . . . . . . . . . . . . . . .
64.5.2 Linux x64 . . . . . . . . . . . . . . . . . . . . . . .
xiii

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

663
663
663
664
664
665
665
665
665
665
667

CONTENTS

64.6 Return values of float and double type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
64.7 Modifying arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
64.8 Taking a pointer to function argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
65 Thread Local Storage
65.1 Linear congruential generator revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65.1.1 Win32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65.1.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

671
671
671
675

66 System calls (syscall-s)
676
66.1 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
66.2 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
67 Linux
678
67.1 Position-independent code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
67.1.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
67.2 LD_PRELOAD hack in Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
68 Windows NT
68.1 CRT (win32) . . . . . . . . . . . . . . .
68.2 Win32 PE . . . . . . . . . . . . . . . . .
68.2.1 Terminology . . . . . . . . . .
68.2.2 Base address . . . . . . . . .
68.2.3 Subsystem . . . . . . . . . . .
68.2.4 OS version . . . . . . . . . . .
68.2.5 Sections . . . . . . . . . . . .
68.2.6 Relocations (relocs) . . . . .
68.2.7 Exports and imports . . . . .
68.2.8 Resources . . . . . . . . . . .
68.2.9 .NET . . . . . . . . . . . . . . .
68.2.10TLS . . . . . . . . . . . . . . .
68.2.11Tools . . . . . . . . . . . . . . .
68.2.12Further reading . . . . . . . .
68.3 Windows SEH . . . . . . . . . . . . . .
68.3.1 Let’s forget about MSVC . . .
68.3.2 Now let’s get back to MSVC
68.3.3 Windows x64 . . . . . . . . .
68.3.4 Read more about SEH . . . .
68.4 Windows NT: Critical section . . . .

VII

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Tools

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

683
683
686
686
687
687
687
687
688
689
691
691
691
692
692
692
692
697
710
714
714

716

69 Disassembler
717
69.1 IDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717
70 Debugger
718
70.1 tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
70.2 OllyDbg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
70.3 GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 718
71 System calls tracing
719
71.0.1 strace / dtruss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
72 Decompilers

720

73 Other tools

721

VIII

More examples

722

74 Task manager practical joke (Windows Vista)
723
74.1 Using LEA to load values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 725
75 Color Lines game practical joke

727

xiv

CONTENTS

76 Minesweeper (Windows XP)
731
76.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
77 Hand decompiling + Z3 SMT solver
736
77.1 Hand decompiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
77.2 Now let’s use the Z3 SMT solver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
78 Dongles
78.1 Example #1: MacOS Classic and PowerPC
78.2 Example #2: SCO OpenServer . . . . . . . .
78.2.1 Decrypting error messages . . . . .
78.3 Example #3: MS-DOS . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

79 “QR9”: Rubik’s cube inspired amateur crypto-algorithm

744
744
750
757
760
765

80 SAP
792
80.1 About SAP client network traffic compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
80.2 SAP 6.0 password checking functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
81 Oracle RDBMS
81.1 V$VERSION table in the Oracle RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81.2 X$KSMLRU table in Oracle RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81.3 V$TIMER table in Oracle RDBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

806
806
813
814

82 Handwritten assembly code
819
82.1 EICAR test file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
83 Demos
83.1 10 PRINT CHR$(205.5+RND(1)); : GOTO 10 . . . . . . . . . . . . . . . .
83.1.1 Trixter’s 42 byte version . . . . . . . . . . . . . . . . . . . . . . . .
83.1.2 My attempt to reduce Trixter’s version: 27 bytes . . . . . . . .
83.1.3 Taking random memory garbage as a source of randomness
83.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83.2 Mandelbrot set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83.2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83.2.2 Let’s get back to the demo . . . . . . . . . . . . . . . . . . . . . .
83.2.3 My “fixed” version . . . . . . . . . . . . . . . . . . . . . . . . . . .

IX

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Examples of reversing proprietary file formats

821
821
821
822
822
823
824
825
830
832

834

84 Primitive XOR-encryption
84.1 Norton Guide: simplest possible 1-byte XOR encryption
84.1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . .
84.2 Simplest possible 4-byte XOR encryption . . . . . . . . .
84.2.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

835
836
837
839
842

85 Millenium game save file

843

86 Oracle RDBMS: .SYM-files

850

87 Oracle RDBMS: .MSB-files
859
87.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864

X

Other things

865

88 npad

866

89 Executable files patching
868
89.1 Text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
89.2 x86 code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868
90 Compiler intrinsic

869

91 Compiler’s anomalies

870

xv

CONTENTS

92 OpenMP
871
92.1 MSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
92.2 GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874
93 Itanium

877

94 8086 memory model

880

95 Basic blocks reordering
881
95.1 Profile-guided optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881

XI

Books/blogs worth reading

96 Books
96.1 Windows . . .
96.2 C/C++ . . . . .
96.3 x86 / x86-64
96.4 ARM . . . . . .
96.5 Cryptography

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

883
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

884
884
884
884
884
884

97 Blogs
885
97.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
98 Other

XII

886

Exercises

887

99 Level 1
889
99.1 Exercise 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
100Level 2
100.1Exercise 2.1 . . . . . . . . . . . . . . . . . . . . . . . . .
100.1.1Optimizing MSVC 2010 x86 . . . . . . . . . .
100.1.2Optimizing MSVC 2012 x64 . . . . . . . . . .
100.2Exercise 2.4 . . . . . . . . . . . . . . . . . . . . . . . . .
100.2.1Optimizing MSVC 2010 . . . . . . . . . . . . .
100.2.2GCC 4.4.1 . . . . . . . . . . . . . . . . . . . . . .
100.2.3Optimizing Keil (ARM mode) . . . . . . . . . .
100.2.4Optimizing Keil (thumb mode) . . . . . . . .
100.2.5Optimizing GCC 4.9.1 (ARM64) . . . . . . . .
100.2.6Optimizing GCC 4.4.5 (MIPS) . . . . . . . . . .
100.3Exercise 2.6 . . . . . . . . . . . . . . . . . . . . . . . . .
100.3.1Optimizing MSVC 2010 . . . . . . . . . . . . .
100.3.2Optimizing Keil (ARM mode) . . . . . . . . . .
100.3.3Optimizing Keil (thumb mode) . . . . . . . .
100.3.4Optimizing GCC 4.9.1 (ARM64) . . . . . . . .
100.3.5Optimizing GCC 4.4.5 (MIPS) . . . . . . . . . .
100.4Exercise 2.13 . . . . . . . . . . . . . . . . . . . . . . . .
100.4.1Optimizing MSVC 2012 . . . . . . . . . . . . .
100.4.2Keil (ARM mode) . . . . . . . . . . . . . . . . .
100.4.3Keil (thumb mode) . . . . . . . . . . . . . . . .
100.4.4Optimizing GCC 4.9.1 (ARM64) . . . . . . . .
100.4.5Optimizing GCC 4.4.5 (MIPS) . . . . . . . . . .
100.5Exercise 2.14 . . . . . . . . . . . . . . . . . . . . . . . .
100.5.1MSVC 2012 . . . . . . . . . . . . . . . . . . . .
100.5.2Keil (ARM mode) . . . . . . . . . . . . . . . . .
100.5.3GCC 4.6.3 for Raspberry Pi (ARM mode) . . .
100.5.4Optimizing GCC 4.9.1 (ARM64) . . . . . . . .
100.5.5Optimizing GCC 4.4.5 (MIPS) . . . . . . . . . .
100.6Exercise 2.15 . . . . . . . . . . . . . . . . . . . . . . . .
100.6.1Optimizing MSVC 2012 x64 . . . . . . . . . .
100.6.2Optimizing GCC 4.4.6 x64 . . . . . . . . . . .
100.6.3Optimizing GCC 4.8.1 x86 . . . . . . . . . . .
100.6.4Keil (ARM mode): Cortex-R4F CPU as target

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

xvi

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

890
890
890
891
891
891
892
893
894
894
895
896
896
897
897
898
899
899
899
899
900
900
900
900
901
901
902
903
903
905
905
907
908
909

CONTENTS

100.6.5Optimizing GCC 4.9.1 (ARM64) . . .
100.6.6Optimizing GCC 4.4.5 (MIPS) . . . . .
100.7Exercise 2.16 . . . . . . . . . . . . . . . . . . .
100.7.1 Optimizing MSVC 2012 x64 . . . . .
100.7.2 Optimizing Keil (ARM mode) . . . . .
100.7.3 Optimizing Keil (thumb mode) . . .
100.7.4 Non-optimizing GCC 4.9.1 (ARM64)
100.7.5 Optimizing GCC 4.9.1 (ARM64) . . .
100.7.6 Non-optimizing GCC 4.4.5 (MIPS) . .
100.8Exercise 2.17 . . . . . . . . . . . . . . . . . . .
100.9Exercise 2.18 . . . . . . . . . . . . . . . . . . .
100.10
Exercise 2.19 . . . . . . . . . . . . . . . . . . .
100.11
Exercise 2.20 . . . . . . . . . . . . . . . . . . .
101Level 3
101.1Exercise 3.2
101.2Exercise 3.3
101.3Exercise 3.4
101.4Exercise 3.5
101.5Exercise 3.6
101.6Exercise 3.8

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

910
911
912
912
913
913
913
914
916
917
918
918
918

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

919
919
919
919
919
920
920

102crackme / keygenme

921

Afterword

923

103Questions?

923

Appendix
A x86
A.1 Terminology . . . . . . . . . . . .
A.2 General purpose registers . . .
A.2.1 RAX/EAX/AX/AL . . . . .
A.2.2 RBX/EBX/BX/BL . . . .
A.2.3 RCX/ECX/CX/CL . . . . .
A.2.4 RDX/EDX/DX/DL . . . .
A.2.5 RSI/ESI/SI/SIL . . . . . .
A.2.6 RDI/EDI/DI/DIL . . . . .
A.2.7 R8/R8D/R8W/R8L . . .
A.2.8 R9/R9D/R9W/R9L . . .
A.2.9 R10/R10D/R10W/R10L
A.2.10 R11/R11D/R11W/R11L
A.2.11 R12/R12D/R12W/R12L
A.2.12 R13/R13D/R13W/R13L
A.2.13 R14/R14D/R14W/R14L
A.2.14 R15/R15D/R15W/R15L
A.2.15 RSP/ESP/SP/SPL . . . .
A.2.16 RBP/EBP/BP/BPL . . . .
A.2.17 RIP/EIP/IP . . . . . . . .
A.2.18 CS/DS/ES/SS/FS/GS . .
A.2.19 Flags register . . . . . .
A.3 FPU registers . . . . . . . . . . .
A.3.1 Control Word . . . . . .
A.3.2 Status Word . . . . . . .
A.3.3 Tag Word . . . . . . . . .
A.4 SIMD registers . . . . . . . . . .
A.4.1 MMX registers . . . . . .
A.4.2 SSE and AVX registers .
A.5 Debugging registers . . . . . . .
A.5.1 DR6 . . . . . . . . . . . .
A.5.2 DR7 . . . . . . . . . . . .
A.6 Instructions . . . . . . . . . . . .

925
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

xvii

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

925
925
925
925
925
926
926
926
926
926
926
926
927
927
927
927
927
927
927
928
928
928
929
929
929
930
930
930
930
930
930
931
931

CONTENTS

A.6.1
A.6.2
A.6.3
A.6.4
A.6.5

Prefixes . . . . . . . . . . . . . . . . . . . . . . .
Most frequently used instructions . . . . . .
Less frequently used instructions . . . . . . .
FPU instructions . . . . . . . . . . . . . . . . .
Instructions having printable ASCII opcode

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

931
932
936
940
941

B ARM
B.1 Terminology . . . . . . . . . . . . . . . . . . . . . . .
B.2 Versions . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 32-bit ARM (AArch32) . . . . . . . . . . . . . . . . .
B.3.1 General purpose registers . . . . . . . . .
B.3.2 Current Program Status Register (CPSR)
B.3.3 VFP (floating point) and NEON registers
B.4 64-bit ARM (AArch64) . . . . . . . . . . . . . . . . .
B.4.1 General purpose registers . . . . . . . . .
B.5 Instructions . . . . . . . . . . . . . . . . . . . . . . .
B.5.1 Conditional codes table . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

944
944
944
944
944
945
945
945
945
946
946

C MIPS
C.1 Registers . . . . . . . . . . . . . . . . . . . .
C.1.1 General purpose registers GPR7 .
C.1.2 Floating-point registers . . . . . .
C.2 Instructions . . . . . . . . . . . . . . . . . .
C.2.1 Jump instructions . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

947
947
947
947
947
948

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

D Some GCC library functions

949

E Some MSVC library functions

950

F Cheatsheets
F.1 IDA . . .
F.2 OllyDbg
F.3 MSVC . .
F.4 GCC . . .
F.5 GDB . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

951
951
951
952
952
952

G Exercise solutions
G.1 Per chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.1 “Stack” chapter . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.2 “switch()/case/default” chapter . . . . . . . . . . . . . . . . .
G.1.3 Exercise #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.4 “Loops” chapter . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.5 Exercise #3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.6 Exercise #4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.7 “Simple C-strings processing” chapter . . . . . . . . . . . . .
G.1.8 “Replacing arithmetic instructions to other ones” chapter
G.1.9 “Floating-point unit” chapter . . . . . . . . . . . . . . . . . .
G.1.10 “Arrays” chapter . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.1.11 “Manipulating specific bit(s)” chapter . . . . . . . . . . . . .
G.1.12 “Structures” chapter . . . . . . . . . . . . . . . . . . . . . . . .
G.1.13 “Obfuscation” chapter . . . . . . . . . . . . . . . . . . . . . . .
G.1.14 “Division by 9” chapter . . . . . . . . . . . . . . . . . . . . . .
G.2 Level 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.2.1 Exercise 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.2.2 Exercise 1.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3 Level 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.1 Exercise 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.2 Exercise 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.3 Exercise 2.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.4 Exercise 2.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.5 Exercise 2.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.6 Exercise 2.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.7 Exercise 2.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
G.3.8 Exercise 2.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

954
954
954
954
954
954
954
955
955
955
955
955
956
958
959
959
959
959
959
959
959
959
960
960
960
960
960
961

7 General

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

Purpose Registers
xviii

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

. . . . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Exercise 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Other . . . . . . . . .3 Exercise 3. . . . . . . . . . G. . . . .10 Exercise 2. . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G. . . . . . . . . . . . . . . . . . . . . . . . . . G. . . . . . . . . . . . . . . . . . . . .3. . . . . . . . . . . . . . .19 . . . .20 . . . .3. G. . . . . . . . . . . . . . . . . . . G. G. . . . . . . . . . . . . . . G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11 Exercise 2. . . . . . .18 . . . G. . . . . . . . . .4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Exercise 2. G. .CONTENTS G. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 . . . . .3. . . . . . . . . . . . .8 . . . . . . . . . .3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 . . . . . . . . . G. . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . . . . Acronyms used . . . . . . . . . . . . . . . . . . . .4 Level 3 . . . . . . . . . . . . . .4. . . G. . . . . . . . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . 961 961 961 961 961 961 961 961 961 961 962 962 964 Glossary 968 Index 969 xix . . . . . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Exercise 3. . . . . . . . . .1 “Minesweeper (Windows XP)” example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Exercise 3. . . . . . . . . . . . . . . . . . . . .6 . . . . . . .2 Exercise 3. . . . . . . . . .6 Exercise 3. . . . .

(technical) writing takes a lot of effort and work. It is also ad-free. Vladimir Botov.1 on page 881). Topics touched Oracle RDBMS ( 81 on page 806). Yuan Jochen Kang. TLS10 . For helping me in other ways: Andrew Zubinski. For translating to Korean: Byungho Min. copy-protection dongles ( 78 on page 744).re. Ways to donate are available on the page: beginners. Arnaud Patard (rtp on #debian-arm IRC). Mark “Logxen” Cooper. About the author Dennis Yurichev is an experienced reverse engineer and programmer. Many LATEX packages were used: I would like to thank the authors as well. critical sections ( 68.1 on page 678 12 GitHub 13 GitHub 9 Executable xx . ELF9 . Mal Malakov. position-independent code (PIC11 ) ( 67. For proofreading: Alexander ”Lstar” Chernenkiy. x86-64 ( 26. you may consider donating. 2) 3D model scanning and reworking in order to make a copy of it.2 on page 686).1 Donate As it turns out. 0. 8 Database management systems file format widely used in *NIX system including Linux 10 Thread Local Storage 11 Position Independent Code: 67. profile-guided optimization ( 95. Vasil Kolev did a great amount of work in proofreading and correcting many mistakes. Changmin Heo.CONTENTS Preface There are several popular meanings of the term “reverse engineering”: 1) reverse engineering of software: researching compiled programs. OpenMP ( 92 on page 871). syscalls ( 66 on page 676).com who have contributed notes and corrections. LD_PRELOAD ( 67.0. For sending me notes about mistakes and inaccuracies: Stanislav ”Beaver” Bobrytskyy. C++ STL ( 51.4 on page 714). He can be contacted by email: dennis(a)yurichev.com. available freely and available in source code form 12 (LaTeX). Thanks also to all the folks on github. Andrei Brazhuk. Java/JVM. For illustrations and cover art: Andy Nechaevsky. stack overflow.yurichev. For translating to Chinese simplified: Xian Chi. or Skype: dennis. Zhu Ruijin.2 on page 680).3 on page 692). This book is related to the first meaning. Alexander Lysenko. This book is free.1 on page 426). My current plan for this book is to add lots of information about: PLANS13 . Every donor’s name will be included in the book! Donors also have a right to ask me to rearrange items in my writing plan.4 on page 556). MIPS. If you want me to continue writing on all these topics. 3) recreating DBMS8 structure. and it will stay so forever. win32 PE file format ( 68. Itanium ( 93 on page 877). Topics discussed x86/x64. Shell Rocket. SEH ( 68.1 on page 678). ARM/ARM64. Thanks For patiently answering all my questions: Andrey “herm1t” Baranovich. Slava ”Avid” Kazakov.

Therefore. mini-FAQ Q: Why one should learn assembly language these days? A: Unless you are OS developer. Praise for Reverse Engineering for Beginners • “It’s very well done . Sergey Lukianov (300 RUR).com) or ask your question at my forum: forum. Georgia. • “.) 14 A very good text about this topic: [Fog13b] processing unit 16 reddit 17 twitter 18 twitter 19 reddit 15 Central xxi . Q: I want to translate your book to some other language. A: Write me it by email (dennis(a)yurichev.CONTENTS Donors 22 * anonymous. Richard S Shultz ($20).. US. LLC.. Someone may also want to build their own version of book.. Sergey Volchkov (10 AUD). Technical Manager. great job!” Michael Sikorski. Sri Harsha Kandrakota (10 AUD). Triop AB (100 SEK). Maxim Dyakonov ($3). full professor at the Vrije Universiteit Amsterdam. Jiarui Hong (100. how to get back? A: (Adobe Acrobat Reader) Alt + LeftArrow Q: Your book is so huge! Is there anything shorter? A: There is shortened lite version: http://beginners. amazing. Also. Bayna AlZaabi ($75). security/malware research. Network & Information Security at Verizon Business. Pawel Szczur (40 CHF). you probably don’t need to write in assembly: modern compilers perform optimizations much better than humans do. Oracle RDBMS security guru. Vankayala Vigneswararao ($50). (25 EUR).. • “. It is amazing and unbelievable. Redfive B. that’s why book is licensed under Creative Commons terms. my compliments for the very nice tutorial!” Herbert Bos. • “.”19 (Mike Stay. Try looking there. 2014). Luis Rocha ($63). Ki Chan Ahn ($50). Carlos Garcia Prado (10 EUR). Q: I clicked on hyperlink inside of PDF-document. Joris van de Vis ($127). Q: How would one find a reverse engineering job? A: There are hiring threads that appear from time to time on reddit devoted to RE16 (2013 Q3. Ange Albertini (10 EUR). • “.. That said. Siege Technologies. Ludvig Gislason (200 SEK).com.00 SEK). • “. teacher at the Federal Law Enforcement Training Center. Victor Cazacov (5 EUR). Jeremy Brown ($100). Justin Simms ($20). Jim_Di (500 RUR). which is why there are many examples of compiler output.” Joris van de Vis. Pillay Harish (10 SGD). CISSP / ISSAP.. reasonable intro to some of the techniques. book is interesting. there are at least two areas where a good understanding of assembly may help: first..V. Vladimir Dikovski (50 EUR). Tobias Sturzenegger (10 CHF). SAP Netweaver & Security specialist.re/#lite. Martin Haeberli ($10). 2 * Oleg Vygovsky (50+100 UAH). Second. this book is intended for those who want to understand assembly language rather than to write in it. Daniel Bilar ($50). Sonny Thai ($15).excellent and free”18 Pete Finnigan.5). Salikov Alexander (500 RUR). gaining a better understanding of your compiled code while debugging.... • “Thanks for the great work and your book. Q: I have a question.. Nicolas Werner (12 EUR). Yao Xiao ($10).. read here about it. Shade Atlas (5 AUD).. Gérard Labadie (40 EUR). author of Practical Malware Analysis: The Hands-On Guide to Dissecting Malicious Software.. A: Read my note to translators.”17 Daniel Bilar. Jang Minchang ($20).yurichev. Timur Valiev (230 RUR). Joona Oskari Heikkilä (5 EUR).” Luis Rocha. A somewhat related hiring thread can be found in the “netsec” subreddit: 2014 Q2. James Truscott ($4. Shawn the R0ck ($27). co-author of Modern Operating Systems (4th Edition). Alexandre Borges ($25). Marshall Bishop ($50). Katy Moe ($14). Philippe Teuwen ($4). Tan Vincent ($30). Q: May I print this book? Use it for teaching? A: Of course. and for free . modern CPU15 s are very complex devices and assembly knowledge would not help you understand its internals. Oliver Whitehouse (30 GBP). 14 .

Oracle RDBMS performance tuning expert.CONTENTS • “I love this book! I have several students reading it at the moment. Translator is Byungho Min (twitter/tais9). They are also the Korean translation copyright holder.co. plan to use it in graduate course. Cover pictures was done by my artist friend Andy Nechaevsky : facebook/andydinka. 20 twitter 21 twitter xxii . Acorn publishing company (www.kr) in South Korea did huge amount of work in translating and publishing my book (state which is it in August 2014) in Korean language. Now it’s available at their website.acornpub. now you may buy it. About Korean translation In January 2015.”20 (Sergey Bratus. Research Assistant Professor at the Computer Science Department at Dartmouth College) • “Dennis @Yurichev has published an impressive (and free!) book on reverse engineering”21 Tanel Poder. So if you want to have a real book on your shelf in Korean language and/or want to support my work.

Part I Code patterns 1 .

This was much easy for me to understand 22 . 2 . in order to get the shortest (or simplest) possible code snippet. Therefore. This probably is not worth doing today in real-world scenarios (because it’s hard to compete with modern compilers on efficiency). and look at the assembly language output. trying to make their code as short as possible. 22 Frankly speaking. That is why when I look at assembly code I could instantly grasp what the original C code in general does and looks like. please also do not forget about testing your results! Difference between non-optimized (debug) and optimized (release) versions A non-optimizing compiler works faster and produces more understandable (verbose.When I first started learning C and then C++. An optimizing (release) compiler works slower and tries to produce faster (but not necessarily smaller) code. That’s why I try to give examples of both versions of code. One important feature of the debugging code is that there might be debugging information showing connections between each line in source code and machine code addresses. but it’s a very good method to learn assembly better. Sometimes I use ancient compilers. you can take any assembly code from this book and try to make it shorter. Perhaps this technique could be helpful for someone else and so I’m going to describe some examples here. Exercises When I studied assembly language. though) code. I still do so when I can’t understand what some code does. However. I also often compiled small C-functions and then rewrote them gradually to assembly. I did it so many times that the relation between the C/C++ code and what the compiler produced was imprinted deeply in my mind. Optimizing compilers tend to produce such output where whole source code lines may be optimized away and not present in the resulting machine code. compile them. I used to write small pieces of code. Reverse engineer practitioner usually encounters either version because some developers turn on compiler’s optimization switches and others do not.

Assembly language : mnemonic code and some extensions like macros which are intended to make a programmer’s life easier. 4 These are MOV/PUSH/CALL/Jcc 2 Reduced 3 . Each instruction is usually encoded by several bytes. Thumb mode (including Thumb-2) and ARM64. This is now referred as “ARM mode”. so. each CPU has its own instruction set architecture (ISA). ARM is a RISC2 CPU designed with constant opcode length in mind. is called a compiler. it would be possible to invent a CPU which can execute high-level PL code. CPU register : Each CPU has a fixed set of general purpose registers (GPR). fixed-length instructions are handy in a way that one can calculate the next (or previous) instruction’s address without effort. the x64 extensions did not affect the ISA very much. arithmetic primitives. This ISA has 4-byte opcodes again. it is very hard to do it without making a huge amount of annoying mistakes. So they added another ISA called Thumb. for example MIPS. On the contrary. Perhaps. but I would rather say they are different ISAs than variations of the same one. Short glossary: Instruction : a primitive CPU command. but it is easier for a CPU to use a much lower level of abstraction. This feature will be discussed in switch() ( 13. ≈ 8 in x86. Imagine you are working with a high-level PL1 and you have only 8 32-bit (or 64-bit) variables. Code compiled for ARM mode and Thumb mode may coexist in one program. it is very inconvenient for humans to use assembly language due to it being low-level. This is incorrect. In fact..1 A couple of words about different ISA x86 was always an ISA with variable-length opcodes. Then they thought it wasn’t very frugal. which executes all of the programs. Thumb-2 now competes with the original ARM mode. The easiest way to understand a register is to think of it as an untyped temporary variable. but some new instructions have a size of 4 bytes. Therefore. However. so Thumb instruction set is somewhat limited. Besides. Thumb-2 is still 2-byte instructions. PowerPC and Alpha AXP. Then ARM creators thought Thumb could be extended: Thumb-2 appeared (in ARMv7). 1. Machine code : code for the CPU. summarizing. A SHORT INTRODUCTION TO THE CPU Chapter 1 A short introduction to the CPU The CPU is the unit. of course. There is a common misconception that Thumb-2 is a mix of ARM and Thumb. Simplest examples: moving data between registers. which converts high-level PL code into assembly. There are many other RISC ISAs with fixed length 32-bit opcodes. ≈ 16 in ARM. etc. without any additional Thumb mode. working with memory. These ISAs intersect partially. 1 Programming language instruction set computing 3 By the way. Thumb-2 was extended to fully support all processor features so it could compete with ARM mode. we now have 3 ARM instruction sets: ARM mode. most used CPU instructions4 in real world applications can be encoded using less information. not all ARM instructions can be encoded in just 2 bytes. Java. where each instruction was encoded in just 2 bytes. ARM had all instructions encoded in 4 bytes 3 . ≈ 16 in x86-64.2. But the 64bit requirements affected ISA. The program.CHAPTER 1. The majority of iPod/iPhone/iPad applications are compiled for the Thumb-2 instruction set because Xcode does this by default. so when the 64-bit era came. This is now referred as “Thumb mode”. I try to add fragments of code in all 3 ARM ISAs in this book. At the very beginning. Rather. but it would be much more complex. Many things can be done using only these! What is the difference between machine code and a PL? It is much easier for humans to use a high-level PL like C/C++.2 on page 158) section. Python. x86 has a lot of instructions that appeared in 16-bit 8086 CPU and are still present in latest CPUs. which had some advantages in the past. On instruction set richness. As a rule. Later the 64-bit ARM came out.

3 MIPS There are two ways of registers naming that are used in the MIPS world.3: Optimizing GCC 4.CHAPTER 2. }.#0x7b . There are just two instructions: the first places the value 123 into the EAX register. which is used by convention for storing the return value and the second one is RET. The return address (RA1 ) is not saved on the local stack in ARM. 2. 2. SIMPLEST POSSIBLE FUNCTION Chapter 2 Simplest possible function Probably the simplest possible function is that one which just returns some constant value. So the BX LR instruction is jumping to that address. It has to be noted that MOV is a confusing name for the instruction in both x86 and ARM ISAs. so 123 is stored into R0 here. 123 lr ARM uses register R0 for returning results. returning execution to the caller. effectively. but rather in the LR2 register. In fact. Here it is: int f() { return 123.5 (assembly output) j li $31 $2. data is not moved. The caller will take the result from the EAX register.1: Optimizing GCC f: mov ret eax. $A0. Assembly listing generated by GCC shows registers by number: Listing 2.1 x86 And that’s what the optimizing GCC compiler produces: Listing 2.2: Optimizing Keil 6/2013 (ARM mode) f PROC MOV BX ENDP r0.2 ARM What about ARM? Listing 2. 123 MSVC’s result is exactly the same. it’s rather copied.4. etc).123 # 0x7b 1 Return 2 Link Address Register 4 . 2. By number (from $0 to $31) or by pseudoname ($V0. which returns execution to the caller.

The other instruction is jump instruction (J or JR) which returns execution flow to the caller.3.4: Optimizing GCC 4. 3 Interactive Disassembler and debugger developed by Hex-Rays 5 . But why the load instruction (LI) and the jump instruction (J or JR) are swapped? This is merely RISC feature called “branch delay slot”. SIMPLEST POSSIBLE FUNCTION 2. 0x7B So the $2 (or $V0) register is used to store the function result. We don’t need to get into details here. branch instruction always swap place with the instruction. This is the register analogous to LR in ARM.3. the instruction following jump or branch instruction is executed before the jump/brunch instruction itslef.CHAPTER 2. 2. Hence. because the instruction and register names of other ISAs are all written in uppercase in this book.4.5 (IDA) jr li $ra $v0. MIPS …while IDA3 — by pseudoname: Listing 2. jumping to the address in $31 (or $RA) register. which must be executed beforehand.1 Note about MIPS instruction/register names Register names and instruction names in MIPS world are traditionally written in lowercase. But I’ve decided to stick to uppercase. We just need to remember that: in MIPS. LI is “Load Immediate”.

HELLO. In our case. The compiler generated 1. eax pop ebp ret 0 _main ENDP _TEXT ENDS MSVC produces assembly listings in Intel-syntax. p176. The difference between Intel-syntax and AT&T-syntax will be discussed in 3.cpp /Fa1.2]. world\n". Function compile flags: /Odtp _TEXT SEGMENT _main PROC push ebp mov ebp. 7. return 0. esp push OFFSET $SG3830 call _printf add esp. however it does not have its own name.h> int main() { printf("hello. 00H CONST ENDS PUBLIC _main EXTRN _printf:PROC . the file contains two segments: CONST (for data constants) and _TEXT (for code).exe.1 3.1 x86 MSVC Let’s compile it in MSVC 2010: cl 1. world'' in C/C++ has type const char[] [Str13.3 on the next page.h> const char $SG3830[]="hello. } 3. 6 . That is why the example may be rewritten as follows: #include <stdio.1.1: MSVC 2010 CONST SEGMENT $SG3830 DB 'hello. world'.CHAPTER 3.3.asm (/Fa option instructs the compiler to generate assembly listing file) Listing 3. The string ``hello. world\n"). 0AH. 4 xor eax.obj file. which is to be linked into 1. The compiler needs to deal with the string somehow so it defines the internal name $SG3830 for it.1. WORLD! Chapter 3 Hello. world! Let’s continue with the famous example from the “The C programming Language”[Ker88] book: #include <stdio.

we need exactly 4 bytes for address passing through the stack. WORLD! 3. ADD ESP. are modified 3 wikipedia 4 C runtime library : 68.g. If it was x64 code we would need 8 bytes. with the assistance of the IDA disassembler.text:0800029A .3: code in IDA main proc near var_10 = dword ptr -10h push mov and sub mov mov ebp ebp. let’s see how the main() function was created. The Intel C++ compiler probably uses POP ECX since this instruction’s opcode is shorter than ADD ESP.2 Linux (app.text:080002A0 push call pop ebx qksfroChild ecx Read more about the stack in section ( 5 on page 22). } Let’s go back to the assembly listing. Some compilers (like the Intel C++ Compiler) in the same situation may emit POP ECX instead of ADD (e.4. there is only one function so far: main(). which means SUBtract the value in the EAX from the value in EAX. In the code segment. Here is an example: Listing 3. "hello.B. the string is terminated by a zero byte. The function main() starts with prologue code and ends with epilogue code (like almost any function) 1 . Usually. This instruction has almost the same effect but the ECX register contents will be overwritten. offset aHelloWorld .1. return 0.1 compiler in Linux: gcc 1. 0FFFFFFF0h esp. the stack pointer (the ESP register) needs to be corrected. world\n" [esp+10h+var_10]. like MSVC. 0 —again because it is a slightly shorter opcode (2 bytes against 5). (IDA.2 GCC Now let’s try to compile the same C/C++ code in the GCC 4. Some compilers emit SUB EAX. 2 CPU 7 . After calling printf(). ``ADD ESP. When the printf() function returns the control to the main() function. the string address (or a pointer to it) is still on the stack. The last instruction RET returns the control to the caller. the original C/C++ code contains the statement return 0 —return 0 as the result of the main() function. however.1. 4'' is effectively equivalent to ``POP register'' but without using any register2 . such a pattern can be observed in the Oracle RDBMS code as it is compiled with the Intel C++ compiler).1. X86 int main() { printf($SG3830). After the function prologue we see the call to the printf() function: CALL _printf.2: Oracle RDBMS 10.o file) . Why 4? Since this is 32-bit program. flags. EAX. Before the call the string address (or a pointer to it) containing our greeting is placed on the stack with the help of the PUSH instruction. Listing 3. 10h eax. this is C/C++ CRT4 code. Since we do not need it anymore. esp esp.c -o 1 Next. In the generated code this is implemented by the instruction XOR EAX.1 on page 646. We could also have GCC produce assembly listings in Intel-syntax by applying the options -S -masm=intel. As we can see..1 on page 683 5 Operating System 6 N. which in its turn returns the control to the OS5 . which is standard for C/C++ strings. eax 1 You can read more about it in section about function prolog and epilog ( 4 on page 21). 3. EAX XOR is in fact just “eXclusive OR” 3 but the compilers often use it instead of MOV EAX. HELLO. x (1 byte against 3).CHAPTER 3. uses Intel-syntax) 6 .text:0800029B . 4 means add 4 to the ESP register value. _TEXT. which in any case resulting in zero. More about C strings: 57.

it emits MOV EAX. For now. Then the printf() function is called.CHAPTER 3. Unlike MSVC.3 GCC: AT&T syntax Let’s see how this can be represented in assembly language AT&T syntax. @function main: .7.cfi_startproc pushl %ebp . 0 instead of a shorter opcode. (%esp) call printf movl $0.GNU-stack. .1.cfi_def_cfa_register 5 andl $-16. the function prologue contains AND ESP. WORLD! call mov leave retn endp main 3.3 . only 4 are necessary here.3 gcc -S 1_1. 0 The result is almost the same. The last instruction.cfi_endproc . LEAVE —is the equivalent of the MOV ESP. X86 _printf eax. .section .c We get this: Listing 3. ESP / AND ESP.string "hello. The address of the “hello. 3."".5: GCC 4.). These are not interesting for us at the moment. The string address (or a pointer to the string) is then stored directly onto the stack without using the PUSH instruction. -8 movl %esp.size main.7..7.section . 0FFFFFFF0h —this instruction aligns the ESP register value on a 16-byte boundary. This syntax is much more popular in the UNIXworld.@progbits The listing contains many macros (beginning with dot). This results in all values in the stack being aligned the same way (The CPU performs better if the values it is dealing with are located in memory at addresses aligned on a 4. HELLO.LC0.globl main . %esp subl $16. %esp movl $.rodata . 10h allocates 16 bytes on the stack. Read about it below. This is necessary since we modified these register values (ESP and EBP) at the beginning of the function (executing MOV EBP. for the sake of simplification. This is because the size of the allocated stack is also aligned on a 16-byte boundary. we can ignore them (except the .1.file "1_1. In addition.cfi_def_cfa_offset 8 .string macro which encodes a null-terminated character sequence just like a C-string).text .-main .LFE0: . world” string (stored in the data segment) is loaded in the EAX register first and then it is saved onto the stack. world\n" . 4 ret . %eax leave .note.LFB0: .type main. var_10 —is a local variable and is also an argument for printf(). Listing 3.3-1ubuntu1) 4. %ebp . as we can see hereafter. SUB ESP.cfi_restore 5 .4: let’s compile in GCC 4.c" . this instruction sets the stack pointer (ESP) back and restores the EBP register to its initial state. Then we’ll see this 7 Wikipedia: Data structure alignment 8 .3" .LC0: .cfi_def_cfa 4.7.ident "GCC: (Ubuntu/Linaro 4. EBP and POP EBP instruction pair —in other words.. Although. when GCC is compiling without optimization turned on.cfi_offset 5.or 16-byte boundary)7 .

eax rsp.CHAPTER 3.LC0. In AT&T syntax: <instruction> <source operand> <destination operand>. %ebp $-16.2.7.1 x86-64 MSVC—x86-64 Let’s also try 64-bit MSVC: Listing 3. X86-64 : Listing 3.6: GCC 4. %esp $. strcpy()) the arguments are listed in the same way as in Intel-syntax: pointer to the destination memory block at the beginning and then pointer to the source memory block. 40 0 8 This GCC option can be used to eliminate “unnecessary” macros: -fno-asynchronous-unwind-tables By the way. (%esp) printf $0.LC0: . Here is a way to easy memorise the difference: when you deal with Intel-syntax. 40 rcx. memcpy(). %esp $16. a percent sign must be written (%) and before numbers a dollar sign ($).2 3. One more thing: the return value is to be set to 0 by using usual MOV. %eax Some of the major differences between Intel and AT&T syntax are: • Source and destination operands are written in opposite order. WORLD! 8 3. Parentheses are used instead of brackets.. Its name is misnomer (data is not moved but rather copied). In Intel-syntax: <instruction> <destination operand> <source operand>. In other architectures. 3. • AT&T: Suffix is added to instructions to define the operand size: – q — quad (64 bits) – l — long (32 bits) – w — word (16 bits) – b — byte (8 bits) Let’s go back to the compiled result: it is identical to what we saw in IDA. HELLO.string "hello. OFFSET FLAT:$SG2989 printf eax. not XOR. -0x10 is equal to 0xFFFFFFF0 (for a 32-bit data type). this instruction is named “LOAD” and/or “STORE” or something similar. 0AH. world\n" main: pushl movl andl subl movl call movl leave ret %ebp %esp. It is the same thing: 16 in the decimal system is 0x10 in hexadecimal.g.2. world'. With one subtle difference: 0FFFFFFF0h is presented as $-16.7: MSVC 2012 x64 $SG2989 DB main main PROC sub lea call xor add ret ENDP 'hello. in some C standard functions (e. 00H rsp. • AT&T: Before register names.3 . MOV just loads value to a register. 9 9 . you can imagine that there is an equality sign (=) between operands and when you deal with AT&T-syntax imagine there is a right arrow (→) 9 .

text:00000000004004D0 . "hello. In Win64. all registers were extended to 64-bit and now their names have an R. "hello. the rest—via the stack. HELLO. I. *BSD and Mac OS X [Mit13].text:00000000004004D0 . But why not use the 64-bit part. 8 endp As we can see.3 on page 664 ). it can be sure that the data segment containing the string will not be allocated at the addresses higher than 4GiB. 4 function arguments are passed in RCX. eax rsp.6 x64 . Besides. GCC is trying to save some space. to access external memory/cache less often). so they are passed in the 64-bit registers (which have the R.4.e.string "hello. RDI? It is important to keep in mind that all MOV instructions in 64-bit mode that write something into the lower 32-bit register part. However. 8 edi.2. This is called “shadow space”. This is how RAX/EAX/AX/AL looks like in 64-bit x86-compatible CPUs: 7th (byte number) 6th 5th 4th RAXx64 3rd 2nd 1st 0th EAX AH AX AL The main() function returns an int-typed value. Apparently.prefix. which is. main: sub mov xor call xor add ret world\n" rsp. R8. R9 registers. offset format . The pointers are 64-bit now. The first 6 arguments are passed in the RDI. world\n" eax.prefix..text:00000000004004E6 48 BF 31 E8 31 48 C3 83 E8 C0 D8 C0 83 EC 08 05 40 00 FE FF FF C4 08 main sub mov xor call xor add retn main proc near rsp. RCX. also clear the higher 32-bits[Int13]. R9 registers. still 32-bit.. 011223344h writes a value into RAX correctly. In order to use the stack less often (in other words. 8 edi. eax _printf eax. 10 This must be enabled in Options → Disassembly → Number of opcode bytes 10 . RDX. So the pointer to the string is passed in EDI (32-bit part of register). That is what we see here: a pointer to the string for printf() is now passed not in stack.e.text:00000000004004DB .text:00000000004004E2 . 3. eax . We also see that the EAX register was cleared before the printf() function call. so that is why the EAX register is cleared at the function end (i. WORLD! 3. eax rsp. If we open the compiled object file (. and the rest—via the stack. 8 A method to pass function arguments in registers is also used in Linux. RSI. R8. This is done because the number of used vector registers is passed in EAX by standard: “with variable arguments passes information about the number of vector registers used”[Mit13]. a part of the function arguments are passed in registers.o).LC0 .8: GCC 4. OFFSET FLAT:. the instruction that writes into EDI at 0x4004D4 occupies 5 bytes.9: GCC 4.2.text:00000000004004E0 . number of vector registers passed printf eax. we can also see all instruction’s opcodes 10 : Listing 3. it is still possible to access the 32-bit parts.4.2 GCC—x86-64 Let’s also try GCC in 64-bit Linux: Listing 3.. world\n" eax.CHAPTER 3. the MOV EAX. there exists a popular way to pass function arguments via registers (fastcall : 64.1 on page 89.text:00000000004004D9 . using the E.text:00000000004004E6 . The same instruction writing a 64-bit value into RDI occupies 7 bytes. X86-64 In x86-64. for backward compatibility.prefix).2. but in the RCX register. RDX.6 x64 . in the C/C++.e. I. about which we are going to talk later: 8.text:00000000004004D4 . There are also 40 bytes allocated in the local stack. 32-bit part of register) instead of RAX. for better backward compatibility and portability. since the higher bits will be cleared.

0xa. "world\n" _puts esp. these two words are positioned in memory adjacently and puts() called from f2() function is not aware this string is divided. puts() is not aware there is something before this string! This clever trick is often used by at least GCC and can save some memory. 1Ch [esp+1Ch+s]. } int main() { f1().1 + IDA listing f1 proc near s = dword ptr -1Ch f1 sub mov call add retn endp f2 proc near s = dword ptr -1Ch f2 sub mov call add retn endp aHello s db 'hello ' db 'world'. } Common C/C++-compilers (including MSVC) allocates two strings. • Apple Xcode 4. Let’s try this example: #include <stdio. 3.3 3.4 ARM For my experiments with ARM processors. } int f2() { printf ("hello world\n").3 uses open-source GCC as front-end compiler and LLVM code generator 11 . 1Ch Indeed: when we print the “hello world” string. but let’s see what GCC 4. offset s . f2(). and the fact that C-strings allocated in constants segment are guaranteed to be immutable. offset aHello .6. 11 It is indeed so: Apple Xcode 4. GCC—ONE MORE THING GCC—one more thing The fact that an anonymous C-string has const type ( 3. has an interesting consequence: the compiler may use a specific part of string.1 does: Listing 3. When puts() is called from f1(). It’s not divided in fact. it uses “world” string plus zero byte. 1Ch esp.8.9 (Linaro) (for ARM64).1.2 compiler 11 .6.10: GCC 4.1 on page 6).com/17325.h> int f1() { printf ("world\n").3 IDE (with LLVM-GCC 4. I used several compilers: • Popular in the embedded area Keil Release 6/2013.3. • GCC 4. 1Ch [esp+1Ch+s]. HELLO. "hello " _puts esp. in this listing. available as win32-executables at http://go.0 esp. it’s divided only “virtually”.8.yurichev.CHAPTER 3. WORLD! 3.

The very first instruction. the ``MOV R0. This difference (offset) is always to be the same. PC in ARM.text:00000004 . Indeed. In other words. • then pass the control to the printf() by writing its address into the PC register. 3. for the sake of simplification. it is called ARM64 here.1 on page 678) 18 Branch with Link 19 Complex instruction set computing 20 Read more about this in next section ( 5 on page 22) 21 MOVe 22 LDMFD23 16 Program 12 . By the way.4. this has no equivalent in x86. in the output listing from the armcc compiler. R4. So. works as an x86 PUSH instruction.g. That’s why each function passes control to the address stored in the LR register. But that is not quite precise. The ADR instruction takes into account the difference between the address of this instruction and the address where the string is located. Indeed. 17 Read more about it in relevant section ( 67.text:00000000 . This instruction (like the PUSH instruction in thumb mode) is able to save several register values at once which can be very useful. they can only be located on 4-byte boundary addresses. The PUSH instruction is only available in thumb mode. ``STMFD SP!. Next. world” string is located. STMFD may be used for storing pack of registers at the specified memory address. When we talk about 64-bit ARM here. #0 SP!. WORLD! 3. This implies that the last 2 bits of the instruction address (which are always zero bits) may be omitted. In summary.text:000001EC 68 65 6C 6C+aHelloWorld DCB "hello. As we may remember. where the return address is usually stored on the stack20 .lr}'' instruction.exe --arm --c90 -O0 1. and increments the stack pointer SP. we compiled our code for ARM mode. since it can work with any register.c The armcc compiler produces assembly listings in Intel-syntax but it has high-level ARM-processor related macros12 . ``BL __2printf''18 instruction calls the printf() function. This is enough to encode current_P C ± ≈ 32M .11: Non-optimizing Keil 6/2013 (ARM mode) IDA . {R4.CHAPTER 3.PC} . The ``ADR R0. SP in ARM. When printf() finishes its execution it must have information about where it needs to return the control to.PC''22 is an inverse instruction of STMFD. That is a difference between “pure” RISC-processors like ARM and CISC19 -processors like x86. world". we have 26 bits for offset encoding.LR} R0. It loads values from the stack (or any other memory place) in order to save them into R4 and PC. to make things less confusing. Here’s how this instruction works: • store the address following the BL instruction (0xC) into the LR. no matter at what address our code is loaded by the OS.text:00000008 . 12 e. {R4. we can easily see each instruction has a size of 4 bytes. world" __2printf R0. all ARM-mode instructions have a size of 4 bytes (32 bits).0 . Listing 3. then it saves the values of the R4 and LR registers at the address stored in the modified SP. {R4. if not mentioned otherwise.4. SP/ESP/RSP in x86/x64. HELLO. ARM 32-bit ARM code is used in all cases in this book. aHelloWorld . #0''21 instruction just writes 0 into the R0 register. aHelloWorld'' instruction adds the value in the PC16 register to the offset where the “hello.text:0000000C . IP/EIP/RIP in x86/64. That’s because our C-function returns 0 and the return value is to be placed in the R0 register. The last instruction ``LDMFD SP!. This instruction decrements first the SP15 so it points to the place in the stack that is free for new entries. ARM mode lacks PUSH/POP instructions 13 STMFD14 15 stack pointer. That’s why all we need is to add the address of the current instruction (from PC) in order to get the absolute memory address of our C-string. but it is more important for us to see the instructions “as is” so let’s see the compiled result in IDA. "hello. 17 Such code can be be executed at a non-fixed address in memory. actually shows the ``PUSH {r4. Hence. It can also be noted that the STMFD instruction is a generalization of the PUSH instruction (extending its features). we’re doing this in IDA.1 Non-optimizing Keil 6/2013 (ARM mode) Let’s start by compiling our example in Keil: armcc. DATA XREF: main+4 In the example.LR}''13 . not just with SP. writing the values of two registers (R4 and LR) into the stack. Counter. an absolute 32-bit address or offset cannot be encoded in the 32-bit BL instruction because it only has space for 24 bits. How is the PC register used here.text:00000010 10 1E 15 00 10 40 0E 19 00 80 2D 8F 00 A0 BD E9 E2 EB E3 E8 main STMFD ADR BL MOV LDMFD SP!.text:00000000 . It works like POP here. one might ask? This is so-called “position-independent code”. By the way. not for thumb.

text:0000000A 10 C0 06 00 10 main PUSH ADR BL MOVS POP B5 A0 F0 2E F9 20 BD . In summary. DCB is an assembly language directive defining an array of bytes or ASCII strings. R0 _puts R0. 3. This is.PC} DCB "hello. world".text:00000000 .2 Non-optimizing Keil 6/2013 (thumb mode) Let’s compile the same example using Keil in thumb mode: armcc.6. Its usage here. The issue here is that the generic MOV instruction in ARM mode may write only the lower 16 bits of the register. SP R0.3 (LLVM) (ARM mode) __text:000028C4 __text:000028C4 __text:000028C8 __text:000028CC __text:000028D0 __text:000028D4 __text:000028D8 __text:000028DC __text:000028E0 80 86 0D 00 00 C3 00 80 40 06 70 00 00 05 00 80 2D 01 A0 40 8F 00 A0 BD E9 E3 E1 E3 E0 EB E3 E8 _hello_world STMFD MOV MOV MOVT ADD BL MOV LDMFD __cstring:00003F62 48 65 6C 6C+aHelloWorld_0 SP!. The MOV instruction just writes the number 0x1686 into the R0 register.4. Therefore.text:00000000 .0 The instructions STMFD and LDMFD are already familiar to us. akin to the DB directive in x86-assembly language. BL thumb-instruction can encode the address current_P C ± ≈ 2M .text:00000002 .B. or something like that. but R4 and PC are restored during the LDMFD execution. #0x1686'' instruction above cleared the higher part of the register. world" __2printf R0. consists of two 16-bit instructions. The R7 register (as it is standardized in [App10]) is a frame pointer.CHAPTER 3.6. 13 . This is the offset pointing to the “Hello world!” string.text:00000004 .12: Non-optimizing Keil 6/2013 (thumb mode) + IDA .3 Optimizing Xcode 4. As I mentioned.6. As for the other instructions in the function: PUSH and POP work here just like the described STMFD/LDMFD only the SP register is not mentioned explicitly here. More on that below. where the instruction count is as small as possible. ARM N. 3.4. This is because it is impossible to load an offset for the printf() function into PC while using the small space in one 16-bit opcode. The BL instruction.c We are getting (in IDA): Listing 3. Remember. thus passing control to where our function was called. The MOVT R0. #0 (MOVe Top) instruction writes 0 into higher 16 bits of the register. the address of the place where each function must return control to is usually saved in the LR register. As I mentioned before. this value can be written directly to the PC register.LR} R0.text:00000304 68 65 6C 6C+aHelloWorld {R4.PC} DCB "Hello world!". #0 SP!.text:00000008 . #0 {R4. Given the above. Listing 3. Of course. the last address bit may be omitted while encoding instructions. The very first instruction STMFD saved the R4 and LR registers pair on the stack. setting the compiler switch -O3. PC. "hello. however. however. This implies it is impossible for a thumb-instruction to be at an odd address whatsoever. Since main() is usually the primary function in C/C++. the control will be returned to the OS loader or to a point in a CRT. this limitation is not related to moving data between registers. thumb. as I mentioned.13: Optimizing Xcode 4. HELLO.3 without optimization turned on produces a lot of redundant code so we’ll study optimized output.4. {R7. This is probably a shortcoming of the compiler. all instructions in thumb mode have a size of 2 bytes (or 16 bits). MOVS writes 0 into the R0 register in order to return zero. aHelloWorld .3 (LLVM) (ARM mode) Xcode 4. DATA XREF: main+2 We can easily spot the 2-byte (16-bit) opcodes. {R7. all instruction opcodes in ARM mode are limited in size to 32 bits.LR} R0. #0 R0. In the function’s end. That’s why an additional instruction MOVT exists for writing into the higher bits (from 16 to 31 inclusive). The very first instruction saves its value in the stack because the same register will be used by our main() function when calling printf().0 . ADR works just like in the previous example. is redundant because the ``MOV R0. #0x1686 R7. the first 16-bit instruction loads the higher 10 bits of the offset and the second instruction loads the lower 11 bits of the offset. WORLD! 3.exe --thumb --c90 -O0 1.

But in the IDA listing the opcode bytes are swapped because for ARM processor the instructions are encoded as follows: last byte comes first and after that comes the first one (for thumb and thumb-2 modes) or for instructions in ARM mode the fourth byte comes first. passing control to it. while enumerating import symbols in the primary module. the optimal solution is to allocate a separate function working in ARM 24 It has also to be noted the puts() does not require a ’\n’ new line symbol at the end of a string. The BL instruction calls the puts() function instead of printf(). The OS loader loads all modules it needs and..LR} R0. so we do not see it here. the BLX instruction is used in this case instead of the BL. we see the familiar ``MOV R0.3 (LLVM) (thumb-2 mode) __text:00002B6C __text:00002B6C __text:00002B6E __text:00002B72 __text:00002B74 __text:00002B78 __text:00002B7A __text:00002B7E __text:00002B80 80 41 6F C0 78 01 00 80 B5 F2 D8 30 46 F2 00 00 44 F0 38 EA 20 BD _hello_world PUSH MOVW MOV MOVT. besides saving the RA in the LR register and passing control to the puts() function.0xA.CHAPTER 3. Besides. clearing the higher bits.PC} . in order to reduce the time the OS loader needs for completing this procedure. This instruction is placed here since the instruction to which control is passed looks like (it is encoded in ARM mode): __symbolstub1:00003FEC _puts __symbolstub1:00003FEC 44 F0 9F E5 . #0 R0. That is obvious considering that the opcodes of the thumb-2 instructions always begin with 0xFx or 0xEx.6. as we recall. #0 {R7. ELF or Mach-O) an import section is present. .4. =__imp__puts So.com/17063 14 .6. Indeed: printf() with a sole argument is almost analogous to puts(). the observant reader may ask: why not call puts() right at the point in the code where it is needed? Because it is not very space-efficient. In an executable binary file (Windows PE . SP R0. So. ``MOVT. PC _puts R0.W R0. #0x13D8 R7.W and BLX instructions begin with 0xFx.3 (LLVM) (thumb-2 mode) By default Xcode 4. WORLD! 3. #0x13D8'' —it stores a 16-bit value into the lower part of the R0 register. ARM The ``ADD R0. CODE XREF: _hello_world+E LDR PC.4. So as we can see. __imp__puts is a 32-bit variable used by the OS loader to store the correct address of the function in an external library. R0'' instruction adds the value in the PC to the value in the R0. determines the correct addresses of each symbol.yurichev. #0'' works just like MOVT from the previous example only it works in thumb-2. it is good idea if it writes the address of each symbol only once. MOVT. the processor is also switching from thumb mode to ARM (or back).3 generates code for thumb-2 in this manner: Listing 3. In our case.14: Optimizing Xcode 4. because the two functions are producing the same result only in case the string does not contain printf format identifiers starting with %. Among the other differences. including the standard C-function puts(). then the second and finally the first (due to different endianness). Then the LDR instruction just reads the 32-bit value from this variable and writes it into the PC register.. it is impossible to load a 32-bit value into a register while using only one instruction without a memory access.0 The BL and BLX instructions in thumb mode. The difference is that. then the third.4 Optimizing Xcode 4. In thumb-2 these surrogate opcodes are extended in such a way so that new instructions may be encoded here as 32-bit instructions. Therefore. One of the thumb-2 instructions is ``MOVW R0.6. as we have already figured out. In case it does. #0'' instruction intended to set the R0 register to 0.W ADD BLX MOVS POP {R7. This is a list of symbols (functions or global variables) imported from external modules along with the names of the modules themselves.so in *NIX or . Almost. the effect of these two functions would be different 24 .dylib in Mac OS X).exe. Almost any program uses external dynamic libraries (like DLL in Windows. __cstring:00003E70 48 65 6C 6C 6F 20+aHelloWorld DCB "Hello world!". to a dedicated place. it is “position-independent code” so this correction is essential here. The dynamic libraries contain frequently used library functions. puts() works faster because it just passes characters to stdout without comparing every one of them with the % symbol. 3. to calculate absolute address of the “Hello world!” string. PC. Also. 25 http://go. As we already know. Why did the compiler replace the printf() with puts()? Probably because puts() is faster 25 . Next. are encoded as a pair of 16-bit instructions. GCC replaced the first printf() call with puts(). HELLO. the MOVW.

Of course.4. HELLO. in the previous example (compiled for ARM mode) the control is passed by the BL to the same thunk function. 3. Now main() returns 64-bit value: Listing 3.. whatsoever.4 on page 447.1 in ARM64: Listing 3. I changed my example slightly and recompiled it.rodata: 400640 01000200 00000000 48656c6c 6f210a00 . In order to verify this. each has a size of 8 bytes.. for better compatibility. is not being switched (hence the absence of an “X” in the instruction mnemonic). [sp].prefixes. only the low 32 bits of X0 register has to be filled. and only then values from registers pair are to be written into the stack. Registers count is doubled: B. The first instruction (ADRP) writes address of 4Kb page where the string is located into X0. so there are 32-bit instructions only. however. so one needs 16 bytes for saving two registers. #0x0 x29.3.. ARM64 registers are 64-bit ones.3 on page 14. so that’s why they are saved in the function prologue and restored in the function epilogue.. so that’s how the return result is prepared. the first instruction is just an analogue to pair of PUSH X29 and PUSH X30.. [sp. puts() is called afterwards using BL instruction.8. There are no instructions. There are no thumb and thumb-2 modes in ARM64. Hence. } 26 Frame Pointer 15 . read more about it here: 28. x30.Hello!.#-16]! x29. So several instructions must be utilised. W0 is low 32 bits of 64-bit X0 register: High 32-bit part low 32-bit part X0 W0 The function result is returned via X0 and main() returns 0.. while its 32-bit parts—W-. just like in x86-64.4. this instruction is able to save this pair at a random place of memory. and X30 as LR.4. in the terms of more familiar x86. The processor mode. The STP instruction (Store Pair) saves two registers in the stack simultaneously: X29 in X30. Contents of section .2 on page 445. X29 is used as FP26 in ARM64. ARM mode with sole goal to pass control to the dynamic library and then to jump to this short one-instruction function (the so-called thunk function) from the thumb-code. 64-bit registers has X.1 + objdump 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0000000000400590 <main>: 400590: a9bf7bfd 400594: 910003fd 400598: 90000000 40059c: 91192000 4005a0: 97ffffa0 4005a4: 52800000 4005a8: a8c17bfd 4005ac: d65f03c0 stp mov adrp add bl mov ldp ret x29.rodata data segment at this address.h> uint64_t main() { printf ("Hello!\n").. About the difference between post-index and pre-index.. Exclamation mark after operand mean that 16 is to be subtracted from SP first. ADRP and ADD instructions are used to fill the string “Hello!” address into the X0 register. in ARM that can store a large number into a register (because the instruction length is limited to 4 bytes. MOV instruction writes 0 into W0.h> #include <stdint. So if a function returns 32-bit int.4.16: main() returning a value of uint64_t type #include <stdio. This is also called pre-index. 0x400000 + 0x648 = 0x400648. But why using the 32-bit part? Because int data type in ARM64. and the the second one (ADD) just adds reminder to the address. so the pair is saved in the stack. read here: 28. 400000 <_init-0x3b8> x0.1 on page 445). only ARM.8. x0. More about that in: 28.CHAPTER 3. but the SP register is specified here. sp x0.#16 // #0 . #0x648 400420 <puts@plt> w0. x30. WORLD! 3. The second instruction copies SP in X29 (or FP). and we see our “Hello!” C-string in the . This was already discussed: 3.5 ARM64 GCC Let’s compile the example using GCC 4. This is done to set up the function stack frame.15: Non-optimizing GCC 4. because the first function argument is passed in this register...1 on page 945. return 0. is still 32-bit. By the way.

5.$4. function epilogue: addiu $sp. saving return address in link register: jalr $25 addiu $4. Then the address of the text string is loaded to $4 using LUI (“Load Upper Immediate”) 27 Block Started by Symbol 16 . \000 is zero byte in octal base: . So we can allocate some register for this purpose and also allocate 64KiB area of most used data.%call16(puts)($28) . only a special hint bit is added.%hi($LC0) .CHAPTER 3. so it’s impossible to embed 32-bit address into one instruction: a pair has to be used for this (like GCC did in our example for the text string address loading). At least PowerPC uses this technique as well.18: Optimizing GCC 4. WORLD! 3. MIPS The result is the same. however. This concept is not unique to MIPS. set GP: lui $28. for initialized data) sections.$sp.1 MIPS Word about “global pointer” One important MIPS concept is “global pointer”. world!\012\000" main: . It’s possible. As we may already know. branch delay slot . each MIPS instruction has size of 32 bits.register + 32767 using one single instruction (because 16 bits of signed offset could be encoded in single instruction).17: Non-optimizing GCC 4. Some old-school programmers may recall the MS-DOS memory model 94 on page 880 or the MS-DOS memory managers like XMS/EMS where all memory was divided in 64KiB blocks.$sp. restore RA: lw $31. so it can execute it more optimally. 3.5. The RA register is also saved in the local stack. puts() is also used here instead of printf().$0 . #0x0 // #0 LDP (Load Pair) then restores X29 and X30 registers. to load data from the address in range of register − 32768. load address of puts() function from GP to $25: lw $25. This area usually contains global variables and addresses of imported functions like printf().28($sp) .$28. In an ELF file this 64KiB area is located partly in .. copy 0 from $zero to $v0: move $2. return by jumping to address in RA: j $31 .%hi(__gnu_local_gp) addiu $sp.sbss (“small BSS27 ”. save RA to the local stack: sw $31.2 Optimizing GCC Listing 3. Due to the simplicity of the function. informing the CPU that this is a return from a function.5 (assembly output) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 $LC0: .32 . only then SP it is increased by 16. 3.4. New instruction appeared in ARM64: RET. jump to puts(). because GCC developers decided that getting address of some function must be as fast as single instruction execution instead of two.sdata (“small data”. for not initialized data) and .5.%lo(__gnu_local_gp) . The address of the puts() function is loaded into $25 using LW the instruction (“Load Word”). This allocated register is called “global pointer” and it points to the middle of the 64KiB area.5 3.sbss.-32 addiu $28. This implies that the programmer may choose what data he/she wants to be accessed fast and place it into . but that’s how MOV at that line looks like now: Listing 3. function prologue.28($sp) . It works just as BX LR. There is no exclamation mark after the instruction: this implies that the value is first loaded from the stack.%lo($LC0) .. branch delay slot The $GP register is set in the function prologue to point the middle of this area. load address of the text string to $4 ($a0): lui $4. This is called post-index. not just another jump instruction.8. HELLO. .1 + objdump 4005a4: d2800000 mov x0.ascii "Hello.sdata/. optimizing GCC generates the very same code.

which is effectively performing return from the function. WORLD! 3. 0x20 The instruction at line 15 saves GP value into the local stack. MOVE at line 22 copies the value from $0 ($ZERO) register to $2 ($V0). LUI sets the high 16 bits of the register (hence “upper” word in instruction name) and ADDIU adds the lower 16 bits of the address. ($LC0 >> 16) # "Hello.text:00000030 addiu $sp. copy 0 from $zero to $v0: .text:0000002C jr $ra . (puts & 0xFFFF)($gp) .text:00000000 var_4 = -4 . (__gnu_local_gp >> 16) . Here is also a listing generated by IDA.text:00000000 .are called “temporaries” and their contents may not be preserved. $zero . set GP: . 0x20+var_4($sp) . which always holds zero. ($LC0 & 0xFFFF) # "Hello. and one important thing is that the address saved in RA is not the address of the next instruction (because it’s in a delay slot and is executed before the jump instruction).text:00000018 lui $a0.text:00000024 lw $ra.text:0000001C jalr $t9 . SRC. and this instruction is missing mysteriously from the GCC output listing.CHAPTER 3. because each function can use its own 64KiB data window. functions generating listings are not so critical to GCC users. saving return address in link register: . Oh. This does not mean an actual addition happens at each MOVE instruction. return by jumping to address in RA: .5 (assembly output) 1 2 3 $LC0: . function epilogue: . MOVE DST. MIPS and ADDIU (“Add Immediate Unsigned Word”) instruction pair. MIPS has a constant register. so let’s just use $0 register every time zero is needed. this instruction is missing in GCC assembly output: . Apparently.ascii "Hello. save GP to the local stack: . in our case. 3. restore RA: . which does the same. MIPS developers wanted to have compact opcode table.text:00000000 main: . Apparently. 0x20+var_10($sp) . J at line 24 jumps to the address in RA. save RA to the local stack: .text:00000020 la $a0. which is used for passing the first function argument 28 .text:00000010 sw $gp. the CPU optimizes these pseudoinstructions and ALU29 is never used.text:00000000 var_10 = -0x10 . The register $4 is also called $A0.text:00000004 addiu $sp. maybe by a GCC error30 .19: Optimizing GCC 4.5.text:00000014 lw $t9. function prologue.text:00000000 . (__gnu_local_gp & 0xFFFF) . $ZERO (DST = SRC + 0). load address of puts() function from GP to $t9: . ADDIU after J is in fact executed before J (remember branch delay slots?) and is part of function epilogue. Each register here has its own pseudoname: Listing 3. the MIPS developers came with the idea that zero is in fact the busiest constant in the computer programming. world!" . LW (“Load Word”) at line 19 restores RA from the local stack (this instruction is rather part of function epilogue). 29 Arithmetic 17 . The GP value has to be saved indeed.4.4. world!\012\000" main: 28 The MIPS registers table is available in appendix C. .text:00000000 lui $gp. Hence. but the address of the instruction after the next one (after the delay slot). This is very similar to ARM. In fact. Most likely.text:0000000C sw $ra. SRC is ADD DST. The register containing the puts() address is called $T9. form address of the text string in $a0: . by some reason. HELLO. Another interesting fact is that MIPS lacks instruction which transfers data between registers.text:00000008 la $gp.text:00000028 move $v0. ADDIU follows JALR (remember branch delay slots?). jump to puts().5 (IDA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 . this is the address of the LW instruction next to ADDIU.3 Non-optimizing GCC Listing 3. -0x20 .1 on page 947 logic unit 30 Apparently. so some unfixed errors may still exist.20: Non-optimizing GCC 4. P C + 8 is written to RA during the execution of JALR.5. 0x20+var_4($sp) . JALR (“Jump and Link Register”) jumps to the address stored in the $25 register (address of puts()) while saving the address of the next instruction (LW) in RA. world!" . because registers prefixed with T.

$2 jalr $25 nop .21: Non-optimizing GCC 4.4.text:00000018 sw . set GP: . $a0.text:00000004 sw .24($sp) . restore GP from local stack: lw $28.text:00000024 lw . (aHelloWorld >> 16) # "Hello.text:00000000 var_8 = -8 . load address of the text string: . So in this case they are left here. restore SP: move $sp.$0 . GP: $v0. restore FP: lw $fp. world!" (puts & 0xFFFF)($gp) $zero .$sp . function prologue.$sp.%hi(__gnu_local_gp) addiu $28. I guess (not sure.text:00000008 sw .text:00000000 .text:00000020 addiu .%lo(__gnu_local_gp) .text:0000002C move 31 No $sp. Here is also IDA listing: Listing 3. jump to RA: j $31 nop .$2. function prologue.text:00000000 addiu . world!" $v0. The second and third of which follow the branch instructions.text:00000000 . call to puts(): .text:00000010 la . load address of puts() address using . (aHelloWorld & 0xFFFF) # "Hello.16($fp) . set register $2 ($V0) to zero: move $2.%hi($LC0) addiu $4. maybe eliminates them. branch delay slot .$fp .$28. $at. 0x20+var_10($sp) $v0. 0x20+var_8($sp) $fp.CHAPTER 3. branch delay slot Non-optimizing GCC is more verbose.text:00000028 or .text:0000000C move .32 . load address of the text string: lui $2. if optimization is turned on. function epilogue.text:0000001C lui . HELLO.24($sp) addiu $sp.text:00000000 var_4 = -4 .5. $v0 OPeration 18 . .28($sp) . We see here that register FP is used as a pointer to the stack frame. -0x20 $ra. save GP and FP in the stack: . call to puts(): move $25. MIPS .28($sp) sw $fp. restore RA: lw $31.-32 sw $31.5 (IDA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 . We also see 3 NOP31 s.text:00000000 var_10 = -0x10 . WORLD! 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 3. $sp $gp. load address of puts() address using GP: lw $2.$sp.%lo($LC0) . set GP: lui $28. . __gnu_local_gp $gp. save GP and FP in the stack: addiu $sp. set FP (stack frame pointer): move $fp. 0x20+var_4($sp) $fp. . set FP (stack frame pointer): . NOP $t9.%call16(puts)($28) nop . though) that the GCC compiler always adds NOPs (because of branch delay slots) after branch instructions and then.text:00000000 main: .

restore SP: .text:0000004C addiu . restore GP from local stack: . doesn’t have a separate NOP instruction. for example: 2. 0x20+var_4($sp) $fp. Type "show copying" and "show warranty" for details. $ZERO.-32716(gp) 0x00400658 <main+24>: lui a0.zero 19 . MIPS.gnu. $zero .4 Role of the stack frame in this example The address of the text string is passed in register.28(sp) 0x00400650 <main+16>: sw gp.-32 0x00400648 <main+8>: addiu gp. restore RA: .. it would have been possible to get rid of the function prologue and epilogue. like many other ISAs.CHAPTER 3. If this was a leaf function. restore FP: .text:00000040 move .text:00000048 lw .done.3 on page 4.22: sample GDB session root@debian-mips:~# gcc hw. There is NO WARRANTY. 0x20+var_8($sp) $sp.org/software/gdb/bugs/>.5.0.1-debian Copyright (C) 2009 Free Software Foundation.0x40 0x0040065c <main+28>: jalr t9 0x00400660 <main+32>: addiu a0.text:00000050 jr .c -O3 -o hw root@debian-mips:~# gdb hw GNU gdb (GDB) 7. Another thing is that IDA doesn’t recognize NOP instructions. 3.0x42 0x00400644 <main+4>: addiu sp.16(sp) 0x00400654 <main+20>: lw t9. and the local stack is used for this purpose. Reading symbols from /root/hw. 0x20+var_10($fp) $v0. MIPS .. jump to RA: . to the extent permitted by law. 3. so here they are at lines 22. $fp $ra. For bug reporting instructions. .text:00000038 lw . 0x00400654 in main () (gdb) set step-mode on (gdb) disas Dump of assembler code for function main: 0x00400640 <main+0>: lui gp.org/licenses/gpl. of course. IDA recognized LUI/ADDIU instructions pair and coalesces them into one LA (“Load Address”) pseudoinstruction at line 15. It is OR $AT.. Why setup local stack anyway? The reason for this lies in the fact that registers RA and GP’s values has to be saved somewhere (because printf() is called). Inc. 26 and 41.5. License GPLv3+: GNU GPL version 3 or later <http://gnu. this instruction applies OR operation to contents of $AT register with zero.28(sp) 0x00400668 <main+40>: move v0.5. WORLD! 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 3. set register $2 ($V0) to zero: .text:00000054 or $t9 $at. function epilogue. NOP Interestingly.text:00000044 lw . idle instruction.sp... $zero .. please see: <http://www. which is. (gdb) b main Breakpoint 1 at 0x400654 (gdb) run Starting program: /root/hw Breakpoint 1. We may also see that this pseudoinstruction has size of 8 bytes! This is a pseudoinstruction (or macro) because it’s not a real MIPS instruction. This GDB was configured as "mips-linux-gnu". NOP $gp. Essentially.text:00000034 or .5 Optimizing GCC: load it into GDB Listing 3.text:0000003C move . but rather handy name for an instruction pair.text:00000030 jalr .-30624 0x0040064c <main+12>: sw ra. $zero $sp.(no debugging symbols found).gp.a0. 0x20 $ra $at. HELLO.html> This is free software: you are free to change and redistribute it.2080 0x00400664 <main+36>: lw ra.

6 (gdb) x/s $a0 0x400820: "hello. So all pointers are 64-bit now. 3. HELLO. WORLD! 3.1 Exercise #1 main: push 0xFFFFFFFF call MessageBeep xor eax.7 Exercises 3.6 Conclusion The main difference between x86/ARM and x64/ARM64 code is that pointer to the string is now 64-bit. We can add much more memory to our computers and 32-bit pointers are not enough to address it.32 End of assembler dump.so. world" (gdb) 3. (gdb) s 0x00400658 in main () (gdb) s 0x0040065c in main () (gdb) s 0x2ab2de60 in printf () from /lib/libc. CONCLUSION 0x0040066c <main+44>: jr ra 0x00400670 <main+48>: addiu sp. modern CPUs are 64-bit now because memory is cheaper nowadays. Indeed.CHAPTER 3.7.eax retn What does this win32-function do? 20 .6.sp.

1 Recursion Epilogues and prologues can negatively affect the recursion performance. For the same purpose one can use ESP. returns the value in the EBP register back to its initial state and returns the control flow to the callee: mov pop ret esp. but since it changes over time this approach is not too convenient. The value in the EBP stays the same over the period of the function execution and is to be used for local variables and arguments access. 21 .3 on page 470. The function epilogue frees allocated space in the stack. X What these instruction do: save the value in the EBP register. ebp ebp 0 Function prologues and epilogues are usually detected in disassemblers for function delimitation. More about recursion in this book: 36. 4.CHAPTER 4. It often looks something like the following code fragment: push mov sub ebp ebp. sets the value of the EBP register to the value of the ESP and then allocate space on the stack for local variables. FUNCTION PROLOGUE AND EPILOGUE Chapter 4 Function prologue and epilogue A function prologue is a sequence of instructions at the start of a function. esp esp.

this segment is write-protected and a single copy of it is shared among all processes executing the same program. Start of heap Start of stack Heap . the size of which may be extended by a system call. towards higher addresses. PUSH subtracts from ESP/RSP/SP 4 in 32-bit mode (or 8 in 64-bit mode) and then writes the contents of its sole operand to the memory address pointed by ESP/RSP/SP. as a pointer within that block. The reason that the stack grows backward is probably historical. starting with a high address and progressing to a lower one). we might think that the stack grow upwards. 1 wikipedia 2 Store Multiple Empty Descending (ARM instruction) Multiple Empty Descending (ARM instruction) 4 Store Multiple Full Ascending (ARM instruction) 5 Load Multiple Full Ascending (ARM instruction) 6 Store Multiple Empty Ascending (ARM instruction) 7 Load Multiple Empty Ascending (ARM instruction) 3 Load 22 . like any other data structure. When the computers were big and occupied a whole room. POP is the reverse operation: retrieve the data from memory pointed by SP. Stack In [RT74] we can read: The user-core part of an image is divided into three logical segments.e. but that’s the way it is. it is just a block of memory in process memory along with the ESP or RSP register in x86 or x64. starting from a low address and progressing to a higher one). it was unknown how big the heap and the stack would be during program execution. i. or the SP register in ARM. one for the heap and one for the stack. It seems strange. STMED2 /LDMED3 instructions are intended to deal with a descending stack (grows downwards. which automatically grows downward as the hardware’s stack pointer fluctuates. PUSH decreases the stack pointer and POP increases it. STMEA6 /LDMEA7 instructions are intended to deal with an ascending stack (grows upwards. During execution. Technically. The most frequently used stack access instructions are PUSH and POP (in both x86 and ARM thumb-mode). The STMFA4 /LDMFA5 . Of course. writable data segment. loads it into the instruction operand (often a register) and then add 4 (or 8) to the stack pointer. so this solution was the simplest possible.CHAPTER 5. it was easy to divide memory into two parts. After stack allocation. The bottom of the stack is actually at the beginning of the memory allocated for the stack block. Starting at the highest address in the virtual address space is a stack segment. the stack pointer points at the bottom of the stack. For example the STMFD/LDMFD. ARM supports both descending and ascending stacks. STACK Chapter 5 Stack The stack is one of the most fundamental data structures in computer science 1 . The program text segment begins at location 0 in the virtual address space.1 Why does the stack grow backwards? Intuitively. At the first 8K byte boundary above the program text segment in the virtual address space begins a nonshared. 5.

and notes for the second one are written from the end of notebook. 5. ss. however. Line 4 pop ebp ret 0 ?f@@YAXXZ ENDP .4. to call another function and use the LR register one more time its value has to be saved. Usually it is saved in the function prologue. f … Also if we turn on the compiler optimization (/Ox option) the optimized code will not overflow the stack and will work correctly 8 instead: ?f@@YAXXZ PROC . esp .2 5.cpp . Line 3 call ?f@@YAXXZ . by flipping it. Just run eternal recursion: void f() { f(). f . RET fetches a value from the stack and jumps to it —that is equivalent to a POP tmp / JMP tmp instruction pair. }. Line 2 $LL3@f: . Line 3 jmp SHORT $LL3@f ?f@@YAXXZ ENDP .asm Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.2. Nevertheless.com/17064 23 . function will cause ⤦ Ç runtime stack overflow …but generates the right code anyway: ?f@@YAXXZ PROC . If such function is small and uses a small 8 irony here 9 http://go. As a consequence. Often. MSVC 2008 reports the problem: c:\tmp6>cl ss. in ARM terminology it is called a leaf function9 .cpp /Fass. Notes may meet each other somewhere in between.LR'' along with this instruction in epilogue ``POP R4-R7. Line 2 push ebp mov ebp. issuing any warning about the problem. f .1 What is the stack used for? Save the function’s return address x86 When calling another function with a CALL instruction the address of the point exactly after the CALL instruction is saved to the stack and then an unconditional jump to the address in the CALL operand is executed. The CALL instruction is equivalent to a PUSH address_after_call / JMP operand instruction pair. we see instructions like ``PUSH R4-R7.cpp c:\tmp6\ss. world!” ( 3. leaf functions do not save the LR register (because they don’t modify it). if a function never calls any other function. the RA is saved to the LR (link register).2.CHAPTER 5.cpp(4) : warning C4717: 'f' : recursive on all control paths. WHAT IS THE STACK USED FOR? This reminds us how some students write two lecture notes using only one notebook: notes for the first lecture are written as usual. File c:\tmp6\ss.21022. STACK 5. If one needs. Overflowing the stack is straightforward.yurichev.cpp . however.1 generates similar code in both cases without. As mentioned in “Hello.PC'' —thus register values to be used in the function are saved in the stack. f GCC 4.00. ARM ARM programs also use the stack for saving return addresses.4 on page 11). including LR. All rights reserved.08 for 80x86 Copyright (C) Microsoft Corporation. File c:\tmp6\ss. but differently. in case of lack of free space. f .

8. 17... it is a convenient custom in x86 and ARM to use the stack for this purpose. but traditionally this is how it’s done. in section 1. main(int argc. 4*3 Callee functions get their arguments via the stack pointer. envp argv argc main If you declare main() as main() without arguments. C functions with a variable number of arguments (like printf()) determine their number using format string specifiers (which begin with the % symbol).5.2 on page 91. This will work 11 . Part II]. Thus. If you declare main() as main(int argc. no matter how many local variables are defined. it is possible to declare main(int argc).CHAPTER 5. 19. the CRT-code is calling main() roughly as: push push push call . they are. marked in IDA as arg_8 … For more information on other calling conventions see also section ( 64 on page 663).4 on page 333.1 dedicated to subroutines[Knu98. char *argv[]). we could read that one way to supply arguments to a subroutine is simply to list them after the JMP instruction passing control to subroutine. the CALL instruction (calling other functions) was expensive. Chapter 4. 5. it’s very fast. 1234). 10 Some time ago. this is how the argument values are located in the stack before the execution of the f() function’s very first instruction: ESP ESP+4 ESP+8 ESP+0xC … return address argument#1. It is also not a requirement to store local variables in the stack. You could store local variables wherever you like. In fact. 15. 5. marked in IDA as arg_0 argument#2. 19. it is possible to call leaf functions without using the stack which can be faster than on older x86 because external RAM is not used for the stack 10 . char *envp[]). That’s why it is not very important how we declare the main() function: as main(). Knuth explains that this method was particularly convenient on IBM System/360. the callee function does not have any information about how many arguments were passed. char *argv[]. still present in the stack. it may not use the stack at all. section 1.2. Therefore. This can be also useful for situations when memory for the stack is not yet allocated or not available. it is possible to allocate a space for arguments in the heap. 24 . Some examples of leaf functions are: listing. It is worth noting that nothing obliges programmers to pass arguments through the stack. 19. STACK 5. in the “The Art of Computer Programming” book by Donald Knuth. on PDP-11 and VAX.17 on page 315. Even more. 15. 8. fill it and pass it to a function via a pointer to this block in the EAX register.3 on page 92.3 Local variable storage A function could allocate space in the stack for its local variables just by shifting the stack pointer towards the stack bottom. By the way. which were laying next to it in the stack. nevertheless.33 on page 332. and then two random numbers. printf() will print 1234.3. and it will work.2.3.2 Passing function arguments The most popular way to pass parameters in x86 is called “cdecl”: push arg3 push arg2 push arg1 call f add esp. you will be able to use first two arguments. up to 50% of execution time might be spent on it. but are not used. One could implement any other method without using the stack at all.1]. 11 For example. so it was considered that having a big number of small functions is an anti-pattern[Ray03. and the third will remain “invisible” for your function. It is not a requirement. marked in IDA as arg_4 argument#3. char *argv[]) or main(int argc.3 on page 216.4 on page 196. However. If we write something like printf("%d %d %d".4.2 on page 195.2.4. For example. Hence. WHAT IS THE STACK USED FOR? number of registers.

this function just shifts ESP downwards toward the stack bottom by the number of bytes you need and sets ESP as a pointer to the allocated block. 00000258H __alloca_probe_16 esi. mov call mov eax.4 5.1 does the same without calling external functions: 12 In MSVC.h> // GCC #else #include <malloc..asm in C:\Program Files (x86)\Microsoft Visual Studio 10. 600.h> void f() { char *buf=(char*)alloca (600). ESP points to the block of 600 bytes and we can use it as memory for the buf array. 0000001cH . 00000258H . 25 . puts() copies buf contents to stdout..) MSVC Let’s compile (MSVC 2010): Listing 5. Of course. // MSVC #endif puts (buf). "hi! %d. In simple terms.g. 3).asm and chkstk. After the alloca() call. since the function epilogue ( 4 on page 21) returns ESP back to its initial state and the allocated memory is just dropped. esp push push push push push push call 3 2 1 OFFSET $SG2672 600 esi __snprintf push call add esi _puts esp. 1. 3). 600. but I would like to illustrate small buffer usage. to terminal or console). STACK 5.1: MSVC 2010 . %d\n". One of the reasons we need a separate function instead of just a couple of instructions in the code. 28 .. it writes it to the buf buffer.0\VC\crt\src\intel 13 It is because alloca() is rather a compiler intrinsic ( 90 on page 869) than a normal function.. %d\n". 1. Let’s try: #ifdef __GNUC__ #include <alloca.CHAPTER 5. in order to let the OS map physical memory to this VM15 region. 600 . // GCC #else _snprintf (buf. 2. "hi! %d. 2.2. The allocated memory chunk does not need to be freed via a free() function call. these two function calls might be replaced by one printf() call. but allocates memory directly on the stack. WHAT IS THE STACK USED FOR? x86: alloca() function It is worth noting the alloca() function. The sole alloca() argument is passed via EAX (instead of pushing it into the stack) 13 . #ifdef __GNUC__ snprintf (buf. }.2. %d. GCC + Intel syntax GCC 4. but instead of dumping the result into stdout (e. the function implementation can be found in alloca16.. %d.12 .4. It is worth noting how alloca() is implemented.h> // MSVC #endif #include <stdio. (_snprintf() function works just like printf(). This function works like malloc(). is because the MSVC14 alloca() implementation also has code which reads from the memory just allocated.

movl $3. 2 DWORD PTR [esp+12].2 on page 264). 16(%esp) $1.3 .3 .5 (Windows) SEH SEH16 records are also stored on the stack (if they present).LC0 . %d.6 Buffer overflow protection More about it here ( 18. "hi! %d. %esp 39(%esp). 4(%esp) _snprintf %ebx. 20(%esp) corresponds to mov DWORD PTR [esp+20]. %ebx $-16. 16 Structured Exception Handling : 68. [esp+39] ebx. %ebp %ebx $660. (%esp) puts -4(%ebp). ebx . By the way. 5.7. 3 in Intel-syntax. %ebx The code is the same as in the previous listing. %d\n" f: pushl movl pushl subl leal andl movl movl movl movl movl movl call movl call movl leave ret %ebp %esp.2. 3 DWORD PTR [esp+16].CHAPTER 5. 5. s DWORD PTR [esp+20]. 600 . -16 .3: GCC 4. OFFSET FLAT:.3 on page 692). align pointer by 16-bit border DWORD PTR [esp]. %d\n" DWORD PTR [esp+4]. but in AT&T syntax: Listing 5. maxlen _snprintf DWORD PTR [esp].LC0: . %ebx %ebx. 1 DWORD PTR [esp+8]. ebx .2.2: GCC 4. s puts ebx. In AT&T syntax when addressing memory using register+offset format.string "hi! %d. %d\n" f: push mov push sub lea and mov mov mov mov mov mov call mov call mov leave ret ebp ebp. Read more about it: ( 68. 12(%esp) $.. 660 ebx. %d.7.LC0. %d. esp ebx esp. it is notated as offset(%register). 20(%esp) $2. 8(%esp) $600. STACK 5.3 on page 692 26 . WHAT IS THE STACK USED FOR? Listing 5.LC0: . DWORD PTR [ebp-4] GCC + AT&T syntax Let’s see the same code.2.string "hi! %d. (%esp) $3.

3 esp. size = 4 . %d\n".4: Non-optimizing MSVC 2010 $SG2752 DB '%d.4 … local variable #2. }. size = 4 .h> void f1() { int a=1. 0aH. Compiling… Listing 5. Short example: #include <stdio. %d'. marked in IDA as arg_4 argument#3. a. }. esp 12 DWORD PTR _c$[ebp] DWORD PTR _b$[ebp] 27 . 2 DWORD PTR _c$[ebp]. before the first instruction execution looks like this: … ESP-0xC ESP-8 ESP-4 ESP ESP+4 ESP+8 ESP+0xC … 5. esp. Where do they come from? These are what was left in there after other functions’ executions. b. int main() { f1(). b. %d. f2(). printf ("%d. esp esp. ebp ebp 0 . size = 4 ebp ebp.3 5. 1 DWORD PTR _b$[ebp]. b=2. 12 DWORD PTR _a$[ebp]. STACK 5. marked in IDA as arg_0 argument#2. TYPICAL STACK LAYOUT Typical stack layout A typical stack layout in a 32-bit environment at the start of a function. eax ecx. size = 4 _c$ = -12 _b$ = -8 _a$ = -4 _f2 PROC push mov sub mov push mov ebp ebp. 00H _c$ = -12 _b$ = -8 _a$ = -4 _f1 PROC push mov sub mov mov mov mov pop ret _f1 ENDP . c=3. marked in IDA as arg_8 … Noise in stack Often in this book I write about “noise” or “garbage” values in stack or memory. size = 4 . %d. void f2() { int a.3. c. marked in IDA as var_8 local variable #1. eax.CHAPTER 5. size = 4 . c). }. marked in IDA as var_4 saved value of EBP return address argument#1.

esp _f1 _f2 eax. '%d. These are “ghosts” values.obj But when I run the compiled program… c:\Polygon\c>st 1.4.exe st.c /Fast.00.asm /MD Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16. STACK _f2 _main _main push mov push push call add mov pop ret ENDP PROC push mov call call xor pop ret ENDP 5. NOISE IN STACK ecx edx.c c:\polygon\c\st. 3 Oh. what a weird thing! We did not set any variables in f2(). All rights reserved.c(11) : warning C4700: uninitialized local variable 'b' used c:\polygon\c\st.01 for 80x86 Copyright (C) Microsoft Corporation. eax ebp 0 The compiler will grumble a little bit… c:\Polygon\c>cl st. All rights reserved.CHAPTER 5. 28 . %d. 2. 16 esp.40219.40219. DWORD PTR _a$[ebp] edx OFFSET $SG2752 .00. ebp ebp 0 ebp ebp.c(11) : warning C4700: uninitialized local variable 'a' used Microsoft (R) Incremental Linker Version 10. /out:st.01 Copyright (C) Microsoft Corporation. %d' DWORD PTR __imp__printf esp.c(11) : warning C4700: uninitialized local variable 'c' used c:\polygon\c\st. which are still in the stack. st.

their values are stored at the address 0x14F860 and so on. b and c .4.1: OllyDbg: f1() When f1() assigns the variables a. NOISE IN STACK Let’s load the example into OllyDbg: Figure 5.CHAPTER 5. 29 . STACK 5.

three numbers are printed. return 0..2: OllyDbg: f2() .. a.5: Optimizing MSVC 2010 $SG3103 DB _main PROC push call push push '%d'.e. so at that point they are still untouched. They are not random in the strict sense.5. Then the local variables will be located at the same positions in the stack. %d\n"). they has same number of arguments).h> int main() { printf ("%d.5. Is there other option? It probably would be possible to clear portions of the stack before each function execution. EXERCISES And when f2() executes: Figure 5. Summarizing. 5.2 Exercise #2 What does this code do? Listing 5. 0aH. }.CHAPTER 5.1 Exercises Exercise #1 If we compile this piece of code in MSVC and run it. 00H 0 DWORD PTR __imp___time64 edx eax 30 . all values in the stack (and memory cells in general) have values left there from previous function executions. but that’s too much extra (and unnecessary) work. %d. 5. b and c of f2() are located at the same addresses! No one has overwritten the values yet.5. Answer: G..1.1 on page 954.5 5. So. Where do they come from? Where they come from if you compile it in MSVC with optimization (/Ox)? Why is the situation completely different if we compile with GCC? #include <stdio. several functions has to be called one after another and SP has to be the same at each function entry (i. but rather have unpredictable values. STACK 5. for this weird situation.

CHAPTER 5. 0 time x1.32| DCB "%d\n".|L0.#0 {r4. . $gp.5 (MIPS) (IDA) main: var_10 var_4 = -0x10 = -4 lui addiu la sw sw lw or jalr move $gp. '%d' DWORD PTR __imp__printf esp.4. $gp. sp. x0 x29.7: Optimizing Keil 6/2013 (thumb mode) main PROC PUSH MOVS BL MOVS ADR BL MOVS POP ENDP {r4.string "%d\n" Listing 5.#0 time r1.pc} |L0.32| __2printf r0. $at.lr} r0. $t9. $t9 $a0. $sp. STACK _main push call add xor ret ENDP 5.20| DCB "%d\n".9 (ARM64) main: stp mov add bl mov ldp adrp add b x29. x30.r0 r0.r0 r0.0 Listing 5. eax 0 Listing 5.pc} |L0. (__gnu_local_gp >> 16) -0x20 (__gnu_local_gp & 0xFFFF) 0x20+var_4($sp) 0x20+var_10($sp) (time & 0xFFFF)($gp) $zero $zero 31 .LC0: . EXERCISES OFFSET $SG3103 .8: Optimizing GCC 4.6: Optimizing Keil 6/2013 (ARM mode) main PROC PUSH MOV BL MOV ADR BL MOV POP ENDP {r4. [sp]. 16 x0. $ra.LC0 printf .9: Optimizing GCC 4. 16 eax. [sp.#0 time r1.|L0.LC0 x0. :lo12:.5.#0 {r4.lr} r0. -16]! x0. 0 x29.20| __2printf r0. x0.0 Listing 5. x30.

1 on page 954. 32 . $a0.5. $t9. $a0. EXERCISES lw lui lw lw la move jr addiu $LC0: $gp. 0x20+var_10($sp) ($LC0 >> 16) # "%d\n" (printf & 0xFFFF)($gp) 0x20+var_4($sp) ($LC0 & 0xFFFF) # "%d\n" $v0 0x20 . $t9 $sp.1. $ra.ascii "%d\n"<0> # DATA XREF: main+28 Answer: G.CHAPTER 5. STACK 5. $a1.

1 6. }.CHAPTER 6.. a1 .. that is 4 bytes. By the way. replacing printf() in the main() function body with this: #include <stdio. X'' instructions into one. In certain cases where several functions return right after one another.. but now we can see the printf() arguments are pushed onto the stack in reverse order. 1. So. after the last call: push push call . 33 . push push push push call add 3 2 1 OFFSET $SG3830 _printf esp. 3). push push push call a1 a2 . and only for 32-bit environment. 6. variables of int type in 32-bit environment have 32-bit width.... the number of function arguments could be deduced by simply dividing X by 4. See also the calling conventions section ( 64 on page 663).. push call . 16 . 4 ∗ 4 = 16 —they occupy exactly 16 bytes in the stack: a 32-bit pointer to a string and 3 numbers of type int. When the stack pointer (ESP register) has changed back by the ADD ESP.. PRINTF() WITH SEVERAL ARGUMENTS Chapter 6 printf() with several arguments Now let’s extend the Hello..h> int main() { printf("a=%d. the compiler could merge multiple ``ADD ESP. Of course. The first argument is pushed last. c=%d'.. often. 00000010H Almost the same.. b=%d. we have 4 arguments here. world! ( 3 on page 6) example.1 x86 x86: 3 arguments MSVC When we compile it with MSVC 2010 Express we get: $SG3830 DB 'a=%d.. 2. return 0. X instruction after a function call. this is specific to the cdecl calling convention. b=%d. a1 a2 a3 . 00H . c=%d".1.

1. 24 34 . X86 add esp. PRINTF() WITH SEVERAL ARGUMENTS 6.CHAPTER 6.

Then load the executable in OllyDbg. 35 . so we can see the imported functions clearly in the debugger. PRINTF() WITH SEVERAL ARGUMENTS 6.dll. We need to perform these actions in order to skip CRT-code. The second breakpoint is in CRT-code. X86 MSVC and OllyDbg Now let’s try to load this example in OllyDbg.DLL. We can compile our example in MSVC 2012 with /MD option.1. It is one of the most popular user-land win32 debuggers. which means to link with MSVCR*. Now we have to find the main() function.CHAPTER 6. because we aren’t really interested in it yet. press F9 (run).1: OllyDbg: the very start of the main() function Click on the PUSH EBP instruction. press F2 (set breakpoint) and press F9 (run). The very first breakpoint is in ntdll. Find this code by scrolling the code to the very top (MSVC allocates the main() function at the very beginning of the code section): Figure 6.

ESP changes as well. value in the stack and some additional OllyDbg comments. but it could be good as an exercise so you start building a feel of how everything works here. OllyDbg understands printf()-like strings. because the arguments values are pushed into the stack.e.CHAPTER 6. It is possible to change the format string. PRINTF() WITH SEVERAL ARGUMENTS 6. which always displays some part of the memory.. i. It is possible to right-click on the format string. and the format string will appear in the debugger left-bottom window. skip 6 instructions: Figure 6. like other debuggers. OllyDbg. X86 Press F8 (step over) 6 times.1. These memory values can be edited.3: OllyDbg: stack after the argument values have been pushed (The red rectangular border was added by me in a graphics editor) We can see 3 columns there: address in the stack. in which case the result of our example would be different. so it reports the string here and the 3 values attached to it. 36 . click on “Follow in dump”. highlights the value of the registers which were changed. Where are the values in the stack? Take a look at the right/bottom debugger window: Figure 6. It is not very useful in this particular case. So each time you press F8.2: OllyDbg: before printf() execution Now the PC points to the CALL printf instruction. EIP changes and its value is displayed in red.

the printf() function’s hidden machinery used them for its own needs. This is indeed the cdecl calling convention behaviour: callee does not return ESP back to its previous value. PRINTF() WITH SEVERAL ARGUMENTS 6.5: OllyDbg: after printf() execution Register EAX now contains 0xD (13). We see the following output in the console: Figure 6.CHAPTER 6. That is correct. The caller is responsible to do so. The EIP value has changed: indeed.1. A very important fact is that neither the ESP value. 37 . X86 Press F8 (step over). now it contains the address of the instruction coming after CALL printf. nor the stack state have been changed! We clearly see that the format string and corresponding 3 values are still there. since printf() returns the number of characters printed. Apparently. ECX and EDX values have changed as well.4: printf() function executed Let’s see how the registers and stack state have changed: Figure 6.

but the values are still in the stack! Yes. no one needs to set these values to zeroes or something like that because everything above the stack pointer (SP) is noise or garbage. c=%d" [esp+10h+var_4].1. 10h eax.CHAPTER 6. "a=%d. GCC Now let’s compile the same program in Linux using GCC 4. eax _printf eax. 0 Its noticeable that the difference between the MSVC code and the GCC code is only in the way the arguments are stored on the stack. It would be time consuming to clear the unused stack entries anyway. b=%d. offset aADBDCD .4.1-ubuntu 1 GNU debugger 38 . esp esp. 10 instruction execution ESP has changed.c -g -o 1 $ gdb 1 GNU gdb (GDB) 7.6. 2 [esp+10h+var_C]. PRINTF() WITH SEVERAL ARGUMENTS 6. Here the GCC is working directly with the stack without the use of PUSH/POP.6: OllyDbg: after ADD ESP. 3 [esp+10h+var_8]. 1 [esp+10h+var_10].1 and take a look at what we have got in IDA: main proc near var_10 var_C var_8 var_4 = = = = main push mov and sub mov mov mov mov mov call mov leave retn endp dword dword dword dword ptr ptr ptr ptr -10h -0Ch -8 -4 ebp ebp. -g option instructs the compiler to include debug information in the executable file. GCC and GDB Let’s try this example also in GDB1 in Linux. and no one really needs to. 10 instruction: Figure 6. X86 Press F8 again to execute ADD ESP. 0FFFFFFF0h esp. of course. $ gcc 1. and has no meaning at all.

Let’s examine the registers.1: let’s set breakpoint on printf() (gdb) b printf Breakpoint 1 at 0x80482f0 Run.c:29 29 printf.c file is located in the current directory. while generating debugging information. License GPLv3+: GNU GPL version 3 or later <http://gnu.c: No such file or directory. 3) are the printf() arguments. the 1.1.%ax The two XCHG instructions are probably random garbage. to the extent permitted by law. For bug reporting instructions. This is the number of characters printed out. Indeed.gnu. In this case: execute till the end of printf(). their local variables. GDB is a source-level debugger. X86 Copyright (C) 2013 Free Software Foundation. The command instructs GDB to “execute all instructions until the end of the function”. b=%d. Reading symbols from /home/dennis/polygon/1. There is NO WARRANTY.c:6 6 return 0. Inc. b=%d. PRINTF() WITH SEVERAL ARGUMENTS 6. b=%d.c:29 main () at 1.done. How does GDB know which C-code line is being currently executed? This is due to the fact that the compiler. c=%d") at printf. Listing 6. 13 in EAX: (gdb) info registers eax 0xd 13 ecx 0x0 0 edx 0x0 0 ebx 0xb7fc0000 esp 0xbffff120 ebp 0xbffff138 esi 0x0 0 -1208221696 0xbffff120 0xbffff138 39 .%ax 0x8048453: xchg %ax. The rest of the elements could be just “garbage” on the stack. Run “finish”. c=%d") at printf. which we can ignore for now.c file at the line 6. but may do so. (gdb) run Starting program: /home/dennis/polygon/1 Breakpoint 1. This GDB was configured as "i686-linux-gnu". We also see “return 0.. Type "show copying" and "show warranty" for details. Print 10 stack elements. The second element (0x080484f0) is the format string address: (gdb) x/s 0x080484f0 0x80484f0: "a=%d. We don’t have the printf() function source code here. The most left column contains addresses on the stack.%eax 0x804844f <main+50>: leave 0x8048450 <main+51>: ret 0x8048451: xchg %ax.html> This is free software: you are free to change and redistribute it. (gdb) finish Run till exit from #0 __printf (format=0x80484f0 "a=%d. please see: <http://www. We can ignore them for now.. We can verify this by disassembling the memory at this address: (gdb) x/5i 0x0804844a 0x804844a <main+45>: mov $0x0.CHAPTER 6.org/software/gdb/bugs/>..org/licenses/gpl.” and the information that this expression is in the 1. so GDB can’t show it.. after all. and GDB finds the string there. __printf (format=0x80484f0 "a=%d. Value returned is $2 = 13 GDB shows what printf() returned in EAX (13). but could also be values from other functions. (gdb) x/10w $esp 0xbffff11c: 0x0804844a 0xbffff12c: 0x00000003 0xbffff13c: 0xb7e29905 0x080484f0 0x08048460 0x00000001 0x00000001 0x00000000 0x00000002 0x00000000 The very first element is the RA (0x0804844a). 2. etc. c=%d" Next 3 elements (1. also saves a table of relations between source code line numbers and instruction addresses. just like in the OllyDbg example.

h=%d\n". (gdb) info registers eax 0x0 0 ecx 0x0 0 edx 0x0 0 ebx 0xb7fc0000 esp 0xbffff120 ebp 0xbffff138 esi 0x0 0 edi 0x0 0 eip 0x804844f ..CHAPTER 6. 2. X86 0x804844a <main+45> Let’s disassemble the current instructions. PRINTF() WITH SEVERAL ARGUMENTS edi eip . Indeed EAX is zero at that point.0x4(%esp) 0x0804843e <+33>: movl $0x80484f0.0xc(%esp) 0x0804842e <+17>: movl $0x2.%esp 0x08048426 <+9>: movl $0x3.0x3 0x0804842e <+17>: mov DWORD PTR [esp+0x8]. this ends the block. c=%d.1. 4. b=%d. return 0. d=%d. 40 .0x2 0x08048436 <+25>: mov DWORD PTR [esp+0x4].1.%esp 0x08048423 <+6>: sub $0x10.0x1 0x0804843e <+33>: mov DWORD PTR [esp]. 3.0x80484f0 0x08048445 <+40>: call 0x80482f0 <printf@plt> => 0x0804844a <+45>: mov eax.h> int main() { printf("a=%d.0x8(%esp) 0x08048436 <+25>: movl $0x1.(%esp) 0x08048445 <+40>: call 0x80482f0 <printf@plt> => 0x0804844a <+45>: mov $0x0. The arrow points to the instruction to be executed next.%eax 0x0804844f <+50>: leave 0x08048450 <+51>: ret End of assembler dump.. (gdb) step 7 }. GDB shows ending bracket. meaning.%ebp 0x08048420 <+3>: and $0xfffffff0. It is possible to switch to Intel syntax: (gdb) set disassembly-flavor intel (gdb) disas Dump of assembler code for function main: 0x0804841d <+0>: push ebp 0x0804841e <+1>: mov ebp.. 1. 6. f=%d. 5. g=%d.0xfffffff0 0x08048423 <+6>: sub esp. GDB uses AT&T syntax by default. (gdb) disas Dump of assembler code for function main: 0x0804841d <+0>: push %ebp 0x0804841e <+1>: mov %esp.esp 0x08048420 <+3>: and esp. 7..2 -1208221696 0xbffff120 0xbffff138 0x804844f <main+50> x64: 8 arguments To see how other arguments are passed via the stack. 8). Let’s examine the registers after the MOV EAX. 0x0 0 0x804844a 6.0x0 0x0804844f <+50>: leave 0x08048450 <+51>: ret End of assembler dump. e=%d. Execute next instruction. }.0x10 0x08048426 <+9>: mov DWORD PTR [esp+0xc]. 6. let’s change our example again by increasing the number of arguments to 9 (printf() format string + 8 int variables): #include <stdio. 0 instruction execution.

2 on page 10. That is exactly what we see here. 0aH. GCC The picture is similar for x86-64 *NIX OS-es. 2 esi.2: MSVC 2012 x64 $SG2923 DB main 'a=%d. R9 registers.2. 4 r9d. RDX. eax main _TEXT END add ret ENDP ENDS rsp. is used for preparing the stack. This is established for the convenience’s sake: it makes it easy to calculate the address of arbitrary argument. 40 mov mov mov mov mov mov xor mov mov mov call r9d. h=%d'. 5 DWORD PTR [rsp+32]. the first 4 arguments has to be passed through the RCX. R8. h=%d\n" main: sub rsp. OFFSET FLAT:. d=%d. 3 edx. g=%d. 4 ecx. so the values are stored to the stack in a straightforward manner. 3 r8d. one has to remember: 8 bytes are allocated for any data type shorter than 64 bits. X86 MSVC As it was mentioned earlier. when 4 is enough? Yes. they are all located at aligned memory addresses. c=%d.6 x64 . eax rsp. 2 edx.LC0: . 7 DWORD PTR [rsp]. b=%d. However. 1 rcx. b=%d. 6 DWORD PTR [rsp+40]. Listing 6. 7 DWORD PTR [rsp+48]. 40 41 . e=%d. while all the rest—via the stack. R9 registers in Win64.2. RDX. return 0 xor add ret eax. It is the same in the 32-bit environments: 4 bytes are reserved for all data types. the MOV instruction. f=%d. 5 r8d. 8 DWORD PTR [rsp+8]. eax . 88 mov mov mov mov mov mov mov mov lea call DWORD PTR [rsp+64]. R8.2 on page 10. RSI.1. GCC generates the code storing the string pointer into EDI instead of RDI—we noted that previously: 3.CHAPTER 6. except that the first 6 arguments are passed through the RDI.string "a=%d.LC0 eax. 00H PROC sub rsp. OFFSET FLAT:$SG2923 printf . instead of PUSH. We also noted earlier that the EAX register has been cleared before a printf() call: 3.4. RCX. number of vector registers passed DWORD PTR [rsp+16]. g=%d. 8 DWORD PTR [rsp+56]. 88 0 The observant reader may ask why are 8 bytes allocated for int values. 6 printf . e=%d. 1 edi. return 0 xor eax. Besides. PRINTF() WITH SEVERAL ARGUMENTS 6. All the rest—via the stack.3: Optimizing GCC 4. f=%d. c=%d. d=%d. Listing 6.

So. d=%d.1-ubuntu Copyright (C) 2013 Free Software Foundation. $ gcc -g 2. e=%d. h=%d\n") at ⤦ Ç printf.1. b=%d. X86 GCC + GDB Let’s try this example in GDB.. 7..c:29 29 printf.4: let’s set the breakpoint to printf().c -o 2 $ gdb 2 GNU gdb (GDB) 7.html> This is free software: you are free to change and redistribute it.6.. __printf (format=0x400628 "a=%d. c=%d. 3 values are also passed through the stack: 6..5: let’s inspect the format string (gdb) x/s $rdi 0x400628: "a=%d.c: No such file or directory. because the values have int type. i. Type "show copying" and "show warranty" for details. Inc. and run (gdb) b printf Breakpoint 1 at 0x400410 (gdb) run Starting program: /home/dennis/polygon/2 Breakpoint 1. b=%d. the high register or stack element part may contain “random garbage”. to the extent permitted by law. Registers RSI/RDX/RCX/R8/R9 have the expected values.. That’s OK. We also see that 8 is passed with the high 32-bits not cleared: 0x00007fff00000008.org/software/gdb/bugs/>. (gdb) x/10g $rsp 0x7fffffffdf38: 0x0000000000400576 0x7fffffffdf48: 0x0000000000000007 0x7fffffffdf58: 0x0000000000000000 0x7fffffffdf68: 0x00007ffff7a33de5 0x7fffffffdf78: 0x00007fffffffe048 0x0000000000000006 0x00007fff00000008 0x0000000000000000 0x0000000000000000 0x0000000100000000 The very first stack element.gnu.e. d=%d. g=%d. f=%d. 8. If you take a look at where the control will return after the printf() execution. e=%d. This GDB was configured as "x86_64-linux-gnu". Listing 6.done.. GDB will show the entire main() function: 42 . 64-bit words. please see: <http://www. License GPLv3+: GNU GPL version 3 or later <http://gnu. PRINTF() WITH SEVERAL ARGUMENTS 6. (gdb) info registers rax 0x0 0 rbx 0x0 0 rcx 0x3 3 rdx 0x2 2 rsi 0x1 1 rdi 0x400628 4195880 rbp 0x7fffffffdf60 rsp 0x7fffffffdf38 r8 0x4 4 r9 0x5 5 r10 0x7fffffffdce0 r11 0x7ffff7a65f60 r12 0x400440 4195392 r13 0x7fffffffe040 r14 0x0 0 r15 0x0 0 rip 0x7ffff7a65f60 . Reading symbols from /home/dennis/polygon/2. RIP has the address of the very first instruction of the printf() function. c=%d.CHAPTER 6. 0x7fffffffdf60 0x7fffffffdf38 140737488346336 140737348263776 140737488347200 0x7ffff7a65f60 <__printf> Listing 6. is the RA. which is 32-bit.org/licenses/gpl. For bug reporting instructions. f=%d. There is NO WARRANTY. g=%d. just like in the previous case. h=%d\n" Let’s dump the stack with the x/g command this time—g stands for giant words..

and note that the EAX register has a value of exactly zero. 32-bit ARM Non-optimizing Keil 6/2013 (ARM mode) 43 . RIP now points to the LEAVE instruction.3 on page 664) or win64 ( 64.2..0x400628 0x000000000040056c <+63>: mov eax.0x5 0x0000000000400552 <+37>: mov r8d.0x3 0x000000000040055d <+48>: mov edx..0x0 0x000000000040057b <+78>: leave 0x000000000040057c <+79>: ret End of assembler dump. h=8 main () at 2. (gdb) info registers rax 0x0 0 rbx 0x0 0 rcx 0x26 38 rdx 0x7ffff7dd59f0 140737351866864 rsi 0x7fffffd9 2147483609 rdi 0x0 0 rbp 0x7fffffffdf60 0x7fffffffdf60 rsp 0x7fffffffdf40 0x7fffffffdf40 r8 0x7ffff7dd26a0 140737351853728 r9 0x7ffff7a60134 140737348239668 r10 0x7fffffffd5b0 140737488344496 r11 0x7ffff7a95900 140737348458752 r12 0x400440 4195392 r13 0x7fffffffe040 140737488347200 r14 0x0 0 r15 0x0 0 rip 0x40057b 0x40057b <main+78> .0x2 0x0000000000400562 <+53>: mov esi.2.2 6.1 ARM ARM: 3 arguments ARM’s traditional scheme for passing arguments (calling convention) behaves as follows: the first 4 arguments are passed through the R0-R3 registers.c:6 6 return 0.0x0 0x0000000000400571 <+68>: call 0x400410 <printf@plt> 0x0000000000400576 <+73>: mov eax. 6. c=3. h=%⤦ Ç d\n") at printf. g=7. PRINTF() WITH SEVERAL ARGUMENTS 6. This resembles the arguments passing scheme in fastcall ( 64.0x7 0x0000000000400545 <+24>: mov DWORD PTR [rsp].1 on page 665). d=4.rsp 0x0000000000400531 <+4>: sub rsp. g=%d. e=%d. Value returned is $1 = 39 (gdb) next 7 }.0x4 0x0000000000400558 <+43>: mov ecx. Let’s finish executing printf().0x20 0x0000000000400535 <+8>: mov DWORD PTR [rsp+0x10]. the penultimate one in the main() function. b=2. the remaining arguments via the stack. ARM (gdb) set disassembly-flavor intel (gdb) disas 0x0000000000400576 Dump of assembler code for function main: 0x000000000040052d <+0>: push rbp 0x000000000040052e <+1>: mov rbp. execute the instruction zeroing EAX. b=%d.e.0x6 0x000000000040054c <+31>: mov r9d. (gdb) finish Run till exit from #0 __printf (format=0x400628 "a=%d. e=5.c:29 a=1. c=%d. i.CHAPTER 6. f=%d.0x8 0x000000000040053d <+16>: mov DWORD PTR [rsp+0x8].0x1 0x0000000000400567 <+58>: mov edi.5. d=%d. f=6..

text:00000000 . then 1 in R1. 2) the call to printf() is the last instruction. #0 SP!. "a=%d. Optimizing Keil 6/2013 (thumb mode) Listing 6. #2 R1. b=%d. There are two main reasons: 1) neither the stack nor SP (the stack pointer) is modified. #3 R2. #2 R1.LR} R3. Therefore we do not need to save LR because we do not need to modify LR. c=%d\n" This is the optimized (-O3) version for ARM mode and this time we see B as the last instruction instead of the familiar BL. similar to JMP in x86.text:00000014 . b=%d. This optimization is often used in functions where the last statement is a call to another function. The result is somewhat unusual: Listing 6.1 on page 141.LR} R3.text:00000014 . the printf() function simply returns the control to the address stored in LR. #3 R2. b=%d.text:00000000 .CHAPTER 6.text:00000014 . 3). aADBDCD __2printf R0.1. {R4. Optimizing Keil 6/2013 generates the same code.text:00000004 .text:00000024 main 03 30 02 20 01 10 1E 0E CB 18 A0 A0 A0 8F 00 E3 E3 E3 E2 EA MOV MOV MOV ADR B R3. Why does it work? Because this code is. #1 R0.7: Optimizing Keil 6/2013 (thumb mode) . Furthermore. #2 R1.text:00000006 . "a=%d.text:00000008 . PRINTF() WITH SEVERAL ARGUMENTS 6.text:00000002 . There is nothing unusual so far. 2 in R2 and 3 in R3. {R4.text:0000000E .text:00000000 . #1 R0. c=%d" There is no significant difference from the non-optimized code for ARM mode.PC} . b=%d.text:0000001C main 10 40 03 30 02 20 01 10 08 00 06 00 00 00 10 80 2D A0 A0 A0 8F 00 A0 BD E9 E3 E3 E3 E2 EB E3 E8 STMFD MOV MOV MOV ADR BL MOV LDMFD SP!. A similar example is presented here: 13. effectively equivalent to the previous. aADBDCD __2printf .PC} . On completion. The instruction at 0x18 writes 0 to R0 —this is return 0 C-statement. return 0 So.text:00000010 .text:00000000 .text:00000010 main 10 B5 03 23 02 22 01 21 02 A0 00 F0 0D F8 00 20 10 BD PUSH MOVS MOVS MOVS ADR BL MOVS POP {R4. And we do not need to modify LR because there are no other function calls except printf().text:00000008 .8: Optimizing Keil 6/2013 (ARM mode) . Since the LR currently stores the address of the point from where our function was called then the control from printf() will be returned to that point.text:0000000C . ARM Listing 6.2.text:0000001C . the first 4 arguments are passed via the R0-R3 registers in this order: a pointer to the printf() format string in R0. 2. without any manipulation of the LR register. 44 .h> void main() { printf("a=%d. c=%d" .text:0000000A . in fact. }.text:00000020 . so there is nothing going on afterwards. Optimizing Keil 6/2013 (ARM mode) + let’s remove return Let’s rework example slightly by removing return 0: #include <stdio. The B instruction just jumps to another address. aADBDCD __2printf R0. #3 R2. #0 {R4.text:00000018 .6: Non-optimizing Keil 6/2013 (ARM mode) .text:00000018 . "a=%d. Another difference between this optimized version and the previous one (compiled without optimization) is the lack of function prologue and epilogue (instructions preserving the R0 and LR registers values).text:00000004 . c=%d". #1 R0. 1. after this call we do not to do anything else! That is the reason such optimization is possible.

#1 R0.text:00000040 . 8). f=%d.text:00000044 . f=%d. save FP and LR in stack frame: stp x29. sp. b=%d. e=%d. c=%d" f2: . c=%d. x30.2.9 Listing 6. [sp. [SP.#4 45 . b=%d.2 ARM: 8 arguments Let’s use again the example with 9 arguments from the previous section: 6. 6.text:0000004C . 1 mov w2. .text:00000030 .text:00000060 . 4.text:00000034 .text:00000064 . #include <stdio. #3 R2. g⤦ BL ADD LDR __2printf SP..LC1 add x0. %d in printf() string format is a 32-bit int.LC1 mov w1.text:0000003C . d=%d. 1. #8 R2.text:00000048 .text:00000054 . we see the familiar ADRP/ADD instruction pair.text:00000068 main var_18 = -0x18 var_14 = -0x14 var_4 = -4 04 14 08 07 06 05 04 0F 04 00 03 02 01 6E E0 D0 30 20 10 00 C0 00 00 00 30 20 10 0F 2D 4D A0 A0 A0 A0 8D 8C A0 8D A0 A0 A0 8F E5 E2 E3 E3 E3 E3 E2 E8 E3 E5 E3 E3 E3 E2 BC 18 00 EB 14 D0 8D E2 04 F0 9D E4 STR SUB MOV MOV MOV MOV ADD STMIA MOV STR MOV MOV MOV ADR LR.text:00000028 . ARM ARM64 Non-optimizing GCC (Linaro) 4. The second ADD X29.#var_4]! SP. 2 and 3 are loaded into 32-bit register parts. [sp].9 .text:00000038 . restore FP and LR ldp x29. Next. -16]! . which forms a pointer to the string.text:00000028 .text:0000002C . 6. It is just writing the value of SP into X29. 5. 2. 2 mov w3. #7 R1. 3.text:00000028 . 0 adrp x0.LC1: . #5 R12. SP. x0.text:00000028 . PRINTF() WITH SEVERAL ARGUMENTS 6. :lo12:. }. #2 R1.2 on page 40.. 0 instruction forms the stack frame. 0 . 3 bl printf mov w0.CHAPTER 6.2.text:00000028 . [SP+4+var_4]. 7.text:0000005C Ç =%".h> int main() { printf("a=%d. [SP. {R0-R3} R0.#0x18+var_18] R3.text:00000058 . c=%d.text:00000050 . #0x14 R3. b=%d. SP. h=%d\n".text:00000028 . d=%d. return 0. Optimizing Keil 6/2013: ARM mode . SP.9: Non-optimizing GCC (Linaro) 4. so the 1. #4 R0. g=%d.string "a=%d. "a=%d. e=%d. .9 generates the same code. 16 ret The first instruction STP (Store Pair) saves FP (X29) and LR (X30) in the stack. #0x18+var_14 R12.text:00000028 . #6 R0. #0x14 PC. x30. Optimizing GCC (Linaro) 4. aADBDCDDDEDFDGD . set stack frame (FP=SP): add x29.1. SP.

2 on page 445. 7 and 8 via the stack: they are stored in the R0.#4'' instruction loads the saved LR value from the stack into the PC register. #8 SP. This instruction is somewhat similar to POP PC in x863 . The other 4 32-bit values are to be passed through registers. and then SP is increased by 4.text:00000036 . The second ``SUB SP. "a=%d. #3 R2. R2 and R3 registers respectively.text:0000002C . Why does IDA display the instruction like that? Because it wants to illustrate the stack layout and the fact that var_4 is allocated for saving the LR value in the local stack. 6. STMIA abbreviates Store Multiple Increment After. #1 R0.2 on page 445. R0-R3'' instruction writes registers R0-R3 contents to the memory pointed by R12. so the offset is to be 0.text:0000001C .#0x18+var_18]'' instruction is saved on the stack. #0x18+var_14'' instruction writes the stack address where these 4 variables are to be stored. #6 R0. SP + 4 is to be stored into the R12 register. SP.#0x18+var_18] R3. #0x14 R3. is impossible to set IP/EIP/RIP value using POP in x86.text:0000001C . but anyway. • Passing 5. Then. This implies that SP is to be decreased by 4 first. 2 and 3 via registers: The values of the first 3 numbers (a. SP.text:0000001C . into the R12 register. but it will all be rewritten during the execution of subsequent functions.text:00000028 .text:0000001C .text:0000001C . • Passing 4 via the stack: 4 is stored in R0 and then this value. ARM This code can be divided into several parts: • Function prologue: The very first ``STR LR. {R0-R2} R0. aADBDCDDDEDFDGD . Optimizing Keil 6/2013: thumb mode . [SP+4+var_4].text:00000024 . ⤦ 2 Read 3 It more about it: 28..text:00000026 .CHAPTER 6. #7 R1. e=%d. Indeed. R2 and R3 registers right before the printf() call. Exclamation mark at the end indicates pre-index. • Passing 1.text:0000002A . #4 R0. and the other 5 values are passed via the stack: • printf() call. The var_? macros generated by IDA reflect local variables in the stack. 46 . [SP]. This is referred as post-index2 . b=%d. and then LR will be saved at the address stored in SP.text:00000038 Ç g=%".. var_14 is an assembly macro. printf_main2 var_18 = -0x18 var_14 = -0x14 var_8 = -8 00 08 85 04 07 06 05 01 07 04 00 03 02 01 A0 B5 23 B0 93 22 21 20 AB C3 20 90 23 22 21 A0 PUSH MOVS SUB STR MOVS MOVS MOVS ADD STMIA MOVS STR MOVS MOVS MOVS ADR {LR} R3. f=%d. thus cleaning the stack.text:00000030 . [SP. d=%d. and each one occupies 4 bytes. The ``LDR PC.text:0000001E . thus causing the function to exit.text:00000034 . var_18 is −0x18. • Function epilogue: The ``ADD SP. equal to −0x14. you get the idea. Read more about it at: 28. #0x14'' instruction restores the SP pointer back to its former value. we need to pass 5 32-bit values via the stack to the printf() function.text:00000032 . I hope. This is similar to PUSH in x86.text:00000022 . #0x14'' instruction decreases SP (the stack pointer) in order to allocate 0x14 (20) bytes on the stack. #0x18+var_14 R3!. because we are going to use this register for the printf() call. SP. SP. R1.text:00000020 . c) (1. So. “Increment After” implies that R12 is to be increased by 4 after each register value is written. Of course. [SP. There is no exclamation mark—indeed. which is exactly 5 ∗ 4 = 20. [SP.text:0000002E . PC is loaded first from the address stored in SP (4 + var_4 = 4 + (−4) = 0. b. SP.text:0000001C .#var_4]!'' instruction saves LR on the stack. with the help of the ``STR R0.2. so this instruction is analogous to LDR PC. 3 respectively) are passed through the R1. #5 R3. created by IDA to conveniently display the code accessing the stack. 2. the ``ADD R12. thus the value from the R0 register (4) is to be written to the address written in SP.#0x18+var_8] R2. c=%d. #2 R1. The next ``STMIA R12.text:0000001C .#4). PRINTF() WITH SEVERAL ARGUMENTS 6. what was stored on the stack will stay there. [SP.

text:00000040 00 BD POP 6. #3 R9. Presumably. #1 R2. #6 47 . For example.PC} Almost the same as what we have already seen. R7 SP!. SP SP. #0x14 R0. {R7. 0x2920 and 0x2928. {R1. Optimizing Xcode 4. #0'' and ``MOV R2. {R7.LR} R7. this is thumb code and the values are packed into stack differently: 8 goes first. with the exception of STMFA (Store Multiple Full Ascending) instruction. PC. #0x14 {PC} The output is almost like in the previous example. On the other hand. the processor attempts to simultaneously execute instructions located side-by-side. #8 R1.3 (LLVM): ARM mode __text:0000290C __text:0000290C __text:0000290C __text:0000290C __text:0000290C __text:0000290C __text:00002910 __text:00002914 __text:00002918 __text:0000291C __text:00002920 __text:00002924 __text:00002928 __text:0000292C __text:00002930 __text:00002934 __text:00002938 __text:0000293C __text:00002940 __text:00002944 __text:00002948 __text:0000294C __text:00002950 __text:00002954 __text:00002958 _printf_main2 var_1C = -0x1C var_C = -0xC 80 0D 14 70 07 00 04 00 06 05 00 0A 08 01 02 03 10 A4 07 80 40 70 D0 05 C0 00 20 00 30 10 20 10 90 10 20 30 90 05 D0 80 2D A0 4D 01 A0 40 A0 8F A0 A0 8D 8D A0 A0 A0 A0 8D 00 A0 BD E9 E1 E2 E3 E3 E3 E3 E0 E3 E3 E5 E9 E3 E3 E3 E3 E5 EB E1 E8 STMFD MOV SUB MOV MOV MOVT MOV ADD MOV MOV STR STMFA MOV MOV MOV MOV STR BL MOV LDMFD SP!. #0 R2. R0 R3. Optimizing Xcode 4. which is a synonym of STMIB (Store Multiple Increment Before) instruction. SP. #0x14 R0. #0'' and ``ADD R0.text:0000003A 06 F0 D9 F8 BL .text:0000003E 05 B0 ADD . SP.6. at addresses 0x2918.#0x1C+var_C] _printf SP. 6.CHAPTER 6.#0x1C+var_1C] SP. #7 R0. PRINTF() WITH SEVERAL ARGUMENTS . #0x12D8 R12. #7 R0.3 (LLVM): thumb-2 mode __text:00002BA0 __text:00002BA0 __text:00002BA0 __text:00002BA0 __text:00002BA0 __text:00002BA0 __text:00002BA0 __text:00002BA2 __text:00002BA4 __text:00002BA6 __text:00002BAA __text:00002BAE __text:00002BB2 __text:00002BB4 __text:00002BB6 _printf_main2 var_1C = -0x1C var_18 = -0x18 var_C = -0xC 80 6F 85 41 4F C0 04 78 06 B5 46 B0 F2 D8 20 F0 07 0C F2 00 00 22 44 23 PUSH MOV SUB MOVW MOV. PC. SP. PC . #0x1570 R12. #4'' instructions can be executed simultaneously since the effects of their execution are not conflicting with each other. #5 R2.text:0000003E loc_3E .W MOVT. [SP. ARM __2printf . #4 R0. instructions like ``MOVT R0. This instruction increases the value in the SP register and only then writes the next register value into the memory. R0'' cannot be executed simultaneously since they both modify the R0 register.R12} R9. However. However.W MOVS ADD MOVS {R7. #6 R1. then 5. 7. CODE XREF: example13_f+16 SP. when it would be possible to do it in one point.R3. #4 R0. #0 R2. SP SP.LR} R7. [SP. the value in the R0 register is manipulated in three places. Usually. and 4 goes third. Another thing that catches the eye is that the instructions are arranged seemingly random.6. rather than performing those two actions in the opposite order.text:0000003E .2. the compiler tries to generate code in such a manner (wherever it is possible). For example. char * R3. ``MOVT R0. #2 R3. the optimizing compiler may have its own reasons on how to order the instructions so to achieve higher efficiency during the execution.

#0x1C+var_18 R2. . 6. d=%d.PC} The output is almost the same as in the previous example. ARM64 Non-optimizing GCC (Linaro) 4. store 9th argument in the stack mov w1.R3. [sp. SP. save FP and LR in stack frame: stp x29.4. because the number of registers is limited. e=%d. sp. MIPS R1. b=%d.9 Listing 6.W STR MOV. x30.3. :lo12:. x0. 6 mov w7. f=%d.5 The main difference with the “Hello.LC2: .W STMIA. #8 LR.string "a=%d. [sp] . #1 R2.LC2 mov w1. [sp. so it’s passed in X0. #3 R9. x29. #5 LR.W BLX ADD POP 6. d=%d. c=%d. 32 ret The first 8 arguments are passed in X. h=%d\n" f3: .#0x1C+var_C] _printf SP.11: Optimizing GCC 4. world!” example is that in this case printf() is called instead of puts() and 3 more arguments are passed through the registers $5 …$7 (or $A0 …$A2). b=%d. 16 adrp x0. b=%d.LC2 . #16 . All other values have a int 32-bit type. g=%d. 2 mov w3. restore FP and LR ldp x29.4. #2 R3. 5 mov w6. [SP.16] add sp.1 MIPS 3 arguments Optimizing GCC 4. The 9th argument (8) is passed via the stack. c=%d.9 . Indeed: it’s not possible to pass large number of arguments through registers. 9th argument str w1.W MOVS MOVS MOVS STR.R12} R1. e=%d. [SP.3 6. g=%d. {R1. f=%d.3.16] . sp. which implies they are used for function arguments passing. 3 mov w4. h=%d\n" add x0. so they are stored in the 32-bit part of the registers (W-). x30. 8 . "a=%d. Listing 6. 4 mov w5.9 generates the same code.ascii "a=%d. set stack frame (FP=SP): add x29. #0x14 {R7. c=%d\000" main: 48 .5 (assembly output) $LC0: . sp.10: Non-optimizing GCC (Linaro) 4. Optimizing GCC (Linaro) 4. grab more space in stack: sub sp.CHAPTER 6. 1 mov w2. #32 .or W-registers: [ARM13c]. A string pointer requires a 64-bit register. SP. PRINTF() WITH SEVERAL ARGUMENTS __text:00002BB8 __text:00002BBA __text:00002BBE __text:00002BC0 __text:00002BC4 __text:00002BC8 __text:00002BCA __text:00002BCC __text:00002BCE __text:00002BD2 __text:00002BD6 __text:00002BD8 05 0D 00 4F 8E 01 02 03 CD 01 05 80 21 F1 92 F0 E8 21 22 23 F8 F0 B0 BD 04 0E 08 09 0A 10 10 90 0A EA MOVS ADD.#0x1C+var_1C] R9. That is why these registers are prefixed with A-. with the exception that thumb-instructions are used instead. 7 bl printf sub sp.

set return value to 0: move $2.text:00000014 lw $t9.$0 . 3 . load address of the text string and set 1st argument of printf(): .5 Non-optimizing GCC is more verbose: Listing 6.%hi(__gnu_local_gp) addiu $sp.%lo($LC0) .text:00000000 .28($sp) .$28. branch delay slot Non-optimizing GCC 4. MIPS .text:0000000C sw $ra. set 3rd argument of printf(): . set 4th argument of printf() (branch delay slot): . load address of printf(): .4.text:00000038 jr $ra .text:00000000 .text:00000010 sw $gp.text:00000018 la $a0.-32 addiu $28.28($sp) .text:00000034 move $v0. -0x20 .32 .text:00000000 main: . return j $31 addiu $sp. 2 .13: Non-optimizing GCC 4. $LC0 # "a=%d.%call16(printf)($28) .text:00000020 li $a1. 1 . b=%d.%lo(__gnu_local_gp) sw $31. (__gnu_local_gp >> 16) . (printf & 0xFFFF)($gp) . function prologue: . c=%d" . function epilogue: .12: Optimizing GCC 4.2 # 0x2 .text:0000002C li $a3. b=%d. set 2nd argument of printf(): . set return value to 0: .text:00000030 lw $ra.text:0000003C addiu $sp. function prologue: 49 . set 4th argument of printf() (branch delay slot): li $7. return . set 3rd argument of printf(): li $6. call printf(): .text:00000000 var_10 = -0x10 .text:00000008 la $gp.4. branch delay slot Listing 6. $zero .text:00000004 addiu $sp. load address of printf(): lw $25.3.text:00000000 var_4 = -4 .3 # 0x3 .CHAPTER 6. set 2nd argument of printf(): li $5. load address of the text string and set 1st argument of printf(): lui $4.$sp.4.text:00000028 jalr $t9 .$sp.5 (IDA) . call printf(): jalr $25 . function prologue: lui $28. 0x20+var_4($sp) . (__gnu_local_gp & 0xFFFF) .1 # 0x1 . c=%d\000" main: .5 (assembly output) $LC0: .%hi($LC0) addiu $4. function epilogue: lw $31. 0x20 .text:00000000 lui $gp. 0x20+var_4($sp) .$4.ascii "a=%d. PRINTF() WITH SEVERAL ARGUMENTS 6.text:00000024 li $a2. 0x20+var_10($sp) .

function epilogue: lw $28. set return value to 0: move $2.text:00000000 main: .24($sp) addiu $sp.$2 jalr $25 nop . return j $31 nop Listing 6. $fp.text:00000000 var_10 = -0x10 .text:00000034 lw . get address of printf(): . .%hi(__gnu_local_gp) addiu $28.text:00000024 move .%lo($LC0) set 1st argument of printf(): move $4. $v0 $t9 $at.24($sp) move $fp.text:00000000 var_4 = -4 .text:00000044 or . (printf & 0xFFFF)($gp) $at. $gp. PRINTF() WITH SEVERAL ARGUMENTS . 1 $a2.text:00000040 jalr .$sp.text:0000000C move .text:00000018 sw .28($sp) sw $fp.5 (IDA) .text:00000010 la . $zero $t9.4.CHAPTER 6. 3 $v0. aADBDCD # "a=%d. set 4th argument of printf(): .28($sp) lw $fp. . $v0 $a1. set 3rd argument of printf(): . NOP 50 .$sp.text:00000038 or .text:00000000 var_8 = -8 . 6. .$2. .text:00000000 addiu . c=%d" $a0.$2 set 2nd argument of printf(): li $5. set 1st argument of printf(): . 2 $a3. $ra.text:0000003C move .$0 move $sp. load address of the text string: .3.text:00000004 sw .14: Non-optimizing GCC 4. .text:0000001C la .3 # 0x3 get address of printf(): lw $2.%lo(__gnu_local_gp) load address of the text string: lui $2.2 # 0x2 set 4th argument of printf(): li $7. $gp.1 # 0x1 set 3rd argument of printf(): li $6.%call16(printf)($28) nop call printf(): move $25. set 2nd argument of printf(): . .text:00000030 li . b=%d.text:00000000 .$28. call printf(): . function prologue: . function epilogue: $sp. MIPS addiu $sp.text:0000002C li .text:00000008 sw . -0x20 0x20+var_4($sp) 0x20+var_8($sp) $sp __gnu_local_gp 0x20+var_10($sp) $v0.$fp lw $31.32 .$sp lui $28. $fp. $zero .-32 sw $31.%hi($LC0) addiu $2.text:00000000 .16($fp) .text:00000028 li .

$4. h=%d\012\000" main: .h> int main() { printf("a=%d.2 6.15: Optimizing GCC 4. g=%d.5 # 0x5 sw $2. $sp. 6.1 # 0x1 . Other calling conventions (like N32) may use the registers for different purposes.%call16(printf)($28) sw $2.$sp. d=%d. 2.5 Only the first 4 arguments are passed in the $A0 …$A3 registers. jr or $ra $at.2 # 0x2 .4. 1.3 # 0x3 51 .3. $fp. This is the O32 calling convention (which is the most common one in the MIPS world). 0x20+var_10($fp) move move lw lw addiu $v0.3. PRINTF() WITH SEVERAL ARGUMENTS .-56 addiu $28.text:0000004C . }. f=%d. e=%d. c=%d. MIPS lw $gp.text:00000054 . c=%d. pass 8th argument in stack: li $2. $ra. 8). Listing 6.text:00000060 .$28.52($sp) . return 0.28($sp) .4. call printf(): jalr $25 .ascii "a=%d.20($sp) . 4.7 # 0x7 lw $25.text:00000064 6.text:00000048 . d=%d. f=%d.CHAPTER 6. 5. 7. SW abbreviates “Store Word” (from register to memory). $sp. b=%d.32($sp) addiu $4. g=%d. pass 9th argument in stack: li $2.24($sp) . e=%d. Optimizing GCC 4. pass 7th argument in stack: li $2. the rest are passed via the stack.text:00000050 . pass 6th argument in stack: li $2. #include <stdio.4 # 0x4 sw $2.6 # 0x6 sw $2. $zero . pass 1st argument in $a0: lui $4. MIPS lacks instructions for storing a value into memory. NOP $zero $fp 0x20+var_4($sp) 0x20+var_8($sp) 0x20 8 arguments Let’s use again the example with 9 arguments from the previous section: 6.1. so an instruction pair has to be used instead (LI/SW).%hi(__gnu_local_gp) addiu $sp. pass 2nd argument in $a1: li $5.2 on page 40. h=%d\n".%hi($LC0) .16($sp) .text:0000005C .text:00000058 . return . function prologue: lui $28. b=%d.%lo($LC0) . 3.%lo(__gnu_local_gp) sw $31. pass 3rd argument in $a2: li $6. pass 5th argument in stack: li $2. set return value to 0: . pass 4th argument in $a3 (branch delay slot): li $7.8 # 0x8 sw $2.5 (assembly output) $LC0: .

3 .text:0000000C sw $ra.text:00000048 li $a1. set return value to 0: move $2.text:00000024 li $v0. branch delay slot Non-optimizing GCC 4. e=%d. pass 6th argument in stack: . function epilogue: lw $31. ($LC0 & 0xFFFF) # "a=%d. 4 .text:00000000 var_28 = -0x28 . g=%".text:00000034 sw $v0. MIPS .56 .text:0000005C move $v0.text:00000000 . (printf & 0xFFFF)($gp) .text:00000008 la $gp.text:0000003C li $v0. 0x38+var_28($sp) . pass 4th argument in $a3 (branch delay slot): . 5 . function epilogue: . pass 2nd argument in $a1: . pass 3rd argument in $a2: .3. return j $31 addiu $sp. pass 1st argument in $a1: . 0x38+var_4($sp) .text:00000054 li $a3. 2 .text:00000000 var_1C = -0x1C .16: Optimizing GCC 4.text:00000038 lui $a0. PRINTF() WITH SEVERAL ARGUMENTS 6. prepare 1st argument in $a0: .text:00000064 addiu $sp. return .text:0000004C li $a2. 0x38+var_24($sp) .4.text:00000030 lw $t9.text:00000000 main: . 0x38+var_4($sp) .text:00000004 addiu $sp.52($sp) . pass 7th argument in stack: .text:00000000 . 6 . 1 . pass 9th argument in stack: .text:00000058 lw $ra. b=%d. c=%d. b=%d.text:00000000 var_4 = -4 .text:00000010 sw $gp. 0x38+var_10($sp) .text:00000000 var_24 = -0x24 . 7 .text:00000000 var_20 = -0x20 . $zero .$0 .text:00000018 sw $v0. 0x38 .text:00000028 sw $v0..text:0000001C li $v0.. c=%d.text:00000020 sw $v0.text:00000050 jalr $t9 .text:00000000 lui $gp. call printf(): . (__gnu_local_gp >> 16) .5 (IDA) .text:00000014 li $v0.text:00000044 la $a0. -0x38 . . ($LC0 >> 16) # "a=%d.text:00000040 sw $v0. g=%".text:00000060 jr $ra . 0x38+var_18($sp) .text:00000000 var_10 = -0x10 . pass 8th argument in stack: .. 0x38+var_1C($sp) . d=%d. function prologue: . f=%d⤦ Ç .. e=%d. 8 . f⤦ Ç =%d. d=%d.5 Non-optimizing GCC is more verbose: 52 . .text:0000002C li $v0. pass 5th argument in stack: . branch delay slot Listing 6.CHAPTER 6.$sp.4. (__gnu_local_gp & 0xFFFF) . set return value to 0: .text:00000000 var_18 = -0x18 . 0x38+var_20($sp) .

$0 move $sp.text:00000000 var_4 . g=%d.$28. pass 2nd argument in $a1: li $5.48($sp) addiu $sp. function epilogue: lw $28.$2. h=%d\012\000" main: . function prologue: addiu $sp.20($sp) . pass 4th argument in $a3: li $7.4 # 0x4 sw $3.text:00000000 . 0x38+var_4($sp) 53 . b=%d.CHAPTER 6.18: Non-optimizing GCC 4.3. e=%d.40($fp) .32($sp) .text:00000000 var_18 .%call16(printf)($28) nop move $25.%hi(__gnu_local_gp) addiu $28. pass 8th argument in stack: li $3.5 (assembly output) $LC0: .-56 sw $31. c=%d.4.$2 .$sp lui $28. pass 1st argument in $a0: move $4.52($sp) sw $fp.7 # 0x7 sw $3.28($sp) . pass 5th argument in stack: li $3.text:00000000 main: . set return value to 0: move $2. PRINTF() WITH SEVERAL ARGUMENTS 6.5 (IDA) .52($sp) lw $fp.56 .16($sp) .5 # 0x5 sw $3.8 # 0x8 sw $3.text:00000000 var_8 .text:00000000 var_20 .24($sp) .17: Non-optimizing GCC 4.text:00000000 var_10 .6 # 0x6 sw $3. return j $31 nop Listing 6. pass 7th argument in stack: li $3.$sp.%lo(__gnu_local_gp) lui $2.3 # 0x3 .text:00000000 var_1C .text:00000000 .$fp lw $31.$2 jalr $25 nop . f=%d. d=%d.48($sp) move $fp.$sp.1 # 0x1 .4.ascii "a=%d. pass 3rd argument in $a2: li $6. pass 9th argument in stack: li $3. -0x38 $ra.2 # 0x2 .text:00000000 var_28 . MIPS Listing 6.text:00000004 = = = = = = = = -0x28 -0x24 -0x20 -0x1C -0x18 -0x10 -8 -4 addiu sw $sp. function prologue: .%hi($LC0) addiu $2.%lo($LC0) . call printf(): lw $2.text:00000000 var_24 .text:00000000 . pass 6th argument in stack: li $3.

NOP $zero $fp 0x38+var_4($sp) 0x38+var_8($sp) 0x38 Conclusion Here is a rough skeleton of the function call: Listing 6. 0x38+var_28($sp) li sw $v1. modify stack pointer (if needed) Listing 6. g=%".text:00000040 . e=%d.. 0x38+var_18($sp) move $a0. PUSH 3rd argument PUSH 2nd argument PUSH 1st argument CALL function . $gp.text:00000010 . 0x38+var_1C($sp) li sw $v1. 2 li $a3. $ra. $sp.text:0000003C . 4 $v1.4. 3rd argument MOV R9. b=%d.text:00000018 .text:0000005C .. NOP (printf & 0xFFFF)($gp) $zero $v0 $zero .text:00000068 . pass 4th argument in $a3: .text:00000008 .20: x64 (MSVC) MOV RCX.text:00000048 . 1st argument MOV RDX.text:00000030 . 5 $v1. 1 li $a2.. $t9 $at. 0x38+var_8($sp) $sp __gnu_local_gp 0x38+var_10($sp) aADBDCDDDEDFDGD # "a=%d.text:0000002C . 0x38+var_10($fp) move move lw lw addiu $v0. etc (if needed) CALL function 54 . 4th argument . pass 6th argument in stack: . CONCLUSION sw move la sw la $fp. pass 1st argument in $a0: . $fp. call printf(): . 3 lw or move jalr or $v0. return .text:00000060 . set return value to 0: .text:00000088 . 6th argument. 8 $v1. f⤦ li sw $v1. $zero . $at. c=%d.text:0000006C .19: x86 . 0x38+var_20($sp) li sw $v1..text:0000001C Ç =%d. 7 $v1.text:00000058 . d=%d. pass 9th argument in stack: . $v0 li $a1..text:0000008C 6.4 6.text:0000000C . pass 3rd argument in $a2: . PRINTF() WITH SEVERAL ARGUMENTS . pass 7th argument in stack: .text:00000080 . $fp.text:00000038 .text:00000044 . .text:00000064 . function epilogue: .text:00000078 . $sp.text:00000070 .text:00000024 . $t9. jr or $ra $at.. PUSH 5th. pass 8th argument in stack: . 0x38+var_24($sp) li sw $v1. $gp.text:00000084 . lw $gp. pass 2nd argument in $a1: .text:00000050 .text:0000004C . $v0.text:00000028 .text:00000074 .text:0000007C . 2nd argument MOV R8. 6 $v1.CHAPTER 6.text:00000054 . pass 5th argument in stack: .text:00000034 .

in stack (if needed) LW temp_reg. ARM and MIPS is a good illustration of the fact that the CPU is oblivious to how the arguments are passed to functions. 8th argument. 2nd argument MOV R2. it works fine. 6th argument. 3rd argument MOV X3.23: ARM64 MOV X0. etc. 5th argument MOV X5. We may also recall how newcoming assembly language programmers passing arguments into other functions: usually via registers. Of course. etc. address of function JALR temp_reg 6. modify stack pointer (if needed) Listing 6. AKA $A2 LI $7. pass 9th. maybe except $ZERO) to pass data or use any other calling convention. BY THE WAY . modify stack pointer (if needed) Listing 6. or even via global variables. fastcall. 5th argument MOV R9. 7th argument MOV X7. AKA $A3 . 4th argument MOV R8.22: ARM MOV R0. Programmers may use any other register (well. etc (if needed) CALL function . 4th argument . 6th argument . 4th argument . without any explicit order. PUSH 7th. 4th argument MOV X4. 55 . modify stack pointer (if needed) Listing 6. AKA $A1 LI $6. PRINTF() WITH SEVERAL ARGUMENTS 6. x64. 3rd argument MOV RCX. pass 5th. in stack (if needed) BL function . 1st argument MOV R1. 6th argument.24: MIPS (O32 calling convention) LI $4. 10th argument. in stack (if needed) BL CALL function .21: x64 (GCC) MOV RDI. 1st argument MOV RSI. 2nd argument MOV X2. MIPS $A0 …$A3 registers are labelled this way only for convenience (that is in the O32 calling convention). pass 5th. modify stack pointer (if needed) Listing 6.. etc.5 By the way By the way. 2nd argument . 3rd argument . 8th argument . 1st argument MOV X1. 2nd argument MOV RDX.CHAPTER 6. 6th argument MOV X6.. 1st argument . 3rd argument MOV R3. this difference between the arguments passing in x86. AKA $A0 LI $5.5. It is also possible to create a hypothetical compiler able to pass arguments via a special structure without using stack at all. The CPU is not aware of calling conventions whatsoever.

in the compiled code there is no information about pointer types at all. to illustrate passing a pointer to a variable of type int. In C/C++ the pointer type is only needed for compile-time type checking.1.1.1 About pointers Pointers are one of the fundamental concepts in computer science. while in x86–64 it is a 64-bit number(occupying 8 bytes). So the simplest thing to do is to pass the address of the array or structure to the callee function.. it also needs to return all these values. 7.2 x86 MSVC Here is what we get after compiling with MSVC 2010: CONST $SG3831 $SG3832 SEGMENT DB 'Enter X:'. while passing their address is much cheaper.. In x86. only the block size matters. it is not clever to use scanf() for user interactions nowadays. given some effort. SCANF() Chapter 7 scanf() Now let’s use scanf(). printf ("You entered %d. }. 7. Internally. takes 2 pointers of type void* as arguments. it is occupying 4 bytes). scanf ("%d". &x).e. Data types are not important..h> int main() { int x. It is possible to work with untyped pointers only. however. printf ("Enter X:\n"). the address is represented as a 32-bit number (i. passing a large array. Often. that is the reason behind some people’s indignation related to switching to x86-64 —all pointers in the x64-architecture require twice as much space. e. 00H DB '%d'. I agree. that copies a block from one memory location to another.CHAPTER 7. the standard C function memcpy(). Pointers are also widely used when a function needs to return more than one value (we are going to get back to this later ( 10 on page 98) ). 0aH. I wanted.g. return 0. Besides the fact that the function needs to indicate how many values were successfully read. In addition if the callee function needs to modify something in the large array or structure received as a parameter and return back the entire structure then the situation is close to absurd.\n". A pointer in C/C++ is simply an address of some memory location. structure or object as an argument to another function is too expensive. By the way. and let it change what needs to be changed. scanf() is such a case. OK. since it is impossible to predict the type of the data you would like to copy. 7.1 Simple example #include <stdio. x). 00H 56 .

. marked in IDA as var_4 saved value of EBP return address argument#1. '%d' call _scanf add esp. In fact it allocates 4 bytes on the stack for storing the x variable. 00H CONST ENDS PUBLIC _main EXTRN _scanf:PROC EXTRN _printf:PROC . 57 . Function compile flags: /Odtp _TEXT SEGMENT _x$ = -4 . 8 mov ecx. DWORD PTR _x$[ebp] push eax push OFFSET $SG3832 . We could say that in this case LEA simply stores the sum of the EBP register value and the _x$ macro in the EAX register. Traditionally.1. x is to be accessed with the assistance of the _x$ macro (it equals to -4) and the EBP register pointing to the current frame.CHAPTER 7. Over the span of the function’s execution. The value of the EBP could be perceived as a frozen state of the value in ESP at the start of the function’s execution. ebp pop ebp ret 0 _main ENDP _TEXT ENDS x is a local variable.. 4 lea eax. PUSH ECX. esp push ecx push OFFSET $SG3831 . The first one is a pointer to the string containing ``%d'' and the second is the address of the x variable. size = 4 _main PROC push ebp mov ebp.. This is the same as lea eax. although that is not very convenient since it changes frequently. Next the EAX register value is pushed into the stack and scanf() is being called. local variables are stored on the stack. First. There are probably other ways to allocate them.2 on page 933). 0aH. EBP is pointing to the current stack frame making it possible to access local variables and function arguments via EBP+offset. marked in IDA as arg_8 … The scanf() function in our example has two arguments. marked in IDA as var_8 local variable #1. DWORD PTR _x$[ebp] push ecx push OFFSET $SG3833 . and is often used for forming an address ( A.'. According to the C/C++ standard it must be visible only in this function and not from any other external scope. So. 8 . 4 is being subtracted from the EBP register value and the result is loaded in the EAX register. return 0 xor eax. Here is a typical stack frame layout in 32-bit environment: … EBP-8 EBP-4 EBP EBP+4 EBP+8 EBP+0xC EBP+0x10 … … local variable #2. DWORD PTR _x$[ebp] instruction LEA stands for load effective address. 'You entered %d. marked in IDA as arg_4 argument#3. 'Enter X:' call _printf add esp. but in x86 that is the way it is. [ebp-4]. It is also possible to use ESP for the same purpose. eax mov esp.' call _printf add esp. SCANF() 7.6. is not to save the ECX state (notice the absence of corresponding POP ECX at the function’s end). marked in IDA as arg_0 argument#2. The goal of the instruction following the function prologue. the x variable’s address is loaded into the EAX register by the lea eax.. SIMPLE EXAMPLE $SG3833 DB 'You entered %d.

\n''.CHAPTER 7. Next the value in the ECX is stored on the stack and the last printf() is being called. The second argument is prepared with: mov ecx. SIMPLE EXAMPLE printf() is being called after that with its first argument ..1. SCANF() 7. The instruction stores the x variable value and not its address. [ebp-4]. 58 . in the ECX register.a pointer to the string: ``You entered %d..

The breakpoint will be triggered when main() begins. Let’s load it and keep pressing F8 (step over) until we reach our executable file instead of ntdll. 123. Scroll up until main() appears.2: User input in the console window 59 .1: OllyDbg: The address of the local variable is calculated Right-click the EAX in the registers window and then select “Follow in stack”. for example.CHAPTER 7. points to the variable in the local stack. SIMPLE EXAMPLE 7. then F9 (Run). Let’s trace with F8 until the scanf() execution completes. press F2 (set a breakpoint). Let’s trace to the point where the address of the variable x is calculated: Figure 7.1.1. The red arrow. Now with the help of PUSH instruction the address of this stack element is going to be stored to the same stack on the next position. in the console window: Figure 7. This address will appear in the stack window. SCANF() 7. Click on the first instruction (PUSH EBP).dll. During the scanf() execution. I have added. At that moment this location contains some garbage (0x6E494714).3 MSVC + OllyDbg Let’s try this example in OllyDbg. we input.

1. SCANF() 7.3: OllyDbg: scanf() executed scanf() returns 1 in EAX. 60 . If we look again at the stack element corresponding to the local variable it now contains 0x7B (123).CHAPTER 7. SIMPLE EXAMPLE scanf() completed its execution already: Figure 7. which implies that it has read successfully one value.

control flow slips from one expression to another. and so in resulting machine code. 0FFFFFFF0h esp. eax ___isoc99_scanf edx. rather than the stack. As in the MSVC example —the arguments are placed on the stack using the MOV instruction. 7. 20h [esp+20h+var_20]. 0 GCC replaced the printf() call with call to puts().4: OllyDbg: preparing the value for passing to printf() GCC Let’s try to compile this code in GCC 4. 61 . SIMPLE EXAMPLE Further. this value is copied from the stack to the ECX register and passed to printf(): Figure 7.. "You entered %d.4. SCANF() 7. "%d" edx.. "Enter X:" _puts eax. The reason for this was explained in ( 3. this simple example is a demonstration of the fact that compiler translates list of expressions in C/C++-block into sequential list of instructions.4. offset aEnterX . edx [esp+20h+var_20]. offset aD . offset aYouEnteredD___ . esp esp. There are nothing between expressions in C/C++. are used for arguments passing.4 x64 The picture here is similar with the difference that the registers. [esp+20h+var_4] [esp+20h+var_1C]. By the way By the way. eax _printf eax. edx [esp+20h+var_20].CHAPTER 7.3 on page 14). [esp+20h+var_4] eax.1 under Linux: main proc near var_20 var_1C var_4 = dword ptr -20h = dword ptr -1Ch = dword ptr -4 main push mov and sub mov call mov lea mov mov call mov mov mov mov call mov leave retn endp ebp ebp.1.1.\n" [esp+20h+var_1C]. there are nothing between.

'.4. eax printf .string "%d" . eax add rsp. 00H DB 'You entered %d. return 0 xor eax.\n" eax.6 x64 . 0aH.. return 0 xor eax.2: Optimizing GCC 4. 'You entered %d.text:00000042 . OFFSET FLAT:$SG1291 .text:00000042 scanf_main var_8 = -8 62 ... eax __isoc99_scanf esi.LC0: . 'Enter X:' printf rdx. OFFSET FLAT:.string "You entered %d.LC2 . OFFSET FLAT:$SG1292 ...LC2: . DWORD PTR x$[rsp] rcx. OFFSET FLAT:.\n" main: sub mov call lea mov xor call mov mov xor call rsp. '%d' scanf edx. [rsp+12] edi.LC1 .LC0 . 00H ENDS _TEXT SEGMENT x$ = 32 main PROC $LN3: sub lea call lea lea call mov lea call main _TEXT rsp. "%d" eax. 0aH. SIMPLE EXAMPLE MSVC Listing 7.LC1: . SCANF() 7. "Enter X:" puts rsi.' printf .text:00000042 . 24 edi.1: MSVC 2012 x64 _DATA $SG1289 $SG1291 $SG1292 _DATA SEGMENT DB 'Enter X:'. OFFSET FLAT:$SG1289 . 56 ret 0 ENDP ENDS GCC Listing 7. "You entered %d.CHAPTER 7.. 00H DB '%d'. 56 rcx. OFFSET FLAT:.1..1. DWORD PTR [rsp+12] edi. QWORD PTR x$[rsp] rcx. 24 ret 7. eax add rsp.string "Enter X:" .5 ARM Optimizing Keil 6/2013 (thumb mode) ..

string "Enter X:" .. aD . passed to scanf(). return 0 mov w0.text:00000056 . .\n" string . so we need 4 bytes to store it somewhere in memory.\n" f4: . this value is moved from the stack to the R1 register in order to be passed to printf().text:00000042 . . ARM64 Listing 7. [sp]. W1=x . restore FP and LR: ldp x29.text:00000046 . 0 . The address is passed to scanf().CHAPTER 7.text:0000004A . 63 .1 ARM64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 . SP R0. sp. X1=address of "x" variable .text:00000054 .text:00000052 . X0=pointer to the "Enter X:" string .LC2 add x0. A place for the local variable x is allocated in the stack and IDA has named it var_8.LC0: . pass the address to scanf() and call it: bl __isoc99_scanf . 32 ret The most interesting part is finding space for the x variable in the stack frame (line 22).28] . #0 {R3..string "%d" .LC1: . aEnterX .text:0000005A .PC} In order for scanf() to be able to read item it needs a parameter—pointer to an int.LC1 add x0. It is not necessary.text:00000044 . 0 .\n" __2printf R0. The value is fetched at line 27 and then passed to printf(). x29. and it fits exactly in a 32-bit register. save FP and LR in stack frame: stp x29... :lo12:.1. :lo12:. -32]! . which just stores the user input value in the memory at that address.LC0 add x0. Later. find a space in stack frame for "x" variable (X1=FP+28): add x1. x0.text:00000042 .LC1 . [sp.string "You entered %d.LC0 . SP’s value is copied to the R1 register and. "You entered %d. So.9.. load pointer to the "Enter X:" string: adrp x0. SIMPLE EXAMPLE 08 A9 06 69 AA 06 00 A9 06 00 08 B5 A0 F0 D3 F8 46 A0 F0 CD F8 99 A0 F0 CB F8 20 BD PUSH ADR BL MOV ADR BL LDR ADR BL MOVS POP {R3. set stack frame (FP=SP) add x29. x0.text:0000005C 7. load 32-bit value from the variable in stack frame: ldr w1.LC2: . This is 32-bit value of type int. "Enter X:\n" __2printf R1. [SP.LC2 bl printf . load pointer to the "%d" string: adrp x0. together with the format-string. x30. int is 32-bit. to allocate a such since SP (stack pointer) is already pointing to that space and it can be used directly. printf() will take text string from X0 and "x" variable from X1 (or W1) adrp x0. with the help of the LDR instruction. print it: bl puts .text:0000004C .text:0000004E .LR} R0. aYouEnteredD___ . .#8+var_8] R0. load pointer to the "You entered %d. 28 . however.. [x29. "%d" __0scanf R1.3: Non-optimizing GCC 4. x0. x30. :lo12:. SCANF() .

text:00000000 var_18 . $sp.40 .1.$4.16($sp) lui $4.1.24($sp) lw $25.\012\000" main: .%lo(__gnu_local_gp) sw $31. SCANF() 7.%call16(__isoc99_scanf)($28) . branch delay slot . set 2nd argument of scanf().$sp.text:0000001C .text:00000004 .text:00000010 . ($LC0 & 0xFFFF) # "Enter X:" . branch delay slot .text:00000000 .text:00000018 . ($LC0 >> 16) # "Enter X:" $t9 $a0.text:00000000 var_10 .ascii "%d\000" $LC1: $LC2: . .text:00000000 . $a1=$sp+24: addiu $5.$sp.5 (assembly output) $LC0: . set 2nd argument of printf()..text:00000020 = -0x18 = -0x10 = -4 lui addiu la sw sw $gp.text:00000014 . $gp.text:00000000 .%lo($LC0) .$4.16($sp) .%lo($LC1) .$28. load word at address $sp+24: lw $5. $ra. SIMPLE EXAMPLE 7. function prologue: lui $28.$0 .$4. branch delay slot 64 . $gp.%hi($LC0) jalr $25 addiu $4.%lo($LC2) .%call16(printf)($28) lui $4.CHAPTER 7.5: Optimizing GCC 4.%hi($LC1) lw $25. function epilogue: lw $31. call scanf(): lw $28.4. and it is to be referred as $sp + 24.%hi(__gnu_local_gp) addiu $sp.ascii "You entered %d. set return value to 0: move $2. (__gnu_local_gp >> 16) -0x28 (__gnu_local_gp & 0xFFFF) 0x28+var_4($sp) 0x28+var_18($sp) lw lui jalr la $t9.ascii "Enter X:\000" . (puts & 0xFFFF)($gp) $a0. call printf(): lw $28.6 MIPS A place in the local stack is allocated for the x variable. return: j $31 addiu $sp. and the user input values is loaded using the LW (“Load Word”) instruction and then passed to printf().. branch delay slot .text:0000000C .$sp.36($sp) . call puts(): lw $25. function prologue: .4: Optimizing GCC 4. Listing 7.5 (IDA) .-40 addiu $28.4.%hi($LC2) jalr $25 addiu $4.36($sp) .text:00000008 .%call16(puts)($28) lui $4.text:00000000 var_4 .24 jalr $25 addiu $4. call puts(): .text:00000000 main: . branch delay slot IDA displays the stack layout as follows: Listing 7. Its address is passed to scanf().

text:00000040 lw $a1. Function compile flags: /Odtp _TEXT SEGMENT _main PROC push ebp mov ebp. but for the sake of the experiment. .CHAPTER 7.\n". . 0aH. SCANF() . Global variables are considered anti-pattern. call scanf(): .text:0000005C jr $ra .text:00000030 addiu $a1. .text:00000034 jalr $t9 . .text:0000003C lw $gp.text:00000044 lw $t9. set return value to 0: .. call printf(): . }. Ç delay slot . 00H $SG2458 DB 'You entered %d. set 2nd argument of scanf(). printf ("You entered %d. . . 0aH.text:00000050 la $a0.2.text:00000060 addiu $sp. . 7. return 0. load word at address $sp+24: . branch delay slot ($LC1 & 0xFFFF) # "%d" 0x28+var_18($sp) 0x28+var_10($sp) (printf & 0xFFFF)($gp) ($LC2 >> 16) # "You entered %d. #include <stdio... &x).2. $a1=$sp+24: . .. 7. int main() { printf ("Enter X:\n"). esp push OFFSET $SG2456 call _printf 65 .\n" ($LC2 & 0xFFFF) # "You entered %d.1 MSVC: x86 _DATA SEGMENT COMM _x:DWORD $SG2456 DB 'Enter X:'.text:00000058 move $v0. GLOBAL VARIABLES 0x28+var_18($sp) ($LC1 >> 16) # "%d" (__isoc99_scanf & 0xFFFF)($gp) $sp.text:00000028 lui $a0. function epilogue: . set 2nd argument of printf().. branch ⤦ 0x28+var_4($sp) $zero 0x28 . x). . we could do this.. 00H _DATA ENDS PUBLIC _main EXTRN _scanf:PROC EXTRN _printf:PROC .h> // now x is global variable int x. return: .text:00000054 lw $ra. 00H $SG2457 DB '%d'.text:00000038 la $a0.. branch delay slot 7.2 Global variables What if the x variable from the previous example was not local but a global one? Then it would have been accessible from any point. scanf ("%d". not only from the function body..text:0000004C jalr $t9 . .'.\n" .text:0000002C lw $t9.text:00000024 lw $gp. .text:00000048 lui $a0. . 0x28+var_10 .

Here we see a value 0xA of DWORD type (DD stands for DWORD = 32 bit) for this variable. 6. and after it you can see text strings.CHAPTER 7.data:0040FA80 . the OS will allocate a block of zeroes there1 .exe file these uninitialized variables do not occupy anything.data:0040FA90 ..data:0040FA8C . . 4 push OFFSET _x push OFFSET $SG2457 call _scanf add esp.exe from the previous example in IDA. why one needs to allocate space for variables initially set to zero?).exe to the memory. you would see something like this: . eax pop ebp ret 0 _main ENDP _TEXT ENDS In this case the x variable is defined in the _DATA segment and no memory is allocated in the local stack. not through the stack.data:0040FA8C . GLOBAL VARIABLES add esp. 8 mov eax. 1 That is how a VM behaves 66 . . a space for all these variables is to be allocated and filled with zeroes [ISO07.data:0040FA84 . .data:0040FA84 . . .exe in IDA.2. but when someone accesses their address.data:0040FA8C .data:0040FA88 .data:0040FA80 . . Now let’s explicitly assign a value to the variable: int x=10.data:0040FA88 . . SCANF() 7.. This is convenient for example when using large arrays. But in the . DWORD PTR _x push eax push OFFSET $SG2458 call _printf add esp. you can see the x variable placed at the beginning of the _DATA segment. This implies that after loading the . . It is accessed directly. 8 xor eax.data:0040FA94 _x dd ? dword_40FA84 dd ? dword_40FA88 dd ? . // default value We got: _DATA _x SEGMENT DD 0aH . LPVOID lpMem lpMem dd ? dword_40FA90 dd ? dword_40FA94 dd ? .7.8p10]. If you open the compiled . . DATA XREF: _main+10 _main+22 DATA XREF: _memset+1E unknown_libname_1+28 DATA XREF: ___sbh_find_block+5 ___sbh_free_block+2BC .data:0040FA90 . where the value of x was not set. DATA XREF: ___sbh_find_block+B ___sbh_free_block+2CA DATA XREF: _V6_HeapAlloc+13 __calloc_impl+72 DATA XREF: ___sbh_free_block+2FE _x is marked with ? with the rest of the variables that do not need to be initialized. If you open the compiled . Uninitialized global variables take no space in the executable file (indeed.

67 . 0x7B appears in the memory window (see the highlighted screenshot regions). After we have entered 123 in the console. Back to the example.CHAPTER 7. SCANF() 7. This implies that the lowest byte is written first. After the PUSH instruction (pushing the address of x) gets executed. and x86 uses little-endian. The memory address of x is 0x00C53394. GLOBAL VARIABLES 7. The variable will appear in the memory window on the left. Read more about it at: 31 on page 453. the 32-bit value is loaded from this memory address into EAX and passed to printf(). 00 00 00 7B should be there. and the highest written last. But why is the first byte 7B? Thinking logically.2 MSVC: x86 + OllyDbg Things are even simpler here: Figure 7.5: OllyDbg: after scanf() execution The variable is located in the data segment. Right-click on that row and select “Follow in dump”. the address appears in the stack window.2. The cause for this is referred as endianness.2.

which has the following attributes: . 0aH.CHAPTER 7. however. Segment type: Uninitialized . initialise the variable with some value e. Segment permissions: Read/Write If you.. 'Enter X:' printf rdx. 40 lea call lea rcx.4 MSVC: x64 Listing 7. with the difference that the uninitialized variables are located in the _bss segment.2. it is to be placed in the _data segment. Segment type: Pure data . 00H DB 'You entered %d.data PEsegment of our program: Figure 7. 00H DB '%d'.g.. GLOBAL VARIABLES In OllyDbg we can review the process memory map (Alt-M) and we can see that this address is inside the . SCANF() 7. OFFSET FLAT:x 68 .2.'. OFFSET FLAT:$SG2924 .6: OllyDbg: process memory map 7. 10. 00H ENDS _TEXT main $LN3: SEGMENT PROC sub rsp.6: MSVC 2012 x64 _DATA COMM $SG2924 $SG2925 $SG2926 _DATA SEGMENT x:DWORD DB 'Enter X:'. Segment permissions: Read/Write 7.3 GCC: x86 The picture in Linux is near the same. In ELF file this segment has the following attributes: .2. 0aH.

while the variable’s value is passed to the second printf() using a MOV instruction. "You entered %d.data.text:00000020 .data .text:00000000 . =x R0.data:00000048 .data:00000048 .text:00000008 . DATA . Please note that the address of the x variable is passed to scanf() using a LEA instruction.2.0 . Segment type: Pure code AREA .. ORG 0x48 EXPORT x x DCD 0xA . Moreover it could possibly change often. obviously. return 0 xor eax.text:00000034 .text:0000000C .5 ARM: Optimizing Keil 6/2013 (thumb mode) . The code segment might sometimes be located in a ROM2 chip (remember. 2 Read-only memory memory 3 Random-access 69 .text:00000030 . because after powering on. we now deal with embedded microelectronics.data:00000048 .text:00000033 . One could ask.text:00000010 . DATA XREF: main+8 . [R0] R0.. OFFSET FLAT:$SG2925 .. contains random information.text:00000047 . 'You entered %d.\n" __2printf R0. Segment type: Pure data AREA .text:00000002 . constant variables in RAM must be initialized.. CODE main PUSH ADR BL LDR ADR BL LDR LDR ADR BL MOVS POP aEnterX off_2C {R4..text:0000001A . aEnterX .text:00000000 .0 DCB 0 DCB 0 DCD x aD ..".text:0000002C . DATA XREF: main+A DCB "%d". SCANF() lea call mov lea call 7. and memory scarcity is common here).text:00000047 . .text:0000002B .data:00000048 . eax main _TEXT add ret ENDP ENDS rsp.CHAPTER 7. '%d' scanf edx.2..data:00000048 . "Enter X:\n" __2printf R1.text:00000004 . OFFSET FLAT:$SG2926 . aYouEnteredD___ .text:0000000A .text:0000001C .PC} DCB "Enter X:".data:00000048 . why are the text strings located in the code segment (.text:00000000 .text:0000002A .. =x R1. aD .0xA. DATA XREF: main+14 DCB 0 ..text.text:00000016 . 7. ``DWORD PTR'' is a part of the assembly language (no relation to the machine code).text:00000012 . the RAM.. DWORD PTR x rcx. the x variable is now global and for this reason located in another segment.text:00000014 .. .0xA.text ends . . main+10 .0 DCB 0 aYouEnteredD___ DCB "You entered %d. DATA XREF: main+8 .text) and x is located right here? Because it is a variable and by definition its value could change.data:00000048 . DATA XREF: main+2 .text:00000000 .. #0 {R4. 40 0 The code is almost the same as in x86. It is not very economical to store constant variables in RAM when you have ROM.text:0000002C . indicating that the variable data size is 32-bit and the MOV instruction has to be encoded accordingly. . and changeable variables —in RAM3 .LR} R0.text:00000047 . . Furthermore. namely the data segment (.data). main+10 ends So. "%d" __0scanf R0. GLOBAL VARIABLES rcx.' printf .

.4 . the Keil compiler allocates it in the . load pointer to the "Enter X:" string: adrp x0. :lo12:. [x0] . 7.text:004006C0 var_10 . x1.sbss ELF section (remember the “Global Pointer”? 3.\n" f5: . x0.2.2. 0 .string "%d" . GLOBAL VARIABLES Moving forward. it may well be even in an external memory chip! One more thing: if a variable is declared as const. the linker could place this segment in ROM too. sp.LC2 add x0... form address of x global variable: adrp x1. 16 ret In this case the x variable is declared as global and its address is calculated using the ADRP/ADD instruction pair (lines 21 and 25). I have compiled to executable file rather than object file and I have loaded it into IDA.9.8: Optimizing GCC 4.LC0: . so its address must be saved somewhere in close proximity to the code.LC0 add x0. x add x1. x30. SCANF() 7. 7.LC2: . we see a pointer to the x (off_2C) variable in the code segment.LC1 .5 (IDA) . x0. since the variable is not initialized at the start. IDA displays the x variable in the . The LDR instruction in thumb mode can only address variables in a range of 1020 bytes from its location. and in in ARM-mode —variables in range of ±4095 bytes. .7: Non-optimizing GCC 4.\n" string: adrp x0. thereafter. x add x0.text:004006C0 var_4 = -0x10 = -4 70 . [sp].CHAPTER 7. Listing 7.LC1 add x0.7 MIPS Uninitialized global variable So now the x variable is global.1 on page 16). load pointer to the "%d" string: adrp x0.4.2.LC0 bl puts . Perhaps.6 ARM64 Listing 7.text:004006C0 .LC1: .1 ARM64 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 . .text:004006C0 main: . along with the code segment. 0 ..constdata segment. :lo12:x .LC2 bl printf . And so the address of the x variable must be located somewhere in close proximity. set stack frame (FP=SP) add x29. x0. form address of x global variable again: adrp x0. x30. :lo12:x bl __isoc99_scanf . x0. return 0 mov w0.5.comm x. -16]! . load pointer to the "You entered %d.4. . That is because the x variable could be located somewhere far from this particular code fragment. because there is no guarantee that the linker would be able to accommodate the variable somewhere nearby the code. [sp. load value from memory at this address: ldr w1. restore FP and LR: ldp x29. save FP and LR in stack frame: stp x29. :lo12:.string "Enter X:" . :lo12:.string "You entered %d. and that all operations with the variable occur via this pointer.

0x20+var_10($sp) $a0. load value from "x" variable . __isoc99_scanf la jalr la $a1. 0x20 .a0.16(sp) a0.text:004006C0 .0x41099C)($v0) jalr $t9 . 0x20+var_10($sp) $a0.sbss . $zero $ra $sp.text:00400710 . function prologue: 4006c0: 3c1c0042 4006c4: 27bdffe0 4006c8: 279c8940 4006cc: afbf001c 4006d0: afbc0010 . 0x20+var_4($sp) $v0. branch delay slot .16(sp) lw lui jalr addiu t9.a0. prepare address of x: 4006f0: 8f858044 4006f4: 0320f809 4006f8: 248408fc .sbss:0041099C x: . call printf(): . aD # "%d" lw lui $gp.4.text:004006E8 .\n" . x $t9 . 0x40 $t9. call scanf(): . 0x40 # "Enter X:" .text:004006E4 .16(sp) 71 .-32712(gp) lw jalr addiu a1.text:004006F0 .sbss:0041099C . $gp.-32700(gp) t9 a0. call puts(): 4006d4: 8f998034 4006d8: 3c040040 4006dc: 0320f809 4006e0: 248408f0 . aEnterX lw lui la $gp.text:004006FC . printf and pass it to printf() in $a1: lw $a1.sbss:0041099C IDA reduces the amount of information.9: Optimizing GCC 4.text:004006C8 .-32716(gp) a0.text:004006DC . branch delay slot lw lui lw gp. branch ⤦ lw move jr addiu $ra.0x40 t9. $sp.text:004006F8 .2288 .text:004006E0 . printf la $a0.text:00400720 . SCANF() . aYouEnteredD___ # "You entered %d. branch delay slot lw gp.text:0040071C . call printf(): 4006fc: 8fbc0010 lui addiu addiu sw sw gp. 0x42 -0x20 0x418940 0x20+var_4($sp) 0x20+var_10($sp) la lui jalr la $t9. puts $a0.space 4 .sbss:0041099C . branch delay slot .CHAPTER 7.gp.-30400 ra. 0x40 $t9 .sp.text:00400704 .text:00400700 .text:00400718 . .text:0040070C .0x40 t9 a0.text:004006D8 .text:004006D4 .text:004006D0 .5 (objdump) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 004006c0 <main>: .text:00400708 .28(sp) gp.text:00400714 Ç delay slot ..text:004006C0 . $ra. get address of x: . (x . prepare address of x: .text:004006CC .-32 gp.text:004006EC . x la $t9. call scanf(): 4006e4: 8fbc0010 4006e8: 3c040040 4006ec: 8f998038 . call puts(): ..text:004006F4 . so I also made a listing using objdump and commented it: Listing 7.globl x .0x42 sp. __isoc99_scanf $a0.text:004006C4 .2300 ..2. GLOBAL VARIABLES lui addiu li sw sw $gp. $gp. function prologue: .sbss:0041099C # Segment type: Uninitialized . puts $a0. function epilogue: .. branch delay slot la $v0.text:00400724 7.

text:004006A0 . printf $a0. puts $a0. More than that. .zero 400720: 03e00008 jr ra 400724: 27bd0020 addiu sp.text:00400700 move $gp.text:004006B4 sw . prepare high part of x address: .text:004006E8 lw . -0x20 $gp. __isoc99_scanf $a0.text:004006E0 la .10: Optimizing GCC 4. add low part of x address: .0x410000) $t9 .text:004006A0 main: .text:004006AC sw . scanf(). . $s0.text:004006A0 var_4 = -4 . 0x20+var_4($sp) $s0. value of x is now in $a1.text:004006B0 sw . branch delay slot function epilogue: 400718: 8fbf001c lw ra. 0x418930 $ra. 0x20+var_10($sp) $a1.. SCANF() 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 . . get a word from memory: . printf $a0.text:004006A0 lui . $zero 72 # "You entered %d.text:004006CC lui . 0x20+var_10($sp) $s0.text:004006BC lui . 0x40 $t9 . aD # "%d" $gp. 16 and 26). because our example is tiny. 7.5 (IDA) .text:004006A8 li . 0x41 $t9. in order to align next function’s start on 16-byte boundary.0(v0) 400710: 0320f809 jalr t9 400714: 24840900 addiu a0.2. and such offset suggests that all three function’s addresses. are all stored somewhere at the beginning of that buffer. and also the address of the x variable..text:004006A4 addiu . __isoc99_scanf $a0.text:004006E4 lw .0x40 get address of x: 400704: 8f828044 lw v0.text:004006D8 addiu . are also read from the 64KiB global data buffer using GP (lines 9. GP points to the middle of the buffer.-32700(gp) 400708: 8f99803c lw t9. 0x20+var_4($sp) $v0.text:004006C4 la .text:004006F0 lui .at 40072c: 00200825 move at.text:004006A0 var_8 = -8 .32 .text:004006B8 la . // default value Now IDA shows that the x variable is residing in the . printf()). GLOBAL VARIABLES 400700: 3c040040 lui a0.text:004006EC la . 0x40 $t9 . x $t9.2304 .a0. the addresses of the three external functions which are used in our example (puts(). now address of x is in $a1.text:004006F8 la .CHAPTER 7. aYouEnteredD___ $ra. 0x20+var_8($sp) $gp. Initialized global variable Let’s alter our example by giving the x variable a default value: int x=10.text:004006A0 .text:004006C8 lw .text:004006C0 jalr .data section: Listing 7.28(sp) 40071c: 00001021 move v0. (x . Another thing worth mentioning is that the function ends with two NOPs (MOVE $AT.4. aEnterX # "Enter X:" $gp. That make sense.-32708(gp) load value from "x" variable and pass it to printf() in $a1: 40070c: 8c450000 lw a1. . puts $a0. 0x20+var_10($sp) $t9. .text:004006A0 var_10 = -0x10 .\n" .text:004006F4 jalr .$AT — an idle instruction).text:004006D4 lui .at Now we see the x variable address is read from a 64KiB data buffer using GP and adding negative offset to it (line 18).sp.text:004006FC lw .text:004006DC jalr . 0x42 $sp. branch delay slot pack of NOPs used for aligning next function start on 16-byte boundary: 400728: 00200825 move at.text:004006D0 la . 0x40 $a1.

CHAPTER 7. value of x is now in $a1.0x40 .2272 4006fc: 8fbf001c lw ra. but here we also see some prefixed with S-.h> int main() { 73 . In our case those are LUI (“Load Upper Immediate”) and ADDIU (“Add Immediate Unsigned Word”).2268 4006e4: 8fbc0010 lw gp. #include <stdio. prepare high part of x address: 4006cc: 3c100041 lui s0. it is slightly old-fashioned to use scanf() today. 4006ec: 8f99803c lw t9.28(sp) 400700: 00001021 move v0.data:00410920 . so one single LW is enough to load a value from the variable and pass it to printf(). But if we have to.data. Registers holding temporary data are prefixed with T-.text:0040070C 7.32 We see that the address is formed using LUI and ADDIU.3.globl x .text:00400708 . and we can take a look how to work with variables there. SCANF() RESULT CHECKING lw jr addiu $s0. That is why the value of $S0 was set at address 0x4006cc and was used again at address 0x4006e8.-32708(gp) 4006f0: 3c040040 lui a0..24(sp) 4006b4: afbc0010 sw gp. The variable’s address must be formed using a pair of instructions.11: Optimizing GCC 4.a0.a0.data:00410920 x: . SCANF() .text:00400704 .word 0xA Why not .-32712(gp) 4006d4: 3c040040 lui a0.-30416 4006ac: afbf001c sw ra. now x is in . add low part of x address: 4006d8: 26050920 addiu a1. but the high part of address is still in the $S0 register. high part of x address is still in $s0. Nevertheless.2336 .e.-32716(gp) 4006bc: 3c040040 lui a0. add low part to it and load a word from memory: 4006e8: 8e050920 lw a1. Here is also the objdump listing for close inspection: Listing 7. .zero 400704: 8fb00018 lw s0. the contents of which is need to be preserved before use in other functions (i. we need to at least check if scanf() finishes correctly without an error. 4006dc: 0320f809 jalr t9 4006e0: 248408dc addiu a0. 7.sp. 0x20 .28(sp) 4006b0: afb00018 sw s0.-32 4006a8: 279c8930 addiu gp.. “saved”).0x40 4006f4: 0320f809 jalr t9 4006f8: 248408e0 addiu a0. . after the scanf() call. which is a general memory area.5 (objdump) 004006a0 <main>: 4006a0: 3c1c0042 lui gp.0x40 4006c0: 0320f809 jalr t9 4006c4: 248408d0 addiu a0.a0. 0x20+var_8($sp) $ra $sp. The scanf() function does not change its value.4.gp.sp. its possible that this depends on some GCC option.3 scanf() result checking As I noted before. and it is possible to encode the offset in a LW (“Load Word”) instruction.sdata? I do not know. now address of x is in $a1.2336(s0) .0x42 4006a4: 27bdffe0 addiu sp.24(sp) 400708: 03e00008 jr ra 40070c: 27bd0020 addiu sp.2256 4006c8: 8fbc0010 lw gp.16(sp) ..16(sp) .s0.16(sp) 4006b8: 8f998034 lw t9.0x41 4006d0: 8f998038 lw t9.

see also: wikipedia. Jcc checks those flags and decides to either pass the control to the specified address or not.0.3. 'What you entered? Huh?'. 'You entered %d. In other words. DWORD PTR _x$[ebp] eax OFFSET $SG3833 .3. 00H _printf esp. which implements return 0. DWORD PTR _x$[ebp] ecx OFFSET $SG3834 .. and another printf() call is to be executed. 4 scanf. 1 SHORT $LN2@main ecx.. 4 eax. 0aH.' and the value of x. '%d'. 1 (CoMPare). in our case $LN2@main. printf ("Enter X:\n"). SCANF() 7. or in case of error (or EOF5 ) . 8 eax. A JNE conditional jump follows the CMP instruction. return 0. x). we compare the value in the EAX register with 1. We check it with the help of the instruction CMP EAX. there is a JMP preceding it (unconditional jump)..exe Enter X: ouch What you entered? Huh? 7. 00H _scanf esp. So. 5 End 74 .CHAPTER 7. 8 SHORT $LN1@main OFFSET $SG3836 . I have added some C code to check the scanf() return value and print error message in case of an error..1 MSVC: x86 Here is what we get in the assembly output (MSVC 2010): lea push push call add cmp jne mov push push call add jmp $LN2@main: push call add $LN1@main: xor eax. Since in this case the second printf() has not to be executed... if everything goes fine and the user enters a number scanf() returns 1. JNE stands for Jump if Not Equal. with two arguments: 'You entered %d. CMP compares two values and sets processor flags6 . if the value in the EAX register is not equal to 1.. else printf ("What you entered? Huh?\n").. eax The Caller function (main()) needs the callee function (scanf()) result.'. the CPU will pass the execution to the address mentioned in the JNE operand. 00H _printf esp. This works as expected: C:\..>ex3. 0aH. It passes the control to the point after the second printf() and just before the XOR EAX. By standard. it could be said that comparing a value with another is usually implemented by CMP/Jcc instruction pair.. &x)==1) printf ("You entered %d. But if everything is fine. if (scanf ("%d".>ex3. SCANF() RESULT CHECKING int x. In our case. so the callee returns it in the EAX register.exe Enter X: 123 You entered 123. the conditional jump is not be be taken. }. C:\. So. wscanf: MSDN of file 6 x86 flags. Passing the control to this address results in the CPU executing printf() with the argument ``What you entered? Huh?''. the scanf()4 function returns the number of fields it has successfully read. where cc is condition code.\n"... EAX instruction.

text:00401000 . but the CMP instruction is in fact SUB (subtract). ask for X . So. then press “–” on the numerical pad and enter the text to be displayed instead.text:00401009 . All arithmetic instructions set processor flags. 7. I have hidden two blocks and given them names: . which means that all these standard functions are not be linked with the executable file. 8 eax. 4 exit: . it is very helpful to leave notes for oneself (and others). get X cmp eax.text:00401024 . Create another label—into “exit”. the CMP instruction can be replaced with a SUB instruction and almost everything will be fine.text:00401000 .text:0040103D . JNE is in fact a synonym for JNZ (Jump if Not Zero). CODE XREF: _main+3B xor eax. [ebp+var_4] ecx offset aYou .text:00401038 . esp ecx offset Format .text:0040103D . for beginners it is good idea to use /MD option in MSVC. 4 eax.text:0040101B . 1 short error ecx. Assembler translates both JNE and JNZ instructions into the same opcode. "Enter X:\n" ds:printf esp. 1 − 1 is 0 so the ZF flag would be set (meaning that the last result was 0).text:00401050 _main proc near var_4 argc argv envp = = = = dword dword dword dword push mov push push call add lea push push call add cmp jnz mov push push call add jmp error: . but are to be imported from the MSVCR*.. You could also hide(collapse) parts of a function in IDA.text:0040104B .text:0040102C . So it is possible to move the cursor to the label. CMP is SUB without saving the result.text:00401012 . we see that JNZ is to be triggered in case of an error.text:00401016 .text:0040103B .text:00401021 . Thus it will be easier to see which standard function are used and where. press “n” and rename it to “error”.text:00401015 . In instance. JNE checks only the ZF flag and jumps only if it is not set. In no other circumstances ZF can be set.text:0040102D .text:00401012 . While analysing code in IDA. To do that mark the block. it is not a good idea to comment on every instruction. "%d" ds:scanf esp. not just CMP.3.text:00401050 .text:00401000 .text:00401000 .text:00401001 .text:00401000 . "You entered %d.text:00401000 . SCANF() RESULT CHECKING This could sound paradoxical.text:0040100F .org 401000h . "What you entered? Huh?\n" ds:printf esp.text:00401003 . print result jmp short exit 75 .text:00401000 .DLL file instead.2 MSVC: x86: IDA It is time to run IDA and try to do something in it.text:00401042 . However. If we compare 1 and 1.text:0040104F .text:00401032 . analysing this example. [ebp+var_4] eax offset aD .text:00401000 .text:0040103D . eax mov esp. with the difference that SUB alters the value of the first operand.text:00401000 .text:00401000 . By the way.\n" ds:printf esp. Here is my result: .text:00401027 .text:00401000 .text:0040104D .text:00401004 . ebp pop ebp retn _main endp Now it is slightly easier to understand the code.3.text:00401029 . 8 short exit XREF: _main+27 offset aWhat .text:0040104B .text:00401048 .text:00401024 . SCANF() 7..text:00401027 .text:00401029 .text:00401000 .text:0040103B _text segment para public 'CODE' use32 assume cs:_text . except when the operands are equal. 1 jnz short error . CODE push call add ptr -4 ptr 8 ptr 0Ch ptr 10h ebp ebp.text:0040104B .CHAPTER 7.

CODE XREF: _main+27 . eax . SCANF() 7. CODE XREF: _main+3B . SCANF() RESULT CHECKING .text:00401050 _main endp To expand previously collapsed parts of the code. 4 .text:0040103D push offset aWhat .text:0040104F pop ebp .text:00401042 call ds:printf .text:00401050 retn . ebp .text:0040104B exit: .text:0040103D error: . use “+” on the numerical pad. "What you entered? Huh?\n" .text:0040104D mov esp.text:0040104B .text:00401048 add esp.text:0040103D .CHAPTER 7.3.text:0040104B xor eax. 76 .

7: Graph mode in IDA There are two arrows after each conditional jump: green and red. we can see how IDA represents a function as a graph: Figure 7. The green arrow points to the block which executes if the jump is triggered. SCANF() RESULT CHECKING By pressing “space”.CHAPTER 7.3. and red if otherwise. SCANF() 7. 77 .

I did it for 3 blocks: Figure 7.8: Graph mode in IDA with 3 nodes folded That is very useful.3. SCANF() RESULT CHECKING It is possible to fold nodes in this mode and give them names as well (“group nodes”). It could be said that a very important part of the reverse engineer’s job is to reduce the information he/she deals with.CHAPTER 7. SCANF() 7. 78 .

forcing it to think scanf() always works without error.9: OllyDbg: passing variable address into scanf() 79 .3 MSVC: x86 + OllyDbg Let’s try to hack our program in OllyDbg.CHAPTER 7. in this case 0x6E494714: Figure 7. When an address of a local variable is passed into scanf(). SCANF() 7. SCANF() RESULT CHECKING 7.3.3. the variable initially contains some random garbage.

Indeed. We now have 1 in EAX. SCANF() 7. what would scanf() write there? It simply did nothing except returning zero. which indicates that an error has occurred: Figure 7.CHAPTER 7. like “asdasd”. Among the options there is “Set to 1”. This is what we need.3. When we run the program (F9) we can see the following in the console window: Figure 7. Right-click on EAX. 1850296084 is a decimal representation of the number in the stack (0x6E494714)! 80 . so the following check is to be executed as intended.11: console window Indeed. scanf() finishes with 0 in EAX. Let’s try to “hack” our program. SCANF() RESULT CHECKING While scanf() executes.10: OllyDbg: scanf() returning error We can also check the local variable in the stack and note that it has not changed. in the console I enter something that is definitely not a number. and printf() will print the value of the variable in the stack.

CHAPTER 7.12: Hiew: main() function Hiew finds ASCIIZ8 strings and displays them.DLL (i. We can see this: Figure 7.3.3.e. Enter. F6.text section. no matter what we enter. 7 that’s 8 ASCII what also called “dynamic linking” Zero (null-terminated ASCII string) 81 . F8. SCANF() 7. as it does with the imported functions’ names.text section (Enter. Assuming that the executable is compiled against external MSVCR*.. SCANF() RESULT CHECKING 7. Let’s open the executable in Hiew and find the beginning of the . Enter). we see the main() function at the beginning of the . with /MD option)7 . We may try to patch the executable so the program would always print the input.4 MSVC: x86 + Hiew This can also be used as a simple example of executable file patching.

56 lea rcx. Two NOPs are probably not the most æsthetic approach. 'Enter X:' 82 .3.5 MSVC: x64 Since we work here with int-typed variables. 00H _TEXT SEGMENT x$ = 32 main PROC $LN5: sub rsp. SCANF() 7. 00H 'You entered %d.CHAPTER 7. 0aH. It will behave as we wanted. Now the executable is saved to the disk. we have to bypass. Another way to patch this instruction is to write just 0 to the second opcode byte (jump offset). meaning two NOPs): Figure 7.13: Hiew: replacing JNZ with two NOPs Then press F9 (update). which are still 32-bit in x86-64. We would get an unconditional jump that is always triggered. press F3. 00H 'What you entered? Huh?'. however.00401027 (where the JNZ instruction.3. so that JNZ will always jump to the next instruction. Listing 7. 7. 0aH.'. We could also do the opposite: replace first byte with EB while not touching the second byte (jump offset). no matter the input. and then type “9090”(. While working with pointers. 64-bit register parts are used.. is located). SCANF() RESULT CHECKING Move the cursor to address . In this case the error message would be printed every time. 00H '%d'. we see how the 32-bit part of the registers (prefixed with E-) are used here as well. prefixed with R-. 0aH.12: MSVC 2012 x64 _DATA $SG2924 $SG2926 $SG2927 $SG2929 _DATA SEGMENT DB DB DB DB ENDS 'Enter X:'. OFFSET FLAT:$SG2924 ..

aYouEnteredD___ .LC1: 9 (PowerPC. aWhatYouEntered . then the branches converge at the point where 0 is written into the R0 as a function return value. 56 ret 0 main ENDP _TEXT ENDS END 7. OFFSET FLAT:$SG2929 . BEQ jumps to another address if the operands were equal to each other. aEnterX . "What you entered? Huh?\n" __2printf MOVS POP R0. '%d' call scanf cmp eax. SCANF() RESULT CHECKING call printf lea rdx.. ARM) Branch if Equal 83 . [SP.CHAPTER 7. #1 loc_1E R0. OFFSET FLAT:$SG2927 .LC0: .6 ARM ARM: Optimizing Keil 6/2013 (thumb mode) Listing 7.string "Enter X:" . aD . return 0 xor eax. and then the function ends.13: Optimizing Keil 6/2013 (thumb mode) var_8 = -8 PUSH ADR BL MOV ADR BL CMP BEQ ADR BL {R3..' call printf jmp SHORT $LN1@main $LN2@main: lea rcx. #0 {R3. Everything else is simple: the execution flow forks in two branches. QWORD PTR x$[rsp] lea rcx.9. DWORD PTR x$[rsp] lea rcx. CODE XREF: main+12 R1. it subtracts one of the arguments from the other and updates the conditional flags if needed. CODE XREF: main+26 loc_1E The new instructions here are CMP and BEQ9 .LR} R0. CMP is analogous to the x86 instruction with the same name. 'You entered %d.. or if the Z flag is 1.3.3.PC} LDR ADR BL B . 1 jne SHORT $LN2@main mov edx. It behaves as JZ in x86. OFFSET FLAT:$SG2926 . "Enter X:\n" __2printf R1. or.14: Non-optimizing GCC 4. eax add rsp. "%d" __0scanf R0. SCANF() 7. SP R0.1 ARM64 1 2 3 . 'What you entered? Huh?' call printf $LN1@main: .\n" __2printf loc_1A loc_1A .. if the result of the last computation was 0.#8+var_8] R0. ARM64 Listing 7. "You entered %d.

. x29. check it: cmp w0.5 (IDA) .CHAPTER 7.LC3: . .text:004006A4 . load x value from the local stack ldr w1. load pointer to the "Enter X:" string: adrp x0. -32]! . [x29. x0.text:004006A0 . x0. 0x28+var_18($sp) $a0. return 0 mov w0.text:004006B0 . .7 MIPS Listing 7. -0x28 $gp. jump to L2 will be occurred bne . BNE is Branch if Not Equal .text:004006C8 main: var_18 var_10 var_4 = -0x18 = -0x10 = -4 lui addiu li sw sw la lui jalr la lw lui $gp. :lo12:. 7.28] . load pointer to the "What you entered? Huh?" string: adrp x0. [sp.4.\n" string: adrp x0.LC2 add x0.text:004006A0 . 28 bl __isoc99_scanf .text:004006A8 . 0x28+var_18($sp) $t9.. .LC1 add x0.LC2: .LC3 add x0.LC3 bl puts . 1 .text:004006A0 . meaning no error . so if W0<>0. 0x42 $sp. [sp]. :lo12:.L2 . 0x28+var_4($sp) $gp. SCANF() RESULT CHECKING . SCANF() 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 7.text:004006AC ..LC2 bl printf . 0x418960 $ra.LC1 .L2: .15: Optimizing GCC 4.3.string "What you entered? Huh?" f6: . at this moment W0=1.LC0 bl puts . set stack frame (FP=SP) add x29. load pointer to the "You entered %d.text:004006B8 . x30. 32 ret Code flow in this case forks with the use of CMP/BNE (Branch if Not Equal) instructions pair. skip the code.text:004006C4 . :lo12:. save FP and LR in stack frame: stp x29. puts $a0. :lo12:. .string "%d" .text:004006B4 . restore FP and LR: ldp x29. calculate address of x variable in the local stack add x1. x0. scanf() returned result in W0. sp.text:004006A0 . x0.text:004006BC . 0x40 84 . . 0 .text:004006A0 . load pointer to the "%d" string: adrp x0. puts $a0. aEnterX # "Enter X:" $gp.text:004006C0 ..LC0 add x0. 0x40 $t9 .text:004006A0 .\n" .L3 . x30. which print the "What you entered? Huh?" string: b .3.text:004006A0 . 0 .L3: .string "You entered %d.

text:004006D4 .text:00400700 .e.text:004006F8 .text:004006CC . 0x28+var_4($sp) $v0. 0x40 $t9 .text:004006D8 .text:004006D0 . 0x28 .text:00400718 .text:004006F0 . $zero $ra $sp.CHAPTER 7.text:00400714 .text:004006E0 .. 0x28+var_18($sp) $v0.text:004006DC . 0x40 $t9 .text:00400704 . NOP $t9. 1 $gp. aYouEnteredD___ # "You entered %d. 0x28+var_10 # branch delay slot $v1. aWhatYouEntered # "What you entered? Huh?" $ra.text:00400728 . Try to do this in some of the examples. success).text:004006F4 . printf $a0. 0x28 scanf() returns the result of its work in register $V0.text:004006FC .8 Exercise As we can see. at 0x004006DC). $sp. 0x28+var_10($sp) $a0. aD # "%d" $t9 .text:0040070C loc_40070C: . puts $a0. But then the basic blocks must also be swapped. SCANF() 7..text:004006E4 .text:004006EC . $zero # branch delay slot. 85 .text:0040071C .text:0040072C la lw lui jalr la lw move jr addiu $t9.3. It is checked at address 0x004006E4 by comparing the values in $V0 with $V1 (1 was stored in $V1 earlier. puts $a0.text:00400710 . 7. __isoc99_scanf $a1. $v1. loc_40070C $at.text:00400724 . If the two values are equal (i.text:00400708 la la jalr addiu li lw beq or la lui jalr la lw move jr addiu $t9. 0x28+var_4($sp) $v0.3.text:00400720 .\n" $ra. BEQ stands for “Branch Equal”. printf $a1. __isoc99_scanf $a0.text:0040070C . $zero $ra $sp.. the execution jumps to address 0x0040070C.text:004006E8 . SCANF() RESULT CHECKING . the JNE/JNZ instruction can be easily replaced by the JE/JZ and vice versa (or BNE by BEQ and vice versa).

8 . 0aH. int main() { printf ("%d\n". 8. f(1. int c) { return a*b+c. 1st argument call _f add esp. size = 4 .2: MSVC 2010 Express _TEXT SEGMENT _a$ = 8 _b$ = 12 _c$ = 16 _f PROC push mov mov imul add pop ret _f ENDP _main .1: simple example #include <stdio. size = 4 ebp ebp. }. return 0 86 .1 x86 MSVC Here is what we get after compilation (MSVC 2010 Express): Listing 8.1 8. 2nd argument push 1 . eax. return 0. esp push 3 .CHAPTER 8. eax.1. 00H call _printf add esp. 3rd argument push 2 . '%d'. size = 4 . ACCESSING PASSED ARGUMENTS Chapter 8 Accessing passed arguments Now we figured out that the caller function is passing arguments to the callee via the stack. 3)). eax.h> int f (int a. ebp 0 esp DWORD PTR _a$[ebp] DWORD PTR _b$[ebp] DWORD PTR _c$[ebp] PROC push ebp mov ebp. 12 push eax push OFFSET $SG2463 . }. int b. 2. But how does the callee access them? Listing 8.

So. the value in EAX is a product of the value in EAX and the content of _b. eax. X86 eax. The first element of the stack frame is the saved value of EBP.4. in the same way as local variables. The value in EAX does not need to be moved: it is already where it must be. they are rather members of the stack frame of the caller function. After IMUL instruction execution. 8.: Function arguments are not members of the function’s stack frame.B. To access the first function argument. N.1.1 f public f proc near arg_0 arg_4 arg_8 = dword ptr = dword ptr = dword ptr f push mov mov imul add pop retn endp ebp ebp. the third is the first function argument. so it has added comments to the stack elements like “RETURN from” and “Arg1 = …”.2 MSVC + OllyDbg Let’s illustrate this in OllyDbg.CHAPTER 8.1. the second one is RA. eax.1: OllyDbg: inside of f() function 8. then the second and third ones. we are addressing the outer side of the stack frame by adding the _a$ macro to the value in the EBP register. 3rd argument 87 . etc. OllyDbg marked “Arg” elements as members of another stack frame. After that.int. OllyDbg is aware about this. which I marked with a red rectangle. 1st argument [ebp+arg_4] . eax.3 GCC Let’s compile the same in GCC 4.1 and see the results in IDA: Listing 8. On returning to caller. eax ebp 0 What we see is that the main() function pushes 3 numbers onto the stack and calls f(int.int). 2nd argument [ebp+arg_8] . we see that EBP is pointing to the stack frame.3: GCC 4. ebp 8 0Ch 10h esp [ebp+arg_0] . but with positive offsets (addressed with plus). one needs to add exactly 8 (2 32-bit words) to EBP. Then the value of a is stored into EAX. When we trace to the first instruction in f() that uses one of the arguments (first one). ACCESSING PASSED ARGUMENTS _main xor pop ret ENDP 8.4. Argument access inside f() is organized with the help of macros like: _a$ = 8. it takes the EAX value and use it as an argument to printf(). Hence. ADD adds the value in _c to EAX.1. Figure 8.

4: Optimizing MSVC 2012 x64 $SG2997 DB main main f f PROC sub mov lea lea call lea mov call xor add ret ENDP '%d'. eax [esp+10h+var_10]. 40 edx. 0 The result is almost the same with some minor differences discussed earlier. 2 r8d. function arguments (first 4 or first 6 of them) are passed in registers i. eax printf eax. 8.2nd argument .1st argument . 2nd argument [esp+10h+var_10]. 2 . ECX=1 f rcx. The LEA instruction here is used for addition.2 x64 The story is a bit different in x86-64. X64 ebp ebp. DWORD PTR [r8+rcx] ret 0 ENDP As we can see. 8. '%d' edx.3rd argument imul ecx. 3rd argument [esp+10h+var_C]. The compiler must have decided that this would work faster than the usual way of loading values into a register — using MOV instruction. R8D=3 ecx. EDX . the callee reads them from there instead of reading them from the stack. 0FFFFFFF0h esp.e. 1 .2 on page 933) instruction takes care of this at the end.2. QWORD PTR [rdx-1] .6. the compact function f() takes all its arguments from the registers. "%d\n" [esp+10h+var_C]. LEA is also used in the main() function to prepare the first and third f() arguments. 3 . apparently the compiler considered it faster than ADD. edx _printf eax. 0aH. edx lea eax. 1st argument f edx. because the penultimate LEAVE ( A. ACCESSING PASSED ARGUMENTS main public main proc near var_10 var_C var_8 = dword ptr -10h = dword ptr -0Ch = dword ptr -8 main push mov and sub mov mov mov call mov mov mov call mov leave retn endp 8.CHAPTER 8. The stack pointer is not set back after the two function calls(f and printf). offset aD . esp esp. QWORD PTR [rdx+1] . 10h [esp+10h+var_8]. Let’s take a look at the non-optimizing MSVC output: 88 .1 MSVC Optimizing MSVC: Listing 8. eax rsp.2. ECX . 40 0 PROC . 00H rsp. R8D . OFFSET FLAT:$SG2997 .

1st argument . 3rd argument mov edx. EDI . So. edi lea eax.3rd argument mov [rsp+arg_10].2. 1st argument call f mov edx. [rsp+arg_8] add eax. f edi. "%d\n" call printf . 8.2. [rsp+arg_0] imul eax. 2) the debugger is always aware where to find the function arguments at a break 2 . [rsp+arg_10] retn endp f main proc near sub rsp. but some small functions (like ours) may not do this. R8D . This is called “shadow space” 1 : every Win64 may (but is not required to) save all 4 register values there. "%d\n" eax eax .5: MSVC 2012 x64 f proc near . r8d mov [rsp+arg_8]. edx. $SG2931 .6: Optimizing GCC 4. [rdx+rsi] ret main: sub mov mov mov call mov mov xor rsp. esi. ECX . It is a caller responsibility to allocate “shadow space” in the stack. 2 .3rd argument imul esi.4. esi. 1 . 2nd argument mov ecx. number of vector registers passed 1 MSDN 2 MSDN 89 . eax lea rcx. This is done for two reasons: 1) it is too lavish to allocate a whole register (or even 4 registers) for an input argument. ecx mov eax. ACCESSING PASSED ARGUMENTS 8. 8 3 2 1 OFFSET FLAT:. ESI . X64 Listing 8. edi. EDX . 28h mov r8d.1st argument . 28h retn endp main It looks somewhat puzzling because all 3 arguments from the registers are saved to the stack for some reason.6 x64 f: .LC0 .2nd argument . eax.2nd argument . eax add rsp. 3 . some large functions can save their input arguments in the “shadows space” if they need to use them during execution. EDX . return 0 xor eax. shadow space: arg_0 = dword ptr arg_8 = dword ptr arg_10 = dword ptr 8 10h 18h . so it will be accessed via stack.2 GCC Optimizing GCC generates more or less understandable code: Listing 8.CHAPTER 8. edx mov [rsp+arg_0].

rsp mov DWORD PTR [rbp-4].4. EDI . }. ACCESSING PASSED ARGUMENTS call xor add ret 8.h> #include <stdint.2. DWORD PTR [rbp-8] add eax.LC0 . return 0.h> uint64_t f (uint64_t a. }. 0 . that is why 32-bit register parts are used (prefixed by E-).8: Optimizing GCC 4. 1 f edx.1st argument . edi mov DWORD PTR [rbp-8]. rdi lea rax. 0x3333333344444444)). It can be altered slightly in order to use 64-bit values: #include <stdio. f(0x1122334455667788. Listing 8.2nd argument . rsp edx. 2 edi. uint64_t b. DWORD PTR [rbp-12] leave ret main: push mov mov mov mov call mov mov mov mov mov call mov leave ret rbp rbp. uint64_t c) { return a*b+c. but the callee may need to save its arguments somewhere in case of registers shortage.6 x64 f f proc near imul rsi. edx rdi.2. number of vector registers passed printf eax. OFFSET FLAT:. int main() { printf ("%lld\n". edx mov eax. 8 Non-optimizing GCC: Listing 8. EDX .CHAPTER 8. [rdx+rsi] retn endp 90 . 0x1111111122222222. 3 esi. 8. ESI .4. esi mov DWORD PTR [rbp-12].3rd argument push rbp mov rbp.6 x64 f: . eax eax.3 GCC: uint64_t instead of int Our example works with 32-bit int. eax rsp. rax eax.7: GCC 4. X64 printf eax. DWORD PTR [rbp-4] imul eax. "%d\n" esi. 0 There are no “shadow space” requirements in System V *NIX[Mit13].

but this time the full size registers (prefixed by R-) are used. functions return values. by standard.text:000000B8 . BX is not only returns control to the calling function. 8 mov rdx.3. R3. 8. ARM or thumb.3. since this is non-optimizing compilation. adds the third operand (R2) to the product and stores the result into the zeroth register (R0). {R4. 1111111122222222h .text:0000009C 1E FF 2F E1 MLA BX R0.text:000000AC . The f() function. eax add rsp.CHAPTER 8. eax . rax xor eax. 1122334455667788h . redundant (a single MLA instruction could be used here instead).3.text:000000D4 . As I mentioned before. R0. The MLA (Multiply Accumulate) instruction multiplies its first two operands (R3 and R1). if it gets called from thumb code.1 ARM Non-optimizing Keil 6/2013 (ARM mode) . #2 R0.3.text:000000CC . By the way. 2nd argument mov rdi. This can be necessary since. the compiler has not optimized it. as we can see. ACCESSING PASSED ARGUMENTS main 8. A2. uses the first 3 registers (R0-R2) as arguments. 1st argument call f mov edi.text:000000A8 . The very first MOV R3.3 8. aD_0 __2printf R0. R4 R0. #0 SP!. {R4. with three values passed to the first one —(f()). 8. R1. number of vector registers passed call _printf xor eax.text:000000B0 .text:000000B0 . ARM proc near sub rsp. Or not switch. if necessary.2]. but also switches the processor mode to thumb. 3 wikipedia: Multiply–accumulate operation 4 wikipedia 91 . 3333333344444444h .text:000000B4 . 8 retn endp main The code is the same.text:000000D8 00 30 A0 E1 93 21 20 E0 1E FF 2F E1 MOV MLA BX R3. R2 LR STMFD MOV MOV MOV BL MOV MOV ADR BL MOV LDMFD SP!. The BX instruction returns the control to the address stored in the LR register and.PC} main 10 03 02 01 F7 00 04 5A E3 00 10 40 20 10 00 FF 40 10 0F 18 00 80 2D A0 A0 A0 FF A0 A0 8F 00 A0 BD E9 E3 E3 E3 EB E1 E1 E2 EB E3 E8 . switches the processor mode from thumb to ARM or vice versa.text:00000098 f .text:000000C8 .. .2 Optimizing Keil 6/2013 (ARM mode) .text:00000098 91 20 20 E0 . R2 LR And here is the f() function compiled by the Keil compiler in full optimization mode (-O3). via which. R1.text:000000A4 .LR} R2. instruction is. in ARM the first 4 values are usually passed in the first 4 registers (R0-R3). R0 R0. #1 f R4. #3 R1. Multiplication and addition at once3 (Fused multiply–add) is a very useful operation. apparently. there was no such instruction in x86 before FMA-instructions appeared in SIMD 4 .text:000000C0 . offset format . R0 R1.text:000000BC . if the function was called from ARM code [ARM12. R0. "%lld\n" mov rsi. "%d\n" The main() function simply calls two other functions. exactly where the calling function will read and use it. function f() is not aware from what kind of code it may be called. 3rd argument mov rsi. The MOV instruction was optimized out (or reduced) and now MLA uses all input registers and also places the result right into R0.. Thus. as it seems.text:000000D0 .text:000000C4 .

1 bl f mov w1. x1.text:00000062 70 47 MULS ADDS BX R0.LC7 bl printf . . f(0x1122334455667788. [sp. save FP and LR to stack frame: stp x29. Indeed. The result is returned in W0.h> uint64_t f (uint64_t a. R2 LR The MLA instruction is not available in thumb mode. }. All 3 arguments are passed in the 32-bit parts of X-registers. [sp].string "%d\n" I also extended all data types to 64-bit uint64_t and tested: #include <stdio. int main() { printf ("%lld\n".h> #include <stdint. . 0x1111111122222222.text:0000005E 48 43 . Listing 8. First the MULS instruction multiplies R0 by R1. w2 main: . R1 R0. -16]! main: 92 . The second instruction (ADDS) adds the result and R2 leaving the result in register R0.CHAPTER 8.9 f: madd ret w0. leaving the result in register R1. 0 .9: Optimizing GCC (Linaro) 4. x30. uint64_t c) { return a*b+c. restore FP and LR ldp x29. the argument types are 32-bit int’s. sp. MADD is just an instruction doing fused multiply/add (similar to the MLA we already saw).LC7: .LC8 x29. f: madd ret x0. w0. ARM Optimizing Keil 6/2013 (thumb mode) .4 ARM64 Optimizing GCC (Linaro) 4. 16 ret .LC7 add x0.text:00000060 80 18 . 3 mov w1. [sp. -16]! mov w2. 13396 x0. x0. 0 mov w0.3. x30. w0 adrp x0. w1.9 Everything here is simple.3. }. return 0 mov w0.3 8. uint64_t b. :lo12:. 2 add x29. x30. 0x3333333344444444)). return 0. R0.3. x0. ACCESSING PASSED ARGUMENTS 8. so the compiler generates the code doing these two operations (multiplication and addition) separately. 8. x2 mov adrp stp x1.

0x58be. w0. result in $v0 upon return .8] w1.text:0000001C .W2 will not be used.4 MIPS Listing 8. is not obliged to save them.text:00000010 main: .CHAPTER 8.4. This is called Register Save Area. which may be needed again in the future. $a1=b . [sp].text:00000008 .9 The non-optimizing compiler is more redundant: f: sub str str str ldr ldr mul ldr add add ret sp. w0.5 . 0 x29. $a2. 16 The code saves its input arguments in the local stack. sp.text:0000000C . $gp.. [ARM13c] The callee.text:00000010 var_10 . w1. lsl 32 x29.8] [sp. w0 [sp.text:00000000 . $a2=c . registers by parts.text:00000020 .9 drop this argument saving code? Because it did some additional optimizing work and concluded that the function arguments will not be needed in the future and also that the registers W0. sp. w2.LC8 x1.text:00000004 . set c: . 8. ACCESSING PASSED ARGUMENTS movk add movk add movk bl mov ldp ret 8. 0x122. w0 sp.1 on page 89. 3 93 . $a0=a . w0. lsl 48 printf w0.text:00000010 .2. #16 [sp.. branch delay slot = -0x10 = -4 lui addiu la sw sw $gp. x30.string "%lld\n" The f() function is the same. $sp.12] [sp. 16 . $gp. lsl 16 x0.1 on page 445.4] [sp.LC8: . $ra. This prevents overwriting the original function arguments. MIPS x1. Why did the optimizing GCC 4. however.text:00000000 f: .text:00000014 .text:00000010 .text:00000010 var_4 . sp.12] [sp. w1. only the whole 64-bit X-registers are now used. $v0 . We also see a MUL/ADD instruction pair instead of single a MADD.3. in case someone (or something) in this function needs using the W0.text:00000018 . x0.W2 registers.. w1. $a0 $v0 $ra $v0.text:00000010 . This is somewhat similar to “Shadow Space”: 8.text:00000024 mult mflo jr addu $a1. w0.4. :lo12:.4] w1. 0x27d0.. Long 64-bit values are loaded into the Non-optimizing GCC (Linaro) 4. I have described it also here: 28. 0 x1. (__gnu_local_gp >> 16) -0x20 (__gnu_local_gp & 0xFFFF) 0x20+var_4($sp) 0x20+var_10($sp) li $a2.10: Optimizing GCC 4.

branch delay slot $ra.text:00000038 lui . branch delay slot $gp. Indeed: we work with 32-bit int data types here. 5 http://go. The difference between JAL and JALR is that a relative offset is encoded in the first instruction. There are two special registers in MIPS: HI and LO which are filled with the 64-bit result of the multiplication during the execution of the MULT instruction. MFLO here takes the low-part of the multiplication result and stores it into $V0.text:00000050 move . 0x20+var_4($sp) $v0. There is a new instruction for us in main(): JAL (“Jump and Link”). Since C/C++ does not support this. ACCESSING PASSED ARGUMENTS .text:00000048 move .text:00000040 la . 1 f $a1.text:0000004C lw .4.text:0000003C lw . set b: . ADDU (“Add Unsigned”) adds the value of the third argument to the result.text:00000054 jr . These registers are accessible only by using the MFLO and MFHI instructions. ADD can raise an exception on overflow.com/17326 94 .text:00000034 lw . 0x20 . 2 . ADDU does not raise exceptions on overflow. which is sometimes useful5 and supported in Ada PL. 0x20+var_10($sp) $a0. so the relative address of f() is known and fixed. There are two different addition instructions in MIPS: ADD and ADDU. but to exceptions.text:00000044 jalr . take result of f() function and pass .text:0000002C jal . set a: . for instance. ($LC0 & 0xFFFF) $t9 it as a second argument to printf(): $a1.yurichev. result in $v0 now .CHAPTER 8.text:00000028 li . Finally. The 32-bit result is left in $V0. The difference between them is not related to signedness. branch delay slot The first four function arguments are passed in four registers prefixed by A-.text:00000058 addiu 8. Both f() and main() functions are located in the same object file. $v0 . MIPS $a0. So the high 32-bit part of the multiplication result is dropped (the HI register content is not used). in our example we see ADDU instead of ADD. while JALR jumps to the absolute address stored in a register (“Jump and Link Register”).text:00000030 li . ($LC0 >> 16) $t9. (printf & 0xFFFF)($gp) $a0. $zero $ra $sp.

4. In ARM.1: GCC 4. esp esp.1 replaced printf() with puts() (we have seen this before: 3. Listing 9. Most likely.envp)).1 Attempt to use the result of a function returning void So.3 on page 14) . GCC 4. that was stored in the EAX register at the end of main() becomes the sole argument of the exit() function. }. the result is usually returned in the R0 register. the FPU register ST(0) is used instead. the result of function execution is usually returned 1 in the EAX register. then the lowest part of register EAX (AL) is used.argv. This implies that the value of EAX at the end of main() contains what puts() has left there. Please notice that EAX is not zeroed before main()’s end.8. If you declare main() as void. If it is byte type or a character (char).1 .CHAPTER 9. world!\n"). since puts() returns the number of characters printed out.8. so the exit code of program is pseudorandom. Please note that here the main() function has a void return type: #include <stdio. left from your function execution. but that’s OK.LC0 puts also: MSDN: Return Values (C++): MSDN 95 .h> void main() { printf ("Hello. world!" main: push mov and sub mov call leave ret 1 See ebp ebp. If a function returns a float number. what if the main() function return value was declared of type void and not int? The so-called startup-code is calling main() roughly as follows: push push push call push call envp argv argc main eax exit In other words: exit(main(argc.LC0: . I can illustrate this fact. just like printf(). Let’s compile it in Linux. 9. there will be a random value. 16 DWORD PTR [esp].string "Hello. OFFSET FLAT:. nothing is to be returned explicitly (using the return statement). then something random. MORE ABOUT RESULTS RETURNING Chapter 9 More about results returning In x86. -16 esp.

the caller must allocate it and pass a pointer to it via the first argument. // and use 4th return rand(). one have to return information via pointers passed as function’s arguments. It is also possible to call a function whose essence is in returning a value. That is why old C compilers cannot create functions capable of returning something that does not fit in one register (usually int).c=a+3. MORE ABOUT RESULTS RETURNING 9. Now it has become possible to return. struct s get_some_values (int a) { struct s rt. The result of the rand() function is left in EAX. if a function needs to return several values.CHAPTER 9. usually./hello_world echo $? And run it: $ tst. let’s say. WHAT IF WE DO NOT USE THE FUNCTION RESULT? Let’ s write a bash script that shows the exit status: Listing 9. size = 4 ?get_some_values@@YA?AUs@@H@Z PROC mov ecx. int b. but the compiler hides it. an entire structure.2: tst. in all four cases. So.sh Hello.2 What if we do not use the function result? printf() returns the count of characters successfully output.sh #!/bin/sh . transparently for the programmer. get_some_values 96 . it returns only one.b=a+2. 9. but that is still not very popular. 9. Small example: struct s { int a. but if one needs it. DWORD PTR _a$[esp-4] . but the result of this function is rarely used in practice. If a function has to return a large structure. }. size = 4 _a$ = 12 . rand(). world! 14 14 is the number of characters printed.3 Returning a structure Let’s go back to the fact that the return value is left in the EAX register. rt. That is almost the same as to pass a pointer in the first argument manually. But in the first 3 cases. and not use it: int f() { // skip first 3 random values rand(). rt. the value in EAX is just thrown away. }. rt.a=a+1. return rt. …what we got (MSVC 2010 /Ox): $T3853 = 8 .2. }. rand(). int c. and all the rest—via pointers.

CHAPTER 9. So there are no performance drawbacks. ecx ret 0 ?get_some_values@@YA?AUs@@H@Z ENDP . edx mov DWORD PTR [eax+8]. }. }. .3. . ecx ecx. edx lea edx. DWORD PTR $T3853[esp-4] lea edx. as if a pointer to the structure was passed. [esp+a] eax. MORE ABOUT RESULTS RETURNING 9. 97 . int b.1 _get_some_values proc near ptr_to_struct a = dword ptr = dword ptr mov mov lea mov lea add mov mov retn _get_some_values endp 4 8 edx. This example can be rewritten using the C99 language extensions: struct s { int a. 3 [eax+4].8. [edx+2] edx.c=a+3}.b=a+2. get_some_values The macro name for internal passing of pointer to a structure her is $T3853. DWORD PTR [ecx+1] mov DWORD PTR [eax]. edx As we see. the function is just filling the structure’s fields allocated by the caller function. DWORD PTR [ecx+2] add ecx. struct s get_some_values (int a) { return (struct s){. RETURNING A STRUCTURE mov eax. 3 mov DWORD PTR [eax+4]. [esp+ptr_to_struct] ecx. int c.3: GCC 4.a=a+1. ecx [eax+8]. Listing 9. [edx+1] [eax].

0aH. product. DWORD PTR [eax+ecx] eax. }. . product=%d'. int *sum. . DWORD PTR _sum$[esp] DWORD PTR [esi]. This compiles to: Listing 10. 456. when a function needs to return two values. int *product) { *sum=x+y. int sum. void main() { f1(123.CHAPTER 10. . &sum. DWORD PTR _x$[esp-4] edx. product). eax esi 0 OFFSET _product OFFSET _sum 456 123 _f1 . 000001c8H . sum. For example.1 Global variables example #include <stdio. 10. *product=x*y. printf ("sum=%d. ecx ecx. &product). POINTERS Chapter 10 Pointers Pointers are often used to return values from functions (recall scanf() case ( 7 on page 56)).h> void f1 (int x. DWORD PTR _y$[esp-4] eax. int y. 0000007bH 98 . DWORD PTR _product$[esp-4] esi esi. 00H _x$ = 8 _y$ = 12 _sum$ = 16 _product$ = 20 _f1 PROC mov mov lea imul mov push mov mov mov pop ret _f1 ENDP _main PROC push push push push call . edx DWORD PTR [ecx]. size size size size = = = = 4 4 4 4 ecx.1: Optimizing MSVC 2010 (/Ob0) COMM _product:DWORD COMM _sum:DWORD $SG2803 DB 'sum=%d. }. product=%d\n".

eax 0 . 0000001cH 99 . DWORD PTR _sum eax ecx OFFSET $SG2803 DWORD PTR __imp__printf esp. 28 eax.1.CHAPTER 10. GLOBAL VARIABLES EXAMPLE eax. DWORD PTR _product ecx. POINTERS _main mov mov push push push call add xor ret ENDP 10.

and we can see the place in the data segment allocated for the two variables.1: OllyDbg: global variables addresses are passed to f1() First. 100 . GLOBAL VARIABLES EXAMPLE Let’s see this in OllyDbg: Figure 10. POINTERS 10. global variables’ addresses are passed to f1().CHAPTER 10. We can click “Follow in dump” on the stack element.1.

7. because non-initialized data (from BSS) is cleared before the execution begins: [ISO07.CHAPTER 10. POINTERS 10. 6.2: OllyDbg: memory map 101 .8p10].1. we can verify this by pressing Alt-M and reviewing the memory map: Figure 10. They reside in the data segment. GLOBAL VARIABLES EXAMPLE These variables are zeroed.

1. 102 .CHAPTER 10.3: OllyDbg: f1() starts Two values are visible in the stack 456 (0x1C8) and 123 (0x7B). POINTERS 10. GLOBAL VARIABLES EXAMPLE Let’s trace (F7) to the start of f1(): Figure 10. and also the addresses of the two global variables.

global variables: In the left bottom window we see how the results of the calculation appear in the Figure 10.1. POINTERS 10.CHAPTER 10.4: OllyDbg: f1() execution completed 103 . GLOBAL VARIABLES EXAMPLE Let’s trace until the end of f1().

eax esp. printf ("sum=%d. Only the code of main() will do: Listing 10. product=%d\n". DWORD PTR _sum$[esp+12] ecx 456 123 _f1 . 456. 00000024H 104 . 36 0 . DWORD PTR _product$[esp+8] eax ecx. size = 4 esp. Line 13 lea push lea push push push call . 0000007bH edx. 8 eax. Line 15 xor add ret .2. product).3: Optimizing MSVC 2010 (/Ob0) _product$ = -8 _sum$ = -4 _main PROC . &product). product. sum.5: OllyDbg: global variables’ addresses are passed into printf() 10. Line 14 mov mov push push push call . // now variables are local in this function f1(123. Line 10 sub . f1() code will not change. &sum. DWORD PTR _sum$[esp+24] edx eax OFFSET $SG2803 DWORD PTR __imp__printf eax.2 Local variables example Let’s rework our example slightly: Listing 10. LOCAL VARIABLES EXAMPLE Now the global variables’ values are loaded into registers ready for passing to printf() (via the stack): Figure 10. size = 4 .2: now the sum and product variables are local void main() { int sum. 000001c8H . }. POINTERS 10. DWORD PTR _product$[esp+24] eax.CHAPTER 10.

LOCAL VARIABLES EXAMPLE Let’s look again with OllyDbg.CHAPTER 10. The addresses of the local variables in the stack are 0x2EF854 and 0x2EF858. We see how these are pushed into the stack: Figure 10. POINTERS 10.6: OllyDbg: local variables’ addresses are pushed into the stack 105 .2.

LOCAL VARIABLES EXAMPLE f1() starts.2. POINTERS 10.7: OllyDbg: f1() starting 106 .CHAPTER 10. So far there is only random garbage in the stack at 0x2EF854 and 0x2EF858: Figure 10.

CONCLUSION f1() completes: Figure 10. 107 . Read more about them: ( 51.3 on page 556). By the way. located anywhere. This is in essence the usefulness of the pointers. These values are the f1() results. POINTERS 10.3. C++ references work exactly the same way. 10.3 Conclusion f1() could return pointers to any place in memory.8: OllyDbg: f1() completes execution We now find 0xDB18 and 0x243 at addresses 0x2EF854 and 0x2EF858.CHAPTER 10.

which has the same effect: unconditional jump to another place. esp OFFSET $SG2934 . eax ebp 0 The goto statement is just replaced by a JMP instruction. by using a debugger or patching. 00H 'end'. 108 .h> int main() { printf ("begin\n"). 'end' _printf esp. The second printf() call can be executed only with human intervention. exit: printf ("end\n"). 1.1: MSVC 2012 $SG2934 DB $SG2936 DB $SG2937 DB _main PROC push mov push call add jmp push call add 'begin'. p. goto exit. }. 4 eax. Here is a very simple example: #include <stdio. but nevertheless. GOTO Chapter 11 GOTO The GOTO operator is considered harmful [Dij68].CHAPTER 11. 'skip me!' _printf esp. [Yur13. 0aH. 4 SHORT $exit$3 OFFSET $SG2936 .3. 00H 'skip me!'. 'begin' _printf esp. 0aH. it can be used reasonably [Knu74]. printf ("skip me!\n"). Here is what we’ve get in MSVC 2012: Listing 11. 00H ebp ebp.2]. 4 $exit$3: _main push call add xor pop ret ENDP OFFSET $SG2937 . 0aH.

1: Hiew 109 .CHAPTER 11. Let’s open the resulting executable in Hiew: Figure 11. GOTO This also could be a simple patching exercise.

This mean. 'end' $exit$4: _main call add xor ret ENDP _printf esp.2 Exercise Try to achieve the same result using your favorite compiler and debugger. Now press F9 (save) and exit. the compiler forgot to remove the “skip me!” string.3: Result The same effect can be achieved by replacing the JMP instruction with 2 NOP instructions. 00H 'end'. DEAD CODE Place the cursor to address JMP (0x410).1 Dead code The second printf() call is also called “dead code” in compiler terms. 11. 0aH. the code will never be executed. 110 . 00H OFFSET $SG2981 .2: Hiew The second byte of the JMP opcode is relative offset of jump.1. NOP has an opcode of 0x90 and length of 1 byte. Now we run the executable and we see this: Figure 11. eax 0 However. 0aH. the compiler removes “dead code”. leaving no trace of it: Listing 11. GOTO 11. so the opcode becomes EB 00: Figure 11. press F3 (edit).CHAPTER 11. 'begin' _printf OFFSET $SG2984 . 0 implies the point right after the current instruction. So when you compile this example with optimizations. 8 eax. so we need 2 instructions as replacement. 00H 'skip me!'. So now JMP not skipping the second printf() call. 11.2: Optimizing MSVC 2012 $SG2981 DB $SG2983 DB $SG2984 DB _main PROC push call push 'begin'. press zero twice. 0aH.

2). return 0. }. int b) { if (a>b) printf ("a>b\n"). int main() { f_signed(1. DWORD PTR _b$[ebp] jle SHORT $LN3@f_signed push OFFSET $SG737 .CHAPTER 12. if (a<b) printf ("a<b\n"). }.1. if (a==b) printf ("a==b\n"). void f_unsigned (unsigned int a.1: Non-optimizing MSVC 2010 _a$ = 8 _b$ = 12 _f_signed PROC push ebp mov ebp. }. f_unsigned(1. esp mov eax. 2). CONDITIONAL JUMPS Chapter 12 Conditional jumps 12.1 Simple example #include <stdio. if (a==b) printf ("a==b\n"). 4 $LN3@f_signed: mov ecx.1 x86 x86 + MSVC What we have in the f_signed() function: Listing 12. if (a<b) printf ("a<b\n"). 'a>b' call _printf add esp. DWORD PTR _a$[ebp] 111 .h> void f_signed (int a. DWORD PTR _a$[ebp] cmp eax. 12. unsigned int b) { if (a>b) printf ("a>b\n").

SIMPLE EXAMPLE cmp ecx. CONDITIONAL JUMPS 12. if all three conditional jumps are triggered. with the exception that the JBE and JAE instructions are used here instead of JLE and JGE.2: GCC _a$ = 8 . 'a>b' call _printf add esp. where we see JG/JL instead of JA/JBE or vice-versa. if the second operand is larger than first or equal. DWORD PTR _a$[ebp] cmp eax. DWORD PTR _b$[ebp] jne SHORT $LN2@f_unsigned push OFFSET $SG2763 . The difference in these instructions (JA/JAE/JBE/JBE) JG/JGE/JL/JLE is that they work with unsigned numbers. where there is nothing much new to us: Listing 12. So. In other words. 4 $LN4@f_signed: pop ebp ret 0 _f_signed ENDP The first instruction. we can almost be sure that the variables are signed or unsigned. no printf() to be called whatsoever.3: main() _main PROC push mov push ebp ebp. 4 $LN2@f_signed: mov edx. But it is impossible without special intervention. Here is also the main() function. JLE. esp 2 112 . 'a<b' call _printf add esp. 'a==b' call _printf add esp. DWORD PTR _b$[ebp] jbe SHORT $LN3@f_unsigned push OFFSET $SG2761 . 'a==b' call _printf add esp. 4 $LN3@f_unsigned: mov ecx. DWORD PTR _b$[ebp] jne SHORT $LN2@f_signed push OFFSET $SG739 . size = 4 _f_unsigned PROC push ebp mov ebp. Control flow will not change if the operands are equal. DWORD PTR _a$[ebp] cmp edx. If this condition does not trigger (second operand smaller than first). See also the section about signed number representations ( 30 on page 451). The f_unsigned() function is the same. DWORD PTR _a$[ebp] cmp edx. esp mov eax. The second check is JNE: Jump if Not Equal.CHAPTER 12. respectively. So. DWORD PTR _b$[ebp] jge SHORT $LN4@f_signed push OFFSET $SG741 . DWORD PTR _a$[ebp] cmp ecx. see below: Now let’s take a look at the f_unsigned() function Listing 12. 'a<b' call _printf add esp. stands for Jump if Less or Equal.1. with these different instructions: JBE—Jump if Below or Equal and JAE—Jump if Above or Equal. the control flow is not to be altered and the first printf() is to be called. DWORD PTR _b$[ebp] jae SHORT $LN4@f_unsigned push OFFSET $SG2765 . 4 $LN2@f_unsigned: mov edx. size = 4 _b$ = 12 . 4 $LN4@f_unsigned: pop ebp ret 0 _f_unsigned ENDP Almost the same. control is to be passed to address or label mentioned in instruction. The third check is JGE: Jump if Greater or Equal—jump if the first operand is larger than the second or if they are equal.

eax ebp 0 113 . CONDITIONAL JUMPS _main push call add push push call add xor pop ret ENDP 12. 8 2 1 _f_unsigned esp. SIMPLE EXAMPLE 1 _f_signed esp.1. 8 eax.CHAPTER 12.

S=1. They are named with one character for brevity in OllyDbg. OllyDbg gives a hint that the (JBE) jump is to be triggered now. 114 . Z=0.CHAPTER 12. The condition is true here. we can read there that JBE is triggering if CF=1 or ZF=1. the flags are: C=1. so the jump is triggered. O=0. if we take a look into [Int13]. Indeed. which works with unsigned numbers. CONDITIONAL JUMPS 12. CMP is executed thrice here. but for the same arguments. Result of the first comparison: Figure 12. so the flags are the same each time.1: OllyDbg: f_unsigned(): first conditional jump So. SIMPLE EXAMPLE x86 + MSVC + OllyDbg We can see how flags are set by running this example in OllyDbg. T=0.1. Let’s begin with f_unsigned() . P=1. A=1. D=0.

CONDITIONAL JUMPS 12.1.2: OllyDbg: f_unsigned(): second conditional jump OllyDbg gives a hint that JNZ is to be triggered now. 115 . SIMPLE EXAMPLE The next conditional jump: Figure 12.CHAPTER 12. JNZ triggering if ZF=0 (zero flag). Indeed.

SIMPLE EXAMPLE The third conditional jump.CHAPTER 12. so the third printf() executes. It’s not true in our case. CONDITIONAL JUMPS 12. 116 . JNB: Figure 12.1.3: OllyDbg: f_unsigned(): third conditional jump In [Int13] we can see that JNB triggers if CF=0 (carry flag).

which works with signed values. P=1. SF≠OF in our case. S=1.4: OllyDbg: f_unsigned(): first conditional jump In [Int13] we find that this instruction is triggered if ZF=1 or SF≠OF. The first conditional jump JLE is to be triggered: Figure 12. so the jump triggers. D=0. 117 . O=0. Flags are set in the same way: C=1.CHAPTER 12. A=1. Z=0. SIMPLE EXAMPLE Now we can try in OllyDbg the f_signed() function.1. T=0. CONDITIONAL JUMPS 12.

1.5: OllyDbg: f_unsigned(): second conditional jump 118 .CHAPTER 12. SIMPLE EXAMPLE The second JNZ conditional jump triggering: if ZF=0 (zero flag): Figure 12. CONDITIONAL JUMPS 12.

CHAPTER 12. and that is not true in our case: Figure 12.6: OllyDbg: f_unsigned(): third conditional jump 119 .1. SIMPLE EXAMPLE The third conditional jump JGE will not trigger because it will only if SF=OF. CONDITIONAL JUMPS 12.

• force the third jump to always trigger. but in any case it will jump to the next instruction. Here is how it looks in Hiew: Figure 12. So if the offset is 0. • force the second jump to never trigger. because. the jump slips to the next instruction. CONDITIONAL JUMPS 12. Jump offset is just gets added to the address for the next instruction in these instructions. and it always print “a==b”.1. for any input values. Thus we can point the code flow to the second printf(). • The third jump we convert into JMP just as the first one. Three instructions (or bytes) has to be patched: • The first jump will now be JMP. we set the jump offset to 0. but the jump offset is still the same. so it is always triggering.7: Hiew: f_unsigned() function Essentially. we’ve got three tasks: • force the first jump to always trigger.CHAPTER 12. SIMPLE EXAMPLE x86 + MSVC + Hiew We can try to patch the executable file in a way that the f_unsigned() function will always print “a==b”. 120 . • The second jump may be triggered sometimes.

8: Hiew: let’s modify the f_unsigned() function If we forget about any of these jumps. if the flags are same after each execution? Perhaps optimizing MSVC can’t do this.4. Non-optimizing GCC Non-optimizing GCC 4.L7: mov DWORD PTR [esp+4]. why execute CMP several times.CHAPTER 12. SIMPLE EXAMPLE That’s what we do: Figure 12. DWORD PTR [esp+8] cmp DWORD PTR [esp+4]. "a>b" jmp puts .1.L1 mov DWORD PTR [esp+4]. then several printf() calls may execute.L1: rep ret .8. CONDITIONAL JUMPS 12. but optimizing GCC 4.4. Optimizing GCC An observant reader may ask. OFFSET FLAT:. "a==b" jmp puts 121 .L6: mov DWORD PTR [esp+4].LC1 .1 produces almost the same code.L7 jge .8.L6 je .4: GCC 4. "a<b" jmp puts . OFFSET FLAT:. OFFSET FLAT:. but we want just one.1 f_signed() f_signed: mov eax.LC0 .3 on page 14) instead of printf().1 can go deeper: Listing 12. eax jg .LC2 . but with puts() ( 3.

text:000000C0 . "a<b" esp. R4 SP!. 12.5: GCC 4. {R4-R6. "a==b\n" __2printf R5.2 ARM 32-bit ARM Optimizing Keil 6/2013 (ARM mode) Listing 12. 20 ebx esi puts But nevertheless. This type of x86 code is somewhat rare. "a==b" esp. can’t generate such code. {R4-R6. esi.text:000000B8 .text:000000D0 .L10: jb add pop pop ret . "a>b" puts esi.text:000000D8 . CONDITIONAL JUMPS 12.1. This kind of trick will have explained later: 13. aAB_1 . aAB_0 . assembly language programmers are fully aware of the fact that Jcc instructions can be stacked.1 on page 140. OFFSET FLAT:.text:000000B8 .L10 DWORD PTR [esp+32]. esi. The f_unsigned() function is not that æsthetically short: Listing 12. aAB . there is a good probability that the code was hand-written. "a<b\n" __2printf 122 .1.PC} SP!. OFFSET FLAT:. {R4-R6. as it seems.8.1 are probably not perfect yet.text:000000E4 .LC1 .8.text:000000E0 . On the other hand.text:000000B8 .text:000000EC EXPORT f_signed f_signed 70 01 04 00 1A A1 04 67 9E 04 70 70 19 99 40 40 00 50 0E 18 00 0F 18 00 80 40 0E 18 2D A0 50 A0 8F 00 55 8F 00 55 BD BD 8F 00 E9 E1 E1 E1 C2 CB E1 02 0B E1 A8 E8 E2 EA STMFD MOV CMP MOV ADRGT BLGT CMP ADREQ BLEQ CMP LDMGEFD LDMFD ADR B .L15: mov add pop pop jmp .L14 20 DWORD PTR [esp+32] DWORD PTR [esp+36] ebx ebx .text:000000E8 . R4 R5.text:000000D4 . 20 ebx esi puts DWORD PTR [esp].LR} R0.text:000000BC . this instruction could be removed . SIMPLE EXAMPLE We also see JMP puts here instead of CALL puts / RETN.text:000000CC . CODE XREF: main+C SP!.text:000000C4 .L13: mov call cmp jne . 20 ebx esi DWORD PTR [esp+32].LC0 .1. MSVC 2012.LR} R4. there are two CMP instructions instead of three.text:000000C8 . So if you see it somewhere. R1 R0. OFFSET FLAT:. .CHAPTER 12. ebx .L14: mov add pop pop jmp esi ebx esp. . ebx. R4 R0.1 f_unsigned() f_unsigned: push push sub mov mov cmp ja cmp je .LC2 . R0 R0.L15 esp.6: Optimizing Keil 6/2013 (ARM mode) .L13 esi. So optimization algorithms of GCC 4.text:000000DC . "a>b\n" __2printf R5.

1 on page 44). i.LR} MOV R1. The B instruction for unconditional jumping is in fact conditional and encoded just like any other conditional jump.text:00000134 .text:00000072 . ignoring flags. while comparing the two. but it is triggered only if a >= b. it is the same as MOV. CODE XREF: f_signed+8 CMP R5. Optimizing Keil 6/2013 (thumb mode) Listing 12. The predicates are encoded in 4 high bits of the 32-bit ARM instructions (condition field). but for unsigned values. {R4. Then we see the ADREQ and BLEQ instructions.1. this instruction works just like LDMFD1 . The last two instructions call printf() with the string «a<b\n» as a sole argument.text:00000072 . The next BLGT instruction behaves exactly as BL and is triggered only if the result of the comparison was the same (Greater Than). it does not returns from the function.g. The sense of ``LDMGEFD SP!. which is one more function epilogue. {R4-R6. But if it is not true.CHAPTER 12. #2 MOV R0. usually set by CMP. but is triggered only when one value was greater or equal to the other (Greater or Equal). "a>b\n" B7 F8 BL __2printf loc_82 .text:00000086 70 0C 05 A0 02 A4 06 B5 00 00 42 DD A0 F0 A5 42 02 D1 A4 A0 f_signed .text:00000128 . Why it is so good? Read here: 33. in «printf() with several arguments» section. This instruction restores the state of registers R4-R6. these predicates (HI = Unsigned higher.text:00000082 .PC}'' instruction is that is like a function epilogue. ADRGT writes a pointer to the string ``a>b\n'' into R0 and BLGT calls printf(). #2 MOV R0. {R4-R6. and LDMCSFD instructions are used there.LR} MOVS R4. There is not much new in the main() function for us: Listing 12.text:00000138 . thus. aAB_0 .text:00000082 . execute always. #1 BL f_signed MOV R1. except the CMOVcc instruction. this is often used when comparing numbers. CS = Carry Set (greater than or equal)) are analogous to those examined before. R1 MOVS R5. #1 BL f_unsigned MOV R0. but also LR instead of PC. (Greater Than).8: Optimizing Keil 6/2013 (thumb mode) . i.text:0000012C .text:0000007C . and it implies execute ALways.. f_unsigned is likewise. "a==b\n" 1 LDMFD 123 . End of function main That’s how you can get rid of conditional jumps in ARM mode. where AL stands for Always. BLHI.text:0000007E . but triggered only when specific flags are set. Consequently.7: main() .1 on page 456. but has AL in the condition field. There is no such feature in x86. End of function f_signed A lot of instructions in ARM mode can be executed only when specific flags are set.text:00000074 .PC} . E. R0 CMP R0. Another CMP is located before them (because the printf() call may tamper the flag state).text:00000082 .. but are to be executed only if operands were equal to each other during the last comparison. {R4. here ( 6.text:00000144 . the ADD instruction is in fact ADDAL internally.text:00000078 .e. CONDITIONAL JUMPS 12.text:00000140 . but the ADRHI. these instructions with suffix -GT are executes only in the case when the value in R0 (a is there) is bigger than the value in R4 (b is there). #0 LDMFD SP!.2. a < b. For instance.text:00000128 .text:00000084 .text:00000130 . then the control flow continues to the next ``LDMFD SP!.text:0000007A . only then the function execution finishes.e. They behave just like ADR and BL. We already examined an unconditional jump to the printf() function instead of function return.text:0000013C .text:000000EC .text:00000076 .text:00000148 . Then we see LDMGEFD. R4 BNE loc_8C ADR R0.LR}'' instruction. CODE XREF: main+6 PUSH {R4-R6.text:00000148 EXPORT main main 10 02 01 DF 02 01 EA 00 10 40 10 00 FF 10 00 FF 00 80 2D A0 A0 FF A0 A0 FF A0 BD E9 E3 E3 EB E3 E3 EB E3 E8 STMFD SP!.text:00000128 . The ADRGT instructions works just like ADR but executes only in the case when the previous CMP instruction found one number greater than another. SIMPLE EXAMPLE . aAB . R4 BLE loc_82 ADR R0.

LC9 b puts .40] 124 . Branch if Equal (a==b) bge . "a<b" ldp x29. w0 bhi . function epilogue. x0.L19 . "a>b" str x1.16] mov w19.L25 .L27: ldr x19.1. "a==b" add x0. . CONDITIONAL JUMPS . . W0=a.L27 .9 Listing 12. End of function f_signed Only B instructions in thumb mode may be supplemented by condition codes.16] ldp x29. :lo12:. :lo12:. BNE—Not Equal. sp.CHAPTER 12. ARM64: Optimizing GCC (Linaro) 4. [sp.text:00000096 12. Branch if Greater than or Equal (a>=b) (impossible here) . . x0. .text:00000096 . [sp].L23: bcc . a<b adrp x0. 48 add x0. :lo12:.text:00000092 .text:00000096 . so the thumb code looks more ordinary. [sp.text:0000008C .L26 . 48 ret . x0. BGE—Greater than or Equal.L15: . "a<b" add x0.LC11 b puts . but other instructions are used while dealing with unsigned values: BLS (Unsigned lower or same) and BCS (Carry Set (Greater than or equal)). [sp.text:0000008E .LC10 b puts Listing 12.LC11 .10: f_unsigned() f_unsigned: stp x29. Branch if HIgher (a>b) cmp w19.LC9 .text:00000088 . w1 add x29.PC} . -48]! . Branch if Greater Than (a>b) beq . aAB_1 . w1 bgt .L20 . x30. w1 beq . Branch if Equal (a==b) . x0.LC9 .text:00000096 . x30.LC10 .16] adrp x0.L19: adrp x0. . CODE XREF: f_signed+1C POP {R4-R6. [sp. CODE XREF: f_signed+12 CMP R5.LC11 b puts .9: f_signed() f_signed: . W0=a.L20: adrp x0. R4 BGE locret_96 ADR R0.text:0000008C .LC11 . impossible to be here ldr x19.text:00000090 . SIMPLE EXAMPLE 06 F0 B2 F8 A5 02 A3 06 42 DA A0 F0 AD F8 70 BD BL __2printf loc_8C . Branch if Carry Clear (if less than) (a<b) . impossible here ret . BLE is a normal conditional jump Less than or Equal. x30. "a<b\n" BL __2printf locret_96 . [sp].text:0000008C .L25: adrp x0.L15 . f_unsigned is likewise. [x29. 0 str x19. "a>b" add x0. :lo12:. W1=b cmp w0. W1=b cmp w0.

$v0. 0x20+arg_0($fp) . Apparently. SIMPLE EXAMPLE add bl ldr cmp bne x0.L23 . "slt $v0. 0x20+var_10($sp) . $zero .text:00000050 jalr $t9 .text:00000048 or $at.text:00000000 . . branch delay slot.text:0000004C move $t9.text:00000000 var_4 = -4 .LC10 .text:00000010 la $gp. . x0. "a==b" x29. "beq $v0.text:00000018 sw $gp.text:0000003C lui $v0.1.text:00000000 f_signed: # CODE XREF: main+18 .text:00000004 sw $ra. Let’s first start with the signed version of our function: Listing 12.text:00000034 beqz $v0. store input values into local stack: . CONDITIONAL JUMPS 12. if condition is not true.4. in fact. :lo12:. which will never be executed. x0. So. so $v0 will be set to 1 if $v0<$v1 (b<a) or to 0 if otherwise: . $sp .text:00000000 arg_0 = 0 . . this instruction pair has to be used in MIPS for comparison and branch. branch delay slot.loc_5c" is there: . NOP .L26: I have added the comments.text:0000002C or $at. $v0=b . reload them. $zero .text:00000000 var_8 = -8 . The destination register is then checked using BEQ (“Branch on Equal”) or BNE (“Branch on Not Equal”) and a jump may occur. $v1=a . loc_5C . What is also striking is that compiler is not aware that some conditions are not possible at all.$v1" is there. NOP . $v1 .text:00000028 lw $v0.text:0000000C move $fp. this is pseudoinstruction.text:00000000 var_10 = -0x10 .text:00000000 . it was done to simplify the analysis of data dependencies.text:00000038 or $at. this is pseudoinstruction.text:00000054 or $at. print "a>b" and finish .LC10 puts . $v0 . 0x20+var_8($sp) .CHAPTER 12. These instructions sets destination register value to 1 if the condition is true or to 0 if otherwise. 0x20+arg_0($fp) .3 MIPS One distinctive MIPS feature is the absence of flags.5 (IDA) . [x29.16] x0. $zero . __gnu_local_gp .40] w19. 48 x0. 0x20+arg_4($fp) . 0x20+var_4($sp) . :lo12:.11: Non-optimizing GCC 4.text:00000020 sw $a1. 0x20+arg_4($fp) . (puts & 0xFFFF)($gp) .text:0000001C sw $a0. [sp. There are instructions similar to SETcc in x86: SLT (“Set on Less Than”: signed version) and SLTU (unsigned version). (unk_230 & 0xFFFF) # "a>b" . without adding new ones.text:00000000 addiu $sp. .1. in fact. -0x20 . $v0.text:00000000 arg_4 = 4 . so there is dead code at some places. removing redundant instructions. $zero . x30. NOP 125 . 12.$zero. jump to loc_5c.text:00000030 slt $v0.text:00000024 lw $v1. (unk_230 >> 16) # "a>b" .LC9 puts x1. w1 .text:00000040 addiu $a0. NOP .text:00000008 sw $fp.text:00000044 lw $v0. Branch if Not Equal ldr adrp ldp add b x19. [sp]. Exercise Try to optimize these functions manually for size.

(aAB_0 >> 16) # "a<b" .text:00000084 jalr $t9 .text:000000C8 loc_C8: # CODE XREF: f_signed+A0 . $zero .text:0000009C slt $v0.text:00000104 f_unsigned: var_10 var_8 var_4 arg_0 arg_4 # CODE XREF: main+28 = -0x10 = -8 = -4 = 0 = 4 addiu sw sw move la sw sw sw lw $sp. but SLTU (unsigned version. $v1.text:000000E8 . so print "a==b" and finish: .text:00000090 lw $v1. set $v0 to 1 if condition is true: . check if $v1<$v0 (a<b). $zero .text:000000E4 . (aAB >> 16) # "a==b" .text:00000088 or $at. check if a==b.text:000000E0 . 0x20 .. branch delay slot.text:000000AC addiu $a0. $v0 .text:000000A8 lui $v0. so just finish: . $ZERO. branch delay slot.text:000000EC .text:000000E0 . $a0.text:000000CC lw $ra. loc_90 .text:0000005C loc_5C: # CODE XREF: f_signed+34 . $v0.text:000000C4 lw $gp. $a1.text:000000B8 move $t9. all 3 conditions were false.5 (IDA) .text:000000E0 .text:000000D0 lw $fp. $fp . (puts & 0xFFFF)($gp) . branch delay slot. print "a<b" and finish . 0x20+var_10($fp) .text:000000E0 .text:00000068 bne $v1. $zero .text:0000005C . NOP .text:00000074 addiu $a0.text:000000FC .text:000000B4 or $at. REG0.text:0000006C or $at. branch delay slot. if condition is not true (i.text:0000008C lw $gp.text:00000060 lw $v0. 0x20+arg_4($fp) .text:00000094 lw $v0. 0x20+arg_0($fp) .text:000000A4 or $at.text:00000100 . REG1”. LABEL”. $zero .text:000000DC # End of function f_signed “SLT REG0. 0x20+arg_4($fp) .text:000000E0 .text:00000090 .1. 0x20+var_8($sp) . -0x20 0x20+var_4($sp) 0x20+var_8($sp) $sp __gnu_local_gp 0x20+var_10($sp) 0x20+arg_0($fp) 0x20+arg_4($fp) 0x20+arg_0($fp) 126 . $gp. NOP . NOP . 0x20+var_4($sp) . NOP .text:00000090 loc_90: # CODE XREF: f_signed+68 .text:00000080 move $t9.4.text:0000007C or $at. $v1. $v0 .text:000000D4 addiu $sp.CHAPTER 12. condition is true.text:000000C8 . jump to loc_90 if its not true': .text:000000F0 . 0x20+arg_0($fp) . $ra.text:0000005C lw $v1.text:000000E0 .text:000000E0 . hence “U” in name) is used instead of SLT: Listing 12. CONDITIONAL JUMPS 12. The unsigned version is just the same.text:000000F8 . $zero . $gp. branch delay slot. NOP . 0x20+var_10($fp) .text:00000064 or $at.text:00000078 lw $v0. loc_C8 . $v0.text:000000A0 beqz $v0.12: Non-optimizing GCC 4.e. $zero . NOP .text:00000070 lui $v0. (aAB_0 & 0xFFFF) # "a<b" .text:000000E0 . 0x20+var_10($fp) . NOP . $v0==0). REG1” is reduced by IDA to its shorter form: “SLT REG0. condition is true. $fp.text:000000BC jalr $t9 .text:000000B0 lw $v0.text:000000DC or $at.text:000000C8 move $sp. SIMPLE EXAMPLE .text:00000098 or $at. which are in fact “BEQ REG.text:000000D8 jr $ra . We also see there BEQZ pseudoinstruction (“Branch if Equal to Zero”). (aAB & 0xFFFF) # "a==b" . $v0. (puts & 0xFFFF)($gp) . $zero .text:00000058 lw $gp. $zero . $v0 . jump to loc_c8: .text:000000C0 or $at. $zero . NOP . NOP .text:000000E0 . $fp.

CHAPTER 12.text:0000016C lw $gp. 127 . $zero .text:000001B8 jr $ra . $zero .text:00000190 lw $v0.text:00000110 sltu $v0.2 Calculating absolute value A simple function: int my_abs (int i) { if (i<0) return -i.text:00000188 lui $v0.text:000001B0 lw $fp. $zero .text:00000148 bne $v1.text:00000108 lw $v0. $zero .text:00000158 lw $v0. CALCULATING ABSOLUTE VALUE . loc_13C .text:00000170 loc_170: # CODE XREF: f_unsigned+68 . 0x20+arg_0($fp) .text:000001BC # End of function f_unsigned 12.text:0000013C lw $v1. 0x20+var_10($fp) .text:000001A4 lw $gp. 0x20+arg_4($fp) .text:0000015C or $at. $v0.text:00000140 lw $v0.text:00000160 move $t9.text:00000118 or $at. $zero .text:00000198 move $t9. $v0 . (aAB >> 16) # "a==b" .text:00000138 lw $gp. $v0 . $zero .text:00000170 lw $v1.text:000001A8 loc_1A8: # CODE XREF: f_unsigned+A0 . $zero . (puts & 0xFFFF)($gp) . 0x20+arg_4($fp) .text:00000150 lui $v0. $v0 .text:000001AC lw $ra.text:00000178 or $at. CONDITIONAL JUMPS 12.text:0000014C or $at. $v1 .text:000001A0 or $at.text:00000164 jalr $t9 .text:0000013C .text:0000018C addiu $a0. (puts & 0xFFFF)($gp) . 0x20+arg_4($fp) .text:0000019C jalr $t9 . loc_1A8 . $v0.text:00000130 jalr $t9 . 0x20+var_10($fp) .text:00000114 beqz $v0.text:000001B4 addiu $sp. 0x20+arg_0($fp) .text:00000184 or $at.text:0000010C or $at.text:00000144 or $at. $zero .text:00000128 or $at.text:0000011C lui $v0.text:00000174 lw $v0. $v0.text:00000194 or $at. else return i. $v0.text:0000017C sltu $v0. 0x20+var_10($fp) .text:000001A8 move $sp. (unk_230 & 0xFFFF) . (puts & 0xFFFF)($gp) .text:0000012C move $t9.text:000001BC or $at.2. (aAB_0 >> 16) # "a<b" .text:00000154 addiu $a0. loc_170 . $zero .text:000001A8 . $v1. $v0 . }.text:00000134 or $at. $zero .text:00000180 beqz $v0. $zero . $zero . 0x20+var_8($sp) .text:0000013C loc_13C: # CODE XREF: f_unsigned+34 .text:00000124 lw $v0.text:00000168 or $at. $zero . (aAB_0 & 0xFFFF) # "a<b" . 0x20+var_4($sp) . $fp . (unk_230 >> 16) .text:00000120 addiu $a0. 0x20 . (aAB & 0xFFFF) # "a==b" .text:00000170 .

execute "Reverse Subtract" instruction only if input value is less than 0: RSBLT r0. [sp. 12.2.2.14: Optimizing Keil 6/2013: thumb mode my_abs PROC CMP r0. wzr .15: Optimizing Keil 6/2013: ARM mode my_abs PROC CMP r0.9 does mostly the same.L2 w0.2. is input value equal to zero or greater than zero? . #16 w0. subtract input value from 0: RSBS r0.9 (ARM64) ARM64 has the instruction NEG for negating: Listing 12.1 on page 456. CONDITIONAL JUMPS 12. skip RSBS instruction then BGE |L0.12] value with contents of WZR register holds zero) w0. sp.2.#0 |L0.16: Optimizing GCC 4. which just subtract with reversed operands. ecx ret 0 my_abs ENDP GCC 4. 12.#0 .2 Optimizing Keil 6/2013: thumb mode Listing 12. check for sign of input value . skip NEG instruction if sign is positive jns SHORT $LN2@my_abs . CALCULATING ABSOLUTE VALUE Optimizing MSVC This is the usual way to generate it: Listing 12. (which always cmp bge ldr sp.6| BX lr ENDP ARM lacks a negate instruction.CHAPTER 12. so that’s what the Keil compiler do: Listing 12. compare input . negate value neg ecx $LN2@my_abs: .r0.6| . [sp.#0 .r0. 12. so the Keil compiler uses the “Reverse Subtract” instruction.2.4 Non-optimizing GCC 4.13: Optimizing MSVC 2012 x64 i$ = 8 my_abs PROC .3 Optimizing Keil 6/2013: ARM mode It’s possible to add condition codes to some instructions in ARM mode.12] w0. prepare result in EAX: mov eax.9 (ARM64) my_abs: sub str ldr . ecx .#0 BX lr ENDP Now there are no conditional jumps and this is good: 33.1 12. [sp. ECX = input test ecx.12] 128 .

NOP locret_10: .3 Conditional operator The conditional operator in C/C++ is: expression ? expression : expression Here is an example: const char* f (int a) { return a==10 ? "it is ten" : "it is not ten".1 x86 Old and non-optimizing compilers generate the code just as if an if/else statement was used: Listing 12.$a0" ($v0=0-$a0) negu $v0. The “U” suffix in both SUBU and NEGU implies that no exception to be raised in case of integer overflow.12] add ret sp. branch delay slot. sp. jump if $a0<0: bltz $a0.2.L3: 12. negate input value and store it in $v0: jr $ra . jump to $LN3@f if not equal jne SHORT $LN3@f .5 (IDA) my_abs: . this is "subu $v0. OFFSET $SG746 . which just does subtraction from zero. w0 . CONDITIONAL JUMPS neg b w0. [sp. 00H 'it is not ten'.17: Optimizing GCC 4. $zero . esp push ecx . $a0 Here we see a new instruction: BLTZ (“Branch if Less Than Zero”). }.3.L3 ldr w0. 10 . 12. 16 12.L2: . store pointer to the string into temporary variable: mov DWORD PTR tv65[ebp]. 'it is ten' 129 . compare input value with 10 cmp DWORD PTR _a$[ebp]. There is also the NEGU pseudoinstruction. 12. in fact.CHAPTER 12. just return input value ($a0) in $v0: move $v0.3.2. 12. this will be used as a temporary variable _a$ = 8 _f PROC push ebp mov ebp. which we are going to see later: 45 on page 512.18: Non-optimizing MSVC 2008 $SG746 $SG747 DB DB 'it is ten'. $a0 jr $ra or $at.6 Branchless version? There are also could be branchless version. 00H tv65 = -4 .4. locret_10 . CONDITIONAL OPERATOR .5 MIPS Listing 12.$zero. this is pseudoinstruction.

pointer to the string "it is not ten" is still in RAX as for now. 00H 'it is not ten'. CONDITIONAL OPERATOR SHORT $LN4@f to the string into temporary variable: DWORD PTR tv65[ebp]. while the non-optimizing GCC 4. if comparison result is EQual.3.8 uses a conditional jumps. 00H 'it is not ten'.0 |L0.|L0. 'it is not ten' .CHAPTER 12. OFFSET $SG793 . 'it is ten' . mov mov pop ret _f ENDP 12. 00H a$ = 8 f PROC . 10 . 'it is not ten' $LN4@f: ret 0 _f ENDP Newer compilers are more concise: Listing 12. eax.|L0. if not. do nothing. cmove rax.8 for x86 also uses the CMOVcc instruction.16| DCB "it is ten". 00H _a$ = 8 .3.28| 130 . 'it is not ten' copy pointer to the string from temporary variable to EAX.19: Optimizing MSVC 2008 $SG792 $SG793 DB DB 'it is ten'. 'it is ten' lea rax. OFFSET $SG747 . compare input value with 10 cmp DWORD PTR _a$[esp-4]. this is exit. jump to $LN4@f if equal je SHORT $LN4@f mov eax. OFFSET $SG792 . rdx ret 0 f ENDP Optimizing GCC 4.16| . copy pointer to the "it is ten" string into R0 ADREQ r0. jump to exit jmp $LN3@f: .28| . DWORD PTR tv65[ebp] esp. compare input value with 10 CMP r0. OFFSET FLAT:$SG1356 . "it is not ten" BX lr ENDP |L0.0 DCB "it is not ten". compare input value with 10 cmp ecx. if comparison result is Not Equal. ebp ebp 0 Listing 12. 10 mov eax.20: Optimizing MSVC 2012 x64 $SG1355 DB $SG1356 DB 'it is ten'. OFFSET FLAT:$SG1355 . copy pointer to the "it is not ten" string into R0 ADRNE r0.#0xa . load pointers to the both strings lea rdx.21: Optimizing Keil 6/2013 (ARM mode) f PROC . 12. copy value from RDX ("it is ten") . size = 4 _f PROC . CONDITIONAL JUMPS . store pointer mov $LN4@f: .2 ARM Optimizing Keil for ARM mode also uses the conditional instructions ADRcc: Listing 12. if equal. "it is ten" .

9 f: cmp beq adrp add ret x0.string "it is not ten" That’s because ARM64 doesn’t have a simple load instruction with conditional flags. "it is ten" x0.22: Optimizing Keil 6/2013 (thumb mode) f PROC . It has.5 (assembly output) $LC0: . like ADRcc in 32-bit ARM mode or CMOVcc in x86 [ARM13a.23: Optimizing GCC (Linaro) 4.$2.LC0: . branch delay slot .LC0 . however.10 .12| .3. compare input value with 10 CMP r0.3 ARM64 Optimizing GCC (Linaro) 4.%lo($LC0) $L2: 131 . 12.%hi($LC0) $31 $2. compare $a0 and 10.28| 12. GCC 4.9 for ARM64 also uses conditional jumps: Listing 12. "it is ten" BX lr ENDP |L0.#0xa .5 for MIPS is not very smart. 10 .5].ascii "it is ten\000" $LC1: f: li $2.$L2 nop . "it is not ten" x0.4.9 doesn’t seem to be smart enough to use it in such piece of code.|L0.ascii "it is not ten\000" . since there are no load instructions that support conditional flags: Listing 12.LC1 adrp add ret x0. .L3 .0 |L0. Optimizing Keil for Thumb mode needs to use conditional jump instructions. C5. but GCC 4.LC0 .|L0. :lo12:. leave address lui j addiu # 0xa of "it is not ten" string in $v0 and return: $2.CHAPTER 12. :lo12:.8| if EQual BEQ |L0.0 DCB "it is ten". branch if equal x0.LC1: .24: Optimizing GCC 4. x0.L3: . jump if equal: beq $4.3. CONDITIONAL JUMPS 12.12| DCB "it is not ten".4 MIPS Unfortunately. CONDITIONAL OPERATOR Without manual intervention. the two instructions ADREQ and ADRNE cannot be executed in the same run. either: Listing 12. jump to |L0. "it is not ten" BX lr |L0.string "it is ten" . x0. “Conditional SELect” instruction (CSEL).8| ADR r0.3.4.28| .$2.LC1 . p390.8| ADR r0. .

optimizing GCC 4. OFFSET FLAT:. 12.25: Optimizing GCC 4.%hi($LC1) $31 $2. else return b. "it is ten" . int b) { if (a>b) return a.4. 10 mov edx.3. 12. GETTING MINIMAL AND MAXIMAL VALUES of "it is ten" string in $v0 and return: $2.8 for x86 was also able to use CMOVcc in this case: Listing 12.LFB0: .21. }.8 .12.23 by removing all conditional jump instructions and using the CSEL instruction.7 Exercise Try rewriting the code in listing. }.3.4 Getting minimal and maximal values 12. OFFSET FLAT:. else return b. Interestingly.1 32-bit int my_max(int a.CHAPTER 12. int b) { if (a<b) return a. "it is not ten" mov eax.4. But the optimizing MSVC 2012 is not that good (yet). if comparison result is Not Equal.string "it is ten" .string "it is not ten" f: . }.1 on page 456.LC1 . do nothing cmovne eax. 12. compare input value with 10 cmp DWORD PTR [esp+4].LC1: . int my_min(int a.3. edx ret Optimizing Keil in ARM mode generates code identical to listing. copy EDX value to EAX . leave address lui j addiu 12.$2.LC0 . 132 . if not. else return "it is not ten".%lo($LC1) Let’s rewrite it in an if/else way const char* f (int a) { if (a==10) return "it is ten".12. CONDITIONAL JUMPS .LC0: .6 Conclusion Why optimizing compilers try to get rid of conditional jumps? Read here about it: 33.5 12.

compare A and B: cmp eax. esp mov eax. DWORD PTR _b$[ebp] $LN3@my_min: pop ebp ret 0 _my_min ENDP _a$ = 8 _b$ = 12 _my_max PROC push ebp mov ebp.CHAPTER 12.26: Non-optimizing MSVC 2013 _a$ = 8 _b$ = 12 _my_min PROC push ebp mov ebp.27: Optimizing Keil 6/2013 (thumb mode) my_max PROC . DWORD PTR _a$[ebp] jmp SHORT $LN3@my_min jmp SHORT $LN3@my_min . There is one unneeded JMP instruction in each function. jump if A is less or equal to B: jle SHORT $LN2@my_max . DWORD PTR _b$[ebp] . R1=B .6| .4. return BX lr ENDP my_min PROC 133 . DWORD PTR _b$[ebp] . DWORD PTR _a$[ebp] .6| . reload A to EAX if otherwise and jump to exit mov eax. Branchless ARM for Thumb mode reminds us of x86 code: Listing 12. this is redundant JMP $LN2@my_min: . esp mov eax. if A is greater or equal to B: jge SHORT $LN2@my_min . otherwise (A<=B) return R1 (B): MOVS r0. this is redundant JMP $LN2@my_max: . compare A and B: cmp eax.r1 |L0. return B mov eax. DWORD PTR _a$[ebp] . DWORD PTR _a$[ebp] jmp SHORT $LN3@my_max jmp SHORT $LN3@my_max . CONDITIONAL JUMPS 12. GETTING MINIMAL AND MAXIMAL VALUES Listing 12. R0=A . compare A and B: CMP r0. DWORD PTR _b$[ebp] $LN3@my_max: pop ebp ret 0 _my_max ENDP These two functions differ only in the conditional jump instruction: JGE (“Jump if Greater or Equal”) is used in the first one and JLE (“Jump if Less or Equal”) in the second. branch if A is greater then B: BGT |L0. jump. which MSVC probably left by mistake. return B mov eax. reload A to EAX if otherwise and jump to exit mov eax.r1 .

R0=A . if instruction is not triggered (in case of A<B). GETTING MINIMAL AND MAXIMAL VALUES . this instruction will trigger only if A>=B (hence. eax .r1 . if instruction is not triggered (in case of A>B). compare A and B: CMP r0. R1=B . GE . load A value into EAX .Greater or Equal) .8. edx ret my_min: mov mov edx. if A>=B. compare A and B: CMP r0. R0=A . branch if A is less then B: BLT |L0.Less or Equal) . if A<=B. A value is still in R0 register MOVGE r0. so the code is shorter. edx ret 134 . return B instead of A by placing B in R0 . EAX=B . EAX=B .r1 . compare A and B: cmp edx. A is still in R0 register MOVLE r0.CHAPTER 12. the instruction idle if otherwise (if A<B) cmovge eax. EDX=A .14| . DWORD PTR [esp+4] eax. LE . which is analogous to MOVcc in ARM: Listing 12. DWORD PTR [esp+8] .4.r1 . eax . MOVcc is to be executed only if the condition is met: Listing 12. return B instead of A by placing B in R0 . CONDITIONAL JUMPS 12. EDX=A . DWORD PTR [esp+4] eax.29: Optimizing MSVC 2013 my_max: mov mov edx. the instruction idle if otherwise (if A>B) cmovle eax.28: Optimizing Keil 6/2013 (ARM mode) my_max PROC . compare A and B: cmp edx. R1=B .r1 BX lr ENDP my_min PROC . this instruction will trigger only if A<=B (hence.14| . R1=B . load A value into EAX . R0=A . compare A and B: CMP r0. DWORD PTR [esp+8] . It’s possible to use conditional suffixes in ARM mode.r1 |L0.1 and optimizing MSVC 2013 can use CMOVcc instruction. otherwise (A>=B) return R1 (B): MOVS r0.r1 BX lr ENDP Optimizing GCC 4. return BX lr ENDP The functions differ in the branching instruction: BGT and BLT.

9.8] [sp] [sp.4. CONDITIONAL JUMPS 12.31: Optimizing GCC 4.L3: my_min: sp. if A>=B. prepare B in RAX for return: mov rax. rsi . sp. [sp] add ret sp. . . RDI=A . #16 [sp.L5 x0.8] .1 x64 my_max: . sp. .8] [sp] x0 ldr x0. x0. x1. RSI=B . else return b. x1.30: Non-optimizing GCC 4. compare A and B: cmp rdi. }.9. x1. x0.1 ARM64 my_max: sub str str ldr ldr cmp ble ldr b sp. int64_t my_min(int64_t a. x1.L3 sp.L2: .L6 ldr x0. but the code is comprehensible: Listing 12.CHAPTER 12. GETTING MINIMAL AND MAXIMAL VALUES 64-bit #include <stdint. #16 [sp. rsi . x1. 16 sub str str ldr ldr cmp bge ldr b sp.L6: Branchless No need to load function arguments from the stack.h> int64_t my_max(int64_t a. int64_t b) { if (a<b) return a. int64_t b) { if (a>b) return a. as they are already in the registers: Listing 12. 16 [sp.4.8] . 135 . }. . x0. put A (RDI) in RAX for return. There is some unneeded value shuffling. else return b.L2 x0.8] [sp] x0 [sp.L5: . [sp] add ret sp. x0. x1.8] [sp] [sp.2 12.

select X0 (A) . $a0 136 . $a1. select X1 (B) csel ret 12.5 for MIPS is not that good: Listing 12.CHAPTER 12.5 (IDA) my_max: . . . rsi . this instruction is idle if otherwise (if A<B) cmovge rax.32: Optimizing GCC 4. if $a1<$a0: beqz $v1. compare A and B: cmp rdi. x0. . locret_10 this is branch delay slot prepare $a1 in $v0 in case of branch triggered: move $v0. $a1 beqz $v1. select X1 (B) csel ret my_min: . this instruction is idle if otherwise (if A>B) cmovle rax.4.3 B: x0. Listing 12. X1=B . select X0 (A) . locret_28 move $v0.33: Optimizing GCC 4. just the name is different: “Conditional SELect”. GETTING MINIMAL AND MAXIMAL VALUES . $a0 locret_10: jr or $ra $at. but input operands in SLT instruction are swapped: my_min: slt $v1.4.1 ARM64 my_max: . $zero . branch delay slot. $a0 jump. . x1 to X0 if X0<=X1 or A<=B (Less or Equal) to X0 if A>B x0. ARM64 has the CSEL instruction. if A<=B. which works just as MOVcc in ARM or CMOVcc in x86. CONDITIONAL JUMPS 12. $a1<$a0: slt $v1. ge B: x0.4. rdi ret my_min: . x1. $a1 move $v0. le MIPS Unfortunately.9. compare A and cmp . $a1 no branch triggered. x1. X1=B . X0=A . x1 to X0 if X0>=X1 or A>=B (Greater or Equal) to X0 if A<B x0. GCC 4. x0. prepare $a0 in $v0: move $v0. put A (RDI) in RAX for return. prepare B in RAX for return: mov rax. rsi . NOP .4. X0=A . the min() function is same. rdi ret MSVC 2013 does almost the same. RSI=B . RDI=A . set $v1 . compare A and cmp . $a0.

. JMP exit true: .. Listing 12. some code to be executed if comparison result is true .. register/value Jcc true .38: Check for equal values BEQ REG1.34: x86 CMP register. CONCLUSION locret_28: jr or $ra $at.. label . REG2. CONDITIONAL JUMPS 12..5. exit: 12. JMP exit true: .1 x86 Here’s the rough skeleton of a conditional jump: Listing 12..2 ARM Listing 12. label ... label . some code to be executed if comparison result is false .. $zero .. the second MOVE is executed only if the branch wasn’t taken.. branch delay slot.5.. label .5.35: ARM CMP register.... Listing 12. greater than (signed) SLT REG1. some code to be executed if comparison result is false . exit: 12. REG2... REG3 BEQ REG1. Listing 12.5 Conclusion 12. label .3 MIPS Listing 12.37: Check for less than zero: BLTZ REG. Listing 12. cc=condition code false: ...36: Check for zero BEQZ REG...40: Check for less than. 12..5.39: Check for non-equal values BNE REG1. REG2. 137 ..CHAPTER 12. register/value Bcc true . cc=condition code false: . NOP Do not forget about the branch delay slots: the first MOVE is executed before BEQZ.... some code to be executed if comparison result is true .

some instruction will be executed if condition code is true instr2_cc . register/value instr1_cc . some other instruction will be executed if other condition code is true . register/value ITEEE EQ ..7.. Listing 12..2 on page 251. etc .. instruction will be instr3 . label .43: ARM (thumb mode) CMP register. instruction will be instr2 . Of course. Read more about it: 17. REG3 BEQ REG1. CSEL in ARM64.. CONCLUSION Listing 12.. the conditional move instruction can be used: MOVcc in ARM (in ARM mode).41: Check for less than.5. ARM It’s possible to use conditional suffixes in ARM mode for some instructions: Listing 12.42: ARM (ARM mode) CMP register. allowing to add conditional suffixes to the next four instructions. instruction will be instr4 . greater than (unsigned) SLTU REG1. set these suffixes: instr1 .5.CHAPTER 12. there is no limit for the number of instructions with conditional code suffixes. REG2. CONDITIONAL JUMPS 12. CMOVcc in x86. as long as the CPU flags are not modified by any of them.4 Branchless If the body of a condition statement is very short. 12. Thumb mode has the IT instruction. instruction will be if-then-else-else-else executed if condition is executed if condition is executed if condition is executed if condition is 138 true false false false .

esp push ecx mov eax. break.CHAPTER 13.1 x86 Non-optimizing MSVC Result (MSVC 2010): Listing 13. size = 4 _f PROC push ebp mov ebp. 'one'. case 1: printf ("one\n").h> void f (int a) { switch (a) { case 0: printf ("zero\n"). DWORD PTR _a$[ebp] mov DWORD PTR tv64[ebp]. break. // test }. 0aH. size = 4 _a$ = 8 . 0aH. }. 4 139 .1. 00H call _printf add esp. }. 13. int main() { f (2).1: MSVC 2010 tv64 = -4 . default: printf ("something unknown\n"). 0 je SHORT $LN4@f cmp DWORD PTR tv64[ebp].1 Small number of cases #include <stdio. 2 je SHORT $LN2@f jmp SHORT $LN1@f $LN4@f: push OFFSET $SG739 . break. 00H call _printf add esp. 1 je SHORT $LN3@f cmp DWORD PTR tv64[ebp]. SWITCH()/CASE/DEFAULT Chapter 13 switch()/case/default 13. eax cmp DWORD PTR tv64[ebp]. break. 4 jmp SHORT $LN7@f $LN3@f: push OFFSET $SG741 . case 2: printf ("two\n"). 'zero'.

or just a pack of if() statements. 4 esp. DWORD PTR _a$[esp-4] sub eax. OFFSET jmp _printf _f ENDP $SG791 . If we compile this in GCC 4. Optimizing MSVC Now let’s turn on optimization in MSVC (/Ox): cl 1. OFFSET jmp _printf $LN4@f: mov DWORD PTR _a$[esp-4]. 'zero'. else if (a==2) printf ("two\n").asm /Ox Listing 13. OFFSET jmp _printf $LN3@f: mov DWORD PTR _a$[esp-4].1.CHAPTER 13. we’ll get almost the same result. else if (a==1) printf ("one\n"). subtracting from 0 is 0) and the first conditional jump JE (Jump if Equal or synonym JZ —Jump if Zero) is to be triggered and control flow is to be passed to the $LN4@f label. There is nothing especially new to us in the generated code. 0aH. 'something unknown'. This implies that switch() is like syntactic sugar for a large number of nested if()s. size = 4 _f PROC mov eax.2: MSVC _a$ = 8 .4. SWITCH()/CASE/DEFAULT jmp $LN2@f: push call add jmp $LN1@f: push call add $LN7@f: mov pop ret _f ENDP 13. }. 'something unknown'. 0aH. 0aH. with the exception of the compiler moving input variable a to a temporary local variable tv64 1 . 4 SHORT $LN7@f OFFSET $SG745 . 0aH. 00H Here we can see some dirty hacks. 'one'. else printf ("something unknown\n"). Sounds absurd. the ZF flag is to be set (e. 1 je SHORT $LN2@f mov DWORD PTR _a$[esp-4]. 0aH. 00H _printf esp. 0aH.1. 0 je SHORT $LN4@f sub eax. but it is done to check if the value in EAX was 0. 'two'. 00H $SG787 . 'two'. 00H _printf esp. 00H $SG789 . 1 je SHORT $LN3@f sub eax. If we work with switch() with a few cases it is impossible to be sure if it was a real switch() in the source code. ebp ebp 0 Our function with a few cases in switch() is in fact analogous to this construction: void f (int a) { if (a==0) printf ("zero\n"). even with maximal optimization turned on (-O3 option). where the 'zero' message is 1 Local variables in stack are prefixed with tv — that’s how MSVC names internal variables for its needs 140 . OFFSET jmp _printf $LN2@f: mov DWORD PTR _a$[esp-4]. SMALL NUMBER OF CASES SHORT $LN7@f OFFSET $SG743 . First: the value of a is placed in EAX and 0 is subtracted from it.g. 00H $SG785 .c /Fa1. If yes.

but via JMP. except for the first printf() argument.1. the corresponding jump is to be triggered.CHAPTER 13. And that is what our code does. bypassing the end of the f() function. SMALL NUMBER OF CASES being printed. Second: we see something unusual for us: a string pointer is placed into the a variable. 2 wikipedia 141 . and then printf() is called not via CALL. CALL itself pushes the return address (RA) to the stack and does an unconditional jump to our function address.2. Our function at any point of execution (since it do not contain any instruction that moves the stack pointer) has the following stack layout: • ESP—points to RA • ESP+4—points to the a variable On the other side. here ( 6. section. If the first jump doesn’t get triggered. but directly printf(). the control flow passes to printf() with string argument 'something unknown'. It replaces the function’s first argument with the address of the string and jumps to printf(). which needs to point to the string. printf() prints a string to stdout and then executes the RET instruction. SWITCH()/CASE/DEFAULT 13. And if no jump gets triggered at all. All this is possible because printf() is called right at the end of the f() function in all cases. it is all done for the sake of speed. A similar case with the ARM compiler is described in “printf() with several arguments”.1 on page 44). There is a simple explanation for that: the Caller pushes a value to the stack and calls our function via CALL. In some way. And of course. which POPs RA from the stack and control flow is returned not to f() but rather to f()’s callee. as if we didn’t call our function f(). it is similar to the longjmp()2 function. when we need to call printf() here we need exactly the same stack layout. 1 is subtracted from the input value and if at some stage the result is 0.

CHAPTER 13. SMALL NUMBER OF CASES OllyDbg Since this example is tricky.1. OllyDbg can detect such switch() constructs. let’s trace it in OllyDbg.1: OllyDbg: EAX now contain the first (and only) function argument 142 . and it can add some useful comments. that’s the function’s input value: Figure 13. SWITCH()/CASE/DEFAULT 13. EAX is 2 in the beginning.

EAX still contains 2.2: OllyDbg: SUB executed 143 . SMALL NUMBER OF CASES 0 is subtracted from 2 in EAX.1. But the ZF flag is now 0. SWITCH()/CASE/DEFAULT 13. indicating that the resulting value is non-zero: Figure 13. Of course.CHAPTER 13.

so the ZF flag is still 0: Figure 13.CHAPTER 13.1. SWITCH()/CASE/DEFAULT 13. But 1 is non-zero.3: OllyDbg: first DEC executed 144 . SMALL NUMBER OF CASES DEC is executed and EAX now contains 1.

CHAPTER 13. because the result is zero: Figure 13.4: OllyDbg: second DEC executed OllyDbg shows that this jump is to be taken now. 145 . SWITCH()/CASE/DEFAULT 13. EAX is finally 0 and the ZF flag gets set.1. SMALL NUMBER OF CASES Next DEC is executed.

146 . SWITCH()/CASE/DEFAULT 13. SMALL NUMBER OF CASES A pointer to the string “two” is to be written into the stack now: Figure 13.5: OllyDbg: pointer to the string is to be written at the place of the first argument Please note: the current argument of the function is 2 and 2 is now in the stack at the address 0x0020FA44.CHAPTER 13.1.

DLL Now printf() treats the string at 0x00FF3010 as its only argument and prints the string.CHAPTER 13. Then. jump happens. SMALL NUMBER OF CASES MOV writes the pointer to the string at address 0x0020FA44 (see the stack window).1. This is the first instruction of the printf() function in MSVCR100. 147 .DLL (I compiled the example with /MD switch): Figure 13. SWITCH()/CASE/DEFAULT 13.6: OllyDbg: first instruction of printf() in MSVCR100.

7: OllyDbg: last instruction of printf() in MSVCR100. SWITCH()/CASE/DEFAULT 13. SMALL NUMBER OF CASES This is the last instruction of printf(): Figure 13. 148 .DLL The string “two” was just printed to the console window.1.CHAPTER 13.

CHAPTER 13. There is only one call to printf(). but rather to main(): Figure 13. or just a pack of if() statements. "two\n" loc_170: . #2 R0.8: OllyDbg: return to main() Yes.text:00000154 . and we can sure that the a variable is not equal to these numbers at this point.1. and ADREQ does not modify any flags at all.text:00000170 . aZero . 13. aTwo . it will since BEQ checks the flags set by the CMP instruction.2. since a was already checked to be equal to 0 or 1. #1 R0.text:00000170 . aOne . if R0 = 0. "something unknown\n" R0. And if R0 = 2. and we have already examined this trick here ( 6. we see here predicated instructions again (like ADREQ (Equal)) which is triggered only in case R0 = 0. from the guts of printf() to main(). #2'' .text:00000150 . CODE XREF: f1+8 . "one\n" loc_170 R0.text:0000015C .text:00000164 . will BEQ trigger correctly since ADREQ before it has already filled the R0 register with another value? Yes. ``CMP R0. is needed to check if a = 2. 13.text:00000170 00 13 05 01 4B 02 02 4A 4E 00 0E 00 00 0F 00 00 0F 0F 50 8F 00 50 8F 00 50 8F 8F E3 02 0A E3 02 0A E3 12 02 f1: CMP ADREQ BEQ CMP ADREQ BEQ CMP ADRNE ADREQ R0. but rather to main(). The rest of the instructions are already familiar to us.text:00000168 . by investigating this code we cannot say if it was a switch() in the original source code.1 on page 44). and then loads the address of the string «zero\n» into R0. And CALL 0x00FF1000 was the actual instruction which called f().2 ARM: Optimizing Keil 6/2013 (ARM mode) . An astute reader may ask. If it is not true. Anyway. SWITCH()/CASE/DEFAULT 13. Because RA in the stack points not to some place in f().3 ARM: Optimizing Keil 6/2013 (thumb mode) 149 .1.text:00000160 .text:00000158 . a pointer to the string «two\n» will be loaded by ADREQ into R0.text:00000170 .text:0000016C . at the end. The next instruction BEQ redirects control flow to loc_170.text:0000014C . f1+14 78 18 00 EA B __2printf Again. then ADRNE loads a pointer to the string «something unknown \n» into R0. there are three paths to printf(). aSomethingUnkno . "zero\n" loc_170 R0. The last instruction. #0 R0.1. SMALL NUMBER OF CASES Now let’s press F7 or F8 (step over) and return…not to f(). In the end.text:0000014C . the jump was direct.

sp.L34 w0. x30.text:000000DC . 0 w0. aSomethingUnkno . .L32 adrp add bl b x0. SMALL NUMBER OF CASES f1: PUSH CMP BEQ CMP BEQ CMP BEQ ADR B {R4.1. "one\n" B default_case . [x29. x0.text:000000F0 .string "one" . :lo12:.LC13: .LC13 puts .L32 adrp add bl b x0.text:000000DA .text:000000E6 . it is not possible to add conditional predicates to most instructions in thumb mode.9 .text:000000EC 00 E0 one_case: .L38 .LC15: .LC12 puts .text:000000E6 95 A0 . [x29. 13.LC15 puts .28] w0. CODE XREF: f1+8 ADR R0.text:000000D6 .text:000000F0 . wzr . :lo12:.text:000000F4 10 BD two_case: .L38: 150 .LC15 .L32 adrp add bl x0.text:000000EA 96 A0 . "something unknown" x0.text:000000E2 . "one" x0. [sp. #0 zero_case R0.text:000000EE . f1+14 BL __2printf POP {R4.L34: . aZero . jump to default label x0.L35: .L35 w0.text:000000D8 . SWITCH()/CASE/DEFAULT .text:000000E0 .LC12: . #2 two_case R0. so the thumb-code here is somewhat similar to the easily understandable x86 CISC-style code. "zero" x0. "zero\n" B default_case . -32]! x29. aOne .string "something unknown" f12: stp add str ldr cmp beq cmp beq cmp bne adrp add bl b x29. x0. 2 .text:000000D4 . . CODE XREF: f1+10 .text:000000DE . 1 . aTwo .string "two" .PC} As I already mentioned.LC14 puts .LR} R0. "something unknown\n" default_case .LC12 .text:000000D4 .text:000000EE 97 A0 . :lo12:. .LC14 . "two" x0.LC13 . x0.28] w0. #1 one_case R0. x0.text:000000EA . :lo12:.string "zero" . "two\n" default_case .LC14: .4 ARM64: Non-optimizing GCC (Linaro) 4.CHAPTER 13. .1. CODE XREF: f1+C ADR R0. CODE XREF: f1+4 ADR R0.text:000000E4 10 00 05 01 05 02 05 91 04 B5 28 D0 28 D0 28 D0 A0 E0 13.text:000000E8 02 E0 zero_case: .text:000000F0 06 F0 7E F8 .

"zero" x0.LC13 puts Better optimized piece of code.1 on page 140.LC12 puts x0. default case adrp add b . is it 1? .L31 w0.1. $zero .5 ARM64: Optimizing GCC (Linaro) 4. load delay slot. NOP . . branch delay slot # --------------------------------------------------------------------------loc_38: lui lw or jr la $a0. (__gnu_local_gp >> 16) li beq la $v0. There is also a direct jump to puts() instead of calling it. loc_60 $gp. $t9.3: Optimizing GCC 4. :lo12:. x0. load delay slot. branch delay slot. NOP .L32: adrp add b . if not equal to 0: bnez $a0. . loc_4C or $at. branch delay slot. x30. x0. $v0.L35: adrp add b . 32 The type of the input value is int. NOP jr $t9 . "two" x0. world!” example: 3. ($LC0 >> 16) # "zero" lw $t9. SMALL NUMBER OF CASES nop . .L31: adrp add b w0. 1 $a0. :lo12:. x0. CBZ (Compare and Branch on Zero) instruction does jump if W0 is zero. 1 . like I explained before: 13. . :lo12:.LC14 . The string pointers are passed to puts() using an ADRP/ADD instructions pair just like I demonstrated in the “Hello. jump.5 (IDA) f: lui $gp. $at.L35 x0.L32: ldp ret x29.LC14 puts x0. is it 2? li $v0. $t9 $a0. . (__gnu_local_gp & 0xFFFF) . zero case: lui $a0.1. $zero . 2 beq $a0.LC13 . branch delay slot. :lo12:.LC15 puts x0. loc_38 or $at. ($LC0 & 0xFFFF) # "zero" . [sp].LC12 . 13.CHAPTER 13.6 MIPS Listing 13. SWITCH()/CASE/DEFAULT 13. hence register W0 is used to hold it instead of the whole X0 register. branch delay slot 151 . "one" x0. (puts & 0xFFFF)($gp) or $at. $v0.L32 w0. # CODE XREF: f+1C ($LC3 >> 16) # "something unknown" (puts & 0xFFFF)($gp) $zero .9 f12: cmp beq cmp beq cbz .LC15 .1.1.4.4. branch delay slot . "something unknown" x0. NOP la $a0.5 on page 15. 2 . NOP ($LC3 & 0xFFFF) # "something unknown" . x0. 13. $zero .

An instruction next to LW may execute at the moment while LW loads value from memory. 13.1.h> void f (int a) { switch (a) { case 0: printf ("zero\n"). break.7 Conclusion A switch() with few cases is indistinguishable from an if/else construction. size = 4 ebp ebp. break. break. DWORD PTR _a$[ebp] 152 . the next instruction must not use the result of LW. ($LC2 & 0xFFFF) # "two" .13. 13. size = 4 . 13. so this is somewhat outdated. break. NOP jr $t9 la $a0. SWITCH()/CASE/DEFAULT 13. In general. $t9.2 A lot of cases If a switch() statement contains a lot of cases. $t9 $a0. it is not very convenient for the compiler to emit too large code with a lot JE/JNE instructions. (puts & 0xFFFF)($gp) or $at. $zero . break. break. case 2: printf ("two\n"). A LOT OF CASES # --------------------------------------------------------------------------loc_4C: # CODE XREF: f+14 lui $a0. so here we see a jump to puts() (JR: “Jump Register”) instead of “jump and link”. However. case 1: printf ("one\n"). it can be ignored. int main() { f (2).4: MSVC 2010 tv64 = -4 _a$ = 8 _f PROC push mov push mov . We also often see NOP instructions after LW ones. load delay slot.1 x86 Non-optimizing MSVC We get (MSVC 2010): Listing 13. load delay slot.2.1. default: printf ("something unknown\n"). We talked about this earlier: 13. case 4: printf ("four\n"). }. This is “load delay slot”: another delay slot in MIPS. esp ecx eax. but GCC still adds NOPs for older MIPS CPUs.2. branch delay slot The function always ends with calling puts(). ($LC2 >> 16) # "two" lw $t9. # CODE XREF: f+8 ($LC1 >> 16) # "one" (puts & 0xFFFF)($gp) $zero .CHAPTER 13. #include <stdio. $at. for example: listing. }.1 on page 140. branch delay slot # --------------------------------------------------------------------------- loc_60: lui lw or jr la $a0. Modern MIPS CPUs have a feature to wait if the next instruction uses result of LW. // test }.1. case 3: printf ("three\n"). NOP ($LC1 & 0xFFFF) # "one" .1.

The address of the $LN11@f table + 8 is the table element where the $LN4@f label is stored. SWITCH()/CASE/DEFAULT 13. 0 DD $LN5@f . eax cmp DWORD PTR tv64[ebp]. 4 _f ENDP What we see here is a set of printf() calls with various arguments. 00H call _printf add esp. cache memory. For example.2. 4 jmp SHORT $LN9@f $LN5@f: push OFFSET $SG741 . pointing exactly to the element we need. but also internal symbolic labels assigned by the compiler. 2 ∗ 4 = 8 (all table elements are addresses in a 32-bit process and that is why all elements are 4 bytes wide). 'one'. Literally. align next label $LN11@f: DD $LN6@f . That is how an address inside the table is constructed. This table is sometimes called jumptable or branch table3 . 0aH. JMP fetches the $LN4@f address from the table and jumps to it. Then the corresponding printf() is called with argument 'two'. 'three'. Not quite relevant these days. 3 The whole method was once called computed GOTO in early versions of FORTRAN: wikipedia. 'two'. let’s say a is equal to 2. 0aH. 'four'. but what a term! 153 . then it gets multiplied by 4 and added with the $LN11@f table address. 0aH. A LOT OF CASES mov DWORD PTR tv64[ebp]. ebp pop ebp ret 0 npad 2 . npad ( 88 on page 866) is assembly language macro that aligning the next label so that it is to be stored at an address aligned on a 4 byte (or 16 byte) boundary. 0aH. All these labels are also mentioned in the $LN11@f internal table. 0aH. 1 DD $LN4@f . 2 DD $LN3@f . At the function start.CHAPTER 13. 'something unknown'. But if the value of a is less or equals to 4. 4 jmp SHORT $LN9@f $LN2@f: push OFFSET $SG747 . DWORD PTR tv64[ebp] jmp DWORD PTR $LN11@f[ecx*4] $LN6@f: push OFFSET $SG739 . This is very suitable for the processor since it is able to fetch 32-bit values from memory through the memory bus. the jmp DWORD PTR $LN11@f[ecx*4] instruction implies jump to the DWORD that is stored at address $LN11@f + ecx * 4. 4 $LN9@f: mov esp. 00H call _printf add esp. in a more effective way if it is aligned. 00H call _printf add esp. etc. 4 jmp SHORT $LN9@f $LN3@f: push OFFSET $SG745 . 3 DD $LN2@f . All they have not only addresses in the memory of the process. if a is greater than 4. where printf() with argument 'something unknown' is called. 4 ja SHORT $LN1@f mov ecx. 0aH. control flow is passed to label $LN1@f. 4 jmp SHORT $LN9@f $LN1@f: push OFFSET $SG749 . 00H call _printf add esp. 'zero'. 00H call _printf add esp. 4 jmp SHORT $LN9@f $LN4@f: push OFFSET $SG743 . 00H call _printf add esp.

9: OllyDbg: function’s input value is loaded in EAX 154 . SWITCH()/CASE/DEFAULT 13. A LOT OF CASES OllyDbg Let’s try this example in OllyDbg. The input value of the function (2) is loaded into EAX: Figure 13.2.CHAPTER 13.

SWITCH()/CASE/DEFAULT 13.2. A LOT OF CASES The input value is checked.CHAPTER 13. the “default” jump is not taken: Figure 13.10: OllyDbg: 2 is no bigger than 4: no jump is taken 155 . is it bigger than 4? If not.

we are going to come back to them later 156 . so now we see the jumptable in the data window. A LOT OF CASES Here we see a jumptable: Figure 13. SWITCH()/CASE/DEFAULT 13. These are 5 32-bit values4 . It’s also possible to click “Follow in Dump” → “Memory address” and OllyDbg will show the element addressed by the JMP instruction. ECX is now 2. so the second element (counting from zero) of the table is to be used.2.CHAPTER 13.2.6 on page 688.11: OllyDbg: calculating destination address using jumptable Here I’ve clicked “Follow in Dump” → “Address constant”. 4 They are underlined by OllyDbg because these are also FIXUPs: 68. That’s 0x010B103A.

CODE XREF: main+10 var_18 = dword ptr -18h arg_0 = dword ptr 8 push mov sub cmp ja mov shl mov jmp ebp ebp. A LOT OF CASES After the jump we are at 0x010B103A: the code printing “two” will now be executed: Figure 13. "zero" call _puts jmp short locret_8048450 loc_804840C: .2. offset aZero . DATA XREF: .4.CHAPTER 13. "three" 157 . 2 eax.5: GCC 4. SWITCH()/CASE/DEFAULT 13.rodata:08048568 mov [esp+18h+var_18].rodata:off_804855C mov [esp+18h+var_18].12: OllyDbg: now we at the case: label Non-optimizing GCC Let’s see what GCC 4.4.1 generates: Listing 13. esp esp. "one" call _puts jmp short locret_8048450 loc_804841A: .rodata:08048564 mov [esp+18h+var_18]. offset aTwo . 4 short loc_8048444 eax. DATA XREF: . DATA XREF: . 18h [ebp+arg_0]. [ebp+arg_0] eax. ds:off_804855C[eax] eax loc_80483FE: . "two" call _puts jmp short locret_8048450 loc_8048428: .1 f public f proc near . offset aOne .rodata:08048560 mov [esp+18h+var_18]. DATA XREF: . offset aThree .

rodata:0804856C mov [esp+18h+var_18]. CODE XREF: f2+4 0000019C .2 ARM: Optimizing Keil 6/2013 (ARM mode) Listing 13. jumptable 00000178 case 1 158 . #5 .2. f2:loc_180 00000194 EC 00 8F E2 ADR R0. CODE XREF: f2+4 0000018C 06 00 00 EA B three_case . switch 5 cases PC. jumptable 00000178 case 4 00000194 00000194 zero_case . R0. PC. DATA XREF: .1 on page 206). f2:loc_184 0000019C EC 00 8F E2 ADR R0. stored in EAX.LSL#2 . aOne 000001A0 04 00 00 EA B loc_1B8 . offset aSomethingUnkno . CODE XREF: f2+4 00000190 07 00 00 EA B four_case . CODE XREF: f+A mov [esp+18h+var_18].2. SWITCH()/CASE/DEFAULT call jmp 13. CODE XREF: f2+4 00000184 04 00 00 EA B one_case . CODE XREF: f2+4 00000194 . jumptable 00000178 case 1 00000188 00000188 loc_188 . jumptable 00000178 case 2 0000018C 0000018C loc_18C .CHAPTER 13. offset aFour . f+34. jumptable 00000178 case 3 00000190 00000190 loc_190 . jumptable 00000178 case 0 00000184 00000184 loc_184 . jumptable 00000178 default case 00000180 00000180 loc_180 . leave retn f endp off_804855C dd dd dd dd dd offset offset offset offset offset loc_80483FE loc_804840C loc_804841A loc_8048428 loc_8048436 . aZero . "four" call _puts jmp short locret_8048450 loc_8048444: . with a little nuance: argument arg_0 is multiplied by 4 by shifting it to left by 2 bits (it is almost the same as multiplication by 4) ( 16. switch jump default_case .. DATA XREF: f+12 It is almost the same. CODE XREF: f+26 . "something unknown" call _puts locret_8048450: .2.6: Optimizing Keil 6/2013 (ARM mode) 00000174 f2 00000174 05 00 50 E3 00000178 00 F1 8F 30 0000017C 0E 00 00 EA CMP ADDCC B R0. jumptable 00000178 case 0 00000198 06 00 00 EA B loc_1B8 0000019C 0000019C one_case .. Then the address of the label is taken from the off_804855C array. and then ``JMP EAX'' does the actual jump. 13. CODE XREF: f2+4 00000180 03 00 00 EA B zero_case . CODE XREF: f2+4 00000188 05 00 00 EA B two_case . A LOT OF CASES _puts short locret_8048450 loc_8048436: .

3 ARM: Optimizing Keil 6/2013 (thumb mode) Listing 13. 0xC. If a = 1.2.2. thus jumping to one of the B (Branch) instructions located below. so that is why PC points there. The next ``ADDCC PC. R0 __ARM_common_switch8_thumb .LSL#2'' 5 instruction is being executed only if R0 < 5 (CC=Carry clear / Less than). jumptable 00000178 default case 000001C0 FC FF FF EA B loc_1B8 This code makes use of the ARM mode feature in which all instructions have a fixed size of 4 bytes. aFour . R0. Let’s keep in mind that the maximum value for a is 4 and any greater value will cause «something unknown\n» string to be printed. CODE XREF: f2+24 000001B8 . With every 1 added to a. CODE XREF: f2+4 000001BC . 8. to what was programmed in the switch(). switch 6 cases DCB 5 DCB 4. f2:loc_188 000001A4 01 0C 8F E2 ADR R0.CHAPTER 13. then is to be added to the value in PC. 0xA. The first ``CMP R0. aZero . in other words. But as we will see later ( 16. which is 8 bytes ahead of the point where the ADDCC instruction is. This is how the pipeline in ARM processors works: when ADDCC is executed. jump table for switch statement ALIGN 2 zero_case . SWITCH()/CASE/DEFAULT 13. 6. jumptable 00000178 case 4 000001B8 000001B8 loc_1B8 .1 on page 205) in section “Shifts”. f2+2C 000001B8 66 18 00 EA B __2printf 000001BC 000001BC default_case . jumptable 000000FA case 0 5 ADD—addition 159 . the length of each B instruction. A LOT OF CASES 000001A4 000001A4 two_case . shift left by 2 bits is equivalent to multiplying by 4. the resulting PC is increased by 4. aThree .2. 0x10 .7: Optimizing Keil 6/2013 (thumb mode) 000000F6 000000F6 000000F6 10 B5 000000F8 03 00 000000FA 06 F0 69 F8 000000FE 000000FF 00000105 00000106 00000106 00000106 05 04 06 08 0A 0C 10 00 8D A0 EXPORT f2 f2 PUSH MOVS BL {R4. etc. This has to be memorized. which is the address of the loc_184 label. Each of these five B instructions passes control further. 13. aTwo 000001A8 02 00 00 EA B loc_1B8 . LSL#2 at the instruction’s suffix stands for “shift left by 2 bits”. the following is to be happen: The value in R0 is multiplied by 4. of which there are 5 in row. In fact. f2+8 000001BC D4 00 8F E2 ADR R0. #5'' instruction compares the input value of a with 5. then P C + 8 + a ∗ 4 = P C + 8 + 1 ∗ 4 = P C + 12 = 0x184 will be written to PC. CODE XREF: f2+4 000001B4 . f2:loc_18C 000001AC 01 0C 8F E2 ADR R0. Consequently. and the actual value of the PC will be written into PC (which is 8 bytes ahead) and a jump to the label loc_180 will happen. if ADDCC does not trigger (it is a R0 ≥ 5 case). But if R0 < 5 and ADDCC triggers. a jump to default_case label will occur. the value in PC is 8 bytes ahead (0x180) than the address at which the ADDCC instruction is located (0x178). aSomethingUnkno . jumptable 00000178 case 3 000001B0 00 00 00 EA B loc_1B8 000001B4 000001B4 four_case . or. f2:loc_190 000001B4 01 0C 8F E2 ADR R0. CODE XREF: f2+4 000001A4 .LR} R3. At the moment of the execution of ADDCC. PC. CODE XREF: f2+4 ADR R0. If a = 0. 4 is the instruction length in ARM mode and also. 2 instructions ahead. the processor at the moment is beginning to process the instruction after the next one. Pointer loading of the corresponding string occurs there. CODE XREF: f2+4 000001AC . jumptable 00000178 case 2 000001AC 000001AC three_case . Then we add R0 ∗ 4 to the current value in PC.

CHAPTER 13. so the compiler not generates the same code for every switch() statement. CODE XREF: ⤦ Ç __ARM_common_switch8_thumb 000061D4 01 C0 5E E5 LDRB R12.8: Optimizing GCC 4. whose function is to switch the processor to ARM-mode. IDA successfully perceived it as a service function and a table. where the table starts. jumptable 000000FA default case B loc_118 000061D0 000061D0 000061D0 78 47 EXPORT __ARM_common_switch8_thumb __ARM_common_switch8_thumb . [LR. aOne . aTwo .#-1] 000061D8 0C 00 53 E1 CMP R3.4.PC} 0000011E 0000011E 0000011E 82 A0 00000120 FA E7 default_case . [LR. It can even be said that in these modes the instructions have variable lengths. CODE XREF: f2+4 ADR R0. CODE XREF: example6_f2+4 BX PC 000061D2 00 00 ALIGN 4 000061D2 . CODE XREF: f2+4 ADR R0. CODE XREF: f2+4 ADR R0. Then you see the function for table processing. End of function __ARM_common_switch8_thumb 000061D2 000061D4 __32__ARM_common_switch8_thumb . after calling of this function. A LOT OF CASES B loc_118 0000010A 0000010A 0000010A 8E A0 0000010C 04 E0 one_case .5 (IDA) 160 . It is interesting to note that the function uses the LR register as a pointer to the table. jumptable 000000FA case 3 B loc_118 00000116 00000116 00000116 91 A0 00000118 00000118 00000118 00000118 06 F0 6A F8 0000011C 10 BD four_case . LR contains the address after ``BL __ARM_common_switch8_thumb'' instruction.R12] 000061E0 03 30 DE 37 LDRCCB R3. aSomethingUnkno . It is also worth noting that the code is generated as a separate function in order to reuse it. jumptable 000000FA case 1 B loc_118 0000010E 0000010E 0000010E 8F A0 00000110 02 E0 two_case . CODE XREF: f2+4 ADR R0.R3] 000061E4 83 C0 8E E0 ADD R12. just like in x86. A special function is present here in order to deal with the table and pass control. named __ARM_common_switch8_thumb. aThree . It is too complex to describe it here now. [LR. CODE XREF: f2+4 ADR R0. So there is a special table added that contains information about how much cases are there (not including default-case). SWITCH()/CASE/DEFAULT 00000108 06 E0 13. and an offset for each with a label to which control must be passed in the corresponding case. and added comments to the labels like jumptable 000000FA case 0. jumptable 000000FA case 2 B loc_118 00000112 00000112 00000112 90 A0 00000114 00 E0 three_case . f2+16 BL __2printf POP {R4. Indeed.4 MIPS Listing 13. jumptable 000000FA case 4 loc_118 .LSL#1 000061E8 1C FF 2F E1 BX R12 000061E8 .2. R3.2. so I’m omitting it. LR. R12 000061DC 0C 30 DE 27 LDRCSB R3. It starts with ``BX PC'' . End of function __32__ARM_common_switch8_thumb One cannot be sure that all instructions in thumb and thumb-2 modes has the same size. aFour . 13. CODE XREF: f2+12 .

SWITCH()/CASE/DEFAULT 13. load address of jumptable . or $at. loc_24 la $gp. branch delay slot # DATA XREF: . or $at. branch delay slot. branch delay slot . 2 . sub_6C: . print "two" and finish lui $a0. (puts & 0xFFFF)($gp) or $at. off_120 .CHAPTER 13. branch delay slot # DATA XREF: . lw $t9. print "something unknown" and finish: lui $a0. LUI and ADDIU pair are there in fact: la $v0. load element from jumptable: lw $v0. NOP ($LC4 & 0xFFFF) # "four" . input value is greater or equal to 5. lw $t9. print "three" and finish lui $a0. may be placed in . branch delay slot .rodata section: 161 .2. LA is pseudoinstruction. sub_80: . $a0. ($LC5 & 0xFFFF) # "something unknown" . print "zero" and finish lui $a0. jr $t9 la $a0. NOP . . or $at. lw $t9. print "four" and finish lui $a0. $zero . lw $t9. $zero . jr $t9 la $a0. $a0 . branch delay slot # DATA XREF: . jr $t9 la $a0. NOP ($LC3 & 0xFFFF) # "three" .rodata:00000124 ($LC1 >> 16) # "one" (puts & 0xFFFF)($gp) $zero .rodata:0000012C ($LC3 >> 16) # "three" (puts & 0xFFFF)($gp) $zero . print "one" and finish lui $a0. jump to loc_24 if input value is lesser than 5: sltiu $v0. A LOT OF CASES f: lui $gp. 0($a0) or $at. (__gnu_local_gp & 0xFFFF) . sub_94: . lw $t9. jr $t9 la $a0. multiply input value by 4: sll $a0. NOP sub_44: . 5 bnez $v0.rodata:00000128 ($LC2 >> 16) # "two" (puts & 0xFFFF)($gp) $zero . (__gnu_local_gp >> 16) . branch delay slot loc_24: # CODE XREF: f+8 . NOP ($LC0 & 0xFFFF) # "zero" . jr $t9 la $a0. NOP ($LC1 & 0xFFFF) # "one" .rodata:00000130 ($LC4 >> 16) # "four" (puts & 0xFFFF)($gp) $zero . NOP ($LC2 & 0xFFFF) # "two" . sub_58: . or $at. or $at. $zero . jump to the address we got in jumptable: jr $v0 or $at. # DATA XREF: . ($LC5 >> 16) # "something unknown" lw $t9. NOP jr $t9 la $a0.rodata:off_120 ($LC0 >> 16) # "zero" (puts & 0xFFFF)($gp) $zero . $v0. branch delay slot # DATA XREF: . sum up multiplied value and jumptable address: addu $a0.

SLL (“Shift Word Left Logical”) does multiplication by 4. 2 . do something JMP exit case5: . exit: . jump_table dd dd dd dd dd case1 case2 case3 case4 case5 The jump to the address in the jump table may also be implemented using this instruction: JMP jump_table[REG*4]. WHEN THERE ARE SEVERAL CASE STATEMENTS IN ONE BLOCK sub_6C sub_80 sub_94 sub_44 sub_58 The new instruction for us is SLTIU (“Set on Less Than Immediate Unsigned”). BNEZ is “Branch if Not Equal to Zero”. 4 .word . jump_table[REG] JMP REG case1: . shift for 3 bits in x64. A jumptable is just array of pointers. 13.. a number has to be specified in the instruction itself. SWITCH()/CASE/DEFAULT off_120: .. MOV REG. find element in table. maximal number of cases JA default SHL REG. input CMP REG.word . do something JMP exit default: .9: x86 MOV REG. MIPS is a 32-bit CPU after all. do something JMP exit case4: .word 13. do something JMP exit case2: .3 When there are several case statements in one block Here is a very widespread construction: several case statements for a single block: #include <stdio. like the one described later: 18..word ..3. do something JMP exit case3: .2. 13.. Code is very close to the other ISAs.CHAPTER 13. i. This is the same as SLTU (“Set on Less Than Unsigned”).e.5 Conclusion Rough skeleton of switch(): Listing 13. but “I” stands for “immediate”..word .5 on page 275.h> void f(int a) { switch (a) { 162 . so all addresses in the jumptable are 32-bit ones. Or JMP jump_table[REG*8] in x64.

OFFSET $SG2806 . 7.1 MSVC Listing 13. It’s too wasteful to generate a block for each possible case. break. default: printf ("default\n"). 21 SHORT $LN1@f eax. 00H 'default'. OFFSET $SG2802 . 0aH. case case case case 8: 9: 20: 21: printf ("8. break. }.3. case case case case 3: 4: 5: 6: printf ("3. 00H '8. OFFSET $SG2800 . }. 7. break. break. 10' DWORD PTR __imp__printf DWORD PTR _a$[esp-4]. 5\n"). 9. 9. 21' DWORD PTR __imp__printf DWORD PTR _a$[esp-4]. 21'. BYTE PTR $LN10@f[eax] DWORD PTR $LN11@f[eax*4] DWORD PTR _a$[esp-4]. 'default' DWORD PTR __imp__printf 2 . 0aH. }. so what is usually done is to generate each block plus some kind of dispatcher. 4. 00H eax. 2. 00H '3. WHEN THERE ARE SEVERAL CASE STATEMENTS IN ONE BLOCK 1: 2: 7: 10: printf ("1.3. 7. '8. 4. 13. 10\n"). OFFSET $SG2798 . 4. 5'. 00H '22'. 0aH. '22' DWORD PTR __imp__printf DWORD PTR _a$[esp-4]. case 22: printf ("22\n"). 2. SWITCH()/CASE/DEFAULT case case case case 13.10: Optimizing MSVC 2010 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 $SG2798 $SG2800 $SG2802 $SG2804 $SG2806 DB DB DB DB DB _a$ = 8 _f PROC mov dec cmp ja movzx jmp $LN5@f: mov jmp $LN4@f: mov jmp $LN3@f: mov jmp $LN2@f: mov jmp $LN1@f: mov jmp npad '1. 9. 21\n"). OFFSET $SG2804 .CHAPTER 13. 2. 10'. DWORD PTR _a$[esp-4] eax eax. '3. break. 5' DWORD PTR __imp__printf DWORD PTR _a$[esp-4]. 0aH. int main() { f(4). align $LN11@f table on 16-byte boundary 163 . 0aH. '1.

. x0. . input value in W0 sub w0. print print print print print '1. . 5). SWITCH()/CASE/DEFAULT 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 13. GCC 4. .9. What is also worth noting is that there is no case for input value 0. 4 is for the default block. This is a very widespread pattern.1 on page 157). . Let’s recall that all ARM64 instructions have a size of 4 bytes. so GCC tries to make the jump table more compact and so it starts at 1 as an input value. a=1 a=2 a=3 a=4 a=5 a=6 a=7 a=8 a=9 a=10 a=11 a=12 a=13 a=14 a=15 a=16 a=17 a=18 a=19 a=20 a=21 a=22 We see two tables here: the first table ($LN10@f) is an index table. . using just one table of pointers. 3 is the fourth one (for value 22).3. . GCC is uses the fact that all offsets in my tiny example are in close proximity to each other. 9. the input value is used as an index in the index table (line 13). w0. . 2. and the table starts at a = 1. 4. . 10). It’s able to encode all offsets as 8-bit bytes.9. Here is a short legend for the values in the table: 0 is the first case block (for values 1. . Listing 13. 4. 2 is the third one (for values 8. . .LC4 b puts . 10' '3.2 GCC GCC does the job in the way we already discussed ( 13. . That’s why we see the DEC instruction at line 10. . So why is this economical? Why isn’t it possible to make it as before ( 13. and the second one ($LN11@f) is an array of pointers to blocks.9. . 9.11: Optimizing GCC 4.L9 . 13. . There we get an index for the second table of code pointers and we jump to it (line 14).3.1 There is no code to be triggered if the input value is 0.CHAPTER 13. just with one table consisting of block pointers? The reason is that the elements in index table are 8-bit. WHEN THERE ARE SEVERAL CASE STATEMENTS IN ONE BLOCK $LN11@f: DD DD DD DD DD $LN5@f $LN4@f $LN3@f $LN2@f $LN1@f DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB DB ENDP 0 0 1 1 1 1 0 2 2 0 4 4 4 4 4 4 4 4 4 2 2 3 .3. 13.2. .L2: . :lo12:. . .1 for ARM64 uses an even cleverer trick. First. .1 ARM64 f14: . . 7. . 1 is the second one (for values 3. 21' '22' 'default' $LN10@f: _f . . 5' '8.LC4 add x0. because there is no need to allocate a table element for a = 0. #1 cmp w0. print "default": adrp x0.3 ARM64: Optimizing GCC 4. 7. 21). branch if less or equal (unsigned): bls .2. So the jump table consisting of single bytes. 21 . 2. hence it’s all more compact. .L9: 164 . .1 on page 157). .

LC0 b puts .L4: .Lrtx4) / 4 . .text .string "1.LC0 add x0.L7: . WHEN THERE ARE SEVERAL CASE STATEMENTS IN ONE BLOCK . case 21 .byte (. print "22" adrp x0. case 10 .LC2: .Lrtx4) / 4 .L3 .L2 . [x1.Lrtx4) / 4 .Lrtx4) / 4 . 7.LC3 add x0. case 5 .byte (. load address of the Lrtx label: adr x1.LC0: .LC1: .L2 .. 9. w0. x1... 5" .byte (.byte (. case 18 .byte (. x1. case 8 ....Lrtx4) / 4 .byte (. load byte from the table: ldrb w0.L5 . 2..Lrtx4) / 4 . :lo12:.Lrtx4) / 4 .Lrtx4) / 4 .L6 . 21" .byte (.. this label is pointing in code (text) segment: .Lrtx4) / 4 . case 19 .byte (.. ..section .. case 9 .. x0.string "8.string "3.byte (.L2 . SWITCH()/CASE/DEFAULT 13. 7. jump to the calculated address: br x0 .byte (. x0.text" statement is allocated in the code (text) segment: . 4.. :lo12:.. 5" adrp x0.LC1 add x0.LC2 b puts .L6: .L6 .L2 . case 4 .L6 .LC1 b puts .LC3 b puts . .L3: .L6 .byte (.Lrtx4 .byte (. case 6 .byte (. .section" statement is allocated in the read-only data (rodata) segment: .Lrtx4) / 4 .L2 .. print "1.Lrtx4) / 4 . case 15 . case 11 .. case 22 . 21" adrp x0.. case 20 .Lrtx4) / 4 ..byte (. multiply table element by 4 (by shifting 2 bits left) and add (or subtract) to the address of Lrtx: add x0. x0.L2 . load jumptable address to X1: adrp x1.L3 .byte (.byte (.byte (. everything after ". print "3. case 16 .L7 .L2 . case 2 .byte (.byte (.. 4.LC3: .Lrtx4) / 4 ..byte (.Lrtx4) / 4 .L5: .Lrtx4) / 4 .byte (. .CHAPTER 13. everything after ". 10" . sxtb #2 . x0. 2.rodata .L3 .Lrtx4) / 4 . case 1 .L5 . case 3 .Lrtx4) / 4 . :lo12:.L5 .L3 . case 13 .Lrtx4) / 4 .. case 17 .L2 . 9.L4 add x1.LC2 add x0.w0. W0=input_value-1 . case 14 .. :lo12:.L5 .uxtw] .3. . :lo12:.byte (.Lrtx4) / 4 .Lrtx4) / 4 .Lrtx4) / 4 . case 7 . print "8.Lrtx4) / 4 . 10" adrp x0.L2 .Lrtx4: .string "22" 165 . case 12 .L4 .

0 is to be multiplied by 4. DCB 0xF7 . DCB 9 .12: jumptable in IDA .rodata:0000000000000076 . DCB 3 . READONLY case case case case case case case case case case case case case case case case case case case case case case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 So in case of 1. 166 .rodata:000000000000006C . DCB 0xF7 . DCB 3 .rodata:000000000000006B . if type = 2 (W). they are used for jumping back to the code that prints the “default” string (at . DCB 0xF7 . write is to be set to 2.rodata:0000000000000064 .rodata:000000000000007B . DCB 0xF7 . In case of 22.rodata:0000000000000065 . Right after the Lrtx4 label is the L7 label. both read and write is to be set to 1.rodata:000000000000006D .rodata:0000000000000071 .rodata:0000000000000067 . . Here is the jump table: Listing 13. DCB 6 . DCB 3 . }.rodata:0000000000000072 . ORG 0x64 DCB 9 . it’s allocated in a separate . DCB 6 . case W: write=1. DCB 0xF7 . FALL-THROUGH .rodata:0000000000000078 .4. case R: read=1.rodata.rodata section (there is no special need to place it in the code section). 13.rodata:0000000000000064 $d . }.CHAPTER 13. switch (type) { case RW: read=1.rodata AREA .rodata:000000000000006A . write=%d\n".rodata:0000000000000077 .rodata:0000000000000069 .rodata:0000000000000070 . read is to be set to 1. break. where you can find the code that prints “22”.rodata:000000000000006F . DCB 0 . write).string "default" I just compiled my example to object file and opened it in IDA.rodata:0000000000000074 .rodata:0000000000000075 . DCB 9 . . There are also negative bytes (0xF7). ends DATA.rodata:0000000000000066 . read. resulting in 0. SWITCH()/CASE/DEFAULT 13. There is no jump table in the code segment. DCB 3 .4 Fall-through Another very popular usage of switch() is the fall-through. In case of type = 3 (RW).rodata:0000000000000064 . DCB 0xF7 .rodata:0000000000000079 . DCB 0xF7 . DCB 9 . break. DCB 6 . DCB 0xF7 . write=0.rodata:0000000000000073 . default: break. Here is a small example: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #define R 1 #define W 2 #define RW 3 void f(int type) { int read=0.rodata:000000000000006E . DCB 6 .LC4: . 9 is to be multiplied by 4 and added to the address of Lrtx4 label.L2). printf ("read=%d. If type = 1 (R).rodata:0000000000000068 . DCB 0xF7 .

4.L3 167 . [x29.14: GCC (Linaro) 4. set "read" and "write" local variables to zero wzr. 0 w0. There is no “break” for “case RW”x and that’s OK. -48]! x29. write=%d'. size = 4 _f PROC push ebp mov ebp. 'read=%d. W je SHORT $LN3@f cmp DWORD PTR tv64[ebp]. DWORD PTR _type$[ebp] mov DWORD PTR tv64[ebp]. esp sub esp. size = 4 _read$ = -8 .44] . [x29.4.string "read=%d.4. eax cmp DWORD PTR tv64[ebp]. 12 mov esp. FALL-THROUGH The code at line 14 is executed in two cases: if type = RW or if type = W . sp. size = 4 _type$ = 8 . so no code setting read to 1 is executed. we land at $LN3@f. 1 $LN3@f: . 3 . 13.9 . [x29. 0 mov DWORD PTR _write$[ebp].28] .1 MSVC x86 Listing 13. load "type" argument w0.LC0: . 0 mov eax. type=W? . write=%d\n" f: stp add str str str ldr cmp beq x29. read is first set to 1. [x29. x30. RW je SHORT $LN4@f jmp SHORT $LN5@f $LN4@f: . ebp pop ebp ret 0 _f ENDP The code mostly resembles what is in the source. 2 . DWORD PTR _read$[ebp] push edx push OFFSET $SG1305 . This is why it’s called fall-through: code flow falls through one piece of code (setting read) to another (setting write). [sp.40] w0.2 ARM64 Listing 13. SWITCH()/CASE/DEFAULT 13. default mov ecx. 12 mov DWORD PTR _read$[ebp]. 1 $LN5@f: .28] wzr. There are no jumps between labels $LN4@f and $LN3@f: so when code flow is at $LN4@f. then write. size = 4 tv64 = -4 . write=%d' call _printf add esp. 13. DWORD PTR _write$[ebp] push ecx mov edx. 00H _write$ = -12 . 0aH. case RW: mov DWORD PTR _read$[ebp]. 1 . case R: mov DWORD PTR _read$[ebp].CHAPTER 13. If type = W . case W: mov DWORD PTR _write$[ebp]. 1 jmp SHORT $LN5@f $LN2@f: . 2 . R je SHORT $LN2@f cmp DWORD PTR tv64[ebp].13: MSVC 2012 $SG1305 DB 'read=%d.

L3: . 3 .L5 . 1 w0. [x29. load "write" printf x29. Try to achieve it. . write=%d\n" x0.L3.L6: .2 on page 152 in such way that the compiler can produce even smaller code. .5. type=R? . read=1 x0. x30. "read=%d. 13.L5: . write=1 . [sp].L6 w0. cmp beq cmp beq b case RW mov str case W mov str b case R mov str nop default adrp add ldr ldr bl ldp ret w0. [x29.. 48 Merely the same thing. :lo12:.LC0 . 1 . [x29. w0.LC0 w1. [x29..40] .44] . 1 w0.CHAPTER 13.40] .L4: . EXERCISES .L4 w0.L4 and .44] . 1 w0. 168 .1 Exercise #1 It’s possible to rework the C example in 13.5. . [x29. There are no jumps between labels .L6 13. x0. SWITCH()/CASE/DEFAULT . read=1 w0. .3 on page 954. otherwise.1.5 Exercises 13. but will work just the same. Hint: G. type=RW? .44] . load "read" w2.

LOOPS Chapter 14 Loops 14. esp push ecx mov DWORD PTR _i$[ebp].1. eax 169 . loop initialization jmp SHORT $LN3@main $LN2@main: mov eax.h> void printing_function(int i) { printf ("f(%d)\n". In C/C++ loops are usually constructed using for().1 x86 There is a special LOOP instruction in x86 instruction set for checking the value in register ECX and if it is not 0. DWORD PTR _i$[ebp] . 1 . to decrement ECX and pass control flow to the label in the LOOP operand. here is what we do after each iteration: add eax. Result (MSVC 2010): Listing 14.1 Simple example 14. loop condition (is the counter bigger than a limit?).CHAPTER 14. at each iteration) { loop_body. }. int main() { int i. Let’s start with a simple example: #include <stdio. return 0. what is done at each iteration (increment/decrement) and of course loop body.1: MSVC 2010 _i$ = -4 _main PROC push ebp mov ebp. i<10. So. }. 2 . Let’s start with for(). for (i=2. condition. if you see this instruction somewhere in code. as I never saw any modern compiler emit it automatically. i++) printing_function(i). i). add 1 to (i) value mov DWORD PTR _i$[ebp]. } The generated code is consisting of four parts as well. while() or do/while() statements. Probably this instruction is not very convenient. it is most likely that this is a manually written piece of assembly code. for (initialization. This statement defines loop initialization (set loop counter to initial value).

if i<=9. This is possible in such small functions where there aren’t many local variables.1 emits almost the same code. eax pop esi ret 0 _main ENDP What happens here is that space for the i variable is not allocated in the local stack anymore. 0FFFFFFF0h esp. One very important thing is that the f() function must not change the value in ESI. 10 .1 with maximal optimization turned on (-O3 option): 170 .CHAPTER 14. 9 short loc_8048465 eax. its value would have to be saved at the function’s prologue and restored at the function’s epilogue. DWORD PTR _i$[ebp] push ecx call _printing_function add esp. ESI. continue loop Now let’s see what we get with optimization turned on (/Ox): Listing 14. Our compiler is sure here.1. Let’s try GCC 4. 10 jge SHORT $LN1@main mov ecx. with one subtle difference: Listing 14. [esp+20h+var_4] [esp+20h+var_20]. 4 cmp esi. ebp pop ebp ret 0 _main ENDP . LOOPS 14.1 main proc near var_20 var_4 = dword ptr -20h = dword ptr -4 push mov and sub mov jmp ebp ebp. And if the compiler decides to use the ESI register in f() too. 2 short loc_8048476 mov mov call add eax.4. but uses an individual register for it. 2 $LL3@main: push esi call _printing_function inc esi add esp. GCC 4. nothing special. loop end As we see. 1 . lets finish loop' . 4 jmp SHORT $LN2@main $LN1@main: xor eax. SIMPLE EXAMPLE $LN3@main: cmp DWORD PTR _i$[ebp].4. jump to loop begin . eax mov esp. (i) initializing loc_8048465: loc_8048476: main . this condition is checked *before* each iteration .4.2: GCC 4. esp esp. 0 . 20h [esp+20h+var_4]. (i) increment cmp jle mov leave retn endp [esp+20h+var_4]. loop body: call printing_function(i) . if (i) is biggest or equals to 10. eax printing_function [esp+20h+var_4]. 0000000aH jl SHORT $LL3@main xor eax. almost like in our listing: please note PUSH ESI/POP ESI at the function start and end.3: Optimizing MSVC _main PROC push esi mov esi.

printing_function [esp+10h+var_10]. Big unrolled loops are not recommended in modern times. LOOPS 14. 1 . Another recommendations about loop unrolling from Intel are here : [Int14. printing_function [esp+10h+var_10].7]. ebx ebx. OK.4. GCC just unwound our loop. pass (i) as first argument to printing_function(): mov [esp+20h+var_20]. i++ call printing_function cmp ebx.1. if not. esp 0FFFFFFF0h 2 1Ch . 3. ebx add ebx. SIMPLE EXAMPLE Listing 14. esp. i==100? jnz short loc_80484D0 . aligning label loc_80484D0 (loop body begin) by 16-byte border: nop loc_80484D0: . printing_function [esp+10h+var_10]. 10h [esp+10h+var_10]. eax 2 3 4 5 6 7 8 9 Huh. printing_function [esp+10h+var_10].5: GCC main public main proc near var_20 = dword ptr -20h push mov and push mov sub ebp ebp. continue add esp. Loop unwinding has an advantage in the cases when there aren’t much iterations and we could cut some execution time by removing all loop support instructions. return 0 pop ebx mov esp. p.1. 64h . the resulting code is obviously larger. On the other side. GCC does: Listing 14. esp. printing_function [esp+10h+var_10]. 171 . because bigger functions may require bigger cache footprint 1 .4: Optimizing GCC 4. printing_function [esp+10h+var_10]. printing_function [esp+10h+var_10]. i=2 . printing_function eax. eax . ebp pop ebp retn main endp 1A very good article about it: [Dre07]. esp esp.1 main proc near var_10 = dword ptr -10h main push mov and sub mov call mov call mov call mov call mov call mov call mov call mov call xor leave retn endp ebp ebp. 1Ch xor eax. 0FFFFFFF0h esp.CHAPTER 14. let’s increase the maximum value of the i variable to 100 and try again.4.

SIMPLE EXAMPLE It is quite similar to what MSVC 2010 with optimization (/Ox) produce.CHAPTER 14. GCC is sure this register will not be modified inside of the f() function.1. just like here in the main() function. it will be saved at the function prologue and restored at epilogue. 172 . and if it will. with the exception that the EBX register is allocated for the i variable. LOOPS 14.

That’s one of the reasons I wrote a tracer for myself.1.log This is what we see: PID=12884|New process loops_2. loop end 14.1: OllyDbg: main() begin By tracing (F8 (step over)) we see ESI incrementing. ESI = i = 6: Figure 14. It seems that OllyDbg is able to detect simple loops and show them in square brackets. for instance.exe -l:loops_2. it is not very convenient to trace manulally in the debugger. find the address of the instruction PUSH ESI (passing the sole argument to f()) which is 0x401026 for me and we run the tracer: tracer. We open compiled example in IDA.2: OllyDbg: loop body just executed with i = 6 9 is the last loop value. SIMPLE EXAMPLE x86: OllyDbg Let’s compile our example in MSVC 2010 with /Ox and /Ob0 options and load it into OllyDbg. and the function will finish: Figure 14. That’s why JL is not triggering after the increment. In the tracer.CHAPTER 14. Here. for convenience: Figure 14.exe bpx=loops_2. LOOPS 14.3 x86: tracer As we might see.1.2 14.3: OllyDbg: ESI = 10.1.exe!0x00401026 BPX just sets a breakpoint at the address and tracer will then print the state of the registers.exe 173 .

exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000005 EDI=0x00333378 EIP=0x00331026 FLAGS=CF AF SF IF (0) loops_2.exe 14.exe!0x00401020. we get the loops_2.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000007 EDI=0x00333378 EIP=0x00331026 FLAGS=CF AF SF IF (0) loops_2. Then.exe!0x401026 EAX=0x00a328c8 EBX=0x00000000 ESI=0x00000002 EDI=0x00333378 EIP=0x00331026 FLAGS=PF ZF IF (0) loops_2. the tracer can collect register values for all addresses within the function.CHAPTER 14. 174 . This is called trace there.trace:cc BPF stands for set breakpoint on function. LOOPS (0) loops_2.exe -l:loops_2. ExitCode=0 (0x0) We see how the value of ESI register changes from 2 to 9.exe bpf=loops_2.exe.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000004 EDI=0x00333378 EIP=0x00331026 FLAGS=CF PF AF SF IF (0) loops_2.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000003 EDI=0x00333378 EIP=0x00331026 FLAGS=CF PF AF SF IF (0) loops_2.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000008 EDI=0x00333378 EIP=0x00331026 FLAGS=CF AF SF IF (0) loops_2. all interesting register values are recorded.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000006 EDI=0x00333378 EIP=0x00331026 FLAGS=CF PF AF SF IF (0) loops_2.exe!0x401026 EAX=0x00000005 EBX=0x00000000 ESI=0x00000009 EDI=0x00333378 EIP=0x00331026 FLAGS=CF PF AF SF IF PID=12884|Process loops_2. that adds comments.1.exe_clear. Every instruction gets traced. an IDA.idc-script is generated. Even more than that. So. in the IDA we’ve learned that the main() function address is 0x00401020 and we run: tracer. SIMPLE EXAMPLE ECX=0x6f0f4714 EDX=0x00000000 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 ECX=0x6f0a5617 EDX=0x000ee188 EBP=0x0024fbfc ESP=0x0024fbb8 exited. As a result.idc scripts.idc and loops_2.

text+0x38).idc-script loaded We see that ESI can be from 2 to 9 at the start of the loop body. that contains information about how many times each instruction was executed and register values: Listing 14. e= 8 [PUSH ESI] ESI=2. but from 3 to 0xA (10) after the increment.4 ARM Non-optimizing Keil 6/2013 (ARM mode) main STMFD MOV B loc_35C . SIMPLE EXAMPLE We load loops_2. e= 8 [CMP ESI.text+0x35). tracer also generates loops_2.text+0x37). e= 8 [CALL 8D1000h] tracing nested maximum level (1) reached.exe.true OF=false 0x401035 (.1. CODE MOV BL ADD SP!. We can also see that main() is finishing with 0 in EAX. e= 1 [RETN] EAX=0 We can use grep here..6: loops_2. LOOPS 14.0xa 0x401033 (..text+0x26). #0 . 2] 0x401026 (. 4] ESP=0x38fcbc 0x401030 (. e= 1 [POP ESI] 0x401038 (.text+0x30).4: IDA with .1. R4.idc into IDA and see: Figure 14.exe. e= 8 [JL 8D1026h] SF=false.9 0x401027 (. CODE CMP BLT MOV 175 .txt. 14. {R4. e= 8 [ADD ESP.text+0x20). R4 printing_function R4..text+0x33).text+0x21). #1 loc_368 XREF: main+8 R4. e= 1 [XOR EAX.LR} R4.text+0x2d).text+0x2c). 0Ah] ESI=3. e= 1 [PUSH ESI] ESI=1 0x401021 (. #2 loc_368 XREF: main+1C R0.text+0x27). e= 1 [MOV ESI. EAX] 0x401037 (. ⤦ Ç skipping this CALL 8D1000h=0x8d1000 0x40102c (.txt 0x401020 (.9 0x40102d (. e= 8 [INC ESI] ESI=2.CHAPTER 14.exe. #0xA loc_35C R0.

176 . #2 R4. #0x1124 . this was in my f() function: void printing_function(int i) { printf ("%d\n". #0xA loc_132 R0. the first instruction preparing the argument for f() function and the second calling the function. #0 {R4. The ``ADD R4.PC} In fact. R4. PC R0. #1 R4. R4 R1. R4 R1.R7. #9 _printf R0. #6 _printf R0. #2'' instruction just initializes i.3 (LLVM) (thumb-2 mode) _main PUSH MOVW MOVS MOVT. SIMPLE EXAMPLE SP!.W ADD ADD MOV BLX MOV MOVS BLX MOV MOVS BLX MOV MOVS BLX MOV MOVS BLX MOV MOVS BLX MOV MOVS BLX MOV MOVS BLX MOVS POP {R4.1.CHAPTER 14. #0 {R4. #7 _printf R0. 0 is to be written into R0 (since our function returns 0) and function execution finishes. R4'' and ``BL printing_function'' instructions compose the body of the loop. #1'' instruction just adds 1 to the i variable at each iteration. Optimizing Keil 6/2013 (thumb mode) _main PUSH MOVS {R4.PC} loc_132 Practically the same. R4 R1. LOOPS LDMFD 14. R4 R1. #8 _printf R0. #0 R7. R4 printing_function R4. Optimizing Xcode 4. #4 _printf R0. #5 _printf R0. #2 MOVS BL ADDS CMP BLT MOVS POP .PC} Iteration counter i is to be stored in the R4 register. R4 R1. #4 R4.LR} R4. #0xA'' compares i with 0xA (10). R4 R1. The ``MOV R0. R4 R1. CODE XREF: _main+E R0. #3 _printf R0. Otherwise. {R4. The next instruction BLT (Branch Less Than) jumps if i is less than 10.LR} R4. R4 _printf R0. "%d\n" R1. R4. SP.6. The ``MOV R4. i).R7. ``CMP R4. }.

8: Non-optimizing GCC 4.CHAPTER 14. but also inlined my very simple function f(). . jump to the loop body begin: bne .L3: . 2 . bl printing_function . x30. save contents of X19 register in the local stack: str x19. set up stack frame: add x29. LLVM not just unrolled the loop. just branch here instead of branch with link and return: b printf main: . 0 .16] . 1 . save contents of X19 register in the local stack: str x19. -32]! . set up stack frame: add x29. This is possible when the function is so primitive (like mine) and when it is not called too much (like here).LC0 add x0. increment counter register. and inserted its body 8 times instead of calling it.1 Listing 14. we will use W19 register as counter.9.LC0 add x0. . LOOPS 14. [sp]. w0 . -32]! . [sp. no. [sp.16] . sp. save FP and LR in the local stack: stp x29.string "f(%d)\n" ARM64: Non-optimizing GCC 4. :lo12:. x0. is it end? cmp w19. SIMPLE EXAMPLE So.1 Listing 14. [sp. 2 177 .L3 . ARM64: Optimizing GCC 4. :lo12:. prepare second argument of printf(): mov w1. x0. just branch here instead of branch with link and return: b printf main: . we will use W19 register as counter. W0 here still holds value of counter value before increment. restore contents of X19 register: ldr x19.1 printing_function: . prepare first argument of printing_function(): mov w0.1. set initial value of 2 to it: mov w19.9.16] . return 0 mov w0. 0 .LC0: . x30. sp. 10 . restore FP and LR values: ldp x29.7: Optimizing GCC 4. 32 ret . [sp. load address of the "f(%d)\n" string adrp x0. . set initial value of 2 to it: mov w19.9. add w19. load address of the "f(%d)\n" string adrp x0. prepare second argument of printf(): mov w1. save FP and LR in the local stack: stp x29. 0 .LC0 .LC0 . [sp.1 -fno-inline printing_function: . w19 . . x30.9. w19. w0 .

initialize counter at 2 and store this value in local stack li $v0. 0x28+i($fp) or $at. bl printing_function . NOP slti $v0. $zero .16] . jump to loc_80 (loop body begin): bnez $v0. add w19. $zero . LOOPS 14. loc_9C" there in fact: b loc_9C or $at. 0x28+i($fp) jal printing_function or $at. load counter. $ZERO. restore contents of X19 register: ldr x19. jump to the loop body begin: bne . $sp .9: Non-optimizing GCC 4. store it back: lw $v0. $zero . return 0 mov w0. $zero . NOP # --------------------------------------------------------------------------loc_80: # CODE XREF: main+48 .L3 . SIMPLE EXAMPLE . load counter value from local stack and call printing_function(): lw $a0. $zero . is it 10? lw $v0. 0x28+saved_FP($sp) move $fp. branch delay slot.1. check counter. 0x28+i($fp) loc_9C: # CODE XREF: main+18 . w19 . x30.1. branch delay slot. function epilogue: move $sp.CHAPTER 14. pseudoinstruction. increment counter register.L3: . 0 . finishing. IDA is not aware of variable names in local stack . is it end? cmp w19. 0x28+saved_RA($sp) sw $fp. I gave them names manually: i = -0x10 saved_FP = -8 saved_RA = -4 . 2 sw $v0. [sp. branch delay slot.LC0: . 0x28+i($fp) or $at. W0 here still holds value of counter value before increment. no. 0x28+i($fp) . NOP addiu $v0.5 MIPS Listing 14. increment it.string "f(%d)\n" 14. restore FP and LR values: ldp x29. if it is less than 10. return 0: move $v0. w19.4. $zero . function prologue: addiu $sp. 1 sw $v0. 32 ret . $fp 178 . 1 .5 (IDA) main: . 0xA . prepare first argument of printing_function(): mov w0. loc_80 or $at. 10 . -0x28 sw $ra. [sp]. NOP . "BEQ $ZERO. NOP .

0x28+saved_RA($sp) 0x28+saved_FP($sp) 0x28 $zero . i<cnt. the body of the loop must not be executed.L5: ret Listing 14. branch delay slot. X1 = source address 2 Single instruction. BYTE PTR [rsi+rax] . 14. However. Xcode (LLVM).1.9 ARM64 optimized for size (-Os) my_memcpy: . eax . 14. if it sure that the situation described here is not possible (like in the case of our very simple example and Keil.L2 .h> void my_memcpy (unsigned char* dst. RDX = size of block .1 Straight-forward implementation Listing 14.L2: . It is actually the pseudoinstruction (BEQ).11: GCC 4. If total_entries_to_process is 0. cl inc rax . $fp. rdx je .CHAPTER 14.10: GCC 4. X0 = destination address . an optimizing compiler may swap the condition check and loop body. for (i=0. $ra $at.6 One more thing In the generated code we can see: after initializing i . RDI = destination address . }. the body of the loop is not to be executed. initialize counter (i) at 0 xor eax. all bytes copied? exit then: cmp rax. i++ jmp . load byte at RSI+i: mov cl.2. size_t cnt) { size_t i. RSI = source address .2 Memory blocks copying routine Real-world memory copy routines may copy 4 or 8 bytes at each iteration. i++) loop_body. store byte at RDI+i: mov BYTE PTR [rdi+rax]. etc. NOP The instruction that’s new to us is “B”. Because.L5 . And that is correct. if the loop condition is not met at the beginning. and only after that loop body can be executed. #include <stdio. This is why the condition checked before the execution. unsigned char* src. use SIMD2 .9 x64 optimized for size (-Os) my_memcpy: . MSVC in optimization mode). But for the sake of simplicity. i++) dst[i]=src[i]. vectorization. this example is the simplest possible. as the condition for i is checked first.2. $sp. LOOPS 14. MEMORY BLOCKS COPYING ROUTINE lw lw addiu jr or $ra. This is possible in the following case: for (i=0. 14. the body of the loo must not be executed at all. i<total_entries_to_process. multiple data 179 .

x3] . 0 . i++ ADDS r3. store byte at R1+i: STRB r4. store byte at X1+i: strb w4. i<size? CMP r3. R2 = size of block .r2 .lr} . store byte at R1+i: STRBCC r12.L2 . .#1 |L0. initialize counter (i) at 0 mov x3..2. i++ ADDCC r3.L5: ret Listing 14.#1 .e. X2 = size of block .L5 . load byte at R1+i: LDRB r4.6| . all bytes copied? exit then: cmp x3.13: Optimizing Keil 6/2013 (ARM mode) my_memcpy PROC . condition checked at the end of function. so jump there: B |L0. jump to the loop begin if its so:' BCC |L0.L2: . i. 1 .4| . 180 .r3. [x1. i++ b . load byte at X1+i: ldrb w4.r3] . .r3] . x3. R1 = source address . [x0. x2 beq .[r1. R1 = source address .[r1.r3.r3] . initialize counter (i) at 0 MOVS r3.[r0. R2 = size of block PUSH {r4. if R2<R3 or i<size.CHAPTER 14.x3] add x3. MEMORY BLOCKS COPYING ROUTINE . R0 = destination address .2.r2 .12| |L0. all bytes copied? CMP r3.[r0. R0 = destination address .pc} ENDP 14.12| .#0 |L0.6| POP {r4. the last instruction of the "conditional block".r3] .#0 .2 ARM in ARM mode Keil in ARM mode takes full advantage of conditional suffixes: Listing 14. the following block is executed only if "less than" condition. initialize counter (i) at 0 MOV r3.12: Optimizing Keil 6/2013 (thumb mode) my_memcpy PROC . LOOPS 14. load byte at R1+i: LDRBCC r12.

$v0. CONCLUSION . $a1. On the other hand.3. 0($a3) loc_14: # CODE XREF: my_memcpy . LBU loads a byte and clears all other bits (“Unsigned”).2 on page 419. NOP Here we have two new instructions: LBU (“Load Byte Unsigned”) and SB (“Store Byte”). 9 jle body 181 . 2 . load byte as unsigned at address in $t0 to $v1: lbu $v1. form address of byte in destination block (\$a3 = \$a0+\$v0 = dst+i): addu $a3. increment check: cmp [counter]. $zero . store byte at $a3 sb $v1. So when dealing with single bytes. $a0. branch delay slot loc_8: # CODE XREF: my_memcpy+1C . 14. 0($t0) .3 MIPS Listing 14. 1 . do nothing otherwise (i. do something here . we have to allocate whole 32-bit registers for them.2. form address of byte in source block: addu $t0. 14. increment counter (i): addiu $v0. loop body . return BX lr ENDP That’s why there is only one branch instruction instead of 2. SB just writes a byte from lowest 8 bits of register to memory. $v0 . $zero . use counter variable in local stack add [counter]. Just like in ARM.1. there are no byte-wide parts like in x86. it will always reside in \$v0: move $v0. $v0 . if i>=size) BCC |L0.14: GCC 4.4 Vectorization Optimizing GCC can do much more on this example: 25.e. all MIPS registers are 32-bit wide.4.2.4| . branch delay slot . check if counter (i) in $v0 is still less then 3rd function argument ("cnt" in $a2): sltu $v1. initialize counter (i) at 0 . LB (“Load Byte”) instruction sign-extends the loaded byte to a 32-bit value.3 Conclusion Rough skeleton of loop from 2 to 9 inclusive: Listing 14. LOOPS 14. finish if BNEZ wasnt triggered:' jr $ra or $at. $a2 . jump to loop body if counter sill less then "cnt": bnez $v1.. jump to loop begin if i<size . loc_8 . $t0 = $a1+$v0 = src+i . 1 . jump to loop check part: b loc_14 . branch delay slot.CHAPTER 14.15: x86 mov [counter]. initialization jmp check body: .5 optimized for size (-Os) (IDA) my_memcpy: . 14.

10 body: . 10 JGE exit . initialization body: . it’s a sign that this piece of code is hand-written: Listing 14. use counter in REG. 9 JLE body Some parts of the loop may be generated by compiler in different order: Listing 14. but do not modify it! INC EBX . loop body . so the body of the loop is to be executed at least once: Listing 14. increment CMP REG. REG check: CMP [counter]. do something here 182 . [counter] . do something here .16: x86 MOV [counter]. initialization JMP check body: .18: x86 MOV [counter]. CONCLUSION The increment operation may be represented as 3 instructions in non-optimized code: Listing 14. loop body . This is rare. do something here . 2 . initialization JMP label_check label_increment: ADD [counter]. 1 . increment check: CMP EBX. a whole register can be dedicated to the counter variable: Listing 14. 10 JL body Using the LOOP instruction. This is done when the compiler is sure that the condition is always true on the first iteration. increment label_check: CMP [counter]. loop body . use counter variable in local stack MOV REG. initialization JMP check body: . count from 10 to 1 MOV ECX.17: x86 MOV EBX. but the compiler may rearrange it in a way that the condition is checked after loop body.20: x86 . loop body . loop body .CHAPTER 14. compilers are not using it. 2 . do something here . do something here . 9 JLE body If the body of the loop is short. use counter variable in local stack JMP label_increment exit: Usually the condition is checked before loop body. When you see it. use counter in EBX. 2 .19: x86 MOV REG.3. increment INC REG MOV [counter]. but do not modify it! INC REG . LOOPS 14. 2 .

14. loop body . 8 esi.22: Optimizing MSVC 2010 $SG2795 DB PROC push push mov mov npad $LL3@main: push push call dec add test jg pop xor pop ret _main ENDP '%d'. #1 . eax esi 0 Listing 14. 0aH.21: ARM MOV R4. use counter in R4.CHAPTER 14.#0x64 r1. EXERCISES . esi SHORT $LL3@main edi eax.4. do something here . but do not modify it! ADD R4.1 on page 169). but do not modify it! LOOP body ARM. DWORD PTR __imp__printf esi.1 Exercise #1 Why isn’t the LOOP instruction used by modern compilers anymore? 14.#0 183 . increment check: CMP R4.lr} r4. '%d' edi esi esp. 00H _main esi edi edi. LOOPS 14.8| MOV ADR BL SUB CMP {r4.20].#1 r4.4. compile it in your favorite OS and compiler and modify (patch) the executable file so the loop range will be [6.4 Exercises 14.1.r4 r0.2 Exercise #2 Take a loop example from this section ( 14..4.|L0.40| __2printf r4.4.R4. use counter in ECX.3 Exercise #3 What does this code do? Listing 14. 2 .r4. #10 BLT body 14. align next label esi OFFSET $SG2795 . initialization B check body: .23: Non-optimizing Keil 6/2013 (ARM mode) main PROC PUSH MOV |L0. The R4 register is dedicated to counter variable in this example: Listing 14. 100 3 .

lr} r4. x30. :lo12:. EXERCISES MOVLE BGT POP ENDP r0.24: Non-optimizing Keil 6/2013 (thumb mode) main PROC PUSH MOVS |L0. [sp. LOOPS 14.40| Listing 14. x19.r4. $s0.0 |L0.0 |L0.pc} DCB "%d\n". 0 x20. $s1.L2: . $s0. x20 printf w19.5 (MIPS) (IDA) main: var_18 var_C var_8 var_4 = = = = -0x18 -0xC -8 -4 lui addiu la sw sw sw sw la li $gp.#0 {r4.25: Optimizing GCC 4. $sp.9 (ARM64) main: stp add stp adrp mov add x29.4| MOVS ADR BL SUBS CMP BGT MOVS POP ENDP {r4.LC0 mov mov bl subs bne ldp ldp ret w1. [sp]. x20. #1 .4.pc} DCW 0x0000 DCB "%d\n". (printf & 0xFFFF)($gp) loc_28: (__gnu_local_gp >> 16) -0x28 (__gnu_local_gp & 0xFFFF) 0x28+var_4($sp) 0x28+var_8($sp) 0x28+var_C($sp) 0x28+var_18($sp) $LC0 # "%d\n" 0x64 # 'd' 184 . x30. lw # CODE XREF: main+40 $t9.#0 |L0.|L0. x20. x20. $ra. -32]! sp. $gp. $s1.26: Optimizing GCC 4. [sp.#0x64 r1. [sp.#1 r4.24| Listing 14.8| {r4.16] x29. x29.#0 |L0.16] .string "%d\n" Listing 14.4.LC0 100 x20.24| __2printf r4. w19.r4 r0.L2 x19. 32 . w19 x0.4| r0.LC0: . w19. $gp.CHAPTER 14.

$s1. eax esi 0 Listing 14.|L0. $a0.40| DCB "%d\n". 100 SHORT $LL3@main edi eax.lr} r4.CHAPTER 14. $gp.r4 r0. EXERCISES move move jalr addiu lw bnez or lw lw lw jr addiu $LC0: $a1. 14.0 Listing 14. 00H _main esi edi edi.27: Optimizing MSVC 2010 $SG2795 DB PROC push push mov mov npad $LL3@main: push push call add add cmp jl pop xor pop ret _main ENDP '%d'. $s0 $s1 -1 0x28+var_18($sp) loc_28 $zero 0x28+var_4($sp) 0x28+var_8($sp) 0x28+var_C($sp) 0x28 . 1 3 .5 on page 954.#0x64 r0.28: Non-optimizing Keil 6/2013 (ARM mode) main PROC PUSH MOV |L0. $t9 $s0.8| MOV ADR BL ADD CMP MOVGE BLT POP ENDP {r4.pc} |L0.4| {r4.#1 185 .4.#1 r1. 8 esi.r4.40| __2printf r4.4 Exercise #4 What does this code do? Listing 14. $s0. $ra $sp.4. '%d' edi esi. $ra.lr} r4. DWORD PTR __imp__printf esi. LOOPS 14. 3 esp.#3 r4.#0 |L0.ascii "%d\n"<0> # DATA XREF: main+1C Answer: G.1.8| {r4. align next label esi OFFSET $SG2795 . $s0. $at.29: Non-optimizing Keil 6/2013 (thumb mode) main PROC PUSH MOVS |L0. 0aH.

CHAPTER 14. LOOPS

14.4. EXERCISES

MOVS
ADR
BL
ADDS
CMP
BLT
MOVS
POP
ENDP

r1,r4
r0,|L0.24|
__2printf
r4,r4,#3
r4,#0x64
|L0.4|
r0,#0
{r4,pc}

DCW

0x0000

DCB

"%d\n",0

|L0.24|

Listing 14.30: Optimizing GCC 4.9 (ARM64)
main:
stp
add
stp
adrp
mov
add

x29,
x29,
x19,
x20,
w19,
x20,

x30, [sp, -32]!
sp, 0
x20, [sp,16]
.LC0
1
x20, :lo12:.LC0

mov
mov
add
bl
cmp
bne
ldp
ldp
ret

w1, w19
x0, x20
w19, w19, 3
printf
w19, 100
.L2
x19, x20, [sp,16]
x29, x30, [sp], 32

.L2:

.LC0:
.string "%d\n"

Listing 14.31: Optimizing GCC 4.4.5 (MIPS) (IDA)
main:
var_18
var_10
var_C
var_8
var_4

=
=
=
=
=

-0x18
-0x10
-0xC
-8
-4

lui
addiu
la
sw
sw
sw
sw
sw
la
li
li

$gp,
$sp,
$gp,
$ra,
$s2,
$s1,
$s0,
$gp,
$s2,
$s0,
$s1,

lw
move
move
jalr
addiu
lw
bne
or
lw
lw
lw

$t9,
$a1,
$a0,
$t9
$s0,
$gp,
$s0,
$at,
$ra,
$s2,
$s1,

loc_30:

(__gnu_local_gp >> 16)
-0x28
(__gnu_local_gp & 0xFFFF)
0x28+var_4($sp)
0x28+var_8($sp)
0x28+var_C($sp)
0x28+var_10($sp)
0x28+var_18($sp)
$LC0
# "%d\n"
1
0x64 # 'd'
# CODE XREF: main+48
(printf & 0xFFFF)($gp)
$s0
$s2
3
0x28+var_18($sp)
$s1, loc_30
$zero
0x28+var_4($sp)
0x28+var_8($sp)
0x28+var_C($sp)
186

CHAPTER 14. LOOPS

14.4. EXERCISES

lw
jr
addiu
$LC0:

$s0, 0x28+var_10($sp)
$ra
$sp, 0x28

.ascii "%d\n"<0>

# DATA XREF: main+20

Answer: G.1.6 on page 955.

187

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

Chapter 15

Simple C-strings processing
15.1

strlen()

Let’s talk about loops one more time. Often, the strlen() function1 is implemented using a while() statement. Here is
how it is done in the MSVC standard libraries:
int my_strlen (const char * str)
{
const char *eos = str;
while( *eos++ ) ;
return( eos - str - 1 );
}
int main()
{
// test
return my_strlen("hello!");
};

15.1.1

x86

Non-optimizing MSVC
Let’s compile:
_eos$ = -4
; size = 4
_str$ = 8
; size = 4
_strlen PROC
push
ebp
mov
ebp, esp
push
ecx
mov
eax, DWORD PTR _str$[ebp] ; place pointer to string from "str"
mov
DWORD PTR _eos$[ebp], eax ; place it to local variable "eos"
$LN2@strlen_:
mov
ecx, DWORD PTR _eos$[ebp] ; ECX=eos
; take 8-bit byte from address in ECX and place it as 32-bit value to EDX with sign extension
movsx
edx, BYTE PTR [ecx]
mov
eax, DWORD PTR _eos$[ebp]
add
eax, 1
mov
DWORD PTR _eos$[ebp], eax
test
edx, edx
je
SHORT $LN1@strlen_
jmp
SHORT $LN2@strlen_
$LN1@strlen_:

;
;
;
;
;
;

EAX=eos
increment EAX
place EAX back to "eos"
EDX is zero?
yes, then finish loop
continue loop

; here we calculate the difference between two pointers
mov
1 counting

eax, DWORD PTR _eos$[ebp]
the characters in a string in the C language

188

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

sub
eax, DWORD PTR _str$[ebp]
sub
eax, 1
mov
esp, ebp
pop
ebp
ret
0
_strlen_ ENDP

15.1. STRLEN()

; subtract 1 and return result

We get two new instructions here: MOVSX and TEST.
The first one — MOVSX — takes a byte from an address in memory and stores the value in a 32-bit register. MOVSX stands
for MOV with Sign-Extend. MOVSX sets the rest of the bits, from the 8th to the 31th, to 1 if the source byte is negative or to 0
if is positive.
And here is why.
By default, the char type is signed in MSVC and GCC. If we have two values of which one is char and the other is int, (int
is signed too), and if the first value contain −2 (coded as 0xFE) and we just copy this byte into the int container, it makes
0x000000FE, and this from the point of signed int view is 254, but not −2. In signed int, −2 is coded as 0xFFFFFFFE. So if
we need to transfer 0xFE from a variable of char type to int, we need to identify its sign and extend it. That is what MOVSX
does.
You can also read about it in “Signed number representations” section ( 30 on page 451).
I’m not sure if the compiler needs to store a char variable in EDX, it could just take a 8-bit register part (for example DL).
Apparently, the compiler’s register allocator works like that.
Then we see TEST EDX, EDX. You can read more about the TEST instruction in the section about bit fields ( 19 on
page 305). Here this instruction just checks if the value in EDX equals to 0.
Non-optimizing GCC
Let’s try GCC 4.4.1:
strlen

public strlen
proc near

eos
arg_0

= dword ptr -4
= dword ptr 8
push
mov
sub
mov
mov

ebp
ebp, esp
esp, 10h
eax, [ebp+arg_0]
[ebp+eos], eax

mov
movzx
test
setnz
add
test
jnz
mov
mov
mov
sub
mov
sub
leave
retn
endp

eax, [ebp+eos]
eax, byte ptr [eax]
al, al
al
[ebp+eos], 1
al, al
short loc_80483F0
edx, [ebp+eos]
eax, [ebp+arg_0]
ecx, edx
ecx, eax
eax, ecx
eax, 1

loc_80483F0:

strlen

The result is almost the same as in MSVC, but here we see MOVZX instead of MOVSX. MOVZX stands for MOV with ZeroExtend. This instruction copies a 8-bit or 16-bit value into a 32-bit register and sets the rest of the bits to 0. In fact, this
instruction is convenient only because it enable us to replace this instruction pair: xor eax, eax / mov al, [...].
On the other hand, it is obvious that the compiler could produce this code: mov al, byte ptr [eax] / test
al, al —it is almost the same, however, the highest bits of the EAX register will contain random noise. But let’s think it
is compiler’s drawback —it cannot produce more understandable code. Strictly speaking, the compiler is not obliged to emit
understandable (to humans) code at all.
The next new instruction for us is SETNZ. Here, if AL doesn’t contain zero, test al, al sets the ZF flag to 0, but SETNZ,
if ZF==0 (NZ stands for not zero) sets AL to 1. Speaking in natural language, if AL is not zero, let’s jump to loc_80483F0. The
compiler emits some redundant code, but let’s not forget that the optimizations are turned off.

189

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

Optimizing MSVC
Now let’s compile all this in MSVC 2012, with optimizations turned on (/Ox):
Listing 15.1: Optimizing MSVC 2012 /Ob0
_str$ = 8
_strlen PROC
mov
mov
$LL2@strlen:
mov
inc
test
jne
sub
dec
ret
_strlen ENDP

; size = 4
edx, DWORD PTR _str$[esp-4] ; EDX -> pointer to the string
eax, edx
; move to EAX
cl, BYTE PTR [eax]
eax
cl, cl
SHORT $LL2@strlen
eax, edx
eax
0

;
;
;
;
;
;

CL = *EAX
EAX++
CL==0?
no, continue loop
calculate pointers difference
decrement EAX

Now it is all simpler. Needless to say, the compiler could use registers with such efficiency only in small functions with
a few local variables.
INC/DEC— are increment/decrement instructions, in other words: add or substract 1 to/from a variable.

190

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

Optimizing MSVC + OllyDbg
We can try this (optimized) example in OllyDbg. Here is the first iteration:

Figure 15.1: OllyDbg: first iteration start
We see that OllyDbg found a loop and, for convenience, wrapped its instructions in brackets. By clicking the right button
on EAX, we can choose “Follow in Dump” and the memory window scrolls to the right place. Here we can see the string
“hello!” in memory. There is at least one zero byte after it and then random garbage. If OllyDbg sees a register with a valid
address in it, that points to some string, it is shown as a string.

191

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

Let’s press F8 (step over) a few times, to get to the start of the body of the loop:

Figure 15.2: OllyDbg: second iteration start
We see that EAX contains the address of the second character in the string.

192

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

We have to press F8 enough number of times in order to escape from the loop:

Figure 15.3: OllyDbg: pointers difference to be calculated now
We see that EAX now contains the address of zero byte that’s right after the string. Meanwhile, EDX hasn’t changed, so
it still pointing to the start of the string. The difference between these two addresses is being calculated now.

193

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

The SUB instruction just got executed:

Figure 15.4: OllyDbg: EAX to be decremented now
The difference of pointers is in the EAX register now—7. Indeed, the length of the “hello!” string is 6, but with the zero
byte included—7. But strlen() must return the number of non-zero characters in the string. So the decrement executes
and then the function returns.
Optimizing GCC
Let’s check GCC 4.4.1 with optimizations turned on (-O3 key):
strlen

public strlen
proc near

arg_0

= dword ptr

8

push
mov
mov
mov

ebp
ebp, esp
ecx, [ebp+arg_0]
eax, ecx

movzx
add
test
jnz
not
add
pop
retn
endp

edx, byte ptr [eax]
eax, 1
dl, dl
short loc_8048418
ecx
eax, ecx
ebp

loc_8048418:

strlen

Here GCC is almost the same as MSVC, except for the presence of MOVZX.
However, here MOVZX could be replaced with mov dl, byte ptr [eax].
Probably it is simpler for GCC’s code generator to remember the whole 32-bit EDX register is allocated for a char variable
and it then can be sure that the highest bits has no any noise at any point.
After that we also see a new instruction — NOT. This instruction inverts all bits in the operand. You can say that it is
a synonym to the XOR ECX, 0ffffffffh instruction. NOT and the following ADD calculate the pointer difference and
subtract 1, just in a different way. At the start ECX, where the pointer to str is stored, gets inverted and 1 is subtracted from
it.
See also: “Signed number representations” ( 30 on page 451).
In other words, at the end of the function just after loop body, these operations are executed:
194

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

ecx=str;
eax=eos;
ecx=(-ecx)-1;
eax=eax+ecx
return eax

…and this is effectively equivalent to:
ecx=str;
eax=eos;
eax=eax-ecx;
eax=eax-1;
return eax

Why did GCC decide it would be better? I cannot be sure. But I’m sure the both variants are equivalent in efficiency.

15.1.2

ARM

32-bit ARM
Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)
Listing 15.2: Non-optimizing Xcode 4.6.3 (LLVM) (ARM mode)
_strlen
eos
str

= -8
= -4
SUB
STR
LDR
STR

SP,
R0,
R0,
R0,

SP, #8 ; allocate 8 bytes for local variables
[SP,#8+str]
[SP,#8+str]
[SP,#8+eos]

loc_2CB8 ; CODE XREF: _strlen+28
LDR
R0, [SP,#8+eos]
ADD
R1, R0, #1
STR
R1, [SP,#8+eos]
LDRSB R0, [R0]
CMP
R0, #0
BEQ
loc_2CD4
B
loc_2CB8
loc_2CD4 ; CODE XREF: _strlen+24
LDR
R0, [SP,#8+eos]
LDR
R1, [SP,#8+str]
SUB
R0, R0, R1 ; R0=eos-str
SUB
R0, R0, #1 ; R0=R0-1
ADD
SP, SP, #8 ; free allocated 8 bytes
BX
LR

Non-optimizing LLVM generates too much code, however, here we can see how the function works with local variables
in the stack. There are only two local variables in our function, eos and str.
In this listing, generated by IDA, I have manually renamed var_8 and var_4 to eos and str .
The first instructions just saves the input values into both str and eos.
The body of the loop starts at label loc_2CB8.
The first three instruction in the loop body (LDR, ADD, STR) load the value of eos into R0, then the value is incremented
and saved back into eos, which is located in the stack.
The next instruction, ``LDRSB R0, [R0]'' (Load Register Signed Byte) , loads a byte from memory at the address
stored in R0 and sign-extends it to 32-bit 2 . This is similar to the MOVSX instruction in x86. The compiler treats this byte
as signed since the char type is signed according to the C standard. I already wrote about it ( 15.1.1 on page 189) in this
section, in relation to x86.
It has to be noted that it is impossible to use 8- or 16-bit part of a 32-bit register in ARM separately of the whole register,
as it is in x86. Apparently, it is because x86 has a huge history of backwards compatibility with its ancestors up to the
16-bit 8086 and even 8-bit 8080, but ARM was developed from scratch as a 32-bit RISC-processor. Consequently, in order
to process separate bytes in ARM, one has to use 32-bit registers anyway.
2 The

Keil compiler treats the char type as signed, just like MSVC and GCC.

195

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

So, LDRSB loads bytes from the string into R0, one by one. The following CMP and BEQ instructions check if the loaded
byte is 0. If it’s not 0, control passes to the start of the body of the loop. And if it’s 0, the loop ends.
At the end of the function, the difference between eos and str is calculated, 1 is subtracted from it, and resulting value is
returned via R0.
N.B. Registers were not saved in this function. That’s because in the ARM calling convention registers R0-R3 are “scratch
registers”, intended for arguments passing, and we’re not required to restore their value when the function exits, since the
calling function will not use them anymore. Consequently, they may be used for anything we want. No other registers are
used here, so that is why we have nothing to save on the stack. Thus, control may be returned back to calling function by a
simple jump (BX), to the address in the LR register.
Optimizing Xcode 4.6.3 (LLVM) (thumb mode)
Listing 15.3: Optimizing Xcode 4.6.3 (LLVM) (thumb mode)
_strlen
MOV

R1, R0

LDRB.W
CMP
BNE
MVNS
ADD
BX

R2, [R1],#1
R2, #0
loc_2DF6
R0, R0
R0, R1
LR

loc_2DF6

As optimizing LLVM concludes, eos and str do not need space on the stack, and can always be stored in registers. Before
the start of the loop body, str is always in R0, and eos—in R1.
The ``LDRB.W R2, [R1],#1'' instruction loads a byte from the memory at the address stored in R1, to R2, signextending it to a 32-bit value, but not just that. #1 at the instruction’s end is implies “Post-indexed addressing”, which
means that 1 is to be added to R1 after the byte is loaded.
Read more about it: 28.2 on page 445.
Then you can see CMP and BNE3 in the body of the loop, these instructions continue looping until 0 is found in the string.
MVNS4 (inverts all bits, like NOT in x86) and ADD instructions compute eos − str − 1. In fact, these two instructions
compute R0 = str +eos, which is effectively equivalent to what was in the source code, and why it is so, I already explained
here ( 15.1.1 on page 194).
Apparently, LLVM, just like GCC, concludes that this code can be shorter (or faster).
Optimizing Keil 6/2013 (ARM mode)
Listing 15.4: Optimizing Keil 6/2013 (ARM mode)
_strlen
MOV

R1, R0

LDRB
CMP
SUBEQ
SUBEQ
BNE
BX

R2, [R1],#1
R2, #0
R0, R1, R0
R0, R0, #1
loc_2C8
LR

loc_2C8

Almost the same as what we saw before, with the exception that the str − eos − 1 expression can be computed not at
the function’s end, but right in the body of the loop. The -EQ suffix, as we may recall, implies that the instruction executes
only if the operands in the CMP that was executed before were equal to each other. Thus, if R0 contains 0, both SUBEQ
instructions executes and result is left in the R0 register.
ARM64
Optimizing GCC (Linaro) 4.9
3 (PowerPC,
4 MoVe

ARM) Branch if Not Equal

Not

196

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.1. STRLEN()

my_strlen:
mov
x1, x0
; X1 is now temporary pointer (eos), acting like cursor
.L58:
; load byte from X1 to W2, increment X1 (post-index)
ldrb
w2, [x1],1
; Compare and Branch if NonZero: compare W2 with 0, jump to .L58 if it is not
cbnz
w2, .L58
; calculate difference between initial pointer in X0 and current address in X1
sub
x0, x1, x0
; decrement lowest 32-bit of result
sub
w0, w0, #1
ret

The algorithm is the same as in 15.1.1 on page 190: find a zero byte, calculate the difference between the pointers and
decrement the result by 1. I have added some comments. The only thing worth noting is that my example is somewhat wrong:
my_strlen() returns 32-bit int, while it has to return size_t or another 64-bit type. The reason is that, theoretically,
strlen() can be called for a huge blocks in memory that exceeds 4GB, so it must able to return a 64-bit value on 64-bit
platforms. Because of my mistake, the last SUB instruction operates on a 32-bit part of register, while the penultimate SUB
instruction works on full the 64-bit register (it calculates the difference between the pointers). It’s my mistake, but I have
decided to leave it as is, as an example of how the code could look like in such case.
Non-optimizing GCC (Linaro) 4.9
my_strlen:
; function prologue
sub
sp, sp, #32
; first argument (str) will be stored in [sp,8]
str
x0, [sp,8]
ldr
x0, [sp,8]
; copy "str" to "eos" variable
str
x0, [sp,24]
nop
.L62:
; eos++
ldr
x0, [sp,24] ; load "eos" to X0
add
x1, x0, 1
; increment X0
str
x1, [sp,24] ; save X0 to "eos"
; load byte from memory at address in X0 to W0
ldrb
w0, [x0]
; is it zero? (WZR is the 32-bit register always contain zero)
cmp
w0, wzr
; jump if not zero (Branch Not Equal)
bne
.L62
; zero byte found. now calculate difference.
; load "eos" to X1
ldr
x1, [sp,24]
; load "str" to X0
ldr
x0, [sp,8]
; calculate difference
sub
x0, x1, x0
; decrement result
sub
w0, w0, #1
; function epilogue
add
sp, sp, 32
ret

It’s more verbose. The variables are often tossed here to and from memory (local stack). The same mistake here: the
decrement operation happens on a 32-bit register part.

15.1.3

MIPS
Listing 15.5: Optimizing GCC 4.4.5 (IDA)

my_strlen:
; "eos" variable will always reside in $v1:
move
$v1, $a0
197

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.2. EXERCISES

loc_4:
; load byte at address in "eos" into $a1:
lb
$a1, 0($v1)
or
$at, $zero ; load delay slot, NOP
; if loaded byte is not zero, jump to loc_4:
bnez
$a1, loc_4
; increment "eos" anyway:
addiu
$v1, 1 ; branch delay slot
; loop finished. invert "str" variable:
nor
$v0, $zero, $a0
; $v0=-str-1
jr
$ra
; return value = $v1 + $v0 = eos + ( -str-1 ) = eos - str - 1
addu
$v0, $v1, $v0 ; branch delay slot

MIPS lacks a “NOT” instruction, but has “NOR” which is OR + NOT operation. This operation is widely used in digital
electronics5 , but isn’t very popular in computer programming. So, the NOT operation is implemented here as “NOR DST,
$ZERO, SRC”.
From fundamentals 30 on page 451 we know that bitwise inverting a signed number is the same as changing its sign
and subtracting 1 from the result. So what NOT does here is to take the value of str and transform it into −str − 1. The
addition operation that follows prepares result.

15.2

Exercises

15.2.1

Exercise #1

What does this code do?
Listing 15.6: Optimizing MSVC 2010
_s$ = 8
_f
PROC
mov
mov
xor
test
je
npad
$LL4@f:
cmp
jne
inc
$LN3@f:
mov
inc
test
jne
$LN2@f:
ret
_f
ENDP

edx, DWORD PTR _s$[esp-4]
cl, BYTE PTR [edx]
eax, eax
cl, cl
SHORT $LN2@f
4 ; align next label
cl, 32
SHORT $LN3@f
eax
cl, BYTE PTR [edx+1]
edx
cl, cl
SHORT $LL4@f
0

Listing 15.7: GCC 4.8.1 -O3
f:
.LFB24:
push
mov
xor
movzx
test
je

ebx
ecx, DWORD PTR [esp+8]
eax, eax
edx, BYTE PTR [ecx]
dl, dl
.L2

cmp
lea
cmove

dl, 32
ebx, [eax+1]
eax, ebx

.L3:

5 NOR is called “universal gate”. For example, the Apollo Guidance Computer used in the Apollo program, was built by only using 5600 NOR gates:
[Eic11].

198

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

15.2. EXERCISES

add
movzx
test
jne

ecx, 1
edx, BYTE PTR [ecx]
dl, dl
.L3

pop
ret

ebx

.L2:

Listing 15.8: Optimizing Keil 6/2013 (ARM mode)
f PROC
MOV

r1,#0

LDRB
CMP
MOVEQ
BXEQ
CMP
ADDEQ
ADD
B
ENDP

r2,[r0,#0]
r2,#0
r0,r1
lr
r2,#0x20
r1,r1,#1
r0,r0,#1
|L0.4|

|L0.4|

Listing 15.9: Optimizing Keil 6/2013 (thumb mode)
f PROC
MOVS
B

r1,#0
|L0.12|

CMP
BNE
ADDS

r2,#0x20
|L0.10|
r1,r1,#1

ADDS

r0,r0,#1

LDRB
CMP
BNE
MOVS
BX
ENDP

r2,[r0,#0]
r2,#0
|L0.4|
r0,r1
lr

|L0.4|

|L0.10|
|L0.12|

Listing 15.10: Optimizing GCC 4.9 (ARM64)
f:
ldrb
cbz
mov

w1, [x0]
w1, .L4
w2, 0

cmp
ldrb
csinc
cbnz

w1,
w1,
w2,
w1,

mov
ret

w0, w2

mov
b

w2, w1
.L2

.L3:
32
[x0,1]!
w2, w2, ne
.L3

.L2:

.L4:

Listing 15.11: Optimizing GCC 4.4.5 (MIPS) (IDA)
f:
lb
or
beqz
li
b
move
loc_18:

$v1, 0($a0)
$at, $zero
$v1, locret_48
$a1, 0x20 # ' '
loc_28
$v0, $zero
# CODE XREF: f:loc_28
199

CHAPTER 15. SIMPLE C-STRINGS PROCESSING

lb
or
beqz
or

$v1,
$at,
$v1,
$at,

15.2. EXERCISES

0($a0)
$zero
locret_40
$zero

loc_28:

# CODE XREF: f+10
# f+38
bne
addiu
lb
or
bnez
addiu

$v1,
$a0,
$v1,
$at,
$v1,
$v0,

$a1, loc_18
1
0($a0)
$zero
loc_28
1

jr
or

$ra
$at, $zero

jr
move

$ra
$v0, $zero

locret_40:

# CODE XREF: f+20

locret_48:

# CODE XREF: f+8

Answer G.1.7 on page 955.

200

CHAPTER 16. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES

Chapter 16

Replacing arithmetic instructions to other ones
In the pursuit of optimization, one instruction may be replaced by others, or even with a group of instructions.
For example, the LEA instruction is often used for simple arithmetic calculations: A.6.2 on page 933.
ADD and SUB can replace each other. For example, line 18 in listing.52.1.

16.1

Multiplication

16.1.1

Multiplication using addition

Here is a simple example:
Listing 16.1: Optimizing MSVC 2010
unsigned int f(unsigned int a)
{
return a*8;
};

Multiplication by 8 is replaced by 3 addition instructions, which do the same. Apparently, MSVC’s optimizer decided that
this code can be faster.
_TEXT
SEGMENT
_a$ = 8
_f
PROC
; File c:\polygon\c\2.c
mov
eax, DWORD PTR _a$[esp-4]
add
eax, eax
add
eax, eax
add
eax, eax
ret
0
_f
ENDP
_TEXT
ENDS
END

16.1.2

; size = 4

Multiplication using shifting

Multiplication and division instructions by a numbers that’s a power of 2 are often replaced by shift instructions.
unsigned int f(unsigned int a)
{
return a*4;
};

Listing 16.2: Non-optimizing MSVC 2010
_a$ = 8
_f
PROC
push
mov
mov
shl
pop
ret
_f
ENDP

; size = 4
ebp
ebp, esp
eax, DWORD PTR _a$[ebp]
eax, 2
ebp
0

201

CHAPTER 16. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES

16.1. MULTIPLICATION

Multiplication by 4 is just shifting the number to the left by 2 bits and inserting 2 zero bits at the right (as the last two
bits). It is just like multiplying 3 by 100 —we need to just add two zeroes at the right.
That’s how the shift left instruction works:

CF

7.

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

0

The added bits at right are always zeroes.
Multiplication by 4 in ARM:
Listing 16.3: Non-optimizing Keil 6/2013 (ARM mode)
f PROC
LSL
BX
ENDP

r0,r0,#2
lr

Multiplication by 4 in MIPS:
Listing 16.4: Optimizing GCC 4.4.5 (IDA)
jr
sll

$ra
$v0, $a0, 2 ; branch delay slot

SLL is “Shift Left Logical”

16.1.3

Multiplication using shifting/subtracting/adding

It’s still possible to get rid of the multiplication operation when you multiply by numbers like 7 or 17 again by using shifting.
The mathematics used here is relatively easy.
32-bit
#include <stdint.h>
int f1(int a)
{
return a*7;
};
int f2(int a)
{
return a*28;
};
int f3(int a)
{
return a*17;
};

x86
Listing 16.5: Optimizing MSVC 2012
; a*7
_a$ = 8
_f1
PROC
mov
ecx, DWORD PTR _a$[esp-4]
; ECX=a
lea
eax, DWORD PTR [ecx*8]
; EAX=ECX*8
sub
eax, ecx
; EAX=EAX-ECX=ECX*8-ECX=ECX*7=a*7
ret
0
_f1
ENDP
202

CHAPTER 16. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES

16.1. MULTIPLICATION

; a*28
_a$ = 8
_f2
PROC
mov
ecx, DWORD PTR _a$[esp-4]
; ECX=a
lea
eax, DWORD PTR [ecx*8]
; EAX=ECX*8
sub
eax, ecx
; EAX=EAX-ECX=ECX*8-ECX=ECX*7=a*7
shl
eax, 2
; EAX=EAX<<2=(a*7)*4=a*28
ret
0
_f2
ENDP
; a*17
_a$ = 8
_f3
PROC
mov
eax, DWORD PTR _a$[esp-4]
; EAX=a
shl
eax, 4
; EAX=EAX<<4=EAX*16=a*16
add
eax, DWORD PTR _a$[esp-4]
; EAX=EAX+a=a*16+a=a*17
ret
0
_f3
ENDP

ARM
Keil for ARM mode takes advantage of the second operand’s shift modifiers:
Listing 16.6: Optimizing Keil 6/2013 (ARM mode)
; a*7
||f1|| PROC
RSB
r0,r0,r0,LSL #3
; R0=R0<<3-R0=R0*8-R0=a*8-a=a*7
BX
lr
ENDP
; a*28
||f2|| PROC
RSB
r0,r0,r0,LSL #3
; R0=R0<<3-R0=R0*8-R0=a*8-a=a*7
LSL
r0,r0,#2
; R0=R0<<2=R0*4=a*7*4=a*28
BX
lr
ENDP
; a*17
||f3|| PROC
ADD
r0,r0,r0,LSL #4
; R0=R0+R0<<4=R0+R0*16=R0*17=a*17
BX
lr
ENDP

But there are no such modifiers in Thumb mode. It also can’t optimize f2():
Listing 16.7: Optimizing Keil 6/2013 (thumb mode)
; a*7
||f1|| PROC
LSLS
r1,r0,#3
; R1=R0<<3=a<<3=a*8
SUBS
r0,r1,r0
; R0=R1-R0=a*8-a=a*7
BX
lr
ENDP
; a*28
203

CHAPTER 16. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES

||f2|| PROC
MOVS
; R1=28
MULS
; R0=R1*R0=28*a
BX
ENDP

16.1. MULTIPLICATION

r1,#0x1c ; 28
r0,r1,r0
lr

; a*17
||f3|| PROC
LSLS
r1,r0,#4
; R1=R0<<4=R0*16=a*16
ADDS
r0,r0,r1
; R0=R0+R1=a+a*16=a*17
BX
lr
ENDP

MIPS
Listing 16.8: Optimizing GCC 4.4.5 (IDA)
_f1:
sll
$v0, $a0, 3
; $v0 = $a0<<3 = $a0*8
jr
$ra
subu
$v0, $a0 ; branch delay slot
; $v0 = $v0-$a0 = $a0*8-$a0 = $a0*7
_f2:
sll
$v0, $a0, 5
; $v0 = $a0<<5 = $a0*32
sll
$a0, 2
; $a0 = $a0<<2 = $a0*4
jr
$ra
subu
$v0, $a0 ; branch delay slot
; $v0 = $a0*32-$a0*4 = $a0*28
_f3:
sll
$v0, $a0, 4
; $v0 = $a0<<4 = $a0*16
jr
$ra
addu
$v0, $a0 ; branch delay slot
; $v0 = $a0*16+$a0 = $a0*17

64-bit
#include <stdint.h>
int64_t f1(int64_t a)
{
return a*7;
};
int64_t f2(int64_t a)
{
return a*28;
};
int64_t f3(int64_t a)
{
return a*17;
};

x64

204

CHAPTER 16. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES

16.2. DIVISION

Listing 16.9: Optimizing MSVC 2012
; a*7
f1:
lea
rax, [0+rdi*8]
; RAX=RDI*8=a*8
sub
rax, rdi
; RAX=RAX-RDI=a*8-a=a*7
ret
; a*28
f2:
lea
rax, [0+rdi*4]
; RAX=RDI*4=a*4
sal
rdi, 5
; RDI=RDI<<5=RDI*32=a*32
sub
rdi, rax
; RDI=RDI-RAX=a*32-a*4=a*28
mov
rax, rdi
ret
; a*17
f3:
mov
rax, rdi
sal
rax, 4
; RAX=RAX<<4=a*16
add
rax, rdi
; RAX=a*16+a=a*17
ret

ARM64
GCC 4.9 for ARM64 is also terse, thanks to the shift modifiers:
Listing 16.10: Optimizing GCC (Linaro) 4.9 ARM64
; a*7
f1:
lsl
x1, x0, 3
; X1=X0<<3=X0*8=a*8
sub
x0, x1, x0
; X0=X1-X0=a*8-a=a*7
ret
; a*28
f2:
lsl
x1, x0, 5
; X1=X0<<5=a*32
sub
x0, x1, x0, lsl 2
; X0=X1-X0<<2=a*32-a<<2=a*32-a*4=a*28
ret
; a*17
f3:
add
x0, x0, x0, lsl 4
; X0=X0+X0<<4=a+a*16=a*17
ret

16.2

Division

16.2.1

Division using shifts

Example:
unsigned int f(unsigned int a)
{
return a/4;
};
205

ecx 0 Listing 16. these two dropped bits are the division operation remainder. size = 4 eax. 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 CF It is easy to understand if you imagine the number 23 in the decimal numeral system. The two freed bits at left (e.3 Exercises 16. but that’s OK. $a0.3.r0.4. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES 16. DWORD PTR [ecx*8] eax.r0. 23 can be easily divided by 10 just by dropping last digit (3 —is division remainder).12: Non-optimizing Keil 6/2013 (ARM mode) f PROC LSR BX ENDP r0. DWORD PTR _a$[esp-4] eax.11: MSVC 2010 _a$ = 8 _f PROC mov shr ret _f ENDP . The two least significant bits are dropped. 0 7.r0. these are not floating point numbers! Division by 4 in ARM: Listing 16. EXERCISES We get (MSVC 2010): Listing 16.. The SHR instruction works just like SHL.LSL #3 lr 206 .15: Non-optimizing Keil 6/2013 (ARM mode) f PROC RSB BX ENDP r0.g.13: Optimizing GCC 4. 2 0 The SHR (SHift Right) instruction in this example is shifting a number by 2 bits to the right.14: Optimizing MSVC 2010 _a$ = 8 _f PROC mov lea sub ret _f ENDP ecx. 2 .3.#2 lr Division by 4 in MIPS: Listing 16.1 Exercise #2 What does this code do? Listing 16. branch delay slot The SRL instruction is “shift-right logical”.CHAPTER 16. we work on integer values anyway. two most significant bits) are set to zero. 2 is left after the operation as a quotient. In fact. DWORD PTR _a$[esp-4] eax.5 (IDA) jr srl $ra $v0. So the remainder is dropped. but in the other direction. 16.

9 (ARM64) f: lsl sub ret w1.1.r0 lr Listing 16.17: Optimizing GCC 4.4. 207 . w0 Listing 16.CHAPTER 16. w1. $a0 Answer G.8 on page 955.5 (MIPS) (IDA) f: sll jr subu $v0.16: Non-optimizing Keil 6/2013 (thumb mode) f PROC LSLS SUBS BX ENDP r1.r1. w0.#3 r0. 3 $ra $v0.r0.18: Optimizing GCC 4. 3 w0. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES 16. EXERCISES Listing 16. $a0.3.

before studying the FPU in x86.2 x86 It is worth looking into stack machines1 or learning the basics of the Forth language 2 . 1 wikipedia 2 wikipedia 3 For example... and each register can hold a number in the IEEE 7544 format. so it can wait until the FPU is done with its work. specially designed to deal with floating point numbers. John Carmack used fixed-point arithmetic (wikipedia) values in his Doom video game.DF).e. IDA and OllyDbg show ST(0) as ST.CHAPTER 17. 32 bits) 6 and double (doubleprecision7 .3 ARM. but a set of registers. x86/x64 SIMD In ARM and MIPS the FPU is not a stack.. GCC also supports the long double type (extended precision8 . The FPU has a stack capable to holding 8 80-bit registers. MIPS. It was possible to buy it separately and install it 3 .ST(7). The FWAIT instruction reminds us of that fact—it switches the CPU to a waiting state. 17.6. which is represented in some textbooks and manuals as “Stack Top”.2 on page 376) section 7 wikipedia 8 wikipedia 208 . It was called “coprocessor” in the past and it stays somewhat aside of the main CPU.1 IEEE 754 A number in the IEEE 754 format consists of a sign. 17. For brevity. Another rudiment is the fact that the FPU instruction opcodes start with the so called “escape”-opcodes (D8. 80 bit).. a significand (also called fraction) and an exponent. 17. 64 bits).4 C/C++ The standard C/C++ languages offer at least two floating number types. i. which MSVC doesn’t. so Doom could work on 32-bit computers without FPU. the FPU is integrated in the CPU. It is interesting to know that in the past (before the 80486 CPU) the coprocessor was a separate chip and it was not always pre-installed on the motherboard. FLOATING-POINT UNIT Chapter 17 Floating-point unit The FPU is a device within the main CPU. They are ST(0). 80386 and 80486 SX 4 wikipedia 5 wikipedia 6 the single precision floating point number format is also addressed in the Working with the float type as with a structure ( 21. opcodes passed to a separate coprocessor. 17. i. Starting with the 80486 DX CPU. float (single-precision5 . stored in 32-bit GPR registers (16 bit for integral part and another 16 bit for fractional part). The same ideology is used in the SIMD extensions of x86/x64 CPUs.e.

FLOATING-POINT UNIT 17.5. 209 . ST(0) = result of _b * 4.14 fmul QWORD PTR __real@4010666666666666 .1: MSVC 2010: f() CONST SEGMENT __real@4010666666666666 DQ 04010666666666666r CONST ENDS CONST SEGMENT __real@40091eb851eb851f DQ 040091eb851eb851fr CONST ENDS _TEXT SEGMENT _a$ = 8 .5. f(1.2. 3. size = 8 _b$ = 16 .14 fld QWORD PTR _b$[ebp] .1. current stack state: ST(0) = _b. }.1 .5 Simple example Let’s consider this simple example: #include <stdio. SIMPLE EXAMPLE The float type requires the same number of bits as the int type in 32-bit environments. . current stack state: ST(0) = result of addition _f pop ret ENDP ebp 0 FLD takes 8 bytes from stack and loads the number into the ST(0) register.14 + b*4. 17. current stack state: .1 x86 MSVC Compile it in MSVC 2010: Listing 17. current stack state: ST(0) = result of _a divided by 3. }. int main() { printf ("%f\n". current stack state: ST(0) = _a fdiv QWORD PTR __real@40091eb851eb851f .1. 3. 4. but the number representation is completely different.14 . double b) { return a/3. ST(1) = result of _a divided by 3. size = 8 _f PROC push ebp mov ebp.4)).CHAPTER 17. ST(1) = result of _a divided by 3.14 faddp ST(1).h> double f (double a. ST(0) . 17. esp fld QWORD PTR _a$[ebp] . automatically converting it into the internal 80-bit format (extended precision).

thereby leaving the result at the top of the stack. which divides ST(1) by ST(0). so what we see here is the hexadecimal representation of 3.14 is encoded there. 9 wikipedia 10 wikipedia 210 . and ST(0) has the value of b. The next FMUL instruction does multiplication: b from ST(0) is multiplied by by value at __real@4010666666666666 (the numer 4. there is also the FDIVP instruction. After the execution of FDIV. The function must return its result in the ST(0) register. After that.14 in 64-bit IEEE 754 format.1 is there) and leaves the result in the ST(0) register. popping both these values from stack and then pushing the result.CHAPTER 17. the quotient is placed in ST(1). storing the result in ST(1) and then popping the value of ST(0). The subsequent FLD instruction pushes the value of b into the stack. FLOATING-POINT UNIT 17. SIMPLE EXAMPLE FDIV divides the value in ST(0) by the number stored at address __real@40091eb851eb851f —the value 3.5. so there are no any other instructions except the function epilogue after FADDP. If you know the Forth language9 . in ST(0). The last FADDP instruction adds the two values at top of stack. you can quickly understand that this is a stack machine10 . The assembly syntax doesn’t support floating point numbers. ST(0) holds the quotient. By the way.

SIMPLE EXAMPLE MSVC + OllyDbg I have marked red 2 pairs of 32-bit words in the stack..1: OllyDbg: first FLD executed Because of unavoidable conversion errors from 64-bit IEEE 754 float point to 80-bit (used internally in the FPU). FLOATING-POINT UNIT 17. For convenience..5. here we see 1. OllyDbg shows its value: 3.CHAPTER 17.14. Each pair is a double-number in IEEE 754 format and is passed from main(). EIP now points to the next instruction (FDIV)..2.2) from the stack and puts it into ST(0): Figure 17. 211 . which is close to 1. which loads a double-number (a constant) from memory.999. We see how the first FLD loads a value (1.

SIMPLE EXAMPLE Let’s trace further.CHAPTER 17.5. FLOATING-POINT UNIT 17. now ST(0) contains 0... FDIV was executed. (quotient): Figure 17.2: OllyDbg: FDIV executed 212 .382.

3: OllyDbg: second FLD executed At the same time.): Figure 17. EIP points to the next instruction: FMUL.. 3. FLOATING-POINT UNIT 17.CHAPTER 17.39999.1 from memory. Right now. quotient is pushed into ST(1).. loading 3.4 into ST(0) (here we see the approximate value. which OllyDbg shows.5. SIMPLE EXAMPLE Third step: the next FLD was executed. 213 . It loads the constant 4.

5. FLOATING-POINT UNIT 17.CHAPTER 17. SIMPLE EXAMPLE Next: FMUL was executed. so now the product is in ST(0): Figure 17.4: OllyDbg: FMUL executed 214 .

and then a value is written to that register.5: OllyDbg: FADDP executed The result is left in ST(0). which is the current “top of stack”. then all 7 register’s contents must be moved (or copied) to adjacent registers during pushing and popping. It can be said that FADDP saved the sum in the stack.4. because the function returns its value in ST(0).2: Optimizing GCC 4. just slightly different: Listing 17. now the result of the addition is in ST(0). Imagine if it was implemented in hardware as it’s described.14 . SIMPLE EXAMPLE Next: FADDP was executed. In reality. GCC GCC 4. But in fact. main() takes this value from the register later. The procedure is reversed if a value is popped. the FPU has just 8 registers and a pointer (called TOP) which contains a register number. So that’s what we see here. and ST(1) is cleared: Figure 17. the register which was freed is not cleared (it could possibly be cleared. 3.1 (with -O3 option) emits the same code. More precisely.. esp [ebp+arg_0] 215 . value is now located in ST(7). but this is more work which can degrade performance). however. TOP is pointed to the next available register. When a value is pushed to the stack. the registers of the FPU are a circular buffer. Why? As I wrote before. FLOATING-POINT UNIT 17. the FPU registers are a stack: 17..5. and that’s a lot of work.14 mov fdivr ebp.4. stack state now: ST(0) = 3. this instruction saved the sum and then shifted TOP. But this is a simplification.93.2 on page 208.1 f public f proc near arg_0 arg_8 = qword ptr = qword ptr push fld 8 10h ebp ds:dbl_8048608 . We also see something unusual: the 13. and then popped one element.CHAPTER 17.

17.3 (LLVM) (ARM mode) Until ARM got standardized floating point support. FDIVR stands for Reverse Divide —to divide with divisor and dividend swapped with each other. stack state now: ST(0) = result of division fld ds:dbl_8048610 . DATA XREF: f . R1. D18. D16. respectively.3. VLDR and VMOV . VMUL and VADD. 3.1 .5.6. R1 . D17 D16. but they work with Dregisters. becase a double-precision number that needs 64 bits for storage. R0. several processor manufacturers added their own instructions extensions.F64 VADD. are instruction for processing floating point numbers that compute quotient. It is easy to remember: D-registers are for double precision numbers.14 is pushed to the stack (into ST(0)). ST(0) holds the sum.1. FADDP adds the two values but also pops one value from the stack.F64 VMOV BX dbl_2C98 dbl_2CA0 DCFD 3.1) constants are stored in memory in IEEE 754 form. intended to be used for single precision floating pointer numbers (float). so we just have FMUL without its -R counterpart. are intended not only for floating point numbers. R0 and R1.5. DATA XREF: f+10 So. ST(1) = result of division fmul [ebp+arg_8] . More about it: B. is returned in R0 and R1.1 . b*4. There is no likewise instruction for multiplication since it is a commutative operation. ``VMOV R0.1 D16. D16 LR load "a" load "b" .3 on page 945. you work just with registers.14 D17. a/3. =4.3 ARM: Optimizing Keil 6/2013 (thumb mode) 216 . R3 .3 (LLVM) (ARM mode) f VLDR VMOV VMOV VDIV. just like the D-registers. R1. st . One important difference from x86 is that in ARM. are analogous to the LDR and MOV instructions. D16 R0. composes two 32-bit values from R0 and R1 into one 64-bit value and saves it to D17.F64 VLDR VMUL.CHAPTER 17.14 DCFD 4. 17. It has to be noted that these instructions.3: Optimizing Xcode 4. first of all. R2.5. stack state now: ST(0) = 4. FLOATING-POINT UNIT 17. D16'' is the inverse operation. but can be also used for SIMD (NEON) operations and this will also be shown soon. ``VMOV D17. Then. VFP (Vector Floating Point) was standardized. however each number that has double precision has a size of 64 bits.6. there are 32 of them. there is no stack. ST(1) = result of division pop faddp ebp st(1). product and sum. 4. stack state now: ST(0) = result of multiplication. =3. After that operation. + . as it can be easily deduced.1 D17.14 and 4. Both (3.2 ARM: Optimizing Xcode 4. These are 64-bit registers.14 . while S-registers —for single precision numbers. and then the value in arg_0 is divided by the value in ST(0). and they can be used both for floating-point numbers (double) but also for SIMD (it is called NEON here in ARM). There are also 32 32-bit S-registers. D17. SIMPLE EXAMPLE . R0. D18. Listing 17. D17. what was in D16 is split in two registers. with D prefix. The code for thumb-2 is same. stack state now: ST(0) = result of addition f retn endp The difference is that. we see here new some registers used. The arguments are passed to the function in a common way. R1'' at the start. so two R-registers are needed to pass each one. VDIV. via the R-registers. D16 D17.

LC25: .14 fdiv d0. R3 R5.14 in IEEE 754 form: dword_36C DCD 0x51EB851F dword_370 DCD 0x40091EB8 . 3. DATA XREF: f+A . 3.1 . 4. __aeabi_ddiv. d0 . 17. and instead of FPU-instructions. while using the coprocessor’s FPU-instructions is called hard float or armhf. R5 R1. Of course. SIMPLE EXAMPLE f PUSH MOVS MOVS MOVS MOVS LDR LDR MOVS MOVS BL MOVS MOVS LDR LDR MOVS MOVS BL MOVS MOVS BL POP {R3-R7. D2 = 4.1 R3. .5. DATA XREF: f+C .5.LR} R7. R6 __aeabi_ddiv R2. R4 __aeabi_dadd {R3-R7. R7 R1. 3. .9 Very compact code: Listing 17. D0 = D0/D2 = a/3.14 ldr d2.PC} . R7 R3. D2 = 3. R1 R2.9 217 .5.5 ARM64: Non-optimizing GCC (Linaro) 4. d2 . =0x40091EB8 R0. =0x51EB851F . By the way. D1 = b ldr d2.4: Optimizing GCC (Linaro) 4. and were installed only on expensive computers. d2. division and addition for floating-point numbers.4 ARM64: Optimizing GCC (Linaro) 4. DATA XREF: f+1A . 3.14 .14 R3.14 .word 1717986918 . __aeabi_dadd ) which emulate multiplication. service library functions are called (like __aeabi_dmul. =0x40106666 R0. 4.1 in IEEE 754 form: dword_364 DCD 0x66666666 dword_368 DCD 0x40106666 .LC26: .LC25 . R4 __aeabi_dmul R7. R0 R4. R0 R6.CHAPTER 17. The FPU-coprocessor emulation is called soft float or armel in the ARM world.word 1074816614 17. that is slower than FPU-coprocessor. D0 = a.14 ret . FLOATING-POINT UNIT 17. d1. similar FPU-emulating libraries were very popular in the x86 world when coprocessors were rare and expensive.word 1074339512 . 4. but still better than nothing.LC26 .1 fmadd d0. 4.1 .1+a/3. R1 R2.9 f: . DATA XREF: f+1C Keil generated code for a processor without FPU or NEON support. The double-precision floating-point numbers are passed via generic R-registers. constants in IEEE 754 format: . d0. D0 = D1*D2+D0 = b*4. =0x66666666 . R2 R4.word 1374389535 .

D1 = 4. x2 . sp. X0 = 3.14 .14 + b*4.5. X0 = D0 = b*4.14 constant to $f0: 218 . you can try to optimize this function manually. Listing 17.6 MIPS MIPS can support several coprocessors (up to 4). D0 = D0/D1 = a/3. What is worth noting is that ARM64 has 64-bit registers. $f12-$f13=A .8] [sp] [sp. D1 = X0 = b*4. as an exercise. d0. D1 = 3. So the compiler is free to save values of type double in GPRs instead of the local stack. D0 = D0*D1 = b*4. As in ARM. X1 = a/3. save "b" in Register Save Area . load low 32-bit part of 3. Probably. d0 . D0 = a/3. $f14-$f15=B lui $v0.LC25 . d0 .14 fmov d1. #16 [sp. \ redundant code d0.1 fmul d0. 4.8] . d1 . / sp. SIMPLE EXAMPLE Listing 17.14 ldr x2. . D0 = D0+D1 = a/3. D0 = a. This isn’t possible on 32-bit CPUs.1 fmov d0. sp. d1 . d1. GCC 4.1. x1 . 16 . x1 fmov d1. d0.1 fmov fmov add ret x0. And again. There is a lot of unnecessary value shuffling. 3. (dword_C4 >> 16) . x0 .word .14 fmov d0. the zeroth of which is a special control coprocessor. it has 32 32-bit registers ($F0-$F31): C.word 1717986918 1074816614 . the MIPS coprocessor is not a stack machine. d0 .1 fmov d0.word 1374389535 1074339512 .5.2 on page 947.1 fmov x0. x0 .14 fmov x1.5 (IDA) f: .CHAPTER 17.1 . a pair of 32-bit F-registers is used. and first coprocessor is the FPU.4.9 is not yet good in generating ARM64 code. d1 . ? . X1 = a ldr x0. x0 . without introducing new instructions like FMADD.9 f: sub str str ldr sp.6: Optimizing GCC 4. x1. 17. d0.word . and the D-registers are 64-bit ones as well. X2 = b ldr x0. . save "a" in Register Save Area . D0 = b fmov d1.1 fadd d0. including some clearly redundant code (the last two FMOV instructions). x0 .LC26: Non-optimizing GCC is more verbose.LC26 . When one needs to work with 64-bit double values. [sp] . FLOATING-POINT UNIT 17.14 fdiv d0.5: Non-optimizing GCC (Linaro) 4. d0. X0 = 4.LC25: .

$f2 .d $f0. . NOP load high 32-bit part of 4.S” would stands for single precision) There is also a weird compiler anomaly: the LUI instructions that I’ve marked with a question mark.word . . load delay slot.D. It’s hard for me to understand why load a part of a 64-bit constant of double type into the $V0 register.D do division/multiplication/addition (“.rodata.1. 32. 4. $f0 $f0-$f1=A/3. ADD. do multiplication: mul. $zero . 17.rodata.d $f0. $LC0 lui $v0. ($LC1 >> 16) .1 x86 Let’s see what we get in (MSVC 2010): Listing 17.rodata. If someone knows more about it. load delay slot.1 to $f3: lwc1 $f3.6 Passing floating point numbers via arguments #include <math. “. } 17.cst8:000000C4 $LC0: dword_BC: $LC1: dword_C4: . A pair of LWC1 instructions may be combined into a L. pow (32. $LC1 or $at. These instruction have no effect. PASSING FLOATING POINT NUMBERS VIA ARGUMENTS lwc1 $f0.54 = %lf\n".d $f2. please drop me an email11 . .cst8:000000BC .h> #include <stdio.word .rodata. dword_BC or $at.6.14 constant in $f0-$f1.com 219 .1 to $f2: lwc1 $f2. esp sub esp.14 constant to $f1: lwc1 $f1.word . . . allocate space for the first variable 11 dennis(a)yurichev.D pseudoinstruction.cst8:000000C0 .1 constant in $f2-$f3.54 _main PROC push ebp mov ebp.D” in the suffix stands for double precision.7: MSVC 2010 CONST SEGMENT __real@40400147ae147ae1 DQ 040400147ae147ae1r __real@3ff8a3d70a3d70a4 DQ 03ff8a3d70a3d70a4r CONST ENDS . $zero . NOP load high 32-bit part of 3.01. . return 0. . load delay slot. • DIV. NOP .CHAPTER 17.cst8:000000B8 . MUL.54)). dword_C4 or $at. 17. ? A in $f12-$f13. do division: div. $f14. 1.14 load low 32-bit part of 4. 8 .1 jr $ra sum 64-bit parts and leave result in $f0-$f1: add. $zero .h> int main () { printf ("32. $f12.01 ^ 1. NOP B in $f14-$f15. FLOATING-POINT UNIT . 3.D.word 0x40091EB8 0x51EB851F 0x40106666 0x66666666 # # # # DATA DATA DATA DATA XREF: XREF: XREF: XREF: f+C f+4 f+10 f The new instructions here are: • LWC1 loads a 32-bit word into a register of the first coprocessor (hence “1” in instruction name). branch delay slot.6. $f2 $f2-$f3=B*4.01 .

The result of _pow is moved into D16. 17.6. D16 _printf R1. and its second one in R2 and R3. =1. as we see. .6. the _pow function receives its first argument in R0 and R1. 64-bit floating pointer numbers are passed in R-registers pairs. 8 .3 (LLVM) (thumb-2 mode) _main var_C = -0xC PUSH MOV SUB VLDR VMOV VLDR VMOV BLX VMOV MOV ADD VMOV BLX MOVS STR MOV ADD POP dbl_2F90 dbl_2F98 DCFD 32.54 {R7. #4 {R7. DATA XREF: _main+6 . . R3. QWORD PTR QWORD PTR _pow esp.6. result now in ST(0) fstp QWORD PTR [esp] . FLOATING-POINT UNIT fld fstp sub fld fstp call add QWORD PTR QWORD PTR esp. SP SP. PASSING FLOATING POINT NUMBERS VIA ARGUMENTS __real@3ff8a3d70a3d70a4 [esp] allocate space for the second variable __real@40400147ae147ae1 [esp] "return back" place of one variable. R2.01 ^ 1. SP. a pair of MOV instructions could be used here for moving values from the memory into the stack. from where printf() takes the resulting number.PC} . then in the R1 and R2 pair. DATA XREF: _main+E As I wrote before. so no conversion is necessary. By the way. SP. in local stack here 8 bytes still reserved for us.6. This code is a bit redundant (certainly because optimization is turned off). and pow() also takes them in this format. move result from ST(0) to local stack for printf() push OFFSET $SG2651 call _printf add esp.3 ARM + Non-optimizing Keil 6/2013 (ARM mode) 12 a standard C function.01 DCFD 1. 0 R0. D16 _pow D16. for ARM: 17. R1.6. That’s how it’s done in the next example. So.LR} R7. raises a number to the given power (exponentiation) 220 . "32. R1 R0. R1 SP. [SP.CHAPTER 17.54 R2.01 R0. R0. 12 xor eax. D16 D16. 17.2 on page 220. PC R1. The function leaves its result in R0 and R1. pow()12 takes both values from the stack of the FPU and returns its result in the ST(0) register. =32. #4 D16. printf() takes 8 bytes from the local stack and interprets them as double type variable. 0xFC1 . since it is possible to load values into the R-registers directly without touching the D-registers. eax pop ebp ret 0 _main ENDP FLD and FSTP move variables between the data segment and the FPU stack. 8 .2 ARM + Non-optimizing Xcode 4. because the values in memory are stored in IEEE 754 format.54 = %lf\n" R0.#0xC+var_C] R0. 17.

LC1: .01 ^ 1. (dword_9C >> 16) 221 . It is to be passed to printf() without any modification and moving. load 32.0 . x R1. "32.4 ARM64 + Optimizing GCC (Linaro) 4.01 in IEEE 754 format . -16]! add x29.54 = %lf\n" __2printf R0.01 ^ 1. 0 ldr d1.54 = %lf".8: Optimizing GCC (Linaro) 4.string "32. x30.LR} R2. R0 R2. result of pow() in D0 adrp x0. =0x3FF8A3D7 R0.0xA.01 into D0 bl pow . PASSING FLOATING POINT NUMBERS VIA ARGUMENTS _main STMFD LDR LDR LDR LDR BL MOV MOV MOV ADR BL MOV LDMFD y dword_520 . 1.word -1374389535 . a32_011_54Lf . R4 R3.LC2 bl printf mov w0.LC1 . because printf() takes arguments of integral types and pointers from X-registers. load 1.CHAPTER 17. 17. DATA XREF: _main+4 .9: Optimizing GCC 4. function prologue: lui $gp.9 Listing 17.6.54 = %lf\n" The constants are loaded into D0 and D1: pow() takes them from there.word 1073259479 .6. [sp.6. DATA XREF: _main+10 DCB "32.5 MIPS Listing 17. DATA XREF: _main+24 D-registers are not used here.LC0 . =0xAE147AE1 . x30. =0x40400147 pow R4. . y R3. 16 ret . #0 SP!.LC2 add x0. just R-register pairs. FLOATING-POINT UNIT 17.01 ^ 1.PC} DCD 0xA3D70A4 DCD 0x3FF8A3D7 . R1 R0. {R4-R6. . 0 ldp x29.4. [sp]. DATA XREF: _main+C DCD 0x40400147 .LC0: . sp. :lo12:. 17. The result will be in D0 after the execution of pow(). =0xA3D70A4 .word 1077936455 .9 f: stp x29.word 171798692 .54 into D1 ldr d0.5 (IDA) main: var_10 var_4 = -0x10 = -4 . and floating point arguments from D-registers. 32. double x x dword_528 a32_011_54Lf SP!. DATA XREF: _main+8 DCD 0xAE147AE1 .LC2: . . {R4-R6. x0.54 in IEEE 754 format .

dword_A4 or $at.54: lwc1 $f14. 0x20+var_4($sp) sw $gp.cst8:000000A4 . $zero .7 Comparison example Let’s try this: #include <stdio. .cst8:0000009C .rodata. $f1 call printf(): lui $a0. (__gnu_local_gp & 0xFFFF) sw $ra.01 ^ 1. -0x20 la $gp.rodata. 0x20+var_10($sp) lui $v0. . $LC0 lui $v0.str1. . double b) { if (a>b) return a.rodata.54 = %lf\n" function epilogue: lw $ra. .word 0x40400147 . (printf & 0xFFFF)($gp) mfc1 $a2.rodata. return b.54: lwc1 $f15.01: lwc1 $f13. This instruction transfers values from the coprocessor’s registers to the registers of the CPU (GPR).54 = %lf\n" jalr $t9 la $a0. 0x20+var_4($sp) return 0: move $v0. we see here LUI loading a 32-bit part of a double number into $V0.rodata.7. The new instruction for us here is MFC1 (“Move From Coprocessor 1”).4:00000084 $LC2: .01 ^ 1. $f0 lw $t9. 0x20+var_10($sp) copy result from $f0 and $f1 to $a3 and $a2: mfc1 $a3. . I honestly don’t know why. 17.01: lwc1 $f12.rodata. addiu $sp. 0x20 . and printf() takes a 64-bit double value from this register pair. ? load low 32-bit part of 1.54: . COMPARISON EXAMPLE . . . ($LC1 >> 16) . $zero jr $ra addiu $sp. load delay slot.01 ^ 1.54 = %lf\n"<0> $LC0: dword_9C: .word 0xA3D70A4 # DATA XREF: main+24 # main+30 # DATA XREF: main+14 And again.ascii "32. NOP load high 32-bit part of 1.h> double d_max (double a. ($LC2 >> 16) # "32. .cst8:000000A0 . And again.cst8:00000098 .rodata. 32. The FPU is coprocessor number 1.word 0x3FF8A3D7 dword_A4: . (dword_A4 >> 16) . ? load low 32-bit part of 32. So in the end the result from pow() is moved to registers $A3 and $A2. FLOATING-POINT UNIT 17. .cst8:0000009C .01: . 1. $LC1 call pow(): jalr $t9 or $at. ($LC2 & 0xFFFF) # "32. hence “1” in the instruction name. (pow & 0xFFFF)($gp) load high 32-bit part of 32. int main() { 222 .cst8:000000A0 . branch delay slot.CHAPTER 17. dword_9C load address of pow() function: lw $t9. }.word 0xAE147AE1 # DATA XREF: main+20 # DATA XREF: main # main+18 $LC1: . NOP lw $gp. $zero .

then C3/C2/C0 bits are to be set as following: 0.4)). accordingly. Unfortunately. printf ("%f\n". and pop register fcomp QWORD PTR _a$[ebp] . The FNSTSW instruction copies FPU the status word register to AX. 17. After the bits are set. size = 8 _d_max PROC push ebp mov ebp.7. 0. Modern CPU starting at Intel P6 have FCOMI/FCOMIP/FUCOMI/FUCOMIP instructions —which do the same. 1. This is how C3/C2/C0 bits are located in the AX register: 14 C3 13 Intel 10 9 8 C2 C1 C0 P6 is Pentium Pro. COMPARISON EXAMPLE printf ("%f\n". FLOATING-POINT UNIT 17. 0. • If a = b. 1. leaving the stack in the same state. but modify the ZF/PF/CF CPU flags. size = 8 _b$ = 16 . 5 jp SHORT $LN1@d_max . then the bits are: 0. they are at the same positions in the AX register and all they are placed in the high part of AX —AH. d_max (5. This is a 16-bit register that reflects the current state of the FPU. stack is empty here fnstsw ax test ah. compare _b (ST(0)) and _a. we are here only if a>b fld QWORD PTR _a$[ebp] jmp SHORT $LN2@d_max $LN1@d_max: fld QWORD PTR _b$[ebp] $LN2@d_max: pop ebp ret 0 _d_max ENDP So. FLD loads _b into ST(0).1 x86 Non-optimizing MSVC MSVC 2010 generates the following: Listing 17. Probably. FCOMP compares the value in ST(0) with what is in _a and sets C3/C2/C0 bits in FPU status word register. -4)). This is what distinguishes it from FCOM. 0. }. • If the result is unordered (in case of error). etc 223 . Pentium II. CPUs before Intel P6 13 don’t have any conditional jumps instructions which check the C3/C2/C0 bits. esp fld QWORD PTR _b$[ebp] . 1.7. it is a matter of history (remember: FPU was separate chip in past).10: Non-optimizing MSVC 2010 PUBLIC _d_max _TEXT SEGMENT _a$ = 8 . 3. it will be harder to understand how it works. the FCOMP instruction also pops one variable from the stack. Despite the simplicity of the function. • If b > a in our example. then the bits are: 1. 0.2. d_max (1. which is just compares values. current stack state: ST(0) = _b .CHAPTER 17. • If a > b. then the set bits are: 1.6. 0. C3/C2/C0 bits are placed at positions 14/10/8.

The FPU has four condition flags (C0 to C3). This flag is set to 1 if the number of ones in the result of the last calculation is even. The PF flag is to be set to 1 if both C0 and C2 are set to 0 or both are 1. If the conditional jump was triggered. since it was cleared by the test ah. let’s see how. incomparable floating point values (NaN or unsupported format) are compared with the FUCOM instructions. If the programmer cares about FPU errors. COMPARISON EXAMPLE This is how C3/C2/C0 bits are located in the AH register: 6 2 C3 1 0 C2 C1 C0 After the execution of test ah. and to 0 if it is odd. If we recall the values of C3/C2/C0 for various cases. And what about checking C2? The C2 flag is set in case of error (NaN. 514 . etc). we can see that the conditional jump JP is triggering in two cases: if b > a or a = b (C3 bit is not considered here.g. in which case the subsequent JP (jump if PF==1) is triggering. FLOATING-POINT UNIT 17.7. the parity flag used sometimes in FPU code.CHAPTER 17. C2 in the parity flag and C3 in the zero flag. 5 instruction). another notable historical rudiment. the value of _a is loaded there. and if it was not triggered. 14 5=1001b 15 wikipedia 224 . Now let’s talk about the parity flag. The C2 flag is set when e. As noted in Wikipedia. Let’s look into Wikipedia 15 : One common reason to test the parity flag actually has nothing to do with parity. C0 is placed in the carry flag. but our code doesn’t check it. but they can not be tested directly. all other bits are just ignored. When this happens. only C0 and C2 bits (on 0 and 2 position) are considered. and must instead be first copied to the flags register. It is all simple after that. he/she must add additional checks. FLD loads the value of _b in ST(0).

CHAPTER 17.4) is already loaded in ST(0).2 and b=3. 225 . Now FCOMP is being executed.7.4 (We can see them in the stack: two pairs of 32-bit values).6: OllyDbg: first FLD is executed Current arguments of the function: a = 1. which is in stack right now. FLOATING-POINT UNIT 17. b (3.4 Let’s load the example into OllyDbg: Figure 17. OllyDbg shows the second FCOMP argument. COMPARISON EXAMPLE First OllyDbg example: a=1.2 and b = 3.

COMPARISON EXAMPLE FCOMP is executed: Figure 17. FLOATING-POINT UNIT 17.1 on page 215.7: OllyDbg: FCOMP is executed We see the state of the FPU’s condition flags: all zeroes. 226 . The popped value is reflected as ST(7). I wrote earlier about reason for this: 17.7.5.CHAPTER 17.

8: OllyDbg: FNSTSW is executed We see that the AX register contain zeroes: indeed.7.CHAPTER 17. FLOATING-POINT UNIT 17. 227 . COMPARISON EXAMPLE FNSTSW is executed: Figure 17. (OllyDbg disassembles the FNSTSW instruction as FSTSW — they are synonyms). all condition flags are zero.

COMPARISON EXAMPLE TEST is executed: Figure 17. Indeed: the number of bits set in 0 is 0 and 0 is an even number. 16 Jump Parity Even (x86 instruction) 228 .CHAPTER 17. OllyDbg disassembles JP as JPE16 — they are synonyms.9: OllyDbg: TEST is executed The PF flag is set to 1. And it is about to trigger now.7. FLOATING-POINT UNIT 17.

FLD loads the value of b (3.10: OllyDbg: second FLD is executed The function finishes its work.7. FLOATING-POINT UNIT 17.4) in ST(0): Figure 17.CHAPTER 17. COMPARISON EXAMPLE JPE triggered. 229 .

which is in stack right now. 230 . FLOATING-POINT UNIT 17. b (−4) is already loaded in ST(0).6 and b=-4 Let’s load example into OllyDbg: Figure 17. FCOMP about to execute now. COMPARISON EXAMPLE Second OllyDbg example: a=5.7. OllyDbg shows the second FCOMP argument.11: OllyDbg: first FLD executed Current function arguments: a = 5.6 and b = −4).CHAPTER 17.

7.CHAPTER 17. 231 . FLOATING-POINT UNIT 17.12: OllyDbg: FCOMP executed We see the state of the FPU’s condition flags: all zeroes except C0. COMPARISON EXAMPLE FCOMP executed: Figure 17.

COMPARISON EXAMPLE FNSTSW executed: Figure 17. 232 . FLOATING-POINT UNIT 17.CHAPTER 17.7.13: OllyDbg: FNSTSW executed We see that the AX register contains 0x100: the C0 flag is at the 16th bit.

233 . COMPARISON EXAMPLE TEST executed: Figure 17.CHAPTER 17. FLOATING-POINT UNIT 17.14: OllyDbg: TEST executed The PF flag is cleared.7. JPE is being skipped now. Indeed: the count of bits set in 0x100 is 1 and 1 is an odd number.

7.6) in ST(0): Figure 17. 00000041H jne SHORT $LN5@d_max . ST(1) = _b fcom ST(1) . . compare _a and ST(1) = (_b) fnstsw ax test ah. size = 8 PROC QWORD PTR _b$[esp-4] QWORD PTR _a$[esp-4] . 65 . copy ST(0) to ST(0) and pop register. leave (_a) on top fstp ST(1) . copy ST(0) to ST(1) and pop register.CHAPTER 17. FLOATING-POINT UNIT 17. current stack state: ST(0) = _a ret 0 $LN5@d_max: . so FLD loads the value of a (5. current stack state: ST(0) = _a. COMPARISON EXAMPLE JPE wasn’t triggered. size = 8 . current stack state: ST(0) = _b ret 0 234 . . Optimizing MSVC 2010 Listing 17. leave (_b) on top fstp ST(0) .11: Optimizing MSVC 2010 _a$ = 8 _b$ = 16 _d_max fld fld .15: OllyDbg: second FLD executed The function finishes its work.

Then FSTP ST(1) follows —this instruction copies the value from ST(0) to the operand and pops one value from the FPU stack. FLOATING-POINT UNIT _d_max 17. 235 . Both will be zero if a > b: in that case the JNE jump will not be triggered. then the bits are: 0. The conditional jump JNE is triggering in two cases: if b > a or a = b. then C3/C2/C0 bits are to be set as: 0. which is why the result of the comparison in C3/C2/C0 is different: • If a > b in our example. In other words. Then the function finishes. the instruction copies ST(0) (where the value of _a is now) into ST(1). 0. then one value is popped from the stack and the top of the stack (ST(0)) is contain what was in ST(1) before (that is _b). • If b > a. ST(0) contains _a and the function is finishes. 0. here the operands are in reverse order. two copies of _a are at the top of the stack. After that. 0. COMPARISON EXAMPLE ENDP FCOM differs from FCOMP in the sense that it just compares the values and doesn’t change the FPU stack. it is just like an idle (NOP) operation. The reason this instruction is used here probably is because the FPU has no other instruction to pop a value from the stack and discard it. Then. 0. The test ah. 0.CHAPTER 17. After that. 65 instruction leaves just two bits —C3 and C0. ST(0) is copied into ST(0). • If a = b. then the bits are: 1.7. Unlike the previous example. 1. one value is popped.

7.2 and b=3. FLOATING-POINT UNIT 17.4 Both FLD are executed: Figure 17. for convenience.CHAPTER 17. 236 .16: OllyDbg: both FLD are executed FCOM being executed: OllyDbg shows the contents of ST(0) and ST(1). COMPARISON EXAMPLE First OllyDbg example: a=1.

all other condition flags are cleared.7.17: OllyDbg: FCOM is done C0 is set. COMPARISON EXAMPLE FCOM is done: Figure 17.CHAPTER 17. 237 . FLOATING-POINT UNIT 17.

7.18: OllyDbg: FNSTSW is executed 238 .CHAPTER 17. AX=0x3100: Figure 17. COMPARISON EXAMPLE FNSTSW is done. FLOATING-POINT UNIT 17.

7. COMPARISON EXAMPLE TEST is executed: Figure 17. conditional jump is about to trigger now. 239 .CHAPTER 17. FLOATING-POINT UNIT 17.19: OllyDbg: TEST is executed ZF=0.

4 was left on top: Figure 17. COMPARISON EXAMPLE FSTP ST (or FSTP ST(0)) was executed: 1. 240 .7.20: OllyDbg: FSTP is executed We see that the FSTP ST instruction works just like popping one value from the FPU stack. FLOATING-POINT UNIT 17. and 3.2 was popped from the stack.CHAPTER 17.

21: OllyDbg: both FLD are executed FCOM is about to execute.7. COMPARISON EXAMPLE Second OllyDbg example: a=5.CHAPTER 17. FLOATING-POINT UNIT 17.6 and b=-4 Both FLD are executed: Figure 17. 241 .

22: OllyDbg: FCOM is finished All conditional flags are cleared.7. COMPARISON EXAMPLE FCOM is done: Figure 17. 242 .CHAPTER 17. FLOATING-POINT UNIT 17.

23: OllyDbg: FNSTSW was executed 243 . COMPARISON EXAMPLE FNSTSW done.CHAPTER 17. FLOATING-POINT UNIT 17. AX=0x3000: Figure 17.7.

jump will not happen now. FLOATING-POINT UNIT 17.24: OllyDbg: TEST was executed ZF=1. 244 .7. COMPARISON EXAMPLE TEST is done: Figure 17.CHAPTER 17.

[ebp+b_first_half] dword ptr [ebp+b]. [ebp+a_first_half] dword ptr [ebp+a]. load a and b to FPU stack: 245 . eax eax.4. put a and b to local stack: mov mov mov mov mov mov mov mov eax. [ebp+a_second_half] dword ptr [ebp+a+4].4.1 d_max proc near b a a_first_half a_second_half b_first_half b_second_half push mov sub = = = = = = qword qword dword dword dword dword ptr -10h ptr -8 ptr 8 ptr 0Ch ptr 10h ptr 14h ebp ebp. FLOATING-POINT UNIT 17. eax . esp esp. but clears ST(1).6 is now at the top of the FPU stack.CHAPTER 17.1 Listing 17. GCC 4. eax eax. 10h . Figure 17.7.25: OllyDbg: FSTP was executed We now see that the FSTP ST(1) instruction works as follows: it leaves what was at the top of the stack. eax eax.12: GCC 4. [ebp+b_second_half] dword ptr [ebp+b+4]. COMPARISON EXAMPLE FSTP ST(1) was executed: a value of 5.

0 is to be stored into AL. CF=0.7. yes [ebp+a] short locret_8048456 loc_8048453: fld [ebp+b] locret_8048456: leave retn d_max endp FUCOMPP is almost like FCOM. compare a and b and pop two values from stack. but Jcc does actually jump or not. 1. 0. a and b ax . COMPARISON EXAMPLE [ebp+a] [ebp+b] . SETNBE stores 1 only if CF=0 and ZF=0. It is almost the counterpart of JNBE. result of division by 0. then: 1.a. but if one tries to do any operation with “signaling” NaNs. then the bits are to be set to: 0. Now let’s also recall the values of C3/C2/C0 in different conditions: • If a is greater than b in our example. the CPU flags are to be set as: ZF=0. If it is not true. ST(1) . 2. • if a is less than b. In other words. load SF. if CF=0 and ZF=0 al. AL==0 ? short loc_8048453 . 0 of the AH register: 6 2 C3 1 0 C2 C1 C0 In other words. Only in one case both CF and ZF are 0: if a > b. FUCOM raising an exception only if any operand is a signaling NaN (SNaN). In all other cases. then the flags are to be set as: ZF=0. PF=0. 8 bits from AH are moved into the lower 8 bits of the CPU flags in the following order: 7 6 SF ZF 4 2 0 AF PF CF Let’s recall that FNSTSW moves the bits that interest us (C3/C2/C0) into AH and they are in positions 6. It is possible to continue to work with “quiet” NaNs.. store 1 to AL. al . PF and CF. CF=0. an exception is to be raised. the subsequent JZ is not to be triggered and the function will return _a. with the exception that SETcc18 stores 1 or 0 in AL. ST(1) . and CF flags state from AH al . Then 1 is to be stored to AL. 0. A bit about not-a-numbers: The FPU is able to deal with special values which are not-a-numbers or NaNs 17 . the fnstsw ax / sahf instruction pair moves C3/C2/C0 into ZF. 0. this instruction swapping ST(1) and ST(0) . etc. current stack state: ST(0) . Not-a-numbers can be “quiet” and “signaling”.b. _b is to be returned. These are infinity.a fxch st(1) . but pops both values from the stack and handles “not-a-numbers” differently. FLOATING-POINT UNIT fld fld 17. FCOM raising an exception if any operand is NaN.b fucompp fnstsw sahf setnbe test jz fld jmp . • And if a = b. • If a < b.CHAPTER 17. PF. The next instruction is SAHF (Store AH into Flags) —this is a rare instruction in code not related to the FPU. i. 17 wikipedia 18 cc is condition code 246 . CF=1. SETNBE stores 1 or 0 to AL. then: ZF=1. then C3/C2/C0 are to be set to: 0.e. 0. ZF. store FPU status to AX . PF=0. • If a = b. these states of the CPU flags are possible after three FUCOMPP/FNSTSW/SAHF instructions: • If a > b. PF=0. current stack state: ST(0) . AF. 0. Depending on the CPU flags and conditions.

1 Listing 17. etc). COMPARISON EXAMPLE Optimizing GCC 4. FLOATING-POINT UNIT 17. Thereby.7. 247 . the FPU is no longer separate unit in P6 Intel family. compare _a and _b fnstsw ax sahf ja short loc_8048448 . the conditional jumps instructions listed here can be used after a FNSTSW/SAHF instruction pair.8. Actually. Let’s recall where bits C3/C2/C0 are located in the AH register after the execution of FSTSW/FNSTSW: 6 2 C3 1 0 C2 C1 C0 Let’s also recall. JBE. how the bits from AH are stored into the CPU flags the execution of SAHF: 7 6 SF ZF 4 2 0 AF PF CF After the comparison. JNE/JNZ) check only flags CF and ZF. store ST(0) to ST(0) (idle operation). JNAE. to easily map them to base CPU flags without additional permutations. leave _a at top fstp st(1) loc_804844A: d_max pop retn endp ebp It is almost the same except that JA is used after SAHF. etc. So what we get is: 19 Starting at Pentium Pro. Apparently. the maintainers of GCC decided to drop support of pre-P6 Intel CPUs (early Pentiums. stack state now: ST(0) = _b.13: Optimizing GCC 4. JNBE. so the conditional jumps are able work after. stack state now: ST(0) = _a. esp [ebp+arg_0] . JA is triggering if both CF are ZF zero. GCC 4. store _a to ST(0). Apparently. _a [ebp+arg_8] .4. .1 d_max public d_max proc near arg_0 arg_8 = qword ptr = qword ptr push mov fld fld 8 10h ebp ebp. _b . pop value at top of stack. ST(1) = _a fxch st(1) . conditional jump instructions that check “larger”. JNA. ST(1) = _b fucom st(1) .1 with -O3 optimization turned on Some new FPU instructions were added in the P6 Intel family19 These are FUCOMI (compare operands and set flags of the main CPU) and FCMOVcc (works like CMOVcc. JE/JZ.CHAPTER 17. “lesser” or “equal” for unsigned number comparison (these are JA.4. so now it is possible to modify/check flags of the main CPU from the FPU. Pentium-II. the C3 and C0 bits are moved into ZF and CF. JAE. but on FPU registers). But I’m not sure. the FPU C3/C2/C0 status bits were placed there intentionally. And also. JBE. leave _b at top fstp st jmp short loc_804844A loc_8048448: . JNB. pop value at top of stack.

%st 0x080484ac <+12>: fcmovbe %st(1).1-ubuntu Copyright (C) 2013 Free Software Foundation. it leaves a in ST(0).6. License GPLv3+: GNU GPL version 3 or later <http://gnu. Otherwise (a > b). Reading symbols from /home/dennis/polygon/d_max. discarding the contents of ST(1). compare "a" and "b" fucomi st.7. st(1) . ST1=a fxch st(1) . FCMOVBE checks the flags and copies ST(1) (b here at the moment) to ST(0) (a here) if ST 0(a) <= ST 1(b). 0x080484a0 in d_max () (gdb) ni 0x080484a4 in d_max () (gdb) disas $eip Dump of assembler code for function d_max: 0x080484a0 <+0>: fldl 0x4(%esp) => 0x080484a4 <+4>: fldl 0xc(%esp) 0x080484a8 <+8>: fxch %st(1) 0x080484aa <+10>: fucomi %st(1). leave "a" in ST0 otherwise fcmovbe st. This GDB was configured as "i686-linux-gnu". load "a" fld QWORD PTR [esp+12] . So FUCOMI compares ST(0) (a) and ST(1) (b) and then sets some flags in the main CPU. For bug reporting instructions. ST1=b .14: Optimizing GCC 4. ST0=a. discard value in ST1 fstp st(1) ret I’m not sure why FXCH (swap operands) is here. Let’s trace this function in GDB: Listing 17.. st(1) . to the extent permitted by law.org/software/gdb/bugs/>. Probably it’s a compiler inaccuracy.%st 0x080484ae <+14>: fstp %st(1) 0x080484b0 <+16>: ret End of assembler dump.. Type "show copying" and "show warranty" for details. There is NO WARRANTY.CHAPTER 17. The last FSTP leaves ST(0) on top of the stack.8. load "b" .1 fld QWORD PTR [esp+4] . (gdb) ni 0x080484a8 in d_max () (gdb) info float R7: Valid 0x3fff9999999999999800 +1. Inc.8.. ST0=b.done.(no debugging symbols found).html> This is free software: you are free to change and redistribute it.15: Optimizing GCC 4..399999999999999911 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000 Status Word: Control Word: 0x3000 TOP: 6 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) 248 .gnu. FLOATING-POINT UNIT 17.. please see: <http://www. COMPARISON EXAMPLE Listing 17. It’s possible to get rid of it easily by swapping the first two FLD instructions or by replacing FCMOVBE (below or equal) by FCMOVA (above).. (gdb) b d_max Breakpoint 1 at 0x80484a0 (gdb) run Starting program: /home/dennis/polygon/d_max Breakpoint 1.c -o d_max -fno-inline dennis@ubuntuvm:~/polygon$ gdb d_max GNU gdb (GDB) 7. copy ST1 ("b" here) to ST0 if a<=b .1 and GDB 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 dennis@ubuntuvm:~/polygon$ gcc -O3 d_max.org/licenses/gpl.199999999999999956 =>R6: Valid 0x4000d999999999999800 +3.

7.399999999999999911 =>R6: Valid 0x3fff9999999999999800 +1.399999999999999911 =>R6: Valid 0x4000d999999999999800 +3.CHAPTER 17.199999999999999956 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000 Status Word: 0x3000 TOP: 6 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0fff Instruction Pointer: 0x73:0x080484a8 Operand Pointer: 0x7b:0xbffff118 Opcode: 0x0000 (gdb) disas $eip Dump of assembler code for function d_max: 0x080484a0 <+0>: fldl 0x4(%esp) 0x080484a4 <+4>: fldl 0xc(%esp) 0x080484a8 <+8>: fxch %st(1) => 0x080484aa <+10>: fucomi %st(1).%st 0x080484ac <+12>: fcmovbe %st(1). COMPARISON EXAMPLE RC: Round to nearest Tag Word: 0x0fff Instruction Pointer: 0x73:0x080484a4 Operand Pointer: 0x7b:0xbffff118 Opcode: 0x0000 (gdb) ni 0x080484aa in d_max () (gdb) info float R7: Valid 0x4000d999999999999800 +3. FLOATING-POINT UNIT 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 17. (gdb) ni 0x080484ac in d_max () (gdb) info registers eax 0x1 1 ecx 0xbffff1c4 -1073745468 edx 0x8048340 134513472 ebx 0xb7fbf000 -1208225792 esp 0xbffff10c 0xbffff10c ebp 0xbffff128 0xbffff128 esi 0x0 0 edi 0x0 0 eip 0x80484ac 0x80484ac <d_max+12> eflags 0x203 [ CF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb) ni 0x080484ae in d_max () (gdb) info float R7: Valid 0x4000d999999999999800 +3.%st 0x080484ae <+14>: fstp %st(1) 0x080484b0 <+16>: ret End of assembler dump.399999999999999911 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000 Status Word: Control Word: 0x3000 TOP: 6 0x037f IM DM ZM OM UM PM 249 .

FSTP leaves one value at the top of stack (line 136).3 (LLVM) (ARM mode) VMOV VMOV VCMPE. D17 .16: Optimizing Xcode 4.6.5.%st 0x080484ac <+12>: fcmovbe %st(1).2 ARM Optimizing Xcode 4.1 on page 215). D16 APSR_nzcv. You can also see the TOP register contents in Status Word (line 44) — it is 6 now. R0. R3 . D16 250 . The value of TOP is now 7.CHAPTER 17. The arrow (at line 35) points to the current top of the stack.%st => 0x080484ae <+14>: fstp %st(1) 0x080484b0 <+16>: ret End of assembler dump. so the FPU stack top is pointing to internal register 7. Inferior 1 [process 30194] will be killed. copy "b" to D16 R0. FLOATING-POINT UNIT 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 17. but internal the FPU registers (Rx). let’s execute the first two FLD instructions. b D17. R1 .399999999999999911 R6: Empty 0x4000d999999999999800 R5: Empty 0x00000000000000000000 R4: Empty 0x00000000000000000000 R3: Empty 0x00000000000000000000 R2: Empty 0x00000000000000000000 R1: Empty 0x00000000000000000000 R0: Empty 0x00000000000000000000 Status Word: 0x3800 TOP: 7 Control Word: 0x037f IM DM ZM OM UM PM PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x3fff Instruction Pointer: 0x73:0x080484ae Operand Pointer: 0x7b:0xbffff118 Opcode: 0x0000 (gdb) quit A debugging session is active. Quit anyway? (y or n) y dennis@ubuntuvm:~/polygon$ Using “ni”. Let’s see the flags: CF is set (line 95).F64 VMRS VMOVGT. Let’s examine the FPU registers (line 33). FCMOVBE has copied the value of b (see line 104).7. so the stack top is now pointing to internal register 6.3 (LLVM) (ARM mode) Listing 17. And GDB doesn’t show STx registers. R1.F64 VMOV D16.6. (gdb) ni 0x080484b0 in d_max () (gdb) info float =>R7: Valid 0x4000d999999999999800 +3. FUCOMI is executed (line 83). FPSCR D16. As I wrote before. COMPARISON EXAMPLE PC: Extended Precision (64-bits) RC: Round to nearest Tag Word: 0x0fff Instruction Pointer: 0x73:0x080484ac Operand Pointer: 0x7b:0xbffff118 Opcode: 0x0000 (gdb) disas $eip Dump of assembler code for function d_max: 0x080484a0 <+0>: fldl 0x4(%esp) 0x080484a4 <+4>: fldl 0xc(%esp) 0x080484a8 <+8>: fxch %st(1) 0x080484aa <+10>: fucomi %st(1). R2. a D17. The values of a and b are swapped after FXCH is executed (line 54).7. the FPU registers set is a circular buffer rather than a stack ( 17. 17.

6. the ARM coprocessor has its own status and flags register. Otherwise.3 (LLVM) (thumb-2 mode) Listing 17. ``IT GT'' implies that the next instruction is to be executed.7. And if there was be. (FPSCR20 ).F64 VMOV BX D16. R0. which is also from “Angry Birds”: Listing 17.PC} Four “T” symbols in the instruction mnemonic mean that the four subsequent instructions are to be executed if the condition is true. there are no conditional jump instruction in ARM. if the GT (Greater Than) condition is true. If it gets executed. then the suffixes would have been set as follows: -EQ -NE -NE -NE 20 (ARM) 21 (ARM) Floating-Point Status and Control Register Application Program Status Register 251 . as in previous example. R2. D16 LR Almost the same as in the previous example. thumb-2 was extended to make it possible to specify predicates to old thumb instructions. the usual VMOV is encoded there. (The inverse condition of NE is EQ (equal)). V) from the coprocessor status word into bits of the general status register (APSR21 ).6. that can check bits in the status register of the coprocessor. a D17. Optimizing Xcode 4. Here is a more complex code fragment.17: Optimizing Xcode 4. the value of b is to be written into D16 (that is currently stored in in D17 ). And just like in x86. D17 ITE stands for if-then-else and it encodes suffixes for the next two instructions. R3 .19: Angry Birds Classic ITTTT EQ MOVEQ ADDEQ POPEQ. D17 R0. The IT instruction defines a so-called if-then block. we see the VMOVGT instruction.3 (LLVM) (thumb-2 mode) VMOV VMOV VCMPE. R3. each of them has a predicate suffix. However. the value of a stays in the D16 register. which copies 4 bits (N. Z. VMOVGT is the analog of the MOVGT. from “Angry Birds” (for iOS): Listing 17. Here.18: Angry Birds Classic ITE NE VMOVNE VMOVEQ R2. The penultimate instruction VMOV prepares the value in the D16 register for returning it via the R0 and R1 register pair. Just like in the x86 coprocessor. SP. R1 . it executes if one operand is greater than the other while comparing (GT—Greater Than). R4 SP. instruction for D-registers. b D17. After the instruction it is possible to place up to 4 instructions. R3. in the IDA-generated listing. FPSCR D16. D16 APSR_nzcv. since there is a ``IT GT'' instruction placed right before it. FLOATING-POINT UNIT BX 17. That’s why IDA adds the -EQ suffix to each one of them. There is no space in the 16-bit instructions for 4 more bits in which conditions can be encoded. ITEEE EQ (if-then-else-else-else). In our example. #0x20 {R8. R1. by the way. But there is no such thing in thumb mode. One more that’s slightly harder. for example.CHAPTER 17. but IDA adds the -GT suffix to it. The first instruction executes if the condition encoded in ITE (NE. In fact. many instructions in ARM mode can be supplemented by condition predicate. D16 R2. The input values are placed into the D17 and D16 registers and then compared using the VCMPE instruction. however slightly different. C. and the second —if the condition is not true.R10} {R4-R7. so there is VMRS . COMPARISON EXAMPLE LR A very simple case.F64 VMRS IT GT VMOVGT. since there is a need to store coprocessor-specific flags.W POPEQ R0. not equal) is true at. As we already know.

SP. SP SP.3 (LLVM) (ARM mode) Listing 17. D17 APSR_nzcv. ITT. R2 R5. [SP. [SP+0x20+b]. ITTE.#4 LR loc_2E08 loc_2E10 Almost the same as we already saw. if you program in ARM assembly language manually for thumb-2 mode.W NEGLE MOVGT R0. D16 SP. R7 R7. #0xFFFFFFFF R10.#saved_R7]! R7. FPSCR loc_2E08 D16. Non-optimizing Xcode 4.7.#0x20+b] D16. R1 D17. [SP. ITTTT.#0x20+a] D16. Compilers usually don’t generate all possible combinations. R3 D17. so I created them with an option to show 4 bytes for each opcode . ITE.F64 VMRS BLE VLDR VSTR B R7. knowing the high part of the 16-bit opcode (IT is 0xBF).LR} R4. SP. R3 252 . R0 ITTE (if-then-then-else) implies that the 1st and 2nd instructions are to be executed if the LE (Less or Equal) condition is true. [SP. R1.20: Angry Birds Classic CMP. [SP. [SP. R0 R10.#0x20+b] D16. R2. #0x1C SP. and the 3rd—if the inverse condition (GT—Greater Than) is true. [SP. R0.#0x20+val_to_return] VLDR VMOV MOV LDR BX D16. [SP. and you add conditional suffixes.W ITTE LE SUBLE. [SP. ITTT.#0x20+val_to_return] R0. R0. For example. [SP.21: Non-optimizing Xcode 4. in the mentioned “Angry Birds” game (classic version for iOS) only these variants of the IT instruction are used: IT.22: Optimizing Keil 6/2013 (thumb mode) PUSH MOVS MOVS {R3-R7.6. [SP. How did I learn this? In IDA It is possible to produce listing files. COMPARISON EXAMPLE Another fragment from “Angry Birds”: Listing 17. the assembler will add the IT instructions automatically with the required flags where it is necessary. FLOATING-POINT UNIT 17.#0x20+a] D16.#0x20+b] D16. #7 D16.lst | grep " BF" | grep "IT" > results.lst By the way. as well as the return value. but there is too much redundant code because the a and b variables are stored in the local stack. Then.3 (LLVM) (ARM mode) b a val_to_return saved_R7 = = = = -0x20 -0x18 -0x10 -4 STR MOV SUB BIC VMOV VMOV VSTR VSTR VLDR VLDR VCMPE.#0x20+val_to_return] loc_2E10 VLDR VSTR D16.6.CHAPTER 17. I did the following using grep: cat AngryBirdsClassic. Optimizing Keil 6/2013 (thumb mode) Listing 17.#0x20+a] D17. #1 R0.

result in X0 fmov d0. d1 fcsel d0. D1 . R4 R1. now result in D0 ret The ARM64 ISA has FPU-instructions which set APSR the CPU flags instead of FPSCR. so the following BCS (Carry set . R0 R7. [sp. #16 str d0. a<=b. R6 R1.L74: .PC} loc_1C0 Keil doesn’t generate FPU-instructions since it cannot rely on them being supported on the target CPU. d1 ble . reload values ldr x1. 16 ret 253 . If the condition is not true. d1.e. and it cannot be done by straightforward bitwise comparing. x1 fmov d1. x0 . R7 {R3-R7.. D0 .a. The result of the comparison is to be left in the flags by this function. C. FCSEL (Floating Conditional Select) copies the value of D0 or D1 into D0 depending on the condition (GT (Greater Than) here). x0 .8] b . [sp. the value of D1 is copied into D0. gt .b fcmpe d0.L76 . save input arguments in "Register Save Area" sub sp. load D1 (b) into X0 ldr x0.L76: . compared to the instruction set in older CPUs.9 d_max: . it compares the two values passed in D0 and D1 (which are the first and second arguments of the function) and sets APSR flags (N. [sp] . D0 . Non-optimizing GCC (Linaro) 4.L74 . R1 __aeabi_cdrcmple loc_1C0 R0.PC} MOVS MOVS POP R0.b fcmpe d0. Here we see FCMPE. D1 . result in D0 add sp.8] str d1. N. sp. If the condition is true (GT) then the value of D0 is copied into D0 (i. sp.8] ldr x0. So it calls an external library function to do the comparison: __aeabi_cdrcmple. [sp] . nothing happens).a.Greater than or equal) instruction can work without any additional code.B.CHAPTER 17. V).3 ARM64 Optimizing GCC (Linaro) 4. COMPARISON EXAMPLE MOVS MOVS BL BCS MOVS MOVS POP R6. 17. TheFPU is not a separate device here anymore (at least. it uses flags in APSR register instead of FPSCR. a>b. FLOATING-POINT UNIT 17. load D0 (a) into X0 ldr x0. This is much more convenient. [sp] fmov d0. logically). Z.7.9 d_max: .7. R5 {R3-R7. for convenience. [sp. d0. and again.

Finally. float b) { if (a>b) return a. Listing 17. A lot of redundant code.d $f0. S0 . now result in S0 ret It is the same code. 254 .d $f14. “LT” is the condition “Less Than”. s1. }. set return value to "b": mov. $zero . float f_max (float a. jump to locret_14 if condition bit is set bc1t locret_14 .5 (IDA) d_max: . this instruction is always executed (set return value to "a"): mov. These bits are located in the register called FCCR. Earlier MIPS-es have only one condition bit (called FCC0). the value of a gets loaded into X0. return b.LT. because the return value needs to be in this register. “T” mean that the jump is to be taken if the bit is set (“True”). In the first case (a > b). COMPARISON EXAMPLE Non-optimizing GCC is more verbose. you can try optimizing this piece of code manually by removing redundant instructions and not introducing new ones (including FCSEL).CHAPTER 17.d $f0..D” compares two values.23: Optimizing GCC 4. later ones have 8 (called FCC7-FCC0). 17. the value of b gets loadedin X0. this instruction is executed only if branch was not taken (i. Depending on the jump. Optimizing GCC (Linaro) 4. s0. FLOATING-POINT UNIT 17. “BC1T” checks the FCC0 bit and jumps if the bit is set.4. NOP “C.7. f_max: . if b>=a) . but that is how non-optimizing compilers work. branch delay slot.a. At this moment. “D” implies values of type double. branch delay slot . $zero . NOP . the function saves its input argument values in the local stack (Register Save Area).9—float I also rewrote this example to use float instead of double. In the other case (a <= b).7. $f12 . FCMPE compares the values and sets the APSR flags. gt .4 MIPS The co-processor of the most popular MIPS processor has only one condition bit which can be set in the FPU and checked in the CPU.b fcmpe s0. set FPU condition bit if $f14<$f12 (b<a): c. so it proceed using old methods: using the BLE instruction (Branch if Less than or Equal). Then the code reloads these values into registers X0/X1 and finally copies them to D0/D1 to be compared using FCMPE. one of function arguments is placed into $F0. $f14 locret_14: jr or $ra $at. the FCC0 condition bit is either set or cleared. S1 . It’s because numbers of type float are passed in 32-bit S-registers (which are in fact the lower parts of the 64-bit D-registers). There is also the instruction “BC1F” which jumps if the bit is cleared (“False”). but the S-registers are used instead of D. s1 fcsel s0. Exercise As an exercise. First.ones. the value from X0 gets copied into D0. $f12 or $at.lt.e. Depending on the result of the comparison. the compiler is not thinking yet about the more convenient FCSEL instruction.

d3 d3.00000000 d0. d2. CALCULATORS AND REVERSE POLISH NOTATION Stack. d0.10.d0. d2 d2. and this was much simpler than to handle complex expressions with parentheses.9 (ARM64) f: fadd fmov fadd fadd fadd fdiv ret d0.F64 VMOV. then press “plus” sign. .10 Exercises 17.F64 VADD.d2 d0.8.0e+0 d0.25: Non-optimizing Keil 6/2013 (thumb mode / compiled for Cortex-R4F CPU) f PROC VADD. d4 d0. 5 8 8 8 8 8 _a1$[esp-4] _a2$[esp-4] _a3$[esp-4] _a4$[esp-4] _a5$[esp-4] __real@4014000000000000 Listing 17.26: Optimizing GCC 4.7.d1 lr Listing 17. STACK.d0. FLOATING-POINT UNIT 17. It’s because old calculators were just stack machine implementations. calculators and reverse Polish notation Now we undestand why some old calculators used reverse Polish notation 22 .9 x64 On how float point numbers are processed in x86-64.d4 d0.F64 BX ENDP d0.d3 d2.#5.d0. d0. one has to enter 12. d3.CHAPTER 17. 17. d0.F64 VADD. then 34. For example. 17. d1 5.d0.2 Exercise #2 What does this code do? Listing 17. d1.F64 VADD. . 17. . .F64 VDIV.d2. read more here: 27 on page 434. size size size size size QWORD QWORD QWORD QWORD QWORD QWORD 0 = = = = = PTR PTR PTR PTR PTR PTR .10.8 17.1 Exercise #1 Eliminate the FXCH instruction in the example 17. for addition of 12 and 34.24: Optimizing MSVC 2010 __real@4014000000000000 DQ 04014000000000000r _a1$ _a2$ _a3$ _a4$ _a5$ _f = = = = = _f 8 16 24 32 40 PROC fld fadd fadd fadd fadd fdiv ret ENDP .1 on page 247 and test it. d1 22 wikipedia 255 .d1 d1.

$ra $f0. $f3. FLOATING-POINT UNIT 17.d lwc1 or lwc1 jr div. $f3. arg_14($sp) $f12. $f2. $f2. $f0. $f2. $f0.d lwc1 or lwc1 or add.word 0 # DATA XREF: f+C # f+44 # DATA XREF: f+3C Answer G. $f14 arg_10($sp) ($LC0 >> 16) $f2. $at.d lwc1 lui add. $at.d $f0. EXERCISES Listing 17.4. $at.9 on page 955. $f0 arg_1C($sp) $zero arg_18($sp) $zero $f2 arg_24($sp) $zero arg_20($sp) $zero $f2 dword_6C $zero $LC0 $f2 $LC0: .27: Optimizing GCC 4. $at. $f0. $f3.1.10. $f1. 256 .d lwc1 or lwc1 or add. $at.word 0x40140000 dword_6C: .CHAPTER 17. $v0.5 (MIPS) (IDA) f: arg_10 arg_14 arg_18 arg_1C arg_20 arg_24 = = = = = = 0x10 0x14 0x18 0x1C 0x20 0x24 lwc1 add. $f2.

DWORD PTR _i$[ebp] mov DWORD PTR _a$[ebp+edx*4]. 84 . for (i=0. DWORD PTR _i$[ebp] add eax.h> int main() { int a[20].CHAPTER 18. 00000014H jge SHORT $LN4@main mov ecx. return 0. ecx jmp SHORT $LN5@main $LN4@main: 1 AKA2 “homogeneous container” 257 .1. size = 80 _main PROC push ebp mov ebp. 18. esp sub esp. 1 mov DWORD PTR _i$[ebp]. DWORD PTR _i$[ebp] shl ecx. 0 jmp SHORT $LN6@main $LN5@main: mov eax. i<20. ARRAYS Chapter 18 Arrays An array is just a set of variables in memory that lie next to each other and that have the same type 1 . }. a[i]).1 Simple example #include <stdio. i<20.1 x86 MSVC Let’s compile: Listing 18. int i. i++) a[i]=i*2. 20 . eax $LN6@main: cmp DWORD PTR _i$[ebp]. i++) printf ("a[%d]=%d\n". i. 00000054H mov DWORD PTR _i$[ebp]. 18. 1 mov edx. size = 4 _a$ = -80 .1: MSVC 2008 _TEXT SEGMENT _i$ = -84 . for (i=0.

ebp ebp 0 ENDP Nothing very special. eax DWORD PTR _i$[ebp]. ARRAYS mov jmp $LN2@main: mov add mov $LN3@main: cmp jge mov mov push mov push push call add jmp $LN1@main: xor mov pop ret _main 18. DWORD PTR _a$[ebp+ecx*4] edx eax. DWORD PTR _i$[ebp] eax. SIMPLE EXAMPLE DWORD PTR _i$[ebp].1. 0 SHORT $LN3@main eax.1 on page 206. 258 . 12 . DWORD PTR _i$[ebp] edx. 00000014H SHORT $LN1@main ecx.2. The shl ecx. 1 instruction is used for value multiplication by 2 in ECX.CHAPTER 18. eax esp. DWORD PTR _i$[ebp] eax OFFSET $SG2463 _printf esp. 80 bytes are allocated on the stack for the array. 0000000cH SHORT $LN2@main eax. just two loops: the first is a filling loop and second is a printing loop. 20 elements of 4 bytes. 1 DWORD PTR _i$[ebp]. 20 . more about below 16.

We see how the array gets filled: each element is 32-bit word of int type and its value is the index multiplied by 2: Figure 18. [esp+70h+i] edx.1 main public main proc near var_70 var_6C var_68 i_2 i = = = = = dword dword dword dword dword ptr ptr ptr ptr ptr . 0FFFFFFF0h esp. edx [esp+70h+i]. we can see all its 20 elements there. SIMPLE EXAMPLE Let’s try this example in OllyDbg. esp esp. edx=i*2 [esp+eax*4+70h+i_2].1. 70h [esp+70h+i].4. DATA XREF: _start+17 -70h -6Ch -68h -54h -4 push mov and sub mov jmp ebp ebp. GCC Here is what GCC 4. ARRAYS 18. i=0 loc_80483F7: 259 . [esp+70h+i] edx. 0 short loc_804840A mov mov add mov add eax.CHAPTER 18.1 does: Listing 18.1: OllyDbg: after array filling Since this array is located in the stack. edx . i++ .4. 1 .2: GCC 4.

LSL#1 R0. {R4. eax _printf [esp+70h+i]. "a[%d]=%d\n" [esp+70h+var_68].LSL#2] R4.1. i MOV STR ADD R0.LR} SP. 13h short loc_804841B eax. i=i+1 loc_4C4 260 . aADD __2printf R4. If you index this pointer as a[idx].1. R4.LSL#2] . i R2. but it’s more correct to say that a pointer to the first element of the array is passed (the addresses of rest of the elements are calculated in an obvious way). An interesting example: a string of characters like “string” is an array of characters and it has a type of const char[].2 ARM Non-optimizing Keil 6/2013 (ARM mode) EXPORT _main _main STMFD SUB SP!. #0x50 MOV B R4. edx [esp+70h+var_70]. 13h short loc_80483F7 [esp+70h+i]. And that is why it is possible to write things like ``string''[i] —this is a correct C/C++ expression! 18. [esp+eax*4+70h+i_2] eax. first loop loc_494 loc_4A0 . (second printf argument) R2=*(SP+R4<<4) (same as R1. [SP. ARRAYS 18. R4. SP. yes. #20 loc_494 . R4. R4 R0. SIMPLE EXAMPLE loc_804840A: cmp jle mov jmp [esp+70h+i]. #0 loc_4C4 .CHAPTER 18. 0 loc_804841B: loc_8048441: main By the way. second loop MOV B loc_4B0 LDR *(SP+R4*4)) MOV ADR BL ADD . [esp+70h+i] edx. variable a is of type int* (the pointer to int) —you can pass a pointer to an array to another function. "a[%d]=%d\n" . #1 .R4. R0=R4*2 . i=i+1 CMP BLT R4. An index can also be applied to this pointer. offset aADD . [esp+70h+i] [esp+70h+var_6C]. 0 short loc_8048441 mov mov mov mov mov mov mov call add eax. idx is just to be added to the pointer and the element placed there (to which calculated pointer is pointing) is to be returned. (first printf argument) R1=i . 1 cmp jle mov leave retn endp [esp+70h+i]. run loop body again R4. allocate place for 20 int variables . i<20? .R4. store R0 to SP+R4<<2 (same as SP+R4*4) . edx edx. [SP. #0 loc_4A0 . #1 .

#20 R1. #0 . R4 is i. . set stack frame (FP=SP) 261 . #0x54 . #1 R0. . allocate place for 20 int variables + one more variable SUB SP. The second loop has an inverse ``LDR R2. [SP. i=i+1 .PC} .1 (ARM64) . #1 CMP R4. The number that is to be written into the array is calculated as i ∗ 2. i<20? . R0.9. #20 loc_4B0 R0.R4. i=0 loc_1CE R1=i<<1 (same as i*2) R2=i<<2 (same as i*4) i=i+1 i<20? store R1 to *(R5+R2) (same R5+i*4) yes. SP . allocated for 20 int variables int type requires 32 bits for storage (or 4 bytes). The compiler allocates slightly more space in the local stack. pointer to first array element LSLS LSLS ADDS CMP STR BLT R1. aADD BL __2printf ADDS R4. R4 ADR R0. R4. run loop body again value to return deallocate chunk.string "a[%d]=%d\n" main: . #0x54 POP {R4. #0x50 SP!. #20 BLT loc_1DC MOVS R0.R5. i . "a[%d]=%d\n" . [sp. R0. it loads the value we need from the array. SP. yes. #2 LDR R2. allocated for ADD SP. . R0. SIMPLE EXAMPLE R4. #0 R5. . and the pointer to it is calculated likewise. .R4. run loop body again . the last 4 bytes are not used. So shifting i left by 2 bits is effectively equivalent to multiplication by 4 (since each array element has a size of 4 bytes) and then it’s added to the address of the start of the array. i<20. i<20.LC0: .PC} . -112]! . first loop MOVS MOV R0. R0=i<<2 (same as i*4) .R2] loc_1CE . SP. SP. #0 . the loop iterator i is placed in the R4 register. ARRAYS CMP BLT MOV ADD LDMFD 18. ``STR R0. . value to return 20 int variables + one more variable Thumb code is very similar. . i<20? yes. save FP and LR in stack frame: stp x29. So that is why the ``SUB SP. run loop body again . #2 R0. #0 SP. {R4. [SP.CHAPTER 18. [R5. SP. deallocate chunk. which calculates the value to be written into the array and the address of each element in the array as well. Here is how a pointer to array element is calculated: SP points to the start of the array.1.R0] MOVS R1. R4. . In both the first and second loops.1 (ARM64) Listing 18.9. #0x50'' instruction in the function’s prologie allocates exactly this amount of space in the stack. second loop MOVS loc_1DC LSLS R0.LR} . #1 R2. so ``MOV R0. R4. [R5. load from *(R5+R0) (same as R5+i*4) .3: Non-optimizing GCC 4. Non-optimizing GCC 4.LSL#2]'' writes the contents of R0 into the array. x30. however. Optimizing Keil 6/2013 (thumb mode) _main PUSH {R4.R5. so to store 20 int variables 80 (0x50) bytes are needed. Thumb mode has special instructions for bit shifting (like LSLS).LSL#2]'' instruction. R4. which is effectively equivalent to shifting it left by one bit.LSL#1'' instruction does this.

108] .108] add w0. [x29. [x29. .lsl 2] . increment "i" variable: ldr w0. .108] cmp w0. [x29. . load value of "i" variable: ldr w0.L2: .3 MIPS The function uses a lot of S.1.L5: . [x29.108] . ARRAYS 18. x0. w0. . 1 . [x29. 262 .108] . W2 still contains the value of array element which was just loaded. 0 .x1. [x29. calculate array address: add x0.108] . load 32-bit integer from local stack and sign extend it to 64-bit one: ldrsw x1. SIMPLE EXAMPLE add x29. 19 . 19 . setting initial counter variable at 0 (WZR is the register always holding zero): str wzr. find a place of an array in local stack: add x0. restore FP and LR: ldp x29. 24 . x29. jump to loop condition checking code: b . load "i" variable to W1 and pass it to printf() as second argument: ldr w1. increment counter (i): ldr w0.L4 . [x0. multiplicate it by 2: lsl w2.L5 . [x29. load value from the array at the address (X0+X1<<2 = address of array + i*4) ldr w2. because the same local variable (i) is being used as counter. by the way. jump to L3 (loop body begin) if not: ble .L4: . so that’s why its values are saved in the function prologue and restored in the epilogue. . 1 str w0. w0.L3 .1. call printf(): bl printf .L2 .LC0 . check if we finished: ldr w0. [sp]. [x29. [x29. w0.108] add w0.108] . [x0.registers which must be preserved. [x29. second part of the function begins here.lsl 2] . the same place in the local stack was used for counter.108] . jump to the loop body begin if not: ble . 112 ret 18. load address of the "a[%d]=%d\n" string: adrp x0.CHAPTER 18.L3: . load "i" value: ldrsw x1.x1. [x29. 0 . sp.108] cmp w0. setting initial counter variable at 0.108] . are we finished? ldr w0. 1 str w0.108] b . return 0 mov w0. 24 . str wzr. :lo12:.LC0 add x0. x29. [x29. x30. calculate address of element (X0+X1<<2=array address+i*4) and store W2 (i*2) there: str w2.

1 lw $gp. 0x80+var_4($sp) move $v0. increase value to be stored by 2 at each iteration: addiu $v0.5 (IDA) main: var_70 = -0x70 var_68 = -0x68 var_14 = -0x14 var_10 = -0x10 var_C = -0xC var_8 = -8 var_4 = -4 . add 4 to address anyway: addiu $v1. "i" variable will reside in $s0: move $s0.CHAPTER 18. 0x80+var_8($sp) sw $s2. increment "i": addiu $s0. $LC0 # "a[%d]=%d\n" . 0x80+var_8($sp) lw $s2.4: Optimizing GCC 4. $zero lw $s3. . loop terminator reached? bne $v0. 0x80+var_4($sp) sw $s3. (__gnu_local_gp & 0xFFFF) sw $ra. SIMPLE EXAMPLE Listing 18. 0x80+var_70($sp) . 0x28 # '(' loc_34: # CODE XREF: main+3C . 0x80+var_70($sp) addiu $s1. 0x80+var_C($sp) sw $s1. jump to loop body if end is not reached: bne $s0. 4 . that value will be used as a loop terminator. call printf(): # CODE XREF: main+70 lw lw move move jalr $t9. -0x80 la $gp. it was precalculated by GCC compiler at compile stage: li $a0. (__gnu_local_gp >> 16) addiu $sp. store value into memory: sw $v0. 0x80+var_14($sp) jr $ra addiu $sp. 0x14 loc_54: . function epilogue lw $ra. 0x80+var_C($sp) lw $s1. 0x80+var_10($sp) sw $s0. 2 . $zero . $a0. 4 .ascii "a[%d]=%d\n"<0> # DATA XREF: main+44 263 . loc_34 .1. $sp. $a1. 0x80 $LC0: . 0x80+var_14($sp) sw $gp. $t9 (printf & 0xFFFF)($gp) 0($s1) $s0 $s3 . $zero li $s2. $a2. 0($v1) . 0x80+var_68 move $v1. $s2. loc_54 . second loop begin la $s3. move memory pointer to the next 32-bit word: addiu $s1. $s1 move $v0. $a0. ARRAYS 18.4. array filling loop is ended . 0x80+var_10($sp) lw $s0. function prologue: lui $gp.

84 mov DWORD PTR _i$[ebp]. BUFFER OVERFLOW Something interesting: there are two loops and the first one doesn’t needs i. return 0. which could check if it is less than 20. size = 4 _a$ = -80 . ARRAYS 18. 1 mov DWORD PTR _i$[ebp]. Compilation results (MSVC 2008): Listing 18. Their goal is to get rid of of multiplications. a[20]).2. DWORD PTR _i$[ebp] shl ecx.5: Non-optimizing MSVC 2008 $SG2474 DB 'a[20]=%d'. eax $LN3@main: cmp DWORD PTR _i$[ebp]. 1 mov edx.2. eax mov esp. 00H _i$ = -84 . for (i=0. array indexing is just array[index]. esp sub esp. size = 80 _main PROC push ebp mov ebp. printf ("a[20]=%d\n".2 Buffer overflow 18. Here is a code that successfully compiles and works: #include <stdio. DWORD PTR _a$[ebp+80] push eax push OFFSET $SG2474 . one (in $V0) increasing by 2 each time. So here we see two variables. DWORD PTR _i$[ebp] add eax. i<20. you’ll probably note the missing index bounds checking. int i. If you study the generated code closely. ebp pop ebp ret 0 _main ENDP 264 . 'a[20]=%d' call DWORD PTR __imp__printf add esp. 0 jmp SHORT $LN3@main $LN2@main: mov eax. ecx jmp SHORT $LN2@main $LN1@main: mov eax. 8 xor eax. The second loop is where printf() is called and it reports the value of i to the user. and another (in $V1) — by 4. 18. DWORD PTR _i$[ebp] mov DWORD PTR _a$[ebp+edx*4].h> int main() { int a[20]. That reminds us of loop optimizations we considered earlier: 39 on page 480. so there is a variable which is increased by 1 each time (in $S0) and also a memory address (in $S1) increased by 4 each time. i++) a[i]=i*2.CHAPTER 18.1 Reading outside array bounds So. }. What if the index is 20 or greater? That’s the one C/C++ feature it is often blamed for. 20 jge SHORT $LN1@main mov ecx. it needs only i ∗ 2 (increased by 2 at each iteration) and also the address in memory (increased by 4 at each iteration). 0aH.

80 bytes away from its first element.CHAPTER 18. ARRAYS _TEXT END 18. I get: Figure 18. BUFFER OVERFLOW ENDS When I run it.2. 265 .2: OllyDbg: console output It is just something that was lying in the stack near to the array.

3: OllyDbg: reading of the 20th element and execution of printf() What is this? Judging by the stack layout. 266 . Let’s load and find the value located right after the last array element: Figure 18. using OllyDbg.CHAPTER 18. this is the saved value of the EBP register. ARRAYS 18. BUFFER OVERFLOW Let’s try to find out where did this value come from.2.

how it could be different? The compiler may generate some additional code to check the index value to be always in the array’s bounds (like in higher-level programming languages3 ) but this makes the code slower.CHAPTER 18. for (i=0. 18. etc 267 . i<30. we read some values from the stack illegally.6: Non-optimizing MSVC 2008 _TEXT 3 Java. ARRAYS 18. MSVC And what we get: Listing 18.2.h> int main() { int a[20].2. return 0.2 Writing beyond array bounds OK. but what if we could write something to it? Here is what we have got: #include <stdio. int i. }. SEGMENT Python. i++) a[i]=i.4: OllyDbg: restoring value of EBP Indeed. BUFFER OVERFLOW Let’s trace further and see how it gets restored: Figure 18.

0 jmp SHORT $LN3@main $LN2@main: mov eax. 1 mov DWORD PTR _i$[ebp]. Let’s see where exactly does it is crash. ebp pop ebp ret 0 _main ENDP The compiled program crashes after running. size = 4 _a$ = -80 . ARRAYS 18. 84 mov DWORD PTR _i$[ebp]. DWORD PTR _i$[ebp] . eax $LN3@main: cmp DWORD PTR _i$[ebp]. that instruction is obviously redundant mov DWORD PTR _a$[ebp+ecx*4]. ECX could be used as second operand here instead jmp SHORT $LN2@main $LN1@main: xor eax. 30 . BUFFER OVERFLOW _i$ = -84 . esp sub esp. 0000001eH jge SHORT $LN1@main mov ecx. 268 . edx . size = 80 _main PROC push ebp mov ebp.CHAPTER 18. DWORD PTR _i$[ebp] add eax. No wonder. eax mov esp.2. DWORD PTR _i$[ebp] mov edx.

and trace until all 30 elements are written: Figure 18. ARRAYS 18. BUFFER OVERFLOW Let’s load it into OllyDbg.CHAPTER 18.2.5: OllyDbg: after restoring the value of EBP 269 .

That’s the stack layout while the control is in main(): ESP ESP+4 ESP+84 ESP+88 4 bytes allocated for i variable 80 bytes allocated for a[20] array saved EBP value return address a[19]=something statement writes the last int in the bounds of the array (in bounds so far!) a[20]=something statement writes something to the place where the value of EBP is saved. Welcome! It is called a buffer overflow 4 . It is also interesting that the EBP register contain 0x14. the function epilogue restores the original EBP value.CHAPTER 18. BUFFER OVERFLOW Trace until the function end: Figure 18. something is appearing next to _i. which was called main()). EIP is 0x15 now. ECX and EDX —0x1D. The RET instruction takes the return address from the stack (that is the address in CRT). Then. which is effectively equivalent to POP EIP instruction. The CPU traps at address 0x15. and 21 iss stored there (0x15 in hexadecimal). the value in the EBP register was saved on the stack. After the control flow was passed to main(). (20 in decimal is 0x14 in hexadecimal). 20 was written in the 20th element. but there is no executable code there. 84 bytes were allocated for the array and the i variable. In our case.2. but OllyDbg can’t disassemble at 0x15 Now please keep your eyes on the registers. ARRAYS 18.6: OllyDbg: EIP was restored. That’s (20+1)*sizeof(int). At the function end. ESP now points to the _i variable in the local stack and after the execution of the next PUSH something. Let’s study stack layout a bit more. so exception gets raised. Please take a look at the register state at the moment of the crash. 4 wikipedia 270 . It is not a legal address for code —at least for win32 code! We got there somehow against our will. Then RET gets executed.

create a long string deliberately and pass it to the program. It’s not that simple in reality.4. to the function.CHAPTER 18. If we run this in the GDB debugger. regardless of the C/C++ programmers’ negligence. esp esp. 0 short loc_80483D1 mov mov mov add eax. edx [ebp+i]. 0 loc_80483C3: loc_80483D1: main Running this in Linux will produce: Segmentation fault. 96 [ebp+i]. Segmentation fault. 18. 1Dh short loc_80483C3 eax. and you’ll able to point the program to an address to which it must jump. since the stack layout is slightly different too. [ebp+i] [ebp+eax*4+a]. 0x00000016 in ?? () (gdb) info registers eax 0x0 0 ecx 0xd2f96388 -755407992 edx 0x1d 29 ebx 0x26eff4 2551796 esp 0xbffff4b0 0xbffff4b0 ebp 0x15 0x15 esi 0x0 0 edi 0x0 0 eip 0x16 0x16 eflags 0x10202 [ IF RF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb) The register values are slightly different than in win32 example. We get: main public main proc near a i = dword ptr -54h = dword ptr -4 push mov sub mov jmp ebp ebp. BUFFER OVERFLOW PROTECTION METHODS Replace the int array with a string (char array). MSVC has options like6 : 5 Classic article about it: [One96].1. 60h . compiler-side buffer overflow protection methods: 6 Wikipedia: wikipedia 271 . 1 cmp jle mov leave retn endp [ebp+i]. ARRAYS 18. [ebp+i] edx.3 Buffer overflow protection methods There are several methods to protect against this scourge. we get this: (gdb) r Starting program: /home/dennis/RE/1 Program received signal SIGSEGV. but that is how it emerged 5 GCC Let’s try the same code in GCC 4. which doesn’t check the length of the string and copies it in a short buffer.3.

3. #ifdef __GNUC__ snprintf (buf. If we compile our very simple array example ( 18. DWORD PTR gs:20 DWORD PTR [ebp-12].7.2.LC0 DWORD PTR [esp+4]. 600 DWORD PTR [esp]. 2. 2 DWORD PTR [esp+12]. 2. "hi! %d.1 on page 257) in MSVC with RTC1 and RTCs option. but stop (or hang). %d. "hi! %d. 600. GCC 4. %d. // MSVC #endif puts (buf).string "hi! %d.CHAPTER 18. OFFSET FLAT:. 3).3 . 3). you can see a call to @_RTC_CheckStackVars@8 a function at the end of the function that checks if the “canary” is correct. }. %d\n" .L5: 7 wikipedia 272 . 3 DWORD PTR [esp+16].h> void f() { char *buf=(char*)alloca (600). DWORD PTR [ebp-4] call __stack_chk_fail . ebx puts eax. eax _snprintf DWORD PTR [esp]. they were used by miners in the past days in order to detect poisonous gases quickly. Let’s take an alloca() ( 5.4 on page 25) example: #ifdef __GNUC__ #include <alloca. DWORD PTR gs:20 . without any additional options. [esp+39] ebx. canary .7: GCC 4. %d. Let’s see how GCC handles this. 1. 1. eax eax. but that is much better than a remote attack to your host. If value is not the same. This random value is called a “canary” sometimes. // GCC #else _snprintf (buf. -16 DWORD PTR [esp+20]. By default. %d\n". %d\n". esp ebx esp. %d\n" f: push mov push sub lea and mov mov mov mov mov mov mov mov xor call mov call mov xor jne mov leave ret ebp ebp. or even die. ARRAYS 18. 1 DWORD PTR [esp+8]. BUFFER OVERFLOW PROTECTION METHODS /RTCs Stack Frame runtime checking /GZ Enable stack checks (/RTCs) One of the methods is to write a random value between the local variables in stack at function prologue and to check it in function epilogue before the function exits. DWORD PTR [ebp-12] eax. ebx eax. "hi! %d. they become very agitated in case of danger.3 inserts a “canary” check into the code: Listing 18. The process will halt.7.LC0: . do not execute the last instruction RET. %d. 600. 676 ebx.h> // GCC #else #include <malloc. check canary . it is related to the miners’ canary7 . Canaries are very sensitive to mine gases.L5 ebx.h> // MSVC #endif #include <stdio.

6.so. ARRAYS 18. More information can be found in the Linux kernel source code (at least in 3.1 on page 257).so.so.17.3.h this variable is described in the comments.6(__fortify_fail+0x63)[0xb7699bc3] /lib/i386-linux-gnu/libc.6(__sprintf_chk+0x2f)[0xb7697fef] . To say it briefly. Today.so.so b777e000-b777f000 r--p 0001f000 08:01 1050794 /lib/i386-linux-gnu/ld-2. It gets written on the stack and then at the end of the function the value in the stack is compared with the correct “canary” in gs:20.6(__libc_start_main+0xf5)[0xb75ac935] ======= Memory map: ======== 08048000-08049000 r-xp 00000000 08:01 2097586 /home/dennis/2_1 08049000-0804a000 r--p 00000000 08:01 2097586 /home/dennis/2_1 0804a000-0804b000 rw-p 00001000 08:01 2097586 /home/dennis/2_1 094d1000-094f2000 rw-p 00000000 00:00 0 [heap] b7560000-b757b000 r-xp 00000000 08:01 1048602 /lib/i386-linux-gnu/libgcc_s.so.17.6(__vsprintf_chk+0xc9)[0xb76980d9] /lib/i386-linux-gnu/libc.so./2_1[0x8048404] /lib/i386-linux-gnu/libc.6(_IO_vfprintf+0x165)[0xb75d7a45] /lib/i386-linux-gnu/libc. the gs register in Linux always points to the TLS ( 65 on page 671) —some information specific to thread is stored there (by the way. in arch/x86/include/asm/stackprotector.CHAPTER 18.so.so b7743000-b7746000 rw-p 00000000 00:00 0 b775a000-b775d000 rw-p 00000000 00:00 0 b775d000-b775e000 r-xp 00000000 00:00 0 [vdso] b775e000-b777e000 r-xp 00000000 08:01 1050794 /lib/i386-linux-gnu/ld-2. again. now we can see how LLVM checks the correctness of the “canary”: _main var_64 var_60 var_5C var_58 var_54 var_50 var_4C var_48 var_44 var_40 var_3C var_38 var_34 var_30 var_2C var_28 var_24 var_20 8 Thread = = = = = = = = = = = = = = = = = = -0x64 -0x60 -0x5C -0x58 -0x54 -0x50 -0x4C -0x48 -0x44 -0x40 -0x3C -0x38 -0x34 -0x30 -0x2C -0x28 -0x24 -0x20 Information Block 9 wikipedia 273 . in win32 the fs register plays the same role.3. pointing to TIB8 9 ).04 x86): *** buffer overflow detected ***: . these registers were used widely in MS-DOS and DOS-extenders times.6(_IO_default_xsputn+0x8c)[0xb7606e5c] /lib/i386-linux-gnu/libc.1 b7592000-b7593000 rw-p 00000000 00:00 0 b7593000-b7740000 r-xp 00000000 08:01 1050781 /lib/i386-linux-gnu/libc-2. BUFFER OVERFLOW PROTECTION METHODS The random value is located in gs:20.1 b757c000-b757d000 rw-p 0001b000 08:01 1048602 /lib/i386-linux-gnu/libgcc_s.so. If the values are not equal.6(+0x10593a)[0xb769893a] /lib/i386-linux-gnu/libc.17.so b7740000-b7742000 r--p 001ad000 08:01 1050781 /lib/i386-linux-gnu/libc-2.so.17.6(+0x105008)[0xb7698008] /lib/i386-linux-gnu/libc.3 (LLVM) (thumb-2 mode) Let’s get back to our simple array example ( 18.so.so.17. its function is different.so b7742000-b7743000 rw-p 001af000 08:01 1050781 /lib/i386-linux-gnu/libc-2.11 version).so bff35000-bff56000 rw-p 00000000 00:00 0 [stack] Aborted (core dumped) gs—is the so-called segment register.1 b757b000-b757c000 r--p 0001a000 08:01 1048602 /lib/i386-linux-gnu/libgcc_s.so b777f000-b7780000 rw-p 00020000 08:01 1050794 /lib/i386-linux-gnu/ld-2.1 Optimizing Xcode 4. 18. the __stack_chk_fail function is called and we can see in the console something like that (Ubuntu 13./2_1 terminated ======= Backtrace: ========= /lib/i386-linux-gnu/libc.17.

[SP. BUFFER OVERFLOW PROTECTION METHODS = = = = -0x1C -0x18 -0x14 -0x10 {R4-R7. SP. [R6.#0x64+var_50] R0.#0x64+var_40] R0.#0x64+var_20] R0. [SP. #0x1E R0.#0x64+var_28] R0.#0xC+var_10]! SP.#0x64+var_24] R0. #0x26 R0.#0x64+var_64] R0. 0xFDA .W SUB MOVW MOVS MOVT. #aObjc_methtype . #0x22 R0. #0x18 R0.#0x64+var_18] R4.#0x64+canary] R0. #0x1C R0. #0 R0. [SP. SP R6.CHAPTER 18. PC loc_2F1C .LSL#2] R5. [SP. [SP.#0x64+var_5C] R0.W STR MOVS STR STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOVS STR MOV MOV ADDS ADD B 18. [SP. #0x1A R0. [SP. [R8] R0. [SP.R5.#0x64+var_58] R0. #0xA R0. SP. #0 R5. ARRAYS var_1C var_18 canary var_10 PUSH ADD STR. [SP. [SP. R5. #0x24 R0. [SP. [SP.#0x64+var_30] R0. #0 R0. #0x10 R0. #0xC R8. #0x14 R0. #0x20 R0.#0x64+var_1C] R0.#0x64+var_2C] R0.LR} R7. [SP. #0xC R0. [SP. #0x16 R0.W LDR. "a[%d]=%d\n" R0. [SP.#0x64+var_4C] R0. [SP. #6 R0. "objc_methtype" R2.#0x64+var_44] R0.#0x64+var_34] R0. PC R8. #0x12 R0. [SP. [SP. [SP. #1 R2. [SP. [R0] R0.#0x64+var_38] R0. R0. #8 R0. second loop begin loc_2F14 ADDS LDR. [SP.W MOVS ADD LDR. #4 R0. #2 R2.3. [SP.#0x64+var_3C] R0. #0x54 R0.#0x64+var_48] R0.#0x64+var_60] R0. #0xE R0.#0x64+var_54] R0. R0 loc_2F1C 274 . #4 R4.W MOV R0.

then access the allocated memory block as an array of variables of the type you need.4 on page 25)) 275 .8: Get month name #include <stdio. the block being skipped. "July". allocate it by using malloc().4 on page 25) internally. "September".2. 18. pre-calculated.7. which. SP. If the “canaries” are not equal. "November". #0x13 loc_2F14 R0. as we see. a 4-instruction block is triggered by ``ITTTT EQ''. // in 0. "May".h> const char* month1[]= { "January". "December" }. #0x54 R8.CHAPTER 18.W POPEQ BLX 18. 6. 18. That’s just because the compiler must know the exact array size to allocate space for it in the local stack layout on at the compiling stage. [SP. If you need an array of arbitrary size.5 Array of pointers to strings Here is an example for an array of pointers. but it looks like alloca() ( 5. .#4 {R4-R7. By the way..5/2]. pp. At the function end we see the comparison of the “canaries” —the one in the local stack and the correct one. ONE MORE WORD ABOUT ARRAYS R0. "February". and finding this could be your homework.W LDR CMP ITTTT EQ MOVEQ ADDEQ LDREQ. #0 SP. will halt execution.5/2]: GCC actually does this by allocating an array dynamically on the stack (like alloca() ( 5. which contains writing 0 in R0. [SP+0x64+var_64].. [R8] R1. "March".PC} ___stack_chk_fail First of all. pp.. Listing 18. 6. as LLVM concluded it can work faster. it is possible in C99 standard[ISO07. R1 . If they are equal to each other.11 range const char* get_month1 (int month) 10 However. to which R8 points.4 One more word about arrays Now we understand why it is impossible to write something like this in C/C++ code 10 : void f(int size) { int a[size]. "August". the function epilogue and exit. canary still correct? R0. Or use the C99 standard feature[ISO07. "June". and the jump to ___stack_chk_fail function will occur. R5 _printf R5. as I suppose.4. "October". "April".2. LLVM “unrolled” the loop and all values were written into an array one-by-one. }.7. instructions in ARM mode may help to do this even faster. ARRAYS MOV MOV BLX CMP BNE LDR.#0x64+canary] R0. R4 R1.

1 x64 Listing 18.9 x64 movsx mov ret rdi. And if this happens.5. 00H 'June'. The reason for the sign extension is that this 32-bit value is to be used in calculations with other 64-bit values. 00H 'February'. ecx rcx. And that’s why to pick a specific element. For 1. 00H '%s'. 00H 'March'. It’s displacement. the negative input value of int type is sign-extended correctly and the corresponding element before table is picked. • Finally. 00H 'May'.CHAPTER 18. 00H 'November'. 12 “0+” was left in the listing because GCC assembler output is not tidy enough to eliminate it. }. 276 . 00H 'September'. the input value (month) is multiplied by 8 and added to the address.9: Optimizing MSVC 2013 x64 _DATA month1 $SG3122 $SG3123 $SG3124 $SG3125 $SG3126 $SG3127 $SG3128 $SG3129 $SG3130 $SG3156 $SG3131 $SG3132 $SG3133 _DATA SEGMENT DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DQ DB DB DB DB DB DB DB DB DB DB DB DB DB ENDS month$ = 8 get_month1 PROC movsxd lea mov ret get_month1 ENDP FLAT:$SG3122 FLAT:$SG3123 FLAT:$SG3124 FLAT:$SG3125 FLAT:$SG3126 FLAT:$SG3127 FLAT:$SG3128 FLAT:$SG3129 FLAT:$SG3130 FLAT:$SG3131 FLAT:$SG3132 FLAT:$SG3133 'January'. QWORD PTR [rcx+rax*8] 0 The code is very simple: • The first MOVSXD instruction copies a 32-bit value from ECX (where month argument is passed) to RAX with signextension (because the month argument is of type int). QWORD PTR month1[0+rdi*8] 11 It is somewhat weird. Optimizing GCC 4. etc. Hence. this instruction also loads the element at this address. and it’s zero here. It is not going to work correctly without sign-extension.10: Optimizing GCC 4. 00H 'December'.9 can do the job even better 12 : Listing 18. ARRAY OF POINTERS TO STRINGS { return month1[month]. 00H 'July'. Hence. ARRAYS 18. 00H 'August'. 18. • Then the address of the pointer table is loaded into RCX. but negative array index could be passed here as month (negative array indices will have been explained later: 52 on page 589) . it has to be promoted to 64-bit 11 . each table element is 8 bytes wide. In addition. 00H rax. 00H 'October'. an element would be a pointer to a string that contains “February”. Indeed: we are in a 64-bit environment and all address (or pointers) require exactly 64 bits (or 8 bytes) for storage. OFFSET FLAT:month1 rax. edi rax. That’s what MOV does. 00H 'April'.5. month ∗ 8 bytes has to be skipped from the start. 0aH.

0 "February".[r1. Then input value month is shifted left by 2 (which is the same as multiplying by 4).r0. ARRAYS 18. so it is used as is.|L0.0 "July".r0.conststring||+0x17 ||. All the rest is done using just one LDR instruction.0 "October".0 "June".r0] 277 .data||.conststring||+0x44 ||.conststring||+0x3c ||. because the LSL suffix cannot be specified in the LDR instruction here: get_month1 PROC LSLS LDR LDR r0.0 "March".|L0.conststring||+0x8 ||.LSL #2] lr |L0.#2 r1.conststring||+0x2b ||. ARRAY OF POINTERS TO STRINGS 32-bit MSVC Let’s also compile it in the 32-bit MSVC compiler: Listing 18.5. 18.conststring||+0x4d The address of the able is loaded in R1.conststring||+0x26 ||. ALIGN=2 month1 DCD DCD DCD DCD DCD DCD DCD DCD DCD DCD DCD DCD ||.data|| DCB DCB DCB DCB DCB DCB DCB DCB DCB DCB DCB DCB "January". then added to R1 (where the address of the table is) and then a table element is loaded from this address.CHAPTER 18. but less dense. ARM in Thumb mode The code is mostly the same.100| r0. because the table elements are 32-bit (or 4 bytes) wide. DATA.100| DCD ||.0 "April".conststring||+0x1d ||.0 "May".0 "December".5. DWORD PTR _month1[eax*4] ret 0 _get_month1 ENDP The input value does not need to be extended to 64-bit value.0 AREA ||.0 "November".0 "September".2 32-bit ARM ARM in ARM mode Listing 18.conststring||+0x32 ||.64| r0. And it’s multiplied by 4. DWORD PTR _month$[esp-4] mov eax.conststring|| ||. The 32-bit table element is loaded into R0 from the table.[r1.11: Optimizing MSVC 2013 x86 _month$ = 8 _get_month1 PROC mov eax.conststring||+0x21 ||.12: Optimizing Keil 6/2013 (ARM mode) get_month1 PROC LDR LDR BX ENDP r1.conststring||+0x11 ||.0 "August".

4.size month1: .LC2: .LC4 .xword .9 ARM64 get_month1: adrp add ldr ret .xword . ARRAY OF POINTERS TO STRINGS lr ARM64 Listing 18.LANCHOR0 = . month1 .xword .type .xword .xword .xword .5.string .string .LC11: . .string . [x1. Then corresponding element is picked using just one LDR.5.CHAPTER 18. + .14: Optimizing GCC 4.string . ARRAYS BX ENDP 18.3 18.string .xword .LC6 .sxtw 3] 0 month1.LC9: .LC3: .xword .LC4: .LC5 .LC10: . 96 .string .string .xword .5. x1.13: Optimizing GCC 4.xword .LC2 . 18.LC3 .LC13: .string .LC12 .w0.LC8: .5 (IDA) get_month1: . sign-extends it (this is what “sxtw” suffix implies) and adds to X0.xword . 2 .LC10 .string x1.LC13 "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" The address of the table is loaded in X1 using ADRP/ADD pair.LC7: . shifts it 3 bits to the left (which is the same as multiplying by 8). which takes W0 (the register where input argument month is).LC12: .string .xword .string .LC9 .LANCHOR0 x1. sum up address of table and multiplied value: 278 .LANCHOR0 x0.LC11 .4 MIPS Listing 18. %object month1.LC7 .string . Then the 64-bit value is loaded from the table into X0.LC6: . :lo12:. load address of table into $v0: la $v0.LC8 . take input value and multiply it by 4: sll $a0.LC5: .

ARRAYS 18. so it just the month names. return jr $ra or $at. . .word aApril .ascii "June"<0> .rodata. DATA XREF: .data # .local .ascii "August"<0> . and return it.0 align 4 db 'April'. .ascii "November"<0> .ascii "May"<0> .ascii "December"<0> Array overflow Our function accepts values in the range of 0. .ascii "July"<0> .word aDecember month1: aJanuary: aFebruary: aMarch: aApril: aMay: aJune: aJuly: aAugust: aSeptember: aOctober: aNovember: aDecember: 18. . . branch delay slot.11.ascii "March"<0> .ascii "April"<0> .4 .ascii "January"<0> .data:0000000140011018 Month names are came right after. . DATA XREF: .. . .data:0000000140011008 .word aNovember .word aMay .data:0000000140011010 .ascii "September"<0> . NOP .globl month1 .word aFebruary .CHAPTER 18. . . Our program is tiny.word aMarch .word aJune .ascii "February"<0> . Let’s see how the CPU treating the bytes there as a 64-bit value: 279 . Soon after. $zero .data:off_140011000 DATA XREF: . .text:0000000140001003 "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" DATA XREF: sub_140001020+4 . some other function can try to get a text string from this address and may crash.word aJanuary .data # .rel.str1. .5. So what if 12 is passed to the function? The 13th element will be returned. . load table element at this lw $v0.5 $v0 address into $v0: 0($a0) # # # # # # # # # # # # "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" . ARRAY OF POINTERS TO STRINGS addu $a0.data. I have compiled the example in MSVC for win64 and opened it in IDA to see what the linker has placed after the table: Listing 18.0 db 'February'.word aJuly . So the function will load some value which happens to be there.word aAugust .word aSeptember . but what if 12 is passed? There is no element in table at this place. But I have to note that there might be really anything that linker has decided to put by chance.5.15: Executable file in IDA off_140011000 dq offset aJanuary_1 aJanuary_1 dq dq dq dq dq dq dq dq dq dq dq db aFebruary_1 aMarch_1 aApril_1 offset aFebruary_1 offset aMarch_1 offset aApril_1 offset aMay_1 offset aJune_1 offset aJuly_1 offset aAugust_1 offset aSeptember_1 offset aOctober_1 offset aNovember_1 offset aDecember_1 'January'. . so there isn’t much data to pack in the data segment. DATA XREF: .word aOctober .0 align 4 db 'March'.0 .ascii "October"<0> . . .

dq offset aMay_1 . ARRAY OF POINTERS TO STRINGS Listing 18. expecting a C-string there.17: assert() added const char* get_month1_checked (int month) { assert (month<12).16: Executable file in IDA off_140011000 dq offset qword_140011060 . }. dq offset aNovember_1 . There exists the philosophy that says “fail early and fail loudly” or “fail-fast”. it will Murphy’s Law It’s a bit naïve to expect that every programmer who use your function or library will never pass an argument larger than 11. ARRAYS 18. return month1[month]. 'h'. DATA XREF: . 'h'. Soon after. 00H. 00H. 00H. because this value does’t look like a valid address. 'c'. dq offset aJuly_1 .18: Optimizing MSVC 2013 x64 $SG3143 DB DB $SG3144 DB DB 'm'. dq offset aDecember_1 . qword_140011060 dq 797261756E614Ah . 32 pop rbx 280 .0 . dq offset aJune_1 . align 4 aMarch_1 db 'March'. One such method in C/C++ is assertions.text:0000000140001003 "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" DATA XREF: sub_140001020+4 . '<'. some other function (presumably. .5. 00H. 00H. 't'. dq offset aApril_1 . 'n'. 't'. 00H. 00H.0 . dq offset aMarch_1 . QWORD PTR [rcx+rbx*8] add rsp. dq offset aOctober_1 . ecx cmp ebx. 00H. 32 movsxd rbx. dq offset aAugust_1 . dq offset aFebruary_1 .data:off_140011000 DATA XREF: . '1'. 00H month$ = 48 get_month1_checked PROC $LN5: push rbx sub rsp. aFebruary_1 db 'February'.'. 00H. 00H 00H. Most likely it is about to crash. 00H. 00H 00H 00H. OFFSET FLAT:$SG3144 mov r8d. OFFSET FLAT:$SG3143 lea rcx. dq offset aSeptember_1 . Listing 18. 'm'. 29 call _wassert $LN3@get_month1: lea rcx. which teaches to report problems as early as possible and stop. 00H. 'n'.data:0000000140011008 DATA XREF: . 12 jl SHORT $LN3@get_month1 lea rdx. Array overflow protection If something can go wrong.data:0000000140011010 And this is 0x797261756E614A. The assertion macro checks for valid values at every function start and fails if the expression is false. 00H. OFFSET FLAT:month1 mov rax. 00H. one that processes strings) may try to read bytes at this address. We can modify our program to fail if an incorrect value is passed: Listing 18. 'o'. '. 'o'. '2'.CHAPTER 18.

The line number is also passed (it’s 29). 29 mov esi. They make your program slower. Microsoft Windows NT kernels come in “checked” and “free” builds 13 . then passes also the line number and file name to another function which reports this information to the user.2423: . and this is true for the sanitizing checks as well. This mechanism is probably the same in all compilers. Here we see that both file name and condition are encoded in UTF-16.string "month<12" get_month1_checked: cmp edi. but macro.L6: push rax mov ecx.string "month. assert() is not a function. the second one doesn’t (hence.6 Multidimensional arrays Internally. 11 jg .c" . MULTIDIMENSIONAL ARRAYS ret 0 get_month1_checked ENDP In fact. leaves the checks in Debug builds. OFFSET FLAT:. thit is how the elements of the a[3][4] array are placed in one-dimensional array of 12 cells: [0][0] [0][1] [0][2] [0][3] [1][0] [1][1] [1][2] [1][3] [2][0] [2][1] [2][2] [2][3] Table 18. For example. but in Release builds they all disappear.LC1 mov edi. especially if the assert() macros used in small time-critical functions.1: Two-dimensional array represented in memory as one-dimensional Here is how each cell of 3*4 array are placed in memory: 13 MSDN 281 . OFFSET FLAT:. Since the computer memory is linear. Nothing is really free. a multidimensional array is essentially the same thing as a linear array. Here is what GCC does: Listing 18. for example. edi mov rax. “checked”).LC2 call __assert_fail __PRETTY_FUNCTION__. OFFSET FLAT:__PRETTY_FUNCTION__. The first has validation checks (hence. It checks for a condition. So MSVC.LC2: . QWORD PTR month1[0+rdi*8] ret .string "get_month1_checked" So the macro in GCC also passes the function name for convenience. ARRAYS 18. For convenience. it is an one-dimensional array. “free” of checks). this multi-dimensional array can be easily represented as one-dimensional.2423 mov edx.L6 movsx rdi.19: Optimizing GCC 4.9 x64 .6. 18.CHAPTER 18.LC1: .

the best scheme for data organization is the one. 3: Listing 18. MATLAB and R. Row filling example Let’s fill the second row with these values: 0 . we first multiply the first index by 4 (matrix width) and then add the second index. and vice versa.20: Row filling example #include <stdio. So if your function accesses data per row. x<3. .1 Two-dimensional array example We are going to work with an array of type char. in terms of performance and cache memory.3: for (y=0. Listing 18. Which method is better? In general.h> char a[3][4].6. in which the elements are accessed sequentially.21: Column filling example #include <stdio. // clear array for (x=0. 2 and 3: Figure 18. then the second column …and finally the elements of the last column”.6. int main() { int x. in order to calculate the address of the element we need.h> char a[3][4].CHAPTER 18. int main() 282 . ARRAYS 18. which implies that each element requires only one byte in memory. The term row-major order in plain English language means: “first. 1. y. then the second row …and finally the elements of the last row”. x++) for (y=0. . y++) a[1][y]=y. row-major order is better. y<4. We see that second row now has values 0.2: Memory addresses of each cell of two-dimensional array So.7: OllyDbg: array is filled Column filling example Let’s fill the third column with values: 0 . write the elements of the first row. . y<4. y++) a[x][y]=0. I have marked all three rows with red.. // fill second row by 0. and this method of array and matrix representation is used in at least C/C++ and Python. column-major order term in plain English language means: “first. write the elements of the first column. 2. MULTIDIMENSIONAL ARRAYS 0 4 8 1 5 9 2 6 10 3 7 11 Table 18. That’s called row-major order. . Another method for representation is called column-major order (the array indices are used in reverse order) and it is used at least in FORTRAN. }. 18.

CHAPTER 18. ARRAYS

18.6. MULTIDIMENSIONAL ARRAYS

{
int x, y;
// clear array
for (x=0; x<3; x++)
for (y=0; y<4; y++)
a[x][y]=0;
// fill third column by 0..2:
for (x=0; x<3; x++)
a[x][2]=x;
};

I have also marked the three rows in red here. We see that in each row, at third position these values are written: 0, 1
and 2.

Figure 18.8: OllyDbg: array is filled

18.6.2

Access two-dimensional array as one-dimensional

I can easily show you how to access a two-dimensional array as one-dimensional array in at least two other ways:
#include <stdio.h>
char a[3][4];
char get_by_coordinates1 (char array[3][4], int a, int b)
{
return array[a][b];
};
char get_by_coordinates2 (char *array, int a, int b)
{
// treat input array as one-dimensional
// 4 is array width here
return array[a*4+b];
};
char get_by_coordinates3 (char *array, int a, int b)
{
// treat input array as pointer,
// calculate address, get value at it
// 4 is array width here
return *(array+a*4+b);
};
int main()
{
a[2][3]=123;
printf ("%d\n", get_by_coordinates1(a, 2, 3));
printf ("%d\n", get_by_coordinates2(a, 2, 3));
printf ("%d\n", get_by_coordinates3(a, 2, 3));
};

Compile and run it: it shows correct values.
What MSVC 2013 did is fascinating, all three routines are just the same!
Listing 18.22: Optimizing MSVC 2013 x64
array$ = 8
a$ = 16
b$ = 24
283

CHAPTER 18. ARRAYS

18.6. MULTIDIMENSIONAL ARRAYS

get_by_coordinates3 PROC
; RCX=address of array
; RDX=a
; R8=b
movsxd rax, r8d
; EAX=b
movsxd r9, edx
; R9=a
add
rax, rcx
; RAX=b+address of array
movzx
eax, BYTE PTR [rax+r9*4]
; AL=load byte at address RAX+R9*4=b+address of array+a*4=address of array+a*4+b
ret
0
get_by_coordinates3 ENDP
array$ = 8
a$ = 16
b$ = 24
get_by_coordinates2 PROC
movsxd rax, r8d
movsxd r9, edx
add
rax, rcx
movzx
eax, BYTE PTR [rax+r9*4]
ret
0
get_by_coordinates2 ENDP
array$ = 8
a$ = 16
b$ = 24
get_by_coordinates1 PROC
movsxd rax, r8d
movsxd r9, edx
add
rax, rcx
movzx
eax, BYTE PTR [rax+r9*4]
ret
0
get_by_coordinates1 ENDP

GCC also generates equivalent routines, but slightly different:
Listing 18.23: Optimizing GCC 4.9 x64
; RDI=address of array
; RSI=a
; RDX=b
get_by_coordinates1:
; sign-extend input 32-bit int values "a" and "b" to 64-bit ones
movsx
rsi, esi
movsx
rdx, edx
lea
rax, [rdi+rsi*4]
; RAX=RDI+RSI*4=address of array+a*4
movzx
eax, BYTE PTR [rax+rdx]
; AL=load byte at address RAX+RDX=address of array+a*4+b
ret
get_by_coordinates2:
lea
eax, [rdx+rsi*4]
; RAX=RDX+RSI*4=b+a*4
cdqe
movzx
eax, BYTE PTR [rdi+rax]
; AL=load byte at address RDI+RAX=address of array+b+a*4
ret
get_by_coordinates3:
sal
esi, 2
; ESI=a<<2=a*4
; sign-extend input 32-bit int values "a*4" and "b" to 64-bit ones
movsx
rdx, edx
movsx
rsi, esi
add
rdi, rsi
; RDI=RDI+RSI=address of array+a*4
284

CHAPTER 18. ARRAYS

18.6. MULTIDIMENSIONAL ARRAYS

movzx
eax, BYTE PTR [rdi+rdx]
; AL=load byte at address RDI+RDX=address of array+a*4+b
ret

18.6.3

Three-dimensional array example

It’s thing in multidimensional arrays.
Now we are going to work with an array of type int: each element requires 4 bytes in memory.
Let’s see:
Listing 18.24: simple example
#include <stdio.h>
int a[10][20][30];
void insert(int x, int y, int z, int value)
{
a[x][y][z]=value;
};

x86
We get (MSVC 2010):
Listing 18.25: MSVC 2010
_DATA
SEGMENT
COMM
_a:DWORD:01770H
_DATA
ENDS
PUBLIC
_insert
_TEXT
SEGMENT
_x$ = 8
; size = 4
_y$ = 12
; size = 4
_z$ = 16
; size = 4
_value$ = 20
; size = 4
_insert
PROC
push
ebp
mov
ebp, esp
mov
eax, DWORD PTR _x$[ebp]
imul
eax, 2400
mov
ecx, DWORD PTR _y$[ebp]
imul
ecx, 120
lea
edx, DWORD PTR _a[eax+ecx]
mov
eax, DWORD PTR _z$[ebp]
mov
ecx, DWORD PTR _value$[ebp]
mov
DWORD PTR [edx+eax*4], ecx
pop
ebp
ret
0
_insert
ENDP
_TEXT
ENDS

; eax=600*4*x
; ecx=30*4*y
; edx=a + 600*4*x + 30*4*y

; *(edx+z*4)=value

Nothing special. For index calculation, three input arguments are used in the formula address = 600 ⋅ 4 ⋅ x + 30 ⋅ 4 ⋅ y + 4z,
to represent the array as multidimensional. Do not forget that the int type is 32-bit (4 bytes), so all coefficients must be
multiplied by 4.
Listing 18.26: GCC 4.4.1
insert

public insert
proc near

x
y
z
value

=
=
=
=

dword
dword
dword
dword

push
mov
push
mov

ptr
ptr
ptr
ptr

8
0Ch
10h
14h

ebp
ebp, esp
ebx
ebx, [ebp+x]
285

CHAPTER 18. ARRAYS

18.6. MULTIDIMENSIONAL ARRAYS

mov
mov
lea
mov
shl
sub
imul
add
lea
mov
mov
pop
pop
retn
endp

insert

eax, [ebp+y]
ecx, [ebp+z]
edx, [eax+eax]
eax, edx
eax, 4
eax, edx
edx, ebx, 600
eax, edx
edx, [eax+ecx]
eax, [ebp+value]
dword ptr ds:a[edx*4], eax
ebx
ebp

;
;
;
;
;
;
;

edx=y*2
eax=y*2
eax=(y*2)<<4 = y*2*16 = y*32
eax=y*32 - y*2=y*30
edx=x*600
eax=eax+edx=y*30 + x*600
edx=y*30 + x*600 + z

; *(a+edx*4)=value

The GCC compiler does it differently. For one of the operations in the calculation (30y), GCC produces code without
multiplication instructions. This is how it done: (y + y) ≪ 4 − (y + y) = (2y) ≪ 4 − 2y = 2 ⋅ 16 ⋅ y − 2y = 32y − 2y = 30y.
Thus, for the 30y calculation, only one addition operation, one bitwise shift operation and one subtraction operation are
used. This works faster.
ARM + Non-optimizing Xcode 4.6.3 (LLVM) (thumb mode)
Listing 18.27: Non-optimizing Xcode 4.6.3 (LLVM) (thumb mode)
_insert
value
z
y
x

=
=
=
=

-0x10
-0xC
-8
-4

; allocate place in local stack for 4 values of int type
SUB
SP, SP, #0x10
MOV
R9, 0xFC2 ; a
ADD
R9, PC
LDR.W
R9, [R9]
STR
R0, [SP,#0x10+x]
STR
R1, [SP,#0x10+y]
STR
R2, [SP,#0x10+z]
STR
R3, [SP,#0x10+value]
LDR
R0, [SP,#0x10+value]
LDR
R1, [SP,#0x10+z]
LDR
R2, [SP,#0x10+y]
LDR
R3, [SP,#0x10+x]
MOV
R12, 2400
MUL.W
R3, R3, R12
ADD
R3, R9
MOV
R9, 120
MUL.W
R2, R2, R9
ADD
R2, R3
LSLS
R1, R1, #2 ; R1=R1<<2
ADD
R1, R2
STR
R0, [R1]
; R1 - address of array element
; deallocate chunk in local stack, allocated for 4 values of int type
ADD
SP, SP, #0x10
BX
LR

Non-optimizing LLVM saves all variables in local stack, which is redundant.
calculated by the formula we already saw.

The address of the array element is

ARM + Optimizing Xcode 4.6.3 (LLVM) (thumb mode)
Listing 18.28: Optimizing Xcode 4.6.3 (LLVM) (thumb mode)
_insert
MOVW
R9, #0x10FC
MOV.W
R12, #2400
MOVT.W R9, #0
286

CHAPTER 18. ARRAYS

RSB.W
ADD
LDR.W
MLA.W
ADD.W
STR.W
BX

R1,
R9,
R9,
R0,
R0,

18.6. MULTIDIMENSIONAL ARRAYS

R1, R1,LSL#4
PC
[R9]
R0, R12, R9
R0, R1,LSL#3

; R1 - y. R1=y<<4 - y = y*16 - y = y*15
; R9 = pointer to a array

;
;
;
R3, [R0,R2,LSL#2] ;
;
LR

R0 - x, R12 - 2400, R9 - pointer to a. R0=x*2400 + ptr to a
R0 = R0+R1<<3 = R0+R1*8 = x*2400 + ptr to a + y*15*8 =
ptr to a + y*30*4 + x*600*4
R2 - z, R3 - value. address=R0+z*4 =
ptr to a + y*30*4 + x*600*4 + z*4

The tricks for replacing multiplication by shift, addition and subtraction which we already saw are also present here.
Here we also see a new instruction for us: RSB (Reverse Subtract). It works just as SUB, but it swaps its operands with each
other before execution. Why? SUB and RSB are instructions, to the second operand of which shift coefficient may be applied:
(LSL#4). But this coefficient can be applied only to second operand. That’s fine for commutative operations like addition
or multiplication (operands may be swapped there without changing the result). But subtraction is a non-commutative
operation, so RSB exist for these cases.
The ``LDR.W R9, [R9]'' instruction works like LEA ( A.6.2 on page 933) in x86, but it does nothing here, it is
redundant. Apparently, the compiler did not optimize it out.
MIPS
My example is tiny, so the GCC compiler decided to put the a table into the 64KiB area addressable by the Global Pointer.
Listing 18.29: Optimizing GCC 4.4.5 (IDA)
insert:
; $a0=x
; $a1=y
; $a2=z
; $a3=value
sll
$v0, $a0, 5
; $v0 = $a0<<5 = x*32
sll
$a0, 3
; $a0 = $a0<<3 = x*8
addu
$a0, $v0
; $a0 = $a0+$v0 = x*8+x*32 = x*40
sll
$v1, $a1, 5
; $v1 = $a1<<5 = y*32
sll
$v0, $a0, 4
; $v0 = $a0<<4 = x*40*16 = x*640
sll
$a1, 1
; $a1 = $a1<<1 = y*2
subu
$a1, $v1, $a1
; $a1 = $v1-$a1 = y*32-y*2 = y*30
subu
$a0, $v0, $a0
; $a0 = $v0-$a0 = x*640-x*40 = x*600
la
$gp, __gnu_local_gp
addu
$a0, $a1, $a0
; $a0 = $a1+$a0 = y*30+x*600
addu
$a0, $a2
; $a0 = $a0+$a2 = y*30+x*600+z
; load address of table:
lw
$v0, (a & 0xFFFF)($gp)
; multiply index by 4 to seek array element:
sll
$a0, 2
; sum up multiplied index and table address:
addu
$a0, $v0, $a0
; store value into table and return:
jr
$ra
sw
$a3, 0($a0)
.comm a:0x1770

18.6.4

More examples

The computer screen is represented as a 2D array, but the video-buffer is a linear 1D array. We talk about it here: 83.2 on
page 824.

287

CHAPTER 18. ARRAYS

18.7

18.7. PACK OF STRINGS AS A TWO-DIMENSIONAL ARRAY

Pack of strings as a two-dimensional array

Let’s revisit the function that returns the name of a month: listing.18.8. As you see, at least one memory load operation is
needed to prepare a pointer to the string that’s the month’s name. Is it possible to get rid of this memory load operation?
In fact yes, if you represent the list of strings as a two-dimensional array:
#include <stdio.h>
#include <assert.h>
const char month2[12][10]=
{
{ 'J','a','n','u','a','r','y', 0, 0,
{ 'F','e','b','r','u','a','r','y', 0,
{ 'M','a','r','c','h', 0, 0, 0, 0,
{ 'A','p','r','i','l', 0, 0, 0, 0,
{ 'M','a','y', 0, 0, 0, 0, 0, 0,
{ 'J','u','n','e', 0, 0, 0, 0, 0,
{ 'J','u','l','y', 0, 0, 0, 0, 0,
{ 'A','u','g','u','s','t', 0, 0, 0,
{ 'S','e','p','t','e','m','b','e','r',
{ 'O','c','t','o','b','e','r', 0, 0,
{ 'N','o','v','e','m','b','e','r', 0,
{ 'D','e','c','e','m','b','e','r', 0,
};

0
0
0
0
0
0
0
0
0
0
0
0

},
},
},
},
},
},
},
},
},
},
},
}

// in 0..11 range
const char* get_month2 (int month)
{
return &month2[month][0];
};

Here is what we’ve get:
Listing 18.30: Optimizing MSVC 2013 x64
month2

DB
DB
DB
DB
DB
DB
DB
DB
DB
DB

04aH
061H
06eH
075H
061H
072H
079H
00H
00H
00H

...
get_month2 PROC
; sign-extend input argument and promote to 64-bit value
movsxd rax, ecx
lea
rcx, QWORD PTR [rax+rax*4]
; RCX=month+month*4=month*5
lea
rax, OFFSET FLAT:month2
; RAX=pointer to table
lea
rax, QWORD PTR [rax+rcx*2]
; RAX=pointer to table + RCX*2=pointer to table + month*5*2=pointer to table + month*10
ret
0
get_month2 ENDP

There are no memory accesses at all. All this function does is to calculate a point at which the first character of the
name of the month is: pointer_to_the_table + month ∗ 10. There are also two LEA instructions, which effectively work as
several MUL and MOV instructions.
The width of the array is 10 bytes. Indeed, the longest string here — “September” — is 9 bytes, and plus the terminating
zero is 10 bytes. The rest of the month names are padded by zero bytes, so they all occupy the same space (10 bytes). Thus,
our function works even faster, because all string start at an address which can be calculated easily.
Optimizing GCC 4.9 can do it even shorter:
Listing 18.31: Optimizing GCC 4.9 x64
movsx
lea

rdi, edi
rax, [rdi+rdi*4]
288

CHAPTER 18. ARRAYS

lea
ret

18.7. PACK OF STRINGS AS A TWO-DIMENSIONAL ARRAY

rax, month2[rax+rax]

LEA is also used here for multiplication by 10.
Non-optimizing compilers do multiplication differently.
Listing 18.32: Non-optimizing GCC 4.9 x64
get_month2:
push
rbp
mov
rbp, rsp
mov
DWORD PTR [rbp-4], edi
mov
eax, DWORD PTR [rbp-4]
movsx
rdx, eax
; RDX = sign-extended input value
mov
rax, rdx
; RAX = month
sal
rax, 2
; RAX = month<<2 = month*4
add
rax, rdx
; RAX = RAX+RDX = month*4+month = month*5
add
rax, rax
; RAX = RAX*2 = month*5*2 = month*10
add
rax, OFFSET FLAT:month2
; RAX = month*10 + pointer to the table
pop
rbp
ret

Non-optimizing MSVC just use IMUL instruction:
Listing 18.33: Non-optimizing MSVC 2013 x64
month$ = 8
get_month2 PROC
mov
DWORD PTR [rsp+8], ecx
movsxd rax, DWORD PTR month$[rsp]
; RAX = sign-extended input value into 64-bit one
imul
rax, rax, 10
; RAX = RAX*10
lea
rcx, OFFSET FLAT:month2
; RCX = pointer to the table
add
rcx, rax
; RCX = RCX+RAX = pointer to the table+month*10
mov
rax, rcx
; RAX = pointer to the table+month*10
mov
ecx, 1
; RCX = 1
imul
rcx, rcx, 0
; RCX = 1*0 = 0
add
rax, rcx
; RAX = pointer to the table+month*10 + 0 = pointer to the table+month*10
ret
0
get_month2 ENDP

… but one thing is weird here: why add multiplication by zero and adding zero to the final result? I don’t know, this
looks like a compiler code generator quirk, which wasn’t caught by the compiler’s tests (the resulting code works correctly,
after all). I intentionally add such pieces of code so the reader would understand, that sometimes one shouldn’t puzzle over
such compiler artifacts.

18.7.1 32-bit ARM
Optimizing Keil for Thumb mode uses the multiplication instruction MULS:
Listing 18.34: Optimizing Keil 6/2013 (thumb mode)
; R0 = month
MOVS
r1,#0xa
; R1 = 10
MULS
r0,r1,r0
; R0 = R1*R0 = 10*month
LDR
r1,|L0.68|
; R1 = pointer to the table
289

CHAPTER 18. ARRAYS

18.7. PACK OF STRINGS AS A TWO-DIMENSIONAL ARRAY

ADDS
r0,r0,r1
; R0 = R0+R1 = 10*month + pointer to the table
BX
lr

Optimizing Keil for ARM mode uses add and shift operations:
Listing 18.35: Optimizing Keil 6/2013 (ARM mode)
; R0 = month
LDR
r1,|L0.104|
; R1 = pointer to the table
ADD
r0,r0,r0,LSL #2
; R0 = R0+R0<<2 = R0+R0*4 = month*5
ADD
r0,r1,r0,LSL #1
; R0 = R1+R0<<2 = pointer to the table + month*5*2 = pointer to the table + month*10
BX
lr

18.7.2 ARM64
Listing 18.36: Optimizing GCC 4.9 ARM64
; W0 = month
sxtw
x0, w0
; X0 = sign-extended input value
adrp
x1, .LANCHOR1
add
x1, x1, :lo12:.LANCHOR1
; X1 = pointer to the table
add
x0, x0, x0, lsl 2
; X0 = X0+X0<<2 = X0+X0*4 = X0*5
add
x0, x1, x0, lsl 1
; X0 = X1+X0<<1 = X1+X0*2 = pointer to the table + X0*10
ret

SXTW is used for sign-extension and promoting input 32-bit value into a 64-bit one and storing it in X0. ADRP/ADD pair
is used for loading the address of the table. The ADD instructions also has a LSL suffix, which helps with multiplications.

18.7.3 MIPS
Listing 18.37: Optimizing GCC 4.4.5 (IDA)
.globl get_month2
get_month2:
; $a0=month
sll
$v0, $a0, 3
; $v0 = $a0<<3 = month*8
sll
$a0, 1
; $a0 = $a0<<1 = month*2
addu
$a0, $v0
; $a0 = month*2+month*8 = month*10
; load address of the table:
la
$v0, month2
; sum up table address and index we calculated and return:
jr
$ra
addu
$v0, $a0
month2:
aFebruary:
aMarch:
aApril:
aMay:
aJune:
aJuly:

.ascii "January"<0>
.byte 0, 0
.ascii "February"<0>
.byte
0
.ascii "March"<0>
.byte 0, 0, 0, 0
.ascii "April"<0>
.byte 0, 0, 0, 0
.ascii "May"<0>
.byte 0, 0, 0, 0, 0, 0
.ascii "June"<0>
.byte 0, 0, 0, 0, 0
.ascii "July"<0>
.byte 0, 0, 0, 0, 0
290

CHAPTER 18. ARRAYS

aAugust:
aSeptember:
aOctober:
aNovember:
aDecember:

18.8. CONCLUSION

.ascii "August"<0>
.byte 0, 0, 0
.ascii "September"<0>
.ascii "October"<0>
.byte 0, 0
.ascii "November"<0>
.byte
0
.ascii "December"<0>
.byte 0, 0, 0, 0, 0, 0, 0, 0, 0

18.7.4 Conclusion
This is a bit old-school technique to store text strings. You may find a lot of it in Oracle RDBMS, for example. But I don’t
really know if it’s worth doing on modern computers. Nevertheless, it was a good example of arrays, so I added it to this
book.

18.8

Conclusion

An array is a pack of values in memory located adjacently. It’s true for any element type, including structures. Access to a
specific array element is just a calculation of its address.

18.9

Exercises

18.9.1

Exercise #1

What does this code do?
Listing 18.38: MSVC 2010 + /O1
_a$ = 8
; size = 4
_b$ = 12
; size = 4
_c$ = 16
; size = 4
?s@@YAXPAN00@Z PROC
; s, COMDAT
mov
eax, DWORD PTR _b$[esp-4]
mov
ecx, DWORD PTR _a$[esp-4]
mov
edx, DWORD PTR _c$[esp-4]
push
esi
push
edi
sub
ecx, eax
sub
edx, eax
mov
edi, 200
; 000000c8H
$LL6@s:
push
100
; 00000064H
pop
esi
$LL3@s:
fld
QWORD PTR [ecx+eax]
fadd
QWORD PTR [eax]
fstp
QWORD PTR [edx+eax]
add
eax, 8
dec
esi
jne
SHORT $LL3@s
dec
edi
jne
SHORT $LL6@s
pop
edi
pop
esi
ret
0
?s@@YAXPAN00@Z ENDP
; s

(/O1: minimize space).
Listing 18.39: Optimizing Keil 6/2013 (ARM mode)
PUSH
MOV
MOV
MOV
MOV

{r4-r12,lr}
r9,r2
r10,r1
r11,r0
r5,#0

291

CHAPTER 18. ARRAYS

18.9. EXERCISES

|L0.20|
ADD
ADD
MOV
ADD
ADD
ADD

r0,r5,r5,LSL #3
r0,r0,r5,LSL #4
r4,#0
r8,r10,r0,LSL #5
r7,r11,r0,LSL #5
r6,r9,r0,LSL #5

ADD
LDM
ADD
LDM
BL
ADD
ADD
STM
CMP
BLT
ADD
CMP
BLT
POP

r0,r8,r4,LSL #3
r0,{r2,r3}
r1,r7,r4,LSL #3
r1,{r0,r1}
__aeabi_dadd
r2,r6,r4,LSL #3
r4,r4,#1
r2,{r0,r1}
r4,#0x64
|L0.44|
r5,r5,#1
r5,#0xc8
|L0.20|
{r4-r12,pc}

|L0.44|

Listing 18.40: Optimizing Keil 6/2013 (thumb mode)
PUSH
MOVS
SUB

{r0-r2,r4-r7,lr}
r4,#0
sp,sp,#8

MOVS
MOVS
LSLS
MULS
LDR
LDR
ADDS
STR
LDR
MOVS
ADDS
ADDS
STR

r1,#0x19
r0,r4
r1,r1,#5
r0,r1,r0
r2,[sp,#8]
r1,[sp,#0xc]
r2,r0,r2
r2,[sp,#0]
r2,[sp,#0x10]
r5,#0
r7,r0,r2
r0,r0,r1
r0,[sp,#4]

LSLS
ADDS
LDM
LDR
ADDS
LDM
BL
ADDS
ADDS
STM
CMP
BGE
LDR
B

r6,r5,#3
r0,r0,r6
r0!,{r2,r3}
r0,[sp,#0]
r1,r0,r6
r1,{r0,r1}
__aeabi_dadd
r2,r7,r6
r5,r5,#1
r2!,{r0,r1}
r5,#0x64
|L0.62|
r0,[sp,#4]
|L0.32|

ADDS
CMP
BLT
ADD
POP

r4,r4,#1
r4,#0xc8
|L0.6|
sp,sp,#0x14
{r4-r7,pc}

|L0.6|

|L0.32|

|L0.62|

Listing 18.41: Non-optimizing GCC 4.9 (ARM64)
s:
sub
str
str

sp, sp, #48
x0, [sp,24]
x1, [sp,16]
292

CHAPTER 18. ARRAYS

18.9. EXERCISES

str
str
b

x2, [sp,8]
wzr, [sp,44]
.L2

str
b

wzr, [sp,40]
.L3

ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
fmov
fmov
fadd
fmov
str
ldr
add
str

w1,
w0,
w0,
x1,
x0,
x0,
x0,
x1,
x0,
w2,
w1,
w1,
x2,
x1,
x1,
x1,
x2,
x1,
x2,
w3,
w1,
w1,
x3,
x1,
x1,
x1,
x3,
x1,
x1,
d0,
d1,
d0,
x1,
x1,
w0,
w0,
w0,

[sp,44]
100
w1, w0
w0
[sp,40]
x1, x0
x0, 3
[sp,8]
x1, x0
[sp,44]
100
w2, w1
w1
[sp,40]
x2, x1
x1, 3
[sp,24]
x2, x1
[x1]
[sp,44]
100
w3, w1
w1
[sp,40]
x3, x1
x1, 3
[sp,16]
x3, x1
[x1]
x2
x1
d0, d1
d0
[x0]
[sp,40]
w0, 1
[sp,40]

ldr
cmp
ble
ldr
add
str

w0,
w0,
.L4
w0,
w0,
w0,

[sp,40]
99

ldr
cmp
ble
add
ret

w0, [sp,44]
w0, 199
.L5
sp, sp, 48

.L5:

.L4:

.L3:

[sp,44]
w0, 1
[sp,44]

.L2:

Listing 18.42: Optimizing GCC 4.4.5 (MIPS) (IDA)
sub_0:
li
move
li

$t3, 0x27100
$t2, $zero
$t1, 0x64 # 'd'

addu
addu
move
move

$t0,
$a3,
$v1,
$v0,

loc_10:

# CODE XREF: sub_0+54
$a1, $t2
$a2, $t2
$a0
$zero
293

CHAPTER 18. ARRAYS

18.9. EXERCISES

loc_20:

# CODE XREF: sub_0+48
lwc1
lwc1
lwc1
lwc1
addiu
add.d
addiu
swc1
swc1
addiu
bne
addiu
addiu
bne
addiu
jr
or

$f2,
$f0,
$f3,
$f1,
$v0,
$f0,
$v1,
$f0,
$f1,
$t0,
$v0,
$a3,
$t2,
$t2,
$a0,
$ra
$at,

4($v1)
4($t0)
0($v1)
0($t0)
1
$f2, $f0
8
4($a3)
0($a3)
8
$t1, loc_20
8
0x320
$t3, loc_10
0x320
$zero

Answer G.1.10 on page 955.

18.9.2

Exercise #2

What does this code do?
Listing 18.43: MSVC 2010 + /O1
tv315 = -8
; size = 4
tv291 = -4
; size = 4
_a$ = 8
; size = 4
_b$ = 12
; size = 4
_c$ = 16
; size = 4
?m@@YAXPAN00@Z PROC
; m, COMDAT
push
ebp
mov
ebp, esp
push
ecx
push
ecx
mov
edx, DWORD PTR _a$[ebp]
push
ebx
mov
ebx, DWORD PTR _c$[ebp]
push
esi
mov
esi, DWORD PTR _b$[ebp]
sub
edx, esi
push
edi
sub
esi, ebx
mov
DWORD PTR tv315[ebp], 100 ; 00000064H
$LL9@m:
mov
eax, ebx
mov
DWORD PTR tv291[ebp], 300 ; 0000012cH
$LL6@m:
fldz
lea
ecx, DWORD PTR [esi+eax]
fstp
QWORD PTR [eax]
mov
edi, 200
; 000000c8H
$LL3@m:
dec
edi
fld
QWORD PTR [ecx+edx]
fmul
QWORD PTR [ecx]
fadd
QWORD PTR [eax]
fstp
QWORD PTR [eax]
jne
HORT $LL3@m
add
eax, 8
dec
DWORD PTR tv291[ebp]
jne
SHORT $LL6@m
add
ebx, 800
; 00000320H
dec
DWORD PTR tv315[ebp]
jne
SHORT $LL9@m
pop
edi
pop
esi

294

CHAPTER 18. ARRAYS

18.9. EXERCISES

pop
ebx
leave
ret
0
?m@@YAXPAN00@Z ENDP

; m

(/O1: minimize space).
Listing 18.44: Optimizing Keil 6/2013 (ARM mode)
PUSH
SUB
MOV

{r0-r2,r4-r11,lr}
sp,sp,#8
r5,#0

LDR
ADD
ADD
ADD
STR
LDR
MOV
ADD
LDR
ADD

r1,[sp,#0xc]
r0,r5,r5,LSL #3
r0,r0,r5,LSL #4
r1,r1,r0,LSL #5
r1,[sp,#0]
r1,[sp,#8]
r4,#0
r11,r1,r0,LSL #5
r1,[sp,#0x10]
r10,r1,r0,LSL #5

MOV
MOV
ADD
STM
MOV
LDR
ADD
ADD

r0,#0
r1,r0
r7,r10,r4,LSL #3
r7,{r0,r1}
r6,r0
r0,[sp,#0]
r8,r11,r4,LSL #3
r9,r0,r4,LSL #3

LDM
LDM
BL
LDM
BL
ADD
STM
CMP
BLT
ADD
CMP
BLT
ADD
CMP
BLT
ADD
POP

r9,{r2,r3}
r8,{r0,r1}
__aeabi_dmul
r7,{r2,r3}
__aeabi_dadd
r6,r6,#1
r7,{r0,r1}
r6,#0xc8
|L0.84|
r4,r4,#1
r4,#0x12c
|L0.52|
r5,r5,#1
r5,#0x64
|L0.12|
sp,sp,#0x14
{r4-r11,pc}

PUSH
MOVS
SUB
STR

{r0-r2,r4-r7,lr}
r0,#0
sp,sp,#0x10
r0,[sp,#0]

MOVS
LSLS
MULS
LDR
LDR
ADDS
STR
LDR
MOVS
ADDS
ADDS
STR

r1,#0x19
r1,r1,#5
r0,r1,r0
r2,[sp,#0x10]
r1,[sp,#0x14]
r2,r0,r2
r2,[sp,#4]
r2,[sp,#0x18]
r5,#0
r7,r0,r2
r0,r0,r1
r0,[sp,#8]

|L0.12|

|L0.52|

|L0.84|

Listing 18.45: Optimizing Keil 6/2013 (thumb mode)

|L0.8|

|L0.32|
295

CHAPTER 18. ARRAYS

18.9. EXERCISES

LSLS
MOVS
ADDS
STR
MOVS
STR

r4,r5,#3
r0,#0
r2,r7,r4
r0,[r2,#0]
r6,r0
r0,[r2,#4]

LDR
ADDS
LDM
LDR
ADDS
LDM
BL
ADDS
LDM
BL
ADDS
ADDS
STM
CMP
BLT
MOVS
ADDS
ADDS
CMP
BLT
LDR
ADDS
CMP
STR
BLT
ADD
POP

r0,[sp,#8]
r0,r0,r4
r0!,{r2,r3}
r0,[sp,#4]
r1,r0,r4
r1,{r0,r1}
__aeabi_dmul
r3,r7,r4
r3,{r2,r3}
__aeabi_dadd
r2,r7,r4
r6,r6,#1
r2!,{r0,r1}
r6,#0xc8
|L0.44|
r0,#0xff
r5,r5,#1
r0,r0,#0x2d
r5,r0
|L0.32|
r0,[sp,#0]
r0,r0,#1
r0,#0x64
r0,[sp,#0]
|L0.8|
sp,sp,#0x1c
{r4-r7,pc}

|L0.44|

Listing 18.46: Non-optimizing GCC 4.9 (ARM64)
m:
sub
str
str
str
str
b

sp, sp, #48
x0, [sp,24]
x1, [sp,16]
x2, [sp,8]
wzr, [sp,44]
.L2

str
b

wzr, [sp,40]
.L3

ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
str
str
b

w1, [sp,44]
w0, 100
w0, w1, w0
x1, w0
x0, [sp,40]
x0, x1, x0
x0, x0, 3
x1, [sp,8]
x0, x1, x0
x1, .LC0
x1, [x0]
wzr, [sp,36]
.L4

ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr

w1,
w0,
w0,
x1,
x0,
x0,
x0,
x1,

.L7:

.L6:

.L5:
[sp,44]
100
w1, w0
w0
[sp,40]
x1, x0
x0, 3
[sp,8]
296

CHAPTER 18. ARRAYS

18.9. EXERCISES

add
ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
ldr
mov
mul
sxtw
ldrsw
add
lsl
ldr
add
ldr
fmov
fmov
fmul
fmov
fmov
fmov
fadd
fmov
str
ldr
add
str

x0,
w2,
w1,
w1,
x2,
x1,
x1,
x1,
x2,
x1,
x2,
w3,
w1,
w1,
x3,
x1,
x1,
x1,
x3,
x1,
x3,
w4,
w1,
w1,
x4,
x1,
x1,
x1,
x4,
x1,
x1,
d0,
d1,
d0,
x1,
d0,
d1,
d0,
x1,
x1,
w0,
w0,
w0,

x1, x0
[sp,44]
100
w2, w1
w1
[sp,40]
x2, x1
x1, 3
[sp,8]
x2, x1
[x1]
[sp,44]
100
w3, w1
w1
[sp,40]
x3, x1
x1, 3
[sp,24]
x3, x1
[x1]
[sp,44]
100
w4, w1
w1
[sp,40]
x4, x1
x1, 3
[sp,16]
x4, x1
[x1]
x3
x1
d0, d1
d0
x2
x1
d0, d1
d0
[x0]
[sp,36]
w0, 1
[sp,36]

ldr
cmp
ble
ldr
add
str

w0,
w0,
.L5
w0,
w0,
w0,

[sp,36]
199

ldr
cmp
ble
ldr
add
str

w0,
w0,
.L6
w0,
w0,
w0,

[sp,40]
299

ldr
cmp
ble
add
ret

w0, [sp,44]
w0, 99
.L7
sp, sp, 48

.L4:

[sp,40]
w0, 1
[sp,40]

.L3:

[sp,44]
w0, 1
[sp,44]

.L2:

Listing 18.47: Optimizing GCC 4.4.5 (MIPS) (IDA)
m:
li
move
li

$t5, 0x13880
$t4, $zero
$t1, 0xC8
297

CHAPTER 18. ARRAYS

18.9. EXERCISES

li

$t3, 0x12C

addu
addu
move
move

$t0,
$a3,
$v1,
$t2,

mtc1
move
mtc1
or
swc1
swc1

$zero, $f0
$v0, $zero
$zero, $f1
$at, $zero
$f0, 4($v1)
$f1, 0($v1)

lwc1
lwc1
lwc1
lwc1
addiu
mul.d
add.d
swc1
bne
swc1
addiu
addiu
addiu
bne
addiu
addiu
bne
addiu
jr
or

$f4,
$f2,
$f5,
$f3,
$v0,
$f2,
$f0,
$f0,
$v0,
$f1,
$t2,
$v1,
$t0,
$t2,
$a3,
$t4,
$t4,
$a2,
$ra
$at,

loc_14:

# CODE XREF: m+7C
$a0, $t4
$a1, $t4
$a2
$zero

loc_24:

# CODE XREF: m+70

loc_3C:

# CODE XREF: m+5C
4($t0)
4($a3)
0($t0)
0($a3)
1
$f4, $f2
$f2
4($v1)
$t1, loc_3C
0($v1)
1
8
8
$t3, loc_24
8
0x320
$t5, loc_14
0x320
$zero

Answer G.1.10 on page 956.

18.9.3

Exercise #3

What does this code do?
Try to determine the dimensions of the array, at least partially.
Listing 18.48: Optimizing MSVC 2010
_array$ = 8
_x$ = 12
_y$ = 16
_f
PROC
mov
mov
mov
shl
sub
lea
mov
fld
ret
_f
ENDP

eax, DWORD PTR _x$[esp-4]
edx, DWORD PTR _y$[esp-4]
ecx, eax
ecx, 4
ecx, eax
eax, DWORD PTR [edx+ecx*8]
ecx, DWORD PTR _array$[esp-4]
QWORD PTR [ecx+eax*8]
0

Listing 18.49: Non-optimizing Keil 6/2013 (ARM mode)
f PROC
RSB
ADD
ADD
LDM
BX
ENDP

r1,r1,r1,LSL #4
r0,r0,r1,LSL #6
r1,r0,r2,LSL #3
r1,{r0,r1}
lr

298

CHAPTER 18. ARRAYS

18.9. EXERCISES

Listing 18.50: Non-optimizing Keil 6/2013 (thumb mode)
f PROC
MOVS
LSLS
MULS
ADDS
LSLS
ADDS
LDM
BX
ENDP

r3,#0xf
r3,r3,#6
r1,r3,r1
r0,r1,r0
r1,r2,#3
r1,r0,r1
r1,{r0,r1}
lr

Listing 18.51: Optimizing GCC 4.9 (ARM64)
f:
sxtw
add
lsl
sub
ldr
ret

x1,
x0,
x2,
x1,
d0,

w1
x0, x2, sxtw 3
x1, 10
x2, x1, lsl 6
[x0,x1]

Listing 18.52: Optimizing GCC 4.4.5 (MIPS) (IDA)
f:
sll
sll
subu
addu
sll
addu
lwc1
or
lwc1
jr
or

$v0,
$a1,
$a1,
$a1,
$a2,
$a1,
$f0,
$at,
$f1,
$ra
$at,

$a1, 10
6
$v0, $a1
$a0, $a1
3
$a2
4($a1)
$zero
0($a1)
$zero

Answer G.1.10 on page 956

18.9.4

Exercise #4

What does this code do?
Try to determine the dimensions of the array, at least partially.
Listing 18.53: Optimizing MSVC 2010
_array$ = 8
_x$ = 12
_y$ = 16
_z$ = 20
_f
PROC
mov
mov
mov
shl
sub
lea
mov
lea
shl
add
mov
ret
_f
ENDP

eax,
edx,
ecx,
ecx,
ecx,
eax,
ecx,
eax,
eax,
eax,
eax,
0

DWORD
DWORD
eax
4
eax
DWORD
DWORD
DWORD
4
DWORD
DWORD

PTR _x$[esp-4]
PTR _y$[esp-4]

PTR [edx+ecx*4]
PTR _array$[esp-4]
PTR [eax+eax*4]
PTR _z$[esp-4]
PTR [ecx+eax*4]

Listing 18.54: Non-optimizing Keil 6/2013 (ARM mode)
f PROC
RSB
ADD

r1,r1,r1,LSL #4
r1,r1,r1,LSL #2
299

CHAPTER 18. ARRAYS

ADD
ADD
ADD
LDR
BX
ENDP

18.9. EXERCISES

r0,r0,r1,LSL #8
r1,r2,r2,LSL #2
r0,r0,r1,LSL #6
r0,[r0,r3,LSL #2]
lr

Listing 18.55: Non-optimizing Keil 6/2013 (thumb mode)
f PROC
PUSH
MOVS
LSLS
MULS
ADDS
MOVS
ADDS
MULS
ADDS
LSLS
LDR
POP
ENDP

{r4,lr}
r4,#0x4b
r4,r4,#8
r1,r4,r1
r0,r1,r0
r1,#0xff
r1,r1,#0x41
r2,r1,r2
r0,r0,r2
r1,r3,#2
r0,[r0,r1]
{r4,pc}

Listing 18.56: Optimizing GCC 4.9 (ARM64)
f:
sxtw
mov
add
smull
lsl
add
add
ldr
ret

x2,
w4,
x2,
x1,
x2,
x3,
x0,
w0,

w2
19200
x2, x2, lsl 2
w1, w4
x2, 4
x2, x3, sxtw
x0, x3, lsl 2
[x0,x1]

Listing 18.57: Optimizing GCC 4.4.5 (MIPS) (IDA)
f:
sll
sll
addu
sll
sll
addu
sll
subu
addu
addu
sll
addu
lw
jr
or

$v0,
$a1,
$a1,
$v0,
$a2,
$a2,
$v0,
$a1,
$a2,
$a1,
$a2,
$a1,
$v0,
$ra
$at,

$a1, 10
8
$v0
$a2, 6
4
$v0
$a1, 4
$v0, $a1
$a3
$a0, $a1
2
$a2
0($a1)
$zero

Answer G.1.10 on page 956

18.9.5

Exercise #5

What does this code do?
Listing 18.58: Optimizing MSVC 2012 /GSCOMM

_tbl:DWORD:064H

tv759 = -4
_main
PROC
push
push

; size = 4
ecx
ebx
300

CHAPTER 18. ARRAYS

push
push
xor
push
xor
xor
xor
xor
mov
mov
npad
$LL6@main:
lea
mov
mov
add
mov
lea
mov
lea
mov
mov
mov
mov
mov
mov
mov
add
inc
add
add
add
add
cmp
jl
pop
pop
pop
xor
pop
pop
ret
_main
ENDP

18.9. EXERCISES

ebp
esi
edx, edx
edi
esi, esi
edi, edi
ebx, ebx
ebp, ebp
DWORD PTR tv759[esp+20], edx
eax, OFFSET _tbl+4
8 ; align next label
ecx, DWORD PTR [edx+edx]
DWORD PTR [eax+4], ecx
ecx, DWORD PTR tv759[esp+20]
DWORD PTR tv759[esp+20], 3
DWORD PTR [eax+8], ecx
ecx, DWORD PTR [edx*4]
DWORD PTR [eax+12], ecx
ecx, DWORD PTR [edx*8]
DWORD PTR [eax], edx
DWORD PTR [eax+16], ebp
DWORD PTR [eax+20], ebx
DWORD PTR [eax+24], edi
DWORD PTR [eax+32], esi
DWORD PTR [eax-4], 0
DWORD PTR [eax+28], ecx
eax, 40
edx
ebp, 5
ebx, 6
edi, 7
esi, 9
eax, OFFSET _tbl+404
SHORT $LL6@main
edi
esi
ebp
eax, eax
ebx
ecx
0

Listing 18.59: Non-optimizing Keil 6/2013 (ARM mode)
main PROC
LDR
MOV
|L0.8|
ADD
MOV
ADD
|L0.20|
MUL
STR
ADD
CMP
BLT
ADD
CMP
MOVGE
BLT
BX
ENDP

r12,|L0.60|
r1,#0
r2,r1,r1,LSL #2
r0,#0
r2,r12,r2,LSL #3
r3,r1,r0
r3,[r2,r0,LSL #2]
r0,r0,#1
r0,#0xa
|L0.20|
r1,r1,#1
r1,#0xa
r0,#0
|L0.8|
lr

|L0.60|
DCD

||.bss||

AREA ||.bss||, DATA, NOINIT, ALIGN=2
301

CHAPTER 18. ARRAYS

18.9. EXERCISES

tbl
%

400

Listing 18.60: Non-optimizing Keil 6/2013 (thumb mode)
main PROC
PUSH
LDR
MOVS
|L0.6|
MOVS
MULS
MOVS
ADDS
|L0.14|
MOVS
MULS
LSLS
ADDS
CMP
STR
BLT
ADDS
CMP
BLT
MOVS
POP
ENDP

{r4,r5,lr}
r4,|L0.40|
r1,#0
r2,#0x28
r2,r1,r2
r0,#0
r3,r2,r4
r2,r1
r2,r0,r2
r5,r0,#2
r0,r0,#1
r0,#0xa
r2,[r3,r5]
|L0.14|
r1,r1,#1
r1,#0xa
|L0.6|
r0,#0
{r4,r5,pc}

DCW

0x0000

DCD

||.bss||

|L0.40|

AREA ||.bss||, DATA, NOINIT, ALIGN=2
tbl
%

400

Listing 18.61: Non-optimizing GCC 4.9 (ARM64)
.comm

tbl,400,8

sub
str
b

sp, sp, #16
wzr, [sp,12]
.L2

str
b

wzr, [sp,8]
.L3

ldr
ldr
mul
adrp
add
ldrsw
ldrsw
mov
lsl
add
lsl
add
str
ldr
add
str

w1,
w0,
w3,
x0,
x2,
x4,
x1,
x0,
x0,
x0,
x0,
x0,
w3,
w0,
w0,
w0,

ldr
cmp
ble

w0, [sp,8]
w0, 9
.L4

main:

.L5:

.L4:
[sp,12]
[sp,8]
w1, w0
tbl
x0, :lo12:tbl
[sp,8]
[sp,12]
x1
x0, 2
x0, x1
x0, 1
x0, x4
[x2,x0,lsl 2]
[sp,8]
w0, 1
[sp,8]

.L3:

302

16 Listing 18. $v0. 0x18+var_C($fp) loc_A0 $at. 0x18+var_10($fp) loc_7C $at.12] ldr cmp ble mov add ret w0. $a2 $v1. 1 w0. $zero sw b or # CODE XREF: main+AC $zero. [sp.L5 w0. . $v0. $v0. $v0.4. $v0. $sp $gp.12] 9 0 sp. $at. 0xA loc_24: loc_30: loc_7C: loc_A0: # CODE XREF: main+88 0x18+var_C($fp) 0x18+var_10($fp) 0x18+var_C($fp) 0x18+var_10($fp) $zero $v1 (tbl & 0xFFFF)($gp) 1 $v0.L2: [sp. $v0. $zero lw lw lw lw or mult mflo lw sll sll addu addu sll addu sw lw or addiu sw $v0. $v0. $v1. $zero $v0.9. 0x18+var_C($fp) $at. $a0. $v0. $a1.CHAPTER 18. __gnu_local_gp $gp. $at. $at. $at. $a1. $v0. w0. . -0x18 $fp. [sp.5 (MIPS) (IDA) main: var_18 var_10 var_C var_4 = = = = -0x18 -0x10 -0xC -4 addiu sw move la sw sw b or $sp. $v0. $v0. EXERCISES ldr add str w0.12] w0. 0x18+var_4($sp) $fp. $v0 0($v0) 0x18+var_10($fp) $zero 1 0x18+var_10($fp) # CODE XREF: main+28 0x18+var_10($fp) $zero 0xA loc_30 $zero 0x18+var_C($fp) $zero 1 0x18+var_C($fp) 303 . lw or slti bnez or lw or addiu sw $v0. $v0. 0x18+var_18($sp) $zero. sp. $v0.62: Non-optimizing GCC 4. w0. ARRAYS 18. $at. 2 $a1 $a0 2 $v1. $a2. lw or slti # CODE XREF: main+1C $v0. $a1.

comm tbl:0x64 # DATA XREF: main+4C Answer G. EXERCISES bnez or move move lw addiu jr or $v0. $sp. ARRAYS 18. $fp.CHAPTER 18. $ra $at. $at. $v0.10 on page 956 304 .1. loc_24 $zero $zero $fp 0x18+var_4($sp) 0x18 $zero .9. $sp.

GENERIC_WRITE | GENERIC_READ. FILE_SHARE_READ. fh=CreateFile ("file".text:7C83D429 .1 x86 Win32 API example: HANDLE fh. but it is not frugally.text:7C83D434 .1 on page 74)). 1 short loc_7C83D417 loc_7C810817 Here we see the TEST instruction.DLL in Windows XP SP3 x86.h #define #define #define #define GENERIC_READ GENERIC_WRITE GENERIC_EXECUTE GENERIC_ALL (0x80000000L) (0x40000000L) (0x20000000L) (0x10000000L) Everything is clear. but only the most significant byte (ebp+dwDesiredAccess+3) and checks it for flag 0x40 (which implies the GENERIC_WRITE flag here) TEST is basically the same instruction as AND.1 Specific bit checking 19.CHAPTER 19. GENERIC_READ | GENERIC_WRITE = 0x80000000 | 0x40000000 = 0xC0000000. 40h [ebp+var_8]. they could be substituted by a set of bool-typed variables.text:7C83D42D . and that value is used as the second argument for the CreateFile()1 function.3: KERNEL32.h: Listing 19. How would CreateFile() check these flags? If we look in KERNEL32.text:7C83D436 test mov jz jmp byte ptr [ebp+dwDesiredAccess+3].3.1. MANIPULATING SPECIFIC BIT(S) Chapter 19 Manipulating specific bit(s) A lot of functions define their input arguments as flags in bit fields. but without saving the result ( 7. but without saving the result (recall the fact CMP is merely the same as SUB. however it doesn’t take the whole second argument. We get (MSVC 2010): Listing 19. 00000080H . Of course. eax . NULL.1: MSVC 2010 push push push push push push push call mov 0 128 4 0 1 -1073741824 OFFSET $SG78813 DWORD PTR __imp__CreateFileA@28 DWORD PTR _fh$[ebp].2: WinNT. we’ll find this fragment of code in CreateFileW: Listing 19. The logic of this code fragment is as follows: 1 MSDN: CreateFile function 305 . c0000000H Let’s take a look in WinNT. FILE_ATTRIBUTE_NORMAL. NULL). 19. OPEN_ALWAYS⤦ Ç .DLL (Windows XP SP3 x86) .

Let’s try GCC 4.6) . ZF is to be set and the conditional jump is to be triggered.so.1 and Linux: #include <stdio. MANIPULATING SPECIFIC BIT(S) 19. it is easy to download both Glibc and the Linux kernel source code. }. filename 5 . and find the do_filp_open() function.6. So. flags [esp+4+filename] . eax If we take a look in the open() function in the libc.c).text:000BE69B . We get: Listing 19. ebx.h> #include <fcntl.6 library.6: do_filp_open() (linux kernel 2. SPECIFIC BIT CHECKING if ((dwDesiredAccess&0x40000000) == 0) goto loc_7C83D417 If AND instruction leaves this bit.1 main public main proc near var_20 var_1C var_4 = dword ptr -20h = dword ptr -1Ch = dword ptr -4 main push mov and sub mov mov call mov leave retn endp ebp ebp. 42h [esp+20h+var_20].5: open() (libc. esp esp. mode [esp+4+flags] . N.31.com/17066 4 See also arch\x86\include\asm\calling. The Linux 2. Of course. the bit fields for open() are apparently checked somewhere in the Linux kernel. 20h [esp+20h+var_1C]. "file" _open [esp+20h+var_4].31) 2 http://go.CHAPTER 19.1. control eventually passes to do_sys_open. and from there —to the do_filp_open() function (it’s located in the kernel source tree in fs/namei. 0FFFFFFF0h esp.4. but we are interested in understanding the matter without it. compile it in Ubuntu: make vmlinux. So. open it in IDA.text:000BE6A3 .so.h file in kernel tree 306 .4: GCC 4. This works faster since CPU does not need to access the stack in memory to read argument values. At the beginning. What this means to us is that the first 3 arguments are to be passed via registers EAX. it is only a syscall: Listing 19.6 kernel is compiled with -mregparm=3 option 3 4 .yurichev. the ZF flag is to be cleared and the JZ conditional jump is not to be triggered. 80h [esp+4+mode] .text:000BE69F . let’s download Linux Kernel 2. if the number of arguments is less than 3. Aside from passing arguments via the stack.sys_open So.com/17040 3 http://go. The conditional jump is triggered only if the 0x40000000 bit is absent in dwDesiredAccess variable —then the result of AND is 0. handle=open ("file".4. as of Linux 2. eax. and the rest via the stack. only part of registers set is to be used. there is also a method of passing some of them via registers.h> void main() { int handle. GCC has the option regparm2 . we see (the comments are mine): Listing 19.6. O_RDWR | O_CREAT). ecx. through which it’s possible to set the number of arguments that can be passed via registers. LINUX . This is also called fastcall ( 64.text:000BE6AC mov mov mov mov int edx.3 on page 664).yurichev.6.B. when the sys_open syscall is called.text:000BE6A7 . EDX and ECX. Of course. offset aFile .

40h .. ebx edi. } static struct file *path_openat(int dfd. op.8. ecx ebx..31) loc_C01EF6B4: test jnz mov shr xor and test jz or .1. pathname). acc_mode (5th arg) bl. 1 ebx. } else { 307 .. struct filename *name) { . &inode)...0 struct file *do_filp_open(int dfd. 2 0x40 —is what the O_CREAT macro equals to. . 11h edi. op. Listing 19. eax .CHAPTER 19. int *opened. struct path *path. struct filename *pathname.6. const struct open_flags *op. the compiler would not touch these registers. 3 [ebp+var_80]. esp edi esi ebx ebx. MANIPULATING SPECIFIC BIT(S) do_filp_open . error = do_last(nd. file. CODE XREF: do_filp_open+4F bl. path. ecx . .. open_flag (3th arg) short loc_C01EF684 ebx. &path.. the next JNZ instruction is triggered. edx . error = lookup_fast(nd.. ebx <. 1 esp. and if this bit is 1. SPECIFIC BIT CHECKING proc near push mov push push push mov add sub mov test mov mov mov jnz mov ebp ebp.. dfd (1th arg) [ebp+var_7C]. const struct open_flags *op. flags | LOOKUP_RCU).7: do_filp_open() (linux kernel 2. Let’s find this fragment of code: Listing 19.open_flag GCC saves the values of the first 3 arguments in the local stack. struct nameidata *nd.0. pathname. ecx . filp = path_openat(dfd. 10000h short loc_C01EF6D3 edi.1. [ebp+arg_4] . . O_CREAT loc_C01EF810 edi. &nd.8: linux kernel 3. 19.. and that would be too tight environment for the compiler’s register allocator.. &opened. open_flag gets checked for the presence of the 0x40 bit. struct file *file. struct filename *pathname. } static int do_last(struct nameidata *nd. 98h esi... 1 edi. 19.8. int flags) { .. if (!(open_flag & O_CREAT)) { . If that wasn’t done. pathname (2th arg) [ebp+var_78].. const struct open_flags *op) { .2 ARM The O_CREAT bit is checked differently in Linux kernel 3..

. R4 R12. jumptable C0169F00 default case loc_C016A128 R2. #1 R3.text:C016A128 loc_C016A128 .h> #define IS_SET(flag. SPECIFIC BIT SETTING/CLEARING . R6 . } Here is how the kernel compiled for ARM mode looks in IDA: Listing 19. [R4.. error = complete_walk(nd).. }. bit) #define SET_BIT(var.#var_54] R3.text:C0169F74 .text:C016A12C .2 Specific bit setting/clearing For example: #include <stdio. MOV R9. [R4.text:C0169F90 .. #3 R1. #0x200000 R1. int main() { f(0x12340678). 308 . [R11. #8 R3.text:C0169F78 . #1 R3. This corresponds to the source code of the do_last() function.isra. bit) ((flag) & (bit)) ((var) |= (bit)) ((var) &= ~(bit)) int f(int a) { int rt=a. [R9] . #-var_38 lookup_fast MOV BL .. . 0x200).#0x10] R12.text:C0169FA4 .text:C0169F68 . #0x40 .text:C0169F80 . REMOVE_BIT (rt.9: do_last() (vmlinux) .. return rt..#0xC] R0.text:C0169F9C . } .text:C0169ED4 ..text:C0169FB4 . R3.text:C0169F6C .text:C0169F84 .CHAPTER 19. #0 R1.2.text:C0169F98 . SET_BIT (rt. R6. R3 . R12 R3. R4..text:C0169F70 .text:C0169F8C . .text:C0169FA0 . R3. R11.#0x24] R3.text:C0169F94 .text:C0169FA8 . R3 ..R3] R2. We can “spot” visually this code fragment by the fact the lookup_fast() is to be executed in one case and complete_walk() in the other. CODE XREF: do_last. [R4.text:C016A128 . 0x4000). [R4.(4th argument) open_flag LDR R6. 19.text:C0169F7C . The O_CREAT macro equals to 0x40 here too.#0x24] R3.. }.#var_50] R3. [R11... bit) #define REMOVE_BIT(var. R4 complete_walk TST is analogous to the TEST instruction in x86. MANIPULATING SPECIFIC BIT(S) 19.open_flag TST BNE LDR ADD LDR MOV STR LDRB MOV CMP ORRNE STRNE ANDS MOV LDRNE ANDNE EORNE STR SUB BL R6..text:C0169FAC . R8 R3.14+DC R0. [R2.text:C0169F88 .text:C0169FB0 . . .text:C0169EA8 . R1.

ebp ebp 0 The OR instruction adds one more bit to value while ignoring the rest. SPECIFIC BIT SETTING/CLEARING x86 Non-optimizing MSVC We get (MSVC 2010): Listing 19. size = 4 . It is the easier way to memorize the logic. ecx edx. in the second AND operand only the bits that need to be saved are set. fffffdffH DWORD PTR _rt$[ebp]. DWORD PTR _rt$[ebp] edx.CHAPTER 19. DWORD PTR _rt$[ebp] ecx. It can be said that AND just copies all bits except one.2. DWORD PTR _a$[ebp] DWORD PTR _rt$[ebp].2.1 19. DWORD PTR _rt$[ebp] esp. edx eax. 16384 . MANIPULATING SPECIFIC BIT(S) 19. 309 . 00004000H DWORD PTR _rt$[ebp]. just the one do not want to copy is not (which is 0 in the bitmask). eax ecx. AND resets one bit.10: MSVC 2010 _rt$ = -4 _a$ = 8 _f PROC push mov push mov mov mov or mov mov and mov mov mov pop ret _f ENDP . size = 4 ebp ebp. esp ecx eax. Indeed. -513 .

Inverted 0x200 is 0xFFFFFDFF (11111111111111111110111111111). the 10th bit (counting from 1st)).e. The input value is: 0x12340678 (10010001101000000011001111000)... SPECIFIC BIT SETTING/CLEARING OllyDbg Let’s try this example in OllyDbg.1: OllyDbg: value is loaded into ECX 310 .CHAPTER 19.2. the 15th bit). We see how it’s loaded: Figure 19. MANIPULATING SPECIFIC BIT(S) 19. 0x4000 (00000000000000100000000000000) (i.e. First. let’s see the binary form of the constants we are going to use: 0x200 (00000000000000000001000000000) (i.

2: OllyDbg: OR executed 15th bit is set: 0x12344678 (10010001101000100011001111000).2.CHAPTER 19. SPECIFIC BIT SETTING/CLEARING OR got executed: Figure 19. MANIPULATING SPECIFIC BIT(S) 19. 311 .

MANIPULATING SPECIFIC BIT(S) 19. SPECIFIC BIT SETTING/CLEARING The value is reloaded again (because the compiler is not in optimizing mode): Figure 19.CHAPTER 19.3: OllyDbg: value was reloaded into EDX 312 .2.

0FFFFFDFFh eax.11: Optimizing MSVC _a$ = 8 _f PROC mov and or ret _f ENDP . SPECIFIC BIT SETTING/CLEARING AND got executed: Figure 19. -513 . DWORD PTR _a$[esp-4] eax. 4000h [ebp+var_4]. esp esp. Optimizing MSVC If we compile it in MSVC with optimization turned on (/Ox).4: OllyDbg: AND executed The 10th bit was cleared (or.12: Non-optimizing GCC f public f proc near var_4 arg_0 = dword ptr -4 = dword ptr 8 f push mov sub mov mov or and mov leave retn endp ebp ebp. fffffdffH eax. eax [ebp+var_4]. it is shorter than the MSVC version without optimization. 16384 . Now let’s try GCC with optimization turned on -O3: 313 .CHAPTER 19. [ebp+arg_0] [ebp+var_4]. all bits were left except the 10th) and the final value now is 0x12344478 (10010001101000100010001111000). [ebp+var_4] There is a redundant code present. the code is even shorter: Listing 19. 00004000H 0 Non-optimizing GCC Let’s try GCC 4. in other words. however. 10h eax.4.2.1 without optimization: Listing 19. MANIPULATING SPECIFIC BIT(S) 19. size = 4 eax.

esp and ah. it’s analogous to a NOT+AND instruction pair. ORR is “logical or”. 0FDh pop ebp retn endp f f Indeed —the first argument is already loaded in EAX. [ebp+arg_0] ebp ah. analogous to OR in x86.2 ARM + Optimizing Keil 6/2013 (ARM mode) Listing 19. #0x4000 LR BIC (BItwise bit Clear) is an instruction for clearing specific bits. 40h'' instruction occupies only 3 bytes. R0. such short functions are better to be inlined functions ( 43 on page 500). I. In 80386 almost all registers were extended to 32-bit.14: Optimizing GCC public f proc near push ebp or ah. 0FDh That’s shorter. This is just like the AND instruction. Since all x86 CPUs are successors of the 16-bit 8086 CPU. 7th (byte number) 6th 5th 4th RAXx64 3rd 2nd 1st 0th EAX AH AX AL N. R0. It is worth noting the compiler works with the EAX register part via the AH register —that is the EAX register part from the 8th to the 15th bits included. but for the sake of compatibility. esp eax. its older parts may be still accessed as AX/AH/AL.2. so it is possible to work with it in-place.. However. #0x200 R0.CHAPTER 19. MANIPULATING SPECIFIC BIT(S) 19. but GCC probably is not good enough to do such code size optimizations.B. Listing 19.15: Optimizing Keil 6/2013 (ARM mode) 02 0C C0 E3 01 09 80 E3 1E FF 2F E1 BIC ORR BX R0. 314 . SPECIFIC BIT SETTING/CLEARING Optimizing GCC Listing 19. 04000h'' but that is 5 bytes.e. Optimizing GCC and regparm It would be even shorter if to turn on the -O3 optimization flag and also set regparm=3.esp'') and epilogue (``pop ebp'') can easily be omitted here.2. It would be more logical way to emit here ``or eax. the accumulator was named EAX. or even 6 (in case the register in the first operand is not EAX). That’s why the ``or ah. these older 16-bit opcodes are shorter than the newer 32-bit ones. It is worth noting that both the function prologue (``push ebp / mov ebp. 40h mov ebp. So far it’s easy. but with inverted operand. The 16-bit CPU 8086 accumulator was named AX and consisted of two 8-bit halves —AL (lower byte) and AH (higher byte). 19. 40h ah.13: Optimizing GCC f public f proc near arg_0 = dword ptr f push mov mov pop or and retn endp 8 ebp ebp.

this value is calculated as 0x4000 ≫ 5. R1. but the generated code works correctly anyway.3 (LLVM) (ARM mode) 42 0C C0 E3 01 09 80 E3 1E FF 2F E1 BIC ORR BX R0. R0.17: Optimizing Xcode 4. #0x4000 LR The code that was generated by LLVM. R0.2. This is because it’s not possible to encode 0x1234 in a BIC instruction.3 (LLVM) (ARM mode) Listing 19. 0x4200).5 Probably a ARM: more about the BIC instruction Let’s rework the example slightly: int f(int a) { int rt=a.2. 0x1234).CHAPTER 19. You can read more about compiler anomalies here ( 91 on page 870). in source code form could be something like this: REMOVE_BIT (rt. And it does exactly what we need. }. Then the optimizing Keil 5. 16384 .3 19. REMOVE_BIT (rt. 0x4000). making 0x200 from 0x4000. MANIPULATING SPECIFIC BIT(S) 19. So that is why. w0.r0.03 in ARM mode does: f PROC BIC BIC BX ENDP r0. LR 0x4000 R1 R1. w0. with the help of ASRS (arithmetic shift right).6. -513 w0. #5 R1 . generate 0x200 and place to R1 Seems like Keil decided that the code in thumb mode. bit 0x1234 are cleared in two passes. But why 0x4200? Perhaps. that an artifact from LLVM’s optimizer 5 . 19. 0x4000 was LLVM build 2410. 0xFFFFFFFFFFFFFDFF . R0. but it’s possible to encode 0x1000 and 0x234.9 Optimizing GCC compiling for ARM64 can use the AND instruction instead of BIC: Listing 19. compiler’s optimizer error. SET_BIT (rt.00 bundled with Apple Xcode 4.2. return rt.2.18: Optimizing GCC (Linaro) 4.3 (LLVM) for thumb mode generates the same code. R0. Optimizing Xcode 4.#0x234 lr There are two BIC instructions.e.9 f: and orr ret 5 It w0.6.6.2.3 315 . #0x4200 R0.16: Optimizing Keil 6/2013 (thumb mode) 01 08 49 88 70 21 89 03 43 11 43 47 MOVS ORRS ASRS BICS BX R1.4 ARM + Optimizing Xcode 4.6.r0. is more compact than the code for writing 0x200 to an arbitrary register.#0x1000 r0. i. SPECIFIC BIT SETTING/CLEARING ARM + Optimizing Keil 6/2013 (thumb mode) Listing 19.6 ARM64: Optimizing GCC (Linaro) 4.. 19. 19.2.

1.4.8 sp. w0.28] w0. MANIPULATING SPECIFIC BIT(S) 19.3.9 f: sub str ldr str ldr orr str ldr and str ldr add ret 19. w0. There was no way to use ANDI because it’s not possible to embed the 0xFFFFFDFF number in a single instruction. but works just like optimized: Listing 19. SHIFTS ARM64: Non-optimizing GCC (Linaro) 4. etc): 16. -513 [sp. Shifting operations are also so important because they are often used for specific bit isolation or for constructing a value of several scattered bits. 0x4000 .of course..2. Shift instructions are often used in division and multiplications by powers of two: 2n (e. 0x4000 . sp.4 Specific bit setting/clearing: FPU example Here is how bits are located in the float type in IEEE 754 form: 31 30 S exponent 23 22 0 mantissa or fraction ( S—sign ) The sign of number is in the MSB6 . #32 [sp.2 on page 201.19: Non-optimizing GCC (Linaro) 4. w0. 8. 2. w0. $a0.g.CHAPTER 19.1. The x86 ISA has the SHL (SHift Left) and SHR (SHift Right) instructions for this. w0. so the compiler has to load 0xFFFFFDFF into register $V0 first and then generate AND which takes all its values from registers.5 (IDA) f: .28] sp. 16384 [sp. Will it be possible to change the sign of a floating point number without any FPU instructions? 6 Most significant bit/byte 316 .9 Non-optimizing GCC generates more redundant code.2. w0. 19. sp. $a0=a ori $a0. w0.12] [sp.2.28] w0.3 Shifts Bit shifts in C/C++ are implemented using ≪ and ≫ operators.28] [sp. 32 . the OR operation.28] [sp.28] [sp. at finish: $v0 = $a0&$v0 = a|0x4000 & 0xFFFFFDFF ORI is. 0xFFFFFFFFFFFFFDFF MIPS Listing 19.7 19.12] [sp. But after that we have AND. w0. w0. 0xFFFFFDFF jr $ra and $v0. w0. 4. 19. 16. “I” in the instruction name mean that the value is embedded in the machine code.1 on page 205. $v0 . $a0=a|0x4000 li $v0.20: Optimizing GCC 4.

("%f\n". -2147483648 DWORD PTR _tmp$[esp-4] . the XOR operation applied with 0 does nothing.456)). it’s an idle operation. ("%f\n".CHAPTER 19. SPECIFIC BIT SETTING/CLEARING: FPU EXAMPLE #include <stdio. MANIPULATING SPECIFIC BIT(S) 19.2 x86 The code is pretty straightforward: Listing 19. ("%f\n". 2147483647 DWORD PTR _tmp$[esp-4] 0 . set_sign() sets MSB and negate() flips it..h> float my_abs (float i) { unsigned int tmp=(*(unsigned int*)&i) & 0x7FFFFFFF. }. ("%f\n". }. }. int main() { printf printf printf printf printf printf printf printf printf }.e. my_abs (123. 80000000H 317 .456)). the XOR operation applied with 1 is effectively inverting a bit: input A 0 0 1 1 input B 0 1 0 1 output 0 1 1 0 And on the contrary.4. Indeed. 19. return *(float*)&tmp. negate (123. return *(float*)&tmp. my_abs (-456.4. set_sign (123.4.123)). ("my_abs():\n"). i. This is a very important property of the XOR operation and it’s highly recommended to memorize it. set_sign (-456.123)).1 So there are three functions: A word about the XOR operation XOR is widely used when one need just to flip specific bit(s).123)). return *(float*)&tmp. 7fffffffH DWORD PTR _i$[esp-4]. ("set_sign():\n"). ("negate():\n"). ("%f\n". float set_sign (float i) { unsigned int tmp=(*(unsigned int*)&i) | 0x80000000.456)). negate (-456.21: Optimizing MSVC 2012 _tmp$ = 8 _i$ = 8 _my_abs PROC and fld ret _my_abs ENDP _tmp$ = 8 _i$ = 8 _set_sign PROC or fld DWORD PTR _i$[esp-4]. float negate (float i) { unsigned int tmp=(*(unsigned int*)&i) ^ 0x80000000. ("%f\n". We need this trickery in C/C++ to copy to/from float value without actual conversion. my_abs() resets MSB. 19.

the result is copied into XMM0. eax xmm0. The 31st bit is MSB.4. 31 DWORD PTR tmp$[rsp]. 0x7FFFFFFF $v0. DWORD PTR i$[rsp] eax. then it is copied into the local stack and then we see some instructions that are new to us: BTR. 31 DWORD PTR tmp$[rsp].4. SPECIFIC BIT SETTING/CLEARING: FPU EXAMPLE 0 DWORD PTR _i$[esp-4]. DWORD PTR i$[rsp] eax.4.5 (IDA) my_abs: .4. eax xmm0. BTC. DWORD PTR tmp$[rsp] 0 DWORD PTR [rsp+8]. because floating-point numbers are returned in this register. move from coprocessor mfc1 li . $v0=0x7FFFFFFF .3 MIPS GCC 4. 80000000H An input value of type float is taken from the stack. but treated as an integer value. 19. Now let’s try optimizing MSVC 2012 for x64: Listing 19. Finally.5 for MIPS does mostly the same: Listing 19. $v1 318 .22: Optimizing MSVC 2012 x64 tmp$ = 8 i$ = 8 my_abs PROC movss mov btr mov movss ret my_abs ENDP _TEXT ENDS tmp$ = 8 i$ = 8 set_sign PROC movss mov bts mov movss ret set_sign ENDP tmp$ = 8 i$ = 8 negate PROC movss mov btc mov movss ret negate ENDP DWORD PTR [rsp+8]. MANIPULATING SPECIFIC BIT(S) ret _set_sign ENDP _tmp$ = 8 _i$ = 8 _negate PROC xor fld ret _negate ENDP 19. -2147483648 DWORD PTR _tmp$[esp-4] 0 . DWORD PTR i$[rsp] eax. setting (BTS) and inverting (or complementing: BTC) specific bits. These instructions are used for resetting (BTR). DWORD PTR tmp$[rsp] 0 The input value is passed in XMM0. xmm0 eax. AND and OR reset and set the desired bit. because floating point values are returned through XMM0 in Win64 environment. xmm0 eax. eax xmm0. the modified value is loaded into ST0. $f12 $v0. BTS. counting from 0. 31 DWORD PTR tmp$[rsp].23: Optimizing GCC 4. do AND: and 1: $v1. xmm0 eax. Finally. DWORD PTR tmp$[rsp] 0 DWORD PTR [rsp+8]. XOR flips it.CHAPTER 19.

move to coprocessor 1: mtc1 $v0. do XOR: EOR BX ENDP r0. lui $v1. ARM has the BIC instruction. . MANIPULATING SPECIFIC BIT(S) 19. branch delay slot One single LUI instruction is used to load 0x80000000 into a register. which explicitly clears specific bit(s). branch delay slot negate: . $f12 0x8000 $v1. move to coprocessor 1: mtc1 $v0. . do OR: or $v0. move from coprocessor 1: mfc1 $v0. SPECIFIC BIT SETTING/CLEARING: FPU EXAMPLE . EOR is the ARM instruction name for XOR (“Exclusive OR”). . lui $v1. return jr $ra or $at. .#0x80000000 lr r0. $f12 0x8000 $v1. clear bit: BIC BX ENDP set_sign PROC . return jr $ra or $at. $zero .r0. move from coprocessor 1: mfc1 $v0. do OR: ORR BX ENDP negate PROC .#0x80000000 lr So far so good. $v0 $f0 $zero . because LUI is clearing the low 16 bits and these are zeroes in the constant.4.24: Optimizing Keil 6/2013 (ARM mode) my_abs PROC . $v1=0x80000000 .r0. $v0 $f0 $zero . branch delay slot set_sign: .CHAPTER 19.4.r0. 19.r0. $f0 .25: Optimizing Keil 6/2013 (thumb mode) my_abs PROC LSLS . so one LUI without subsequent ORI is enough. . . return jr $ra or $at.#0x80000000 lr r0. move to coprocessor 1: mtc1 $v0.#1 319 . Optimizing Keil 6/2013 (thumb mode) Listing 19. do XOR: xor $v0. $v1=0x80000000 . r0=i<<1 r0.4 ARM Optimizing Keil 6/2013 (ARM mode) Listing 19.

r0. S0 R3. S0 R3.CHAPTER 19. what’s our goal? The goal is to flip the MSB. Optimizing GCC 4. so here a MOVS/LSLS instruction pair is used for forming the 0x80000000 constant. R3 LR R2. so S-registers are used here for floating point numbers instead of R-registers. but the instruction “ADD register. First of all.26: Optimizing GCC 4. R2. so let’s forget about the XOR operation. The FMRS instruction copies data from GPR to the FPU and back. 0x80000000”. r1=1 LSLS r1. R2. MANIPULATING SPECIFIC BIT(S) LSRS .r1. copy from R3 to S0: FMSR BX negate . r1=1 LSLS r1. 0x80000000” works just like “XOR register. my_abs() and set_sign() looks as expected. It works like this: 1 << 31 = 0x80000000.3 for Raspberry Pi (ARM mode) my_abs . R2. r0=(i<<1)>>1 BX ENDP 19.#1 . do OR: ORR . When the subsequent result >> 1 statement is executed.6.r1 . ARM mode) Listing 19.6. r1=1<<31=0x80000000 EORS r0. but MSB is zero. But nevertheless. r1=1<<31=0x80000000 ORRS r0. The code of my_abs is weird and it effectively works like this expression: (i << 1) >> 1. copy from R3 to S0: FMSR BX set_sign .#31 .r1. copy from S0 to R2: FMRS . when input << 1 is executed. S0 R3.#1 .r0. because all “new” bits appearing from the shift operations are always zeroes. copy from S0 to R2: FMRS . clear bit: BIC . From school-level math320 . copy from R3 to S0: FMSR BX R2. all bits are now in their own places. #0x80000000 S0. That is how the LSLS/LSRS instruction pair clears MSB. R3 LR I run Raspberry Pi Linux in QEMU and it emulates an ARM FPU. the MSB (sign bit) is just dropped.r0. do ADD: ADD . #0x80000000 S0. R3 LR R2.3 (Raspberry Pi.#1 lr set_sign PROC MOVS r1.r1 . r0=r0 | 0x80000000 BX lr ENDP negate PROC MOVS r1. SPECIFIC BIT SETTING/CLEARING: FPU EXAMPLE r0. copy from S0 to R2: FMRS .#31 . but negate()? Why is there ADD instead of XOR? It’s hard to believe.4. #0x80000000 S0. r0=r0 ^ 0x80000000 BX lr ENDP Thumb mode in ARM offers 16-bit instructions and not much data can be encoded in them. This statement looks meaningless.

Adding 0x80000000 to any value never affects the lowest 31 bits. int rt=0. #include <stdio. I’m not sure why GCC decided to do this. 1<<i)) rt++. This operation is also called “population count” 7 . In this loop. But here we operate in binary base and 0x80000000 is 100000000000000000000000000000000. return rt. we would say shift 1 by n bits left.. }. only the highest bit is set. for (i=0. Describing this operation in natural language. so the 1 ≪ i statement is counting from 1 to 0x80000000. Here is a table of all possible 1 ≪ i for i = 0 . int main() { f(0x12345678). 19. COUNTING BITS SET TO 1 ematics we may remember that adding values like 1000 to other values never affects the last 3 digits. but affects only the MSB. Adding 1 to 1 is resulting in 10 in binary form. but it works correctly.CHAPTER 19. For example: 1234567 + 10000 = 1244567 ( last 4 digits are never affected). i++) if (IS_SET (a.e.5. MANIPULATING SPECIFIC BIT(S) 19. // test }.h> #define IS_SET(flag. but the 32th bit (counting from zero) gets dropped. the iteration count value i is counting from 0 to 31. bit) ((flag) & (bit)) int f(unsigned int a) { int i. i<32. 1 ≪ i statement consequently produces all possible bit positions in a 32-bit number. 31: 7 modern x86 CPUs (supporting SSE4) even have a POPCNT instruction for it 321 .5 Counting bits set to 1 Here is a simple example of a function that calculates the number of bits set in the input value. . . The freed bit at right is always cleared. i. because our registers are 32 bit wide. Adding 1 to 0 is resulting in 1. so the result is 0. That’s why XOR can be replaced by ADD here. In other words.

here is excerpt from ssl_private.1 x86 MSVC Let’s compile (MSVC 2010): Listing 19. but the hexadecimal ones are very easy to remember.5.5.4. or the bit mask.27: MSVC 2010 _rt$ = -8 .CHAPTER 19. that is why it always works correctly.6 source code: /** * Define the SSL options */ #define SSL_OPT_NONE #define SSL_OPT_RELSET #define SSL_OPT_STDENVVARS #define SSL_OPT_EXPORTCERTDATA #define SSL_OPT_FAKEBASICAUTH #define SSL_OPT_STRICTREQUIRE #define SSL_OPT_OPTRENEGOTIATE #define SSL_OPT_LEGACYDNFORMAT (0) (1<<0) (1<<1) (1<<3) (1<<4) (1<<5) (1<<6) (1<<7) Let’s get back to our example. For example. if the bit is present. MANIPULATING SPECIFIC BIT(S) 19. You probably haven’t to memorize the decimal numbers. These constants are very often used for mapping flags to specific bits. The if() operator in C/C++ triggers if the expression in it is not zero. COUNTING BITS SET TO 1 C/C++ expression 1≪0 1≪1 1≪2 1≪3 1≪4 1≪5 1≪6 1≪7 1≪8 1≪9 1 ≪ 10 1 ≪ 11 1 ≪ 12 1 ≪ 13 1 ≪ 14 1 ≪ 15 1 ≪ 16 1 ≪ 17 1 ≪ 18 1 ≪ 19 1 ≪ 20 1 ≪ 21 1 ≪ 22 1 ≪ 23 1 ≪ 24 1 ≪ 25 1 ≪ 26 1 ≪ 27 1 ≪ 28 1 ≪ 29 1 ≪ 30 1 ≪ 31 Power of two 1 21 22 23 24 25 26 27 28 29 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 Decimal form 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456 536870912 1073741824 2147483648 Hexadecimal form 1 2 4 8 0x10 0x20 0x40 0x80 0x100 0x200 0x400 0x800 0x1000 0x2000 0x4000 0x8000 0x10000 0x20000 0x40000 0x80000 0x100000 0x200000 0x400000 0x800000 0x1000000 0x2000000 0x4000000 0x8000000 0x10000000 0x20000000 0x40000000 0x80000000 These constant numbers (bit masks) very often appear in code and a practicing reverse engineer must be able to spot them quickly. The IS_SET macro is in fact the logical AND operation (AND) and it returns 0 if the specific bit is absent there. size = 4 322 .h from Apache 2. 19. The IS_SET macro checks bit presence in a. it might be even 123456.

1 ecx. cl edx. size = 4 ebp ebp.CHAPTER 19. ebp ebp 0 323 . not zero increment rt SHORT $LN3@f eax. DWORD PTR _i$[ebp] eax.5. DWORD PTR _i$[ebp] edx. esp esp. 8 DWORD PTR _rt$[ebp]. size = 4 . 00000020H . EDX=EDX<<CL . 1 DWORD PTR _rt$[ebp]. DWORD PTR _a$[ebp] SHORT $LN1@f . eax . . 0 SHORT $LN4@f eax. eax . result of AND instruction was 0? then skip next instructions no. COUNTING BITS SET TO 1 . . 0 DWORD PTR _i$[ebp]. DWORD PTR _rt$[ebp] eax. 32 SHORT $LN2@f edx. . loop finished? eax. increment of i DWORD PTR _i$[ebp]. 1 DWORD PTR _i$[ebp]. MANIPULATING SPECIFIC BIT(S) _i$ = -4 _a$ = 8 _f PROC push mov sub mov mov jmp $LN3@f: mov add mov $LN4@f: cmp jge mov mov shl and je mov add mov $LN1@f: jmp $LN2@f: mov mov pop ret _f ENDP 19. DWORD PTR _rt$[ebp] esp.

we see how i is loaded into ECX: Figure 19.CHAPTER 19. For i = 1. 324 .5. Let the input value be 0x12345678. MANIPULATING SPECIFIC BIT(S) 19.5: OllyDbg: i = 1. i is loaded into ECX EDX is 1. SHL is to be executed now. COUNTING BITS SET TO 1 OllyDbg Let’s load this example into OllyDbg.

CHAPTER 19.5. This is a bit mask. COUNTING BITS SET TO 1 SHL was executed: Figure 19. MANIPULATING SPECIFIC BIT(S) 19.6: OllyDbg: i = 1. EDX =1 ≪ 1 = 2 EDX contain 1 ≪ 1 (or 2). 325 .

7: OllyDbg: i = 1. there is no corresponding bit in the input value. (ZF =1) So. The piece of code. which implies that the input value (0x12345678) ANDed with 2 results in 0: Figure 19. MANIPULATING SPECIFIC BIT(S) 19.CHAPTER 19. COUNTING BITS SET TO 1 AND sets ZF to 1. is there that bit in the input value? No. which increments the counter is not to be 326 . executed: the JZ instruction bypassing it.5.

5.CHAPTER 19. COUNTING BITS SET TO 1 I traced a bit further and i is now 4.8: OllyDbg: i = 4. MANIPULATING SPECIFIC BIT(S) 19. i is loaded into ECX 327 . SHL is to be executed now: Figure 19.

MANIPULATING SPECIFIC BIT(S) 19.CHAPTER 19.9: OllyDbg: i = 4. 328 . COUNTING BITS SET TO 1 EDX =1 ≪ 4 (or 0x10 or 16): Figure 19.5. EDX =1 ≪ 4 = 0x10 This is another bit mask.

1 cmp jle mov add pop pop retn [ebp+i]. esp ebx esp. The function returns 13.4. ebx eax. 10h [ebp+rt]. is there that bit in the input value? Yes. This bit counts: the jump is not triggering and the bit counter incrementing. eax ebx. [ebp+arg_0] eax. COUNTING BITS SET TO 1 AND is executed: Figure 19. 0 [ebp+i].1: Listing 19. 1Fh short loc_80483D0 eax. 10h ebx ebp loc_80483D0: loc_80483EB: loc_80483EF: 329 . eax short loc_80483EB [ebp+rt].5.28: GCC 4. [ebp+rt] esp. cl eax. 0x12345678 & 0x10 = 0x10. GCC Let’s compile it in GCC 4. Indeed.10: OllyDbg: i = 4.4.CHAPTER 19. (ZF =0) ZF is 0 because this bit is present in the input value. 1 add [ebp+i]. MANIPULATING SPECIFIC BIT(S) 19.1 f public f proc near rt i arg_0 = dword ptr -0Ch = dword ptr -8 = dword ptr 8 push mov push sub mov mov jmp ebp ebp. edx ecx. [ebp+i] edx. This is total number of bits set in 0x12345678. 0 short loc_80483EF mov mov mov mov shl mov and test jz add eax. 1 ebx.

0 .5.h> #define IS_SET(flag.8.L2: cmp QWORD PTR [rbp-8]. Non-optimizing GCC 4. je . RDX = a mov ecx. 1 . EAX = EAX&1 = (a>>i)&1 test rax.2 So far so easy.2 1 f: 330 . i++) if (IS_SET (a. rax . bit) ((flag) & (bit)) int f(uint64_t a) { uint64_t i.5. rt++ . 1ULL<<i)) rt++. Listing 19. }. eax . RAX = RDX = a>>i and eax. return rt.2 f: push mov mov mov mov jmp rbp rbp. i++ . if so mov eax. rdi .8. 1 .8. QWORD PTR [rbp-24] . rdx .L2 . ECX = i shr rdx. the last bit is zero? . 0 .L4 .L3 add DWORD PTR [rbp-12]. i<63? jbe . i<64. cl . i=0 . rt=0 QWORD PTR [rbp-8].2 Listing 19. MANIPULATING SPECIFIC BIT(S) f 19. return rt pop rbp ret Optimizing GCC 4. if it was so. jump to the loop body begin.30: Optimizing GCC 4.CHAPTER 19. RDX = RDX>>CL = a>>i mov rax. 63 .h> #include <stdint. 1 . int rt=0. COUNTING BITS SET TO 1 endp 19. QWORD PTR [rbp-8] mov rdx. RAX = i.29: Non-optimizing GCC 4.2 x64 I modified the example slightly to extend it to 64-bit: #include <stdio. skip the next ADD instruction. a DWORD PTR [rbp-12].L3: add QWORD PTR [rbp-8]. for (i=0.8. DWORD PTR [rbp-12] .L4: mov rax. rsp QWORD PTR [rbp-24].

the CMOVNE 8 instruction (which is a synonym for CMOVNZ 9 ) commits the new value of “rt” by moving EDX (“proposed rt value”) into EAX (“current rt” to be returned at the end). load input value [rax+1] . MANIPULATING SPECIFIC BIT(S) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 xor xor eax.e. i variable will be in ECX register . RCX++ 64 .1 on page 456.CHAPTER 19. the incrementing is done at each step of loop. It’s just like an inverted i. EDX=EAX+1 version of rt". AKA fatret This code is terser. R8D=64 npad 5 $LL4@f: test rdx. ecx 19. add rcx. writing the new value into register EDX.. and esi. 64 times. p. RDX=RDX<<1 dec r8 . je SHORT $LN3@f inc eax . COUNTING BITS SET TO 1 . i. write "new version of rt" into EAX edx 1 . R8-jne SHORT $LL4@f fatret 0 f ENDP Here the ROL instruction is used instead of SHL.5. EDX here is a "new shr rsi.yurichev. This is somewhat optimized version of RET. eax ecx.L3 rep ret rdi . Thus. rcx . eax mov edx. R8 here is counting from 64 to 0. In all examples that we see so far. skip the next INC instruction then. Here is a table of some registers during the execution: 8 Conditional MOVe if Not Equal MOVe if Not Zero 10 More information on it: http://go.L3: mov rsi. jne . rt++ $LN3@f: rol rdx. QWORD PTR [rax+64] . if RET goes right after conditional jump: [AMD13b. but has a quirk. the last bit is 1? cmovne eax.15] 10 . . The advantage of this code is that it contain only one conditional jump (at the end of the loop) instead of two jumps (skipping the “rt” value increment and at the end of loop). but the code here increments “rt” before (line 6). The last instruction is REP RET (opcode F3 C3) which is also called FATRET by MSVC.6. Hence. 1 lea r8d. 1 .3 on page 939. if the last bit is 1 cl . if the last bit is 1. Optimizing MSVC 2010 Listing 19. rt variable will be in EAX register . cmp rcx. .31: MSVC 2010 a$ = 8 f PROC . but in this example it works just as SHL. RSI=RSI>>CL 1 . there are no such bit in input value? . lea edx. You can read more about the rotate instruction here: A. ESI=ESI&1 If so. without any relation to the input value.com/17328 9 Conditional 331 . which is in fact “rotate left” instead of “shift left”. RCX = input value xor eax. we were incrementing the “rt” value after comparing a specific bit. which will be written into rt variable. And that might work faster on the modern CPUs with branch predictors: 33. which is recommended by AMD to be placed at the end of function.

#1 .. which may be added to such instructions as MOV. 0x4000000000000000 0x8000000000000000 R8 64 63 62 61 .5. 2 1 At the end we see the FATRET instruction.3 (LLVM) (ARM mode) Listing 19. set flags according to R1 & (R2<<R3) R3. 1 . there are no separate shifting instructions in ARM mode.2. rt++ $LN11@f: rol rdx. but somehow. rcx je SHORT $LN11@f inc eax . I don’t know why. but perfectly working.6.LSL R3 . COUNTING BITS SET TO 1 RDX 0x0000000000000001 0x0000000000000002 0x0000000000000004 0x0000000000000008 .1 on page 489).32: MSVC 2012 a$ = 8 f PROC .5. then R0++ R3.6. I have added the code here intentionally to show that sometimes the compiler output may be really weird and illogical. pass 2 -----------------------------------test rdx. 19. EDX = 1.2 on page 331.5. #1 . ASR (Arithmetic Shift Right). QWORD PTR [rax+32] . R0.3 (LLVM) (ARM mode) MOV MOV MOV MOV R1. R2. R3. 1 .33: Optimizing Xcode 4. 11 These instructions are also called “data processing instructions” 332 . RDX=RDX<<1 . R3++ R0. TST.. eax mov edx. As I mentioned before ( 41. R0 #0 #1 R0 TST ADD ADDNE CMP BNE BX R1. ADD. rcx je SHORT $LN3@f inc eax . CMP.. #32 loc_2E54 LR loc_2E54 TST is the same things as TEST in x86. To be honest. However.. RDX=RDX<<1 . it generates two identical loop bodies and the loop count is now 32 instead of 64. MANIPULATING SPECIFIC BIT(S) 19. LSR (Logical Shift Right). which was explained here: 19. ROR (Rotate Right) and RRX (Rotate Right with Extend) . Some optimization trick? Maybe it’s better for the loop body to be slightly longer? Anyway. R8D = 32 npad 5 $LL4@f: . R2.CHAPTER 19. SUB. if ZF flag is cleared by TST. rt++ $LN3@f: rol rdx. RSB11 . ------------------------------------------dec r8 . pass 1 -----------------------------------test rdx. RCX = input value xor eax. ------------------------------------------.3 ARM + Optimizing Xcode 4. R8-jne SHORT $LL4@f fatret 0 f ENDP Optimizing MSVC 2012 does almost the same job as optimizing MSVC 2010. R0. 1 lea r8d. there are modifiers LSL (Logical Shift Left). Optimizing MSVC 2012 Listing 19. R3.

28] .LSL R3'' instruction works here as R1 ∧ (R2 ≪ R3). result of TST was non-zero? . x1. The CSEL instruction is “Conditional SELect”.5.34: Optimizing GCC (Linaro) 4. Listing 19. #1 loc_2F7A 19.35: Non-optimizing GCC (Linaro) 4.2 on page 330. MOV MOVS MOV.L2 . 64 .5. #32 x0. MANIPULATING SPECIFIC BIT(S) 19. COUNTING BITS SET TO 1 These modificators define how to shift the second operand and by how many bits.2 on page 330. w3.9 And again.28] 1 x1. Thus the ``TST R1. Listing 19.5. [sp. add w3. bne . R3 R2.9 I took the 64-bit example I already used: 19. R9.8 f: mov mov mov w2. it just choose one variable of two depending on the flags set by TST and copies the value into W2 . 19. [sp.W/TST instructions are used instead of a single TST. 0 x5. w2. The code is more verbose.L4: ldr mov lsl mov . x1 .30 on page 330.5 R0. tst x4.24] wzr. R1 R3. [sp. X0 = X1<<X0 = 1<<i 333 . R3. add w1. R3. as usual. w1. w2 .6 ARM64 + Non-optimizing GCC 4. rt=0 . R2. the 64-bit example I already used: 19. #1 R3. [sp. R9. 1 . R0.W MOVS R1.6. R0 #0 #1 #0 LSL. w2.4 ARM + Optimizing Xcode 4.8 f: sub str str str b sp. which holds the “rt” variable. sp. rt=0 . x5. #32 loc_2F7A LR ARM64 + Optimizing GCC 4.L2: lsl x4. otherwise: w2=w2 or rt=rt (idle csel w2. w2 . ne cmp w1. x0 x0 . because in thumb mode it is not possible to define LSL modifier directly in TST. 1 . i=0 . x0. . ret w4 = w5<<w1 = 1<<i new_rt=rt+1 (1<<i) & a i++ operation) i<64? yes return rt The result is very similar to what GCC generates for x64: 19.CHAPTER 19.5.L2 . store "a" value to Register Save Area . 1 w1. . but here are two LSL. mov w0.W IT NE ADDNE CMP BNE BX R2.3 (LLVM) (thumb-2 mode) Almost the same. 19.5. x0 .8] wzr.W TST ADD. x1.5. X1 = 1<<1 w0. then w2=w3 or rt=new_rt.

load delay slot. $v0.L3: . 0x18+var_4($sp) move $fp.24] add sp. 0x18+rt($fp) or $at. [sp.7 MIPS Non-optimizing GCC Listing 19. 0x18+a($fp) . $at. 1 0x18+i($fp) $zero . $v0 move lw or and $v1.24] . IDA is not aware of variable names. $zero . [sp. $v0 = 1<<i .36: Non-optimizing GCC 4.5.L2: .28] . X0 = a and x0. 0x18+rt($fp) sw $zero.28] add w0. -0x18 sw $fp. load delay slot. skipping "rt" increment cmp x0. $zero . $sp sw $a0. X0 = X1&X0 = (1<<i) & a .L3 .24] add w0. $v0 0x18+a($fp) $zero .L4 . w0. 0x18+i($fp) . initialize rt and i variables to zero: sw $zero. return rt ldr w0. NOP loc_20: li lw or sllv $v1.L3. $v0 = a&(1<<i) . x0 .5 (IDA) f: . w0. $v0. I gave them manually: rt = -0x10 i = -0xC var_4 = -4 a = 0 addiu $sp. 1 str w0. that means a&(1<<i)!=0.CHAPTER 19.28] cmp w0. load delay slot. sp. $v0. rt++ ldr w0.5. so increment "rt" then: lw $v0. [sp. no jump occurred. MANIPULATING SPECIFIC BIT(S) ldr 19. 63 ble .8] .L4 ldr w0. jump to loop check instructions: b loc_68 or $at. $zero . branch delay slot. 1 sw $v0. 32 ret 19. NOP $v1. NOP addiu $v0.4. 1 str w0. [sp. is a&(1<<i) equals to zero? jump to loc_58 then: beqz $v0. NOP $v1. COUNTING BITS SET TO 1 x0. $v0 . i++ ldr w0. x1. [sp. X0 contain zero? then jump to . [sp. 0x18+rt($fp) 334 . $at. [sp. loc_58 or $at. i<=63? then jump to . xzr beq . $v0.

0x18 . $a0=a .5. $t0. 0x20 # ' ' bnez $v0. Why? It’s possible to replace the first SLLV instruction with an unconditional branch instruction that jumps right to the second SLLV. branch delay slot. 0x18+i($fp) $zero . NOP . load delay slot jr $ra or $at. increment i: lw or addiu sw $v0. jump to loc\_28 if a&(1<<i)==0 and increment rt: beqz $a1. 0x18+rt($fp) move $sp. function epilogue. as a consequence).5 (IDA) f: . NOP That is verbose: all local variables are located in the local stack and reloaded each time they’re needed. Optimizing GCC That is terser. load delay slot lw $fp. NOP 335 . $a1 = $t0<<$v1 = 1<<i loc_14: and $a1.4. load i and compare it with 0x20 (32). 1 . increment i: addiu $v1. rt variable will reside in $v0: move $v0. it differs from SLL only in that the shift amount is encoded in the SLL instruction (and is fixed. $v1 . $zero li $t0. save updated rt into $v0: move $v0. COUNTING BITS SET TO 1 loc_58: . $v0. $v1 . 0x18+i($fp) or $at. $fp .1 on page 456. load delay slot. loc_20 or $at. but SLLV takes shift amount from a register. Listing 19. $a2 loc_28: .37: Optimizing GCC 4. $at. jump to loc_14 and also prepare next shifted value: bne $v1. i variable will reside in $v1: move $v1. The SLLV instruction is “Shift Word Left Logical Variable”. loc_28 addiu $a2. 32 sllv $a1. $zero . branch delay slot. . branch delay slot. $v0. There are two shift instructions instead of one. $zero . if BEQZ was not triggered. $v0. $t0. load delay slot. $a0 . return jr $ra or $at. $zero . if i!=32. loc_14 sllv $a1. and it’s always favorable to get rid of them: 33. $a3. $a1 = a&(1<<i) . 1 li $a3. $zero . return rt: lw $v0. But this is another branching instruction in the function. NOP slti $v0. NOP 1 0x18+i($fp) loc_68: .CHAPTER 19. jump to loc_20 if it is less then 0x20 (32): lw $v0. 1 . 0x18+var_4($sp) addiu $sp. MANIPULATING SPECIFIC BIT(S) 19. $zero .

#0x40 BNE is_set . bit is set Listing 19.38: C/C++ if (input&0x40) .40: x86 TEST REG. CL AND REG.6.39: x86 TEST REG. The shift instructions in ARM are LSR/LSL (for unsigned values) and ASR/LSL (for signed values).6.1 Check for specific bit (known at compile stage) Test if the 1000000 bit (0x40) is present in the register’s value: Listing 19... Listing 19.6... MANIPULATING SPECIFIC BIT(S) 19. CL=n SHR REG. 40h JZ is_cleared . This is usually implemented in x86 code as: Listing 19. then cut off lowest bit): Listing 19. REG 336 ..6 19. AND is used instead of TEST. 19. This is usually implemented in x86 code as: Listing 19. CL=n MOV REG. but the flags that are set are the same.2 Check for specific bit (specified at runtime) This is usually done by this C/C++ code snippet (shift value by n bits right... REG=input_value . 1 Or (shift 1 bit n times left. isolate this bit in input value and check if it’s not zero): Listing 19. 19.42: C/C++ if ((value>>n)&1) . 1 SHL REG. CONCLUSION Conclusion Analogous to the C/C++ shifting operators ≪ and ≫. bit is not set Listing 19. CL AND input_value.CHAPTER 19.. bit is not set Sometimes. 40h JNZ is_set .45: x86 . It’s also possible to add shift suffix to some instructions (which are called “data processing instructions”).44: C/C++ if (value & (1<<n)) . the shift instructions in x86 are SHR/SHL (for unsigned values) and SAR/SHL (for signed values).43: x86 .41: ARM (ARM mode) TST REG.

CL NOT REG AND input_value.6. CL OR input_value.55: C/C++ value=value&(~(1<<n)). R0.51: C/C++ value=value&(~0x40).47: x86 OR REG.4 Set specific bit (specified at runtime) Listing 19.6. Listing 19.53: x64 AND REG. Listing 19.54: ARM (ARM mode) BIC R0. CONCLUSION Set specific bit (known at compile stage) Listing 19. 40h Listing 19. REG 337 . 1 SHL REG. This is usually implemented in x86 code as: Listing 19.6. which works like the NOT+AND instruction pair: Listing 19.52: x86 AND REG.5 Clear specific bit (known at compile stage) Just apply AND operation with the inverted value: Listing 19. REG 19.6 Clear specific bit (specified at runtime) Listing 19. ARM in ARM mode has BIC instruction. CL=n MOV REG. MANIPULATING SPECIFIC BIT(S) 19.56: x86 . Listing 19.48: ARM (ARM mode) and ARM64 ORR R0. R0. 1 SHL REG. CL=n MOV REG. 0FFFFFFBFh Listing 19. #0x40 19.CHAPTER 19.46: C/C++ value=value|0x40.3 19.6.50: x86 . 0FFFFFFFFFFFFFFBFh This is actually leaving all bits set except one.6. #0x40 19.49: C/C++ value=value|(1<<n).

0 DWORD PTR _a$[esp-4] ecx ecx 16 .r1.#0xff00 r1.r2 r2. w0 Listing 19.r2. edx.r1.r0.r1. $v1. eax.r0.LSR r2.#16 r2.r2 r0. ecx. 00000010H ecx 8 8 edx Listing 19.r3 r1.7 19.r2. eax.r3 r1.r0.5 (MIPS) (IDA) f: srl sll sll or $v0.#8 r3.57: Optimizing MSVC 2010 _a$ = 8 _f PROC mov mov mov shl and or mov and shr or shl shr or ret _f ENDP ecx. 8 $v0 338 .7. edx.#8 r2. 24 $a0.61: Optimizing GCC 4.r0.r0. 24 $a0. edx. 0000ff00H edx ecx 16711680 .#24 r0. eax. MANIPULATING SPECIFIC BIT(S) 19.60: Optimizing GCC 4. eax.r1. edx.r0.59: Optimizing Keil 6/2013 (thumb mode) f PROC MOVS LSLS LSLS ANDS LSRS ORRS LSRS ASRS ANDS ORRS LSLS ORRS BX ENDP r3.58: Optimizing Keil 6/2013 (ARM mode) f PROC MOV AND MOV ORR AND ORR ORR BX ENDP r1. EXERCISES Exercises 19.LSL r2.r3.r1.r0.7.r1.r2 r0.r0. $a0.#8 r3.r0.r2.4.CHAPTER 19. $a1.LSL lr #8 #24 #8 #24 Listing 19.9 (ARM64) f: rev ret w0.r1 lr Listing 19.#24 r1. edx. $v1. 00ff0000H 16 .#0xff0000 r1. eax. edx. 00000010H 65280 .r3.#0xff r2.1 Exercise #1 What does this code do? Listing 19.LSR r1.

MANIPULATING SPECIFIC BIT(S) lui and srl or andi jr or $v0. edx edx. ecx edi edx.r1 LSR AND MLA ADD ADD CMP LSL BLE BX ENDP r12.r1.1. size = 4 esi esi.#1 |L0. 15 edi. $v0.r2. edx ecx.r3 |L0.r0 r1.2 Exercise #2 What does this code do? Listing 19.64: Optimizing Keil 6/2013 (thumb mode) f PROC PUSH MOVS MOVS MOVS MOVS {r4. EXERCISES 0xFF $a1.7. cl ecx.62: Optimizing MSVC 2010 _a$ = 8 _f PROC push mov xor push lea xor npad $LL3@f: mov shr add and imul lea add add cmp jle pop pop ret _f ENDP . 4 edi.r12.r1 r12. DWORD PTR [edx+edx*4] eax.r0 r1.r2.r0 r1. $v0 0xFF00 $a0 Answer: G. DWORD PTR _a$[esp] ecx.63: Optimizing Keil 6/2013 (ARM mode) f PROC MOV MOV MOV MOV r3. $a0.11 on page 956.r12. $v0 8 $v1.CHAPTER 19.lr} r3.#1 r0.16| Listing 19.10| 339 . $ra $v0. edi edx.#1 r0. 28 SHORT $LL3@f edi esi 0 Listing 19. 19.r3. esi edi. 19. align next label edi.r2. $a0.16| lr |L0.r2.#4 r2.#0 r2. eax 3 .LSL #2 r1. $v0.7.#0x1c r2.#0 r2.r1 MOVS r4. DWORD PTR [ecx+1] eax.#0xf r0.

r4. 0xF 0xF $a0. w1. $t1. 4 [sp.28] [sp. 32 Listing 19.r4.r1 r4.4.24] ldr cmp ble ldr add ret w0. w0 w0.r0 r4. $a0. w0. $a0. [sp. 1 w0. w0.L3 w0. w0. [sp. w0. w1 w0. $t2. $t2 $v1 0xF $a0. [sp. w0.r2 r1.28] w0. $t0. $t1.24] wzr. $t5.20] [sp.L3: . $v1. $a1. $a3. w0.#4 r1. $a0. 0xF 0xF $v0.r2. $a1. w0. w0 [sp. $v1.#28 r4.5 (MIPS) (IDA) f: srl srl andi andi srl srl andi andi sll sll sll srl sll addu subu andi srl $v0.28] w0. 2 w0.12] w1.24] w1. 8 20 12 16 4 2 2 4 7 28 340 . $a2. $t0.20] sp.r1. w1. 1 [sp. 15 [sp.#0xa r2.L2 ldr ldr lsr and ldr mul ldr add str ldr add str ldr mov lsl add lsl str w0.CHAPTER 19. w1.r4.pc} Listing 19. $a3. sp. $a2. sp. w0. w1.12] wzr. w0. $v0. #32 w0.20] w1.66: Optimizing GCC 4.65: Non-optimizing GCC 4.r4. $a3.24] w1 w0. w0 [sp.#28 r4. $a0. w0. [sp. w0. w0. [sp. EXERCISES r4. $v0. . MANIPULATING SPECIFIC BIT(S) LSRS LSLS LSRS MULS ADDS MOVS MULS ADDS CMP BLE POP ENDP 19. [sp.28] 28 .7.r4.L2: [sp.20] . $t5. w0. w0. $a3.10| {r4.28] [sp.r4 r0.9 (ARM64) f: sub str str mov str str b sp.#0x1c |L0.

$t2 0xF $t1. $a2. $t5 $t3.11 on page 957. $t2. $a3. $a0. $a3. MANIPULATING SPECIFIC BIT(S) sll sll sll sll srl addu subu subu andi sll sll sll sll sll addu subu addu addu addu sll sll sll addu subu addu sll sll sll addu subu addu sll sll andi addu addu addu subu sll addu addu sll sll addu addu sll addu sll jr addu $t4. $a2. $t2. $v1. $a3. $ra $v0. $a3. $t6. eax 0 Developer Network 341 . $a2. $t0. $t2. 'hello. $t3. $t5. $t4. $t7. $t1. 7 7 2 2 7 24 $a3 $t3 3 8 2 1 3 $a2 $t1 $t2 $a1 2 8 3 $t0 $t4 2 6 $a1 $t1 6 2 $v1 $t0 $a2 $v0 2 $a3 $v0 $v1. $t3.7. $a2. $v0.1. $t1. $v0. $t3. $a0. $t2. $t5. $t6. $a3.7. $t1. 00044043H OFFSET $SG79792 . $a1. $t4.67: Optimizing MSVC 2010 _main _main PROC push push push push call xor ret ENDP 12 Microsoft 278595 . $t5 $t3. $v0. $t1. $t0. $a0. $a1 6 $a3. 3 $t4. $v1. $v0 Answer: G. 4 $a0. $v1. $t3. world!' 0 DWORD PTR __imp__MessageBoxA@16 eax. $t5. $v0.3 Exercise #3 Using the MSDN12 documentation. $t4. $t1. 19. 'caption' OFFSET $SG79793 . $v0. $t0. $a1. $a1. $t7. $v0. $t4. $t7. 5 $a2. $t0. $a1 $t3 $t2. $a2. $t3 $t0. $v1. $a1. $t3. $v1. 0xF $t1. $t0. $t2. $t2. $a3. EXERCISES $a1. $v1. $t0. $a2.CHAPTER 19. find out which flags were used in the MessageBox() win32 function call. $a2. 19. $t2. $a1. Listing 19. $v0. $a3.

lr} r3. 19. DWORD PTR _n$[esp-4] eax.69: Optimizing Keil 6/2013 (ARM mode) f PROC PUSH MOV MOV MOV MOV B {r4.40| |L0.#0 r2.40| r0.r0 r12. 0 esi.#1 CMP MOVEQ BNE POP ENDP r1.1.r0 r4.r0.r2.24| |L0. 1 SHORT $LL3@f esi 0 Listing 19. size = 4 .68: Optimizing MSVC 2010 _m$ = 8 _n$ = 12 _f PROC mov xor xor test je push mov $LL3@f: test je add adc $LN1@f: add shr jne pop $LN2@f: ret _f ENDP . 1 SHORT $LN1@f eax.r0.48| TST BEQ ADDS ADC r1. MANIPULATING SPECIFIC BIT(S) 19. EXERCISES Answer: G.r4 |L0.7.70: Optimizing Keil 6/2013 (thumb mode) f PROC PUSH MOVS MOVS MOVS MOVS B {r4.CHAPTER 19.#1 |L0.48| Listing 19.r2 |L0.12| 342 .r3 r2.r5.r1.lr} r3.r0 r0.11 on page 957.r12 LSL LSR r3.r3.r3 r2.#31 |L0.r2.r0 r0.r0 |L0.pc} |L0.20| r0.r1.#1 r1. edx ecx. DWORD PTR _m$[esp] cl.7.24| LSLS BEQ ADDS ADCS r5. size = 4 ecx. esi edx. esi ecx.4 Exercise #4 What does this code do? Listing 19.#0 r2.r0 |L0.#0 r1. ecx SHORT $LN2@f esi esi.24| {r4. eax edx.

72: Optimizing GCC 4. $a2. . x3. $a3 $t0 $a1.9 (ARM64) f: mov mov cbz w2.11 on page 957.5 (MIPS) (IDA) mult: beqz move move addu $a1.12| r1. $a3.#1 r1. $a1.L3: w1. $t1. EXERCISES |L0. 343 . MANIPULATING SPECIFIC BIT(S) 19. w3.1. . . $zero move jr move $v0.#1 CMP BNE MOVS POP ENDP r1.20| LSLS LSRS r3. loc_44 $a0. x3.r2 {r4. loc_40 $zero $zero $a3. $a0 sltu move andi addu beqz srl move move $t1. $a3. $a3 loc_10: # CODE XREF: mult+38 loc_30: # CODE XREF: mult+20 loc_40: # CODE XREF: mult loc_44: # CODE XREF: mult:loc_30 Answer: G. $a2.pc} |L0. w1. w1. wzr x0. uxtw 1 x0.L2: ret Listing 19. $t0. $a0 move $a2. $a3.r5.7.L3 1 1 x2. w0 x0.4.71: Optimizing GCC 4.CHAPTER 19. w2. 0 w1. $v1.24| Listing 19.L2 and lsr cmp add lsl csel cbnz w3. 1 $a2 loc_30 1 $v1 $t1 beqz sll b addu $a1. $t0.r3. 1 loc_10 $t0. $t0. $a2 $ra $v1. $t0. w2.#0 |L0. w1.r1. x0. ne .

void my_srand (uint32_t init) { rand_state=init.1: Optimizing MSVC 2013 _BSS SEGMENT _rand_state DD 01H DUP (?) _BSS ENDS _init$ = 8 _srand PROC mov mov ret _srand ENDP _TEXT eax. I defined them using a #define C/C++ statement. return rand_state & 0x7fff. } int my_rand () { rand_state=rand_state*RNG_a. The last AND operation is needed because by C-standard. eax 0 SEGMENT 1 Mersenne twister is better 344 . #include <stdint. The difference between a C/C++ macro and a constant is that all macros are replaced with their value by C/C++ preprocessor. LINEAR CONGRUENTIAL GENERATOR AS PSEUDORANDOM NUMBER GENERATOR Chapter 20 Linear congruential generator as pseudorandom number generator The linear congruential generator is probably the simplest possible way to generate random numbers. They are taken from [Pre+07].CHAPTER 20. a constant is a read-only variable. It’s not in favour in modern times1 . We see that two constants are used in the algorithm. just omit the last AND operation. my_rand() has to return a value in the 0. but it’s so simple (just one multiplication. rand_state=rand_state+RNG_c.32767 range. DWORD PTR _init$[esp-4] DWORD PTR _rand_state. but impossible to do so with a macro.h> // constants from the Numerical Recipes book #define RNG_a 1664525 #define RNG_c 1013904223 static uint32_t rand_state. In contrast.1 x86 Listing 20.. } There are two functions: the first one is used to initialize the internal state. If you want to get 32-bit pseudorandom values. It’s a macro. and the second one is called to generate pseudorandom numbers. 20. unlike variables. we can use it as an example. one addition and one AND operation). It’s possible to take a pointer (or address) of a constant variable. and they don’t take any memory.

1664525 DWORD PTR _rand_state. ecx ret 0 my_srand ENDP _TEXT SEGMENT my_rand PROC imul eax. ecx eax. 3c6ef35fH DWORD PTR _rand_state.3: Optimizing MSVC 2013 x64 _BSS SEGMENT rand_state DD 01H DUP (?) _BSS ENDS init$ = 8 my_srand PROC .2. DWORD PTR _init$[ebp] DWORD PTR _rand_state. 1013904223 .CHAPTER 20. LINEAR CONGRUENTIAL GENERATOR AS PSEUDORANDOM NUMBER GENERATOR _rand _rand PROC imul add mov and ret ENDP _TEXT ENDS 20.2: Non-optimizing MSVC 2013 _BSS SEGMENT _rand_state DD 01H DUP (?) _BSS ENDS _init$ = 8 _srand PROC push mov mov mov pop ret _srand ENDP _TEXT _rand _rand SEGMENT PROC push mov imul mov mov add mov mov and pop ret ENDP _TEXT ENDS 20. The my_srand() function just copies its input value into the internal rand_state variable. DWORD PTR _rand_state. 1013904223 . eax ebp 0 ebp ebp. calculates the next rand_state. DWORD PTR _rand_state eax. The non-optimized version is more verbose: Listing 20. But my_srand() takes its input argument from the ECX register rather than from stack: Listing 20. DWORD PTR _rand_state. 00007fffH 0 Here we see it: both constants are embedded into the code. X64 eax. esp eax. DWORD PTR rand_state.2 x64 ebp ebp. 0019660dH . There is no memory allocated for them. esp eax. 3c6ef35fH DWORD PTR _rand_state. DWORD PTR _rand_state ecx. eax ecx. ECX = input argument mov DWORD PTR rand_state. 32767 . 1664525 eax. eax eax. 00007fffH ebp 0 The x64 version is mostly the same and uses 32-bit registers instead of 64-bit ones (because we are working with int values here). cuts it and leaves it in the EAX register. 32767 . my_rand() takes it. 1664525 345 .

56| |L0. LINEAR CONGRUENTIAL GENERATOR AS PSEUDORANDOM NUMBER GENERATOR add mov and ret my_rand ENDP _TEXT eax.3.|L0. load RNG_a .[r0.#0] MUL r1. save rand_state .4.#17 LSR r0.#0] lr my_rand PROC LDR r0.r1 LDR r2. load pointer to rand_state .60| AREA ||. 20.52| LDR r2.CHAPTER 20. (rand_state >> 16) 346 .r1. 32767 0 20. rand_state my_rand: . 00007fffH ENDS GCC compiler generates mostly the same code.r1.r2 STR r1. So what Keil does is shifting rand_state left by 17 bits and then shifting it right by 17 bits.r0. so Keil has to place them externally and load them additionally. Optimizing Keil for Thumb mode generates mostly the same code.5 (IDA) my_srand: . load rand_state to $v0: lui $v1. ALIGN=2 rand_state DCD 0x00000000 It’s not possible to embed 32-bit constants into ARM instructions. but what it does is clearing the high 17 bits.56| LDR r1. load pointer to rand_state . leaving the low 15 bits intact. save rand_state |L0.|L0.5: Optimizing GCC 4. store $a0 to rand_state: lui $v0. and that’s our goal after all. This is analogous to the (rand_state ≪ 17) ≫ 17 statement in C/C++. load rand_state . 32-BIT ARM . 20. AND with 0x7FFF: LSL r0. One interesting thing is that it’s not possible to embed the 0x7FFF constant as well.|L0. 1013904223 DWORD PTR rand_state.data||.[r0.|L0. load RNG_c .4: Optimizing Keil 6/2013 (ARM mode) my_srand PROC LDR STR BX ENDP r1.4 MIPS Listing 20. (rand_state >> 16) jr $ra sw $a0. It seems to be useless operation. DATA.#17 BX lr ENDP . eax eax.[r1.r2.52| r0.3 32-bit ARM Listing 20.#0] .data|| DCD 0x0019660d DCD 0x3c6ef35f |L0. 3c6ef35fH .60| ADD r1.52| DCD ||.

rand_state or $at. 6 subu $a0. $zero . $v0 sll $a0. Wow. MIPS .4. (rand_state & 0xFFFF)($v1) jr $ra andi $v0. $a1. $a0. 2 sll $a0. $v1.0(v0) 0000000c <my_rand>: 347 . load delay slot multiplicate rand_state in $v0 by 1664525 (RNG_a): sll $a1. lw $v0. $v0. $a0 add 1013904223 (RNG_c) the LI instruction is coalesced by IDA from LUI and ORI li $a0. 00000000 <my_srand>: 0: 3c020000 4: 03e00008 8: ac440000 lui jr sw v0. 2 addu $v0.7: Optimizing GCC 4. 4 addu $a0. $v0. $a1 sll $a0. $a0 addu $a0.o .CHAPTER 20. branch delay slot Indeed! 20. $a0. $v0. $a0..6: Optimizing GCC 4. 0x7FFF . 5 addu $a0. 3 addu $v0. 2 4 $v0 6 $v0 5 $a0 2 $a0. . 0x3C6EF35F addu $v0. $a0. $v0.5 (IDA) f: sll sll addu sll subu addu sll addu sll addu sll jr addu $v1. branch delay slot . here we see only one constant (0x3C6EF35F or 1013904223). The listings here are produced by IDA. $a0. $a0 sll $a1. LINEAR CONGRUENTIAL GENERATOR AS PSEUDORANDOM NUMBER GENERATOR 20. $a1. $a0. $v0. $a0.1 MIPS relocations I want also to focus on how such operations as load from memory and store to memory actually work. $v1..4. $v0. $a0 $v0.4. $v0. $a0 store to rand_state: sw $v0. } Listing 20.4. $v0. . $v1 3 $v0. Where is the other one (1664525)? It seems that multiplication by 1664525 is done by just using shifts and additions! Let’s check this assumption: #define RNG_a 1664525 int f (int a) { return a*RNG_a. $v0. $v1. I’ll run objdump twice to get a disassembled listing and also relocations list: Listing 20. $v0. $ra $v0. which hides some details.0x0 ra a0. $v0 . $v1. $v0.5 (objdump) # objdump -D rand_O3. $v0 sll $a1.

.CHAPTER 20. SW will sum up the low part of the address and what is in register $V0 (the high part is there). 348 .0x0 v0. but we ought to remember them.a0 v0..1 on page 671.bss . for address 0 has a type of R_MIPS_HI16 and the second one for address 8 has a type of R_MIPS_LO16.bss 00000010 R_MIPS_LO16 .a0.a0 a0.a0 a1. The first one..a1 a0. 20..5.a1.text]: OFFSET TYPE VALUE 00000000 R_MIPS_HI16 .0x5 a0.0x6 a0. # objdump -r rand_O3.o .0xf35f v0. RELOCATION RECORDS FOR [.bss 0000000c R_MIPS_HI16 . It’s the same story with the my_rand() function: R_MIPS_HI16 relocation instructs the linker to write the high part of the .0x3 v0.a0. The LW instruction at address 0x10 sums up the high and low parts and loads the value of the rand_state variable into $V1.bss segment address into instruction LUI. The linker will fix this.a0.v0 a0.a0..v0 a1.v0.v0.bss 00000008 R_MIPS_LO16 .bss segment is to be written into the instructions at address of 0 (high part of address) and 8 (low part of address).a0.v0.v0.0x3c6e a0.0x7fff .bss 00000054 R_MIPS_LO16 .0x2 a0.a0.0(v1) at. THREAD-SAFE VERSION OF THE EXAMPLE v1. IDA processes relocations while loading. That implies that address of the beginning of the .0x2 v0.v0.5 Thread-safe version of the example The thread-safe version of the example is to be demonstrated later: 65..0(v1) ra v0.at a1. because nothing is there yet — the compiler don’t know what to write there.v0.0x4 a0. thus hiding these details.a1.a0 a0.a0. The rand_state variable is at the very start of the . and the high part of the address will be written into the operand of LUI and the low part of the address — to the operand of SW. LINEAR CONGRUENTIAL GENERATOR AS PSEUDORANDOM NUMBER GENERATOR c: 10: 14: 18: 1c: 20: 24: 28: 2c: 30: 34: 38: 3c: 40: 44: 48: 4c: 50: 54: 58: 5c: 3c030000 8c620000 00200825 00022880 00022100 00a42021 00042980 00a42023 00822021 00042940 00852021 000420c0 00821021 00022080 00441021 3c043c6e 3484f35f 00441021 ac620000 03e00008 30427fff lui lw move sll sll addu sll subu addu sll addu sll addu sll addu lui ori addu sw jr andi 20. Let’s consider the two relocations for the my_srand() function. So we see zeroes in the operands of instructions LUI and SW.bss segment. The SW instruction at address 0x54 do the summing again and then stores the new value to the rand_state global variable. So the high part of the rand_state variable address is residing in register $V1.

h> void main() { SYSTEMTIME t. WORD wDay. }. 16 lea eax.wSecond). *PSYSTEMTIME. DWORD PTR _t$[ebp] push eax call DWORD PTR __imp__GetSystemTime@4 movzx ecx. Let’s write a C function to get the current time: #include <windows. WORD wDayOfWeek. return.wDay. t. is just a set of variables. This is how it’s defined: Listing 21. esp sub esp. t.h> #include <stdio. WORD wMilliseconds.2: MSVC 2010 /GS_t$ = -16 . WORD wMinute.h typedef struct _SYSTEMTIME { WORD wYear. WORD wMonth. } SYSTEMTIME. size = 16 _main PROC push ebp mov ebp. with some assumptions.wHour. t. 21. wSecond push ecx 1 AKA “heterogeneous container” SYSTEMTIME structure 2 MSDN: 349 .1: WinBase.CHAPTER 21. t. always stored in memory together. t.wMonth. We get (MSVC 2010): Listing 21. WORD wSecond. printf ("%04d-%02d-%02d %02d:%02d:%02d\n". WORD wHour. WORD PTR _t$[ebp+12] . t. STRUCTURES Chapter 21 Structures A C/C++ structure.1 MSVC: SYSTEMTIME example Let’s take the SYSTEMTIME2 win32 structure that describes time. GetSystemTime (&t).wYear. not necessary of the same type 1 .wMinute.

writes current month. MSVC: SYSTEMTIME EXAMPLE edx. wMonth edx eax. wDay ecx edx. wMinute edx eax. STRUCTURES movzx push movzx push movzx push movzx push movzx push push call add xor mov pop ret _main 21. and that is the same! GetSystemTime() writes the current year to the WORD pointer pointing to. 28 eax.1. 00H _printf esp. WORD PTR _t$[ebp+6] . WORD PTR _t$[ebp+10] . 3 MSDN: SYSTEMTIME structure 350 . ebp ebp 0 ENDP 16 bytes are allocated for this structure in the local stack —that is exactly sizeof(WORD)*8 (there are 8 WORD variables in the structure). It can be said that a pointer to the SYSTEMTIME structure is passed to the GetSystemTime()3 . then shifts 2 bytes ahead. wHour eax ecx.CHAPTER 21. wYear eax OFFSET $SG78811 . etc. '%04d-%02d-%02d %02d:%02d:%02d'. WORD PTR _t$[ebp+8] . etc. but it is also can be said that a pointer to the wYear field is passed. Pay attention to the fact that the structure begins with the wYear field. WORD PTR _t$[ebp] . eax esp. WORD PTR _t$[ebp+2] . 0aH.

1./MD keys and run it in OllyDbg.1: OllyDbg: GetSystemTime() just executed The system time of the function execution on my computer is 9 december 2014.h> #include <stdio. I can demonstrate by doing the following. Let’s open windows for data and stack at the address which is passed as the first argument of the GetSystemTime() function. available for use.1. STRUCTURES 21. Keeping in mind the SYSTEMTIME structure description. these are the values currently stored in memory: Hexadecimal number 0x07DE 0x000C 0x0002 0x0009 0x0016 0x001D 0x0034 0x03D4 decimal number 2014 12 2 9 22 29 52 980 field name wYear wMonth wDayOfWeek wDay wHour wMinute wSecond wMilliseconds The same values are seen in the stack window. And then printf() just takes the values it needs and outputs them to the console. I can rewrite this simple example like this: #include <windows. and let’s wait until it’s executed. Since the endianness is little endian. but they are in memory right now. Some values aren’t output by printf() (wDayOfWeek and wMilliseconds). but they are grouped as 32-bit values. 21.2: OllyDbg: printf() output So we see these 16 bytes in the data window: DE 07 0C 00 02 00 09 00 16 00 1D 00 34 00 D4 03 Each two bytes represent one field of the structure.h> 351 . 22:29:52: Figure 21.1.2 Replacing the structure with array The fact that the structure fields are just variables located side-by-side. MSVC: SYSTEMTIME EXAMPLE OllyDbg Let’s compile this example in MSVC 2010 with /GS.CHAPTER 21. we see the low byte first and then the high one. Hence.1 21. We see this: Figure 21.

WORD PTR _array$[ebp+10] . no sane person would do it. 28 eax. array[1] /* wMonth */. array[5] /* wMinute */. 00H _array$ = -16 _main PROC push mov sub lea push call movzx push movzx push movzx push movzx push movzx push movzx push push call add xor mov pop ret _main ENDP .h> #include <stdio. WORD PTR _array$[ebp+12] . but in the heap: #include <windows.2.c(7) : warning C4133: 'function' : incompatible types . }. wMinute edx eax. Nevertheless. wYear eax OFFSET $SG78573 _printf esp. 352 .CHAPTER 21. or an array. array[6] /* wSecond */). The compiler grumbles a bit: systemtime2. WORD PTR _array$[ebp+8] .2 Let’s allocate space for a structure using malloc() Sometimes it is simpler to place structures not the in local stack. as it is not convenient. printf ("%04d-%02d-%02d %02d:%02d:%02d\n". Also the structure fields may be changed by developers. array[0] /* wYear */. it produces this code: Listing 21. eax esp. WORD PTR _array$[ebp+6] . wMonth edx eax. array[4] /* wHour */. WORD PTR _array$[ebp] . return. DWORD PTR _array$[ebp] eax DWORD PTR __imp__GetSystemTime@4 ecx. I’m not adding OllyDbg example here. swapped. because it is just the same as in the case with the structure. So by looking at this code. 0aH. 16 eax. ebp ebp 0 And it works just as the same! It is very interesting that the result in assembly form cannot be distinguished from the result of the previous compilation. esp esp. wSecond ecx edx. LET’S ALLOCATE SPACE FOR A STRUCTURE USING MALLOC() void main() { WORD array[8]. array[3] /* wDay */. GetSystemTime (array). wDay ecx edx. etc. one cannot say for sure if there was a structure declared. size = 16 ebp ebp.from 'WORD [8]' to '⤦ Ç LPSYSTEMTIME' But nevertheless. wHoure eax ecx. 21. WORD PTR _array$[ebp+2] . STRUCTURES 21.h> void main() { SYSTEMTIME *t.3: Non-optimizing MSVC 2010 $SG78573 DB '%04d-%02d-%02d %02d:%02d:%02d'.

That’s because printf() requires a 32-bit int. It returns a pointer to a freshly allocated memory block in the EAX register. return. t[1] /* wMonth */. but we got a WORD in the structure —that is 16-bit unsigned type. Let’s compile it now with optimization (/Ox) so it would be easy see what we need. but it sets the remaining bits to 0. That’s why by copying the value from a WORD into int. wYear eax ecx edx OFFSET $SG78833 _printf esi _free esp. which is then moved into the ESI register. WORD PTR [esi+8] . t->wMonth. t=(WORD *)malloc (16). In this example. printf ("%04d-%02d-%02d %02d:%02d:%02d\n". wDay ecx ecx. which is left from the previous operations on the register(s). free (t). and that is why it is not saved here and continues to be used after the GetSystemTime() call. wMinute edx. 353 . It may be used in most cases as MOVSX. GetSystemTime() win32 function takes care of saving value in ESI. eax esi DWORD PTR __imp__GetSystemTime@4 eax. t->wHour. eax esi 0 ENDP So. WORD PTR [esi+10] .h> #include <stdio. WORD PTR [esi+2] . I again can represent the structure as an array of 8 WORDs: #include <windows. t->wDay. }. printf ("%04d-%02d-%02d %02d:%02d:%02d\n". t[3] /* wDay */. New instruction —MOVZX (Move with Zero eXtend). WORD PTR [esi+6] .2. GetSystemTime (t). t[0] /* wYear */. 4 esi. wSecond ecx. GetSystemTime (t).CHAPTER 21. t->wMinute. bits from 16 to 31 must be cleared. WORD PTR [esi+12] . LET’S ALLOCATE SPACE FOR A STRUCTURE USING MALLOC() t=(SYSTEMTIME *)malloc (sizeof (SYSTEMTIME)). Listing 21. STRUCTURES 21.4: Optimizing MSVC _main push push call add mov push call movzx movzx movzx push movzx push movzx push movzx push push push push call push call add xor pop ret _main PROC esi 16 _malloc esp. wMonth edx edx. t->wYear. wHour eax eax.h> void main() { WORD *t. WORD PTR [esi] . t->wSecond). 32 eax. because a random noise may be there. sizeof(SYSTEMTIME) = 16 and that is exact number of bytes to be allocated by malloc().

3. t. WORD PTR [esi+12] ecx.h> void main() { struct tm t. you haven’t to do this in practice.h> #include <time. ("Day: %d\n". t[6] /* wSecond */). 21. 354 . unless you really know what you are doing. &t). WORD PTR [esi] eax ecx edx OFFSET $SG78594 _printf esi _free esp. eax esi DWORD PTR __imp__GetSystemTime@4 eax. UNIX: STRUCT TM t[4] /* wHour */. We get: Listing 21.5: Optimizing MSVC $SG78594 DB _main _main '%04d-%02d-%02d %02d:%02d:%02d'. we got the code cannot be distinguished from the previous one. 00H PROC push push call add mov push call movzx movzx movzx push movzx push movzx push movzx push push push push call push call add xor pop ret ENDP esi 16 _malloc esp. 32 eax.CHAPTER 21. WORD PTR [esi+8] eax eax.tm_year+1900).tm_mon).3 UNIX: struct tm 21. ("Month: %d\n". ("Minutes: %d\n".tm_min). WORD PTR [esi+2] edx edx. eax esi 0 Again. localtime_r (&unix_time. return.1 Linux Let’s take the tm structure from time.h in Linux for example: #include <stdio. WORD PTR [esi+6] ecx ecx.tm_mday). t. 4 esi.tm_sec). t. And again I have to note. ("Seconds: %d\n". WORD PTR [esi+10] edx. STRUCTURES 21. t[5] /* wMinute */. free (t). t. printf printf printf printf printf printf ("Year: %d\n". time_t unix_time.tm_hour). unix_time=time(NULL). 0aH. t. ("Hour: %d\n". t.3. }.

esp and esp. offset aSecondsD . I wasn’t able to run GDB that quickly. offset aHourD . edx mov [esp]. [esp+10h] mov eax.6: GCC 4. "Minutes: %d\n" mov [esp+4]. edx mov [esp]. [esp+3Ch] . 355 . Of course. But since we already are experienced reverse engineers :-) we may do it without this information in this simple example. tm_mon mov eax. pass pointer to the structure begin mov [esp]. "Day: %d\n" mov [esp+4]. [esp+14h] . edx . UNIX: STRUCT TM }. [esp+1Ch] . eax call printf mov edx.4. Let’s compile it in GCC 4. 40h mov dword ptr [esp]. edx mov [esp]. [esp+24h] . but doesn’t modify any flags. eax call printf mov edx. in the same second. edx mov [esp]. eax call printf mov edx. offset format . offset aMonthD . tm_sec mov [esp].2 on page 933).CHAPTER 21.1 main proc near push ebp mov ebp. 0 . GDB Let’s try to load the example into GDB 4 : Listing 21.7: GDB dennis@ubuntuvm:~/polygon$ date Mon Jun 2 18:10:37 EEST 2014 dennis@ubuntuvm:~/polygon$ gcc GCC_tm. tm_year lea edx. tm_min mov eax.c -o GCC_tm dennis@ubuntuvm:~/polygon$ gdb GCC_tm 4I corrected the date result slightly for demonstration purposes.4. "Seconds: %d\n" mov [esp+4]. [eax+76Ch] —this instruction just adds 0x76C (1900) to value in EAX. eax call printf mov edx. "Year: %d\n" mov [esp+4]. tm_mday mov eax. Please also pay attention to the lea edx. IDA did not write the local variables’ names in the local stack. eax . pass pointer to result of time() call localtime_r mov eax.6.1: Listing 21. offset aMinutesD . take pointer to what time() returned lea edx.3. [esp+18h] . first argument for time() call time mov [esp+3Ch]. STRUCTURES 21. [esp+20h] . edx . [esp+10h] . edx mov [esp]. eax lea eax. eax call printf leave retn main endp Somehow. 0FFFFFFF0h sub esp. eax call printf mov edx. tm_hour mov eax. offset aDayD . "Month: %d\n" mov [esp+4]. [eax+76Ch] . "Hour: %d\n" mov [esp+4]. edx=eax+1900 mov eax. at ESP+10h struct tm will begin mov [esp+4]. See also the relevant section about LEA ( A.

. like tm_wday. There is NO WARRANTY. there are also other fields available that are not used. int tm_wday. int tm_hour. let’s see how it’s defined in time. int tm_yday. First. __printf (format=0x80485c0 "Year: 29 printf. This GDB was configured as "i686-linux-gnu".. int tm_mday.6. to the extent permitted by law.3.1 on page 349). Type "show copying" and "show warranty" for details. int tm_mon. int tm_isdst. (gdb) x/20x $esp 0xbffff0dc: 0x080484c3 0x080485c0 0xbffff0ec: 0x08048301 0x538c93ed 0xbffff0fc: 0x00000012 0x00000002 0xbffff10c: 0x00000001 0x00000098 0xbffff11c: 0x0804b090 0x08048530 (gdb) %d\n") at printf.org/software/gdb/bugs/>.. tm_yday.. STRUCTURES 21.CHAPTER 21. Here are the fields of our structure in the stack: 0xbffff0dc: 0xbffff0ec: 0xbffff0fc: 0xbffff10c: 0xbffff11c: 0x080484c3 0x080485c0 0x000007de 0x00000000 0x08048301 0x538c93ed 0x00000025 sec 0x0000000a min 0x00000012 hour 0x00000002 mday 0x00000005 mon 0x00000072 year 0x00000001 wday 0x00000098 yday 0x00000001 isdst0x00002a30 0x0804b090 0x08048530 0x00000000 0x00000000 Or as a table: Hexadecimal number 0x00000025 0x0000000a 0x00000012 0x00000002 0x00000005 0x00000072 0x00000001 0x00000098 0x00000001 decimal number 37 10 18 2 5 114 1 152 1 field name tm_sec tm_min tm_hour tm_mday tm_mon tm_year tm_wday tm_yday tm_isdst Just like SYSTEMTIME ( 21. tm_isdst.h: Listing 21. UNIX: STRUCT TM GNU gdb (GDB) 7. License GPLv3+: GNU GPL version 3 or later <http://gnu.8: time.2 ARM Optimizing Keil 6/2013 (thumb mode) Same example: 356 ..3. }. So.html> This is free software: you are free to change and redistribute it.1-ubuntu Copyright (C) 2013 Free Software Foundation. int tm_year. please see: <http://www. Inc.c: No such file or directory.(no debugging symbols found). Reading symbols from /home/dennis/polygon/GCC_tm. Pay attention that 32-bit int is used here instead of WORD in SYSTEMTIME. For bug reporting instructions. 21.h struct tm { int tm_sec.gnu.. int tm_min. (gdb) b printf Breakpoint 1 at 0x8048330 (gdb) run Starting program: /home/dennis/polygon/GCC_tm Breakpoint 1.done.c:29 0x000007de 0x00000025 0x00000005 0x00000001 0x00000000 0x00000000 0x0000000a 0x00000072 0x00002a30 0x00000000 We can easily find our structure in the stack. each field occupies 32-bit.org/licenses/gpl.

#0x34 time R0. SP. STRUCTURES 21.#0x38+var_38] R0.3.#0x38+var_34] R0. [SP. SP . timer SP. #0x34 {PC} Optimizing Xcode 4. "Day: %d\n" __2printf R1. aMinutesD .LR} R7. [SP.CHAPTER 21.6. aDayD .#0x38+timer] R1.#0x38+var_28] R0. aHourD .3 (LLVM) (thumb-2 mode) IDA “knows” the tm structure (because IDA “knows” the types of the arguments of library functions like localtime_r()). time_t * _localtime_r R1. #0x30 R0. aYearD . "Minutes: %d\n" __2printf R1.3 (LLVM) (thumb-2 mode) var_38 = -0x38 var_34 = -0x34 PUSH MOV SUB MOVS BLX ADD STR MOV BLX LDR MOV ADD ADDW BLX LDR MOV ADD BLX LDR {R7. "Month: %d\n" __2printf R1. time_t * _time R1. "Year: %d\n" __2printf R1.9: Optimizing Keil 6/2013 (thumb mode) var_38 var_34 var_30 var_2C var_28 var_24 timer = = = = = = = -0x38 -0x34 -0x30 -0x2C -0x28 -0x24 -0xC PUSH MOVS SUB BL STR MOV ADD BL LDR LDR ADDS ADR BL LDR ADR BL LDR ADR BL LDR ADR BL LDR ADR BL LDR ADR BL ADD POP {LR} R0.#0x38+var_34. [SP.#0x38+var_34. 0xF44 . [SP. [SP. #0x38+var_34 . [SP. R0. #0 . #0x38+timer .tm_year] R0.#0x38+var_30] R0. char * R1. SP.tm_mon] R0. SP. [SP. struct tm * R0.6. 0xF3A . [SP.#0x38+var_34.#0x38+var_24] R1. timer localtime_r R1. PC . "Hour: %d\n" __2printf R1. SP. "Year: %d\n" R0. [SP. [SP. SP.#0x38+var_2C] R0. char * _printf R1.#0x38+var_38] R0. "Seconds: %d\n" __2printf SP. #0x76C _printf R1. "Month: %d\n" R0. [SP. UNIX: STRUCT TM Listing 21. SP SP. R1. aSecondsD . =0x76C R0. #0 . tp R0.10: Optimizing Xcode 4. SP . PC . aMonthD . so it shows here structure elements accesses and their names. Listing 21.tm_mday] 357 . R1 R0.

char * _printf R1. PC . 0x50+seconds 0x50+var_38($sp) . $gp.11: Optimizing GCC 4. 00000000 00000000 00000004 00000008 0000000C 00000010 00000014 00000018 0000001C 00000020 00000024 00000028 0000002C 21. #0x30 {R7. I named them manually: var_40 var_38 seconds minutes hour day month year var_4 = = = = = = = = = -0x40 -0x38 -0x34 -0x30 -0x2C -0x28 -0x24 -0x20 -4 lui addiu la sw sw lw or jalr move lw addiu lw addiu jalr sw lw lw $gp. SP. (sizeof=0x2C. PC .tm_hour] R0.3 tm tm_sec tm_min tm_hour tm_mday tm_mon tm_year tm_wday tm_yday tm_isdst tm_gmtoff tm_zone tm struc . (__gnu_local_gp >> 16) -0x50 (__gnu_local_gp & 0xFFFF) 0x50+var_4($sp) 0x50+var_40($sp) (time & 0xFFFF)($gp) $zero . STRUCTURES MOV ADD BLX LDR MOV ADD BLX LDR MOV ADD BLX LDR MOV ADD BLX ADD POP 21. $a0. UNIX: STRUCT TM R0.tm_min] R0.3.CHAPTER 21. $t9 $v0. "Hour: %d\n" R0. load delay slot. $t9 $a0.3. "Seconds: %d\n" R0. char * _printf R1. standard type) DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? DCD ? . $gp. $t9. NOP $zero . 0xF25 . 0xF35 . char * _printf SP.PC} . 0xF28 . "Minutes: %d\n" R0.#0x38+var_34. 0xF2E ..5 (IDA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 main: . char * _printf R1.. branch delay slot 0x50+var_40($sp) 0x50+year($sp) 358 . 0x50+var_38 (localtime_r & 0xFFFF)($gp) $sp. $sp. PC .#0x38+var_34] R0. $ra. branch delay slot. NOP 0x50+var_40($sp) $sp. "Day: %d\n" R0. $at. [SP. IDA is not aware of structure field names.4. $a1. $a1. [SP. $t9. PC . $gp.#0x38+var_34. $gp. [SP. offset ends MIPS Listing 21.

h> #include <time. UNIX: STRUCT TM $t9. $gp.ascii .3. For example. $a1. $a0.3. printf ("Year: %d\n". tm_mon). tm_wday.ascii .21. $a1. printf ("Day: %d\n". $t9 $a1. localtime_r (&unix_time. $t9. NOP 0x50 "Year: %d\n"<0> "Month: %d\n"<0> "Day: %d\n"<0> "Hour: %d\n"<0> "Minutes: %d\n"<0> "Seconds: %d\n"<0> This is an example where the branch delay slots can confuse us. branch delay slot 0x50+var_40($sp) 0x50+hour($sp) (printf & 0xFFFF)($gp) ($LC3 >> 16) # "Hour: %d\n" ($LC3 & 0xFFFF) # "Hour: %d\n" . $gp.ascii . $t9 $a0. $t9 $a0. &tm_sec). branch delay slot 0x50+var_40($sp) 0x50+seconds($sp) (printf & 0xFFFF)($gp) ($LC5 >> 16) # "Seconds: %d\n" ($LC5 & 0xFFFF) # "Seconds: %d\n" . branch delay slot 0x50+var_4($sp) $zero . $t9. 21. let’s rework our example while looking at the tm structure definition again: listing. do not forget about it. branch delay slot 0x50+var_40($sp) 0x50+month($sp) (printf & 0xFFFF)($gp) ($LC1 >> 16) # "Month: %d\n" ($LC1 & 0xFFFF) # "Month: %d\n" . STRUCTURES 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 lw la jalr addiu lw lw lw lui jalr la lw lw lw lui jalr la lw lw lw lui jalr la lw lw lw lui jalr la lw lw lw lui jalr la lw or jr addiu $LC0: $LC1: $LC2: $LC3: $LC4: $LC5: . branch delay slot 0x50+var_40($sp) 0x50+minutes($sp) (printf & 0xFFFF)($gp) ($LC4 >> 16) # "Minutes: %d\n" ($LC4 & 0xFFFF) # "Minutes: %d\n" . $t9. $a1. $t9 $a0. unix_time=time(NULL). $a0. tm_mday). tm_mon.ascii . $t9. $a1. branch delay slot 0x50+var_40($sp) 0x50+day($sp) (printf & 0xFFFF)($gp) ($LC2 >> 16) # "Day: %d\n" ($LC2 & 0xFFFF) # "Day: %d\n" . tm_isdst. (printf & 0xFFFF)($gp) $LC0 # "Year: %d\n" 1900 . $gp. $a1. $a0. $a0. tm_hour.ascii . $gp. $ra $sp.ascii 21. $a0. $t9.4 Structure as a set of values In order to illustrate that the structure is just variables laying side-by-side in one place. It’s executed before the corresponding JALR at line 34. tm_min. printf ("Month: %d\n". 359 . $ra. $gp. $t9 $a0. time_t unix_time.h> void main() { int tm_sec. tm_yday. $a0.8. $t9 $a0.CHAPTER 21. tm_mday. tm_year. #include <stdio. $at. tm_year+1900). there is the instruction “addiu $a1. 1900” at line 35 which adds 1900 to the year number. load delay slot.

3.c: In function 'main': GCC_tm2. 1900 [esp+30h+var_2C].13: GCC 4.c:11:5: warning: passing argument 2 of 'localtime_r' from incompatible pointer type [⤦ Ç enabled by default] In file included from GCC_tm2. eax [esp+30h+var_30].e.CHAPTER 21. [esp+30h+tm_mon] [esp+30h+var_2C]. eax [esp+30h+var_30]. 0 . offset aDayD . eax [esp+30h+var_30]. [esp+30h+tm_year] eax. arg 0 time [esp+30h+unix_time].c:2:0: /usr/include/time. offset aSecondsD .3 GCC_tm2. "Month: %d\n" printf eax. eax [esp+30h+var_30]. The compiler warns us: Listing 21. esp esp. tm_sec).B. The pointer to the tm_sec field is passed into localtime_r. 30h __main [esp+30h+var_30]. [esp+30h+unix_time] [esp+30h+var_30]. "Day: %d\n" printf eax.. eax eax. [esp+30h+tm_min] [esp+30h+var_2C]. }. printf ("Minutes: %d\n". tm_hour). tm_min).3 main proc near var_30 var_2C unix_time tm_sec tm_min tm_hour tm_mday tm_mon tm_year = = = = = = = = = dword dword dword dword dword dword dword dword dword push mov and sub call mov call mov lea mov lea mov call mov add mov mov call mov mov mov call mov mov mov call mov mov mov call mov mov mov call mov mov mov call leave retn ptr ptr ptr ptr ptr ptr ptr ptr ptr -30h -2Ch -1Ch -18h -14h -10h -0Ch -8 -4 ebp ebp. "Hour: %d\n" printf eax. eax [esp+30h+var_30]. "Minutes: %d\n" printf eax. offset aHourD . i.h:59:12: note: expected 'struct tm *' but argument is of type 'int *' But nevertheless. 0FFFFFFF0h esp. eax [esp+30h+var_30]. eax eax. STRUCTURES 21. [esp+30h+tm_sec] [esp+30h+var_2C].7. UNIX: STRUCT TM printf ("Hour: %d\n". "Seconds: %d\n" printf 360 . "Year: %d\n" printf eax.7. it generates this: Listing 21. offset aMinutesD . printf ("Seconds: %d\n". [esp+30h+tm_sec] [esp+30h+var_2C]. [esp+30h+tm_mday] [esp+30h+var_2C]. eax localtime_r eax. to the first element of the “structure”. offset aMonthD .12: GCC 4. N. offset aYearD . [esp+30h+tm_hour] [esp+30h+var_2C].

timer lea ebx. UNIX: STRUCT TM endp This code is identical to what we saw previously and it is not possible to say.3. }. I’m not adding a debugger example here: it is just the same as you already saw. some other compiler may warn about the tm_year. it is not recommended to do this in practice. directing the compiler to hold them in one place. 0FFFFFFF0h sub esp. 0 . non-optimizing compilers allocates variables in the local stack in the same order as they were declared in the function. but not tm_sec are used without being initialized. time_t unix_time. However. tm_hour. [esp+38h] 361 . side-by-side. I could say that the structure is syntactic sugar.8. Read more about it in next section: “Fields packing in structure” ( 21. i<9. tm_mon. 21. tmp). STRUCTURES main 21. 0x0000002D 0x00000033 0x00000017 0x0000001A 0x00000006 0x00000072 0x00000006 0x000000CE 0x00000001 (45) (51) (23) (26) (6) (114) (6) (206) (1) The variables here are in the same order as they are enumerated in the definition of the structure: 21. 40h mov dword ptr [esp]. [esp+14h] call _time lea esi. like in the case of the SYSTEMTIME structure—GetSystemTime() will fill them incorrectly ( because the local variables are aligned on a 32-bit boundary). Usually. And that works! I ran the example at 23:51:45 26-July-2014. I’m not an expert on programming languages. esp push esi push ebx and esp. tm_min variables. Indeed. Nevertheless. printf ("0x%08X (%d)\n". By the way.3.CHAPTER 21.4 on the following page).1 main proc near push ebp mov ebp. I chose this example. for (i=0. i++) { int tmp=((int*)&t)[i]. int i. so most likely I’m wrong with this term. unix_time=time(NULL).14: Optimizing GCC 4.8 on page 356. localtime_r (&unix_time. the compiler is not aware that these are to be filled by localtime_r() function. }. tm_mday. there is no guarantee. a structure is just a pack of variables laying on one place. However. Here is how it gets compiled: Listing 21. there were no structures at all [Rit93].h> void main() { struct tm t.h> #include <time. was it a structure in original source code or just a pack of variables. I just cast a pointer to structure to an array of int’s. &t). since all structure fields are of type int. tmp.5 Structure as an array of 32-bit words #include <stdio. So. By the way. in some very early C versions (before 1972). This would not work if structure fields are 16-bit (WORD). And this works.

}.h> #include <time. int i. mov eax.CHAPTER 21. eax . j. 4 . j++) printf ("0x%02X ". It’s even possible to modify the fields of the structure through this pointer. unix_time=time(NULL). ESI is the pointer to the end of it. next field in structure mov dword ptr [esp+4]. localtime_r (&unix_time. }.3. UNIX: STRUCT TM [esp+4]. printf ("\n"). eax . Let’s cast the pointer to an array of bytes and dump it: #include <stdio. time_t unix_time. NOP loc_80483D8: . ((unsigned char*)&t)[i*4+j]). EBX here is pointer to structure. pass value to printf() call ___printf_chk cmp ebx.6 Structure as an array of bytes I can do even more.load next value then lea esp. not recommended for use in production code. try to modify (increase by 1) the current month number. 21. "0x%08X (%d)\n" mov dword ptr [esp].3. 0x2D 0x33 0x17 0x1A 0x06 0x72 0x06 0xCE 0x01 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 362 . pass value to printf() mov [esp+8]. it’s dubiously hackish way to do things. tp . get 32-bit word from array add ebx. [esi+0] .h> void main() { struct tm t. eax eax. and then it’s treated as an array. esi . offset a0x08xD . [ebx] . meet structure end? jnz short loc_80483D8 . timer esi. &t). [esp+10h] [esp]. [ebp-8] pop ebx pop esi pop ebp retn main endp Indeed: the space in the local stack is first treated as a structure. j<4. eax _localtime_r . no . i++) { for (j=0. treating the structure as an array. i<9. 1 mov [esp+0Ch]. STRUCTURES mov mov lea mov call nop lea 21. ebx [esp+10h]. And again. Exercise As an exercise. for (i=0.

4 _putchar esi. tp mov [esp+10h]. print carriage return mov add call cmp jnz lea pop pop pop pop retn main endp 21. loc_8048408: xor ebx. 0 . load byte ebx. NOP .15: Optimizing GCC 4. "0x%02X " dword ptr [esp]. esi . Listing 21. c esi. struct end mov [esp+4]. void f(struct s s) { 5 See also: Wikipedia: Data structure alignment 363 .5 on page 361). Let’s take a simple example: #include <stdio. because this is a little-endian architecture ( 31 on page 453). eax . the lowest byte goes first.4. 1 [esp+8]. [esi+0] . pass loaded byte to printf() ___printf_chk ebx. meet struct end? short loc_8048408 . }. int b.8. j=0 loc_804840A: movzx add mov mov mov call cmp jnz . timer lea esi. [esp+10h] mov [esp]. 40h mov dword ptr [esp]. esp push edi push esi push ebx and esp. 0FFFFFFF0h sub esp. [esp+14h] call _time lea edi. and of course. The values are just the same as in the previous dump ( 21. j=j+1 dword ptr [esp+4]. 1 . int d. FIELDS PACKING IN STRUCTURE I ran this example also at 23:51:45 26-July-2014. [esp+38h] .4 eax. eax .h> struct s { char a. [ebp-0Ch] ebx esi edi ebp Fields packing in structure One important thing is fields packing in structures5 . EDI is the pointer to structure end. char c. timer call _localtime_r lea esi. STRUCTURES 21.3.1 main proc near push ebp mov ebp.CHAPTER 21. 4 short loc_804840A character (CR) dword ptr [esp]. ebx . j=0 esp. edi . 0Ah . ESI here is the pointer to structure in local stack. offset a0x02x . byte ptr [esi+ebx] . eax lea eax.

mov BYTE PTR _tmp$[ebp+8]. 2 . s. 16 xor eax. tmp.16: MSVC 2012 /GS. s. mov DWORD PTR [eax].a.4 bytes). ecx mov edx. DWORD PTR _tmp$[ebp+12] mov DWORD PTR [eax+12]. esp mov ecx. 4 . f push ebp mov ebp. 21. d=%d\n". b=%d. FIELDS PACKING IN STRUCTURE printf ("a=%d. DWORD PTR _s$[ebp+4] push edx movsx eax. edx call _f add esp. int main() { struct s tmp. 3 . size = 16 ?f@@YAXUs@@@Z PROC . tmp.4. esp mov eax. }. as we can see. sub esp. 16 . we have two char fields (each is exactly one byte) and two more —int (each . DWORD PTR _s$[ebp+12] push eax movsx ecx. s.4. STRUCTURES 21. BYTE PTR _s$[ebp] push eax push OFFSET $SG3842 call _printf add esp.d). edx mov ecx. and then all 4 fields. DWORD PTR _tmp$[ebp+4] mov DWORD PTR [eax+4]. but in fact. }.b=2. c=%d.c.CHAPTER 21. 1 . s. 20 pop ebp ret 0 ?f@@YAXUs@@@Z ENDP . and then its pointer (address) 364 . f _TEXT ENDS We pass the structure as a whole. mov DWORD PTR _tmp$[ebp+4]./Ob0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 _tmp$ = -16 _main PROC push ebp mov ebp. the structure is being copied to a temporary one (a place in stack is allocated in line 10 for it. f(tmp).d=4. DWORD PTR _tmp$[ebp+8] mov DWORD PTR [eax+8]. mov eax. esp sub esp. tmp. one by one. mov DWORD PTR _tmp$[ebp+12]. are copied in lines 12 … 19). DWORD PTR _tmp$[ebp] .a=1. As we see. eax mov esp. ecx mov edx.b. BYTE PTR _s$[ebp+8] push ecx mov edx. tmp.1 x86 This compiles to: Listing 21. ebp pop ebp ret 0 _main ENDP set field a set field b set field c set field d allocate place for temporary structure copy our structure to the temporary one _s$ = 8 . 16 mov BYTE PTR _tmp$[ebp].c=3.

size = 10 ?f@@YAXUs@@@Z PROC . Not field-by-field. 20 pop ebp ret 0 ?f@@YAXUs@@@Z ENDP . of course: 43. What does it give to us? Size economy.4. As we can see. sub esp. DWORD PTR _s$[ebp+6] push eax movsx ecx. Why not 4? The compiler decided that it’s better to copy 10 bytes using 3 MOV pairs than to copy two 32-bit words and two bytes using 4 MOV pairs. 12 xor eax. using three pairs of MOV. Aside from MSVC /Zp option which sets how to align each structure field. which can be defined right in the source code.h file has this: 6 MSDN: Working with Packing Structures Pragmas 7 Structure-Packing 365 . The structure is also copied in main(). DWORD PTR _s$[ebp+1] push edx movsx eax. The structure is copied because it’s not know if the f() function going to modify the structure or not. mov BYTE PTR _tmp$[ebp+5]. f push ebp mov ebp.CHAPTER 21. each field’s address is aligned on a 4-byte boundary. such copy implementation using MOV instead of calling the memcpy() function is widely used. mov DWORD PTR _tmp$[ebp+1]. but without the copying. but directly 10 bytes. By the way. We could use C/C++ pointers. Let’s get back to the SYSTEMTIME structure that consists of 16-bit fields. ebp pop ebp ret 0 _main ENDP set field a set field b set field c set field d allocate place for temporary structure copy 10 bytes _TEXT SEGMENT _s$ = 8 . DWORD PTR _tmp$[ebp] . BYTE PTR _s$[ebp+5] push ecx mov edx. there is also the #pragma pack compiler option. 3 . ecx mov edx. esp mov ecx. esp mov eax. 2 . esp sub esp. because it’s faster than a call to memcpy()—for short blocks. Why? Because it is easier for the CPU to access memory at aligned addresses and to cache data from it.17: MSVC 2012 /GS. WORD PTR _tmp$[ebp+8] mov WORD PTR [eax+8]. 12 . then the structure in main() has to remain as it was. 4 . cx call _f add esp. That’s why each char occupies 4 bytes here (like int). 12 mov BYTE PTR _tmp$[ebp]. BYTE PTR _s$[ebp] push eax push OFFSET $SG3842 call _printf add esp. Let’s try to compile it with option (/Zp1) (/Zp[n] pack structures on n-byte boundary). mov DWORD PTR _tmp$[ebp+6].1. 1 . If it gets changed. How does our compiler know to pack them on 1-byte alignment boundary? WinNT. if the structure is used in many source and object files./Zp1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 _main PROC push ebp mov ebp. mov DWORD PTR [eax]. It is available in both MSVC6 and GCC7 .5 on page 504. it is not very economical. And as drawback —the CPU accessing these fields slower than it could. FIELDS PACKING IN STRUCTURE is to be passed. and the resulting code will be almost the same. eax mov esp. edx mov cx. STRUCTURES 21. all these must be compiled with the same convention about structures packing. DWORD PTR _tmp$[ebp+4] mov DWORD PTR [eax+4]. mov eax. As it can be easily guessed. However. f Now the structure takes only 10 bytes and each char value takes 1 byte. Listing 21.

h #include "pshpack1.h #if ! (defined(lint) || defined(RC_INVOKED)) #if ( _MSC_VER >= 800 && !defined(_M_I86)) || defined(_PUSHPOP_SUPPORTED) #pragma warning(disable:4103) #if !(defined( MIDL_PASS )) || defined( __midl ) #pragma pack(push.1) #else #pragma pack(1) #endif #else #pragma pack(1) #endif #endif /* ! (defined(lint) || defined(RC_INVOKED)) */ This tell the compiler how to pack the structures defined after #pragma pack. FIELDS PACKING IN STRUCTURE Listing 21.20: PshPack1.18: WinNT.19: WinNT.h looks like: Listing 21.CHAPTER 21.h" And this: Listing 21. STRUCTURES 21.h" // 4 byte packing is the default The file PshPack1.h #include "pshpack4.4. 366 .

The remaining 3 bytes of the 32-bit words are not being modified in memory! Hence.16 on page 364.4. This garbage doesn’t influence the printf() output in any way. 1 and 3 respectively (lines 6 and 8). But where do the random bytes (0x30. that are next to the first (a) and third (c) fields? By looking at our listing 21. 367 .16 (lines 34 and 38). If the type unsigned char or uint8_t was used here.3: OllyDbg: Before printf() execution We see our 4 fields in the data window. because the values for it are prepared using the MOVSX instruction. therefore only one byte is written. the MOVSX (sign-extending) instruction is used here. MOVZX instruction would have been used instead. FIELDS PACKING IN STRUCTURE OllyDbg + fields are packed by default Let’s try our example (where the fields are aligned by default (4 bytes)) in OllyDbg: Figure 21.21.CHAPTER 21. because char is signed by default in MSVC and GCC. we can see that the first and third fields are char. which takes bytes. 0x37. 0x01) come from. STRUCTURES 21. random garbage is left there. but not words: listing. By the way.

[SP. c . it doesn’t matter which epilogue gets executed.text:00000292 .text:00000286 . it hold 5 local variables too (5 ∗ 4 = 0x14)).text:00000280 .text:00000280 .21: Optimizing Keil 6/2013 (thumb mode) . LDRB loads one byte from memory and extends it to 32-bit.4: OllyDbg: Before printf() execution 21. taking its sign into account.#8] R0.text:00000280 . #0x14 POP {PC} .text:00000280 .text:00000282 . 368 . that was quite different function. not related in any way to ours.#4] R0. c=%d.#12] R1.CHAPTER 21.text:0000028E .text:00000280 .4. b=%d. b . Keil decides to reuse a part of another function to economize. Here it is used to load fields a and c from the structure.text:00000284 . d . One more thing we spot easily is that instead of function epilogue. d=%d\n" As we may recall. FIELDS PACKING IN STRUCTURE OllyDbg + fields aligning on 1 byte boundary Things are much clearer here: 4 fields occupy 10 bytes and the values are stored side-by-side Figure 21. there is jump to another function’s epilogue! Indeed. aADBDCDDD __2printf exit .2 ARM Optimizing Keil 6/2013 (thumb mode) Listing 21. SP.text:0000003E 05 B0 .text:00000280 . and since the first 4 function arguments in ARM are passed via registers. "a=%d. [R0. SP.LR} SP. [R0. STRUCTURES 21. The epilogue takes 4 bytes while jump —only 2.text:00000288 . [SP. CODE XREF: f+16 ADD SP. it has exactly the same epilogue (probably because. Apparently.text:00000280 .text:00000296 f var_18 a b c d 0F 81 04 02 00 68 03 01 59 05 D2 B5 B0 98 9A 90 46 7B 79 A0 F0 AD FF E6 = = = = = -0x18 -0x14 -0x10 -0xC -8 PUSH SUB LDR LDR STR MOV LDRB LDRB ADR BL B {R0-R3.text:00000280 . the structure’s fields are passed via R0-R3. #4 R0. however. Also it is located nearby (take a look at the addresses). SP R3.text:0000028C . Indeed.text:00000040 00 BD exit . [SP] R0.#16] R2. This is similar to MOVSX in x86.text:0000028A .text:00000290 . a . here a structure is passed instead of pointer to one. if it works just as we need.text:0000003E .4.text:00000280 .

22: Optimizing Xcode 4.PC} SXTB (Signed Extend Byte) is analogous to MOVSX in x86. 0x28+var_4($sp) or $at. load delay slot. #4 {R7. c=%d. c=%d. prepare c R2. d=%d\n"<0> 369 . R1 .6. PC . d=%d\n" move $a1.d lui $gp. 24 lw $t9. $a2.3 MIPS Listing 21. b=%d. (__gnu_local_gp >> 16) addiu $sp. [SP. place d to stack for printf() R0. b=%d. d=%d\n" R1.4. R2 . NOP jr $ra addiu $sp.W STR ADD SXTB MOV BLX ADD POP {R7. b R1. "a=%d. branch delay slot lw $ra. ($LC0 & 0xFFFF) # "a=%d. All the rest —just the same. 0x28 . ($LC0 >> 16) # "a=%d. . R9 . $a1 . #0 R3. $a0.4. 0x28+var_18($sp) sw $a1.LR} R7. SP.ascii "a=%d. format-string R3.CHAPTER 21. prepare a R0. 0x28+arg_0($sp) lui $a0.5 (IDA) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 f: var_18 var_10 var_4 arg_0 arg_4 arg_8 arg_C . #4 R9. $v0 . 24 move $v1.a $a1=s. STRUCTURES 21. 0x28+arg_8($sp) sw $a3. R0 . SP. FIELDS PACKING IN STRUCTURE ARM + Optimizing Xcode 4.3 (LLVM) (thumb-2 mode) var_C = -0xC PUSH MOV SUB MOV MOV MOVW SXTB MOVT. = -0x18 = -0x10 = -4 = 0 = 4 = 8 = 0xC $a0=s.b $a2=s. 0x28+var_10($sp) . (__gnu_local_gp & 0xFFFF) sw $ra.6. #0xF10 . d=%d\n" sw $a3.c $a3=s. R1 . -0x28 la $gp. a R0.23: Optimizing GCC 4. c=%d. prepare byte from 32-bit big-endian integer: sra $t0. c=%d. 21. b=%d. branch delay slot $LC0: . $v1 jalr $t9 move $a3. $t0 move $a2. (printf & 0xFFFF)($gp) sw $a0. 0x28+var_4($sp) sw $gp.3 (LLVM) (thumb-2 mode) Listing 21. b=%d. b _printf SP. . prepare byte from 32-bit big-endian integer: sra $v0. SP SP. 0x28+arg_4($sp) sw $a2.#0xC+var_C] . 0x28+arg_C($sp) la $a0. .4. $zero .

b and d. b=%d.b=%d.5. … in this case. b=%d. }. d=%d.a=1. But there are two SRA (“Shift Word Right Arithmetic”) instructions. both inner_struct fields are to be placed between the a. s.a.a=%d. d=%d\n". d=%d.CHAPTER 21. a. int e. int b. And when a char variable needs to be extended into a 32-bit value.d=3. b. Let’s compile (MSVC 2010): Listing 21. int d) { printf ("a=%d.c.$A3 and then get reshuffled into $A1. c. struct inner_struct c. s.4 One more word Passing a structure as a function argument (instead of a passing pointer to structure) is the same as passing all structure fields one by one. s.e). s. c. }. struct outer_struct { char a. so an arithmetical shift is used here instead of logical. s. }.e fields of the outer_struct. 21. BYTE PTR _s$[esp+12] 370 .b=101.5 Nested structures Now what about situations when one structure is defined inside of another? #include <stdio. s. c=%d.a.a=%d. e=%d'.b=2. they occupy the high 31. the f() function can be rewritten as: void f(char a. f(s). e=%d\n". s.$A4 for printf(). DWORD PTR _s$[esp+16] movsx ecx.. c.a=100. char d.. s. 21. s. b=%d. 00H _TEXT SEGMENT _s$ = 8 _f PROC mov eax.24 bits. int b.h> struct inner_struct { int a. c. NESTED STRUCTURES Structure fields come in registers $A0.c..e=4. int b.4. s. char is a signed type. So when byte variables are stored in 32-bit structure slots.d.b=%d. void f(struct outer_struct s) { printf ("a=%d. s. and the Debian Linux I work in is big-endian too. }. If the structure fields are packed by default.c. 0aH. c. }. Why? MIPS is a big-endian architecture by default 31 on page 453. which prepare char fields. s. char c. And that leads to the same code.b. STRUCTURES 21. it must be shifted right by 24 bits. int main() { struct outer_struct s.b. d).c.24: Optimizing MSVC 2010 /Ob0 $SG2802 DB 'a=%d.

eax pop ebx add esp.b=%d. 3 mov ecx. BYTE PTR _s$[esp+8] eax ecx edx OFFSET $SG2802 . 1 mov ebx. 2 sub esp. DWORD PTR _s$[esp+76] mov DWORD PTR [eax+12]. Of course. edx mov BYTE PTR _s$[esp+76]. 371 . b=%d.a=%d. 28 0 _s$ = -24 _main PROC sub esp. NESTED STRUCTURES edx. d=%d. 24 ret 0 _main ENDP One curious thing here is that by looking onto this assembly code. ecx mov DWORD PTR [eax+20]. 24 mov eax. c.CHAPTER 21. 24 pop edi pop esi xor eax. e=%d' _printf esp. DWORD PTR _s$[esp+8] edx edx. ecx lea edx. STRUCTURES _f mov push mov push mov push movsx push push push push call add ret ENDP 21. we would say. nested structures are unfolded into linear or one-dimensional structure. DWORD PTR [ecx+2] mov DWORD PTR [eax+8]. DWORD PTR _s$[esp+8] eax eax. DWORD PTR [ecx+98] lea esi. if we replace the struct inner_struct c. esi mov DWORD PTR [eax+16]. DWORD PTR [ecx+99] lea edi.5. DWORD PTR _s$[esp+60] mov DWORD PTR [eax]. esp mov BYTE PTR _s$[esp+60]. c. ebx mov DWORD PTR [eax+4]. declaration with struct inner_struct *c. 'a=%d. DWORD PTR _s$[esp+8] ecx ecx. we do not even see that another structure was used inside of it! Thus. 24 push ebx push esi push edi mov ecx. (thus making a pointer here) the situation will be quite different. edi call _f add esp.

a) 32-bit word 0x64 (100). For example. it is not rational if speed is important. • (outer_struct.4.6 Bit fields in a structure 21.b) 32-bit word 0x65 (101). So let’s make this function by ourselves for GCC with the help of its built-in assembler9 ."=d"(*d):"a"(code))."=b"(*b). It is very useful if one needs to save memory space.a) byte 1 + 3 bytes of random garbage."=c"(*c). This instruction returns information about the current CPU and its features. but GCC 4. #include <stdio.5: OllyDbg: Before printf() execution That’s how the values are located in memory: • (outer_struct.6. • (inner_struct. STRUCTURES 21.d) byte 3 + 3 bytes of random garbage. int *d) { asm volatile("cpuid":"=a"(*a). • (outer_struct.1 CPUID example The C/C++ language allows to define the exact number of bits for each structure field.5.1 21. int *c. Let’s consider the CPUID8 instruction example. CPUID returning this information packed into the EAX register: 3:0 (4 bits) 7:4 (4 bits) 11:8 (4 bits) 13:12 (2 bits) 19:16 (4 bits) 27:20 (8 bits) Stepping Model Family Processor Type Extended Model Extended Family MSVC 2010 has CPUID macro. int *a. • (outer_struct.h> #ifdef __GNUC__ static inline void cpuid(int code.6. int *b. • (inner_struct.1 does not. If the EAX is set to 1 before the instruction’s execution. } #endif 8 wikipedia 9 More about internal GCC assembler 372 . 21. But of course. one bit is enough for a bool variable.b) 32-bit word 2. BIT FIELDS IN A STRUCTURE OllyDbg Let’s load the example into OllyDbg and take a look at outer_struct in memory: Figure 21.e) 32-bit word 4.CHAPTER 21.

DWORD PTR _b$[esp+24] eax. ("extended_model_id=%d\n". BIT FIELDS IN A STRUCTURE #ifdef _MSC_VER #include <intrin. ("extended_family_id=%d\n". &b[2]. 1 mov mov and esi. ("processor_type=%d\n". ("model=%d\n". edx 373 . }. printf printf printf printf printf printf ("stepping=%d\n". int main() { struct CPUID_1_EAX *tmp. STRUCTURES 21. tmp->processor_type). ecx DWORD PTR [esi+12]. tmp->extended_model_id). #endif tmp=(struct CPUID_1_EAX *)&b[0]. 15 esi esi. After CPUID fills EAX/EBX/ECX/EDX. #ifdef _MSC_VER __cpuid(b. ecx eax.6. ebx DWORD PTR [esi+8]. unsigned int extended_family_id:8. DWORD PTR _b$[esp+24] DWORD PTR [esi]. ("family_id=%d\n". &b[3]). unsigned int family_id:4. we treat a 32-bit int value as a structure.1). eax DWORD PTR [esi+4]. }. esi eax. tmp->family_id). #endif #ifdef __GNUC__ cpuid (1. Then. &b[1]. unsigned int reserved1:2.h> #endif struct CPUID_1_EAX { unsigned int stepping:4. MSVC Let’s compile it in MSVC 2008 with /Ox option: Listing 21. tmp->model). int b[4].CHAPTER 21. return 0. we have a pointer to the CPUID_1_EAX structure and we point it to the value in EAX from the b[] array. unsigned int extended_model_id:4. these registers are to be written in the b[] array. size = 16 _main PROC sub esp.25: Optimizing MSVC 2008 _b$ = -16 . 16 push ebx xor mov cpuid push lea mov mov mov mov ecx. Then we read from the structure. tmp->extended_family_id). &b[0]. In other words. unsigned int reserved2:4. unsigned int processor_type:2. unsigned int model:4. tmp->stepping).

CHAPTER 21. 15 ecx OFFSET $SG15439 . 00H _printf mov shr and push push call eax. esi ecx. 0aH. we ignore some bits at the right side.g. 00H _printf mov shr and push push call ecx. 15 ecx OFFSET $SG15436 . 0aH. leaves only those bits in the EAX register we need. 12 eax. esi eax. or. 0aH. 'stepping=%d'. BIT FIELDS IN A STRUCTURE push push call eax OFFSET $SG15435 . The AND instruction clears the unneeded bits on the left. 0aH. in other words. 'family_id=%d'. 'model=%d'. 0aH. 4 ecx. esi edx. 00H _printf mov shr and push push call ecx. STRUCTURES 21. 'extended_family_id=%d'. 48 esi xor pop eax. 00H _printf shr and push push call add pop esi. 00H _printf esp. 255 esi OFFSET $SG15440 . e. 16 ecx. 0aH. 3 eax OFFSET $SG15438 . 16 0 ENDP The SHR instruction shifting the value in EAX by the number of bits that must be skipped. 20 esi. 00H _printf mov shr and push push call edx. eax ebx add ret _main esp. 8 edx. 'processor_type=%d'. esi ecx. 15 edx OFFSET $SG15437 .. 'extended_model_id=%d'. 374 .6.

6: OllyDbg: After CPUID execution EAX has 0x000206A7 (my CPU is Intel Xeon E3-1220).7: OllyDbg: Result GCC Let’s try GCC 4. STRUCTURES 21. what values are set in EAX/EBX/ECX/EDX after the execution of CPUID: Figure 21. 0FFFFFFF0h esi esi.26: Optimizing GCC 4.CHAPTER 21. BIT FIELDS IN A STRUCTURE MSVC + OllyDbg Let’s load our example into OllyDbg and see. Here is how the bits are distributed by fields: field reserved2 extended_family_id extended_model_id reserved1 processor_id family_id model stepping in binary form 0000 00000000 0010 00 00 0110 1010 0111 in decimal form 0 0 2 0 0 6 10 7 Figure 21. esp esp. DATA XREF: _start+17 ebp ebp. 1 375 .1 main push mov and push mov proc near .4.4.1 with -O3 option. This is 00000000000000100000011010100111 in binary form. Listing 21.6.

eax dword ptr [esp+4].h> 376 . esi eax. eax dword ptr [esp+4]. 21. 31 30 S exponent 23 22 0 mantissa or fraction ( S—sign ) #include <stdio. 1 ___printf_chk [esp+8]. dword ptr [esp]. esi eax. BIT FIELDS IN A STRUCTURE ebx eax. "stepping=%d\n" offset aModelD . eax eax.h> #include <assert. 1 ___printf_chk eax. dword ptr [esp]. 10h esi. instead of calculating them separately before each printf() call. dword ptr [esp]. eax ebx esi esp. eax dword ptr [esp+4]. dword ptr [esp]. 8 eax. 0Fh [esp+8]. 18h esi. But will we be able to work with these fields directly? Let’s try this with float. 1 ___printf_chk eax. "processor_type=%d\n" offset aExtended_model . both float and double types consist of a sign. 0Fh esi. 4 eax. "extended_model_id=%d\n" offset unk_80486D0 endp Almost the same. 18h eax. dword ptr [esp]. esi dword ptr [esp+4]. ebp ebp offset aSteppingD . esi eax. The only thing worth noting is that GCC somehow combines the calculation of extended_model_id and extended_family_id into one block.2 Working with the float type as with a structure As we already noted in the section about FPU ( 17 on page 208). eax dword ptr [esp+4]. eax dword ptr [esp+4]. 0Ch eax. "family_id=%d\n" offset aProcessor_type . esi eax. STRUCTURES push mov sub cpuid mov and mov mov mov call mov shr and mov mov mov call mov shr and mov mov mov call mov shr and mov mov mov call mov shr shr and and mov mov mov call mov mov mov call add xor pop pop mov pop retn main 21. "model=%d\n" offset aFamily_idD . 3 [esp+8]. 1 ___printf_chk eax. esi esp. a significand (or fraction) and an exponent. 0FFh [esp+8].6. 14h eax.CHAPTER 21. 0Fh [esp+8]. dword ptr [esp]. 1 ___printf_chk eax. 1 ___printf_chk esp. 0Fh [esp+8].6.

eax. assert (sizeof (struct float_as_struct) == sizeof (float)). eax. by adding 2 to the exponent.set minus sign DWORD PTR _t$[ebp]. we thereby multiply the whole number by 22 .27: Non-optimizing MSVC 2008 _t$ = -8 _f$ = -4 __in$ = 8 ?f@@YAMM@Z push mov sub . size = 4 ..CHAPTER 21. ecx.6. 80000000H . }. STRUCTURES 21. i. size = 4 . esp esp. Now we are setting the negative sign in the input value and also.. 000000ffH 23 . DWORD PTR _f$[ebp] eax ecx. DWORD PTR _t$[ebp] edx. // set negative sign t. The float_as_struct structure occupies the same amount of memory as float. 4 bytes or 32 bits. edx mov shr and add and shl mov and eax. 000000ffH 2 . -2147483648 . DWORD PTR _t$[ebp] ecx _memcpy esp. &t.exponent=t. 8 fld fstp DWORD PTR __in$[ebp] DWORD PTR _f$[ebp] push lea push lea push call add 4 eax.234)). BIT FIELDS IN A STRUCTURE #include <stdlib.sign=1. }. eax. 00000017H 255 . // sign bit }. // exponent + 0x3FF unsigned int sign : 1. by 4. &f. // multiply d by 2n (n here is 2) memcpy (&f.e. i. return f.e.h> #include <memory. f(1. float f(float in) { float f=in. DWORD PTR _t$[ebp] 23 . memcpy (&t. f ebp ebp. ecx. sizeof (float)). size = 4 PROC . struct float_as_struct t.exponent+2. 807fffffH - drop significand leave here only exponent shift result to place of bits 30:23 drop exponent 377 .h> struct float_as_struct { unsigned int fraction : 23. add 2 to it 255 . 00000017H DWORD PTR _t$[ebp] -2139095041 . eax. t. // fractional part unsigned int exponent : 8. sizeof (float)). eax. Let’s compile in MSVC 2008 without optimization turned on: Listing 21. 12 mov or mov edx. int main() { printf ("%f\n".

1 .28: Optimizing GCC 4. But it is easier to understand by looking at the unoptimized version. eax mov DWORD PTR _t$[ebp].6. [ebp+arg_0] eax. "%f\n" mov dword ptr [esp]. f(float) public _Z1ff _Z1ff proc near var_4 arg_0 = dword ptr -4 = dword ptr 8 _Z1ff push mov sub mov or mov and shr add movzx shl or mov fld leave retn endp main main ebp ebp. 807FFFFFh . edx. the f variable is used directly. eax [ebp+var_4] set minus sig leave only significand and exponent in EAX prepare exponent add 2 clear all bits except 7:0 in EAX shift new calculated exponent to its place add new exponent and original value without exponent public main proc near push ebp mov ebp. edx. esp esp.CHAPTER 21. STRUCTURES 21. What would GCC 4.936 fstp qword ptr [esp+8] mov dword ptr [esp+4]. BIT FIELDS IN A STRUCTURE . [ebp+var_4]. 10h fld ds:dword_8048614 . 23 . dl .234) during compilation despite all this hodge-podge with the structure fields and prepared this argument to printf() as precalculated at compile time! 378 . 0FFFFFFF0h sub esp. esp and esp. eax leave retn endp The f() function is almost understandable. eax. edx . edx.4. -4. 23 . However. 2 . 1 call ___printf_chk xor eax. add original value without exponent with new calculated exponent or ecx. f A bit redundant.4. what is interesting is that GCC was able to calculate the result of f(1. edx. ebp ebp 0 ENDP . DWORD PTR _f$[ebp] eax _memcpy esp.1 with -O3 do? Listing 21. offset asc_8048610 . DWORD PTR _t$[ebp] edx eax. If it was compiled with /Ox flag there would be no memcpy() call. 4 eax. 80000000h . eax eax. edx. ecx push lea push lea push call add 4 edx. 12 fld DWORD PTR _f$[ebp] mov pop ret ?f@@YAMM@Z esp.

00H __real@405ec00000000000 DQ 0405ec00000000000r __real@407bc00000000000 DQ 0407bc00000000000r _s$ = 8 _f PROC push mov cmp jle cmp jbe fld sub fmul fld fmul faddp fstp push call movzx movsx push push push call add pop ret $LN3@f: pop mov jmp $LN4@f: pop mov jmp _f ENDP .7 21.1 Exercise #1 Linux x86 (beginners. Function contents may be ignored for the moment. STRUCTURES 21.r0 r0. 'error #2' _printf esi DWORD PTR _s$[esp-4]. 8 QWORD PTR __real@407bc00000000000 QWORD PTR [esi+16] QWORD PTR __real@405ec00000000000 ST(1). %d' _printf esp. 1000 SHORT $LN4@f DWORD PTR [esi+4].#0] r0. 00H '%c.7.5 -O3 379 . 'error #1' _printf Listing 21. Try to reverse engineer structure field types. '%c. ST(0) QWORD PTR [esp] OFFSET $SG2802 . %d'.re)10 Linux MIPS (beginners. 24 esi 0 esi DWORD PTR _s$[esp-4].re)11 This program for Linux x86 and Linux MIPS opens a file and prints a number. Listing 21.#0x3e8 4.8. 00H 'error #1'. 0aH.4.7. 00H 'error #2'. 0aH. '%f' _printf eax.7. BYTE PTR [esi+24] eax ecx OFFSET $SG2803 .1.[r0. DWORD PTR _s$[esp] DWORD PTR [esi].30: Non-optimizing Keil 6/2013 (ARM mode) f PROC PUSH MOV LDR CMP 10 GCC 11 GCC {r4-r6. 21. OFFSET $SG2807 .12 on page 958. 123 .1 -O3 4. 444 esi esi.2 Exercise #2 This function takes some structure on input and does something. BYTE PTR [esi+25] ecx. 10 SHORT $LN3@f DWORD PTR [esi+8] esp. EXERCISES Exercises 21.29: Optimizing MSVC 2010 $SG2802 $SG2803 $SG2805 $SG2807 DB DB DB DB '%f'. 0aH.CHAPTER 21.lr} r4. 0aH. OFFSET $SG2805 . What is this number? Answer: G.

[r4.172| __2printf r2.r1 r0.140| |L0.r4.#0x18] {r4-r6.132| r0.{r0.#0 __aeabi_dmul r5.132| r0.#0x10 r0.r6 __aeabi_dadd r2.|L0. %d\n".#0x10 r0.[r4.[r4.#0x3e8 r0.|L0.#0x19] r1.140| |L0.172| |L0.r0 380 .152| |L0.|L0.|L0.|L0.|L0.168| __aeabi_fmul __aeabi_f2d r2.168| |L0.lr} r0.164| |L0.r1 r0.164| r2.#4] r0.#0 __aeabi_dmul r5.#0] r0.[r4.|L0.132| r0.0 0 0 DCB DCB DCB "error #2\n".152| |L0.132| |L0.0 0 0 DCD 0x405ec000 DCD 0x43de0000 DCB "%f\n".31: Non-optimizing Keil 6/2013 (thumb mode) f PROC PUSH MOV LDR CMP ADRLE BLE LDR CMP ADRLS BLS ADD LDM LDR MOV BL MOV {r4-r6.r0 r3.164| r2.r0 r0.r4. EXERCISES ADRLE BLE LDR CMP ADRLS BLS ADD LDM LDR MOV BL MOV MOV LDR LDR BL BL MOV MOV BL MOV MOV ADR BL LDRB LDRB POP ADR B r0.0 |L0.[r0.r5 r3.176| Listing 21.#0xa r0.|L0.#4] r0.140| |L0.{r0.176| __2printf POP B ENDP {r4-r6. STRUCTURES 21.#8] r1.r1} r3.CHAPTER 21.r1} r3.r0 r6.152| |L0.lr} __2printf DCB DCB DCB "error #1\n".0 DCB "%c.7.132| r0.lr} r4.#0xa r0.|L0.[r4.

[sp. [x19. x0. d1.16] x29.r5 r3. s1 d0. [sp.25] x0.|L0. d2.r1 r0.7. :lo12:. .LC3 w2. s1.9 (ARM64) f: stp add ldr str cmp ble ldr cmp bls ldr mov ldr adrp ldr add fmul ldr fmul fcvt fadd bl ldrb adrp ldrb add ldr ldp b x29.8] x19.24] x0. . [sp.[r4.#0x18] {r4-r6.0 0 0 DCD 0x405ec000 DCD 0x43de0000 DCB "%f\n".172| __2printf r2.#8] r1.L3 s1. [x19. [x0.L2 w1.0 DCB "%c. STRUCTURES 21.CHAPTER 21.0 0 0 DCB DCB DCB "error #2\n".16] w1. [x19.r6 __aeabi_dadd r2.0 |L0.r1 r0.32: Optimizing GCC 4. s0 d0. x30. %d\n". 1000 .|L0. d0 d1.r0 r3. [sp]. x0 s0.176| __2printf POP B ENDP {r4-r6.LC2 d0. .152| |L0.#0x19] r1.4] w1.[r4.16] x0.140| |L0.168| |L0.168| __aeabi_fmul __aeabi_f2d r2.LC0 s1.LC0 d2.lr} __2printf DCB DCB DCB "error #1\n".176| Listing 21.lr} r0. 32 printf 381 .|L0. :lo12:. sp. d0 printf w1. x0. 10 . x30. -32]! x29. EXERCISES MOV LDR LDR BL BL MOV MOV BL MOV MOV ADR BL LDRB LDRB POP ADR B r6. [x0] x19.LC1 x0.132| |L0. .LC3 x19.164| |L0.172| |L0. [x0. 0 w1.[r4.

7. $v0.d mfc1 mfc1 jalr la lw lbu $gp.s lwc1 lwc1 lwc1 cvt. EXERCISES .LC3: . $f0 (printf & 0xFFFF)($gp) ($LC0 >> 16) # "%f\n" $f2. (__gnu_local_gp >> 16) -0x20 (__gnu_local_gp & 0xFFFF) 0x20+var_4($sp) 0x20+var_8($sp) 0x20+var_10($sp) 0($a0) $zero 0x3E9 loc_C8 $a0 4($a0) $zero 0xB loc_AC (dword_134 >> 16) $LC1 8($a0) ($LC2 >> 16) 0x14($a0) $f4. $f1. $ra. $f2 dword_134 0x10($a0) $LC2 $f2 $f4. $f0 $f5 $f4 ($LC0 & 0xFFFF) # "%f\n" 0x20+var_10($sp) 0x19($s0) 382 . $sp. $t9. . $f0. STRUCTURES 21. [sp. $gp. $f5. $v0.LC1: . $a2. x30. $f4. $a0.LC0: .16] x0.d lw lui add.5 (MIPS) (IDA) f: var_10 var_8 var_4 = -0x10 = -8 = -4 lui addiu la sw sw sw lw or slti bnez move lw or sltiu bnez lui lwc1 lwc1 lui lwc1 mul.size x19.word 0 1079951360 .word 1138622464 . $f4.string "error #2" . . $f2. :lo12:. :lo12:.LC5: . x0. $v0. $v0.33: Optimizing GCC 4.string "%c. %d\n" . x0.d.4. $f4. $f2. $a3. $gp.LC4 x29. $f0. $f2. 32 x0.L2: .LC4 puts ldr adrp ldp add b . $gp.string "%f\n" . $v0.16] x0. [sp].word .LC2: .LC4: . $at. [sp]. 32 x0. $v0.string "error #1" Listing 21. .LC5 puts f.s mul.-f . $v0. $a2. $v0.L3: ldr adrp ldp add b x19. $s0.CHAPTER 21.LC5 x29. $at. $t9 $a0. x30. [sp. $s0.

12 on page 958.data # . $s0. lui lw lw lw la jr addiu $a0.7. $ra. $a0.word 0x405EC000 . EXERCISES lb lui lw lw lw la jr addiu $a1. %d\n" (printf & 0xFFFF)($gp) 0x20+var_4($sp) 0x20+var_8($sp) ($LC3 & 0xFFFF) # "%c. loc_AC: loc_C8: 0x18($s0) ($LC3 >> 16) # "%c.1.ascii . $a0.data # .cst4 .CHAPTER 21.rodata. lui lw lw lw la jr addiu $a0. 383 .ascii "%f\n"<0> "%c. $a0. %d\n"<0> "error #2"<0> "error #1"<0> $LC1: .rodata. STRUCTURES 21. $t9. $ra. $t9. $a0. $t9 $sp. %d\n" 0x20 # CODE XREF: f+38 ($LC4 >> 16) # "error #2" (puts & 0xFFFF)($gp) 0x20+var_4($sp) 0x20+var_8($sp) ($LC4 & 0xFFFF) # "error #2" 0x20 # CODE XREF: f+24 ($LC5 >> 16) # "error #1" (puts & 0xFFFF)($gp) 0x20+var_4($sp) 0x20+var_8($sp) ($LC5 & 0xFFFF) # "error #1" 0x20 $LC0: $LC3: $LC4: $LC5: . $ra. $t9 $sp. $s0.ascii .word 0x43DE0000 $LC2: dword_134: . $t9.word 0 Answer: G. $s0.ascii .cst8 . $t9 $sp.

We just need to store random bits in all significand bits to get a random float number! The exponent cannot be zero (the number is denormalized in this case). Then we can transform this value to float and then divide it by RAND_MAX (0xFFFFFFFF in our case) — we getting a value in the 0.com/17308 384 . uint32_t my_rand() { RNG_state=RNG_state*RNG_a+RNG_c. // global variable void my_srand(uint32_t i) { RNG_state=i. }. float f. Then we filling the significand with random bits. const uint32_t RNG_c=1013904223. So this code in compiled form is omitted. UNIONS Chapter 22 Unions 22. Here we represent the float type as an union —it is the C/C++ construction that enables us to interpret a piece of memory as different types. // FPU PRNG definitions and routines: union uint32_t_float { uint32_t i. we are able to create a variable of type union and then access to it as it is float or as it is uint32_t. In our case.CHAPTER 22. }.. #include <stdio. Can we get rid of the division? Let’s recall what a floating point number consists of: sign bit. The integer PRNG code is the same as we already considered: 20 on page 344.h> #include <time. we would like to issue as few FPU operations as possible. But as we know. return RNG_state. uint32_t RNG_state. The PRNG is initialized with the current time in UNIX timestamp format. it is just a hack. float float_rand() { 1 the idea was taken from: http://go. so we must also subtract 1. A very simple linear congruential random numbers generator is used in my example1 .yurichev. It can be said. The generated numbers is to be between 1 and 2. data and routines: // constants from the Numerical Recipes book const uint32_t RNG_a=1664525. set the sign bit to 0 (which means a positive number) and voilà. }.1 interval. significand bits and exponent bits. A dirty one.h> #include <stdint.1 Pseudo-random number generator example If we need float random numbers between 0 and 1. It produces random 32-bit values in DWORD form. the simplest thing is to use a PRNG like the Mersenne twister. division is slow. it produces 32-bit numbers. Also. so we are storing 01111111 to exponent —this means that the exponent is 1.h> // integer PRNG definitions.

1. float_rand()). }. i++) printf ("%f\n".1 x86 Listing 22.i=my_rand() & 0x007fffff | 0x3F800000. eax esi 0 385 . 1065353216 . EAX=pseudorandom value and eax. UNIONS 22. eax . 00H __real@3ff0000000000000 DQ 03ff0000000000000r . // test int main() { my_srand(time(NULL)).1. 1 tv130 = -4 _tmp$ = -4 ?float_rand@@YAMXZ PROC push ecx call ?my_rand@@YAIXZ . // PRNG initialization for (int i=0. }. EAX=pseudorandom value & 0x007fffff | 0x3f800000 . 4 esi. / pop ecx ret 0 ?float_rand@@YAMXZ ENDP _main PROC push xor call push call add mov $LL3@main: call sub fstp push call add dec jne xor pop ret _main ENDP esi eax. subtract 1. store it into local stack: mov DWORD PTR _tmp$[esp+4]. 100 ?float_rand@@YAMXZ esp.1: Optimizing MSVC 2010 $SG4238 DB '%f'. i<100. 12 esi SHORT $LL3@main eax. return 0. reload it as float point number: fld DWORD PTR _tmp$[esp+4] . 007fffffH or eax.CHAPTER 22. eax _time eax ?my_srand@@YAXI@Z esp. 8 QWORD PTR [esp] OFFSET $SG4238 _printf esp. PSEUDO-RANDOM NUMBER GENERATOR EXAMPLE union uint32_t_float tmp. store value we got into local stack and reload it: fstp DWORD PTR tv130[esp+4] . return tmp.f-1. 22. \ these instructions are redundant fld DWORD PTR tv130[esp+4] . tmp. 0aH. 3f800000H .0: fsub QWORD PTR __real@3ff0000000000000 . 8388607 .

. we will talk about it later: 51. branch delay slot. it uses the SIMD instructions for the FPU. $a0 $v1=pseudorandom value & 0x7FFFFF | 0x3F800000 I still dont get matter of the following instruction:' lui $v0.0. 0x7FFFFF $v1=0x7FFFFF and $v1. .CHAPTER 22. $zero . 0x64 # 'd' . (time & 0xFFFF)($gp) $at.5 on page 444. (__gnu_local_gp >> 16) $sp. = -0x10 = -4 lui $gp. $f2 lw $ra. NOP $v0=32-bit pseudorandom value li $v1. read more about it here: 27. If we compile this in MSVC 2012. $zero my_srand $s1. branch delay slot main: var_18 var_10 var_C var_8 var_4 = = = = = -0x18 -0x10 -0xC -8 -4 lui addiu la sw sw sw sw sw lw or jalr move lui move la move jal li $gp. NOP $t9 $a0.2: Optimizing GCC 4. .1. . 0x20+var_4($sp) subtract 1. $v1 $v1=pseudorandom value & 0x7FFFFF lui $a0. $zero .2 MIPS Listing 22. (__gnu_local_gp & 0xFFFF) sw $ra. $zero . .1.0 into $f0: lwc1 $f0. ($LC0 >> 16) load 1.1. leave result in $f0: sub. branch delay slot 386 . 0x28+var_C($sp) $s0. UNIONS 22. 0x20 . -0x28 $gp. $v0 $s2. . 0x28+var_4($sp) $s2.5 float_rand: var_10 var_4 . $f2. 0x28+var_18($sp) $t9. 0x20+var_10($sp) call my_rand(): jal my_rand or $at. 22. 0x20+var_4($sp) sw $gp. $v0. 0x28+var_8($sp) $s1. (__gnu_local_gp >> 16) addiu $sp. $f0 jr $ra addiu $sp. .4. PSEUDO-RANDOM NUMBER GENERATOR EXAMPLE Function names are so strange here because I compiled this example as C++ and this is name mangling in C++. 0x3F80 $a0=0x3F800000 or $v1. $LC0 move value from $v1 to coprocessor 1 (into register $f2) it behaves like bitwise copy. . no conversion done: mtc1 $v1. ($LC1 & 0xFFFF) # "%f\n" $s0. (__gnu_local_gp & 0xFFFF) $ra.s $f0. . load delay slot.1 on page 541. ($LC1 >> 16) # "%f\n" $a0. 0x28+var_10($sp) $gp. branch delay slot $s2. -0x20 la $gp. .

0 S0=1. $zero lw $ra.0 There is also an useless LUI instruction added for some weird reason. #0x64 . R0.0 and leave result in S0: FSUBS S0. R3. $s1. D7 BL printf SUBS R4. #0x3F800000 R3=pseudorandom value & 0x007fffff | 0x3f800000 copy from R3 to FPU (register S15).PC} flt_5C DCFS 1.1. 1 lw $gp. =1. .6 on ARM (ARM mode) Listing 22. $f2 mfc1 $a2. 0x28+var_10($sp) jr $ra addiu $sp. PSEUDO-RANDOM NUMBER GENERATOR EXAMPLE loc_104: jal float_rand addiu $s0. $f3 jalr $t9 move $a0. 0x28 . convert value we got from float_rand() to double type (printf() need it): cvt. bitwise copy from D7 into R2/R3 pair of registers (for printf()): FMRRD R2. $s2 bne $s0. R4. (printf & 0xFFFF)($gp) mfc1 $a3. #1 BNE loc_78 MOV R0.6. S0 . UNIONS 22. 0x28+var_8($sp) lw $s1. R3.3 (IDA) float_rand .ascii "%f\n"<0> .s $f2.d.3: Optimizing GCC 4. #0x800000 ORR R3. branch delay slot $LC1: $LC0: . S0 LDMFD SP!. . 22. R3 subtract 1.1. loc_104 move $v0. #0xFF000000 BIC R3.3 We considered this artifact earlier: 17.5. 0x28+var_4($sp) lw $s2.float 1. S15. $f0 lw $t9. R4 387 . 0x28+var_C($sp) lw $s0.LR} BL my_rand R0=pseudorandom value FLDS S0. S0=pseudorandom value LDR R0.CHAPTER 22. =aF . 0x28+var_18($sp) .0 BIC R3. 'd' loc_78 BL float_rand . {R3. "%f" . R3. . no conversion done: FMSR S15. it behaves like bitwise copy.LR} R0.0 main STMFD MOV BL BL MOV SP!. #0 time my_srand R4. {R4. page 219. convert float type value into double type value (printf() will need it): FCVTDS D7. STMFD SP!. {R3. . .

}. #20] . UNIONS 22. calculate_machine_epsilon(1. #100 bl 38 <main+0x38> ldr r0. r4 pop {r4. CALCULATING MACHINE EPSILON LDMFD aF SP!. 38 <main+0x38> The instructions at 5c in float_rand() and at 38 in main() are random noise. Apparently. #24] vcvt.f=start. 388 .i++. d7 bl 0 <printf> subs r4. #1065353216 . the smaller the machine epsilon.CHAPTER 22. #1 bne 14 <main+0x14> mov r0. } void main() { printf ("%g\n". v. 0x3f800000 vmov s15. r0 . how easy it’s to calculate the machine epsilon: #include <stdio.h> union uint_float { uint32_t i. {R4. s0 vmov r2.4: Optimizing GCC 4. r0. pc} svccc 0x00800000 00000000 <main>: 0: e92d4010 4: e3a00000 8: ebfffffe c: ebfffffe 10: e3a04064 14: ebfffffe 18: e59f0018 1c: eeb77ac0 20: ec532b17 24: ebfffffe 28: e2544001 2c: 1afffff8 30: e1a00004 34: e8bd8010 38: 00000000 push {r4. 5c <float_rand+0x24> bic r3. lr} bl 10 <my_rand> vldr s0.22e − 16 for double. Listing 22.234567)).0xA.2.3 (objdump) 00000038 <float_rand>: 38: e92d4008 3c: ebfffffe 40: ed9f0a05 44: e3c034ff 48: e3c33502 4c: e38335fe 50: ee073a90 54: ee370ac0 58: e8bd8008 5c: 3f800000 push {r3. IDA and binutils developers used different manuals? I suppose.f32 s0. it would be good to know both instruction name variants. s0 pop {r3. v.2 Calculating machine epsilon The machine epsilon is the smallest possible value the FPU can work with. [pc. r3 vsub. [pc. 0x800000 orr r3. 0xff000000 bic r3. r3. r3. 22. s15.f64. }. It is 2−23 = 1.6. lr} mov r0. return v. r0. r3. The more bits allocated for floating point number. 0x64 . #-16777216 . float calculate_machine_epsilon(float start) { union uint_float v.f32 d7.PC} DCB "%f".19e − 07 for float and 2−52 = 2. r4. float f.0 I also made a dump in objdump and I saw that the FPU instructions have different names than in IDA. It’s interesting.h> #include <stdint.f-start. pc} andeq r0. #8388608 . #0 bl 0 <time> bl 0 <main> mov r4.

The union serves here as a way to access IEEE 754 number as a regular integer. what difference one bit reflects in the single precision (float). this instruction is redundant inc DWORD PTR _v$[esp-4] fsubr DWORD PTR _v$[esp-4] fstp DWORD PTR tv130[esp-4] . The resulting floating number is equal to starting_value + machine_epsilon.234567)). d0 add x0. Listing 22. ARM64 has no instruction that can add a number to a FPU D-register.5: Optimizing MSVC 2010 tv130 = 8 _v$ = 8 _start$ = 8 _calculate_machine_epsilon PROC fld DWORD PTR _start$[esp-4] fst DWORD PTR _v$[esp-4] .i++. } void main() { printf ("%g\n". needless to say. }. and then subtraction occurs.6: Optimizing GCC 4. incremented. d1. v. double d. UNIONS 22. 1 fmov d1.2. 389 . CALCULATING MACHINE EPSILON What we do here is just treat the fraction part of the IEEE 754 number as integer and add 1 to it. . 22.4 on page 443. x0. 22. . however. FSUBR does the rest of job and the resulting value is stored in ST0.d=start.h> typedef union { uint64_t i. return v. overflow is possible. \ this instruction pair is also redundant fld DWORD PTR tv130[esp-4] . / ret 0 _calculate_machine_epsilon ENDP The second FST instruction is redundant: there is no need to store the input value in the same place (the compiler decided to allocate the v variable at the same point in the local stack as the input argument). Then it is loaded into the FPU as a 32-bit IEEE 754 number.d-start. so we just need to subtract the starting value (using floating point arithmetic) to measure. Then it is incremented with INC. copied to FPU register D1.2.2 ARM64 Let’s extend our example to 64-bit: #include <stdio.1 x86 Listing 22.2.CHAPTER 22. . so the input value (that came in D0) is first copied into GPR. but the compiler didn’t optimize it out. as it is a normal integer variable. calculate_machine_epsilon(1. The last FSTP/FLD instruction pair is redundant.9 ARM64 calculate_machine_epsilon: fmov x0. Adding 1 to it in fact adds 1 to the fraction part of the number. d0 ret .h> #include <stdint. which will add another 1 to the exponent part. x0 fsub d0. } uint_double. double calculate_machine_epsilon(double start) { uint_double v. load input value of double type into X0 X0++ move it to FPU register subtract See also this example compiled for x64 with SIMD instructions: 27. v.

2.7: Optimizing GCC 4. 390 . mtc1 $v1.5 (IDA) calculate_machine_epsilon: mfc1 $v0.3 22. Listing 22.CHAPTER 22. CALCULATING MACHINE EPSILON MIPS The new instruction here is MTC1 (“Move To Coprocessor 1”).2.4 $f12 $zero . UNIONS 22. this example serves well for explaining the IEEE 754 format and unions in C/C++.4. 1 $f2 $f2. NOP $v0. 22.2. or $at. branch delay slot Conclusion It’s hard to say whether someone may need this trickery in real-world code. it just transfers data from GPR to the FPU’s registers. but as I write many times in this book. $f12 . jr $ra sub. addiu $v1.s $f0.

yurichev. They are often used as callbacks 1 . char* argv[]) 1 wikipedia 2 wikipedia 3 http://go. Well-known examples are: • qsort()2 . const int *b=(const int *)_b. else return 1. The functions is able to sort anything. if (*a==*b) return 0. shortcut has a corresponding function to call if a specific key is pressed: GitHub As we can see. any type of data. const void * _b) { const int *a=(const int *)_a. } int main(int argc. • lots of places in the Linux kernel. pthread_create() (POSIX). • thread starting: CreateThread() (win32). as any other pointer.h> int comp(const void * _a. Each So. is just the address of the function’s start in its code segment. the qsort() function is an implementation of quicksort in the C/C++ standard library. for example the filesystem driver functions are called via callbacks: http://go.CHAPTER 23.com/17076 • The GCC plugin functions are also called via callbacks: http://go. else if (*a < *b) return -1. atexit()3 from the standard C library. such table is easier to handle than a large switch() statement.h> #include <stdlib. POINTERS TO FUNCTIONS Chapter 23 Pointers to functions A pointer to a function. like EnumChildWindows()5 . • lots of win32 functions. The comparison function can be defined as: int (*compare)(const void *.com/17077 • Another example of function pointers is a table in the “dwm” Linux window manager that defines shortcuts. yurichev.com/17073 4 wikipedia 5 MSDN 391 . as long as you have a function to compare these two elements and qsort() is able to call it.yurichev. const void *) Let’s use a slightly modified example I found here: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 /* ex3 Sorting ints with qsort */ #include <stdio. • *NIX OS signals4 .

5 DWORD PTR _numbers$[esp+84]. located in MSVCR80. size = 4 . . . the address of label _comp is passed. -12345 DWORD PTR _numbers$[esp+88]. for (i=0. -98 DWORD PTR _numbers$[esp+76]. the address of the very first instruction of that function. 16 . MSVC { int numbers[10]={1892.200.numbers[ i ]) .5. which is just a place where comp() is located. return 0.4087. -100000 _qsort esp. 0000000aH .88. . ecx dl eax.i<9.10.-100000}. size = 40 . 88 DWORD PTR _numbers$[esp+96]. Nothing surprising so far. 1892 DWORD PTR _numbers$[esp+64]. How does qsort() call it? Let’s take a look at this function. 00000028H . DWORD PTR __a$[esp-4] ecx.45. 00000764H 0000002dH 000000c8H ffffff9eH 00000ff7H . .1: Optimizing MSVC 2010: /GS.CHAPTER 23. 00000010H . } 23. DWORD PTR [ecx] eax. eax 0 edx. DWORD PTR __b$[esp-4] eax. POINTERS TO FUNCTIONS 21 22 23 24 25 26 27 28 29 30 23. 4087 DWORD PTR _numbers$[esp+80].1 MSVC Let’s compile it in MSVC 2010 (I omitted some parts for the sake of brevity) with /Ox option: Listing 23. .-98.. edx eax.1. 200 DWORD PTR _numbers$[esp+72]. As a fourth argument.DLL (a MSVC DLL module with C standard library functions): 392 . /* Sort the array */ qsort(numbers. size = 4 eax.1087. in other words. ffffcfc7H 0000043fH 00000058H fffe7960H .-12345. 45 DWORD PTR _numbers$[esp+68]..i++) printf("Number = %d\n". size = 4 esp. DWORD PTR _numbers$[esp+52] 10 eax DWORD PTR _numbers$[esp+60]. . or. . 40 esi OFFSET _comp 4 eax. size = 4 . int i. DWORD PTR [eax] ecx. 1087 DWORD PTR _numbers$[esp+92].sizeof(int). DWORD PTR [edx+edx-1] 0 . ecx SHORT $LN4@comp eax.comp) ./MD __a$ = 8 __b$ = 12 _comp PROC mov mov mov mov cmp jne xor ret $LN4@comp: xor cmp setge lea ret _comp ENDP _numbers$ = -40 _argc$ = 8 _argv$ = 12 _main PROC sub push push push lea push push mov mov mov mov mov mov mov mov mov mov call add .

100h . The second reason is that the callback function types must comply strictly. Here the control gets passed to the address in the comp argument.text:7816CBF0 lo = dword ptr -104h .text:7816CBF0 var_FC = dword ptr -0FCh .text:7816CCE0 .text:7816CBF0 num = dword ptr 8 .text:7816CBF0 histk = dword ptr -7Ch . That’s why it is dangerous to use pointers to functions. 8 eax. int (__cdecl *)(const ⤦ Ç void *.text:7816CBF0 width = dword ptr 0Ch .text:7816CBF0 lostk = dword ptr -0F4h .text:7816CCE5 . CODE XREF: _qsort+B1 eax.text:7816CBF0 _qsort proc near . Its result is checked after its execution. qsort() may pass control flow to an incorrect point. the process may crash and this bug will be hard to find. ebp eax.text:7816CCF2 . the crashing of the process is not a problem here —the problem is how to determine the reason for the crash —because the compiler may be silent about the potential problems while compiling.text:7816CBF0 stkptr = dword ptr -0F8h .DLL . if you call qsort() with an incorrect function pointer.text:7816CCF5 .1.text:7816CCE0 loc_7816CCE0: .CHAPTER 23. const void *)) . however. eax edi ebx [esp+118h+comp] esp. 1 eax.2: MSVCR80. . POINTERS TO FUNCTIONS 23.text:7816CCEB .text:7816CCE9 .text:7816CBF0 .text:7816CBF0 . two arguments are prepared for comp().. First of all. MSVC Listing 23.text:7816CCEA .text:7816CBF0 base = dword ptr 4 ..text:7816CCF7 shr imul add mov push push call add test jle . eax short loc_7816CD04 comp— is the fourth function argument.text:7816CBF0 hi = dword ptr -100h .text:7816CCE7 . ebx edi.text:7816CBF0 . void __cdecl qsort(void *..text:7816CBF0 comp = dword ptr 10h . unsigned int. 393 . unsigned int.text:7816CBF0 sub esp.text:7816CCE2 . calling the wrong function with wrong arguments of wrong types may lead to serious problems. Before it.text:7816CBF0 public _qsort .

394 .CHAPTER 23. We can see how the values are compared at the first comp() call: Figure 23. where the qsort() function is (located in MSVCR100.1.1 23. for convenience.1. MSVC MSVC + OllyDbg Let’s load our example into OllyDbg and set a breakpoint on comp().1: OllyDbg: first call of comp() OllyDbg shows the compared values in the window under the code window.DLL). POINTERS TO FUNCTIONS 23. We can also see that the SP points to RA .

MSVC By tracing (F8) until the RETN instruction and pressing F8 one more time. POINTERS TO FUNCTIONS 23.CHAPTER 23. 395 .1.2: OllyDbg: the code in qsort() right after comp() call That was a call to the comparison function. we return to the qsort() function: Figure 23.

These 10 numbers are being sorted: 1892. I got the address of the first CMP instruction in comp().CHAPTER 23.1.exe!0x0040100C Now I get some information about the registers at the breakpoint: PID=4336|New process 17_1.2 MSVC + tracer Let’s also see which pairs are compared. -100000.exe -l:17_1. it is 0x0040100C and I’ve set a breakpoint on it: tracer. 5. MSVC Here is also a screenshot of the moment of the second call of comp()— now values that have to be compared are different: Figure 23.. 200.. 1087.exe!0x40100c EAX=0x00000764 EBX=0x0051f7c8 ESI=0x0051f7d8 EDI=0x0051f7b4 EIP=0x0028100c FLAGS=IF (0) 17_1.exe bpx=17_1. 4087. -12345.exe!0x40100c EAX=0x00000005 EBX=0x0051f7c8 ESI=0x0051f7d8 EDI=0x0051f7b4 EIP=0x0028100c FLAGS=PF ZF IF (0) 17_1.exe!0x40100c EAX=0x00000764 EBX=0x0051f7c8 ESI=0x0051f7d8 EDI=0x0051f7b4 EIP=0x0028100c FLAGS=CF PF ZF IF . POINTERS TO FUNCTIONS 23. ECX=0x00000005 EDX=0x00000000 EBP=0x0051f794 ESP=0x0051f67c ECX=0xfffe7960 EDX=0x00000000 EBP=0x0051f794 ESP=0x0051f67c ECX=0x00000005 EDX=0x00000000 EBP=0x0051f794 ESP=0x0051f67c I’ve filtered out EAX and ECX and gotten: EAX=0x00000764 EAX=0x00000005 EAX=0x00000764 EAX=0x0000002d EAX=0x00000058 EAX=0x0000043f EAX=0xffffcfc7 EAX=0x000000c8 EAX=0xffffff9e EAX=0x00000ff7 EAX=0x00000ff7 EAX=0xffffff9e EAX=0xffffff9e EAX=0xffffcfc7 EAX=0x00000005 ECX=0x00000005 ECX=0xfffe7960 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0x00000005 ECX=0xfffe7960 ECX=0xffffcfc7 396 . -98.1. 45. 88.3: OllyDbg: second call of comp() 23.exe (0) 17_1.

the quick sort algorithm needs 34 comparison operations to sort these 10 numbers. 397 . POINTERS TO FUNCTIONS EAX=0xffffff9e EAX=0xffffcfc7 EAX=0xffffff9e EAX=0xffffcfc7 EAX=0x000000c8 EAX=0x0000002d EAX=0x0000043f EAX=0x00000058 EAX=0x00000764 EAX=0x000000c8 EAX=0x0000002d EAX=0x0000043f EAX=0x00000058 EAX=0x000000c8 EAX=0x0000002d EAX=0x0000043f EAX=0x000000c8 EAX=0x0000002d EAX=0x0000002d 23.CHAPTER 23.1. Therefore. MSVC ECX=0x00000005 ECX=0xfffe7960 ECX=0xffffcfc7 ECX=0xfffe7960 ECX=0x00000ff7 ECX=0x00000ff7 ECX=0x00000ff7 ECX=0x00000ff7 ECX=0x00000ff7 ECX=0x00000764 ECX=0x00000764 ECX=0x00000764 ECX=0x00000764 ECX=0x00000058 ECX=0x000000c8 ECX=0x000000c8 ECX=0x00000058 ECX=0x000000c8 ECX=0x00000058 That’s 34 pairs.

58h [esp+40h+var_4].1. 43Fh [esp+40h+var_8]. [esp+40h+var_28] [esp+40h+var_40]. 0Ah _qsort comp() function: comp public comp proc near 398 . 764h [esp+40h+var_24].4: tracer and IDA.3: GCC lea mov mov mov mov mov mov mov mov mov mov mov mov mov mov call eax.trace:cc We get an .B. 2Dh [esp+40h+var_20].CHAPTER 23. 0FF7h [esp+40h+var_14]. N. 0FFFFFF9Eh [esp+40h+var_18]. because there no equal elements in the array. GCC MSVC + tracer (code coverage) We can also use the tracer’s feature to collect all possible register values and show them in IDA. as 32-bit values are stored in the array. Let’s trace all instructions in comp(): tracer.exe!0x00401000. but the step between them is 4.exe -l:17_1. 0FFFFCFC7h [esp+40h+var_C]. 5 [esp+40h+var_10]. 0FFFE7960h [esp+40h+var_34]. 23.exe bpf=17_1. We see that the instructions at 0x401010 and 0x401012 were never executed (so they left as white): indeed.: some values are cut at right IDA gave the function a name (PtFuncCompare) — because IDA sees that the pointer to this function is passed to qsort(). comp() has never returned 0. POINTERS TO FUNCTIONS 23. offset comp [esp+40h+var_38].idc-script for loading into IDA and load it: Figure 23.2. 4 [esp+40h+var_3C]. 0C8h [esp+40h+var_1C]. We see that the a and b pointers are pointing to various places in the array. eax [esp+40h+var_28].3 23.2 GCC Not a big difference: Listing 23.

6./a. Reading symbols from /home/dennis/polygon/a. For bug reporting instructions. Inc. 23. esp eax. so we can set a breakpoint (b) on line number (11—the line where the first comparison occurs). al eax. _b=_b@entry=0xbffff0fc) at 17_1.. This GDB was configured as "i686-linux-gnu". POINTERS TO FUNCTIONS arg_0 arg_4 = dword ptr = dword ptr 23. where our defined function is called via a passed pointer: Listing 23.c -g dennis@ubuntuvm:~/polygon$ gdb .10.4: (file libc.c:11 Breakpoint 1 at 0x804845f: file 17_1. There is NO WARRANTY. edx short loc_8048458 ebp setnl movzx lea pop retn endp al eax. we have the C-source code of our example ( 23 on page 391).1) . please see: <http://www.text:0002DDF9 ../a. [ebp+arg_0] edx.org/licenses/gpl.text:0002DE04 .CHAPTER 23. eax [ecx]. line 11. glibc version—2.c:11 11 if (*a==*b) (gdb) p *a $1 = 1892 (gdb) p *b $2 = 45 6a concept like thunk function 399 . GCC 8 0Ch push mov mov mov mov xor cmp jnz pop retn ebp ebp.done.1-ubuntu Copyright (C) 2013 Free Software Foundation.out Breakpoint 1. [ebp+arg_4] ecx.text:0002DE00 .out.6.1 mov mov mov mov call edx. [ebp+arg_10] [esp+4]. [eax] eax.. Listing 23. License GPLv3+: GNU GPL version 3 or later <http://gnu. In turn. We also need to compile the example with debugging information included (-g). edx [ebp+arg_C] GCC + GDB (with source code) Obviously.so.. edi [esp+8].. so the table with addresses and corresponding line numbers is present.text:0002DDFD .5: GDB session dennis@ubuntuvm:~/polygon$ gcc 17_1. it is calling quicksort(). to the extent permitted by law.2.. esi [esp]. We can also print values using variable names (p): the debugging information also has tells us which register and/or local stack element contains which variable.c. (gdb) b 17_1. comp (_a=0xbffff0f8.gnu.org/software/gdb/bugs/>. We can also see the stack (bt) and find out that there is some intermediate function msort_with_tmp() used in Glibc.6 and it is in fact just a wrapper 6 for qsort_r(). Type "show copying" and "show warranty" for details. (gdb) run Starting program: /home/dennis/polygon/.html> This is free software: you are free to change and redistribute it.2. [eax+eax-1] ebp loc_8048458: comp The implementation of qsort() is located in libc.text:0002DDF6 .so.out GNU gdb (GDB) 7.

out GNU gdb (GDB) 7.esp 0x08048450 <+3>: sub esp. Reading symbols from /home/dennis/polygon/a.c:53 #4 0xb7e4273e in msort_with_tmp (n=5. s=s@entry=4. b=b@entry=0xbffff0f8..c:26 (gdb) 23. b=b@entry=0xbffff0f8.DWORD PTR [ebp-0x8] 0x08048477 <+42>: mov edx.DWORD PTR [ebp+0xc] 0x0804845c <+15>: mov DWORD PTR [ebp-0x4].1-ubuntu Copyright (C) 2013 Free Software Foundation.DWORD PTR [eax] 0x0804847e <+49>: cmp edx. This GDB was configured as "i686-linux-gnu".out. At each breakpoint.0x10 0x08048453 <+6>: mov eax.gnu. p=0xbffff07c) at msort. argv=0xbffff1c4) at 17_1..done. There is NO WARRANTY.c:45 #5 msort_with_tmp (p=p@entry=0xbffff07c.c:53 #6 0xb7e42cef in msort_with_tmp (n=10.c dennis@ubuntuvm:~/polygon$ gdb . n=n@entry=10. GCC (gdb) c Continuing. we are going to dump all register contents (info registers). POINTERS TO FUNCTIONS 23.org/software/gdb/bugs/>. to the extent permitted by law.c:45 #3 msort_with_tmp (p=p@entry=0xbffff07c.2 GCC + GDB (no source code) But often there is no source code at all. (gdb) set disassembly-flavor intel (gdb) disas comp Dump of assembler code for function comp: 0x0804844d <+0>: push ebp 0x0804844e <+1>: mov ebp.eax 0x08048480 <+51>: jge 0x8048489 <comp+60> 400 ./a.DWORD PTR [ebp-0x8] 0x08048462 <+21>: mov edx.DWORD PTR [ebp+0x8] 0x08048456 <+9>: mov DWORD PTR [ebp-0x8]. p=0xbffff07c) at msort. cmp=cmp@entry=0x804844d <⤦ Ç comp>.. _b=_b@entry=0xbffff0fc) at 17_1. Type "show copying" and "show warranty" for details. find the very first CMP instruction and set a breakpoint (b) at that address.6.DWORD PTR [ebp-0x4] 0x08048467 <+26>: mov eax. The stack information is also available (bt). cmp=0x804844d <comp>) at msort.2.CHAPTER 23.2.c:11 #1 0xb7e42872 in msort_with_tmp (p=p@entry=0xbffff07c. _b=_b@entry=0xbffff108) at 17_1.DWORD PTR [eax] 0x08048469 <+28>: cmp edx. b=0xbffff0f8..c:297 #8 0xb7e42dcf in __GI_qsort (b=0xbffff0f8. n=n@entry=5) at msort.org/licenses/gpl. n=n@entry=2) at msort.c:307 #9 0x0804850d in main (argc=1. s=4. b=0xbffff0f8. b=b@entry=0xbffff0f8.eax 0x08048459 <+12>: mov eax. Listing 23.c:11 11 if (*a==*b) (gdb) p *a $3 = -98 (gdb) p *b $4 = 4087 (gdb) bt #0 comp (_a=0xbffff0f8.eax 0x0804845f <+18>: mov eax.DWORD PTR [eax] 0x08048464 <+23>: mov eax.c:45 #7 __GI_qsort_r (b=b@entry=0xbffff0f8.DWORD PTR [ebp-0x4] 0x0804847c <+47>: mov eax.eax 0x0804846b <+30>: jne 0x8048474 <comp+39> 0x0804846d <+32>: mov eax. comp (_a=0xbffff104.html> This is free software: you are free to change and redistribute it.c:65 #2 0xb7e4273e in msort_with_tmp (n=2. n=10. n=n@entry=10) at msort. Inc. but partially: there is no line number information for comp().0x0 0x08048472 <+37>: jmp 0x804848e <comp+65> 0x08048474 <+39>: mov eax. For bug reporting instructions.. p=0xbffff07c) at msort. Breakpoint 1. b=0xbffff0f8. arg=arg@entry=0x0) at msort. so we can disassemble the comp() function (disas). License GPLv3+: GNU GPL version 3 or later <http://gnu.DWORD PTR [eax] 0x08048479 <+44>: mov eax.. please see: <http://www.(no debugging symbols found).6: GDB session dennis@ubuntuvm:~/polygon$ gcc 17_1.

0x08048469 in comp () (gdb) info registers eax 0xff7 4087 ecx 0xbffff104 -1073745660 edx 0xffffff9e -98 ebx 0xb7fc0000 -1208221696 esp 0xbfffee58 0xbfffee58 ebp 0xbfffee68 0xbfffee68 esi 0xbffff108 -1073745656 edi 0xbffff010 -1073745904 eip 0x8048469 0x8048469 <comp+28> eflags 0x282 [ SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb) c Continuing. Breakpoint 1. Breakpoint 1. 0x08048469 in comp () (gdb) info registers eax 0xffffff9e -98 ecx 0xbffff100 -1073745664 edx 0xc8 200 ebx 0xb7fc0000 -1208221696 esp 0xbfffeeb8 0xbfffeeb8 ebp 0xbfffeec8 0xbfffeec8 esi 0xbffff104 -1073745660 edi 0xbffff010 -1073745904 eip 0x8048469 0x8048469 <comp+28> eflags 0x286 [ PF SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 401 . POINTERS TO FUNCTIONS 23.CHAPTER 23.2. 0x08048469 in comp () (gdb) info registers eax 0x2d 45 ecx 0xbffff0f8 -1073745672 edx 0x764 1892 ebx 0xb7fc0000 -1208221696 esp 0xbfffeeb8 0xbfffeeb8 ebp 0xbfffeec8 0xbfffeec8 esi 0xbffff0fc -1073745668 edi 0xbffff010 -1073745904 eip 0x8048469 0x8048469 <comp+28> eflags 0x286 [ PF SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 (gdb) c Continuing.0xffffffff 0x08048487 <+58>: jmp 0x804848e <comp+65> 0x08048489 <+60>: mov eax.0x1 0x0804848e <+65>: leave 0x0804848f <+66>: ret End of assembler dump. GCC 0x08048482 <+53>: mov eax. (gdb) b *0x08048469 Breakpoint 1 at 0x8048469 (gdb) run Starting program: /home/dennis/polygon/./a.out Breakpoint 1.

n=10.c:53 #4 0xb7e4273e in msort_with_tmp (n=5.2. n=n@entry=10. b=b@entry=0xbffff0f8. b=0xbffff0f8.c:45 #7 __GI_qsort_r (b=b@entry=0xbffff0f8.c:45 #5 msort_with_tmp (p=p@entry=0xbffff07c.CHAPTER 23.c:297 #8 0xb7e42dcf in __GI_qsort (b=0xbffff0f8. n=n@entry=2) at msort. b=0xbffff0f8. s=s@entry=4. cmp=cmp@entry=0x804844d <⤦ Ç comp>. cmp=0x804844d <comp>) at msort. n=n@entry=10) at msort.c:307 #9 0x0804850d in main () 402 . GCC gs 0x33 51 (gdb) bt #0 0x08048469 in comp () #1 0xb7e42872 in msort_with_tmp (p=p@entry=0xbffff07c. b=b@entry=0xbffff0f8. p=0xbffff07c) at msort. POINTERS TO FUNCTIONS 23.c:45 #3 msort_with_tmp (p=p@entry=0xbffff07c. p=0xbffff07c) at msort. p=0xbffff07c) at msort. s=4. n=n@entry=5) at msort. b=b@entry=0xbffff0f8. arg=arg@entry=0x0) at msort. b=0xbffff0f8.c:53 #6 0xb7e42cef in msort_with_tmp (n=10.c:65 #2 0xb7e4273e in msort_with_tmp (n=2.

305397760 $3.12| DCD 0x90abcdef DCD 0x12345678 |L0. 305419896 0 .4. 24.h> uint64_t f () { return 0x1234567890ABCDEF.1. 64-bit values are returned from functions in the EDX:EAX register pair.1 x86 In a 32-bit environment. GPR’s are 32-bit.5 (assembly listing) li li ori 1 By $3. 12345678H A 64-bit value is returned in the R0-R1 register pair (R1 is for the high part and R0 for the low part): Listing 24.16| lr |L0.12| r1. 32-bit values are passed as pairs in 16-bit environment in the same way: 53.1. so 64-bit values are stored and passed as 32-bit value pairs 1 .1 Returning of 64-bit value #include <stdint.3: Optimizing GCC 4. 90abcdefH .4 on page 594 403 .1.CHAPTER 24.1: Optimizing MSVC 2010 _f _f PROC mov mov ret ENDP 24.|L0. Listing 24. 24.2 ARM eax. 64-BIT VALUES IN 32-BIT ENVIRONMENT Chapter 24 64-bit values in 32-bit environment In a 32-bit environment.$3.16| 24. }.3 MIPS A 64-bit value is returned in the V0-V1 ($2-$3) register pair (V0 ($2) is for the high part and V1 ($3) for the low part): Listing 24.-1867841536 $2. -1867788817 edx.|L0.2: Optimizing Keil 6/2013 (ARM mode) ||f|| PROC LDR LDR BX ENDP r0.0xcdef # 0xffffffff90ab0000 # 0x12340000 the way.

5: Optimizing MSVC 2012 /Ob1 _a$ = 8 _b$ = 16 _f_add PROC mov add mov adc ret _f_add ENDP .0x5678 Listing 24. size = 8 . 23456789012345)). SUBTRACTION $31 $2. DWORD PTR _b$[esp-4] edx. '%I64d'.2 $v1. 64-BIT VALUES IN 32-BIT ENVIRONMENT j ori 24. uint64_t f_sub (uint64_t a.4: Optimizing GCC 4. DWORD PTR _a$[esp] 404 . 73ce2ff_subH call _f_add push edx push eax push OFFSET $SG1436 . #else printf ("%I64d\n".4. addition. }. edx. 75939f79H push 2874 .h> uint64_t f_add (uint64_t a.2.1 x86 Listing 24. $v1. ARGUMENTS PASSING. $ra $v0. eax. 0 DWORD DWORD DWORD DWORD PTR PTR PTR PTR _a$[esp-4] _b$[esp-4] _a$[esp] _b$[esp] _f_add_test PROC push 5461 . 00H call _printf add esp. f_add(12345678901234. #endif }. subtraction #include <stdint. uint64_t b) { return a+b. f_add(12345678901234.$2. 28 ret 0 _f_add_test ENDP _f_sub PROC mov sub mov eax. }. edx. 0x90AB 0x1234 0x90ABCDEF 0x12345678 Arguments passing. uint64_t b) { return a-b.2. 00001555H push 1972608889 . 24. $v0. void f_add_test () { #ifdef __GNUC__ printf ("%lld\n". 23456789012345)). 0aH.5 (IDA) lui lui li jr li 24. DWORD PTR _a$[esp-4] eax.CHAPTER 24. ADDITION. size = 8 eax. 00000b3aH push 1942892530 .

0x75939f79 r3.r2 r1.r0. 24. 5461 [esp].r2 r1. "%lld\12\0" _f_sub: mov mov sub sbb ret eax. DWORD DWORD DWORD DWORD esp. OFFSET FLAT:LC0 . 28 DWORD PTR DWORD PTR DWORD PTR DWORD PTR _f_add DWORD PTR DWORD PTR DWORD PTR _printf esp. the low 32-bit part are added first. .6: GCC 4. eax. 75939f79H 00001555H 73ce2ff_subH 00000b3aH [esp+4]. In addition. The first SUB may also turn on the CF flag. 0x00001555 405 . edx [esp].7: Optimizing Keil 6/2013 (ARM mode) f_add PROC ADDS ADC BX ENDP r0. edx. eax [esp+8].1 -O1 -fno-inline _f_add: mov mov add adc ret _f_add_test: sub mov mov mov mov call mov mov mov call add ret eax.lr} r2. ARGUMENTS PASSING.CHAPTER 24.r1. ADC instruction adds the high parts of the values. It is easy to see how the f_add() function result is then passed to printf(). edx. edx. the CF flag is set. 2874 . eax.r1. which is to be checked in the subsequent SBB instruction: if the carry flag is on.8. high part first. ADDITION. Listing 24. If carry was occurred while adding. DWORD PTR _b$[esp] 0 We can see in the f_add_test() function that each 64-bit value is passed using two 32-bit values. SUBTRACTION edx.|L0.72| . The following Subtraction also occurs in pairs. 64-BIT VALUES IN 32-BIT ENVIRONMENT _f_sub sbb ret ENDP 24.|L0. 28 PTR PTR PTR PTR [esp+12] [esp+16] [esp+4] [esp+8] [esp+8]. . and also adds 1 if CF = 1.r0. then 1 is also to be subtracted from the result.68| . edx.2.2 ARM Listing 24.r3 lr f_sub PROC SUBS SBC BX ENDP r0. 1972608889 [esp+12].2. DWORD DWORD DWORD DWORD PTR PTR PTR PTR [esp+4] [esp+8] [esp+12] [esp+16] GCC code is the same.r3 lr f_add_test PROC PUSH LDR LDR {r4. Addition and subtraction occur in pairs as well. . then low part. 1942892530 [esp+4].

76| |L0.|L0. set $v0 to 1 sltu $v0.84| . 64-BIT VALUES IN 32-BIT ENVIRONMENT LDR LDR BL POP MOV MOV ADR B ENDP r0. $a2. the second in R2 and R3 register pair.8: Optimizing GCC 4. Otherwise.high part of result $v1 . (__gnu_local_gp >> 16) 406 . $v1. $a3 - .3 MIPS Listing 24.5 (IDA) f_add: .4. and flags (esp. 24.2. high part of a low part of a high part of b low part of b subu $v1.low part of result f_add_test: var_10 var_4 = -0x10 = -4 lui $gp. ADDS and SUBS instructions with -S suffix are used.CHAPTER 24. ADDITION.2. . subtract low parts subu $v0. $a0 . .80| |L0. . 0x00000b3a f_add {r4. branch delay slot $v0 .high part of result $v1 .80| . $a1 . "%I64d\n" __2printf DCD 0x75939f79 DCD 0x00001555 DCD 0x73ce2ff2 DCD 0x00000b3a DCB "%I64d\n". $a2 . carry flag) is what consequent ADC/SBC instructions definitely need. 0x73ce2ff2 r1. $a3 - . $a3 jr $ra add 1 to high part of result if carry should be generated: addu $v0. $a2 .|L0.72| |L0. $a1 . ARM has the ADC instruction as well (which counts carry flag) and SBC (“subtract with carry”). .r0 r3. .|L0. high part of a low part of a high part of b low part of b addu $v1. ARGUMENTS PASSING. $a1 . $a0 . Important thing: when the low parts are added/subtracted.84| The first 64-bit value is passed in R0 and R1 register pair. branch delay slot $v0 . $a2 . $a0. The -S suffix stands for “set flags”. . $a1. subtract high parts will carry generated while subtracting low parts? if yes. $a0 . $a0 .lr} r2. $a3 . $v1 jr $ra subtract 1 from high part of result if carry should be generated: subu $v0. sum up low parts addu $a0. sum up high parts will carry generated while summing up low parts? if yes.0 24.68| |L0. set $a0 to 1 sltu $a1.76| . . SUBTRACTION |L0. $a1 . instructions without the -S suffix would do the job (ADD and SUB).r1 r0.low part of result f_sub: . $a3. .

}. 0x20+var_10($sp) $a1. ($LC0 & 0xFFFF) # "%lld\n" $a3. 0x20 . which sets the destination register to 1 or 0. DIVISION $sp. 0x73CE2FF2 $gp.3 Multiplication. (__gnu_local_gp & 0xFFFF) $ra. $v0 $t9 $sp. 0x1555 f_add $a1.h> uint64_t f_mul (uint64_t a. 0x73CE $a3. (printf & 0xFFFF)($gp) $ra. -0x20 $gp.ascii "%lld\n"<0> MIPS has no flags register. }.1 x86 Listing 24.CHAPTER 24. $v1 $a2. 24. size = 8 . size = 8 DWORD PTR _b$[esp] DWORD PTR _b$[esp] DWORD PTR _a$[esp+8] 407 . so there is no such information present after the execution of arithmetic operations. 0x20+var_4($sp) $gp. size = 8 . MULTIPLICATION. To know if the carry flag would be set. 24. 64-BIT VALUES IN 32-BIT ENVIRONMENT addiu la sw sw lui lui li li li jal li lw lui lw lw la move move jr addiu $LC0: 24. 0xB3A $a3. 0x7593 $a0.3. 0x20+var_10($sp) $a0. uint64_t f_div (uint64_t a. division #include <stdint. ($LC0 >> 16) # "%lld\n" $t9.9: Optimizing MSVC 2012 /Ob1 _a$ = 8 _b$ = 16 _f_mul PROC push push push push call ret _f_mul ENDP _a$ = 8 _b$ = 16 _f_div PROC push push push . }. size = 8 DWORD PTR _b$[esp] DWORD PTR _b$[esp] DWORD PTR _a$[esp+8] DWORD PTR _a$[esp+8] __allmul . uint64_t b) { return a/b. 0x75939F79 $a2.3. a comparison (using “SLTU” instruction) also occurs. So there are no instructions like x86’s ADC and SBB. uint64_t b) { return a*b. This 1 or 0 is then added or subtracted to/from the final result. uint64_t b) { return a % b. 0x20+var_4($sp) $a0. long long multiplication 0 . uint64_t f_rem (uint64_t a.

DWORD PTR [esp+44] DWORD PTR [esp+8]. DWORD PTR [esp+32] DWORD PTR [esp+12]. 28 eax. unsigned division esp. Listing 24. thinking it could be more efficient. unsigned long long division 0 . edx. eax eax. eax DWORD PTR [esp+4]. DWORD PTR [esp+36] DWORD PTR [esp]. edx ___umoddi3 . ecx. 24. 28 eax. size = 8 . ebx.10: Optimizing GCC 4. ebx sub mov mov mov mov mov mov mov mov call add ret esp.8. edx ecx. 64-BIT VALUES IN 32-BIT ENVIRONMENT _f_div push call ret ENDP _a$ = 8 _b$ = 16 _f_rem PROC push push push push call ret _f_rem ENDP 24. These functions are described here: E on page 950.3. but the multiplication code is inlined right in the function. 28 DWORD DWORD DWORD DWORD eax edx PTR PTR PTR PTR [esp+8] [esp+16] [esp+12] [esp+20] ebx ecx _f_div: _f_rem: GCC does the expected.1 -fno-inline _f_mul: push mov mov mov mov imul imul mul add add pop ret ebx edx. so usually the compiler embeds calls to a library functions doing that. edx ___udivdi3 . unsigned modulo esp. 28 sub mov mov mov mov mov mov mov mov call add ret esp. size = 8 DWORD PTR DWORD PTR DWORD PTR DWORD PTR __aullrem 0 _b$[esp] _b$[esp] _a$[esp+8] _a$[esp+8] . ecx. DIVISION DWORD PTR _a$[esp+8] __aulldiv . eax. DWORD PTR [esp+32] DWORD PTR [esp+12]. DWORD PTR [esp+44] DWORD PTR [esp+8]. DWORD PTR [esp+40] edx.3. edx edx. MULTIPLICATION. edx edx. GCC has different library function names: D on page 949. DWORD PTR [esp+40] edx. eax eax. unsigned long long remainder Multiplication and division are more complex operations. DWORD PTR [esp+36] DWORD PTR [esp].2 ARM Keil for thumb mode inserts library subroutine calls: 408 .CHAPTER 24. ebx. eax DWORD PTR [esp+4].

DIVISION Listing 24.lr} __aeabi_uldivmod r0. NOP . $a1 $zero $zero $a3 $a0 $zero $a1 . 64-BIT VALUES IN 32-BIT ENVIRONMENT 24.r0. $a0.lr} __aeabi_uldivmod r0.r1. $v0 $at.3. but has to call a library routine for 64-bit division: Listing 24. $a0 $v0.4.r2 r1.r0. $a2 $v1 $ra $v0.pc} ||f_rem|| PROC PUSH BL MOVS MOVS POP ENDP {r4. NOP .r2 r1. NOP $a2 f_div: 409 .3.lr} __aeabi_uldivmod {r4.lr} __aeabi_lmul {r4.13: Optimizing GCC 4.r3.12: Optimizing Keil 6/2013 (ARM mode) ||f_mul|| PROC PUSH UMULL MLA MLA MOV POP ENDP {r4.r2 r1.r2.r3 {r4.lr} r12.pc} ||f_rem|| PROC PUSH BL MOV MOV POP ENDP {r4.pc} ||f_div|| PROC PUSH BL POP ENDP {r4.11: Optimizing Keil 6/2013 (thumb mode) ||f_mul|| PROC PUSH BL POP ENDP {r4.r4.3 MIPS Optimizing GCC for MIPS can generate 64-bit multiplication code.r4 r1.lr} __aeabi_uldivmod {r4. $a3.CHAPTER 24.r1 r0.r3 {r4. $at.pc} Keil for ARM mode. on the other hand.pc} ||f_div|| PROC PUSH BL POP ENDP {r4. MULTIPLICATION. is able to produce 64-bit multiplication code: Listing 24.r12 {r4.pc} 24.5 (IDA) f_mul: mult mflo or or mult mflo addu or multu mfhi mflo jr addu $a2. $at.

$ra $sp. $ra. (__gnu_local_gp >> 16) -0x20 (__gnu_local_gp & 0xFFFF) 0x20+var_4($sp) 0x20+var_10($sp) (__umoddi3 & 0xFFFF)($gp) $zero $zero 0x20+var_4($sp) $zero 0x20 f_rem: var_10 var_4 = -0x10 = -4 lui addiu la sw sw lw or jalr or lw or jr addiu $zero 0x20+var_4($sp) $zero 0x20 There are a lot of NOPs. 0 DWORD PTR _a$[esp-4] DWORD PTR _a$[esp] edx. (__gnu_local_gp >> 16) -0x20 (__gnu_local_gp & 0xFFFF) 0x20+var_4($sp) 0x20+var_10($sp) (__udivdi3 & 0xFFFF)($gp) $zero $gp.4.4 Shifting right #include <stdint. $gp.15: Optimizing GCC 4.4. $t9 $at. $t9 $at. but I’m not completely sure. 7 7 Listing 24. SHIFTING RIGHT = -0x10 = -4 lui addiu la sw sw lw or jalr or lw or jr addiu $gp. after all).CHAPTER 24. 64-BIT VALUES IN 32-BIT ENVIRONMENT var_10 var_4 24. DWORD PTR [esp+4] 410 . $sp. $ra $sp. }. $t9.h> uint64_t f (uint64_t a) { return a>>7. $ra. 24. $gp. eax. $at. $gp. DWORD PTR [esp+8] eax. $ra. edx. $at. edx.1 -fno-inline _f: mov mov edx. $sp.8. $at. $ra. size = 8 eax. $t9. $gp.14: Optimizing MSVC 2012 /Ob1 _a$ = 8 _f PROC mov mov shrd shr ret _f ENDP . probably delay slots filled after the multiplication instruction (it’s slower than other instructions. $at. 24.1 x86 Listing 24.

r0.#7 r0. CONVERTING 32-BIT VALUE INTO 64-BIT ONE eax.#7 r0. But the lower part is shifted with the help of the SHRD instruction. }. but pulls new bits from EAX. so the Keil compiler ought to do this using simple shifts and OR operations: Listing 24.r1. $v1. i. 7 Shifting also occurs in two passes: first the lower part is shifted. 64-BIT VALUES IN 32-BIT ENVIRONMENT shrd shr ret 24. 25 $a1.e. from the higher part.4.CHAPTER 24.r0.5 $v0. The higher part is shifted using the more popular SHR instruction: indeed.17: Optimizing Keil 6/2013 (thumb mode) ||f|| PROC LSLS LSRS ORRS LSRS BX ENDP 24. then the higher part. 24.2 ARM ARM doesn’t have such instruction as SHRD in x86.r0. $a0.r0.5 (IDA) f: sll srl or jr srl 24. $v1. DWORD PTR _a$[esp-4] 0 411 . 7 $v0. it shifts the value of EDX by 7 bits. edx.h> int64_t f (int32_t a) { return a. 7 edx.r1.r1.19: Optimizing MSVC 2012 _a$ = 8 _f PROC mov cdq ret _f ENDP eax.1 x86 Listing 24.#7 lr MIPS GCC for MIPS follows the same algorithm as Keil does for thumb mode: Listing 24.r1.16: Optimizing Keil 6/2013 (ARM mode) ||f|| PROC LSR ORR LSR BX ENDP r0.5.3 r2.#7 lr Listing 24.#25 r0.LSL #25 r1.. 7 Converting 32-bit value into 64-bit one #include <stdint.4. the freed bits in the higher part must be filled with zeroes.18: Optimizing GCC 4.4. 24.5.r2 r1. $v1 $a0. $ra $v0.

21: Optimizing GCC 4.r0. Unsigned values are converted straightforwardly: all bits in the higher part must be set to 0. 24. Its operation is somewhat similar to the MOVSX instruction. But this is not appropriate for signed data types: the sign has to be copied into the higher part of the resulting number. 24.5 (IDA) f: sra jr move $v0. In other words. $a0.5.3 MIPS GCC for MIPS does the same as Keil did for ARM mode: Listing 24.r0.CHAPTER 24.5. In other words. and depending of it. extends it to 64-bit and leaves it in the EDX:EAX register pair. As we know.2 ARM Listing 24. So after “ASR r1.#31 lr Keil for ARM is different: it just arithmetically shifts right the input value by 31 bits. 31 $ra $v1. R1 containing 0xFFFFFFFF if the input value was negative and 0 otherwise.20: Optimizing Keil 6/2013 (ARM mode) ||f|| PROC ASR BX ENDP r1.#31”. sets all 32 bits in EDX to 0 or 1. it takes its input value in EAX. this code just copies the MSB (sign bit) from the input value in R0 to all bits of the high 32-bit part of the resulting 64-bit value. CDQ gets the number sign from EAX (by getting the most significant bit in EAX). the sign bit is MSB. $a0 412 .4. The CDQ instruction does that here. R1 contains the high part of the resulting 64-bit value. CONVERTING 32-BIT VALUE INTO 64-BIT ONE Here we also run into necessity to extend a 32-bit signed value into a 64-bit signed one. 64-BIT VALUES IN 32-BIT ENVIRONMENT 24.5. and the arithmetical shift copies the sign bit into the “emerged” bits.

with wires and AND/OR/NOT gates. variable of type unsigned int on x86 can hold up to 32 bits. so it is possible to store there intermediate results for 32 block-key pairs simultaneously.yurichev. then it is possible to load 4 elements from A into a 128-bit XMM-register. As its name implies. SIMD began as MMX in x86. By the way.com/17329 2 Wikipedia: 3 Remotely. i < 1024.1 Vectorization Vectorization2 is when. MMX-enabled CPU + old OS + process utilizing MMX features will still work.com/17313 25. encrypt the block and produces a 64-bit result. 4 16-bit values or 8 bytes. For example: for (i = 0. so there are addition instructions in SIMD which are not adding anything if the maximum value is reached. memory comparing (memcmp) and so on. Now about practical usage. For example. The loop body takes values from the input arrays. memory copy routines (memcpy). using 64+56 variables of type unsigned int. multiplies them and saves the result into C. When the user changes the brightness in the graphics editor. SIMD Chapter 25 SIMD SIMD is an acronym: Single Instruction. http://go. etc. Each MMX register can hold 2 32-bit values. then it is possible to change the brightness of 8 pixels simultaneously. Vectorization is not very fresh technology: the author of this textbook saw it at least on the Cray Y-MP supercomputer line from 1988 when he played with its “lite” version Cray Y-MP EL 3 . For the sake of brevity if we say that the image is grayscale and each pixel is defined by one 8-bit byte. but saving the FPU registers. When the user changes the brightness of the image. i++) { C[i] = A[i]*B[i]. vectorization It is installed in the museum of supercomputers: http://go. It is important that there is only a single operation applied to each element. from B to another XMM-registers.com/17081 413 . these registers were actually located in the FPU’s registers. now separate from the FPU. SSE—is extension of the SIMD registers to 128 bits. It was possible to use either FPU or MMX at the same time. Bitslice DES1 —is the idea of processing groups of blocks and keys simultaneously. When MMX appeared. One more example: the DES encryption algorithm takes a 64-bit block and a 56-bit key. 1 http://go.CHAPTER 25. to 256 bits. One simple example is a graphics editor that represents an image as a two dimensional array. Thus. Multiple Data. AVX—another extension. If each array element we have is 32-bit int.yurichev. 8 new 64-bit registers appeared: MM0-MM7. you have a loop taking couple of arrays for input and producing one array.yurichev. The DES algorithm may be considered as a very large electronic circuit. that CPU subsystem looks like a separate processor inside x86. I wrote an utility to brute-force Oracle RDBMS passwords/hashes (ones based on DES). the editor must add or subtract a coefficient to/from each pixel value. and by executing PMULLD ( Multiply Packed Signed Dword Integers and Store Low Result) and PMULHW ( Multiply Packed Signed Integers and Store High Result). Like the FPU. it is possible to add 8 8-bit values (bytes) simultaneously by adding two values in MMX registers. Let’s say. this is the reason why the saturation instructions are present in SIMD. for example. Vectorization is to process several elements simultaneously. Of course. One might think that Intel saved on transistors. it is possible to get 4 64-bit products at once. it processes multiple data using only one instruction. } This fragment of code takes elements from A and B. overflow and underflow is not desirable. does something and puts the result into the output array. but in fact the reason of such symbiosis was simpler —older OSes that are not aware of the additional CPU registers would not save them at the context switch. using slightly modified bitslice DES algorithm for SSE2 and AVX —now it is possible to encrypt 128 or 256 block-keys pairs simultaneously.

VECTORIZATION Thus.int *. e. eax ecx. int *.asm /Ox We got (in IDA): . 6 loc_143 eax. [esp+10h+ar2] esi. }. SIMD 25. of course.cpp /QaxSSE2 /Faintel. esi short loc_55 loc_36: . [esp+10h+ar2] esi.1 Addition example Some compilers can do vectorization automatically in simple cases. CODE cmp jnb mov sub lea cmp jb XREF: f(int.. return 0.1.1. loop body execution count is 1024/4 instead of 1024. 25. int *ar3) { for (int i=0. i++) ar3[i]=ar1[i]+ar2[i].CHAPTER 25.int *)+21 eax. I wrote this tiny function: int f (int sz. [esp+10h+sz] edx. [esp+10h+ar2] loc_143 esi.051 win32: icl intel. ecx loc_143 loc_55: . [esp+10h+ar1] short loc_67 esi. int *ar1.g. faster.int *. Intel C++4 . [esp+10h+ar3] edx. i<sz. int *ar2. CODE cmp jbe mov sub neg XREF: f(int. edx loc_15B eax. Intel C++ Let’s compile it with Intel C++ 11. int *. that is 4 times less and. eax ecx. ds:0[edx*4] esi ecx.1.int *)+34 eax. [esp+10h+ar1] esi. [esp+10h+ar2] short loc_36 esi. eax esi 4 More about Intel C++ automatic vectorization: Excerpt: Effective Automatic Vectorization 414 . int *) public ?f@@YAHHPAH00@Z ?f@@YAHHPAH00@Z proc near var_10 sz ar1 ar2 ar3 = = = = = dword dword dword dword dword push push push push mov test jle mov cmp jle cmp jbe mov sub lea neg cmp jbe ptr -10h ptr 4 ptr 8 ptr 0Ch ptr 10h edi esi ebx esi edx.int *.int *. ds:0[edx*4] esi. int __cdecl f(int.

edx ecx. edi ecx. 0Fh short loc_109 .int *)+84 ecx. xmm1 . [esp+10h+ar2] loc_ED: . yes! ebx. [esp+10h+ar1] esi. CODE mov add mov inc cmp jb mov mov XREF: f(int.int *)+65 . esi short loc_7F loc_67: . xmm0 xmmword ptr [eax+edi*4]. 0Fh short loc_9A edi. edi short loc_D6 ebx. [esp+10h+ar1] loc_143 esi. CODE cmp jnb mov sub cmp jb XREF: f(int. CODE movdqu movdqu paddd movdqa add cmp jb jmp XREF: f(int. CODE XREF: f(int.int *. 3 ecx ecx. yes loc_109: . 2 loc_9A: .int *)+125 415 . [esp+10h+ar2] esi. [esp+10h+sz] loc_D6: . 3 loc_162 edi edi.int *)+CD edx.int *.int *)+B2 esi.int *. edi short loc_C1 ecx. edx edi.int *. [esp+10h+var_10] edx.int *. eax edi.int *.int *. is ar2+i*4 16-byte aligned? esi. ar3+i*4 edi.int *. ecx ecx. so load it to XMM0 xmm1. [edi+4] edx. CODE mov and jz test jnz neg add shr XREF: f(int. ecx short loc_ED short loc_127 *.int *. [esp+10h+ar1] mov esi.int *)+105 xmm1. xmmword ptr [esi+edi*4] . edx esi esi. is ar1 16-byte aligned? . VECTORIZATION ecx. xmmword ptr [ebx+edi*4] . CODE mov lea test jz mov mov XREF: f(int. ar2+i*4 is not 16-byte aligned.CHAPTER 25. [esp+10h+ar2] [esp+10h+var_10].int *.int *.int *)+59 eax.1.int *.int *)+E3 mov ebx.int edi. edi = ar1 . ar1+i*4 xmm0. CODE XREF: f(int. 10h edi. [esi+edi*4] . SIMD cmp jbe 25.int *.int *.int *. ecx loc_143 loc_7F: . [esp+10h+ar1] esi. [esp+10h+ar2] loc_111: . [ecx+esi*4] edx. CODE lea cmp jl mov sub and neg add test jbe mov mov mov xor XREF: f(int. ecx loc_162 ecx. eax esi. [ebx+esi*4] [eax+esi*4]. 4 edi. esi loc_C1: . [esp+10h+ar1] esi.

CODE XREF: f(int. xmmword ptr [ebx+edi*4] xmm0. ecx jmp short loc_127 ?f@@YAHHPAH00@Z endp The SSE2-related instructions are: • MOVDQU (Move Unaligned Double Quadword)— just loads 16 bytes from memory into a XMM-register. if ar2 is aligned on a 16-byte boundary as well.int *)+3A .int *.int *.int *. xor eax. If it is not aligned.int *.CHAPTER 25. [edi+ecx*4] mov [eax+ecx*4]. By the way.int *. [esi+ecx*4] add ebx. CODE XREF: f(int. [edi+ecx*4] mov [eax+ecx*4].int *. • PADDD (Add Packed Integers)— adds 4 pairs of 32-bit numbers and leaves the result in the first operand. [esp+10h+ar1] mov edi. CODE XREF: f(int. f(int.int *)+129 .int *. just the low 32 bits of the result are to be stored. but requires the address of the value in memory to be aligned on a 16-bit boundary..int *)+9F xor ecx. eax pop ecx pop ebx pop esi pop edi retn loc_162: .int *)+17 . ecx short loc_111 loc_127: . edx jb short loc_133 jmp short loc_15B loc_143: .int *. CODE XREF: f(int. [esp+10h+ar2] xor ecx. [esp+10h+ar1] mov edi.int *. • MOVDQA (Move Aligned Double Quadword) is the same as MOVDQU. but requires aforesaid..int *. these SSE2-instructions are to be executed only in case there are more than 4 pairs to work on and the pointer ar3 is aligned on a 16-byte boundary.int *. SIMD movdqu paddd movdqa add cmp jb 25.int *.int *. If it is not aligned. exception will be raised. edx jb short loc_14D loc_15B: . then the address must be aligned on a 16-byte boundary. 4 edi.1.int *.int *)+164 cmp ecx.int *)+107 .int *)+A .int *)+13F mov ebx. this fragment of code is to be executed: 5 More about data alignment: Wikipedia: Data structure alignment 416 . CODE XREF: f(int.. MOVDQA works faster than MOVDQU. [esp+10h+ar2] loc_133: . Also. xmm0 edi.int *. edx jnb short loc_15B mov esi. VECTORIZATION xmm0.int *. ebx inc ecx cmp ecx. an exception will be triggered 5 . ebx inc ecx cmp ecx. f(int.int *)+159 mov ebx. So.. xmmword ptr [esi+edi*4] xmmword ptr [eax+edi*4]. no exception is raised in case of overflow and no flags are to be set. [esi+ecx*4] add ebx. f(int. f(int.int *. mov esi. CODE XREF: f(int. If one of PADDD’s operands is the address of a value in memory.int *. ecx loc_14D: .int *.int *.int *)+8C .

eax nop lea esi..int *. [ebp+arg_0] esi. [ebx+10h] short loc_80484E8 loc_80484C1: .int *)+61 .int *. xmmword ptr [esi+edi*4] . int *. xmmword ptr [ebx+edi*4] .4. [esi+eax*4] [ebx+eax*4].int *)+36 edx. non-SSE2 code is to be executed. 0Ch ecx.int *)+17 . VECTORIZATION xmm0.int *)+A5 add esp. ecx short loc_80484D8 ecx. What we get (GCC 4.com/17083 417 .int *. f(int. ar3+i*4 Otherwise. xmm0 . which does not require aligned pointer.1.CHAPTER 25. ar2+i*4 xmmword ptr [eax+edi*4]. [ebp+arg_8] ebx.int *. [ebp+arg_C] ecx. [esi+0] loc_80484C8: . the value from ar2 is to be loaded into XMM0 using MOVDQU. xor eax.yurichev. if the -O3 option is used and SSE2 support is turned on: -msse2. GCC GCC may also vectorize in simple cases6 . f(int. esp edi esi ebx esp.int *. CODE XREF: f(int.int *. CODE XREF: f(int. eax pop ebx pop esi pop edi 6 More about GCC vectorization support: http://go. ecx short loc_80484C8 loc_80484D8: .. xmmword ptr [esi+edi*4] . [edi+eax*4] edx.int *. 1 eax.int *. int *. ar2+i*4 is not 16-byte aligned. int *) public _Z1fiPiS_S_ _Z1fiPiS_S_ proc near var_18 var_14 var_10 arg_0 arg_4 arg_8 arg_C = = = = = = = dword dword dword dword dword dword dword push mov push push push sub mov mov mov mov test jle cmp lea ja ptr -18h ptr -14h ptr -10h ptr 8 ptr 0Ch ptr 10h ptr 14h ebp ebp. edx eax. 6 eax. ar1+i*4 xmm0. xmmword ptr [ebx+edi*4] .int *)+4B .int *.1): . f(int. ar1+i*4 xmm0. xmm0 xmmword ptr [eax+edi*4]. but may work slower: movdqu movdqu paddd movdqa xmm1. 0Ch xor eax. so load it to XMM0 xmm1. xmm1 . ar3+i*4 In all other cases. [ebp+arg_4] edi. SIMD movdqu paddd movdqa 25. CODE mov add mov add cmp jnz XREF: f(int.int *.

eax eax. ecx short loc_8048520 ecx.int *. CODE test jnz lea cmp jbe XREF: f(int.int *)+E0 edx. 1 xmm0. eax short loc_8048547 [ebp+var_18]. SIMD pop retn 25. eax edx. CODE lea cmp ja cmp jbe XREF: f(int. [esi+0] loc_8048558: . [ebp+var_14] eax.int *)+73 edx. eax ebx esi edi ebp 418 . 1 edi.int *)+1F bl.int *. CODE mov add add add add mov add cmp jg add xor pop pop pop pop XREF: f(int.int *.int *.int *. edx loc_8048520: . edx short loc_8048503 edi. 2 eax. eax short loc_80484D8 loc_8048547: . edx esi.int *)+CC edx. ecx ecx. xmmword ptr [esi+eax] edx. VECTORIZATION ebp align 8 loc_80484E8: . xmmword ptr [edi+eax] xmm0. edx ebx.CHAPTER 25. 4 edx. [ebp+var_10] ecx. eax [ebp+var_10]. [edi+10h] ebx.int *. [edi] eax. edx ebx.int *. xmm0 eax.int *. 0Ch eax.int *)+9B xmm1. 2 [ebp+var_14].int *.1.int *. [esi] esi. eax short loc_8048558 esp. eax short loc_80484C1 loc_8048503: . CODE movdqu movdqu add paddd movdqa add cmp jb mov mov cmp jz XREF: f(int. CODE lea add add add lea XREF: f(int. 10h edx. edx loc_8048578 loc_80484F8: . edx edi. ds:0[eax*4] esi.int *. xmm1 xmmword ptr [ebx+eax]. 4 ecx.int *)+5D eax. [esi+10h] ebx. 0Fh short loc_80484C1 edx.int *. 4 [ebx]. CODE mov shr mov shl test mov jz mov mov xor xor nop XREF: f(int. [ebp+var_18] eax. ecx eax.

9. esi loc_80484C1 loc_80484F8 Almost the same. CODE cmp jnb jmp _Z1fiPiS_S_ endp XREF: f(int. al je . [rdi+16] cmp rsi. however.L13 mov rcx. rcx je .L4 movzx eax. SIMD 25. 3 mov BYTE PTR [rdi+2]. BYTE PTR [rsi+1] cmp rcx. BYTE PTR [rsi+3] cmp rcx. al je .h> void my_memcpy (unsigned char* dst. al je . BYTE PTR [rsi+4] cmp rcx. VECTORIZATION retn loc_8048578: . }. BYTE PTR [rsi] cmp rcx.L15 movzx eax.1. RDI = destination address .1: Optimizing GCC 4. al je .int *)+52 eax. rax lea rax. 15 cmp rcx.2 Memory copy example Let’s revisit the simple memcpy() example ( 14. rsi push rbp push rbx neg rcx and ecx.L13 cmp rdx. for (i=0. rax setae al or cl. [rsi+16] setae cl cmp rdi. 1 mov BYTE PTR [rdi].L16 movzx eax.L18 movzx eax. unsigned char* src. i++) dst[i]=src[i].CHAPTER 25. BYTE PTR [rsi+2] cmp rcx.int *. RDX = size of block test rdx.1. rdx cmova rcx. 4 mov BYTE PTR [rdi+3]. 5 419 . size_t cnt) { size_t i. 25.9. RSI = source address . And that’s what optimizations GCC 4. i<cnt. not as meticulously as Intel C++.1 x64 my_memcpy: .int *.L17 movzx eax. 2 mov BYTE PTR [rdi+1].2 on page 179): #include <stdio. eax test rcx.L41 lea rax.1 did: Listing 25. 22 jbe . rdx je . al je . rdx xor eax.

r11 r10. rcx r8. al mov lea sub lea sub shr add mov sal cmp jbe lea xor add xor r10. 1 r11.L6 rbp. ebx movdqa add movups add cmp jb add cmp je xmm0. al BYTE PTR [rsi+6] 7 PTR [rdi+6].L25 eax. rcx. BYTE PTR [rsi+rax] BYTE PTR [rdi+rax]. r8 . BYTE . VECTORIZATION mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov je movzx cmp mov jne movzx mov mov BYTE .CHAPTER 25. al BYTE PTR [rsi+14] PTR [rdi+14]. 14 . al BYTE PTR [rsi+12] 13 PTR [rdi+12]. rcx. rcx. rcx r8. al BYTE PTR [rsi+8] 9 PTR [rdi+8]. al BYTE PTR [rsi+7] 8 PTR [rdi+7]. al BYTE PTR [rsi+11] 12 PTR [rdi+11].L1 movzx mov ecx.L22 eax. 4 r9.L24 eax. al 15 . rcx. BYTE . BYTE eax.L20 eax. rcx. BYTE . cl BYTE PTR [rsi+5] 6 PTR [rdi+5]. SIMD 25.L6: 420 . PTR [rdi+4]. BYTE . [rdx-1] r10.L28 eax. 16 rbx. BYTE .L23 eax. al BYTE PTR [rsi+9] 10 PTR [rdi+9]. BYTE .L7 rax. r11 . rcx.L7: . BYTE .L19 eax. r9d rcx. rcx. al BYTE PTR [rsi+10] 11 PTR [rdi+10].L21 eax. r8 r11.L4: . rcx. BYTE . rcx. 4 r8. rdi ebx.L26 eax. XMMWORD PTR [rbp+0+r9] rbx. xmm0 r9. 1 XMMWORD PTR [rcx+r9].L27 eax. [r10-16] r9.1. al BYTE PTR [rsi+13] 15 PTR [rdi+13]. rdx r9. [rsi+rcx] r9d. BYTE .

L1 ecx.L1 ecx.1. . rdx. .L1 ecx. rdx. BYTE rcx. . rdx. cl [rax+10] rcx BYTE PTR [rsi+10+rax] PTR [rdi+10+rax].L1 ecx. . BYTE rcx. BYTE rcx. rdx. cl [rax+6] rcx BYTE PTR [rsi+6+rax] PTR [rdi+6+rax]. cl [rax+9] rcx BYTE PTR [rsi+9+rax] PTR [rdi+9+rax]. . . rdx. . cl [rax+7] rcx BYTE PTR [rsi+7+rax] PTR [rdi+7+rax]. BYTE rcx. BYTE rcx. rdx. .L1 edx. rdx. dl 421 .L1 ecx. . rdx. rdx. .L1 ecx. . cl [rax+13] rcx BYTE PTR [rsi+13+rax] PTR [rdi+13+rax]. VECTORIZATION rcx. cl [rax+4] rcx BYTE PTR [rsi+4+rax] PTR [rdi+4+rax]. BYTE rcx. SIMD lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov lea cmp jbe movzx mov 25. rdx. . BYTE [rax+1] rcx BYTE PTR [rsi+1+rax] PTR [rdi+1+rax]. BYTE rcx.L1 ecx.L1 ecx. cl [rax+3] rcx BYTE PTR [rsi+3+rax] PTR [rdi+3+rax].L1 ecx. rdx. BYTE rcx. BYTE rcx. . rdx. BYTE rcx.L1 ecx. rdx. . BYTE rcx.CHAPTER 25.L1 ecx. cl [rax+14] rcx BYTE PTR [rsi+14+rax] PTR [rdi+14+rax]. cl [rax+12] rcx BYTE PTR [rsi+12+rax] PTR [rdi+12+rax]. cl [rax+8] rcx BYTE PTR [rsi+8+rax] PTR [rdi+8+rax].L1 ecx. cl [rax+2] rcx BYTE PTR [rsi+2+rax] PTR [rdi+2+rax]. rdx. cl [rax+11] rcx BYTE PTR [rsi+11+rax] PTR [rdi+11+rax]. cl [rax+5] rcx BYTE PTR [rsi+5+rax] PTR [rdi+5+rax]. BYTE rcx. BYTE rcx.L1 ecx.

5 times faster than the common implementation. some of them are located in the intrin.CHAPTER 25.L28: . 9 .L23: .yurichev.L4 mov jmp eax. This function loads 16 characters into a XMM-register and check each against zero 9 . SIMD STRLEN() IMPLEMENTATION . 4 . 3 .L4 mov jmp eax.L4 mov jmp eax. 7 .L15: . 8 .L20: .L4 mov jmp eax.2. eax movzx mov add cmp jne rep ret ecx. 7 MSDN: MMX. 13 .L26: .L25: .L4 mov jmp eax.L41: rep ret .h file. cl 1 rdx . and SSE2 Intrinsics —standard C library function for calculating string length 9 The example is based on source code from: http://go.L24: . For MSVC. rax.L4 mov jmp eax.L18: . 1 .L1: pop pop rbx rbp .L13: xor eax. 11 . 12 .L4 mov jmp eax. 6 .L4 mov jmp eax.L19: .L4 mov jmp eax.L4 mov jmp eax. 2 .L17: .L3: BYTE PTR [rsi+rax] PTR [rdi+rax]. . SSE.L4 mov jmp eax. 14 .2 SIMD strlen() implementation It has to be noted that the SIMD instructions can be inserted in C/C++ code via special macros7 . 5 . It is possible to implement the strlen() function8 using SIMD instructions that works 2-2.L27: 25. BYTE rax.L21: .L16: .L3 mov jmp eax.L4 mov jmp eax. 8 strlen() 422 .L4 .com/17330. 10 . SIMD 25.L22: .L4 mov jmp eax.

} Let’s compile it in MSVC 2010 with /Ox option: Listing 25. xmm1 = _mm_cmpeq_epi8(xmm1. bool str_is_aligned=(((unsigned int)str)&0xFFFFFFF0) == (unsigned int)str. const char *s=str. 12 .2. edx mov ecx. -16 . __m128i xmm1. DWORD PTR _str$[ebp] sub esp. len += sizeof(__m128i). fffffff0H mov eax. _BitScanForward(&pos. int mask = 0. mask). fffffff0H xor edx. -16 . DWORD PTR [eax+1] npad 3 .CHAPTER 25. size = 4 _str$ = 8 . return len. esp and esp. for (. size = 4 ?strlen_sse2@@YAIPBD@Z PROC . xmm0). __m128i xmm0 = _mm_setzero_si128(). eax je SHORT $LN4@strlen_sse lea edx.. edx pop esi mov esp. eax jne SHORT $LN9@strlen_sse 423 . if ((mask = _mm_movemask_epi8(xmm1)) != 0) { unsigned long pos. xmm0 pcmpeqb xmm1. cl jne SHORT $LL11@strlen_sse sub eax. break. align next label $LL11@strlen_sse: mov cl. xmm0 pmovmskb eax.2: Optimizing MSVC 2010 _pos$75552 = -4 . SIMD 25.) { xmm1 = _mm_load_si128((__m128i *)s). eax and esi. SIMD STRLEN() IMPLEMENTATION size_t strlen_sse2(const char *str) { register size_t len = 0. 0000000cH push esi mov esi. } s += sizeof(__m128i). XMMWORD PTR [eax] pxor xmm0. xmm1 test eax. }. eax cmp esi. BYTE PTR [eax] inc eax test cl. strlen_sse2 push ebp mov ebp. len += (size_t)pos. ebp pop ebp ret 0 $LN4@strlen_sse: movdqa xmm1. if (str_is_aligned==false) return strlen (str).

there will be 0xff at this point in the result or 0 if otherwise. a function loading 16 bytes at once may step over the border of an allocated memory block. of course.CHAPTER 25. If not. that is. Thus. e. Now if we implement our own strlen() reading 16 byte at once. Any attempt to access memory starting from that address will raise an exception. we load the next 16 bytes into the XMM1 register using MOVDQA. eax mov DWORD PTR _pos$75552[esp+16]. the block is the bytes starting from address 0x008c0000 to 0x008c1fff inclusive. So then we’ll work only with the addresses aligned on a 16 byte boundary. Then. aligned or not. and then an exception will be raised. But here we are may hit another caveat: In the Windows NT line of OS (but not limited to it). strlen() reads one byte at a time until 0x008c1ffd. That situation is to be avoided. After the block. eax mov ecx. and that is not a crime. xmm0 add edx. why can’t MOVDQU be used here since it can load data from the memory regardless pointer alignment? Yes. _mm_cmpeq_epi8()— is a macro for PCMPEQB. eax lea eax. it just loads 16 bytes from the address into the XMM1 register. xmm0 —it just clears the XMM0 register. strlen_sse2 First. starting from address 0x008c2000 there is nothing at all. only some parts of the address space are connected to real physical memory. And if some byte was equals to the one in the other register. If the process is accessing an absent memory block. Let’s get back to our function. xmm1 test eax. ebp pop ebp ret 0 ?strlen_sse2@@YAIPBD@Z ENDP . it might be done in this way: if the pointer is aligned. which in combination with the knowledge that the OS’ page size is usually aligned on a 16-byte boundary gives us some warranty that our function will not read from unallocated memory. _mm_setzero_si128()— is a macro generating pxor xmm0. an exception is to be raised. xmm0. _mm_load_si128()— is a macro for MOVDQA. 16 . 0x008c1ff8 0x008c1ff9 0x008c1ffa 0x008c1ffb 0x008c1ffc 0x008c1ffd 0x008c1ffe 0x008c1fff ’h’ ’e’ ’l’ ’l’ ’o’ ’\x00’ random noise random noise So. starting at any address. And let’s consider the example in which the program is holding a string that contains 5 characters almost at the end of a block. memory is allocated by pages of 4 KiB (4096 bytes). 00000010H pcmpeqb xmm1. XMM1: 11223344556677880000000000000000 XMM0: 11ab3444007877881111111111111111 After the execution of pcmpeqb xmm1. MOVDQU may attempt to load 16 bytes at once at address 0x008c1ff8 up to 0x008c2008. SIMD STRLEN() IMPLEMENTATION $LL3@strlen_sse: movdqa xmm1. eax je SHORT $LL3@strlen_sse $LN9@strlen_sse: bsf eax. we call the generic strlen() implementation. XMMWORD PTR [ecx+16] add ecx. the OS not allocated any memory there. where there’s a zero byte. in normal conditions the program calls strlen(). 16 . if not —use the slower MOVDQU.2. passing it a pointer to the string 'hello' placed in memory at address 0x008c1ff8. and then it stops. That’s how VM works10 . SIMD 25. we check if the str pointer is aligned on a 16-byte boundary. An observant reader might ask. but in fact. the XMM1 register contains: XMM1: ff0000ff0000ffff0000000000000000 10 wikipedia 424 .g. load data using MOVDQA. an instruction that compares two XMM-registers bytewise. Let’s say that the OS has allocated 8192 (0x2000) bytes at address 0x008c0000. For example. DWORD PTR [ecx+edx] pop esi mov esp. So. 00000010H pmovmskb eax. Each win32process has 4 GiB available.

it is also has to be noted that the MSVC compiler emitted two loop bodies side by side. its position is added to what we have already counted and now we have the return result. The next macro is _mm_movemask_epi8() —that is the PMOVMSKB instruction. In other words. Obviously. If the second byte in the XMM1 register is 0xff. too. then the second bit in EAX is to be set to 1.com/17331 11 I use here order from MSB to LSB12 425 .yurichev. By the way. our function must take only the first zero bit and ignore the rest. if the first byte of the XMM1 register is 0xff. PMOVMSKB in our case will set EAX to (in binary representation): 0010000000100000b. EAX contains 5. and some random noise in memory. we are getting something like 11 : XMM1: 0000ff00000000000000ff0000000000 This means that the instruction found two zero bytes. It is very useful with PCMPEQB. There might be 16 bytes in the input like: 15 14 13 12 11 10 “h” “e” “l” “l” “o” 0 3 9 garbage 2 0 1 0 garbage It is the 'hello' string. The next instruction is BSF (Bit Scan Forward). If we load these 16 bytes into XMM1 and compare them with the zeroed XMM0. SIMD STRLEN() IMPLEMENTATION In our case. and it is not surprising. By the way. Now it is simple. meaning 1 was found at the 5th bit position (starting from zero). Almost all.CHAPTER 25. The other bits in the EAX register are to be cleared. pmovmskb eax. xmm0. In other words. the instruction is answering the question “which bytes in XMM1 are 0xff?” and returns 16 bits in the EAX register. MSVC has a macro for this instruction: _BitScanForward. then the first bit of EAX is to be 1. terminating zero. This instruction finds the first bit set to 1 and stores its position into the first operand. xmm1 This instruction sets first EAX bit to 1 if the most significant bit of the first byte in XMM1 is 1. do not forget about this quirk of our algorithm. SIMD 25. EAX=0010000000100000b After the execution of bsf eax. If a zero byte was found. which was set in the XMM0 register by pxor xmm0. this instruction compares each 16-byte block with a block of 16 zero-bytes. for optimization.2 (that appeared in Intel Core i7) offers more instructions where these string manipulations might be even easier: http://go. By the way.2. SSE 4. eax.

the function that calculates the first S-box of the DES encryption algorithm processes 32/64/128/256 values at once (depending on DES_type type (uint32. R15. From the reverse engineer’s perspective. RSP.1 x86-64 It is a 64-bit extension to the x86 architecture. R9 registers. The Caller function must also allocate 32 bytes so the callee may save there 4 first arguments and use these registers for its own needs. It is still possible to access the older register parts as usual. R10. the rest —in the stack. System V AMD64 ABI (Linux. R8. 8 additional registers wer added. including cache memory. Now GPR’s are: RAX. See also the section on calling conventions ( 64 on page 663). *BSD. it uses 6 registers RDI. R9 for the first 6 arguments. R14. uint64. Short functions may use arguments just from registers. • All pointers are 64-bit now. 7th (byte number) 6th 5th 4th R8 3rd 2nd 1st 0th R8D R8W R8L The number of SIMD registers was doubled from 8 to 16: XMM0-XMM15. R12.prefix. the compilers have more space for maneuvering called register allocation. RDX. RDI. For example. but larger ones may save their values on the stack.CHAPTER 26. the most important changes are: • Almost all registers (except FPU and SIMD) were extended to 64 bits and got a R. RBX. R8. RCX. RCX. For us this implies that the emitted code containing less number of local variables. All the rest are passed via the stack. R11. SSE2 or AVX)) using the bitslice DES method (read more about this technique here ( 25 on page 413)): 426 . RSI. R8. R8W-R15W (lower 16-bit parts). RBP. • The C/C++ int type is still 32-bit for compatibility. Mac OS X)[Mit13] also somewhat resembles fastcall. somewhat resembling fastcall ( 64. RDX.3 on page 664) . R8L-R15L (lower 8-bit parts). The first 4 arguments are stored in the RCX. R9. despite the fact that x64 CPUs can address only 48 bits of external RAM. it is possible to access the lower 32-bit part of the RAX register using EAX: 7th (byte number) 6th 5th 4th RAXx64 3rd 2nd 1st 0th EAX AH AX AL The new R8-R15 registers also have their lower parts: R8D-R15D (lower 32-bit parts). RDX. R13. 64 BITS Chapter 26 64 bits 26. For example. the function calling convention is slightly different. Since now the number of registers is doubled. • In Win64. This provokes irritation sometimes: now one needs twice as much memory for storing pointers. RSI.

x11 = a2 | x10. x39. x24. *out2. x47. * * This software may be modified. x18. x26. x5 = a6 & x4. x34. x37. redistributed. x22.CHAPTER 26. x21 = a1 & x20. x14. *out1. x9 = a6 & ~x8. a6. x38. x31 = x18 & ~x30. x15 = a5 & ~a4. x23. x30 = x9 ^ x24. and used for any purpose. x24 = x23 ^ x8. x5.March 1998 */ #ifdef _WIN64 #define DES_type unsigned __int64 #else #define DES_type unsigned int #endif void s1 ( DES_type DES_type DES_type DES_type DES_type DES_type DES_type DES_type DES_type DES_type a1. x43. x32 = a2 & x31. x17. x13 = a5 ^ x5. ) { x1 = a3 & ~a5. x14 = x13 & x8. x6. x10 = x7 ^ x9. x15. x11. x53. x16 = x3 ^ x14. x7 = a4 & ~a5. x32. x19. x8 = a3 ^ a4. x45. a4. a5. x21. x31. x44. x27 = x24 ^ x26. * * Produced by Matthew Kwan . x22 = x12 ^ ~x21. x26 = a2 & ~x25. x50. x9. x13. x29 = x28 ^ x25. x48. x7. x16. x25 = x18 & ~x2. X86-64 /* * Generated S-box files. x17 = a6 | x16. x46. x20 = x14 ^ x19. x27. x33. *out3. x30. x25. x54. 64 BITS 26. x8. x20. x12. x4. x6 = x2 ^ x5. x12 = x6 ^ x11. x29. x40. x3 = a3 & ~a4. x23 = x1 | x5. x51. x41. x35. *out4 DES_type DES_type DES_type DES_type DES_type DES_type DES_type x1. a3. x2 = x1 ^ a4. x10. x56. 427 . x19 = a2 | x18. x49.1. x55. x28. x36. *out2 ^= x22. x2. x3. a2. x18 = x15 ^ x17. x28 = x6 | x7. * so long as its origin is acknowledged. x42. x4 = x3 | a5. x52.

size = 4 _out1$ = 32 . x47 ^ x39. size = 4 _x28$ = 28 .1: Optimizing MSVC 2008 PUBLIC _s1 . x49 & ~x5. x51 ^ a5. size = 4 _out3$ = 40 . size = 4 _s1 PROC sub esp. size = 4 _a4$ = 20 . DWORD PTR _a4$[esp+20] push ebp push esi mov esi. x48 ^ ~x55. x42 & ~a2. Of course. ^= x35. size = 4 _x3$ = -16 . size = 4 _x4$ = -4 . a1 | x54. x4 ^ x36. x24 & ~x37. x33 & ~x9. ^= x46. Let’s compile it with MSVC 2008 with /Ox option: Listing 26. a3 | x31. } There are a lot of local variables. a2 | x3. X86-64 x29 ^ x32. a3 & x28. a1 & x33. x40 ^ x43. x37 ^ x38. size = 4 _x7$ = 20 . size = 4 _a1$ = 8 _a2$ = 12 . a1 & ~x44. size = 4 _x33$ = 20 . not all those going into the local stack. 20 . size = 4 _out4$ = 44 . size = 4 _x24$ = 36 . edi and edi. x41 | x3. x18 & ~x36. size = 4 _x8$ = -8 .1. size = 4 . DWORD PTR _a3$[esp+28] push edi mov edi. size = 4 _a6$ = 28 . a2 & ~x52. ebx not edi mov ebp.CHAPTER 26. DWORD PTR _a5$[esp+16] push ebx mov ebx. x50 ^ x53. ^= x56. x42 | x18. size = 4 _a5$ = 24 . x39 ^ ~x45. 00000014H mov edx. size = 4 _x1$ = -12 . size = 4 _out2$ = 36 . 64 BITS x33 = x34 = x35 = *out4 x36 = x37 = x38 = x39 = x40 = x41 = x42 = x43 = x44 = x45 = x46 = *out1 x47 = x48 = x49 = x50 = x51 = x52 = x53 = x54 = x55 = x56 = *out3 26. x27 ^ x34. Function compile flags: /Ogtpy _TEXT SEGMENT _x6$ = -20 . size = 4 _a3$ = 16 . size = 4 _x36$ = 28 . size = 4 tv326 = 28 . DWORD PTR _a5$[esp+32] 428 .

DWORD PTR _a1$[esp+32] eax. DWORD PTR _a6$[esp+32] ecx ecx. ebx ebx. ebx edx. DWORD PTR _x1$[esp+36] DWORD PTR _x28$[esp+32]. DWORD PTR _x7$[esp+32] edi ebx. DWORD PTR _x7$[esp+32] ebx. DWORD PTR [edi] eax ebx. ecx edx. ecx ebx. ebx DWORD PTR _x1$[esp+36]. ebx ebx. ebp esi. esi edi. edx edx. edx DWORD PTR [edi]. edx ebx. ebx ebx. edi ebx. DWORD PTR _x24$[esp+32] DWORD PTR [ebx]. eax 429 . DWORD PTR _a3$[esp+32] edx. edx edx. esi ecx. ebx ebx. ebx esi. DWORD PTR _out4$[esp+32] eax. ebp edx. DWORD PTR _x6$[esp+36] edi. edi edi. DWORD PTR _a5$[esp+32] DWORD PTR _x8$[esp+36]. esi esi. esi eax. edi edi. eax DWORD PTR _x6$[esp+36]. DWORD PTR _a6$[esp+32] DWORD PTR _x7$[esp+32]. edx edi.CHAPTER 26. 64 BITS mov not and mov and and mov xor mov or mov and mov mov xor mov mov xor mov xor mov and mov mov xor or not and xor mov or mov mov xor and mov xor not or xor mov mov xor not xor and mov mov or mov or mov xor mov xor not and mov and xor xor not mov and and xor mov xor xor mov 26.1. ecx edi edi. X86-64 ecx. DWORD PTR _x8$[esp+36] DWORD PTR _x24$[esp+32]. edx ecx. DWORD PTR _a1$[esp+32] ebx. ebp edi. edx DWORD PTR _x4$[esp+36]. ebx edi. DWORD PTR _a2$[esp+32] edi. ebx edi. DWORD PTR [ebx] eax. ebx ebx. edi edi. DWORD PTR _a2$[esp+32] DWORD PTR _x3$[esp+36]. ecx eax. esi ebx. ebp ebx. ebp ebp. DWORD PTR _a6$[esp+32] edx. eax eax. DWORD PTR _x28$[esp+32] ebx. edx ebx. DWORD PTR _x6$[esp+36] eax. esi edx. ebp eax. eax eax DWORD PTR _x33$[esp+32]. DWORD PTR _out2$[esp+32] ebx. edx ecx ebp.

DWORD PTR _out1$[esp+32] not eax and eax. rbp [rsp+16]. 00000014H ret 0 _s1 ENDP 5 variables were allocated in the local stack by the compiler. 64 BITS 26. DWORD PTR _a1$[esp+32] and edx. DWORD PTR _x36$[esp+32] xor edx. DWORD PTR [edi] not ecx and ecx. DWORD PTR _a1$[esp+32] not ebp xor ebp. DWORD PTR _out3$[esp+32] xor eax. rdx [rsp+8]. rcx 430 . eax not eax and eax. DWORD PTR _a5$[esp+32] mov edx. ebx pop ebp mov DWORD PTR [ecx]. eax not eax and eax. DWORD PTR _x33$[esp+32] xor ebp. DWORD PTR _x28$[esp+32] and eax. eax pop ebx add esp. ebx not eax mov DWORD PTR [edi]. DWORD PTR _a2$[esp+32] not ebp and ebp. ebp xor eax. DWORD PTR _x24$[esp+32] not ebp or eax. edx xor eax. edx or ebx. ebp xor ebx. Now let’s try the same thing in the 64-bit version of MSVC 2008: Listing 26. DWORD PTR [ecx] pop edi pop esi xor eax. edi mov edi. rbx [rsp+32]. X86-64 mov eax. DWORD PTR _x4$[esp+36] xor ebp. DWORD PTR _a3$[esp+32] mov ebx. esi xor eax.2: Optimizing MSVC 2008 a1$ = 56 a2$ = 64 a3$ = 72 a4$ = 80 x36$1$ = 88 a5$ = 88 a6$ = 96 out1$ = 104 out2$ = 112 out3$ = 120 out4$ = 128 s1 PROC $LN3: mov QWORD mov QWORD mov QWORD mov QWORD push rsi PTR PTR PTR PTR [rsp+24]. edx or eax. 20 . DWORD PTR _x3$[esp+36] not esi and ebp.CHAPTER 26. DWORD PTR _a3$[esp+32] mov DWORD PTR _x36$[esp+32].1. DWORD PTR _x3$[esp+36] or edi. eax or eax. ecx mov ecx.

rbx rax. r14 r8.CHAPTER 26. rax rax. r11 r11. rax rax. QWORD PTR a6$[rsp] rbp. r8 QWORD PTR [rax]. r15 r13. rdx rdi. r8 r10. QWORD PTR [rax] rcx. rsi rdx. QWORD PTR out2$[rsp] rcx. r9 rsi. rdi rcx. rax r8. r9 r9. r8 rsi. r14 r8. r13 rax. r9 r10 r11. r14 rax. r10 QWORD PTR x36$1$[rsp]. rdx QWORD PTR x36$1$[rsp]. QWORD PTR a1$[rsp] rax. r9 rcx rcx. rcx rdi. rcx rax. rax 431 . rdi rcx. r9 rax. r13 rdx rdx.1. rsi r12. QWORD PTR a2$[rsp] r12. rbp rax rdx. r9 rcx. r12 r14. rax rdi. rcx rdx. r10 rbx. r8 rcx. r15 rbx rax. QWORD PTR a5$[rsp] rcx. QWORD PTR x36$1$[rsp] rcx. rcx rax. QWORD PTR x36$1$[rsp] rcx. r15 r13 r13. 64 BITS push push push push push mov mov mov mov mov mov not xor not mov and mov mov and and and mov mov xor mov mov or not and mov and mov mov xor xor not and mov xor or xor and mov or xor mov xor and or not xor mov xor xor mov mov mov or or mov xor mov mov mov xor not and mov and xor 26. r9 r10. r15 rdx. r11 rbx. rcx r14. rsi rdi. rax r11. r9 rcx. r8 r10. rax rax. rdi r10. X86-64 rdi r12 r13 r14 r15 r15. rdx r10. rdx rbx.

r15 rcx. rcx rdx. there are CPUs with much more GPR’s. QWORD PTR out1$[rsp] r8. r13 r9. rbx rbx. rdi rbx. rsi rbx. QWORD PTR x36$1$[rsp] rbx. QWORD PTR [rax] r9. QWORD PTR [rsp+80] r9. rdx rdx. QWORD PTR a1$[rsp] rbx. QWORD PTR out3$[rsp] r9. QWORD PTR [rax] rbx. rbx r9 r9. rbx rbx. e. rdx r9 rcx.2. r10 r9. r9 r15 r14 r13 r12 rdi rsi 0 Nothing was allocated in the local stack by the compiler. Itanium (128 registers). rdx r9. rbp r9. 64 BITS s1 xor not and mov and xor mov xor xor mov mov and mov not and or mov xor not and or mov or xor mov not not not and or and xor xor mov not not and and and xor mov not xor or not xor mov mov xor xor xor mov pop pop pop pop pop pop ret ENDP 26.2 ARM 64-bit instructions appeared in ARMv8. QWORD PTR out4$[rsp] rbx. r12 rcx.CHAPTER 26. By the way. rcx QWORD PTR [rax]. rax rax. QWORD PTR [rax] r9.g. rcx rax. x36 is synonym for a5. rbp rbp. QWORD PTR [rsp+72] rcx rcx. rdi r8. r11 rcx r14 r13 rcx. r9 r9. ARM r10. r10 rax. r14 r9. rbx rbx rbx. r8 QWORD PTR [rax]. r9 r9 r9. r11 rcx. 432 . QWORD PTR a1$[rsp] r9 rcx r13. r11 rax. 26. r9 rax. r8 QWORD PTR [rax].

CHAPTER 26. 433 . 64 BITS 26. FLOAT POINT NUMBERS Float point numbers How float point numbers are processed in x86-64 is explained here: 27 on the previous page.3 26.3.

a is passed in XMM0. The constants are encoded by compiler in IEEE 754 format. So. 3. }. b—via XMM1.1 Simple example #include <stdio.4)). but the double values are 64 bit. modern compilers (including those generating for x86-64) usually use SIMD instructions instead of FPU ones. int main() { printf ("%f\n". f(1. double b) { return a/3. DIVSD is an SSE-instruction that stands for “Divide Scalar Double-Precision Floating-Point Values”. QWORD PTR __real@4010666666666666 xmm0. The result of the function’s execution in type double is left in the in XMM0 register. xmm1 0 The input floating point values are passed in the XMM0-XMM3 registers.14 xmm0. The number format remains the same (IEEE 754). so only lower register half is used. }.h> double f (double a.1. because it’s easier to work with them. 3. MULSD and ADDSD work just as the same. stored in the lower halves of operands. The SIMD extensions (SSE2) offer an easier way to work with floating-point numbers. 27.1. it just divides one value of type double by another. It can be said that it’s good news.1 x64 Listing 27. QWORD PTR __real@40091eb851eb851f xmm1. The XMM-registers are 128-bit (as we know from the section about SIMD: 25 on page 413). 27.CHAPTER 27. That is how non-optimizing MSVC works: 1 MSDN: Parameter Passing 434 .2. We are going to reuse the examples from the FPU section here: 17 on page 208. 4.1 .14 + b*4. but do multiplication and addition. WORKING WITH FLOATING POINT NUMBERS USING SIMD Chapter 27 Working with floating point numbers using SIMD Of course.1: Optimizing MSVC 2012 x64 __real@4010666666666666 DQ 04010666666666666r __real@40091eb851eb851f DQ 040091eb851eb851fr a$ = 8 b$ = 16 f PROC divsd mulsd addsd ret f ENDP . the FPU has remained in x86-compatible processors when the SIMD extensions were added. all the rest—via the stack 1 .

435 . xmm0. 3. size = 8 xmm1.14 PTR [rsp+16]. 2) the result of the function is returned in ST(0) — in order to do so.2 x86 I also compiled this example for x86. 0 .. xmm1 QWORD PTR tv70[ebp]. only 64-bit values of type double. xmm0.1. QWORD PTR _b$[ebp] xmm1. 4. MSVC 2012 uses SSE2 instructions: Listing 27.2. xmm1. xmm1.e. QWORD QWORD 0 QWORD PTR _a$[esp-4] QWORD PTR __real@40091eb851eb851f QWORD PTR _b$[esp-4] QWORD PTR __real@4010666666666666 xmm0 PTR tv67[esp-4]. xmm1. size = 8 . QWORD PTR __real@4010666666666666 xmm0. xmm1 PTR tv67[esp-4] It’s almost the same code. like in the FPU examples ( 17 on page 208). GCC produces the same code. size = 8 .2: MSVC 2012 x64 __real@4010666666666666 DQ 04010666666666666r __real@40091eb851eb851f DQ 040091eb851eb851fr a$ = 8 b$ = 16 f PROC movsdx movsdx movsdx divsd movsdx mulsd addsd ret f ENDP QWORD QWORD xmm0. size = 8 . there are some differences related to calling conventions: 1) the arguments are passed not in XMM registers. it’s copied (through local variable tv) from one of the XMM registers to ST(0). 27. xmm0 QWORD PTR a$[rsp] QWORD PTR __real@40091eb851eb851f QWORD PTR b$[rsp] QWORD PTR __real@4010666666666666 xmm1 Slightly redundant. size = 8 ebp ebp. QWORD PTR _a$[ebp] xmm0.CHAPTER 27. QWORD PTR __real@40091eb851eb851f xmm1.1.3: Non-optimizing MSVC 2012 x86 tv70 = -8 _a$ = 8 _b$ = 16 _f PROC push mov sub movsd divsd movsd mulsd addsd movsd fld mov pop ret _f ENDP .4: Optimizing MSVC 2012 x86 tv67 = 8 _a$ = 8 _b$ = 16 _f PROC movsd divsd movsd mulsd addsd movsd fld ret _f ENDP . xmm0 QWORD PTR tv70[ebp] esp. xmm0. Despite the fact it’s generating for x86. ebp ebp 0 Listing 27. SIMPLE EXAMPLE Listing 27.1 .1 on page 89). but only their lower register halves. xmm1 PTR [rsp+8]. but in the stack. however. size = 8 . 8 xmm0. esp esp. xmm1. The input arguments are saved in the “shadow space” ( 8. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27. i. xmm0.

1. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27. SIMPLE EXAMPLE Let’s try the optimized example in OllyDbg: Figure 27.CHAPTER 27.1: OllyDbg: MOVSD loads the value of a into XMM1 436 .

SIMPLE EXAMPLE Figure 27.1. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27.2: OllyDbg: DIVSD calculated quotient and stored it in XMM1 437 .CHAPTER 27.

WORKING WITH FLOATING POINT NUMBERS USING SIMD 27.1. SIMPLE EXAMPLE Figure 27.CHAPTER 27.3: OllyDbg: MULSD calculated product and stored it in XMM0 438 .

CHAPTER 27. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27.4: OllyDbg: ADDSD adds value in XMM0 to XMM1 439 .1. SIMPLE EXAMPLE Figure 27.

CHAPTER 27. 440 .1. But of course. SIMPLE EXAMPLE Figure 27. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27. OllyDbg shows them in that format because the SSE2 instructions (suffixed with -SD) are executed right now. Apparently. but only the lower part is used.5: OllyDbg: FLD left function result in ST(0) We see that OllyDbg shows the XMM registers as pairs of double numbers. it’s possible to switch the register format and to see their contents as 4 float-numbers or just as 16 bytes.

27. Apparently.6: Optimizing GCC 4.5: Optimizing MSVC 2012 x64 $SG1354 DB '32. OFFSET FLAT:$SG1354 xmm1. It is then moved to RDX for printf(). 1.01 .3 Comparison example 441 .h> #include <stdio. maybe because printf()—is a variable arguments function? Listing 27. QWORD PTR __real@40400147ae147ae1 pow rcx. xmm0 rdx. } They are passed in the lower halves of the XMM0-XMM3 registers.54 = %lf\n" main: sub rsp. 00H __real@40400147ae147ae1 DQ 040400147ae147ae1r __real@3ff8a3d70a3d70a4 DQ 03ff8a3d70a3d70a4r main main PROC sub movsdx movsdx call lea movaps movd call xor add ret ENDP . The value for printf() is passed in XMM0.2 27.54 = %lf'. 40 .2.6 x64 . number of vector registers passed call printf xor eax. xmm1 printf eax. here is a case when 1 is written into EAX for printf()—this implies that one argument will be passed in vector registers. so they renamed it to MOVSDX.LC2: .54 rsp. 0aH.LC0: .4.01. OFFSET FLAT:. eax add rsp. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27.54)). 8 movsd xmm1.LC0[rip] movsd xmm0. 40 .6. just as the standard requires [Mit13]. 32. pow() takes arguments from XMM0 and XMM1.01 ^ 1. pow (32.long 2920577761 1077936455 .h> int main () { printf ("32.long 171798692 1073259479 .1.long .01 ^ 1. 1 .CHAPTER 27.LC2 mov eax. 00000028H xmm1. Why? Honestly speaking. and returns result in XMM0. QWORD PTR . 00000028H 0 There is no MOVSDX instruction in Intel [Int13] and AMD [AMD13a] manuals.long . there it is called just MOVSD.string "32. QWORD PTR . eax rsp. PASSING FLOATING POINT NUMBER VIA ARGUMENTS Passing floating point number via arguments #include <math. Listing 27.LC1[rip] call pow .LC1: GCC generates clearer output. 8 ret .2 on page 934).01 ^ 1. I don’t know. So there are two instructions sharing the same name in x86 (about the other see: A. result is now in XMM0 mov edi. return 0.54 = %lf\n". By the way. QWORD PTR __real@3ff8a3d70a3d70a4 xmm0. Microsoft developers wanted to get rid of the mess. It just loads a value into the lower half of a XMM register.

QWORD PTR b$[rsp] 0 However. xmm1 0 Optimizing MSVC generates a code very easy to understand. return b.8: MSVC 2012 x64 a$ = 8 b$ = 16 d_max PROC movsdx movsdx movsdx comisd jbe movsdx jmp $LN1@d_max: movsdx $LN2@d_max: fatret d_max ENDP QWORD QWORD xmm0. d_max (1.CHAPTER 27. SHORT xmm0. 27.4. int main() { printf ("%f\n". SHORT PTR [rsp+16].4)). which just choose the maximum value! Listing 27.9: Optimizing GCC 4. but it is still not hard to understand: Listing 27.2. xmm1 SHORT $LN2@d_max xmm0.3.4.h> double d_max (double a. xmm1 PTR [rsp+8].3. Non-optimizing MSVC generates more redundant code.6.7: Optimizing MSVC 2012 x64 a$ = 8 b$ = 16 d_max PROC comisd ja movaps $LN2@d_max: fatret d_max ENDP xmm0. 3. xmm1 442 . WORKING WITH FLOATING POINT NUMBERS USING SIMD 27. printf ("%f\n".6 x64 d_max: maxsd ret xmm0. double b) { if (a>b) return a. COMPARISON EXAMPLE #include <stdio. -4)). }.1 x64 Listing 27. GCC 4. COMISD is “Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS”. d_max (5.6 did more optimizations and used the MAXSD (“Return Maximum Scalar Double-Precision FloatingPoint Value”) instruction. xmm0 QWORD PTR a$[rsp] QWORD PTR b$[rsp] $LN1@d_max QWORD PTR a$[rsp] $LN2@d_max xmm0. xmm0. that is what it does. }. Essentially.

xmm0.2 x86 Let’s compile this example in MSVC 2012 with optimization turned on: Listing 27.11: Optimizing MSVC 2012 x64 v$ = 8 calculate_machine_epsilon PROC movsdx QWORD PTR v$[rsp]. If we load this example in OllyDbg. SHORT QWORD 0 QWORD PTR _a$[esp-4] QWORD PTR _b$[esp-4] $LN1@d_max PTR _a$[esp-4] QWORD PTR _b$[esp-4] 0 Almost the same. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27.4 Calculating machine epsilon: x64 and SIMD Let’s revisit the “calculating machine epsilon” example for double listing.2.10: Optimizing MSVC 2012 x86 _a$ = 8 _b$ = 16 _d_max PROC movsd comisd jbe fld ret $LN1@d_max: fld ret _d_max ENDP . size = 8 xmm0.2. size = 8 . CALCULATING MACHINE EPSILON: X64 AND SIMD 27.4.3. Now we compile it for x64: Listing 27. xmm0 443 .22. we can see how the COMISD instruction compares values and sets/clears the CF and PF flags: Figure 27. but the values of a and b are taken from the stack and the function result is left in ST(0).CHAPTER 27.6: OllyDbg: COMISD changed CF and PF flags 27.

but MSVC 2012 probably is not that good yet 2 . “Scalar” implies that only one value is stored in the register. Listing 27.CHAPTER 27. Essentially. “Single” stands for float data type.22. it will use the SIMD instructions for the FPU. the value is then reloaded to a XMM register and subtraction occurs. The result is returned in the XMM0 register. Nevertheless. PSEUDO-RANDOM NUMBER GENERATOR EXAMPLE REVISITED movaps xmm1. eax . Hence.. FPU may give more precise calculation results. EAX=pseudorandom value and eax. and reloading it into ST0: fld DWORD PTR tv128[esp+4] pop ecx ret 0 ?float_rand@@YAMXZ ENDP All instructions have the -SS suffix. which stands for “Scalar Single”. WORKING WITH FLOATING POINT NUMBERS USING SIMD 27. If you would try to replace double with float in these examples. DWORD PTR __real@3f800000 . 27.. store it into local stack: mov DWORD PTR _tmp$[esp+4]. subtract 1. Instructions working with several values in a register simultaneously have “Packed” in their name. ADDSS.6 Summary Only the lower half of XMM registers is used in all examples here. MOVSS. to store number in IEEE 754 format. it operates on the lower 64-bit part of 128-bit XMM register. the SSE2 instructions work with 64-bit IEEE 754 numbers (double). EAX=pseudorandom value & 0x007fffff | 0x3f800000 . 1 tv128 = -4 _tmp$ = -4 ?float_rand@@YAMXZ PROC push ecx call ?my_rand@@YAIXZ . The stack register model is not used. the FPU may produce less round-off errors and as a consequence. xmm0 inc QWORD PTR v$[rsp] movsdx xmm0. There is. 27. etc. move value to ST0 by placing it in temporary variable. 2 As an exercise. reload it as float point number: movss xmm0. however. you may try to rework this code to eliminate the usage of the local stack.12: Optimizing MSVC 2012 __real@3f800000 DD 03f800000r .5 Pseudo-random number generator example revisited Let’s revisit “pseudo-random number generator example” example listing. If we compile this in MSVC 2012.. xmm1 ret 0 calculate_machine_epsilon ENDP There is no way to add 1 to a value in 128-bit XMM register. but prefixed with -SS (“Scalar Single-Precision”). for example.1. . xmm0 . 3f800000H . SUBSD is “Subtract Scalar DoublePrecision Floating-Point Values”.. while the internal representation of the floating-point numbers in FPU is 80-bit numbers. “Scalar” implies that the SIMD register containing only one value instead of several. stored in the lower 64-bit half of a XMM register.5.e.0: subss xmm0.. 444 . probably because the SIMD extensions were evolved in a less chaotic way than the FPU ones in the past. so it must be placed into memory. DWORD PTR _tmp$[esp+4] . And it is easier than in the FPU. QWORD PTR v$[rsp] subsd xmm0. all instructions prefixed by -SD (“Scalar Double-Precision”)—are instructions working with floating point numbers in IEEE 754 format. the same instructions will be used. COMISS. 007fffffH or eax. 1065353216 . 8388607 . Needless to say. movss DWORD PTR tv128[esp+4]. i. the ADDSD instruction (Add Scalar Double-Precision Floating-Point Values) which can add a value to the lowest 64-bit half of a XMM register while ignoring the higher one.

39.15. for example: listing. *++ptr.1. then use *ptr value decrement ptr pointer. Thus. For example. C language compilers may use it.3 Loading a constant into a register 28. post-increment. even on PDP-11. By the way.14.3.1 32-bit ARM Aa we already know. *--ptr. w4. And it’s possible to do that both before and after loading. then use *ptr value Pre-indexing is marked with an exclamation mark in the ARM assembly language. then increment ptr pointer use *ptr value.CHAPTER 28. this is one of the hard to memorize C features. The meaning is different if the number is outside the brackets: ldr Please note that 24 is inside the brackets. 28. Then how can we load a 32-bit value into a register. were “guilty” for the appearance of such C language (which developed on PDP-11) constructs as *ptr++. pre-decrement and post-decrement modes in PDP-11.28 This means load the value at the address in X1. if it is present on the target processor. [x29.9 generates assembly language output. 28. all instructions have a length of 4 bytes in ARM mode and 2 bytes in Thumb mode. But when GCC 4. ARM allows you to add or subtract a constant to/from the address used for loading. MORE ABOUT ARM Chapter 28 More about ARM 28.4. it doesn’t. *ptr--. [x1]. This is how it is: C term Post-increment ARM term post-indexed addressing C statement *ptr++ Post-decrement post-indexed addressing *ptr-- Pre-increment pre-indexed addressing *++ptr Pre-decrement pre-indexed addressing *--ptr how it works use *ptr value. for example: listing. one has to obey the rules accepted in environment he/she works in. but it is present in some other processors.3. The ARM listings in this book are somewhat mixed. Supposedly.24] This means add 24 to the value in X29 and load the value from this address. Dennis Ritchie (one of the creators of the C language) mentioned that it probably was invented by Ken Thompson (another C creator) because this processor feature was present in PDP-7 [Rit86][Rit93]. There is a legend that the pre-increment.3.1 Number sign (#) before number The Keil compiler. That’s very convenient for array processing. see line 2 in listing. then add 28 to X1. There is no such addressing mode in x86. if it’s not possible to encode it in one instruction? Let’s try: 445 . IDA and objdump precede all numbers with the “#” number sign. I’m not sure which method is right.2 Addressing modes This instruction is possible in ARM64: ldr x0. then decrement ptr pointer increment ptr pointer.

1: GCC 4. i. For example: double a() { return 1.500000000000000000e+000 446 . }.1 -O3 mov movk movk movk ret x0. the 0x12345678 value is just stored aside in memory and loaded if needed. }. Listing 28. #22136 r0. it writes a 16-bit value into the register. the lower part first (using MOVW). 32 and 48 bits at each step. The shifting is done before loading.3: GCC 4. Listing 28.9.CHAPTER 28.word 305419896 . Most likely. MORE ABOUT ARM 28. . not touching the rest of the bits.. x0.3 -O3 -march=armv7-a (ARM mode) movw movt bx r0.3 -O3 ARM mode f: ldr bx r0. IDA is able to detect such patterns in the code and disassembles this function as: MOV BX 28. additional memory access. LOADING A CONSTANT INTO A REGISTER unsigned int f() { return 0x12345678.6. The LSL suffix shifts left the value by 16. On the other hand.6.e.9. Does it mean that the two-instruction version is slower than one-instruction version? Doubtfully. Listing 28. 0x5678 . because in fact there are not many constants in real code (except of 0 and 1). This implies that 4 instructions are necessary to load a 64-bit value into a register. #1. modern ARM processors are able to detect such sequences and execute them fast. But it’s possible to get rid of the Listing 28.L2: So.2 R0.3. x0. 0x1234.5. 0x12345678 . 0xef01 lsl 16 lsl 32 lsl 48 MOVK stands for “MOV Keep”.4: GCC 4. then the higher (using MOVT). }.L2 lr . 0x5678. #4660 lr . It’s not a real problem. This implies that 2 instructions are necessary in ARM mode for loading a 32-bit value into a register. 61185 0xabcd.1 -O3 + objdump 0000000000000000 <a>: 0: 1e6f1000 4: d65f03c0 fmov ret d0. 0x1234 We see that the value is loaded into the register by parts. . 0x12345678 LR ARM64 uint64_t f() { return 0x12345678ABCDEF01.2: GCC 4. x0.3. Storing floating-point number into register It’s possible to store a floating-point number into a D-register using only one instruction.

. there are 4-byte instructions in ARM64. sp x0. MORE ABOUT ARM 28. Read more about them (in relation to Win32 PE): 68. RELOCS IN ARM64 The number 1. x30..6 on page 688.. so it is impossible to write a large number into a register using a single instruction. 26-bit one.rodata 000000000000000c R_AARCH64_ADD_ABS_LO12_NC . The address is formed using the ADRP and ADD instruction pair in ARM64. All ARM64 (and in ARM in ARM mode) instruction addresses have zeroes in the two lowest bits (because all instructions have a size of 4 bytes). but it couldn’t encode 32. The algorithm is called VFPExpandImm() in [ARM13a]. there are 8 bits in the FMOV instruction for encoding some floating-point numbers. [sp. 1 wikipedia 447 .0. #0x0 x29. • The first one takes the page address. But how? In ARM64.word ...9 under win32: Listing 28.exe -d hw..9 and objdump of object file .LC0: 28.>aarch64-linux-gnu-objdump. so that’s why relocs exists. as 8 bytes have to be allocated for this number in the IEEE 754 format: double a() { return 32. cuts the lowest 12 bits and writes the remaining high 21 bits to the ADRP instruction’s bit fields.o .#-16]! x29.. #0x0 0 <printf> w0.>aarch64-linux-gnu-gcc.>aarch64-linux-gnu-objdump. and the ADRP instruction has space only for 21 bits.o .2.#16 // #0 . 0000000000000000 <main>: 0: a9bf7bfd stp 4: 910003fd mov 8: 90000000 adrp c: 91000000 add 10: 94000000 bl 14: 52800000 mov 18: a8c17bfd ldp 1c: d65f03c0 ret x29. Listing 28.. [sp].. I compiled the example from “Hello. Nevertheless. 0 <main> x0.6: GCC (Linaro) 4.5 was indeed encoded in a 32-bit instruction.4.word 0 1077936128 . x30.5: GCC 4.1 -O3 a: ldr ret d0.CHAPTER 28. RELOCATION RECORDS FOR [.LC0 .9. This is because we don’t need to encode the low 12 bits.exe -r hw. .rodata 0000000000000010 R_AARCH64_CALL26 printf So there are 3 relocs in this object file.6) in GCC (Linaro) 4.0 and 31. an executable image can be loaded at any random address in memory.. x0. • The last. }. so one need to encode only the highest 26 bits of 28-bit address space (±128MB). world!” (listing. The first loads a 4Kb-page address and the second one adds the remainder.0.c -c . is applied to the instruction at address 0x10 where the jump to the printf() function is.exe hw. • The second one puts the 12 bits of the address relative to the page start into the ADD instruction’s bit fields. This is also called minifloat 1 .4 Relocs in ARM64 As we know. I tried it with different values: the compiler is able to encode 30.text]: OFFSET TYPE VALUE 0000000000000008 R_AARCH64_ADR_PREL_PG_HI21 .

p.. RELOCS IN ARM64 There are no such relocs in the executable file: because it’s known where the “Hello!” string is located. C5. So there are values set already in the ADRP.6. in which page. so the number is negative.CHAPTER 28. [sp. x0.Hello!. As an example. imm26 is the last 26 bits: imm26 = 11111111111111111110100000. let’s try to disassemble the BL instruction manually. More about ARM64-related relocs: [ARM13b]. which may be different! ).rodata: 400640 01000200 00000000 48656c6c 6f210000 .. and 0x4005a0 + FFFFFE80 = 0x400420 (please note: we consider the address of the BL instruction. Again.#-16]! x29.#16 // #0 . it is 0xFFFFFFA0. MORE ABOUT ARM 28. So the destination address is 0x400420. ADD and BL instructions (the linker has written them while linking): Listing 28. 400000 <_init-0x3b8> x0.. [sp].26].. #0x648 400420 <puts@plt> w0. 0xFFFFFFA0 * 4 = 0xFFFFFE80. not the current value of PC. x30. sp x0. 0x97ffffa0 is 10010111111111111111111110100000b. but the MSB is 1. Contents of section . 448 . #0x0 x29. According to [ARM13a. by modulo 232 ..4...7: objdump of executable file 0000000000400590 <main>: 400590: a9bf7bfd 400594: 910003fd 400598: 90000000 40059c: 91192000 4005a0: 97ffffa0 4005a4: 52800000 4005a8: a8c17bfd 4005ac: d65f03c0 stp mov adrp add bl mov ldp ret x29... It is 0x3FFFFA0. and in terms of modular arithmetic by modulo 232 .. x30. and the address of puts() is also known.

branch delay slot The GCC assembly output has the LI pseudoinstruction.4. so it’s not possible to embed a 32-bit constant into one instruction. for convenience it shows the last ORI instruction as the LI pseudoinstruction. so.5 -O3 (assembly output) li j ori $2. }. which stores a 16-bit value into the high part of the register. MORE ABOUT MIPS Chapter 29 More about MIPS 29.2 Further reading about MIPS [Swe10]. just like ARM.305397760 # 0x12340000 $31 $2.1 Loading constants into register unsigned int f() { return 0x12345678.5 -O3 (IDA) lui jr li $v0. 449 . LUI (“Load Upper Imeddiate”) is there. All instructions in MIPS.$2.2: GCC 4. Listing 29. which allegedly loads a full 32-bit number into the $V0 register.CHAPTER 29. 0x1234 $ra $v0. 0x12345678 . which effectively sets the low 16-bit part of the target register: Listing 29.1: GCC 4. have a of 32-bit. So this translates to at least two instructions: the first loads the high part of the 32-bit number and the second one applies an OR operation.0x5678 . 29. but in fact.4. branch delay slot IDA is fully aware of such frequently encountered code patterns.

Part II Important fundamentals 450 .

SIGNED NUMBER REPRESENTATIONS Chapter 30 Signed number representations There are several methods for representing signed numbers1 . that is what one need to know: • Number can be signed or unsigned.0x7FFFFFFFFFFFFFFF).255 or 0. but “two’s complement” is the most popular one in computers.0xFF).... For the sake of simplicity.. If we represent them both as signed. 6 5 4 3 2 1 0 255 254 253 252 251 250 .0xFFFFFFFF). – unsigned int (0..CHAPTER 30.. JBE) operations. and it is smaller than the second (2). JL) and unsigned (JA.4294967295 or 0.. – int (-2147483646.. JG. – char (-127.0xFFFFFFFFFFFFFFFF).0x7FFFFFFF).. 1 wikipedia 451 ... • C/C++ signed types: – int64_t (-9223372036854775806.128 or 0x7F.9223372036854775807) or 0x8000000000000000... then the first number (4294967294) is bigger than the second one(2).0x80).g. Here is a table for some byte values: binary 01111111 01111110 hexadecimal 0x7f 0x7e 00000110 00000101 00000100 00000011 00000010 00000001 00000000 11111111 11111110 11111101 11111100 11111011 11111010 0x6 0x5 0x4 0x3 0x2 0x1 0x0 0xff 0xfe 0xfd 0xfc 0xfb 0xfa 10000010 10000001 10000000 0x82 0x81 0x80 unsigned 127 126 . Unsigned: – uint64_t (0..18446744073709551615 or 0. – unsigned char (0.2147483647 or 0x80000000. That is the reason why conditional jumps ( 12 on page 111) are present both for signed (e. the first one is to be −2... – ssize_t. 130 129 128 signed (2’s complement) 127 126 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -126 -127 -128 The difference between signed and unsigned numbers is that if we represent 0xFFFFFFFE and 0x0000002 as unsigned.

1. But for multiplication and division operations. 0 mean “plus”. • Here are some more instructions that work with signed numbers: CBW/CWD/CWDE/CDQ/CDQE ( A. • Promoting to a larger data types is simple: 24. MOVSX ( 15. SIGNED NUMBER REPRESENTATIONS – size_t.CHAPTER 30. • Negation is simple: just invert all bits and add 1. • The addition and subtraction operations work well for both signed and unsigned values.3 on page 940).6. We can remember that a number of inverse sign is located on the opposite side at the same proximity from zero. x86 has different instructions: IDIV/IMUL for signed and DIV/MUL for unsigned.5 on page 411.6.3 on page 936). SAR ( A. The addition of one is needed because zero is present in the middle. • Signed types have the sign in the most significant bit: 1 mean “minus”.1 on page 189). 452 .

31.yurichev. And after running it I get: 1 I’ve uploaded it here: http://go.CHAPTER 31. *(((char*)&v)+2). *(((char*)&v)+1). IBM POWER. And I have compiled this simple example: #include <stdio. }. 31.1 Big-endian The 0x12345678 value is represented in memory as: address in memory +0 +1 +2 +3 byte value 0x12 0x34 0x56 0x78 Big-endian CPUs include Motorola 68k. ENDIANNESS Chapter 31 Endianness The endianness is a way of representing values in memory.2 Little-endian The 0x12345678 value is represented in memory as: address in memory +0 +1 +2 +3 Little-endian CPUs include Intel x86.3 Example I have big-endian MIPS Linux installed and ready in QEMU 1 . *(((char*)&v)+3)). i. v=123. printf ("%02X %02X %02X %02X\n". 31.h> int main() { int v.com/17008 453 byte value 0x78 0x56 0x34 0x12 . *(char*)&v.

CHAPTER 31.5 Converting data The BSWAP instruction can be used for conversion. so that is why a program working on a little-endian architecture has to convert the values. The htonl() and htons() functions are usually used. ENDIANNESS 31. In TCP/IP. because the highest byte goes first. BI-ENDIAN root@debian-mips:~# . 2 Intel Architecture 64 (Itanium): 93 on page 877 454 . big-endian is also called “network byte order”. IA642 . That’s why there are separate Linux distributions for MIPS (“mips” (big-endian) and “mipsel” (little-endian)).4 Bi-endian CPUs that may switch between endianness are ARM.4. but it is big-endian on IBM POWER.out 00 00 00 7B That is it. TCP/IP network data packets use the big-endian conventions. 7B is the first byte (you can check on x86 or x86-64)./a. so htonl() and htons() are not shuffle any bytes on latter. while byte order on the computer—“host byte order”. but here it is the last one. SPARC. 0x7B is 123 in decimal. “host byte order” is little-endian on Intel x86 and other little-endian architectures.4. 31. In little-endian architectures. etc. It is impossible for a binary compiled for one endianness to work on an OS with different endianness.3 on page 369. There is another example of MIPS big-endiannes in this book: 21. MIPS. 31. PowerPC.

The allocation is done just by declaring variables/arrays locally in the function. MEMORY Chapter 32 Memory There are 3 main types of memory: • Global memory. AKA “allocate on stack”. Not convenient for buffers/arrays. but can be slow.2 on page 352. There’s an example in this book: 7. These are global variables. residing in the data or constant segments. This is the slowest way to allocate memory: the memory allocator must support and update all control structures while allocating and deallocating.2 on page 264. Sometimes these local variable are also available to descending functions (if one passes a pointer to a variable to the function to be executed). This is the most convenient method: the block size may be set at runtime. Buffer overflows that occur here usually overwrite variables or buffers reside next to them in memory. • Stack. because they must have a fixed size. Another problem is the “use after free”—using a memory block after free() was called on it. But they’re also not convenient for buffers/arrays.2. AKA “static memory allocation”. Example in this book: 21. Allocation and deallocation are very fast. Allocation is done by calling malloc()/free() or new/delete in C++. Buffer overflows usually overwrite important stack structures: 18. AKA “dynamic memory allocation”. The are available globally (hence. considered as an anti-pattern). but one may forget about it. the allocation is done just by declaring variables/arrays globally. Resizing is possible (using realloc()). Heap allocations are also source of memory leak problems: each memory block has to be deallocated explicitly. Buffer overflows usually overwrite these structures. 455 . or do it incorrectly.CHAPTER 32. • Heap.4 on page 25) (or a variable-length array) is used. because the buffer size has to be fixed. These are usually local variables for the function. only SP needs to be shifted. unless alloca() ( 5. No need to allocate explicitly. which is very dangerous.2 on page 65.

Examples in this book are: 12.1. because it does not modify CPU flags. 12. Hence. 33. Conditional instructions in ARM (like ADRcc) are one way. while other arithmetic instructions do this.2 on page 122.2 Data dependencies Modern CPUs are able to execute instructions simultaneously (OOE1 ).3 on page 129. This is because the branch predictor is not always perfect.5. That’s why the LEA instruction is so popular. if possible. 19. 1 Out-of-order execution 456 . but in order to do so. the compiler endeavors to use instructions with minimal influence to the CPU state.2 on page 330. so the compilers try to do without conditional jumps.CHAPTER 33. another one is the CMOVcc x86 instruction. CPU Chapter 33 CPU 33. the results of one instruction in a group must not influence the execution of others.1 Branch predictors Some modern compilers try to get rid of conditional jump instructions.

for example: 4 6 0 1 3 5 7 8 9 2 Algorithm of simplest possible one-way function is: • take a number at zeroth position (4 in our case). Besides. it is impossible to restore the original text from the hash value. an internet forum engine is not aware of your password. an algorithm that provides “stronger” checksum for integrity checking purposes. etc.CHAPTER 34.1 How one-way function works? One-way function is a function which is able to transform one value into another. Such functions are MD5. But CRC32 is not cryptographically secure: it is known how to alter a text in a way that the resulting CRC32 hash value will be the one we need. I’ll try to demonstrate it. and give you access if they match. We’ve got vector of 10 numbers in range 0. while it is impossible (or very hard) to do reverse it back. it has much less information: the input can be long. 34. Let’s mark numbers on positions of 4 and 6: 4 6 0 1 3 5 7 8 9 2 ^ ^ Let’s swap them and we’ve got a result: 4 6 0 1 7 5 3 8 9 2 While looking at the result and even if we know algorithm. • swap numbers at positions of 4 and 6. and then they could be participate in swapping procedure. it has only to check if its hash is the same as the one in the database.9. Cryptographic hash functions are protected from this. This is utterly simplified example for demonstration. 457 . SHA1. HASH FUNCTIONS Chapter 34 Hash functions A very simple example is CRC32. each encounters only once.. and they are widely used to hash user passwords in order to store them in a database. • take a number at first position (6 in our case).. but CRC32’s result is always limited to 32 bits. One of