forked from dlang/dlang.org
-
Notifications
You must be signed in to change notification settings - Fork 0
/
float.dd
265 lines (220 loc) · 7.36 KB
/
float.dd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
Ddoc
$(SPEC_S Floating Point,
<h3>Floating Point Intermediate Values</h3>
$(P On many computers, greater
precision operations do not take any longer than lesser
precision operations, so it makes numerical sense to use
the greatest precision available for internal temporaries.
The philosophy is not to dumb down the language to the lowest
common hardware denominator, but to enable the exploitation
of the best capabilities of target hardware.
)
$(P For floating point operations and expression intermediate values,
a greater precision can be used than the type of the
expression.
Only the minimum precision is set by the types of the
operands, not the maximum. $(B Implementation Note:) On Intel
x86 machines, for example,
it is expected (but not required) that the intermediate
calculations be done to the full 80 bits of precision
implemented by the hardware.
)
$(P It's possible that, due to greater use of temporaries and
common subexpressions, optimized code may produce a more
accurate answer than unoptimized code.
)
$(P Algorithms should be written to work based on the minimum
precision of the calculation. They should not degrade or
fail if the actual precision is greater. Float or double types,
as opposed to the real (extended) type, should only be used for:
)
$(UL
$(LI reducing memory consumption for large arrays)
$(LI when speed is more important than accuracy)
$(LI data and function argument compatibility with C)
)
<h3>Floating Point Constant Folding</h3>
$(P Regardless of the type of the operands, floating point
constant folding is done in $(B real) or greater precision.
It is always done following IEEE 754 rules and round-to-nearest
is used.)
$(P Floating point constants are internally represented in
the implementation in at least $(B real) precision, regardless
of the constant's type. The extra precision is available for
constant folding. Committing to the precision of the result is
done as late as possible in the compilation process. For example:)
---
const float f = 0.2f;
writefln(f - 0.2);
---
$(P will print 0. A non-const static variable's value cannot be
propagated at compile time, so:)
---
static float f = 0.2f;
writefln(f - 0.2);
---
$(P will print 2.98023e-09. Hex floating point constants can also
be used when specific floating point bit patterns are needed that
are unaffected by rounding. To find the hex value of 0.2f:)
---
import std.stdio;
void main()
{
writefln("%a", 0.2f);
}
---
$(P which is 0x1.99999ap-3. Using the hex constant:)
---
const float f = 0x1.99999ap-3f;
writefln(f - 0.2);
---
$(P prints 2.98023e-09.)
$(P Different compiler settings, optimization settings,
and inlining settings can affect opportunities for constant
folding, therefore the results of floating point calculations may differ
depending on those settings.)
<h3>Complex and Imaginary types</h3>
$(P In existing languages, there is an astonishing amount of effort expended in trying to jam a
complex type onto existing type definition facilities: templates, structs, operator
overloading, etc., and it all usually ultimately fails. It fails because the semantics of
complex operations can be subtle, and it fails because the compiler doesn't know what the
programmer is trying to do, and so cannot optimize the semantic implementation.
)
$(P This is all done to avoid adding a new type. Adding a new type means that the compiler
can make all the semantics of complex work "right". The programmer then can rely on a
correct (or at least fixable <g>) implementation of complex.
)
$(P Coming with the baggage of a complex type is the need for an imaginary type. An
imaginary type eliminates some subtle semantic issues, and improves performance by not
having to perform extra operations on the implied 0 real part.
)
$(P Imaginary literals have an i suffix:
)
------
ireal j = 1.3i;
------
$(P There is no particular complex literal syntax, just add a real and
imaginary type:
)
------
cdouble cd = 3.6 + 4i;
creal c = 4.5 + 2i;
------
$(P Complex, real and imaginary numbers have two properties:
)
<pre>
.re get real part (0 for imaginary numbers)
.im get imaginary part as a real (0 for real numbers)
</pre>
$(P For example:
)
<pre>
cd.re is 4.5 double
cd.im is 2 double
c.re is 4.5 real
c.im is 2 real
j.im is 1.3 real
j.re is 0 real
</pre>
<h3>Rounding Control</h3>
$(P IEEE 754 floating point arithmetic includes the ability to set 4
different rounding modes.
These are accessible via the functions in std.c.fenv.
)
$(V2
$(P If the floating-point rounding mode is changed within a function,
it must be restored before the function exits. If this rule is violated
(for example, by the use of inline asm), the rounding mode used for
subsequent calculations is undefined.
)
)
<h3>Exception Flags</h3>
$(P IEEE 754 floating point arithmetic can set several flags based on what
happened with a
computation:)
$(TABLE
$(TR $(TD FE_INVALID))
$(TR $(TD FE_DENORMAL))
$(TR $(TD FE_DIVBYZERO))
$(TR $(TD FE_OVERFLOW))
$(TR $(TD FE_UNDERFLOW))
$(TR $(TD FE_INEXACT))
)
$(P These flags can be set/reset via the functions in
$(LINK2 phobos/std_c_fenv.html, std.c.fenv).)
<h3>Floating Point Comparisons</h3>
$(P In addition to the usual < <= > >= == != comparison
operators, D adds more that are
specific to floating point. These are
!<>=
<>
<>=
!<=
!<
!>=
!>
!<>
and match the semantics for the
NCEG extensions to C.
See $(LINK2 expression.html#floating_point_comparisons, Floating point comparisons).
)
<h3><a name="floating-point-transformations">Floating Point Transformations</a></h2>
$(P An implementation may perform transformations on
floating point computations in order to reduce their strength,
i.e. their runtime computation time.
Because floating point math does not precisely follow mathematical
rules, some transformations are not valid, even though some
other programming languages still allow them.
)
$(P The following transformations of floating point expressions
are not allowed because under IEEE rules they could produce
different results.
)
$(TABLE1
<caption>Disallowed Floating Point Transformations</caption>
$(TR
$(TH transformation) $(TH comments)
)
$(TR
$(TD $(I x) + 0 → $(I x)) $(TD not valid if $(I x) is -0)
)
$(TR
$(TD $(I x) - 0 → $(I x)) $(TD not valid if $(I x) is ±0 and rounding is towards -∞)
)
$(TR
$(TD -$(I x) ↔ 0 - $(I x)) $(TD not valid if $(I x) is +0)
)
$(TR
$(TD $(I x) - $(I x) → 0) $(TD not valid if $(I x) is NaN or ±∞)
)
$(TR
$(TD $(I x) - $(I y) ↔ -($(I y) - $(I x))) $(TD not valid because (1-1=+0) whereas -(1-1)=-0)
)
$(TR
$(TD $(I x) * 0 → 0) $(TD not valid if $(I x) is NaN or ±∞)
)
$(COMMENT
$(TR
$(TD $(I x) * 1 → $(I x)) $(TD not valid if $(I x) is a signaling NaN)
)
)
$(TR
$(TD $(I x) / $(I c) ↔ $(I x) * (1/$(I c))) $(TD valid if (1/$(I c)) yields an e$(I x)act result)
)
$(TR
$(TD $(I x) != $(I x) → false) $(TD not valid if $(I x) is a NaN)
)
$(TR
$(TD $(I x) == $(I x) → true) $(TD not valid if $(I x) is a NaN)
)
$(TR
$(TD $(I x) !$(I op) $(I y) ↔ !($(I x) $(I op) $(I y))) $(TD not valid if $(I x) or $(I y) is a NaN)
)
)
$(P Of course, transformations that would alter side effects are also
invalid.)
)
Macros:
TITLE=Floating Point
WIKI=Float
CATEGORY_SPEC=$0