Background: Fractional Binary Numbers

Representable Numbers (Limitations)

Can only exactly represent numbers of the form $\frac{x}{2^k}$
- Other rational numbers have repeating bit representations, such as $\frac{1}{3}==0.0101[01]_2$
Just one setting of binary point within the $w$ bits
- A tradeoff: bigger range of numbers, or more representable fractional numbers?

IEEE Floating Point

Floating point is a kind of shifting binary point.

Numerical form: $$(-1)^sM2^E$$

Sign bit $s$ determines whether it's positive or negative
Significand $M$ a.k.a. Mantissa normally is a fractional value in the range $[1.0,2.0)$
Exponent $E$ weights value by power of 2

Encoding:

MSB is the sign bit $s$
Exp field encodes $E$ (but is not equal to $E$)
Frac field encodes $M$ (but is not equal to $M$)

Precision Options

Single precision 32 bits: 1 bit for $s$, 8 bits for $E$ and 23 bits for $M$
Double precision 64 bits: 1 bit for $s$, 11 bits for $E$ and 52 bits for $M$
Extended precision 80 bits (intel only): 1 bit for $s$, 15 bits for $E$ and 63/64 bits for $M$

Normalized Values

$Exp\neq 000\dots0\ and\ Exp\neq 111\dots1$
- So the range of Exp would be $[1,2^k-2]$
- For example, in the single precision 32 bits case: $1\leq Exp\leq 254\Rightarrow -126\leq E\leq 127$
Exponent coded as a biased value: $E=Exp-Bias$
- Exp: unsigned value of Exp field
- $Bias=2^{k-1}-1$ where $k$ is the number of exponent bits
Significand coded with implied leading 1: $M=1.x_1x_2x_3...x_n$

Denormalized Values

$Exp=000\dots0$
$E=1-Bias$ (instead of $E=0-Bias$)
- To keep consistent with the smallest norm case
- A nice smooth transition from denorm to norm
Significand coded with implied leading 0: $M=0.x_1x_2x_3...x_n$

Special Values

$Exp=111\dots1,Frac=000\dots0$
- Represents infinity
- Operation that overflows
- Both positive and negative
- E.g., 1.0/0.0
$Exp=111\dots1,Frac\neq000\dots0$
- Not-a-Number a.k.a. NaN
- When no numerical value can be determined
- E.g., sqrt(-1)

Distribution of Values

The distribution gets denser toward 0.

Special Properties of the IEEE Encoding

FP zero same as integer zero (bitwise)
Can almost use unsigned integer comparison
- Must first compare sign bit
- Must consider -0 = 0
- NaNs
- Otherwise OK
  - Denorms vs. norms
  - Norms vs. infinity

FP Operations: Basic Idea

First compute exact result
Make it fit into desired precision
- Possibly overflow if exponent too large
- Possibly round to fit into frac

Rounding

Rounding modes:

Towards 0
Round down ($-\infty$)
Round up ($+\infty$)
Nearest even (default)
- Hard to get any other kind without dropping into assembly
- All others are statistically biased: sum of set of positive numbers will consistently be over- or under-estimated

Rounding Binary Numbers

Binary Fractional Numbers:

Even: when least significant bit is 0
Half way: when bits to right of rounding position are $100\dots_2$

FP Multiplication

$$(-1)^{s_1}M_12^{E_1}\times (-1)^{s_2}M_22^{E_2}=(-1)^{s}M2^{E}$$

$s=s_1\bigoplus s_2$
$M=M_1\times M_2$
$E=E_1+E_2$
If $M\geq2$, shift M right and increment E
If E out of range, overflow
Round M to fit frac precision

FP Addition

$$(-1)^{s_1}M_12^{E_1}+(-1)^{s_2}M_22^{E_2}=(-1)^{s}M2^{E}$$

Assume $E_1>E_2$
s, M: result of signed align & add
$E=E_1$
If $M\geq 2$, shift M right, increment E
If $M<1$, shift M left k positions, decrement E by k
Overflow if E out of range
Round M to fit frac precision

Mathematical Properties of FP Addition

Closed under addition//加法封闭性
- But may generate infinity or NaNs
Commutative//交换性
But not associative//但没有结合性
- Overflow and inexactness of rounding
- e.g., $(3.14+1e10)-1e10=0$, whereas $3.14+(1e10-1e10)=3.14$
0 is additive identity//加法恒等元
Every element has additive inverse//逆元
- Except for infinities and NaNs
Almost monotonicity//单调性
- Except for infinities and NaNs
- e.g., double d; float f; d>f -> -d<-f

Mathematical Properties of FP Multiplication

Closed under addition//乘法封闭性
- But may generate infinity or NaNs
Commutative//交换性
But not associative//但没有结合性
- Overflow and inexactness of rounding
- e.g., $(1e201e20)1e-20=inf$, whereas $1e20(1e201e-20)=1e20$
1 is multiplicative identity//乘法恒等元
Mult doesn't distribute over addition//没有分配律
- Overflow and inexactness of rounding
- e.g., $1e20*(1e20-1e20)=0.0$, whereas $1e201e20-1e201e20=inf$
Almost monotonicity//单调性
- Except for infinities and NaNs

FP in C

float: single precision FP, 23 bits of frac field
double: double precision FP, 52 bits of frac field
Casting between float, double and int changes bit representation
float/double -> int
- Truncates fractional part (only when double -> int)
- Like rounding towards zero
- Not defined when out of range or NaN: generally sets to TMin
int -> double
- Exact conversion, as long as int has word size of no more than 53 bits
int -> float
- Will round according to rounding mode

Lecture 04 Floating Point