Common Floating Point Representations

Number Bases
General Floating-Point Representations
``The Most Important Fact'' About Floating-Point Number Systems
Tables of Common Floating-Point Representations
Sources of IEEE software and information

Number Bases
There are a variety of number bases. The most popular happens to correspond to the number of fingers your average human has on his hands. However, computers have a natural base of 2 corresponding to either on or off, high voltage or low. To represent a large number often requires a long string of 1's or 0's that gets a little cumbersome to write or work with on a sheet of paper. The Octal or Hexadecimal representation serves as a compromise between humans & computers. It allows humans to work with numbers that can be written in a compact fashion that translates directly to the computer's binary representation.

Base Example
Decimal 10 10 .1
Binary 2 1010₂ .0~~0011~~₂
Octal 8 12₈ .0~~6314~~₈
Hexadecimal 16 A₁₆ .19₁₆
Only rational numbers with denominators containing the same prime factors as the base have non-repeating fractional parts. In the above example repeating digits are represented with the strike-through font.
The following Fortran code fragment demonstrates one of the pitfalls of the floating point representation ... it's only an approximation and as such the programmer must be careful to realize that the actual computed result may not be the expected result.
    parameter (zero = 0.0, one = 1.0, tenth = .1)
    a = tenth
    x = zero
    do 100 i = 1,10
      x = x + tenth
100 continue
    error = x - one
These results were produced on a Sun workstation using 32 bit IEEE representation.

a = 0.10000000
x = 1.00000012
error = 1.1920929E-07
The expected results is zero. However, the computed results is small, but definitely not zero.
General Floating-Point Representations
The general floating point representation can be summarized as: x = +/- ^{e - b} x 0.d₁d₂...d_p
where

+/- = sign bit (s)
= base or radix
d_n = digit (0 d_n < -1)
(d₁ 0)
d = 0.d₁d₂...d_p (mantissa)
p = precision
m = minimum exponent
M = maximum exponent
e - b = exponent - bias (m e - b M)

where the following stipulation defines a normalized representation

^-1 d < 1 And the special values = ^m-1 = ^m x .1
= ^M(1 - ^-p) where is the minimum positive value and is the maximum positive value.
Example: =2, p=3, m=-1, M=1
The only allowed non-negative values are:
.0, =.25, .3125, .375, .4375, .5, .625, .75, .875, 1., 1.25, 1.5, =1.75

The assumption is made that for the basic numeric operations (+-*/) between any two floating point values the operation will yield a value that is a floating point value that's the nearest to the ``exact'' value.
Take for example the values .875 and 1.25 from above. The product of .875*1.25=1.09375; however, this value does not exist exactly within the floating point representation. The closest representable value is 1.0 which will (or should) be returned as the result of this operation.
If x is the result of the basic floating-point operations (+-*/) then the following assumptions will hold:

If |x| then x will be set to the nearest allowable floating-point representation.
If |x| < then the result is 0 which gives a ``silent and nondestructive underflow.''
If |x| > then calculation is terminated with a ``fatal overflow.''

``The Most Important Fact'' About Floating-Point Number Systems

The spacing between a floating-point number x and an adjacent floating-point number is at least x / and at most x (unless x or the neighbor is 0).

Therefore, the floating point representation corresponds to a discrete and finite set of points on the real number line which gets denser nearer the origin until it reaches some limit . The relative spacing is proportional to .
Table of Common Floating-Point Representations
All the representations shown are radix 2. The ones listed correspond to Cray parallel vector processors (PVPs), Digital Equipment Corporation (DEC) VMS VAX, and the last is the IEEE representation commonly found on workstations and for Linux boxes. The first table gives the single precision representation, and second table gives the double precision representation.

Fortran REAL and C float (Cray double)
REAL CRAY VAX IEEE

length (bits) 64 32 32
sign bit (s) yes yes yes
exponent (bits) 15 8 8
exponent bias (b) 3FFF 7F 7F
fraction (bits) 48 23 23
hidden bit normalization no yes yes
range low () 3.67x10^-2466 2.93x10^-39 1.175x10^-38
2^-8189 2^-128 2^-126
range high () 2.73x10²⁴⁶⁵ 1.701x10³⁸ 3.403x10³⁸
2⁸¹⁹⁰ 2¹²⁷ 2¹²⁸
machine epsilon () 7.11x10^-15 5.96x10^-8 1.19x10^-7
digits accuracy 14 7 7

Fortran DOUBLE PRECISION and C double (Cray long double)
DOUBLE PRECISION CRAY VAX IEEE

length (bits) 128 64 64
sign bit (s) yes yes yes
exponent (bits) 15 8 11
exponent bias (b) 3FFF 7F 3FF
fraction (bits) 96 55 52
hidden bit normalization no yes yes
range low () 3.67x10^-2466 2.93x10^-39 2.23x10^-308
2^-8189 2^-128 2^-11022
range high () 2.73x10²⁴⁶⁵ 1.701x10³⁸ 1.80x10³⁰⁸
2⁸¹⁹⁰ 2¹²⁷ 2¹⁰²⁴
machine epsilon () 2.52x10^-29 1.39x10^-17 2.22x10^-16
digits accuracy 29 17 16

The VAX & IEEE use ``hidden bit'' normalization. That is the first digit in the mantissa is assumed to be 1 and does not need to be ``stored''. The IEEE normalization is slightly different as noted below

Cray (-1)^s x 2^e-b x .f
VAX (-1)^s x 2^e-b x .1f
IEEE (-1)^s x 2^e-b x 1.f
The IEEE format also defines:

Inf : e = all one's and f = 0
NaN : e = all one's and f 0

Sources of IEEE software and information

SoftFloat - freely available C code to implement IEC/IEEE floating point representation with software.
Sun Microsystems' Numerical Computation Guide, which covers the IEEE Standard in detail.

Last Modified:

Brought to you by: R.K. Owen,Ph.D.
This page is http://owen.sj.ca.us/rkowen/howto/fltpt/index.html

	Base	Example
Decimal	10	10	.1
Binary	2	1010₂	.0~~0011~~₂
Octal	8	12₈	.0~~6314~~₈
Hexadecimal	16	A₁₆	.19₁₆

+/-	=	sign bit (s)
	=	base or radix
d_n	=	digit (0 d_n < -1)
		(d₁ 0)
d	=	0.d₁d₂...d_p (mantissa)
p	=	precision
m	=	minimum exponent
M	=	maximum exponent
e - b	=	exponent - bias (m e - b M)

REAL	CRAY	VAX	IEEE

length (bits)	64	32	32
sign bit (s)	yes	yes	yes
exponent (bits)	15	8	8
exponent bias (b)	3FFF	7F	7F
fraction (bits)	48	23	23
hidden bit normalization	no	yes	yes
range low ()	3.67x10^-2466	2.93x10^-39	1.175x10^-38
	2^-8189	2^-128	2^-126
range high ()	2.73x10²⁴⁶⁵	1.701x10³⁸	3.403x10³⁸
	2⁸¹⁹⁰	2¹²⁷	2¹²⁸
machine epsilon ()	7.11x10^-15	5.96x10^-8	1.19x10^-7
digits accuracy	14	7	7

Cray	(-1)^s x 2^e-b x .f
VAX	(-1)^s x 2^e-b x .1f
IEEE	(-1)^s x 2^e-b x 1.f