
CAVR-4
140
Basic data types
AVR® IAR C/C++ Compiler
Reference Guide
* Depends on whether the --64bit_doubles option is used, see --64bit_doubles, page 201.
The type
long double use the same precision as double.
32-bit floating-point format
The representation of a 32-bit floating-point number as an integer is:
The value of the number is:
(-1)
S
* 2
(Exponent-127)
* 1.Mantissa
The precision of the float operators (+, -, *, and /) is approximately 7 decimal digits.
64-bit floating-point format
The representation of a 64-bit floating-point number as an integer is:
The value of the number is:
(-1)
S
* 2
(Exponent-1023)
* 1.Mantissa
The precision of the float operators (+, -, *, and /) is approximately 15 decimal digits.
Special cases
The following applies to both 32-bit and 64-bit floating-point formats:
● Zero is represented by zero mantissa and exponent. The sign bit signifies positive or
negative zero.
● Infinity is represented by setting the exponent to the highest value and the mantissa
to zero. The sign bit signifies positive or negative infinity.
● Not a number (NaN) is represented by setting the exponent to the highest positive
value and the mantissa to a non-zero value. The value of the sign bit is ignored.
● Subnormal numbers are used for representing values smaller than what can be
represented by normal values. The drawback is that the precision will decrease with
smaller values. The exponent is set to 0 to signify that the number is denormalized,
even though the number is treated as if the exponent would have been 1. Unlike
normal numbers, denormalized numbers do not have an implicit 1 as the most
significant bit (MSB) of the mantissa.
S
31 30
23 22
0
Exponent Mantissa
S
6362
52 51
0
Exponent Mantissa