Atmel Corp. CAVR-4 Manual

A SERVICE OF

next previous

CAVR-4

140

Basic data types

AVR® IAR C/C++ Compiler

Reference Guide

* Depends on whether the --64bit_doubles option is used, see --64bit_doubles, page 201.

The type

long double use the same precision as double.

32-bit floating-point format

The representation of a 32-bit floating-point number as an integer is:

The value of the number is:

(-1)

S

* 2

(Exponent-127)

* 1.Mantissa

The precision of the float operators (+, -, *, and /) is approximately 7 decimal digits.

64-bit floating-point format

The representation of a 64-bit floating-point number as an integer is:

The value of the number is:

(-1)

S

* 2

(Exponent-1023)

* 1.Mantissa

The precision of the float operators (+, -, *, and /) is approximately 15 decimal digits.

Special cases

The following applies to both 32-bit and 64-bit floating-point formats:

● Zero is represented by zero mantissa and exponent. The sign bit signifies positive or

negative zero.

● Infinity is represented by setting the exponent to the highest value and the mantissa

to zero. The sign bit signifies positive or negative infinity.

● Not a number (NaN) is represented by setting the exponent to the highest positive

value and the mantissa to a non-zero value. The value of the sign bit is ignored.

● Subnormal numbers are used for representing values smaller than what can be

represented by normal values. The drawback is that the precision will decrease with

smaller values. The exponent is set to 0 to signify that the number is denormalized,

even though the number is treated as if the exponent would have been 1. Unlike

normal numbers, denormalized numbers do not have an implicit 1 as the most

significant bit (MSB) of the mantissa.

S

31 30

23 22

0

Exponent Mantissa

S

6362

52 51

0

Exponent Mantissa