Overview

The IEEE floating-point standard defines an encoding used to represent numbers of form where denotes the sign bit, the significand, and the exponent. The binary representation of floating point numbers are segmented into three fields: the sign bit, the exponent field, and the fraction field. Furthermore, there are three classes these fields are interpreted with respect to:

  • Normalized Form
    • Here the exponent field is neither all 0s nor all 1s.
    • The significand is , where denotes the fractional part.
    • where is the unsigned interpretation of the exponent field.
  • Denormalized Form
    • Here the exponent field is all 0s.
    • The significand is , where denotes the fractional part.
    • , defined for smooth transition between normalized and denormalized values.
  • Special Values
    • Here the exponent field is all 1s.
    • If the fraction field is all 0s, we have an value.
    • If the fraction field is not all 0s, we have .

The in the first two forms is set to where denotes the number of bits that make up the exponent field. In C, fields have the following widths:

DeclarationSign BitExponent FieldFractional Field
float1823
double11152

The precision of a floating-point type refers to the number of bits found in the fractional field.

Rounding

Because floating-point arithmetic can’t represent every real number, it must round results to the “nearest” representable number, however “nearest” is defined. The IEEE floating-point standard defines four rounding modes to influence this behavior:

  • Round-to-even rounds numbers to the closest representable value. In the case of values equally between two representations, it rounds to the number with an even least significant digit.
  • Round-toward-zero rounds downward for positive values and upward for negative values.
  • Round-down always rounds downward.
  • Round-up always rounds upward.

Arithmetic

Bibliography