IEEE 754 Single Precision Basics

Learn the fundamentals of IEEE 754 single precision (float32) format. Understand the 32-bit layout with 1 sign bit, 8 exponent bits, and 23 mantissa bits.

Fundamentals

Decimal Value

1.0

Float32 Hex

0x3F800000

Float64 Hex

0x3FF0000000000000

Detailed Explanation

IEEE 754 single precision, commonly called float in C/C++ or Float in Java, is a 32-bit binary format for representing floating-point numbers. Every single-precision value is encoded using exactly 32 bits divided into three fields:

The three fields:

Field Bits Position
Sign 1 bit Bit 31 (most significant)
Exponent 8 bits Bits 30-23
Mantissa 23 bits Bits 22-0 (least significant)

How the value 1.0 is encoded:

  1. Sign bit = 0 (positive number)
  2. Exponent = 127 (stored as biased exponent; actual exponent = 127 - 127 = 0)
  3. Mantissa = 000...0 (the implicit leading 1 plus zero fraction = 1.0)

The formula is: (-1)^sign x (1 + mantissa) x 2^(exponent - 127)

For 1.0: (-1)^0 x (1 + 0) x 2^(127 - 127) = 1 x 1 x 1 = 1.0

In hex this is 0x3F800000, and in binary: 0 01111111 00000000000000000000000.

Precision and range:

Single precision provides approximately 7 decimal digits of precision. The smallest positive normalized value is about 1.18e-38 and the largest is about 3.40e+38. For many graphics and gaming applications, 32-bit floats offer sufficient precision while using half the memory of 64-bit doubles.

Bias encoding:

The exponent uses a bias of 127. This means the stored value is always the actual exponent plus 127. This allows the exponent to represent both positive and negative powers of 2 without needing a separate sign bit for the exponent. Stored values range from 1 to 254 for normalized numbers (0 and 255 are reserved for special values).

Use Case

Understanding single precision is essential for GPU programming, embedded systems development, and any application where memory-efficient floating-point storage is needed. Graphics shaders, machine learning inference, and sensor data processing commonly use float32.

Try It — IEEE 754 Inspector

Open full tool