IEEE 754 Single Precision Basics
Learn the fundamentals of IEEE 754 single precision (float32) format. Understand the 32-bit layout with 1 sign bit, 8 exponent bits, and 23 mantissa bits.
Decimal Value
1.0
Float32 Hex
0x3F800000
Float64 Hex
0x3FF0000000000000
Detailed Explanation
IEEE 754 single precision, commonly called float in C/C++ or Float in Java, is a 32-bit binary format for representing floating-point numbers. Every single-precision value is encoded using exactly 32 bits divided into three fields:
The three fields:
| Field | Bits | Position |
|---|---|---|
| Sign | 1 bit | Bit 31 (most significant) |
| Exponent | 8 bits | Bits 30-23 |
| Mantissa | 23 bits | Bits 22-0 (least significant) |
How the value 1.0 is encoded:
- Sign bit = 0 (positive number)
- Exponent = 127 (stored as biased exponent; actual exponent = 127 - 127 = 0)
- Mantissa = 000...0 (the implicit leading 1 plus zero fraction = 1.0)
The formula is: (-1)^sign x (1 + mantissa) x 2^(exponent - 127)
For 1.0: (-1)^0 x (1 + 0) x 2^(127 - 127) = 1 x 1 x 1 = 1.0
In hex this is 0x3F800000, and in binary: 0 01111111 00000000000000000000000.
Precision and range:
Single precision provides approximately 7 decimal digits of precision. The smallest positive normalized value is about 1.18e-38 and the largest is about 3.40e+38. For many graphics and gaming applications, 32-bit floats offer sufficient precision while using half the memory of 64-bit doubles.
Bias encoding:
The exponent uses a bias of 127. This means the stored value is always the actual exponent plus 127. This allows the exponent to represent both positive and negative powers of 2 without needing a separate sign bit for the exponent. Stored values range from 1 to 254 for normalized numbers (0 and 255 are reserved for special values).
Use Case
Understanding single precision is essential for GPU programming, embedded systems development, and any application where memory-efficient floating-point storage is needed. Graphics shaders, machine learning inference, and sensor data processing commonly use float32.