INT8 Quantization

Convert FP32 weights to 8-bit integers without sacrificing perceptual accuracy.

Overview

Standard neural networks use FP32 (32-bit floating point) for weights and activations. INT8 quantization reduces this to 8-bit integers, providing a 4x reduction in model size and significant speedup on hardware with INT8 tensor cores.

Performance Boost

4.2x Faster Inference

On specialized hardware like NVIDIA Jetson (TensorRT) or Apple M2 (CoreML).

Accuracy Retention

99.8% Core Parity

Maintains near-original accuracy through our advanced calibration algorithms.

How to use

edge-ai compress ./model.onnx --method int8

Calibration Datasets

For the best results, provide a representative sample of your input data (100-500 images or text samples) to help the optimizer calculate optimal clamping ranges.