INT8 Quantization
Convert FP32 weights to 8-bit integers without sacrificing perceptual accuracy.
Overview
Standard neural networks use FP32 (32-bit floating point) for weights and activations. INT8 quantization reduces this to 8-bit integers, providing a 4x reduction in model size and significant speedup on hardware with INT8 tensor cores.
Performance Boost
4.2x Faster Inference
On specialized hardware like NVIDIA Jetson (TensorRT) or Apple M2 (CoreML).
Accuracy Retention
99.8% Core Parity
Maintains near-original accuracy through our advanced calibration algorithms.
How to use
edge-ai compress ./model.onnx --method int8
Calibration Datasets
For the best results, provide a representative sample of your input data (100-500 images or text samples) to help the optimizer calculate optimal clamping ranges.