INT4 Mixed Precision
Push the boundaries of sharding by mixing 4-bit and 8-bit precision across sensitive layers.
Why Mixed Precision?
Not all layers are created equal. While some neural layers can be compressed to 4-bit without info loss, others (like attention heads in Transformers) require higher precision. Our optimizer automatically identifies these bottlenecks.
Layer-wise Optimization
Automatic sensitivity analysis across 1,000+ layers in real-time.
Compression Command
Enable mixed precision by specifying the auto-bit encoder.
edge-ai compress ./llama-2.pt --method mixed-precision --auto-bit
LLM Focus
Perfect for Large Language Models where VRAM is the primary constraint. 7B parameter models can run on 6GB VRAM devices.
Vision Pipelines
Ideal for high-throughput video streams where 4-bit weights allow for massive batch parallelization.