
DistillKit
PyTorch toolkit for model optimization.
Overview
In experiments, DistillKit retained over 96% of the teacher model's accuracy while reducing model size by nearly 99%, enabling high-performance deployment on even the most constrained edge devices.
Approach
Knowledge Distillation
DistillKit implements Hinton's temperature-scaled distillation, combining soft targets from the teacher with the original hard labels. In the default configuration, a ResNet152 teacher trained for 50 epochs transfers its learned representations to a MobileNetV2 student trained for 20 epochs. The distillation process uses a temperature of 4.0 and a blending factor (α) of 0.7 to balance soft and hard losses.
Quantization
The pipeline supports both PTQ and QAT. Models are quantized to INT8 precision, with fused layers and batch normalization folding to minimize accuracy loss. QAT allows the model to learn in its quantized state, improving resilience to precision reduction.
Optimization
Deployment performance is enhanced through layer fusion, batch normalization folding, and dynamic quantization for RNN layers. Optional advanced augmentations such as CutMix and MixUp are available to improve generalization.
Architecture
Teacher Model
ResNet152 with ImageNet pretraining, adapted for CIFAR-10's 32×32 input. Provides rich feature representations for distillation.
Student Model
MobileNetV2 optimized for CIFAR-10, with ~2.3M parameters compared to the teacher's ~60M. Uses depthwise separable convolutions for efficiency.
Quantizable Student
MobileNetV2 modified for quantization, with fused layers and QAT-ready modules. Supports both static and quantization-aware workflows.
Training produces multiple artifacts, including the teacher model, distilled student, baseline student, and both QAT and static quantized versions. Evaluation metrics, confusion matrices, and latency/throughput profiles are logged for analysis.
Results
Performance Summary
Teacher (ResNet152): ~96.6% accuracy, ~240MB size
Distilled Student (MobileNetV2): ~94% accuracy, ~9MB size — 27× smaller, 3× faster
QAT Quantized Student: ~93.5% accuracy, ~2.96MB size — 81× smaller, 5× faster
Baseline Student (no KD): ~91% accuracy, ~9MB size — 27× smaller, 3× faster
These results demonstrate that knowledge distillation combined with quantization can retain high accuracy while achieving massive efficiency gains, enabling real-time inference on constrained hardware.
Potential Applications
While CIFAR-10 serves as the benchmark task, the methods extend to other domains — including real-world edge AI deployments like wildlife detection, mobile NLP, and embedded vision systems.
Future Directions
DistillKit is built for extensibility. The modular structure makes it easy to experiment with new teacher/student architectures, alternative loss functions, and custom quantization schemes. Future work includes applying the pipeline to specialized datasets for low-power edge deployments, such as wildfire detection in remote sensor networks.