DistillKit

PyTorch toolkit for model optimization.

Overview

In experiments, DistillKit retained over 96% of the teacher model's accuracy while reducing model size by nearly 99%, enabling high-performance deployment on even the most constrained edge devices.

Teacher

Soft Targets

Hard Targets

Student

FP32

Quantized

INT8

Approach

Knowledge Distillation

DistillKit implements Hinton's temperature-scaled distillation, combining soft targets from the teacher with the original hard labels. In the default configuration, a ResNet152 teacher trained for 50 epochs transfers its learned representations to a MobileNetV2 student trained for 20 epochs. The distillation process uses a temperature of 4.0 and a blending factor (α) of 0.7 to balance soft and hard losses.

Quantization

The pipeline supports both PTQ and QAT. Models are quantized to INT8 precision, with fused layers and batch normalization folding to minimize accuracy loss. QAT allows the model to learn in its quantized state, improving resilience to precision reduction.

Optimization

Deployment performance is enhanced through layer fusion, batch normalization folding, and dynamic quantization for RNN layers. Optional advanced augmentations such as CutMix and MixUp are available to improve generalization.

Architecture

Teacher Model

ResNet152 with ImageNet pretraining, adapted for CIFAR-10's 32×32 input. Provides rich feature representations for distillation.

Student Model

MobileNetV2 optimized for CIFAR-10, with ~2.3M parameters compared to the teacher's ~60M. Uses depthwise separable convolutions for efficiency.

Quantizable Student

MobileNetV2 modified for quantization, with fused layers and QAT-ready modules. Supports both static and quantization-aware workflows.

Training produces multiple artifacts, including the teacher model, distilled student, baseline student, and both QAT and static quantized versions. Evaluation metrics, confusion matrices, and latency/throughput profiles are logged for analysis.

Results

Teacher (ResNet152)

96.6%

Accuracy

240 MB

Model Size

1×

Speed

Performance Summary

Teacher (ResNet152): ~96.6% accuracy, ~240MB size

Distilled Student (MobileNetV2): ~94% accuracy, ~9MB size — 27× smaller, 3× faster

QAT Quantized Student: ~93.5% accuracy, ~2.96MB size — 81× smaller, 5× faster

Baseline Student (no KD): ~91% accuracy, ~9MB size — 27× smaller, 3× faster

These results demonstrate that knowledge distillation combined with quantization can retain high accuracy while achieving massive efficiency gains, enabling real-time inference on constrained hardware.

Potential Applications

While CIFAR-10 serves as the benchmark task, the methods extend to other domains — including real-world edge AI deployments like wildlife detection, mobile NLP, and embedded vision systems.

Future Directions

DistillKit is built for extensibility. The modular structure makes it easy to experiment with new teacher/student architectures, alternative loss functions, and custom quantization schemes. Future work includes applying the pipeline to specialized datasets for low-power edge deployments, such as wildfire detection in remote sensor networks.

View the GitHub