It’s basically when you use a larger model to train a smaller one. You use a dataset of data generated by the teacher model and ground truth data to train the student model, and by some strange alchemy I don’t quite understand you get a much smaller model that resembles the teacher model.
It’s really hard training on a distilled model without breaking it, so people prefer models undistilled whenever possible. Without the teacher model, distilled models are basically cripple-ware.
It’s basically when you use a larger model to train a smaller one. You use a dataset of data generated by the teacher model and ground truth data to train the student model, and by some strange alchemy I don’t quite understand you get a much smaller model that resembles the teacher model.
It’s really hard training on a distilled model without breaking it, so people prefer models undistilled whenever possible. Without the teacher model, distilled models are basically cripple-ware.
Thanks for explaining!