• Even_Adder@lemmy.dbzer0.comOPM
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    7 days ago

    It’s basically when you use a larger model to train a smaller one. You use a dataset of data generated by the teacher model and ground truth data to train the student model, and by some strange alchemy I don’t quite understand you get a much smaller model that resembles the teacher model.

    It’s really hard training on a distilled model without breaking it, so people prefer models undistilled whenever possible. Without the teacher model, distilled models are basically cripple-ware.