

From what I understand it makes the model lighter overall, thus you could fit a bigger param model on the same amount of Vram. It would also make generation faster in terms of tokens/sec. Layers get loaded into vram and are part of the model, and basically a loop goes deeper and deeper into each layer to generate each token. At each layer, it gets refined a little more, and you do this over and over again. Bigger param models have more layers.
You unfortunately can’t apply the fix in the .safetensors or .gguf file, the model curators need to bake it in






we should throw AI at other weird websites and see what sort of files come up. call it magnet fishing.