

I’d have to assume this is for local LLMs because it would get expensive quickly to test the performance of a prompt well.
It never really occurred to me to even try because the output can be so subjective. And to grab something at random: must cite source if factual claim
- There are lots of ways to cite a source. How would you ensure you capture all of them?
- A source can be hallucinated. You still have to curl any links.
- A source can be misunderstood and not say what the bot thinks it says. Only way to test this is by hand or write another AI to do it, and now you testing that.
- Last and not least: what if the source exists, and says what the bot thinks it does, but it’s a garbage source?
In short, passing a unit test does nothing to guarantee any quality of output. You’d be further ahead effort-wise to just give the LLM multishot examples and actually manually review each output for quality.
What you’re doing here is spending all your time churning your prompt instead of accomplishing anything. Take it from someone who spent like six months prompt churning. Some prompts are better than others but at the end of the day the output is random and you’ll never anticipate all the possible inputs. Tweak it until the output feels right and that’s that. Play with it when you have nothing better to do.




I’ve had good luck keeping the home partition and just reinstalling the OS. Set it up with the same user name and home directory and you’re done.