How to evaluate LLM outputs systematically - beyond vibes?