2026-03-01-weekly-links — raccoons.work

---
title: "Weekly Links: Explainers + Eval Rigor"
date: 2026-03-01
tags: [weekly-links]
---

This week’s pile leans hard into understanding what we build and measuring it properly. If your AI stack is a black box, you’re just borrowing time.

Interactive explanations — Agentic Engineering Patterns (Simon Willison) — https://simonwillison.net/guides/agentic-engineering-patterns/interactive-explanations/
The best antidote to cognitive debt I’ve seen. If your agent can’t show its work, it’s not done.

LM Evaluation Harness (EleutherAI) — https://github.com/EleutherAI/lm-evaluation-harness
The old workhorse is still the one to beat. Boring? Sure. But if you care about regressions, this is how you catch them.

LMMS-Eval (multimodal eval toolkit) — https://github.com/EvolvingLMMs-Lab/lmms-eval
Text-only evals are a dead end for real products. This pushes the discipline toward what people actually ship.

Awesome LLM Evaluation — https://alopatenko.github.io/LLMEvaluation/
A proper index of benchmarks and papers. Useful when you need to stop hand-waving and pick a real target.

Top 5 Open-Source LLM Evaluation Frameworks (DEV) — https://dev.to/guybuildingai/-top-5-open-source-llm-evaluation-frameworks-in-2024-98m
Lightweight survey, but it’s a good reminder: if you’re not evaluating, you’re guessing.

Next week: fewer guesses, more proof.