You Are the Most Expensive Model

Every — hello@every.to

Reçu le

lundi 27 avril 2026 à 14:58

Source

Every

Message-ID

20260427145816.369a493a03be1405@mg.every.to

Version nettoyage

v1.0.0 (ok)

Brut (HTML rendu, sandboxé, ressources externes bloquées)

Nettoyé (Markdown — clean déterministe)

The real cost of AI agents is your time. A four-step framework for keeping your
AI costs in check.
 ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌
[https://d24ovhgu8s7341.cloudfront.net/static/every-logo.png]
https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by8iLCJwb3NpdGlvbiI6MH0=
[https://d24ovhgu8s7341.cloudfront.net/uploads/publication/logo/95/small_ath.png]

Also True for Humans

[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucyIsInBvc2l0aW9uIjoxfQ==]

YOU ARE THE MOST EXPENSIVE MODEL
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbCIsInBvc2l0aW9uIjoyfQ==]

THE REAL COST OF AI AGENTS IS YOUR TIME. A FOUR-STEP FRAMEWORK FOR KEEPING YOUR
AI COSTS IN CHECK.
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbCIsInBvc2l0aW9uIjozfQ==]

by Mike Taylor
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9AbWlrZV8yMTE0IiwicG9zaXRpb24iOjR9]

[https://d24ovhgu8s7341.cloudfront.net/uploads/post/cover/4134/full_page_cover_You_Are_the_Most_Expensive_Model.png]
Midjourney/Every illustration.

Not every step in an AI workflow needs the smartest AI. That may sound obvious,
but it’s not how most people are working. The default is to route entire tasks
through frontier models, which is expensive, slow, and usually unnecessary.
Incremental determinism starts from a different question: How much intelligence
does this task really need?? The answer is almost always less than you’d expect,
and the savings add up.—Mike Taylor
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9AbWlrZV8yMTE0IiwicG9zaXRpb24iOjV9]

--------------------------------------------------------------------------------

There is a reason McDonald’s would never ask its CEO to man the burger grill: It
would cost the company $9,230.77 an hour
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cucmVzdGF1cmFudGJ1c2luZXNzb25saW5lLmNvbS9maW5hbmNpbmcvbWNkb25hbGRzLWNlby1jaHJpcy1rZW1wY3ppbnNraS1nb3QtcmFpc2UtbGFzdC15ZWFyIiwicG9zaXRpb24iOjZ9].
It’s the same as using frontier AI models
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL2dwdC01LTUiLCJwb3NpdGlvbiI6N30=]
to do every task—you don’t need to pay 75 cents every half hour
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly94LmNvbS9CZW5qYW1pbkRFS1Ivc3RhdHVzLzIwMTc2NDQ3NzMzNTY1NDg1MzIiLCJwb3NpdGlvbiI6OH0=]
($1,095 per month!) for Claude Opus to check your to-do list in OpenClaw
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9ndWlkZXMvY2xhdy1zY2hvb2wiLCJwb3NpdGlvbiI6OX0=].

This tension isn’t really about the pricing of AI models—it’s about the value of
human attention. Now that you have a cheaper alternative for many tasks that
used to require it, you need to figure out the optimal way to deploy AI in a way
that frees up your most expensive model—you. Most businesses are getting this
balance wrong in both directions: overpaying for AI on simple tasks and
underusing it on ones that would free up their best people.

The solution is a process of optimization that I call incremental determinism.
Every time you repeat a task, build it into a repeatable process by creating a
skill file
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL3ZpYmUtY2hlY2stY2xhdWRlLXNraWxscy1uZWVkLWEtc2hhcmUtYnV0dG9uIiwicG9zaXRpb24iOjEwfQ==].
Identify which parts of that process need the most expensive model, which can be
delegated to cheaper, less powerful models, and which tasks repeat often enough
to justify turning them into reusable code. And finally, get better at
delegating so you can stay focused on the work that needs you.

I call it incremental determinism because the more you repeat a task, the more
it pays to nail down exactly how it should be done. The first time, you figure
the task out as you go, but after doing it a few times, you can document the
best approach. “Deterministic” is a programming term for code that always
produces the same output given the same input. The goal is to push as much of
your workflow towards that end of the spectrum as possible, because
deterministic steps are faster, cheaper, and more reliable. The tradeoff is the
upfront investment needed to systematize the task.

There are four levels for achieving this balance and optimizing AI costs.
Depending on your technical fluency, you don’t have to go to the final step, but
understanding how they each support each other will help you manage how you can
control AI costs across your entire organization.

Uploaded image
[https://d24ovhgu8s7341.cloudfront.net/uploads/editor/posts/4134/optimized_e2dd67a8-2638-42ea-b62f-7a5e3fe65d5b.png]https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9kMjRvdmhndThzNzM0MS5jbG91ZGZyb250Lm5ldC91cGxvYWRzL2VkaXRvci9wb3N0cy80MTM0L29wdGltaXplZF9lMmRkNjdhOC0yNjM4LTQyZWEtYjYyZi03YTVlM2ZlNjVkNWIucG5nIiwicG9zaXRpb24iOjExfQ==

Uploaded image
[https://d24ovhgu8s7341.cloudfront.net/uploads/editor/advertisements/1069/optimized_62ec57e7-aa84-4c4b-96ca-0e369b5c9d13.png]https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cuYXVnbWVudGNvZGUuY29tL2ludGVudD91dG1fc291cmNlPWV2ZXJ5XHUwMDI2dXRtX21lZGl1bT1uZXdzbGV0dGVyXHUwMDI2dXRtX2NhbXBhaWduPWFwcjI3cHJpbWFyeSIsInBvc2l0aW9uIjoxMiwiYWR2ZXJ0aXNlbWVudF9pZCI6MTA2OX0=

WHAT COMES AFTER YOUR IDE? INTENT.

Stop herding AI agents across terminals and branches. Intent bundles each task
into a single workspace with a living spec, agent notes, and full change
visibility.

Orchestrate agents like a system
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cuYXVnbWVudGNvZGUuY29tL2ludGVudD91dG1fc291cmNlPWV2ZXJ5XHUwMDI2dXRtX21lZGl1bT1uZXdzbGV0dGVyXHUwMDI2dXRtX2NhbXBhaWduPWFwcjI3cHJpbWFyeSIsInBvc2l0aW9uIjoxMywiYWR2ZXJ0aXNlbWVudF9pZCI6MTA2OX0=],
not a swarm: Direct specialists, keep work aligned, and ship without
copy-pasting context.

Works with Augment, Claude Code, Codex, or OpenCode.

Want to sponsor Every? Click here
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoibWFpbHRvOnNwb25zb3JzaGlwc0BldmVyeS50byIsInBvc2l0aW9uIjoxNH0=].

LEVEL 1: TURN SESSIONS INTO SKILLS

The first level is the easiest. Let’s say you are often asking AI to generate a
PowerPoint pitch deck. The first step toward systematizing it is to make a skill
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL3ZpYmUtY2hlY2stY2xhdWRlLXNraWxscy1uZWVkLWEtc2hhcmUtYnV0dG9uIiwicG9zaXRpb24iOjE1fQ==].
A skill can be as simple as a text file detailing how to do a task that the
model follows each time it’s asked. It’s the McDonald’s handbook that tells
every employee how to make the perfect burger, over and over again. Even less
experienced cooks can get a good result.

Once you’re done with the normal back and forth of giving the AI the necessary
data and context for the presentation, ask it, “What information would have been
useful to know at the start of this task that would have eliminated several
steps or mistakes?” Claude knows what it is capable of, so you can ask it to
turn its response into a PowerPoint deck creation skill to use next time.
Anthropic has been releasing plugins
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9naXRodWIuY29tL2FudGhyb3BpY3Mva25vd2xlZGdlLXdvcmstcGx1Z2lucyIsInBvc2l0aW9uIjoxNn0=]
(collections of skills) for various industries to serve as a starting point.
They even provide a “skill-creator” skill
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9naXRodWIuY29tL2FudGhyb3BpY3Mvc2tpbGxzL2Jsb2IvbWFpbi9za2lsbHMvc2tpbGwtY3JlYXRvci9TS0lMTC5tZCIsInBvc2l0aW9uIjoxN30=]
that teaches Claude how to guide you through making one when you ask.

Once you have a skill, test it. Ask Claude to test the efficacy of the skill
with the following prompt: “Run the task using subagents, one with the skill,
one without, and compare the results.” If the skill is doing its job, you should
see an improvement in quality, cost, and speed. Now try running it with a
cheaper model—“Run this test again with Sonnet
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL3ZpYmUtY2hlY2stY2xhdWRlLXNvbm5ldC00LTUiLCJwb3NpdGlvbiI6MTh9]/Haiku
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL3ZpYmUtY2hlY2stY2xhdWRlLWhhaWt1LTQtNS1hbnRocm9waWMtY29va2VkIiwicG9zaXRpb24iOjE5fQ==]”—and
compare the results. If you’re happy with the output, ask Claude to “Use a
subagent
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by92aWJlLWNoZWNrL3ZpYmUtY2hlY2stY2xhdWRlLXMtbmV3LWFnZW50cy1hcmUtY29uZnVzaW5nLWFzLWhlbGwiLCJwb3NpdGlvbiI6MjB9]
with Sonnet/Haiku when calling this skill.” You are using a subagent because you
don’t want the model that you are using for your main session—the more expensive
one—to be the model executing the task, so the separate, cheaper subagent does
the work. You just decreased the cost of running that task by 10 to 100 times.

It doesn’t make sense to write skills for throwaway tasks you won’t do again.
But if you find yourself doing something for the third time, it’s probably worth
formalizing it. If you’re using it multiple times per week, try getting it
working with a smaller model.

LEVEL 2: TURN SKILLS INTO EVALS

Your team might see your skill and want to use it to create their presentations
as well. While it’s easy to share skills across your organization
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9zdXBwb3J0LmNsYXVkZS5jb20vZW4vYXJ0aWNsZXMvMTMxMTk2MDYtcHJvdmlzaW9uLWFuZC1tYW5hZ2Utc2tpbGxzLWZvci15b3VyLW9yZ2FuaXphdGlvbiIsInBvc2l0aW9uIjoyMX0=],
you’ll have to get them to trust that your skill delivers before they’ll adopt
it. For that, you’ll need evidence in the form of evaluation metrics
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9wL2hvdy10by1ncmFkZS1haS1hbmQtd2h5LXlvdS1zaG91bGQiLCJwb3NpdGlvbiI6MjJ9],
or evals.

For the simplest eval, gather 10 examples of tasks your skill has been used
for—say, the last 10 decks you have made with the skill—and rewrite the output
to be the gold standard or best-in-class example of what you’d hope Claude could
produce. Now, ask Claude to “Run each test case with subagents and compare the
output versus my gold examples.” Make changes to the skill and test if it does
better. This is the “LLM-as-a-judge” technique
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy9ob3ctdG8tZ3JhZGUtYWktYW5kLXdoeS15b3Utc2hvdWxkLWQ0NTU3YzRjLWI0MjctNGNmYi1hMDk3LWQ5YWFhZjA5OWNmZiIsInBvc2l0aW9uIjoyM30=]—you’re
using a model to grade its own work against your standard.

In the spirit of incremental determinism, you should formalize your evals over
time, too. Ask Claude to “Break down the patterns between what makes a ‘good’
answer (gold examples) versus the typical output of the skill.” It might say
that one pattern for a good answer is following brand guidelines, another
pattern is including four to five bullet points of commentary on a specific
slide, and a third is calculating the correct numbers.

Once you have several evals, you can combine them into a single score. Each eval
becomes one “judge”—it looks at the output from one angle, such as data
accuracy, and returns a score. You can weight each judge based on how much that
dimension matters to you, then average the scores together. This
“panel-of-judges” approach lets you track overall quality as a single number.
The on-brand eval might be worth 40 points to you, the correct numbers could be
50, and the bullet points worth 10. Each prompt you test can then be scored out
of 100, allowing you to compare how well one approach works versus another.
Claude is a human-level prompt engineer and runs this process as a matter of
course if you use the skill-creator
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9naXRodWIuY29tL2FudGhyb3BpY3Mvc2tpbGxzL2Jsb2IvbWFpbi9za2lsbHMvc2tpbGwtY3JlYXRvci9TS0lMTC5tZCIsInBvc2l0aW9uIjoyNH0=]
function Anthropic provides.

Let’s come back to our patterns of good output for a PowerPoint deck. Validating
the data is more important than whether you’re missing a bullet point or using
the right visual components, so you could weight that eval as 60 percent of the
overall score versus 20 percent each for the other two. Together, you have a
weighted average score for measuring how well your skill is performing. For
companies, where getting a pixel out of line is a fireable offence, such as
top-tier consulting or finance firms, you can change the relative weighting of
that eval.

Now, you have proof you can share with the team about the impact your changes
are making on skills. When the next big model comes out, you can test how much
better it does on your benchmark and if it’s worth the extra cost.

LEVEL 3: TURN EVALS INTO SCRIPTS

When your skill is working reliably, and you’re using it frequently enough that
the token cost is starting to feel significant, you need to start thinking about
scripts, CLIs or MCPs. This is where the steps get slightly more technical, but
the principle is the same: Replace thinking with a structured process wherever
your thinking doesn’t add anything extra.

Every skill, like your PowerPoint deck skill, is a bundle of actions—pull this
data, reference our brand guidelines, create a .pptx file—and some of those
actions don’t require a smart model. Some don’t even require an LLM at all.
Deconstruct your skill into its component parts and hard-code whatever you can.
Code costs almost nothing to run and returns in an instant compared to LLMs, so
the more of your workflow you can make deterministic, the cheaper and faster it
will be.

For our PowerPoint creation task, you can use the HTML and CSS templates for the
slide deck written once by Opus, then filled in to generate the .pptx file when
you need to create a deck. You can also write a script to pull the right revenue
or sales figures from a data source, no LLM involved. The final export step—to
.pptx format—can also be done in code.

For tasks that require some judgment, like checking your deck’s compliance with
brand guidelines, don’t jump straight to the most expensive model. Platforms
like OpenRouter
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9vcGVucm91dGVyLmFpLyIsInBvc2l0aW9uIjoyNX0=]
allow you to call any of the major commercial or open-source models, so you can
experiment with the tradeoffs between cost and intelligence. Basic
classification and summarization tasks can be done by older models 1,000 times
cheaper than Opus with reasonable accuracy. Leave the most challenging tasks,
such as the narrative and tailoring the tone to a specific audience, to Opus.

LEVEL 4: TURN SCRIPTS INTO BETTER SCRIPTS

In the previous step, you replaced as much LLM thinking as possible with
deterministic code, bringing the cost of your PowerPoint skill down 10 to 90
percent compared to only using Opus. But you were only optimizing for your own
use. When your skill is running inside a product, creating hundreds of decks a
week, cost inefficiencies will again become a problem. For this, you will need
to build a process to automate the optimization. Once you have 100 to 200
examples of the skill being used in the real world, a reliable basket of eval
metrics, and a clear map of what the skill does at each step, you have
everything you need to do so.

The most common tool for this is DSPy
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cub3JlaWxseS5jb20vbGlicmFyeS92aWV3L2NvbnRleHQtZW5naW5lZXJpbmctd2l0aC8wNjQyNTcyMjYxNjAzLyIsInBvc2l0aW9uIjoyNn0=],
which can automate the prompt engineering process
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy9pLXZlLXN0b3BwZWQtd3JpdGluZy1wcm9tcHRzLWRzcHktZG9lcy1pdC1iZXR0ZXIiLCJwb3NpdGlvbiI6Mjd9]
end-to-end. It runs your prompt, looks at the test cases, and rewrites the
prompt to arrive at a more accurate outcome, often with a cheaper model. Another
common approach is distillation. You use Opus to generate hundreds of
high-quality examples that pass your evals, then use those to teach a cheaper
model to produce similar results. You can do that by either including the
examples in the prompt so Haiku can pattern-match against them, or by
fine-tuning the cheaper model directly on the examples. Think of it as a head
chef writing such a good recipe that a less experienced cook can follow it
perfectly. This process can cost $10, $100, or $1,000, depending on the model
and how many test cases you have, but spending $1,000 to save millions in
production is worth it.

More experimental approaches are emerging, too. Andrej Karpathy’s autoresearch
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9naXRodWIuY29tL2thcnBhdGh5L2F1dG9yZXNlYXJjaCIsInBvc2l0aW9uIjoyOH0=]
runs experiments to optimize a script file against an eval metric over long
periods. Researchers wake up to more than 20 experiments run overnight with
meaningful performance improvements.

The great enemy at this level is overfitting: The skill or script works well
against your eval metric but fails on tasks it hasn’t seen before. It’s
“teaching to the test” for LLMs. The evals in the previous step are your main
defense against this, because they give you a formal rubric for grading its
performance. Human involvement in the evaluation process is necessary because
we’re better able to catch behavior that goes against the spirit of the game,
even if it’s not technically wrong as defined by the rules.

If you are a manager at a company responsible for AI, you don’t need to know how
to implement any of this yourself. What matters is understanding that this
optimization layer exists, it’s what your technical team or tools are doing
under the hood, and why the decision to invest can pay off.

YOU ARE THE MOST EXPENSIVE MODEL

All of this optimization work takes time and expertise, and your attention is an
even more expensive commodity than the latest models. Attention is the key word:
The ladder of incremental determinism—sessions, skills, evals, scripts,
optimized scripts—gives you a framework for deciding where to invest your
attention. Every hour you spend optimizing a skill is an hour you’re not
spending on something only you can do.

You don’t need to climb the whole ladder—having reliable skills and evals is
more than enough. The point is knowing the rungs exist, so when the cost
pressure hits (and it will), you know exactly which lever to pull. If you’re
struggling with unreliable or expensive skills but don’t have the capability to
build scripts in house, it might be time to bring in someone technical and
AI-savvy to do the heavy lifting.

The cost of tokens is falling 90 percent every year for the same level of
intelligence, so the task even Opus struggles with today might be easy and cheap
in 12 months. Sometimes the smartest move is to overpay now and let the market
do the price optimization for you.

--------------------------------------------------------------------------------

Mike Taylor
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9AbWlrZV8yMTE0IiwicG9zaXRpb24iOjI5fQ==]
is the head of tech consulting
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9jb25zdWx0aW5nIiwicG9zaXRpb24iOjMwfQ==]
at Every and a co-author of Prompt Engineering for Generative AI
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cub3JlaWxseS5jb20vbGlicmFyeS92aWV3L3Byb21wdC1lbmdpbmVlcmluZy1mb3IvOTc4MTA5ODE1MzQyNy8iLCJwb3NpdGlvbiI6MzF9]
(O’Reilly). Learn more about how Every’s consulting team can bring AI into your
organization
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9jb25zdWx0aW5nP3V0bV9zb3VyY2U9ZW1haWxmb290ZXIiLCJwb3NpdGlvbiI6MzJ9].

For sponsorship opportunities, reach out to sponsorships@every.to. To read more
essays like this, subscribe to Every
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9zdWJzY3JpYmUiLCJwb3NpdGlvbiI6MzN9],
and follow us on X at @every
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cDovL3R3aXR0ZXIuY29tL2V2ZXJ5IiwicG9zaXRpb24iOjM0fQ==]
and on LinkedIn
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly93d3cubGlua2VkaW4uY29tL2NvbXBhbnkvZXZlcnlpbmMvIiwicG9zaXRpb24iOjM1fQ==].

Subscribe
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9zdWJzY3JpYmU_c291cmNlPXBvc3RfYnV0dG9uIiwicG9zaXRpb24iOjM2fQ==]

WHAT DID YOU THINK OF THIS POST?

Amazing
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbC9mZWVkYmFjaz9yYXRpbmc9YW1hemluZ1x1MDAyNmhhc2g9JXJlY2lwaWVudC5oYXNoJSIsInBvc2l0aW9uIjozN30=]
Good
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbC9mZWVkYmFjaz9yYXRpbmc9Z29vZFx1MDAyNmhhc2g9JXJlY2lwaWVudC5oYXNoJSIsInBvc2l0aW9uIjozOH0=]
Meh
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbC9mZWVkYmFjaz9yYXRpbmc9bWVoXHUwMDI2aGFzaD0lcmVjaXBpZW50Lmhhc2glIiwicG9zaXRpb24iOjM5fQ==]
Bad
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9hbHNvLXRydWUtZm9yLWh1bWFucy95b3UtYXJlLXRoZS1tb3N0LWV4cGVuc2l2ZS1tb2RlbC9mZWVkYmFjaz9yYXRpbmc9YmFkXHUwMDI2aGFzaD0lcmVjaXBpZW50Lmhhc2glIiwicG9zaXRpb24iOjQwfQ==]

GET MORE OUT OF YOUR SUBSCRIPTION

Try our AI tools for ultimate productivity

AI Tools Showcase [https://every.to/paywall/ai_tools_showcase.gif]
https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by9wcm9kdWN0cz91dG1fc291cmNlPWVtYWlsXHUwMDI2dXRtX21lZGl1bT1wb3N0X3BheXdhbGxcdTAwMjZ1dG1fY2FtcGFpZ249cGF5d2FsbF9naWYiLCJwb3NpdGlvbiI6NDF9
Pencil [https://every.to/paywall/pencil.png] Front-row access to the future of
AI
Check [https://every.to/paywall/check.png] In-depth reviews of new models on
release day
Check [https://every.to/paywall/check.png] Playbooks and guides for putting AI
to work
Check [https://every.to/paywall/check.png] Prompts and use cases for builders

Sparks [https://every.to/paywall/sparks.png] Bundle of AI software
Sparkle [https://every.to/paywall/app_icons/sparkle.png] Sparkle: Organize your
Mac with AI
Cora [https://every.to/paywall/app_icons/cora.png] Cora: The most human way to
do email
Spiral [https://every.to/paywall/app_icons/spiral.png] Spiral: Repurpose your
content endlessly
Monologue [https://every.to/paywall/app_icons/monologue.png] Monologue:
Effortless voice dictation for your Mac

You received this email because you signed up for emails from Every
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50byIsInBvc2l0aW9uIjo0Mn0=].
Need help? Visit our help center
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9oZWxwLmV2ZXJ5LnRvIiwicG9zaXRpb24iOjQzfQ==].

Extraction LLM— claude-haiku-4-5 · prompt v1 · 16122→2962 tokens

# THE REAL COST OF AI AGENTS IS YOUR TIME. A FOUR-STEP FRAMEWORK FOR KEEPING YOUR AI COSTS IN CHECK.

by Mike Taylor

Not every step in an AI workflow needs the smartest AI. That may sound obvious, but it's not how most people are working. The default is to route entire tasks through frontier models, which is expensive, slow, and usually unnecessary. Incremental determinism starts from a different question: How much intelligence does this task really need? The answer is almost always less than you'd expect, and the savings add up.

There is a reason McDonald's would never ask its CEO to man the burger grill: It would cost the company $9,230.77 an hour. It's the same as using frontier AI models to do every task—you don't need to pay 75 cents every half hour ($1,095 per month!) for Claude Opus to check your to-do list in OpenClaw.

This tension isn't really about the pricing of AI models—it's about the value of human attention. Now that you have a cheaper alternative for many tasks that used to require it, you need to figure out the optimal way to deploy AI in a way that frees up your most expensive model—you. Most businesses are getting this balance wrong in both directions: overpaying for AI on simple tasks and underusing it on ones that would free up their best people.

The solution is a process of optimization that I call incremental determinism. Every time you repeat a task, build it into a repeatable process by creating a skill file. Identify which parts of that process need the most expensive model, which can be delegated to cheaper, less powerful models, and which tasks repeat often enough to justify turning them into reusable code. And finally, get better at delegating so you can stay focused on the work that needs you.

I call it incremental determinism because the more you repeat a task, the more it pays to nail down exactly how it should be done. The first time, you figure the task out as you go, but after doing it a few times, you can document the best approach. "Deterministic" is a programming term for code that always produces the same output given the same input. The goal is to push as much of your workflow towards that end of the spectrum as possible, because deterministic steps are faster, cheaper, and more reliable. The tradeoff is the upfront investment needed to systematize the task.

There are four levels for achieving this balance and optimizing AI costs. Depending on your technical fluency, you don't have to go to the final step, but understanding how they each support each other will help you manage how you can control AI costs across your entire organization.

## LEVEL 1: TURN SESSIONS INTO SKILLS

The first level is the easiest. Let's say you are often asking AI to generate a PowerPoint pitch deck. The first step toward systematizing it is to make a skill. A skill can be as simple as a text file detailing how to do a task that the model follows each time it's asked. It's the McDonald's handbook that tells every employee how to make the perfect burger, over and over again. Even less experienced cooks can get a good result.

Once you're done with the normal back and forth of giving the AI the necessary data and context for the presentation, ask it, "What information would have been useful to know at the start of this task that would have eliminated several steps or mistakes?" Claude knows what it is capable of, so you can ask it to turn its response into a PowerPoint deck creation skill to use next time. Anthropic has been releasing plugins (collections of skills) for various industries to serve as a starting point. They even provide a "skill-creator" skill that teaches Claude how to guide you through making one when you ask.

Once you have a skill, test it. Ask Claude to test the efficacy of the skill with the following prompt: "Run the task using subagents, one with the skill, one without, and compare the results." If the skill is doing its job, you should see an improvement in quality, cost, and speed. Now try running it with a cheaper model—"Run this test again with Sonnet/Haiku"—and compare the results. If you're happy with the output, ask Claude to "Use a subagent with Sonnet/Haiku when calling this skill." You are using a subagent because you don't want the model that you are using for your main session—the more expensive one—to be the model executing the task, so the separate, cheaper subagent does the work. You just decreased the cost of running that task by 10 to 100 times.

It doesn't make sense to write skills for throwaway tasks you won't do again. But if you find yourself doing something for the third time, it's probably worth formalizing it. If you're using it multiple times per week, try getting it working with a smaller model.

## LEVEL 2: TURN SKILLS INTO EVALS

Your team might see your skill and want to use it to create their presentations as well. While it's easy to share skills across your organization, you'll have to get them to trust that your skill delivers before they'll adopt it. For that, you'll need evidence in the form of evaluation metrics, or evals.

For the simplest eval, gather 10 examples of tasks your skill has been used for—say, the last 10 decks you have made with the skill—and rewrite the output to be the gold standard or best-in-class example of what you'd hope Claude could produce. Now, ask Claude to "Run each test case with subagents and compare the output versus my gold examples." Make changes to the skill and test if it does better. This is the "LLM-as-a-judge" technique—you're using a model to grade its own work against your standard.

In the spirit of incremental determinism, you should formalize your evals over time, too. Ask Claude to "Break down the patterns between what makes a 'good' answer (gold examples) versus the typical output of the skill." It might say that one pattern for a good answer is following brand guidelines, another pattern is including four to five bullet points of commentary on a specific slide, and a third is calculating the correct numbers.

Once you have several evals, you can combine them into a single score. Each eval becomes one "judge"—it looks at the output from one angle, such as data accuracy, and returns a score. You can weight each judge based on how much that dimension matters to you, then average the scores together. This "panel-of-judges" approach lets you track overall quality as a single number. The on-brand eval might be worth 40 points to you, the correct numbers could be 50, and the bullet points worth 10. Each prompt you test can then be scored out of 100, allowing you to compare how well one approach works versus another. Claude is a human-level prompt engineer and runs this process as a matter of course if you use the skill-creator function Anthropic provides.

Let's come back to our patterns of good output for a PowerPoint deck. Validating the data is more important than whether you're missing a bullet point or using the right visual components, so you could weight that eval as 60 percent of the overall score versus 20 percent each for the other two. Together, you have a weighted average score for measuring how well your skill is performing. For companies, where getting a pixel out of line is a fireable offence, such as top-tier consulting or finance firms, you can change the relative weighting of that eval.

Now, you have proof you can share with the team about the impact your changes are making on skills. When the next big model comes out, you can test how much better it does on your benchmark and if it's worth the extra cost.

## LEVEL 3: TURN EVALS INTO SCRIPTS

When your skill is working reliably, and you're using it frequently enough that the token cost is starting to feel significant, you need to start thinking about scripts, CLIs or MCPs. This is where the steps get slightly more technical, but the principle is the same: Replace thinking with a structured process wherever your thinking doesn't add anything extra.

Every skill, like your PowerPoint deck skill, is a bundle of actions—pull this data, reference our brand guidelines, create a .pptx file—and some of those actions don't require a smart model. Some don't even require an LLM at all. Deconstruct your skill into its component parts and hard-code whatever you can. Code costs almost nothing to run and returns in an instant compared to LLMs, so the more of your workflow you can make deterministic, the cheaper and faster it will be.

For our PowerPoint creation task, you can use the HTML and CSS templates for the slide deck written once by Opus, then filled in to generate the .pptx file when you need to create a deck. You can also write a script to pull the right revenue or sales figures from a data source, no LLM involved. The final export step—to .pptx format—can also be done in code.

For tasks that require some judgment, like checking your deck's compliance with brand guidelines, don't jump straight to the most expensive model. Platforms like OpenRouter allow you to call any of the major commercial or open-source models, so you can experiment with the tradeoffs between cost and intelligence. Basic classification and summarization tasks can be done by older models 1,000 times cheaper than Opus with reasonable accuracy. Leave the most challenging tasks, such as the narrative and tailoring the tone to a specific audience, to Opus.

## LEVEL 4: TURN SCRIPTS INTO BETTER SCRIPTS

In the previous step, you replaced as much LLM thinking as possible with deterministic code, bringing the cost of your PowerPoint skill down 10 to 90 percent compared to only using Opus. But you were only optimizing for your own use. When your skill is running inside a product, creating hundreds of decks a week, cost inefficiencies will again become a problem. For this, you will need to build a process to automate the optimization. Once you have 100 to 200 examples of the skill being used in the real world, a reliable basket of eval metrics, and a clear map of what the skill does at each step, you have everything you need to do so.

The most common tool for this is DSPy, which can automate the prompt engineering process end-to-end. It runs your prompt, looks at the test cases, and rewrites the prompt to arrive at a more accurate outcome, often with a cheaper model. Another common approach is distillation. You use Opus to generate hundreds of high-quality examples that pass your evals, then use those to teach a cheaper model to produce similar results. You can do that by either including the examples in the prompt so Haiku can pattern-match against them, or by fine-tuning the cheaper model directly on the examples. Think of it as a head chef writing such a good recipe that a less experienced cook can follow it perfectly. This process can cost $10, $100, or $1,000, depending on the model and how many test cases you have, but spending $1,000 to save millions in production is worth it.

More experimental approaches are emerging, too. Andrej Karpathy's autoresearch runs experiments to optimize a script file against an eval metric over long periods. Researchers wake up to more than 20 experiments run overnight with meaningful performance improvements.

The great enemy at this level is overfitting: The skill or script works well against your eval metric but fails on tasks it hasn't seen before. It's "teaching to the test" for LLMs. The evals in the previous step are your main defense against this, because they give you a formal rubric for grading its performance. Human involvement in the evaluation process is necessary because we're better able to catch behavior that goes against the spirit of the game, even if it's not technically wrong as defined by the rules.

If you are a manager at a company responsible for AI, you don't need to know how to implement any of this yourself. What matters is understanding that this optimization layer exists, it's what your technical team or tools are doing under the hood, and why the decision to invest can pay off.

## YOU ARE THE MOST EXPENSIVE MODEL

All of this optimization work takes time and expertise, and your attention is an even more expensive commodity than the latest models. Attention is the key word: The ladder of incremental determinism—sessions, skills, evals, scripts, optimized scripts—gives you a framework for deciding where to invest your attention. Every hour you spend optimizing a skill is an hour you're not spending on something only you can do.

You don't need to climb the whole ladder—having reliable skills and evals is more than enough. The point is knowing the rungs exist, so when the cost pressure hits (and it will), you know exactly which lever to pull. If you're struggling with unreliable or expensive skills but don't have the capability to build scripts in house, it might be time to bring in someone technical and AI-savvy to do the heavy lifting.

The cost of tokens is falling 90 percent every year for the same level of intelligence, so the task even Opus struggles with today might be easy and cheap in 12 months. Sometimes the smartest move is to overpay now and let the market do the price optimization for you.

Prompt utilisé(snapshot au moment de l'extraction — édition via System prompts)

Tu es l'extracteur de contenu de Breviat. On te fournit le contenu Markdown nettoyé d'une newsletter.

Ta mission : produire une version PROPRE du contenu en supprimant tout ce qui n'est pas de l'information utile au lecteur. Tu es un FILTRE, pas un résumeur.

À RETIRER :
- Publicités, encarts sponsors, mentions "sponsorisé par X", "ad", "présenté par"
- Intros vides : formules de bienvenue, météo de l'humeur de l'auteur, anecdotes personnelles non liées au contenu
- Appels à l'action marketing : s'abonner à la newsletter, parrainer un ami, "follow us on Twitter", "join our Discord"
- Signatures, mentions légales, adresses postales, "view in browser", "unsubscribe"
- Boutons / CTAs / "cliquez ici" / "lire la suite" sans contenu derrière
- Promotions d'autres produits / événements / formations payantes de l'auteur ou de tiers
- Encarts récurrents type "Read of the day" ou "Quote of the day" sans valeur informationnelle propre

À CONSERVER (intégralement, sans résumer ni reformuler) :
- Toutes les annonces, news, analyses, commentaires factuels
- Les chiffres, dates, noms d'entreprises, citations
- Les explications techniques
- Les liens vers des sources réelles (annonces officielles, papers, articles cités)
- La structure (titres, sous-titres, listes)

RÈGLES :
- Ne reformule pas. Garde la formulation d'origine.
- Ne résume pas, ne condense pas. Si une section fait 200 mots et est utile, garde 200 mots.
- N'ajoute aucun contenu (pas de titres ni de transitions de ton cru).
- Ne fabrique aucune URL. Garde celles d'origine, ou retire-les.
- Si la newsletter entière est de la pub / promo / contenu inutile, sors un Markdown vide (rien d'autre).

Sortie : UNIQUEMENT le Markdown nettoyé, sans préambule ni commentaire sur ton travail.

Footer détecté et extrait (R-08)

No longer interested in receiving emails from us? Click here to unsubscribe
[https://every.to/emails/click/0a7e2264cb45384f9a4c029ef50a01941ed6546379993be477e017002f91aaf7/eyJzdWJqZWN0IjoiWW91IEFyZSB0aGUgTW9zdCBFeHBlbnNpdmUgTW9kZWwiLCJwb3N0X2lkIjo0MTM0LCJwb3N0X3R5cGUiOiJwb3N0IiwidXJsIjoiaHR0cHM6Ly9ldmVyeS50by91bnN1YnNjcmliZT9wb3N0PXlvdS1hcmUtdGhlLW1vc3QtZXhwZW5zaXZlLW1vZGVsXHUwMDI2aGFzaD0lcmVjaXBpZW50Lmhhc2glIiwicG9zaXRpb24iOjQ0fQ==].

221 Canal St 5th floor, New York, NY 10013

[http://email.mg.every.to/o/eJwUyEtuwyAQANDThJ0t5mMwiznMOB4apFAiTFFz-yjL907JKcGhzgRijORhT9E9BPXcomYIFMh20pzvqJhRbT80pNMVQY_BM0bgbYewUkjKidTTYcB-u7GvP6tN6-91NNfl6DaLjhv7rNeoWp7rvVU35NWusTAQuyHv9rdot2U8bKnft_-X_V5lfnna003BTwAAAP__Abw2_Q]