10x Your Claude Skills — Using Karpathy’s Autoresearch Method | Rabbit's Blog

Your Claude skill is quietly hallucinating 30% of the time—and your clients might find out before you do.

Three months ago, I handed over a landing page copywriting skill I built to a client, promising them it was “absolutely reliable.” They needed launch copy for a new product, and I told them not to worry—I used this skill myself all the time.

That evening, the client sent a screenshot: The CTA was “Learn More,” and the headline was “Transform Your Business.”

I stared at my phone for three seconds, closed it, reopened it—it was still the same screenshot.

In that moment, I realized I had no idea when this skill was actually working and when it was just “phoning it in.”

I built a system that can automatically iterate any skill in autopilot mode. This article teaches you how to run it yourself.

Once you start it, the agent begins constantly testing and polishing the skill without you doing a thing. My landing page copy skill went from a 56% quality pass rate to 92%. Zero manual effort. The agent just sat there repeatedly testing and tightening the prompt.

Below is the full methodology and the specific skill I built—you can use it directly:

P.S. Want to receive more AI workflows like this every week? Follow me.

Where This Method Comes From

Andrej Karpathy (OpenAI co-founder, former Tesla AI lead, and the person who coined “vibe coding”) released a methodology called autoresearch.

The core idea is simple: instead of you manually improving things, let an AI agent do it for you in a loop.

It tries a small change.
It checks if the result got better.
If it improved, keep it. If not, toss it.
Then do it again. And again.

He originally used it for machine learning code, but this method applies to anything that can be measured and improved.

It made me think of an uncomfortable truth: The AI workflows you use every day—do you actually know if they are “performing well” or just “outputting text”?

Most people can’t tell the difference because nobody taught them how. You’ve taken so many AI courses and learned so many prompting tricks—but has anyone told you how to verify that what you built is actually working?

No. Everyone teaches you how to build; no one teaches you how to test.

Three Traps of “Failure”

I used to be unable to tell the difference too. I thought a lot about how this failure happens. The hardest part wasn’t that client screenshot; it was what came after—I started counting how many times I thought it was “fine,” but it was actually drifting off course.

The Invisible Drift: The model drifts toward “safety” for things the prompt didn’t explicitly forbid. The output becomes increasingly vague and templated. It passes every time, but it’s slightly worse every time. By the time you notice, you have no idea which round it started failing in.
Survivor Bias: You only see the “decent” outputs—you open them, use them, and close them. The ones that quietly failed, lost their formatting, or missed key elements—you never know how frequent they are because you simply don’t check the logs.
Self-Deception: Occasionally you find a problem, manually fix that specific output, and tell yourself “I fixed it.” But you fixed that instance, not the skill itself. It will still fail in the same place next time.

I’ve done all three.

I turned Karpathy’s method into a skill that runs in Claude Code and Cowork. When you want to use it, just run it on top of your other skills. Say: “Run autoresearch on my landing page skill,” and it handles the rest.

How a Single Loop Automatically Levels Up Your Skill

Imagine this: You have a recipe. 7 out of 10 times, it’s great. The other 3 times, something is off. Maybe the sauce is bland, or the seasoning is wrong.

Instead of rewriting the whole recipe, you swap one ingredient. You cook it 10 times with that change.

Better? Keep the change.
Worse? Revert to the original.
Next? Change the next thing. Cook 10 more times.

After 50 rounds, your recipe succeeds 9.5 times out of 10. This is exactly what autoresearch does for your skill:

The “Recipe” is your skill prompt.
The “Cooking” is running the skill.
The “Tasting” is scoring the output.

Your Only Job: Define the Scoring Criteria

The checklist that tells the agent “what good looks like” is the only thing you need to do in this process.

Define it with a simple checklist of Yes/No questions. Each question checks one specific aspect of the output. Pass or Fail—it’s that simple.

The agent uses this checklist to score every output. The score tells it whether a change is helping or hurting. Imagine a teacher grading a paper with a checklist:

Wrong way: “Rate the writing quality from 1-10” (vague, inconsistent).
Right way: “Did the student write a thesis? Yes or No.” “Is every citation sourced? Yes or No.”

Grade 100 papers with that checklist, and the result is consistent every time. A checklist for a landing page skill might look like this:

“Does the headline contain specific numbers or quantifiable results?” (Not “Better copy,” but “Recover ad spend in 3 days”)
“Does the opening sentence name a specific pain point?”
“Does the CTA clearly tell the user what happens after this step?”
“Does the text avoid zero-information words like ‘disruptive,’ ‘industry-leading,’ or ‘optimal’?”

You don’t need to come up with these yourself. When you start autoresearch, the agent will guide you through it. It will ask you what “good” means and help turn vague feelings into checkable questions. 3-6 questions is the optimal number. Don’t be greedy—I tried adding 10, and the skill started “gaming” the checklist, making the actual output worse.

How to Run It

Step 1: Download the skill and put it in your skills folder.
Step 2: Pick a skill to improve. Say “Run autoresearch on my [Skill Name] skill.” Pick the one that gives you the most headaches—the inconsistent one.
Step 3: The agent asks for 3 things: which skill to optimize, a test input (e.g., “Write copy for an AI productivity tool”), and your checklist questions.
Step 4: It runs your skill once to get a baseline. My landing page skill started at 56%.
Step 5: A live dashboard pops up in your browser showing the score curve and change logs.
Step 6: Walk away. The agent starts the loop. It finds the weakest point, tweaks it, tests it, and keeps it if the score rises. It runs until you stop it or it hits 95% three times in a row.

What Happened to My Landing Page Skill?

56% → 92%. 4 rounds of changes, 3 kept, 1 reverted. The agent actually made these changes to my prompt:

Added a hard rule: “Headlines MUST include specific numbers. Prohibited: vague promises like ‘Transform your business’.”
Added a “Banned Buzzword” list: “Never use: revolutionary, cutting-edge, synergy, next-level, game-changing, leverage, unlock, transform.”
Added actual examples: It included a snippet of a high-quality landing page and labeled where the pain-point opening and CTA were.
Reverted a change: It tried a stricter word count limit, then reverted it because the copy became too thin and the CTA quality dropped. (The system recognizes changes that look like improvements in isolation but hurt the overall output.)

In the end, I got: The improved skill, a log of every round’s score, and a full changelog (what was changed, why, and the result). That changelog is probably the most valuable thing—it’s the “lessons learned” for that skill. When a stronger model comes out, give it this changelog, and it can pick up exactly where the last agent left off.

Honestly, the biggest change after running autoresearch wasn’t the score. It was the feeling. I used to feel uneasy delivering a skill, hoping it wouldn’t fail. Now, I know exactly when it works and how to find the problem if it doesn’t.

Moving from luck to a system—that is the most valuable thing.

This Method Works for More Than Just Skills

Anything that can be scored can use this method.

Website Speed: Change one thing, measure speed, keep or revert.
Cold Outreach: Define your checklist: “Mentioned their company? Under 75 words? Specific question at the end?” Let the agent run 50 variants.
Newsletter Hooks: “Does the opening have personal details?” “Is it free of clichés?”

If it can be scored, it can be autoresearched.

Would you like me to help you draft the checklist for one of your skills to get started?

10x Your Claude Skills — Using Karpathy’s Autoresearch Method