Why avoid LangChain and similar frameworks?

They save you an hour on Saturday and cost you four on Sunday when something breaks inside their abstraction. Raw Anthropic SDK calls are around two hundred lines of TypeScript total, and you can debug every part of the loop.

How many tools should the agent have?

Two to start, three at most for v1. More tools confuse the model about which to call. Start with a search tool and a fetch tool, and only add a third when your evaluation runs show you need it.

Do I need a vector database for memory?

No. SQLite via better-sqlite3 with cosine similarity computed in TypeScript handles under ten thousand memories comfortably. Move to a vector database only when you cross that scale or need server-side filtering.

What does the evaluation harness actually do?

It loads ten to twenty example tasks, runs the agent against each, and scores the output with either a deterministic check, a regex, or a judge prompt sent back to Claude. You compare scores before and after each change to know if you're improving.

Claude or GPT for the model?

Claude is currently the best at multi-step tool plans, and the Anthropic SDK has the cleanest tool-use API. Start there. GPT-4-class models are competitive and cheaper at high volume, so swap if cost becomes the constraint.

Why cap the iteration loop?

Models sometimes call the same tool repeatedly when they don't like a result. A loop counter capped at ten iterations stops a stuck agent from burning your API budget while you sleep.

Claude Code Club

Join the club

All use cases

Build an AI Agent in a Weekend with Claude Code

Developer Weekend Intermediate

What you'll build

A weekend is enough time to build a real agent if you pick one job and stay disciplined. Start with a clear task definition, give the agent two or three tools via the Anthropic tool-use API, add a lightweight memory layer backed by SQLite, and write a tiny eval harness so you can tell objectively whether changes make the agent better or worse. Deploy as a small Hono service behind an API key with server-sent-event streaming so the UI feels responsive while the model thinks.

What you're building

You're building an AI agent that does one job well. Not a chatbot, not a general assistant. An agent in the strict sense: a loop that calls a model, lets the model use tools, feeds results back, and stops when the job is done. Examples that fit a weekend: a research agent that summarizes a topic from five sources, a code review agent that comments on a pull request, a triage agent that reads support email and routes it to the right queue, or a job-board agent that finds three relevant postings every morning and writes a one-paragraph summary of each.

By Sunday night you should have a working agent you can call from the command line and from an HTTP endpoint, with a small evaluation harness that proves it does the job better than a single prompt would. That last part is what separates a real agent from a clever demo. Without an eval, every change feels like an improvement and most are sideways. With one, you can ship and iterate with the confidence of numbers.

Pick the job before the architecture. Agents that try to be 'helpful in general' produce mediocre output everywhere. Agents with a sharp definition of done outperform much larger systems on their narrow task. Write one sentence that describes the input, one sentence that describes the output, and one sentence that describes success. If you can't write those three sentences, the project isn't ready for code yet.

What you need before you start

You need a Claude API key or an OpenAI key, Node 20 or later, and a comfort level with TypeScript. You need to have read Anthropic's tool-use docs once, even skimmed. You need a way to test the agent against real inputs, which means having ten to twenty example tasks written down before you write any code. Without examples, you can't tell if your agent is improving or just changing. The examples should include three or four cases the model will probably get wrong on the first try. Those are the ones worth optimizing against, because the easy ones tell you nothing about quality.

Claude Code installed locally, plus an Anthropic API key
Node 20 or later and pnpm
@anthropic-ai/sdk in your package.json
Hono for the HTTP server, or fastify if you prefer
A SQLite file or a Turso database for memory
Ten to twenty example tasks with expected behavior

Saturday morning: the agent loop

Start with a single file, agent.ts, that exports a runAgent function. The function takes a task string, calls the model with a system prompt that describes the job, and exposes two or three tools. Use Anthropic's tool_use API, not a wrapper framework. You'll understand the loop better and debug faster. The loop is simple: send messages, check for tool_use blocks in the response, run the tools, append the results as tool_result blocks, send again. Stop when the model returns end_turn with no tool use.

Resist using LangChain, LlamaIndex, or any framework this weekend. They will save you an hour on Saturday and cost you four hours on Sunday when something goes wrong inside their abstraction. Raw SDK calls are around two hundred lines of TypeScript total. You can read all of it, which means when the agent behaves oddly you can find the bug in fifteen minutes rather than reading three layers of someone else's documentation.

Spend time on the system prompt. It's the agent's job description, and a tight one is worth more than another tool. Tell the model what it is, what success looks like, the format of the final answer, and what to do when stuck. A good system prompt is around two hundred words, opinionated, and rewritten three or four times during the weekend as you watch the agent fail on real tasks.

Saturday afternoon: tools

Tools are functions the model can call. Each one needs a JSON schema, a description, and a TypeScript implementation. Start with two: a search tool that hits Brave Search or Exa, and a fetch tool that downloads a URL and returns clean text via something like @mozilla/readability. If your agent writes anything to a database or file system, add a third tool for that, but keep the surface narrow. Every tool you add is a new failure mode, a new schema to maintain, and a new branch in the model's decision tree.

Tool descriptions matter more than tool names. The model decides which tool to call by reading the description, not the function signature. Write each one as if you were explaining to a smart intern when to use it and when not to. Include one example of a good input and one of a bad input. The descriptions should be three to five sentences. Anything shorter is too vague, anything longer crowds the context window.

Saturday evening: memory

Memory is where weekend agents usually break. You need two kinds. Session memory is just the message list, which the SDK manages for you within a single run. Long-term memory is anything the agent should remember between runs, and that needs a store. Use SQLite via better-sqlite3 for local dev and Turso for production. Define a tiny schema: a memories table with id, text, embedding, and created_at. Add a remember tool that writes, and a recall tool that does a similarity search.

Generate embeddings with text-embedding-3-small from OpenAI or with voyage-3 if you want to stay in the Anthropic ecosystem. Either works. Don't introduce a vector database this weekend. SQLite with cosine similarity in TypeScript is plenty fast for under ten thousand memories. The cost is a few microseconds per query and the operational footprint is one file on disk. You can move to pgvector or Qdrant later when traffic justifies the extra surface area.

Sunday morning: evaluation

This is the step most people skip and most people regret. Write an evals.ts file that loads your ten to twenty example tasks, runs the agent against each, and scores the output. Scoring can be exact match, regex, or a second call to Claude that judges the result. Run the evals before and after every change. If a tweak that felt smarter scores lower, undo it. Without this loop you're just vibing, which feels productive while you're typing and depressing when you can't tell if last week's version was actually better.

Keep the eval cheap and fast. The whole suite should run in under two minutes and cost less than a dollar. If it doesn't, you'll stop running it, which defeats the point. Start with five examples, get the harness clean, then grow to twenty. Examples should be the actual tasks you want the agent to handle in production, not synthetic toy problems, because synthetic improvements don't transfer.

1Define ten to twenty tasks with expected behavior
2Write a runner that loops over them
3Score each result with a deterministic check or a judge prompt
4Save the run as JSON with timestamp and git commit
5Compare runs before and after each change

Choices to make along the way

Claude versus GPT versus a local model: Claude is the best at following multi-step tool plans as of mid-2026, and the Anthropic SDK has the cleanest tool-use API. GPT-4-class models are competitive and cheaper at scale. Local models via Ollama are tempting but tool-use reliability is still uneven below thirty billion parameters. Start with Claude, swap later if you need to.

Hono versus Fastify versus a Cloudflare Worker: Hono is the right default because it runs identically on Node, Bun, and Cloudflare. If you want to deploy to a Worker on Sunday night, Hono is the only option that doesn't require rewriting. Fastify is fine if you're staying on Node forever.

Sunday afternoon: shipping

Wrap the agent in a Hono server with one POST endpoint, /run, behind a simple bearer-token check. Deploy to Fly, Render, or a Cloudflare Worker. Add logging that captures every tool call with timing, the full message history, and the final result. You'll want this when an agent does something weird in production. Save logs to a file in dev and to Logtail or Axiom in production. Set a per-request budget cap in dollars too, so a runaway agent can't cost you more than the cap before the loop aborts and returns a clear error.

If you want others to run their own copies, push the repo to GitHub with a clean README. The club at claudecodeclub.ai shares agent repos every week, and a well-documented weekend project is a good first contribution to show off what nine dollars a month and a couple of evenings can produce. The README should include a sample task, a sample output, and a one-line install command. Anything more than three steps and people bounce.

Stream the agent's intermediate steps when calling from a UI. The model thinks for ten or twenty seconds per loop iteration, and a blank screen during that time feels broken. Server-sent events from the Hono endpoint work cleanly and Claude can write the streaming handler in one go. Even a simple 'thinking' line that updates every second turns a slow agent into a satisfying one.

How to extend it

After v1, the natural next steps are a planning layer that breaks a task into subtasks before the loop starts, a reflection step where the agent critiques its own output before returning, and a longer-term memory that summarizes past runs into a profile. Each of those is a weekend project on its own. Don't try to add them in the first weekend. Ship the simple loop, watch it run for a week, then add the layer the runs actually need.

A nice fourth extension is multi-agent. One worker per subtask, a coordinator that delegates and merges, and a shared scratch space. Multi-agent is overkill for most jobs but shines when subtasks are obviously parallel, like comparing five competitor sites or processing a batch of pull requests independently. Add it only when single-agent runs visibly bottleneck on serial tool calls.

Common gotchas

Forgetting to handle the case where the model calls a tool and then calls it again with the same arguments because it didn't like the first result. Add a small loop counter and cap the number of iterations at ten. Forgetting to truncate the conversation history when it gets long, which makes calls slow and expensive. Forgetting to set a timeout on tool calls, which lets a stuck fetch hang the whole agent. Finally, don't trust your own judgment about whether the agent got smarter after a change. Run the evals.

One more: not handling structured tool errors. When a tool fails, return a tool_result with an error message the model can read, not a thrown exception that crashes the loop. The model is good at recovering from a clear error message and bad at recovering from a silent abort. A weekend agent that handles errors gracefully feels twice as smart as one that doesn't, even when the underlying logic is identical.

What makes Claude Code the right tool for building agents

There is a neat recursion at play when you use Claude Code to build a Claude AI agent. Claude Code writes the tool schemas, the system prompt, the loop logic, and the eval harness. You describe in plain English what the agent should do and what the tools should expose, and Claude Code produces a working skeleton in one session. What is claude code doing here? It is acting as a high-context code generator that understands the Anthropic SDK deeply enough to produce correct tool_use patterns on the first try - something that took developers hours of reading documentation six months ago.

When you ask Claude Code to 'add a web search tool that returns the top five results with titles and snippets,' it writes the Brave Search or Exa API call, the JSON schema describing the tool's parameters, the TypeScript function, and the wiring into the tool dispatch loop. It also suggests the right description text for the tool, which is the detail most developers underestimate until they watch the model make poor tool choices because the description was vague.

The club at claudecodeclub.ai runs a standing thread on agent builds where members post their eval scores week over week. The pattern is consistent: members who use Claude Code to write the eval harness first - before the tools, before the system prompt - improve their agents twice as fast as members who write the agent first and the evals last. The eval harness is the feedback loop, and Claude Code makes it trivial to write.

Deploying the agent and making it useful beyond a demo

A Claude Code AI agent that lives only in a terminal is a prototype. One that runs as a Hono HTTP service, returns streaming responses, and can be called from a Slack bot, a cron job, or a web UI is a product. The deploy step is a Saturday-night hour, not a new project. Ask Claude Code to wrap the runAgent function in a Hono POST /run route, add bearer token auth, add a Content-Type: text/event-stream response for streaming, and write a Dockerfile or a fly.toml for Fly.io. The result is an API your agent consumer can call with a fetch from anywhere.

Set a per-request dollar cap as a first-class parameter, not an afterthought. Track cumulative token spend inside the loop, estimate cost at current model pricing, and abort with a clear error message if the budget is exceeded. This is especially important if you expose the agent to users other than yourself, because an uncapped agentic loop on user-provided input is a billing liability. Claude Code can write the cost-tracking logic in about ten lines if you give it the current input and output token prices for the model you are using.

Log everything. Every tool call - the name, arguments, response, and latency. Every model call - the input token count, output token count, stop reason. Write a run summary at the end with total cost, total steps, and the final answer. Send the summary to Logtail, Axiom, or a simple append-only JSONL file. When an agent run produces a wrong answer in production, the log is the only way to reconstruct what happened. Agents that don't log are black boxes, and black boxes erode trust faster than occasional bad answers.

1Write one sentence: input, output, definition of success
2Write ten example tasks with expected behavior before any code
3Scaffold agent.ts with the raw Anthropic SDK, no frameworks
4Add two tools: search and fetch - get eval scores before adding more
5Write evals.ts and run it - establish a baseline score
6Add SQLite memory with remember and recall tools
7Wrap in a Hono server with bearer auth and SSE streaming
8Deploy to Fly.io or Cloudflare Workers - share the endpoint

Common questions

Why avoid LangChain and similar frameworks?
They save you an hour on Saturday and cost you four on Sunday when something breaks inside their abstraction. Raw Anthropic SDK calls are around two hundred lines of TypeScript total, and you can debug every part of the loop.
How many tools should the agent have?
Two to start, three at most for v1. More tools confuse the model about which to call. Start with a search tool and a fetch tool, and only add a third when your evaluation runs show you need it.
Do I need a vector database for memory?
No. SQLite via better-sqlite3 with cosine similarity computed in TypeScript handles under ten thousand memories comfortably. Move to a vector database only when you cross that scale or need server-side filtering.
What does the evaluation harness actually do?
It loads ten to twenty example tasks, runs the agent against each, and scores the output with either a deterministic check, a regex, or a judge prompt sent back to Claude. You compare scores before and after each change to know if you're improving.
Claude or GPT for the model?
Claude is currently the best at multi-step tool plans, and the Anthropic SDK has the cleanest tool-use API. Start there. GPT-4-class models are competitive and cheaper at high volume, so swap if cost becomes the constraint.
Why cap the iteration loop?
Models sometimes call the same tool repeatedly when they don't like a result. A loop counter capped at ten iterations stops a stuck agent from burning your API budget while you sleep.

More to build

Built for your role

For Developers3-5 days · Advanced · tailored playbook

Keep going on this topic

Build it. Ship it. Get paid.

Step-by-step lessons for every one of these inside the club. Join Claude Code Club for $9/month.

Join the club See the curriculum

Related: the library, guides, and comparisons.