AI Agent Memory Architecture: Build Persistent L1/L2 Memory That Survives Sessions (2026)

Quick Answer

AI agent memory is the difference between a useful assistant and a disposable chatbot. Most AI agents — whether built on GPT, Claude, Gemini, or open-source models — forget everything between sessions. The context window is not storage; it’s RAM. I run a multi-agent system that manages trading, content publishing, smart home automation, and scheduled tasks — and every time a session ended, it forgot everything. Credentials, preferences, which API endpoints were sunset, which product variants actually worked. The fix wasn’t a bigger context window, a vector database, or RAG — it was a three-tier memory architecture: L1 (short-term working memory), L2 (long-term persistent facts), and procedural skills (reusable workflows). A retrieval strategy surfaces the right facts at the right time. A background “dreaming” process consolidates and prunes memory during idle periods. The system now self-corrects across sessions, shares knowledge across multiple agent harnesses (Telegram, CLI, scheduled workers), and runs in Docker on a NAS. Here’s exactly how it works and how to build your own.


The Problem I Hit: Every Session Starts From Zero

If you’ve built anything non-trivial with an AI agent — trading automation, content publishing, multi-platform bots — you’ve hit this wall. The agent is brilliant within a single session. You give it context, it reasons through it, executes flawlessly. Then the session ends.

Next time you open a chat, it’s meeting you for the first time.

I had an AI agent managing a MetaTrader integration, publishing articles to WordPress and Shopify, running scheduled monitoring jobs, and controlling smart home devices through Telegram. Every new session meant re-explaining the same things: which API endpoints still worked and which were sunset, which product variant IDs were valid, what the automation pipeline architecture looked like, what my preferences were for output format and notification rules. The list grew every week.

This isn’t just annoying — it’s architecturally fatal for an agent that’s supposed to operate autonomously. If it forgets which endpoints are broken, it’ll call them and waste tokens on error handling. If it forgets your preferences, it’ll generate output you have to reject. If it forgets the pipeline layout, it’ll rebuild it from scratch instead of working with what exists.

The context window isn’t the bottleneck. Storage is. LLMs can handle thousands of tokens of injected context. The problem is nobody’s feeding them the right context at the right time.

What I Tried First (And Why It Didn’t Work)

Before settling on the architecture described below, I went through three failed approaches — each one teaching me something about what agent memory actually needs.

Approach 1: RAG Over Conversation History

My first instinct was to treat past conversations like a document store. Store every session in a vector database, embed them, retrieve relevant chunks when a new session starts. Standard RAG (retrieval-augmented generation) — the same pattern used for enterprise knowledge bases.

The problem: conversations are terrible documents. A trading discussion contains contradictory statements (“I think the pair is bullish” followed by “actually, let’s wait for confirmation”). Vector search returns the most semantically similar chunk, which might be the wrong sentiment. RAG is designed for factual retrieval, not conversational context. And worse — it retrieves based on similarity, not importance. The fact that an API endpoint is sunset is just as “similar” to a query as the fact that it still works. You need determinism, not probability.

Approach 2: Summarize Everything Into a System Prompt

Next I tried auto-summarizing each session and injecting the summary into the next session’s system prompt. This worked better — the agent had a rough memory of past interactions.

But summaries are lossy. General preferences survive a summary. Specific technical details don’t — they get compressed into something vague. The agent loses the specificity that makes it actually useful. “User has a Printful integration” is useless. “Mockup generator uses v1 endpoints, /v1/files returns 404, position field is required” is useful. Summaries can’t preserve that granularity.

Approach 3: Manual Context Injection

So I started manually writing context notes and injecting them into each session. It worked, but it was just a more sophisticated version of re-explaining everything. I was doing the memory work the agent was supposed to do.

None of these approaches solved the core problem: the agent needed structured, prioritized, queryable facts — not embeddings, not summaries, not prose.

The Architecture: L1, L2, and the Retrieval Strategy

Here’s what I built. The memory system has two main stores — L1 (short-term) and L2 (long-term) — plus a retrieval strategy that decides what gets loaded and when.

LayerPurposeScopeUpdate Pattern
L1 — Session MemoryWorking memoryCurrent conversationAutomatic (every message)
L2 — Persistent MemoryLong-term factsCross-sessionAgent writes, user corrects
SkillsProcedural workflowsTask-specificAgent writes after task completion

L1 — Short-Term Working Memory

This is the standard conversation context. Every message in the current session is visible to the agent. It’s fast, it’s complete within the session, and it’s ephemeral — when the session ends, L1 is gone.

L1 is what every chatbot already does. The interesting part is what happens after L1 disappears: which facts survive into L2, and how does the agent decide what to load next time?

L2 — Long-Term Persistent Memory

This is the layer that changed everything. Think of it as a structured notepad that lives across sessions. Every fact the agent learns — or you correct it on — gets written here as a discrete, atomic entry.

Here’s what a real L2 store looks like after months of use:

TTS: British English voice, +30% speed, deliver as MP3.
§
E-commerce: Store ID 48291. Mockup gen = v1 endpoints. 
  Templates: API can't create — use dashboard.
  Products: transparent PNG, large fonts (240/200px).
§
Trading: Platform XYZ, account type standard. 
  Risk per trade: 4%. Architecture: COPROCESSOR pattern.
  7 scheduled tasks + executor + monitoring API.
§
LESSONS: Close desktop client before updating. 
  Never run watchdog --dry-run repeatedly.
  Use os.replace() not os.rename() on Windows.

Every entry is a single, factual, queryable line. No prose. No summaries. Just facts with clear separators. When a new session starts, the entire L2 store is injected into the system prompt — the agent reads it before it says a single word.

The critical design choices:

  1. Atomic entries — each fact is one line, not a paragraph. Easy to add, remove, or update without touching other facts.
  2. Priority ordering — most critical facts come first. Token budget is finite; the agent needs to see the important stuff even if context gets tight.
  3. Agent can write, user can correct — the agent proactively saves new facts it discovers. But when it’s wrong, the user corrects it, and the correction replaces the old entry.
  4. No ephemeral state — task progress, session outcomes, and anything with an expiration date don’t belong in L2. It’d be stale in a week.

Skills — Procedural Memory

Skills are the agent’s how-to library — reusable workflows it writes for itself after completing a complex task. Think of them as the difference between knowing what to do and knowing how to do it.

After debugging a tricky API integration — say, discovering that a required field was missing from the payload and the error message was unhelpful — the agent saves a skill:

## Product Mockup Generation
1. POST to mockup endpoint (v1, NOT v2 — v2 doesn't exist)
2. Payload MUST include "position" field — omitting = MG-4 error
3. Poll status endpoint every 5s (first check at 10s)
4. Result URLs expire after 72h — download immediately
5. Product ID 71 = standard tee, front area 1800x2400
6. Known variant ranges by color (validate before use)

Next time mockups are needed, the agent loads this skill automatically. It doesn’t re-derive the API quirks. It doesn’t forget that the position field is required. It just knows.

Over time, skills accumulate into a living knowledge base. The agent writes them. You prune them when they get stale. They’re loaded on-demand — only the relevant skills are injected for each task, keeping token usage efficient.

The Retrieval Strategy: What Gets Loaded When

L2 is always injected — every turn, every session. But skills are selective. The retrieval strategy determines which skills load for a given task, and it works on three levels:

  1. Automatic skill matching — when the agent recognizes a task pattern (publishing content, generating mockups, managing an API), it loads the relevant skill before responding. No user prompting needed.
  2. User-triggered loading — the agent can be explicitly told to load a skill, or the user can reference a past workflow. “Do what we did last time” triggers skill retrieval.
  3. On-demand discovery — the agent can browse available skills when it’s unsure. A skills_list tool lets it scan the library and pick the right one.

The key insight: L2 is cheap, skills are expensive. A few hundred tokens of L2 facts are negligible. A full skill with code examples and pitfall lists might be 500–2000 tokens. Loading all skills every session would blow the budget. So skills load only when relevant — the retrieval strategy is the filter between “everything the agent knows” and “what the agent needs right now.”

Dreaming: Background Consolidation During Idle Time

This is the part that felt weird to implement but turned out to be one of the most useful features. The agent has a dreaming phase — a background process that runs during idle periods to consolidate, prune, and reorganize memory.

Think of it like what your brain does during sleep. During active sessions, the agent writes facts fast and messy — raw corrections, new discoveries, task-specific details. During idle time, the dreaming process:

  1. Consolidates — merges redundant entries (“User prefers British TTS” + “Use en-GB voice” → one clean entry)
  2. Prunes — flags entries that haven’t been referenced in weeks, surfaces them for review or auto-removal
  3. Categorizes — groups related facts by topic (trading, publishing, environment) so the retrieval strategy can load topic-specific slices instead of the full store
  4. Validates — checks if stored facts contradict each other and flags conflicts for user resolution

Dreaming runs on a schedule — typically during low-activity hours. It’s lightweight (just file operations and string matching, no LLM calls) and produces a cleaner, more efficient L2 store. The agent wakes up with a tidier brain every morning.

Portability: Docker, Not Directories

One of the design goals was portability. The entire memory system — L2 store, skills library, configuration — runs inside a container. Docker makes this elegant.

In practice, this means:

  • Same brain, any machinedocker pull on a new host, point it at your memory volume, and the agent picks up exactly where it left off. No re-training, no re-indexing, no migration scripts. The container is stateless; the volume is the brain.
  • NAS deployment — the container runs comfortably on a Synology, TrueNAS, or any Docker-capable NAS. The agent stays always-on for scheduled tasks, monitoring, and on-demand queries. Mount the memory as a named volume and it persists across container restarts and host reboots.
  • Backup and restore — back up the volume with any standard tool (restic, borg, NAS snapshots). Restore by mounting the volume. The memory is just files inside a container — no special database, no proprietary format.
  • Version control — because skills and L2 entries are plain text, the volume works with git. You can diff what the agent learned, revert mistakes, and branch experimental knowledge.
  • No vendor lock-in — the memory format is plain text. If you switch LLM providers or agent frameworks, you take your memory with you. The facts are framework-agnostic.

Portability was non-negotiable. If the agent can’t survive a host migration without losing its memory, it’s not really your agent. It’s rented.

One Brain, Multiple Harnesses

The memory system isn’t designed for a single agent. It’s a shared brain that serves multiple harnesses — different agent instances, different surfaces, different tasks, all reading from the same knowledge.

Think of it this way: the brain (L2 + skills + configuration) is the constant. The harnesses (Telegram bot, CLI tool, scheduled worker, smart home controller) are interchangeable. Each harness has its own session context (L1), but they all share the same persistent memory and the same skill library.

In practice:

  • Shared L2 store — a trading agent and a content publishing agent both see the same environment facts, preferences, and lessons. One brain, two harnesses, zero duplication.
  • Domain-isolated skills — each harness loads only the skills relevant to its domain. The trading harness doesn’t load publishing skills. The publishing harness doesn’t load trading skills. Token efficiency by design.
  • Subagent delegation — the main harness can spawn subagents for parallel tasks. Each subagent gets its own L1 (session context) but shares the parent’s L2. When the subagent finishes, its results feed back into the parent’s context.
  • Plugin ecosystem — skills, monitoring scripts, and integrations are all plugins. Add a new capability by dropping a skill file into the library. Remove it by deleting the file. No code changes, no redeployment.

This architecture scales horizontally. Need a new harness for a new domain? Point it at the shared brain and it’s productive from the first session. The memory system becomes the institutional knowledge layer that all harnesses share. You’re not building separate agents — you’re building one brain with multiple faces.

Optional Setup: Run It on a NAS

You don’t need a powerful server to run this. The entire system — agent, memory, plugins — runs comfortably on a NAS or any low-power Linux machine. Here’s why that matters:

  • Always-on — a NAS is already running 24/7. The agent stays available for scheduled tasks, monitoring, and on-demand queries without a dedicated machine.
  • Low resource usage — the memory system is just file I/O. The LLM inference can happen remotely (cloud API) while the agent and memory live locally. You’re not running a GPU on the NAS.
  • Local network access — the agent can reach local services (trading platforms, smart home hubs, local APIs) without exposing them to the internet.
  • Simple deployment — the agent runs as a process or container. Memory is a directory. Back up the directory, back up the brain.

If you already have a NAS for file storage or media, adding an AI agent to it is a natural extension. The memory system was designed for this kind of lightweight, persistent deployment.

What the Agent Sees vs. What It Doesn’t

Memory is powerful, but unfiltered memory is dangerous. Here’s the boundary between what gets loaded and what stays hidden:

Always Loaded (L2 — Persistent Memory)

  • User preferences (output format, notification rules, voice settings)
  • System architecture (pipeline layouts, platform connections)
  • Hard-won lessons (API quirks, platform gotchas, corrections)
  • Environment facts (OS, hardware, installed tools)

Loaded On-Demand (Skills)

  • Task-specific workflows (publishing content, generating mockups, managing APIs)
  • API reference cards (endpoint formats, required fields, error codes)
  • Pitfall lists (what went wrong before, how to avoid it)
  • Verification steps (how to confirm an operation succeeded)

Never Stored (Deliberately)

  • Task progress (“I’m halfway through the migration”)
  • Session outcomes (“the tests passed”)
  • Stale state (“the current price is 2340”)
  • Anything with an expiration date

The agent knows who you are, what you like, and how your systems work. And when you ask “what was I working on last Tuesday?” — it can tell you.

This is where the system gets interesting. L2 doesn’t just store preferences — it stores project state. “Phase 2 migration: 80% complete, remaining: cron job re-registration, Hindsight integration.” When you come back after a week away, you don’t have to remember where you left off. The agent does. And it can offer to continue: “Last time you were working on the migration. Want me to pick up where we left off?”

The key design choice: project state is stored, but task progress isn’t. There’s a difference between “this project exists and is at stage X” (durable, useful) and “I’m currently editing line 47 of this file” (ephemeral, useless next session). The agent stores the former and discards the latter. When you return, it reconstructs the context from durable state and offers to resume.

Learning and Self-Improvement

The memory system isn’t just a storage layer — it’s a self-improvement engine. Every correction the user makes, every skill the agent writes, every lesson learned from a failed operation feeds back into the system. The agent literally gets smarter between sessions.

This complements the agent’s built-in self-improvement functions. The platform has native mechanisms for this — when a difficult task succeeds, the agent saves the approach as a skill. When it hits a pitfall, it patches the skill with the new finding. When a tool behaves unexpectedly, it records the quirk in L2. The memory system amplifies these native behaviors by giving them persistence and structure.

Here’s how the learning loop works:

  1. Correction → L2 update — user says “that endpoint is v1, not v2.” Agent updates L2 immediately. Next session, the correction is already loaded.
  2. Task completion → skill creation — agent finishes a complex workflow (say, publishing an article through multiple API calls). It captures the working approach as a numbered skill with pitfall notes.
  3. Skill failure → skill patch — next time the skill is used and something fails, the agent patches it with the new finding. The skill evolves.
  4. Dreaming → consolidation — during idle time, redundant L2 entries merge, stale skills get flagged, and the knowledge base tightens.
  5. Cross-session learning — the agent learns from its own history. It knows which skills work, which ones have been patched, which L2 entries were corrected. It builds confidence over time.

After months of this, the difference is measurable. The first week, the agent needed constant guidance. By month three, it was proactively loading the right skills, avoiding known pitfalls, and suggesting improvements to its own workflows. The system doesn’t just remember — it improves.

Performance: Speed and Embeddings

The memory system is designed for speed. Here’s what the numbers look like in production:

OperationLatencyNotes
L2 injection into system prompt10–50ms~500–1500 tokens, pure file read
Skill loading (single)50–200ms~500–2000 tokens depending on complexity
Full skill library scan200–500msOnly when agent needs to discover, not for routine tasks
Dreaming consolidation1–5sFile I/O + string matching, no LLM calls
Session start overhead<100msL2 read + relevant skill matching
Total memory overhead per turn<50msNegligible vs. LLM inference time (200ms–2s)

The memory system adds less than 50ms per turn. LLM inference takes 200ms–2s. The memory is a rounding error in the total latency budget.

Where Embeddings Fit In

The current system uses structured text (L2 entries, markdown skills) with deterministic retrieval. But there’s a place for embeddings — specifically in skill discovery.

When you have 50+ skills and the agent needs to find the right one for an ambiguous task, keyword matching isn’t enough. Embedding the skill summaries (not the full skills, just the 1–2 sentence descriptions) into a lightweight vector store gives you semantic search over the skill library. The agent describes what it needs in natural language, and the embedding layer returns the most relevant skills.

This is the only place embeddings earn their keep in the architecture. L2 facts are too short and specific for semantic search — you want deterministic retrieval there. Skills are too long and procedural. But skill summaries are perfect for embedding: short, descriptive, semantically rich.

In practice, a lightweight local embedding model (under 100MB) handles this without any cloud API calls. The embedding index is small, fast, and lives alongside the rest of the memory in Docker.

For reference: users who want full-text search over their memory history can pair the system with tools like Hindsight (conversation search and analysis) and Obsidian (structured knowledge management). These are optional — the core L2 + skills system works without them — but they add powerful retrieval for agents with extensive memory histories.

The Surprising Parts: What Emerged Naturally

I designed the memory system to solve the context problem. What I didn’t expect were the emergent behaviors that appeared after a few months of production use:

Self-Correction Across Sessions

When I corrected the agent on something — “no, that endpoint is v1, not v2” — it saved the correction as a new L2 entry. Next session, it loaded the correction before making the mistake. The agent started getting smarter between sessions, not just within them. Over time, the correction frequency dropped. The agent was learning from its own memory.

Proactive Context Loading

The agent began loading relevant skills before I finished describing the task. I’d say “generate mockups for the new design” and it’d already have the relevant skill loaded — the API quirks, the required fields, the pitfall list. It felt less like talking to a chatbot and more like working with someone who’d done this before.

The Living Knowledge Base

Skills accumulated over time. Each one was a small, focused workflow that captured what the agent learned from a specific task. After months, the skill library covered automation, content publishing, product management, code review workflows, and dozens of smaller tasks. The agent was building its own institutional knowledge — and I could prune, update, or extend any of it.

Token Efficiency Through Structure

L2 memory is tiny — a few hundred tokens. Skills load on-demand. The total overhead per session is negligible compared to the context the agent would need if you re-explained everything manually. Structured facts beat prose every time. A single line of L2 can replace paragraphs of conversational context.

The Takeaway: Memory Is the Architecture

The MT4 article was about building a bridge between incompatible systems. The MT5 article was about designing the boundary between what an agent can see and what it can’t.

This one is about something more fundamental: making the agent remember.

Without memory, every session is a cold start. With memory, every session builds on the last. The agent learns your systems, absorbs your corrections, and develops reusable expertise — not because the LLM got smarter, but because the architecture gave it a place to store what it learned.

The pattern transfers to any agent framework: L2 for persistent facts, skills for procedural knowledge, a retrieval strategy for what loads when, dreaming for consolidation, and a shared brain for multi-agent harnesses. The implementation details change; the design principles don’t.

Start with L2. Add skills. Let dreaming tighten the knowledge base. Run it in Docker on a NAS. The whole system is portable, self-improving, and grows smarter with every session.

The LLM is the brain. Memory is the career. One lasts a conversation. The other lasts a lifetime.

Build Your Own

If you’re already running an agent system, start with L2 — persistent memory. Write down the 10 facts your agent would need if it started a fresh session right now. Inject them into the system prompt. You’ll feel the difference in the first conversation.

Then add skills. After the next complex task, capture what you learned as a numbered workflow. Next time, the agent loads it automatically.

Then add dreaming. A simple script that deduplicates L2 entries and flags stale ones. It doesn’t need to be fancy — just consistent.

The whole system can run on a NAS, a VPS, or your laptop. The memory is just files. Move them anywhere.

What’s Next

The memory system is running in production — trading, publishing, smart home, scheduled tasks, all of it. Next up is making the dreaming process more sophisticated: cross-referencing skills for conflicts, auto-generating skill documentation, and building a retrieval strategy that learns from usage patterns.

I’ll also be writing about the multi-agent orchestration layer — how the agent delegates tasks to subagents, runs them in parallel, and coordinates results without context pollution between workstreams. That’s the piece that makes the memory system scale.

Get Involved

Drop a comment below — how do you handle agent memory in your systems? RAG, summarization, manual context, or something else entirely?

If you’re building agent-based systems and want to talk architecture, I’m always interested in how others are solving the memory problem.

Related Post

Discover more from Yellowchilli's Playground

Subscribe now to keep reading and get access to the full archive.

Continue reading