Why do AI agents forget context between sessions?

LLMs are stateless — each session starts with a fresh context window. Without explicit memory persistence, the agent has no way to recall facts, preferences, or workflows from previous interactions. The context window works like RAM (temporary working memory), not storage. This is an architectural limitation, not a model limitation. A memory architecture with persistent L2 storage solves this.

What is L1 and L2 memory in AI agents?

L1 is short-term working memory (the current conversation context), while L2 is long-term persistent memory (structured facts that survive across sessions). L1 is ephemeral — it disappears when the session ends. L2 is injected into every new session as system context before the agent says a single word. Skills serve as a third layer: procedural memory — reusable workflows loaded on-demand only when the task matches.

How is persistent memory different from RAG?

RAG (retrieval-augmented generation) retrieves semantically similar chunks from a vector database — useful for knowledge retrieval but poor for conversational context. Persistent memory uses structured, atomic facts injected directly into the system prompt. It's deterministic rather than probabilistic. RAG returns what's most similar; persistent memory returns what's most important. For agent memory, you need determinism — the fact that an API endpoint is sunset shouldn't compete with the fact that it still works.

What is dreaming in AI agent memory?

Dreaming is a background consolidation process that runs during idle periods — like what your brain does during sleep. It merges redundant entries, flags stale facts that haven't been referenced in weeks, categorizes related information by topic, and validates contradictions between stored facts. It's lightweight (just file I/O and string matching, no LLM calls) and produces a cleaner, more efficient memory store. The agent wakes up with a tidier knowledge base.

Can an AI agent remember what I was working on last week?

Yes. The persistent memory stores durable project state — which project exists, what stage it's at, what's remaining. When you return after time away, the agent reconstructs context from stored facts and offers to resume. The key design choice: durable project state is stored ("Phase 2 migration: 80% complete"), but ephemeral task progress is discarded ("currently editing line 47"). The agent stores what's useful next session, not what's useful right now.

How does AI agent self-improvement work?

Every correction updates persistent memory immediately — "that endpoint is v1, not v2" becomes a permanent fact. Every complex task becomes a reusable skill with pitfall notes. Every failed skill gets patched with the new finding. During idle time, dreaming consolidates and prunes the knowledge base. After months of this loop, the agent proactively loads the right skills, avoids known pitfalls, and suggests improvements. The system doesn't just remember — it gets measurably smarter between sessions.

Can multiple AI agents share the same memory?

Yes. A shared brain (L2 persistent facts + procedural skills) serves multiple harnesses — different agent instances, different surfaces (Telegram, CLI, scheduled workers), different tasks. Each harness has its own L1 session context but shares the same persistent memory. Domain-isolated skills ensure each harness loads only what's relevant. One brain, multiple faces, zero duplication.

How fast is AI agent memory retrieval?

L2 injection takes 10–50ms (pure file read, ~500–1500 tokens). Single skill loading takes 50–200ms (~500–2000 tokens). Full skill library scan takes 200–500ms (only for discovery, not routine). Dreaming consolidation takes 1–5 seconds. Total memory overhead per turn is under 50ms — negligible versus LLM inference time of 200ms–2 seconds. The memory system is a rounding error in the total latency budget.

Can I run an AI agent on a NAS with Docker?

Yes. The entire system runs in a Docker container on any Docker-capable NAS (Synology, TrueNAS, etc.). The container is stateless; the memory volume is the brain. Mount the volume as a named Docker volume and it persists across container restarts and host reboots. LLM inference happens via cloud API while the agent and memory live locally. Back up the volume with standard tools (restic, borg, NAS snapshots). The memory is just files — no special database, no proprietary format.

Why doesn't a bigger context window solve AI agent memory?

The context window is working memory (RAM), not long-term storage. Even with 200K+ token windows, you still need to decide WHAT to put in it. Injecting every past conversation is wasteful and noisy — most of it is irrelevant. Structured persistent memory with a few hundred tokens of curated facts outperforms dumping thousands of tokens of raw conversation history. The bottleneck isn't window size; it's knowing which facts matter right now.

AI Agent Memory: Build L1/L2 Persistent Memory (2026)

Quick Answer

AI agent memory is the difference between a useful assistant and a disposable chatbot. Most AI agents — whether built on GPT, Claude, Gemini, or open-source models — forget everything between sessions. The context window is not storage; it’s RAM. I run a multi-agent system that manages trading, content publishing, smart home automation, and scheduled tasks — and every time a session ended, it forgot everything. Credentials, preferences, which API endpoints were sunset, which product variants actually worked. The fix wasn’t a bigger context window, a vector database, or RAG — it was a three-tier memory architecture: L1 (short-term working memory), L2 (long-term persistent facts), and procedural skills (reusable workflows). A retrieval strategy surfaces the right facts at the right time. A background “dreaming” process consolidates and prunes memory during idle periods. The system now self-corrects across sessions, shares knowledge across multiple agent harnesses (Telegram, CLI, scheduled workers), and runs in Docker on a NAS. Here’s exactly how it works and how to build your own.

The Problem I Hit: Every Session Starts From Zero

If you’ve built anything non-trivial with an AI agent — trading automation, content publishing, multi-platform bots — you’ve hit this wall. The agent is brilliant within a single session. You give it context, it reasons through it, executes flawlessly. Then the session ends.

Next time you open a chat, it’s meeting you for the first time.

I had an AI agent managing a MetaTrader integration, publishing articles to WordPress and Shopify, running scheduled monitoring jobs, and controlling smart home devices through Telegram. Every new session meant re-explaining the same things: which API endpoints still worked and which were sunset, which product variant IDs were valid, what the automation pipeline architecture looked like, what my preferences were for output format and notification rules. The list grew every week.

This isn’t just annoying — it’s architecturally fatal for an agent that’s supposed to operate autonomously. If it forgets which endpoints are broken, it’ll call them and waste tokens on error handling. If it forgets your preferences, it’ll generate output you have to reject. If it forgets the pipeline layout, it’ll rebuild it from scratch instead of working with what exists.

The context window isn’t the bottleneck. Storage is. LLMs can handle thousands of tokens of injected context. The problem is nobody’s feeding them the right context at the right time.

What I Tried First (And Why It Didn’t Work)

Before settling on the architecture described below, I went through three failed approaches — each one teaching me something about what agent memory actually needs.

Approach 1: RAG Over Conversation History

My first instinct was to treat past conversations like a document store. Store every session in a vector database, embed them, retrieve relevant chunks when a new session starts. Standard RAG (retrieval-augmented generation) — the same pattern used for enterprise knowledge bases.

The problem: conversations are terrible documents. A trading discussion contains contradictory statements (“I think the pair is bullish” followed by “actually, let’s wait for confirmation”). Vector search returns the most semantically similar chunk, which might be the wrong sentiment. RAG is designed for factual retrieval, not conversational context. And worse — it retrieves based on similarity, not importance. The fact that an API endpoint is sunset is just as “similar” to a query as the fact that it still works. You need determinism, not probability.

Approach 2: Summarize Everything Into a System Prompt

Next I tried auto-summarizing each session and injecting the summary into the next session’s system prompt. This worked better — the agent had a rough memory of past interactions.

But summaries are lossy. General preferences survive a summary. Specific technical details don’t — they get compressed into something vague. The agent loses the specificity that makes it actually useful. “User has a Printful integration” is useless. “Mockup generator uses v1 endpoints, /v1/files returns 404, position field is required” is useful. Summaries can’t preserve that granularity.

Approach 3: Manual Context Injection

So I started manually writing context notes and injecting them into each session. It worked, but it was just a more sophisticated version of re-explaining everything. I was doing the memory work the agent was supposed to do.

None of these approaches solved the core problem: the agent needed structured, prioritized, queryable facts — not embeddings, not summaries, not prose.

The Architecture: L1, L2, and the Retrieval Strategy

Here’s what I built. The memory system has two main stores — L1 (short-term) and L2 (long-term) — plus a retrieval strategy that decides what gets loaded and when.

Layer	Purpose	Scope	Update Pattern
L1 — Session Memory	Working memory	Current conversation	Automatic (every message)
L2 — Persistent Memory	Long-term facts	Cross-session	Agent writes, user corrects
Skills	Procedural workflows	Task-specific	Agent writes after task completion

L1 — Short-Term Working Memory

This is the standard conversation context. Every message in the current session is visible to the agent. It’s fast, it’s complete within the session, and it’s ephemeral — when the session ends, L1 is gone.

L1 is what every chatbot already does. The interesting part is what happens after L1 disappears: which facts survive into L2, and how does the agent decide what to load next time?

L2 — Long-Term Persistent Memory

This is the layer that changed everything. Think of it as a structured notepad that lives across sessions. Every fact the agent learns — or you correct it on — gets written here as a discrete, atomic entry.

Here’s what a real L2 store looks like after months of use:

TTS: British English voice, +30% speed, deliver as MP3.
§
E-commerce: Store ID 48291. Mockup gen = v1 endpoints. 
  Templates: API can't create — use dashboard.
  Products: transparent PNG, large fonts (240/200px).
§
Trading: Platform XYZ, account type standard. 
  Risk per trade: 4%. Architecture: COPROCESSOR pattern.
  7 scheduled tasks + executor + monitoring API.
§
LESSONS: Close desktop client before updating. 
  Never run watchdog --dry-run repeatedly.
  Use os.replace() not os.rename() on Windows.

Every entry is a single, factual, queryable line. No prose. No summaries. Just facts with clear separators. When a new session starts, the entire L2 store is injected into the system prompt — the agent reads it before it says a single word.

The critical design choices:

Atomic entries — each fact is one line, not a paragraph. Easy to add, remove, or update without touching other facts.
Priority ordering — most critical facts come first. Token budget is finite; the agent needs to see the important stuff even if context gets tight.
Agent can write, user can correct — the agent proactively saves new facts it discovers. But when it’s wrong, the user corrects it, and the correction replaces the old entry.
No ephemeral state — task progress, session outcomes, and anything with an expiration date don’t belong in L2. It’d be stale in a week.

Skills — Procedural Memory

Skills are the agent’s how-to library — reusable workflows it writes for itself after completing a complex task. Think of them as the difference between knowing what to do and knowing how to do it.

After debugging a tricky API integration — say, discovering that a required field was missing from the payload and the error message was unhelpful — the agent saves a skill:

## Product Mockup Generation
1. POST to mockup endpoint (v1, NOT v2 — v2 doesn't exist)
2. Payload MUST include "position" field — omitting = MG-4 error
3. Poll status endpoint every 5s (first check at 10s)
4. Result URLs expire after 72h — download immediately
5. Product ID 71 = standard tee, front area 1800x2400
6. Known variant ranges by color (validate before use)

Next time mockups are needed, the agent loads this skill automatically. It doesn’t re-derive the API quirks. It doesn’t forget that the position field is required. It just knows.

Over time, skills accumulate into a living knowledge base. The agent writes them. You prune them when they get stale. They’re loaded on-demand — only the relevant skills are injected for each task, keeping token usage efficient.

The Retrieval Strategy: What Gets Loaded When

L2 is always injected — every turn, every session. But skills are selective. The retrieval strategy determines which skills load for a given task, and it works on three levels:

Automatic skill matching — when the agent recognizes a task pattern (publishing content, generating mockups, managing an API), it loads the relevant skill before responding. No user prompting needed.
User-triggered loading — the agent can be explicitly told to load a skill, or the user can reference a past workflow. “Do what we did last time” triggers skill retrieval.
On-demand discovery — the agent can browse available skills when it’s unsure. A skills_list tool lets it scan the library and pick the right one.

The key insight: L2 is cheap, skills are expensive. A few hundred tokens of L2 facts are negligible. A full skill with code examples and pitfall lists might be 500–2000 tokens. Loading all skills every session would blow the budget. So skills load only when relevant — the retrieval strategy is the filter between “everything the agent knows” and “what the agent needs right now.”

Dreaming: Background Consolidation During Idle Time

This is the part that felt weird to implement but turned out to be one of the most useful features. The agent has a dreaming phase — a background process that runs during idle periods to consolidate, prune, and reorganize memory.

Think of it like what your brain does during sleep. During active sessions, the agent writes facts fast and messy — raw corrections, new discoveries, task-specific details. During idle time, the dreaming process:

Consolidates — merges redundant entries (“User prefers British TTS” + “Use en-GB voice” → one clean entry)
Prunes — flags entries that haven’t been referenced in weeks, surfaces them for review or auto-removal
Categorizes — groups related facts by topic (trading, publishing, environment) so the retrieval strategy can load topic-specific slices instead of the full store
Validates — checks if stored facts contradict each other and flags conflicts for user resolution

Dreaming runs on a schedule — typically during low-activity hours. It’s lightweight (just file operations and string matching, no LLM calls) and produces a cleaner, more efficient L2 store. The agent wakes up with a tidier brain every morning.

Portability: Docker, Not Directories

One of the design goals was portability. The entire memory system — L2 store, skills library, configuration — runs inside a container. Docker makes this elegant.

In practice, this means:

Same brain, any machine — docker pull on a new host, point it at your memory volume, and the agent picks up exactly where it left off. No re-training, no re-indexing, no migration scripts. The container is stateless; the volume is the brain.
NAS deployment — the container runs comfortably on a Synology, TrueNAS, or any Docker-capable NAS. The agent stays always-on for scheduled tasks, monitoring, and on-demand queries. Mount the memory as a named volume and it persists across container restarts and host reboots.
Backup and restore — back up the volume with any standard tool (restic, borg, NAS snapshots). Restore by mounting the volume. The memory is just files inside a container — no special database, no proprietary format.
Version control — because skills and L2 entries are plain text, the volume works with git. You can diff what the agent learned, revert mistakes, and branch experimental knowledge.
No vendor lock-in — the memory format is plain text. If you switch LLM providers or agent frameworks, you take your memory with you. The facts are framework-agnostic.

Portability was non-negotiable. If the agent can’t survive a host migration without losing its memory, it’s not really your agent. It’s rented.

One Brain, Multiple Harnesses

The memory system isn’t designed for a single agent. It’s a shared brain that serves multiple harnesses — different agent instances, different surfaces, different tasks, all reading from the same knowledge.

Think of it this way: the brain (L2 + skills + configuration) is the constant. The harnesses (Telegram bot, CLI tool, scheduled worker, smart home controller) are interchangeable. Each harness has its own session context (L1), but they all share the same persistent memory and the same skill library.

In practice:

Shared L2 store — a trading agent and a content publishing agent both see the same environment facts, preferences, and lessons. One brain, two harnesses, zero duplication.
Domain-isolated skills — each harness loads only the skills relevant to its domain. The trading harness doesn’t load publishing skills. The publishing harness doesn’t load trading skills. Token efficiency by design.
Subagent delegation — the main harness can spawn subagents for parallel tasks. Each subagent gets its own L1 (session context) but shares the parent’s L2. When the subagent finishes, its results feed back into the parent’s context.
Plugin ecosystem — skills, monitoring scripts, and integrations are all plugins. Add a new capability by dropping a skill file into the library. Remove it by deleting the file. No code changes, no redeployment.

This architecture scales horizontally. Need a new harness for a new domain? Point it at the shared brain and it’s productive from the first session. The memory system becomes the institutional knowledge layer that all harnesses share. You’re not building separate agents — you’re building one brain with multiple faces.

Optional Setup: Run It on a NAS

You don’t need a powerful server to run this. The entire system — agent, memory, plugins — runs comfortably on a NAS or any low-power Linux machine. Here’s why that matters:

Always-on — a NAS is already running 24/7. The agent stays available for scheduled tasks, monitoring, and on-demand queries without a dedicated machine.
Low resource usage — the memory system is just file I/O. The LLM inference can happen remotely (cloud API) while the agent and memory live locally. You’re not running a GPU on the NAS.
Local network access — the agent can reach local services (trading platforms, smart home hubs, local APIs) without exposing them to the internet.
Simple deployment — the agent runs as a process or container. Memory is a directory. Back up the directory, back up the brain.

If you already have a NAS for file storage or media, adding an AI agent to it is a natural extension. The memory system was designed for this kind of lightweight, persistent deployment.

What the Agent Sees vs. What It Doesn’t

Memory is powerful, but unfiltered memory is dangerous. Here’s the boundary between what gets loaded and what stays hidden:

Always Loaded (L2 — Persistent Memory)

User preferences (output format, notification rules, voice settings)
System architecture (pipeline layouts, platform connections)
Hard-won lessons (API quirks, platform gotchas, corrections)
Environment facts (OS, hardware, installed tools)

Loaded On-Demand (Skills)

Task-specific workflows (publishing content, generating mockups, managing APIs)
API reference cards (endpoint formats, required fields, error codes)
Pitfall lists (what went wrong before, how to avoid it)
Verification steps (how to confirm an operation succeeded)

Never Stored (Deliberately)

Task progress (“I’m halfway through the migration”)
Session outcomes (“the tests passed”)
Stale state (“the current price is 2340”)
Anything with an expiration date

The agent knows who you are, what you like, and how your systems work. And when you ask “what was I working on last Tuesday?” — it can tell you.

This is where the system gets interesting. L2 doesn’t just store preferences — it stores project state. “Phase 2 migration: 80% complete, remaining: cron job re-registration, Hindsight integration.” When you come back after a week away, you don’t have to remember where you left off. The agent does. And it can offer to continue: “Last time you were working on the migration. Want me to pick up where we left off?”

The key design choice: project state is stored, but task progress isn’t. There’s a difference between “this project exists and is at stage X” (durable, useful) and “I’m currently editing line 47 of this file” (ephemeral, useless next session). The agent stores the former and discards the latter. When you return, it reconstructs the context from durable state and offers to resume.

Learning and Self-Improvement

The memory system isn’t just a storage layer — it’s a self-improvement engine. Every correction the user makes, every skill the agent writes, every lesson learned from a failed operation feeds back into the system. The agent literally gets smarter between sessions.

This complements the agent’s built-in self-improvement functions. The platform has native mechanisms for this — when a difficult task succeeds, the agent saves the approach as a skill. When it hits a pitfall, it patches the skill with the new finding. When a tool behaves unexpectedly, it records the quirk in L2. The memory system amplifies these native behaviors by giving them persistence and structure.

Here’s how the learning loop works:

Correction → L2 update — user says “that endpoint is v1, not v2.” Agent updates L2 immediately. Next session, the correction is already loaded.
Task completion → skill creation — agent finishes a complex workflow (say, publishing an article through multiple API calls). It captures the working approach as a numbered skill with pitfall notes.
Skill failure → skill patch — next time the skill is used and something fails, the agent patches it with the new finding. The skill evolves.
Dreaming → consolidation — during idle time, redundant L2 entries merge, stale skills get flagged, and the knowledge base tightens.
Cross-session learning — the agent learns from its own history. It knows which skills work, which ones have been patched, which L2 entries were corrected. It builds confidence over time.

After months of this, the difference is measurable. The first week, the agent needed constant guidance. By month three, it was proactively loading the right skills, avoiding known pitfalls, and suggesting improvements to its own workflows. The system doesn’t just remember — it improves.

Performance: Speed and Embeddings

The memory system is designed for speed. Here’s what the numbers look like in production:

Operation	Latency	Notes
L2 injection into system prompt	10–50ms	~500–1500 tokens, pure file read
Skill loading (single)	50–200ms	~500–2000 tokens depending on complexity
Full skill library scan	200–500ms	Only when agent needs to discover, not for routine tasks
Dreaming consolidation	1–5s	File I/O + string matching, no LLM calls
Session start overhead	<100ms	L2 read + relevant skill matching
Total memory overhead per turn	<50ms	Negligible vs. LLM inference time (200ms–2s)

The memory system adds less than 50ms per turn. LLM inference takes 200ms–2s. The memory is a rounding error in the total latency budget.

Where Embeddings Fit In

The current system uses structured text (L2 entries, markdown skills) with deterministic retrieval. But there’s a place for embeddings — specifically in skill discovery.

When you have 50+ skills and the agent needs to find the right one for an ambiguous task, keyword matching isn’t enough. Embedding the skill summaries (not the full skills, just the 1–2 sentence descriptions) into a lightweight vector store gives you semantic search over the skill library. The agent describes what it needs in natural language, and the embedding layer returns the most relevant skills.

This is the only place embeddings earn their keep in the architecture. L2 facts are too short and specific for semantic search — you want deterministic retrieval there. Skills are too long and procedural. But skill summaries are perfect for embedding: short, descriptive, semantically rich.

In practice, a lightweight local embedding model (under 100MB) handles this without any cloud API calls. The embedding index is small, fast, and lives alongside the rest of the memory in Docker.

For reference: users who want full-text search over their memory history can pair the system with tools like Hindsight (conversation search and analysis) and Obsidian (structured knowledge management). These are optional — the core L2 + skills system works without them — but they add powerful retrieval for agents with extensive memory histories.

The Surprising Parts: What Emerged Naturally

I designed the memory system to solve the context problem. What I didn’t expect were the emergent behaviors that appeared after a few months of production use:

Self-Correction Across Sessions

When I corrected the agent on something — “no, that endpoint is v1, not v2” — it saved the correction as a new L2 entry. Next session, it loaded the correction before making the mistake. The agent started getting smarter between sessions, not just within them. Over time, the correction frequency dropped. The agent was learning from its own memory.

Proactive Context Loading

The agent began loading relevant skills before I finished describing the task. I’d say “generate mockups for the new design” and it’d already have the relevant skill loaded — the API quirks, the required fields, the pitfall list. It felt less like talking to a chatbot and more like working with someone who’d done this before.

The Living Knowledge Base

Skills accumulated over time. Each one was a small, focused workflow that captured what the agent learned from a specific task. After months, the skill library covered automation, content publishing, product management, code review workflows, and dozens of smaller tasks. The agent was building its own institutional knowledge — and I could prune, update, or extend any of it.

Token Efficiency Through Structure

L2 memory is tiny — a few hundred tokens. Skills load on-demand. The total overhead per session is negligible compared to the context the agent would need if you re-explained everything manually. Structured facts beat prose every time. A single line of L2 can replace paragraphs of conversational context.

The Takeaway: Memory Is the Architecture

The MT4 article was about building a bridge between incompatible systems. The MT5 article was about designing the boundary between what an agent can see and what it can’t.

This one is about something more fundamental: making the agent remember.

Without memory, every session is a cold start. With memory, every session builds on the last. The agent learns your systems, absorbs your corrections, and develops reusable expertise — not because the LLM got smarter, but because the architecture gave it a place to store what it learned.

The pattern transfers to any agent framework: L2 for persistent facts, skills for procedural knowledge, a retrieval strategy for what loads when, dreaming for consolidation, and a shared brain for multi-agent harnesses. The implementation details change; the design principles don’t.

Start with L2. Add skills. Let dreaming tighten the knowledge base. Run it in Docker on a NAS. The whole system is portable, self-improving, and grows smarter with every session.

The LLM is the brain. Memory is the career. One lasts a conversation. The other lasts a lifetime.

Build Your Own

If you’re already running an agent system, start with L2 — persistent memory. Write down the 10 facts your agent would need if it started a fresh session right now. Inject them into the system prompt. You’ll feel the difference in the first conversation.

Then add skills. After the next complex task, capture what you learned as a numbered workflow. Next time, the agent loads it automatically.

Then add dreaming. A simple script that deduplicates L2 entries and flags stale ones. It doesn’t need to be fancy — just consistent.

The whole system can run on a NAS, a VPS, or your laptop. The memory is just files. Move them anywhere.

What’s Next

The memory system is running in production — trading, publishing, smart home, scheduled tasks, all of it. Next up is making the dreaming process more sophisticated: cross-referencing skills for conflicts, auto-generating skill documentation, and building a retrieval strategy that learns from usage patterns.

I’ll also be writing about the multi-agent orchestration layer — how the agent delegates tasks to subagents, runs them in parallel, and coordinates results without context pollution between workstreams. That’s the piece that makes the memory system scale.

Get Involved

Drop a comment below — how do you handle agent memory in your systems? RAG, summarization, manual context, or something else entirely?

If you’re building agent-based systems and want to talk architecture, I’m always interested in how others are solving the memory problem.

AI Agent Memory Architecture: Build Persistent L1/L2 Memory That Survives Sessions (2026)

Quick Answer

The Problem I Hit: Every Session Starts From Zero

What I Tried First (And Why It Didn’t Work)

Approach 1: RAG Over Conversation History