Some teams think about long context the wrong way. They treat the context window like a storage upgrade. Bigger model. Bigger window. Bigger bill. Problem solved.
That is not how this works in production.
As conversations get longer, the real challenge is not whether the model can technically accept more tokens. The challenge is whether your application can preserve the right state without drowning the model in stale turns, repeated instructions, and buried facts. Mastra’s own memory model reflects this reality. It separates raw message history from working memory, semantic recall, and observational memory, with observational memory specifically intended to keep context smaller while preserving long-term memory across conversations. (Mastra)
That is context window compaction.
In Mastra, compaction is not a hack layered on top of the framework. It fits naturally into how the framework already wants you to think: keep structured memory for what should persist, use recent messages for local coherence, and use workflows when you want deterministic control over how context gets reduced and refreshed. Mastra also distinguishes clearly between agents and workflows: agents are open-ended and tool-using, while workflows give you explicit control over steps, order, and state transitions. (Mastra)
What context window compaction actually means in Mastra
In practical terms, context window compaction means you stop replaying the entire conversation forever.
Instead, you preserve context in layers:
- Agent instructions for stable behavior
- Working memory for durable structured state
- Observational memory for compressed long-range memory
- Recent message history for verbatim short-term coherence
- Semantic recall when relevant past material should be pulled back in on demand
That is a much better production model than “just send the whole transcript again.” It is also aligned with how Mastra describes its own memory primitives. Working memory stores persistent structured information such as user profile, preferences, or goals. Observational memory maintains a dense observation log as raw history grows. Semantic recall retrieves relevant past messages by meaning rather than exact match. (Mastra)
The key idea is simple: not all context deserves to survive in the same form.
A temporary clarification can fade. A user constraint should not. A final decision should be preserved structurally, not buried in assistant prose.
Why naive summarization fails
A lot of teams hear “compaction” and think “summary.” So they replace the first fifty turns with a paragraph and hope for the best.
That usually breaks in quiet ways. The agent starts missing constraints. It forgets what was approved. It retains the tone of the discussion but not the actual commitments. This is exactly why structured memory matters more than generic summarization.
In Mastra terms, you should not rely on message history alone to preserve important state. Use working memory templates to explicitly track what matters. Mastra recommends custom templates so the agent remembers the information most relevant to the use case, rather than depending on an untargeted default. (Mastra)
The pattern that works in Mastra
The cleanest pattern is:
- Put durable instructions in the agent definition.
- Turn on memory.
- Use a custom working memory template to define what should persist.
- Let observational memory compress long-range history.
- Use a workflow when you want deterministic compaction and promotion of facts into structured state.
This matters because Mastra gives you two different control surfaces. If you want the model to reason flexibly, use an agent with memory. If you want explicit compaction behavior with step-by-step control, use a workflow built with createStep() and createWorkflow(). (Mastra)
A Mastra example in TypeScript
Below is a practical example that shows how to think about compaction in Mastra.
The example has three parts:
- a memory-enabled agent
- a working memory template that preserves durable state
- a workflow that compacts older conversation turns into a structured summary before the next agent run
1) Create a memory-enabled agent
import { Agent } from '@mastra/core/agent';import { Memory } from '@mastra/memory';const supportMemory = new Memory({ options: { workingMemory: { enabled: true, scope: 'thread', template: `# Active Case State- User goal:- Current issue:- Constraints:- Decisions made:- Important facts:- Open questions:- Next expected action: `.trim(), }, observationalMemory: true, },});export const supportAgent = new Agent({ id: 'support-agent', name: 'Support Agent', instructions: `You are a precise enterprise support agent.Rules:- Preserve decisions and constraints exactly.- Do not restate stale details unless they affect the current issue.- Prefer the structured memory over vague paraphrasing.- If there is a conflict between recent messages and memory, call it out explicitly. `.trim(), model: 'openai/gpt-5.4', memory: supportMemory,});
This is the first important shift. Instead of treating the transcript as the only memory, you tell Mastra what durable state should be maintained in working memory, and you let observational memory help compress the long tail of the conversation. Mastra documents thread-scoped working memory and custom templates for exactly this purpose, and it recommends observational memory as the long-term compression layer. (Mastra)
2) Add a compaction step in a workflow
Mastra workflows are useful when you want explicit control over how compaction happens, rather than leaving everything to the agent loop. Workflows define steps with schemas and controlled data flow. (Mastra)
import { z } from 'zod';import { createStep, createWorkflow } from '@mastra/core/workflows';const compactContextStep = createStep({ id: 'compact-context', inputSchema: z.object({ transcript: z.array( z.object({ role: z.enum(['user', 'assistant', 'tool']), content: z.string(), }) ), }), outputSchema: z.object({ compactedSummary: z.string(), promotedFacts: z.array(z.string()), decisions: z.array(z.string()), openQuestions: z.array(z.string()), }), execute: async ({ inputData, mastra }) => { const prompt = `Compact this conversation history for future use.Rules:- Preserve decisions, constraints, facts, and unresolved questions.- Remove repetition, pleasantries, and obsolete turns.- Be specific.- Do not invent facts.Transcript:${inputData.transcript .map(t => `[${t.role}] ${t.content}`) .join('\n')} `.trim(); const result = await supportAgent.generate(prompt, { output: { schema: z.object({ compactedSummary: z.string(), promotedFacts: z.array(z.string()), decisions: z.array(z.string()), openQuestions: z.array(z.string()), }), }, }); return result.object; },});
The important design choice here is that compaction returns structured outputs, not a blob of prose. That makes it far easier to feed the result into working memory, trace it, test it, and compare compaction quality over time.
3) Run the agent with compacted state
Mastra supports agents inside workflows, and workflows can map previous step output into the prompt that the agent sees. (Mastra)
const replyStep = createStep(supportAgent).map(async ({ inputData }) => { return { prompt: `Use this compacted context to answer the user well.Compacted summary:${inputData.compactedSummary}Promoted facts:${inputData.promotedFacts.map(f => `- ${f}`).join('\n')}Decisions:${inputData.decisions.map(d => `- ${d}`).join('\n')}Open questions:${inputData.openQuestions.map(q => `- ${q}`).join('\n')}Now answer the latest user request. `.trim(), };});export const compactingConversationWorkflow = createWorkflow({ id: 'compacting-conversation', inputSchema: z.object({ transcript: z.array( z.object({ role: z.enum(['user', 'assistant', 'tool']), content: z.string(), }) ), }),}) .then(compactContextStep) .then(replyStep) .commit();
This is a strong production pattern because the workflow owns the deterministic sequence while the agent handles the reasoning. Mastra explicitly supports this split: workflows for controlled multi-step logic, agents for judgment and generation. (Mastra)
What to watch for
1) Do not compact instructions
Your agent instructions are not chat history. They should remain stable and explicit in the agent definition, not be “remembered” through summary text. Mastra’s Agent class uses instructions as the core system behavior layer, which is a different concern from memory. (Mastra)
2) Do not let summaries replace structure
A paragraph that says “the team generally agreed on an approach” is nearly useless.
A compacted state that says:
- Decision: use Redis for ephemeral cache
- Constraint: never include client PII in prompts
- Open question: whether audit logs need 30 or 90 day retention
Provides actionable structures.
That is exactly why Mastra working memory templates are so useful. They let you define the shape of what should persist instead of hoping the model keeps it straight. (Mastra)
3) Observational memory is powerful, but you still need design discipline
Mastra recommends observational memory because it keeps the context window small while preserving long-term memory, but that does not mean you should stop thinking about what belongs in structured state versus retrievable state. Observational memory helps compress; it does not remove the need for clear memory design. (Mastra)
4) Use workflows for deterministic compaction
If compaction is business-critical, do not leave it entirely implicit.
Use a workflow. Add schemas. Capture outputs. Evaluate whether important facts survived. Mastra workflows are explicitly built for this kind of multi-step, predictable execution. (Mastra)
5) Do not confuse compaction with retrieval
Compaction is about compressing conversational state.
Retrieval is about pulling in relevant external context when needed.
Mastra supports both memory and RAG-style retrieval patterns. You still want semantic recall or external retrieval for large bodies of knowledge that should not live permanently in prompt state. (Mastra)
The practical mental model
Here is the better way to think about it:
Message history is not your product memory model.
In Mastra, your real memory model is a combination of:
- instructions for stable behavior
- working memory for durable structured state
- observational memory for compressed long-range continuity
- message history for recent turns
- semantic recall for relevant past material
That is a much stronger architecture than replaying everything and hoping a bigger model window will save you. Mastra’s documentation reflects this layered approach directly. (Mastra)
Summary
Context window compaction is one of the clearest differences between an LLM demo and an LLM system.
A demo keeps appending turns. A system decides what should persist, what should compress, what should be retrieved, and what should be forgotten.
Mastra is useful here because it gives you the pieces in the right shape. Agents handle reasoning. Memory handles persistence. Observational memory helps replace raw history as it grows. Workflows let you make compaction explicit when the stakes are high. (Mastra)