Every new conversation with ChatGPT starts from zero. It doesn't remember you, your preferences, or that bug you fixed together last week. But what if it could?
We've explored how to craft the perfect prompts, built reasoning agents that think step-by-step, created RAG systems that access external knowledge, and equipped our LLMs with action tools that actually do things. But there's still one critical piece missing.
Memory systems are that final piece — and implementing them right is trickier than you'd think. The difference between a chatbot and an intelligent assistant? Memory. The gap between a demo and production? Memory management that doesn't drain your budget.
The Problem
Your LLM is brilliant but has the memory of a goldfish. Every API call is a fresh start. For a chatbot, that's annoying. For a production system? It's a dealbreaker.
Here's what happens without memory:
- Customer asks about their order → "I don't have access to order information"
- Developer asks for help debugging → Suggests solutions you already tried
- Support agent hands off a ticket → New agent starts from zero context
Lost context leads to lost customers.
1. Memory Types: The Architecture of Digital Memory
Not all memories are equal. Just like humans, LLMs need different types of memory for different purposes.
- 🟡 Semantic Memory — Meaning and context, including:
- Profile: Single, continuously updated user/context profile (JSON document with key-value pairs)
- Collection: Set of discrete memory items updated and extended over time
- 🔵 Procedural Memory — How-to knowledge and learned workflows
- 🟣 Episodic Memory — When and where things happened, with temporal context
Real-World Memory Mapping:
# Conversation: "Fix the login bug we discussed yesterday"
# Needs:
memory_requirements = {
"semantic": {
"profile": "User's tech stack, preferences, and system context",
"collection": "Previous bug discussions, solutions tried, related issues"
},
"procedural": "What's the standard debugging workflow?",
"episodic": "When did this start? What changed recently?"
}
Without all three types, your LLM is playing telephone with incomplete information.
In practice, most engineering teams simplify this into just two categories: Short-term (what's happening now) and Long-term (what we need to remember).
2. Short-Term Memory: The Working Context
Once you understand the memory types you need, the first challenge is managing what's immediately relevant. Short-term memory is where conversations live and breathe.
Think of short-term memory as your LLM's notepad. It holds what's immediately relevant but has strict size limits.
The challenge? Context windows are expensive real estate. Every token costs money, and models have hard limits (4K to 200K tokens depending on the model).
Smart Conversation Buffering
from litellm import completion
from typing import List, Dict
import tiktoken
class ConversationMemory:
def __init__(self, max_tokens: int = 2000, model: str = "gpt-3.5-turbo"):
self.messages: List[Dict] = []
self.max_tokens = max_tokens
self.encoder = tiktoken.encoding_for_model(model)
def add_message(self, role: str, content: str) -> None:
"""Add message and trim if needed"""
self.messages.append({"role": role, "content": content})
self._smart_trim()
def _smart_trim(self) -> None:
"""Keep system prompt + recent messages within token limit"""
while self._count_tokens() > self.max_tokens and len(self.messages) > 2:
# Never remove system prompt (index 0) or last message
# Remove from the middle, preserving conversation flow
if len(self.messages) > 3:
# Remove oldest user/assistant pair
self.messages.pop(1) # Remove old user message
if len(self.messages) > 2:
self.messages.pop(1) # Remove old assistant response
def _count_tokens(self) -> int:
"""Count total tokens in conversation"""
total = 0
for message in self.messages:
total += len(self.encoder.encode(message["content"]))
return total
def get_context(self) -> List[Dict]:
"""Get trimmed conversation for LLM"""
return self.messages.copy()
# Usage example
memory = ConversationMemory(max_tokens=1000)
memory.add_message("system", "You are a helpful coding assistant.")
memory.add_message("user", "Help me debug this Python function")
memory.add_message("assistant", "I'd be happy to help! Please share the function.")
# Automatically trims old messages when limit is reached
print(f"Current conversation: {len(memory.messages)} messages")
print(f"Token count: {memory._count_tokens()}")
Run this example yourself 🔧
Script:
1_conversation_memory.py
Command:
uv run 1_conversation_memory.py
Expected Output:
=== Conversation Memory Demo ===
Token limit: 1000
--- Turn 1 ---
Messages: 3, Tokens: 29
--- Turn 2 ---
Messages: 5, Tokens: 53
--- Turn 3 ---
Messages: 7, Tokens: 76
--- Turn 4 ---
Messages: 9, Tokens: 109
--- Turn 5 ---
Messages: 11, Tokens: 136
--- Turn 6 ---
Messages: 13, Tokens: 161
--- Turn 7 ---
Messages: 15, Tokens: 175
=== Final Context ===
1. ⚙️ system: You are a helpful coding assistant specializing in...
2. 👤 user: Help me debug this Python function
3. 🤖 assistant: I'd be happy to help! Please share the function co...
4. 👤 user: def calculate_sum(numbers): return sum(numbers)
5. 🤖 assistant: That function looks correct! It calculates the sum...
6. 👤 user: It's giving me a TypeError
7. 🤖 assistant: The TypeError suggests you might be passing incomp...
8. 👤 user: TypeError: unsupported operand type(s) for +: 'int...
9. 🤖 assistant: Ah! You have strings mixed with integers. Try conv...
10. 👤 user: How do I convert strings to integers?
11. 🤖 assistant: Use int() function: int('5') converts string '5' t...
12. 👤 user: What about handling non-numeric strings?
13. 🤖 assistant: Use try/except with ValueError: try: int(x) except...
14. 👤 user: Show me a complete example
15. 🤖 assistant: Here's a robust version with error handling...
Final token count: 175
==================================================
[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.
LLM test failed: litellm.NotFoundError: NotFoundError: OpenrouterException - {"error":{"message":"No endpoints found for openai/gpt-oss-20b:free.","code":404},"user_id":"user_31M7VppPHV2NEFvtpnCzwY0BELg"}
Make sure you have OPENROUTER_API_KEY set in your .env file
The key insight? Don't just truncate randomly. Preserve the system prompt, keep the most recent exchanges, and remove middle conversations that are less likely to be relevant.
3. Long-Term Memory: The Persistent Brain
While short-term memory handles the here and now, long-term memory transforms your LLM from a forgetful assistant into something that genuinely learns about you and your needs over time.
Short-term memory gets you through a conversation. Long-term memory makes your LLM actually intelligent over time.
Three storage strategies that actually work in production:
Entity Memory: Tracking What Matters
Remember people, projects, and important objects across conversations.
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from datetime import datetime
import json
@dataclass
class Entity:
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.encoder = SentenceTransformer(model_name)
self.memories: List[Dict] = []
self.embeddings: List[np.ndarray] = []
def store_memory(self, content: str, metadata: Dict = None) -> str:
"""Store content with semantic embedding"""
memory_id = f"mem_{len(self.memories)}_{int(datetime.now().timestamp())}"
# Generate embedding
embedding = self.encoder.encode(content)
# Store memory
memory = {
"id": memory_id,
"content": content,
"timestamp": datetime.now().isoformat(),
"metadata": metadata or {},
"access_count": 0
}
self.memories.append(memory)
self.embeddings.append(embedding)
return memory_id
def retrieve_similar(self, query: str, top_k: int = 5, threshold: float = 0.7) -> List[Dict]:
"""Find semantically similar memories"""
if not self.embeddings:
return []
# Encode query
query_embedding = self.encoder.encode(query)
# Calculate similarities
similarities = []
for i, memory_embedding in enumerate(self.embeddings):
similarity = np.dot(query_embedding, memory_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(memory_embedding)
)
similarities.append((i, similarity))
# Sort by similarity and filter by threshold
similarities.sort(key=lambda x: x[1], reverse=True)
results = []
for idx, score in similarities[:top_k]:
if score >= threshold:
memory = self.memories[idx].copy()
memory["similarity_score"] = float(score)
# Update access count
self.memories[idx]["access_count"] += 1
results.append(memory)
return results
# Example usage
semantic_memory = SemanticMemory()
# Store various memories
memories_to_store = [
"User prefers concise code examples with comments",
"Customer uses PostgreSQL 14 with Django ORM",
"Debugging session: login timeout issues resolved by increasing session timeout",
"User wants cost-effective solutions, mentioned budget constraints"
]
for memory in memories_to_store:
memory_id = semantic_memory.store_memory(memory)
print(f"Stored: {memory_id}")
# Retrieve relevant memories
query = "Show me efficient database code"
relevant = semantic_memory.retrieve_similar(query, top_k=3)
print(f"\nQuery: {query}")
for memory in relevant:
print(f"- {memory['content']} (score: {memory['similarity_score']:.3f})")
Run this example yourself 🔧
Script:
2_entity_memory.py
Command:
uv run 2_entity_memory.py
Expected Output:
=== Entity Memory Demo ===
Processing conversation messages...
1. The user mentioned their app has database performance issues
Found entities: ['person_user', 'project_app', 'technology_database', 'concept_issue']
2. The customer is using PostgreSQL with their Django app
Found entities: ['person_customer', 'project_app', 'technology_postgresql', 'technology_django']
3. Developer Smith reported a bug in the authentication system
Found entities: ['person_developer', 'person_smith', 'project_system', 'concept_bug']
4. The API service needs optimization for better performance
Found entities: ['project_service', 'project_api']
5. John from the team fixed the timeout error in the PostgreSQL database
Found entities: ['person_john', 'technology_database', 'technology_postgresql', 'concept_error', 'concept_timeout']
6. The Django app is now working properly after Smith's bug fix
Found entities: ['person_smith', 'project_app', 'technology_django', 'concept_bug']
=== Entity Storage Summary ===
- user (person): 1 mentions, importance: 1.00
- app (project): 3 mentions, importance: 1.20
- database (technology): 2 mentions, importance: 1.10
- issue (concept): 1 mentions, importance: 1.00
- customer (person): 1 mentions, importance: 1.00
- postgresql (technology): 2 mentions, importance: 1.10
- django (technology): 2 mentions, importance: 1.10
- developer (person): 1 mentions, importance: 1.00
- smith (person): 2 mentions, importance: 1.10
- system (project): 1 mentions, importance: 1.00
- bug (concept): 2 mentions, importance: 1.10
- service (project): 1 mentions, importance: 1.00
- api (project): 1 mentions, importance: 1.00
- john (person): 1 mentions, importance: 1.00
- error (concept): 1 mentions, importance: 1.00
- timeout (concept): 1 mentions, importance: 1.00
=== Query Examples ===
Query: 'Help optimize the database performance'
Relevant entities:
- database (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- app (project) - mentioned 3 times
Recent context: The Django app is now working properly after Smith's bug fix
- postgresql (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- user (person) - mentioned 1 times
Recent context: The user mentioned their app has database performance issues
- issue (concept) - mentioned 1 times
Recent context: The user mentioned their app has database performance issues
Query: 'Who fixed the authentication bug?'
Relevant entities:
- bug (concept) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
- app (project) - mentioned 3 times
Recent context: The Django app is now working properly after Smith's bug fix
- database (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- smith (person) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
- postgresql (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
Query: 'What technology does the app use?'
Relevant entities:
- app (project) - mentioned 3 times
Recent context: The Django app is now working properly after Smith's bug fix
- django (technology) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
- database (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- postgresql (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- bug (concept) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
Query: 'Tell me about recent errors'
Relevant entities:
- error (concept) - mentioned 1 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
- app (project) - mentioned 3 times
Recent context: The Django app is now working properly after Smith's bug fix
- django (technology) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
- bug (concept) - mentioned 2 times
Recent context: The Django app is now working properly after Smith's bug fix
- database (technology) - mentioned 2 times
Recent context: John from the team fixed the timeout error in the PostgreSQL database
=== Entity Network Example ===
Entity: smith (person)
Related entities:
- reported: bug (concept)
Vector-Based Semantic Memory
Store the meaning of conversations, not just the text.
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
from datetime import datetime
import json
@dataclass
class Entity:
name: str
type: str # "person", "project", "system", "concept"
attributes: Dict = field(default_factory=dict)
relationships: Dict = field(default_factory=dict)
last_mentioned: datetime = field(default_factory=datetime.now)
mention_count: int = 0
importance_score: float = 1.0
class EntityMemory:
def __init__(self):
self.entities: Dict[str, Entity] = {}
self.relationships: Dict[str, Set[str]] = {}
def extract_and_store_entities(self, text: str, conversation_context: Dict = None) -> List[str]:
"""Extract entities from text and store them"""
# In production, use NER models like spaCy or custom extraction
# For demo, we'll use simple keyword detection
entity_patterns = {
"person": ["user", "customer", "developer", "team member"],
"project": ["app", "system", "platform", "service"],
"technology": ["database", "API", "framework", "library"],
"concept": ["bug", "feature", "requirement", "issue"]
}
found_entities = []
text_lower = text.lower()
for entity_type, keywords in entity_patterns.items():
for keyword in keywords:
if keyword in text_lower:
entity_id = f"{entity_type}_{keyword}"
self._update_entity(entity_id, keyword, entity_type, text)
found_entities.append(entity_id)
return found_entities
def _update_entity(self, entity_id: str, name: str, entity_type: str, context: str) -> None:
"""Update or create entity"""
if entity_id in self.entities:
entity = self.entities[entity_id]
entity.mention_count += 1
entity.last_mentioned = datetime.now()
entity.importance_score += 0.1 # Increase importance with mentions
else:
entity = Entity(
name=name,
type=entity_type,
attributes={"first_context": context[:100]},
mention_count=1
)
self.entities[entity_id] = entity
def get_relevant_entities(self, query: str, top_k: int = 5) -> List[Entity]:
"""Get entities relevant to current query"""
relevant = []
query_lower = query.lower()
for entity in self.entities.values():
relevance_score = 0
# Name match
if entity.name.lower() in query_lower:
relevance_score += 2.0
# Type relevance
if entity.type in query_lower:
relevance_score += 1.0
# Recency boost
hours_since_mention = (datetime.now() - entity.last_mentioned).total_seconds() / 3600
recency_boost = max(0, 1 - (hours_since_mention / 24)) # Decay over 24 hours
relevance_score += entity.importance_score * recency_boost
if relevance_score > 0:
relevant.append((entity, relevance_score))
# Sort by relevance and return top_k
relevant.sort(key=lambda x: x[1], reverse=True)
return [entity for entity, score in relevant[:top_k]]
# Example usage
entity_memory = EntityMemory()
# Simulate conversation
conversation = [
"The user mentioned their app has database performance issues",
"The customer is using PostgreSQL with their Django app",
"Developer Smith reported a bug in the authentication system",
"The API service needs optimization for better performance"
]
for message in conversation:
entities = entity_memory.extract_and_store_entities(message)
print(f"Found entities: {entities}")
# Query for relevant entities
query = "Help optimize the database performance"
relevant_entities = entity_memory.get_relevant_entities(query)
print(f"\nRelevant entities for '{query}':")
for entity in relevant_entities:
print(f"- {entity.name} ({entity.type}) - mentioned {entity.mention_count} times")
Run this example yourself 🔧
Script:
3_vector_memory.py
Command:
uv run 3_vector_memory.py
Expected Output:
=== Semantic Memory Demo ===
Loading embedding model: all-MiniLM-L6-v2
Storing memories...
✓ Stored: mem_0_175587... - User prefers concise code examples with comments...
✓ Stored: mem_1_175587... - Customer uses PostgreSQL 14 with Django ORM...
✓ Stored: mem_2_175587... - Debugging session: login timeout issues resolved b...
✓ Stored: mem_3_175587... - User wants cost-effective solutions, mentioned bud...
✓ Stored: mem_4_175587... - Fixed database connection pooling issue in product...
✓ Stored: mem_5_175587... - User asked about Python best practices for web dev...
✓ Stored: mem_6_175587... - Resolved memory leak in Django application by opti...
✓ Stored: mem_7_175587... - Customer prefers AWS over Google Cloud for deploym...
✓ Stored: mem_8_175587... - Implemented caching strategy using Redis for bette...
✓ Stored: mem_9_175587... - User mentioned they work with large datasets and n...
Total memories stored: 10
=== Semantic Retrieval Tests ===
Query: 'Show me efficient database code'
Relevant memories:
1. [context] User mentioned they work with large datasets and need efficient processing
Similarity: 0.361, Importance: 0.8, Accessed: 0 times
2. [preference] User prefers concise code examples with comments
Similarity: 0.342, Importance: 0.9, Accessed: 0 times
Query: 'What are the user's preferences?'
No relevant memories found above threshold
Query: 'Help with performance optimization'
Relevant memories:
1. [solution] Implemented caching strategy using Redis for better performance
Similarity: 0.367, Importance: 0.8, Accessed: 0 times
2. [preference] User wants cost-effective solutions, mentioned budget constraints
Similarity: 0.347, Importance: 0.7, Accessed: 0 times
Query: 'Cloud deployment options'
Relevant memories:
1. [preference] Customer prefers AWS over Google Cloud for deployment
Similarity: 0.685, Importance: 0.7, Accessed: 0 times
Query: 'Python web development advice'
Relevant memories:
1. [query] User asked about Python best practices for web development
Similarity: 0.837, Importance: 0.6, Accessed: 0 times
=== Memory Clustering Analysis ===
Cluster 2 (2 memories):
Sample: User prefers concise code examples with comments...
Types: {'preference': 1, 'query': 1}
Cluster 1 (6 memories):
Sample: Customer uses PostgreSQL 14 with Django ORM...
Types: {'technical_context': 1, 'solution': 4, 'context': 1}
Cluster 0 (2 memories):
Sample: User wants cost-effective solutions, mentioned budget constraints...
Types: {'preference': 2}
=== Memory Access Patterns ===
- preference: 3 memories, avg 1.0 accesses
- technical_context: 1 memories, avg 0.0 accesses
- solution: 4 memories, avg 0.2 accesses
- query: 1 memories, avg 1.0 accesses
- context: 1 memories, avg 1.0 accesses
4. Production Architecture: Memory That Scales
Now that we've built the individual memory components, the real question is: how do you orchestrate them together? A production-ready memory system isn't just one approach — it's multiple memory types working in harmony.
The magic happens when you combine all memory types into a unified system. Think of it as a three-tier architecture:
Hot Memory (Redis) — Recent interactions that need sub-millisecond access. Session data, conversation buffers, and temporary context that expires quickly.
Warm Memory (In-Memory) — Current session state. Entity tracking, conversation history, and working memory that lives for the duration of a user session.
Cold Memory (Vector Store) — Long-term semantic memories. Important interactions, user preferences, and learned patterns that persist across sessions and get retrieved via similarity search.
The key insight: route intelligently. Not every interaction needs to go into long-term memory. Use importance scoring to decide what deserves expensive vector storage.
class HybridMemorySystem:
def __init__(self):
self.hot_memory = redis.from_url("redis://localhost:6379") # Fast access
self.warm_memory = ConversationMemory() # Session state
self.cold_memory = SemanticMemory() # Long-term storage
def store_interaction(self, user_id: str, interaction: Dict) -> None:
# Always store in hot memory (fast, temporary)
self.hot_memory.setex(f"recent:{user_id}", 3600, json.dumps(interaction))
# Update warm memory (session context)
self.warm_memory.add_message(interaction["role"], interaction["content"])
# Selectively store in cold memory (expensive, permanent)
importance = self._calculate_importance(interaction)
if importance > 0.7: # Only important stuff goes to long-term
self.cold_memory.store_memory(
interaction["content"],
metadata={"user_id": user_id, "importance": importance}
)
def get_context(self, user_id: str, query: str) -> Dict:
return {
"recent": json.loads(self.hot_memory.get(f"recent:{user_id}") or "{}"),
"conversation": self.warm_memory.get_context(),
"similar_past": self.cold_memory.retrieve_similar(query, top_k=3)
}
Run this example yourself 🔧
Script:
4_hybrid_memory_system.py
Command:
uv run 4_hybrid_memory_system.py
Expected Output:
=== Hybrid Memory System Demo ===
--- Interaction 1 ---
Storing interaction for user user_123
✓ Stored in hot memory (Redis): I'm getting a database timeout error in my Python ...
✓ Updated warm memory (session): 1 messages
✓ Calculated importance score: 0.7999999999999999
✓ Stored in cold memory (vector): mem_1
--- Interaction 2 ---
Storing interaction for user user_123
✓ Stored in hot memory (Redis): Let's debug this step by step. First, check your c...
✓ Updated warm memory (session): 2 messages
✓ Calculated importance score: 0.7
✗ Not important enough for long-term storage
--- Interaction 3 ---
Storing interaction for user user_123
✓ Stored in hot memory (Redis): I prefer concise code examples with detailed comme...
✓ Updated warm memory (session): 3 messages
✓ Calculated importance score: 0.8999999999999999
✓ Stored in cold memory (vector): mem_2
--- Interaction 4 ---
Storing interaction for user user_123
✓ Stored in hot memory (Redis): I'll keep that in mind. Here's a concise example w...
✓ Updated warm memory (session): 4 messages
✓ Calculated importance score: 0.5
✗ Not important enough for long-term storage
--- Interaction 5 ---
Storing interaction for user user_123
✓ Stored in hot memory (Redis): What's the weather like?...
✓ Updated warm memory (session): 5 messages
✓ Calculated importance score: 0.8
✓ Stored in cold memory (vector): mem_3
==================================================
Retrieving context for user user_123, query: 'Show me how to optimize database connections'
✓ Context assembled:
- Recent: Yes
- Conversation: 5 messages
- Similar past: 1 memories
=== Final Context Summary ===
Query: 'Show me how to optimize database connections'
Hot Memory: True
Warm Memory: 5 messages
Cold Memory: 1 relevant memories
Most relevant past memory:
Content: I'm getting a database timeout error in my Python app...
Similarity: 0.14
5. Smart Forgetting: Why?
With all this memory storage capability, you might think "more is always better." That's where you'd be wrong. The secret to effective memory systems isn't just knowing how to remember — it's knowing when and what to forget. Remember: memory feeds into context windows, and those windows have hard limits.
Here's the counterintuitive truth: Good memory systems forget strategically. Without forgetting, you get:
- Irrelevant old information cluttering context
- Storage costs spiraling out of control
- Privacy compliance nightmares
- Performance degradation from too much data
The key to effective forgetting is modeling how human memory actually works. Important memories get reinforced through repeated access, while unused information naturally fades. A smart forgetting algorithm assigns each memory an importance score that decays exponentially over time — memories lose value at a rate of about 5% per day by default. But here's the clever part: every time a memory gets accessed, it receives a relevance boost that fights the decay. Frequently accessed memories stay fresh, while those collecting digital dust gradually become candidates for removal. The system also considers factors like initial importance (critical bug fixes get higher base scores than casual chitchat) and access patterns (memories accessed recently or multiple times get protection). This creates a natural pruning mechanism that keeps your most valuable context while automatically clearing out the noise — exactly what you need for a production system that learns and adapts over time without drowning in irrelevant history.
6. Privacy & Compliance: GDPR-Safe Memory
But before you deploy any memory system to production, there's one more critical piece: compliance. Memory systems inevitably store personal data, and that means navigating the complex world of privacy regulations.
Memory systems collect personal data. That means GDPR, CCPA, and other privacy regulations apply. The challenge isn't just technical — it's legal and ethical.
The core privacy principles for memory systems:
Data Classification — Not all memories are equal. Personal preferences need different treatment than public documentation. Classify data into categories (Public, Internal, Confidential, Personal) with different retention policies and access controls.
Consent Management — Users must explicitly consent to data processing. Store what they've agreed to, and when they revoke consent, delete the related memories immediately. No exceptions.
Right to be Forgotten — GDPR Article 17 requires you to delete all user data on request. This means tracking every memory by user ID and having a reliable deletion process that actually works.
Automatic Expiration — Set retention policies by data type. Personal data might expire in 30 days, while public documentation can stay for a year. Build expiration into the system from day one.
Here's a simplified example of the key privacy controls:
class PrivacyCompliantMemory:
def __init__(self):
self.memories: Dict[str, Dict] = {}
self.user_consent: Dict[str, Set[str]] = {} # user_id -> consent types
def store_memory(self, user_id: str, content: str,
data_type: str, consent_type: str) -> Optional[str]:
# Check consent before storing
if not self._has_consent(user_id, consent_type):
return None
# Auto-anonymize personal data
if data_type == "personal":
content = self._anonymize_pii(content)
# Store with expiration
memory_id = self._generate_secure_id(user_id, content)
self.memories[memory_id] = {
"content": content,
"user_id": user_id,
"expires_at": self._calculate_expiry(data_type)
}
return memory_id
def right_to_be_forgotten(self, user_id: str) -> int:
# Delete all memories for this user
deleted_count = 0
for memory_id in list(self.memories.keys()):
if self.memories[memory_id]["user_id"] == user_id:
del self.memories[memory_id]
deleted_count += 1
return deleted_count
Run this example yourself 🔧
Script:
6_privacy_compliant_memory.py
Command:
uv run 6_privacy_compliant_memory.py
Expected Output:
=== Privacy-Compliant Memory Demo ===
--- Granting User Consent ---
✅ Granted consent for user123: analytics, personalization
✅ Granted consent for user456: analytics
✅ Granted consent for user789: marketing, personalization
--- Storing Memories with Privacy Controls ---
🔒 Anonymized personal data: 'User prefers dark mode' -> 'User prefers dark mode'
✅ Stored memory efa27aeb96cc... (personal)
✅ Stored memory 2535b28a614a... (internal)
🔒 Anonymized personal data: 'Contact: john.doe@email.com for follow-up' -> 'Contact: [EMAIL] for follow-up'
✅ Stored memory 816e8334dd9d... (personal)
✅ Stored memory d0add335864b... (public)
❌ Storage denied: User user456 has not consented to personalization
✅ Stored memory 8360e692bc1f... (confidential)
🔒 Anonymized personal data: 'Phone number: 555-123-4567 for support' -> 'Phone number: [PHONE] for support'
✅ Stored memory ece0fac306f3... (personal)
✅ Successfully stored 6 out of 7 memories
--- Initial Privacy Compliance Report ---
Total Memories: 6
Total Users: 3
Classification Breakdown:
- personal: 3
- internal: 1
- public: 1
- confidential: 1
Consent Breakdown:
- personalization: 3
- analytics: 2
- marketing: 1
Retention Status:
- expired: 0
- expiring_soon: 3
- fresh: 3
Deletion Events: 0
Report Timestamp: 2025-08-22T21:44:41.361845
--- GDPR Compliance Demonstrations ---
1. Right to Data Portability (Article 20)
📤 Data export generated for user123: 3 memories
Exported data contains:
- 3 memories
- Consent status: ['analytics', 'personalization']
- Deletion history: 0 events
2. Consent Revocation
🗑️ Revoked personalization consent for user123: 2 memories deleted
Result: 2 memories deleted due to consent revocation
3. Right to be Forgotten (Article 17)
🔥 Right to be forgotten executed for user456: 1 memories deleted
Result: 1 memories completely erased
4. Automatic Retention Policy Enforcement
🕰️ Cleaned up 1 expired memories
Result: 1 memories automatically deleted due to retention policies
--- Final Privacy Compliance Report ---
Total Memories: 2
Total Users: 2
Classification Breakdown:
- internal: 1
- confidential: 1
Consent Breakdown:
- analytics: 1
- marketing: 1
Retention Status:
- expired: 0
- expiring_soon: 0
- fresh: 2
Deletion Events: 4
Report Timestamp: 2025-08-22T21:44:41.361884
--- Deletion Audit Log ---
1. 2025-08-22T21:44:41 - consent_revoked
Memory: efa27aeb96cc... (personal)
User: user123
2. 2025-08-22T21:44:41 - consent_revoked
Memory: 816e8334dd9d... (personal)
User: user123
3. 2025-08-22T21:44:41 - right_to_be_forgotten
Memory: d0add335864b... (public)
User: user456
4. 2025-08-22T21:44:41 - retention_period_expired
Memory: ece0fac306f3... (personal)
User: user789
Total audit entries: 4
=== PII Anonymization Demo ===
Original -> Anonymized:
'Contact John Smith at john.smith@company.com or 555-123-4567'
-> '[NAME] Smith at [EMAIL] or [PHONE]'
'Credit card 4532-1234-5678-9012 was used for payment'
-> 'Credit card [CREDIT_CARD] was used for payment'
'Jane Doe from accounting called about the invoice'
-> '[NAME] from accounting called about the invoice'
'Email support@company.com for technical issues'
-> 'Email [EMAIL] for technical issues'
'Customer phone: 1-800-555-0199 for urgent matters'
-> 'Customer phone: 1-[PHONE] for urgent matters'
The Hard Truths
Here's what the documentation doesn't mention and you only learn the hard way:
- Memory isn't free: Every token stored costs money, every vector embedding requires compute
- Context windows are limited: You can't remember everything — choose wisely
- Retrieval adds latency: Speed vs completeness tradeoff — optimize for your use case
- Privacy matters: Not all memories should be kept — build compliance from day one
- Forgetting is a feature: Strategic amnesia prevents information overload
Key Takeaways
- Hybrid memory wins — Combine hot (Redis), warm (in-memory), and cold (vector) storage for optimal cost and performance
- Memory is to memorize and to forget — Use exponential decay with access patterns to automatically prune low-value memories
- Privacy by design — Build GDPR compliance, consent management, and data classification from the start
What's Next?
With prompting, reasoning, knowledge, tools, and memory in place, you've got all the pieces. But how do you put them together into a production system that actually works? The real challenge is orchestrating these components safely and reliably — which brings us to Guardrails & Safety.
A final reality check: LLMs aren't magic. They're incredibly powerful tools for natural language processing — understanding, generating, and transforming text at superhuman scale. But they're not AGI (at least for now), they're not databases, and they're not infallible reasoning engines. The real value comes from combining them intelligently with traditional software engineering practices. Context engineering isn't about replacing your entire tech stack with AI — it's about making AI useful within your existing systems. Use these patterns wisely, measure their impact, and remember: the best LLM application is often the one that feels like it's not using LLMs at all.
Technical deep dive series — Part 5 of 5
← Part 4: Action Tools | Back to Overview →
Related Articles in This Series
📚 Context Engineering Deep Dive Series:
- User Intent & Prompting: The Art of Making LLMs Understand What You Really Want
- Agents & Reasoning: When LLMs Learn to Think Before They Speak
- RAG Systems: When Your LLM Needs to Phone a Friend
- Action Tools: How LLMs Finally Learned to Stop Talking and Start Doing
- Memory Systems (You are here)
🎯 Start with the overview: Context Engineering: How RAG, agents, and memory make LLMs actually useful
Comments
Join the discussion! Your thoughts and feedback are valuable.
💡 Comments are powered by GitHub Discussions. You'll need a GitHub account to comment. Make sure Discussions are enabled in the repository.