An LLM without tools is just an expensive autocomplete.
Give it function calling, and suddenly it's writing code, running queries, and sending emails.
The power? Immense. The risks? Let's talk about those too.
Remember when we had to parse LLM outputs with regex to trigger actions? Dark times! I'm happy for you if you don't even know this story, the later doesn't always be worse. Now, models can directly call functions, use tools, and interact with the real world, subsequently create massive implication.
1. From Text to Action: The Evolution
Let's cut through the fluff. Before function calling, we were stuck in the dark ages of "parse and pray." We'd beg the LLM to output valid JSON:
# The old way: Begging the LLM to output valid JSON
prompt = """
Analyze this SQL query and return EXACTLY this format:
{
"action": "optimize_query",
"query": "...",
"suggestions": [...]
}
IMPORTANT: Output ONLY valid JSON, nothing else PLEASE, I'm begging you, bruh!
"""
response = llm.complete(prompt)
# Pray it's valid JSON
try:
action = json.loads(response) # 50% chance of failure
except:
# Welcome to regex hell
action = extract_json_with_regex(response)
June 2023 changed everything. OpenAI introduced function calling, and suddenly we had structured, reliable tool use:
from litellm import completion
from dotenv import load_dotenv
from os import getenv
load_dotenv()
# Modern way: Define your function schema
tools = [{
"type": "function",
"function": {
"name": "optimize_query",
"description": "Optimize a SQL query for performance",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "SQL query to optimize"},
"target_db": {"type": "string", "enum": ["postgres", "mysql", "snowflake"]}
},
"required": ["query"]
}
}
}]
response = completion(
model="openrouter/openai/gpt-oss-20b:free",
api_key=getenv("OPENROUTER_API_KEY"),
messages=[{"role": "user", "content": "Optimize: SELECT * FROM users WHERE age > 25"}],
tools=tools,
tool_choice="auto"
)
# Clean, structured, guaranteed format
if response.choices[0].message.get("tool_calls"):
tool_call = response.choices[0].message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Run this example yourself 🔧
Script:
1_function_calling_evolution.py
Command:
uv run 1_function_calling_evolution.py
Expected Output:
Function Calling Evolution Demo
==================================================
=== Old Way: JSON Parsing ===
Raw response:
{
"action": "optimize_query",
"query": "SELECT * FROM users WHERE age > 25",
"suggestions": ["Add index on age column", "Use specific columns instead of *"]
}
✅ Successfully parsed JSON:
{
"action": "optimize_query",
"query": "SELECT * FROM users WHERE age > 25",
"suggestions": [
"Add index on age column",
"Use specific columns instead of *"
]
}
=== Modern Way: Function Calling ===
✅ Function called: optimize_query
Arguments received:
{
"query": "SELECT * FROM users WHERE age > 25",
"suggestions": [
"Consider selecting only the columns you need instead of using SELECT *",
"Create an index on the age column to speed up the WHERE clause",
"If you have a large dataset, consider pagination to limit the number of rows returned."
]
}
Query to optimize: SELECT * FROM users WHERE age > 25
Target DB: Not specified
Suggestions:
1. Consider selecting only the columns you need instead of using SELECT *
2. Create an index on the age column to speed up the WHERE clause
3. If you have a large dataset, consider pagination to limit the number of rows returned.
Here's what nobody explains clearly about the terminology: Function Calling (OpenAI's approach), Tool Use (Anthropic's terminology), and Actions (what everyone else calls it) — they're all the same thing. The model decides which function to call and with what parameters. No more regex, no more prayer. Just clean, structured execution.
2. Building Safe Tool Interfaces
Now that we understand how function calling evolved from regex hell to structured tool use, let's tackle the critical question: how do we build tools that won't accidentally destroy our production systems?
Want to give an LLM database access? Here's how to not destroy everything:
The Pydantic Approach: Type Safety First
from typing import Literal
from pydantic import BaseModel, Field, field_validator
import re
class DatabaseQuery(BaseModel):
"""Tool for read-only database queries with multiple safety layers"""
query: str = Field(description="SQL query to execute")
database: Literal["staging", "analytics"] = Field(
default="staging",
description="Target database (prod not available)"
)
timeout_seconds: int = Field(default=30, le=60, description="Query timeout")
@field_validator("query")
def validate_query(cls, v):
"""Multi-layer query validation"""
# Layer 1: No destructive operations
dangerous_keywords = ["DELETE", "UPDATE", "DROP", "ALTER", "TRUNCATE", "INSERT"]
query_upper = v.upper()
for keyword in dangerous_keywords:
if re.search(r'\b' + keyword + r'\b', query_upper):
raise ValueError(f"Destructive operation '{keyword}' not allowed")
# Layer 2: Must be a SELECT query
if not query_upper.strip().startswith("SELECT"):
raise ValueError("Only SELECT queries allowed")
# Layer 3: Limit check
if "LIMIT" not in query_upper:
v = f"{v.rstrip(';')} LIMIT 1000" # Force limit
return v
def execute(self):
"""Execute with additional runtime checks"""
# Connection would use read-only credentials
# Wrapped in timeout context
# Full audit logging
pass
Run this example yourself 🔧
Script:
2_safe_tool_interfaces.py
Command:
uv run 2_safe_tool_interfaces.py
Expected Output:
Safe Tool Interfaces Demo
==================================================
=== Database Query Safety Demo ===
Testing: SELECT * FROM users WHERE age > 25
⚠️ Auto-added LIMIT 1000 to prevent large result sets
✅ Query validated: SELECT * FROM users WHERE age > 25 LIMIT 1000
Testing: SELECT COUNT(*) FROM orders
⚠️ Auto-added LIMIT 1000 to prevent large result sets
✅ Query validated: SELECT COUNT(*) FROM orders LIMIT 1000
Testing: DELETE FROM users WHERE id = 1
✅ Correctly blocked: 1 validation error for DatabaseQuery
query
Value error, ❌ Destructive operation 'DELETE' not allowed [type=value_error, input_value='DELETE FROM users WHERE id = 1', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
Testing: DROP TABLE users
✅ Correctly blocked: 1 validation error for DatabaseQuery
query
Value error, ❌ Destructive operation 'DROP' not allowed [type=value_error, input_value='DROP TABLE users', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
Testing: UPDATE users SET admin = true
✅ Correctly blocked: 1 validation error for DatabaseQuery
query
Value error, ❌ Destructive operation 'UPDATE' not allowed [type=value_error, input_value='UPDATE users SET admin = true', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
Testing: SELECT * FROM users; DELETE FROM orders
✅ Correctly blocked: 1 validation error for DatabaseQuery
query
Value error, ❌ Destructive operation 'DELETE' not allowed [type=value_error, input_value='SELECT * FROM users; DELETE FROM orders', input_type=str]
For further information visit https://errors.pydantic.dev/2.11/v/value_error
Testing: select * from products
⚠️ Auto-added LIMIT 1000 to prevent large result sets
✅ Query validated: select * from products LIMIT 1000
=== Defense-in-Depth Pattern ===
Executing safe query:
✅ Input validation passed
📝 Audit log: Executing DatabaseQuery
⏱️ Timeout protection active
🔒 Executing in sandbox environment
🔒 Executing query on staging database:
Query: SELECT name, email FROM users LIMIT 10
Timeout: 30s
Status: Would execute with read-only credentials
Executing file operation:
✅ Input validation passed
📝 Audit log: Executing FileOperation
⏱️ Timeout protection active
🔒 Executing in sandbox environment
The Defense-in-Depth Pattern
Never trust a single validation layer:
from functools import wraps
import time
from typing import Any, Callable
def rate_limit(calls_per_minute: int = 10):
"""Rate limiting decorator"""
def decorator(func: Callable) -> Callable:
call_times = []
@wraps(func)
def wrapper(*args, **kwargs) -> Any:
now = time.time()
# Clean old calls
call_times[:] = [t for t in call_times if now - t < 60]
if len(call_times) >= calls_per_minute:
raise Exception(f"Rate limit exceeded: {calls_per_minute}/min")
call_times.append(now)
return func(*args, **kwargs)
return wrapper
return decorator
def audit_log(func: Callable) -> Callable:
"""Audit logging decorator"""
@wraps(func)
def wrapper(*args, **kwargs) -> Any:
start_time = time.time()
try:
result = func(*args, **kwargs)
# Log success
print(f"✅ {func.__name__} succeeded in {time.time() - start_time:.2f}s")
return result
except Exception as e:
# Log failure
print(f"❌ {func.__name__} failed: {str(e)}")
raise
return wrapper
@audit_log
@rate_limit(calls_per_minute=5)
def execute_tool(tool_name: str, params: dict) -> Any:
"""Execute tool with all safety layers"""
# Validation, execution, monitoring
pass
3. Tool Categories That Matter
With our safety patterns in place — type validation, rate limiting, and defense-in-depth — we need to understand which tools we're actually building. Because let's be honest: giving an LLM the ability to send emails is very different from letting it read documentation.
Not all tools are created equal. Here's the hierarchy of danger:
🟢 Safe Tools (Start Here)
# Read-only operations
safe_tools = {
"search_documentation": "Read API docs",
"query_analytics": "Read-only database queries",
"fetch_metrics": "Get performance data",
"list_files": "Directory listings"
}
🟡 Moderate Risk Tools (Add Safeguards)
# State changes with limits
moderate_tools = {
"send_slack_message": "Rate limited, specific channels only",
"create_jira_ticket": "Template-based, no custom fields",
"generate_report": "Resource limits, sandboxed execution",
"cache_invalidation": "Specific keys only"
}
🔴 High Risk Tools (Human Approval Required)
# Never fully automated
dangerous_tools = {
"execute_code": "Arbitrary code execution",
"database_write": "Data modifications",
"send_email": "External communications",
"deploy_code": "Production changes"
}
Run this example yourself 🔧
Script:
3_tool_categories.py
Command:
uv run 3_tool_categories.py
Expected Output:
Tool Categories and Risk Assessment
==================================================
=== Tool Risk Categories ===
🟢 Safe - 4 tools:
----------------------------------------
📦 search_documentation
Search and read API documentation
Safeguards:
• Read-only access
⏱️ Rate limit: 100/min
📦 query_analytics
Read-only database queries on analytics DB
Safeguards:
• Read-only credentials
• Automatic LIMIT clause
• Query timeout 30s
⏱️ Rate limit: 50/min
📦 fetch_metrics
Get performance metrics from monitoring
Safeguards:
• Cached responses
• Rate limiting
⏱️ Rate limit: 60/min
📦 list_files
List directory contents
Safeguards:
• Restricted to project directories
• No system paths
⏱️ Rate limit: 100/min
🟡 Moderate - 4 tools:
----------------------------------------
📦 send_slack_message
Send messages to Slack channels
Safeguards:
• Rate limited to 10/minute
• Restricted to specific channels
• Message length limit
⏱️ Rate limit: 10/min
📦 create_jira_ticket
Create tickets in Jira
Safeguards:
• Template-based creation only
• No custom field modifications
• Rate limited
⏱️ Rate limit: 5/min
📦 generate_report
Generate PDF/CSV reports
Safeguards:
• Resource limits (CPU/Memory)
• Sandboxed execution
• Output size limits
⏱️ Rate limit: 10/min
📦 cache_invalidation
Invalidate specific cache keys
Safeguards:
• Whitelist of allowed cache keys
• Rate limiting
• Rollback capability
⏱️ Rate limit: 5/min
🔴 High - 3 tools:
----------------------------------------
📦 execute_code
Execute arbitrary code in sandbox
Safeguards:
• Sandboxed environment
• Resource limits
• Timeout enforcement
⚠️ Requires human approval
⏱️ Rate limit: 1/min
📦 database_write
Modify database records
Safeguards:
• Transaction rollback capability
• Backup before modification
• Human approval required
⚠️ Requires human approval
⏱️ Rate limit: 1/min
📦 send_email
Send emails to external recipients
Safeguards:
• Template-based only
• Recipient whitelist
• Human approval for new recipients
⚠️ Requires human approval
⏱️ Rate limit: 5/min
⛔ Critical - 2 tools:
----------------------------------------
📦 deploy_code
Deploy code to production
Safeguards:
• Multi-stage approval
• Automated testing required
• Rollback plan mandatory
⚠️ Requires human approval
⏱️ Rate limit: 1/min
📦 modify_infrastructure
Change infrastructure configuration
Safeguards:
• Terraform plan review
• Cost estimation
• Multi-person approval
⚠️ Requires human approval
⏱️ Rate limit: 1/min
=== Execution Decision Logic ===
Tool: search_documentation
Risk: 🟢 Safe
✅ Can execute automatically
Rate limit: 100/min
Tool: send_slack_message
Risk: 🟡 Moderate
✅ Can execute automatically
Rate limit: 10/min
Tool: database_write
Risk: 🔴 High
🛑 Requires human approval
Safeguards: Transaction rollback capability, Backup before modification
Tool: deploy_code
Risk: ⛔ Critical
🛑 Requires human approval
Safeguards: Multi-stage approval, Automated testing required
4. The LangChain Toolkit Approach
Understanding tool risk categories is crucial, but managing individual tools manually gets overwhelming fast. This is where orchestration frameworks shine — and LangChain has become the de facto standard.
LangChain makes tool orchestration almost too easy. Here's production-ready patterns:
Building a SQL Toolkit
from langchain_community.utilities import SQLDatabase
from langchain_community.agent_toolkits import create_sql_agent
from litellm import completion
from langchain_community.llms.base import LLM
from typing import Any, List, Optional
class LiteLLMWrapper(LLM):
"""Wrapper to use LiteLLM with LangChain"""
model: str = "openrouter/openai/gpt-oss-20b:free"
api_key: str
def _call(self, prompt: str, stop: Optional[List[str]] = None, **kwargs) -> str:
response = completion(
model=self.model,
api_key=self.api_key,
messages=[{"role": "user", "content": prompt}],
stop=stop
)
return response.choices[0].message.content
@property
def _llm_type(self) -> str:
return "litellm"
# Safe database connection
db = SQLDatabase.from_uri(
"sqlite:///example.db", # Use read-only connection in production
sample_rows_in_table_info=3
)
# Create agent with safety features
llm = LiteLLMWrapper(api_key=getenv("OPENROUTER_API_KEY"))
agent = create_sql_agent(
llm=llm,
db=db,
agent_type="openai-tools", # Use tool calling
verbose=True,
handle_parsing_errors=True,
max_iterations=5, # Prevent infinite loops
max_execution_time=30 # Timeout protection
)
# Safe execution with error handling
try:
result = agent.invoke({
"input": "What are the top 5 customers by revenue?"
})
except Exception as e:
print(f"Execution failed safely: {e}")
Run this example yourself 🔧
Script:
4_langchain_toolkit.py
Command:
uv run 4_langchain_toolkit.py
Expected Output:
LangChain Toolkit Approach
==================================================
=== LangChain SQL Toolkit Demo ===
Available tools:
• execute_sql_query: Execute a read-only SQL query on the database
• get_database_schema: Get the schema of all tables in the database
==================================================
1. Getting database schema:
Database Schema:
==================================================
Table: customers
- id (INTEGER)
- name (TEXT)
- email (TEXT)
- country (TEXT)
- total_spent (REAL)
Table: orders
- id (INTEGER)
- customer_id (INTEGER)
- product (TEXT)
- amount (REAL)
- order_date (DATE)
==================================================
2. Executing safe queries:
Query: SELECT * FROM customers WHERE country = 'USA'
Columns: id, name, email, country, total_spent
--------------------------------------------------
1 | Alice Johnson | alice@example.com | USA | 1500.0
4 | Diana Prince | diana@example.com | USA | 3200.0
(Showing 2 of 2 results)
--------------------------------------------------
Query: SELECT COUNT(*) as total_orders FROM orders
Columns: total_orders
--------------------------------------------------
5
(Showing 1 of 1 results)
--------------------------------------------------
Query: SELECT c.name, SUM(o.amount) as total FROM customers c JOIN orders o ON c.id = o.customer_id GROUP BY c.id
Columns: name, total
--------------------------------------------------
Alice Johnson | 1250.0
Bob Smith | 800.0
Charlie Brown | 150.0
Diana Prince | 2500.0
(Showing 4 of 4 results)
--------------------------------------------------
=== LLM Integration Demo ===
Question: What are the top 3 customers by total spent?
Generated SQL: ```sql
SELECT name, total_spent
FROM customers
ORDER BY total_spent DESC
LIMIT 3;
```
Result:
Error executing query: near "```sql
SELECT name, total_spent
FROM customers
ORDER BY total_spent DESC
LIMIT 3;
```": syntax error
--------------------------------------------------
Question: How many orders do we have in total?
Generated SQL: ```sql
SELECT COUNT(*) AS total_orders FROM orders;
```
Result:
Error executing query: near "```sql
SELECT COUNT(*) AS total_orders FROM orders;
```": syntax error
--------------------------------------------------
Question: Show me all customers from the USA
Generated SQL: ```sql
SELECT * FROM customers WHERE country = 'USA';
```
Result:
Error executing query: near "```sql
SELECT * FROM customers WHERE country = 'USA';
```": syntax error
--------------------------------------------------
Custom Tool Creation
from langchain.tools import Tool, StructuredTool
from pydantic import BaseModel
class CodeAnalysisInput(BaseModel):
file_path: str
analysis_type: Literal["security", "performance", "style"]
def analyze_code(file_path: str, analysis_type: str) -> str:
"""Analyze code with specific focus"""
# Implementation here
return f"Analysis of {file_path} for {analysis_type}"
# Structured tool with schema
code_analyzer = StructuredTool.from_function(
func=analyze_code,
name="code_analyzer",
description="Analyze code for security, performance, or style issues",
args_schema=CodeAnalysisInput,
return_direct=False, # Let agent process results
handle_tool_error=True # Graceful error handling
)
5. All-in-One with MCP (Model Context Protocol)
We've seen how to build tools with LangChain, categorize them by risk, and implement safety patterns. But managing all these integrations across different LLM providers gets complex fast. Enter MCP — Anthropic's answer to the tool integration chaos.
Anthropic's MCP is the new kid on the block — and it's quickly becoming the standard everyone's adopting. Instead of rehashing the theory (plenty of that out there already), let's dive straight into a real implementation.
I'll use the official time server as a practical example — a real MCP server that demonstrates STDIO communication (for others or to implement a custom one, see here for more details):
MCP Configuration
The beauty of MCP is its configuration-based approach. Just like Claude Desktop or VS Code, you define which servers to use:
uv add mcp-server-time # Install Time serber
uv add mcp # Install MCP Python SDK
Here's how to build an LLM agent using MCP - just like configuring Claude Desktop:
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
from litellm import completion
class MCPAgent:
"""LLM Agent with MCP server configuration"""
def __init__(self, mcp_config):
"""Configure like Claude Desktop:
mcp_config = {
"time": {"command": "uvx", "args": ["mcp-server-time"]}
}
"""
self.mcp_config = mcp_config
self.available_tools = []
async def setup(self):
"""Discover tools from MCP servers"""
for server_name, config in self.mcp_config.items():
try:
# Start MCP server and get tools
command = [config["command"]] + config["args"]
session, cleanup_ctx = await self.create_session(command)
tools = await session.list_tools()
self.available_tools.extend([
{"name": t.name, "description": t.description}
for t in tools.tools
])
print(f"✅ {server_name}: {len(tools.tools)} tools available")
await self.cleanup_session(cleanup_ctx)
except Exception as e:
print(f"❌ {server_name}: {e}")
async def chat(self, message: str):
"""Chat with LLM using MCP tools"""
response = completion(
model="openrouter/openai/gpt-4o-mini",
api_key=getenv("OPENROUTER_API_KEY"),
messages=[{"role": "user", "content": message}],
tools=self.format_tools_for_llm(),
tool_choice="auto"
)
# Execute any tool calls through MCP
if response.choices[0].message.tool_calls:
await self.execute_tools(response.choices[0].message.tool_calls)
return response.choices[0].message.content
# Simple usage
async def main():
agent = MCPAgent({
"time": {"command": "uvx", "args": ["mcp-server-time"]}
})
await agent.setup()
result = await agent.chat("What time is it in Tokyo?")
print(result)
asyncio.run(main())
Run this example yourself 🔧
Script:
5_mcp_example.py
Command:
uv run 5_mcp_example.py
Expected Output:
🚀 MCP (Model Context Protocol) with Python SDK Demo
============================================================
=== LLM Agent with MCP (Configuration-based) ===
=== MCP Configuration (like Claude Desktop) ===
📋 MCP Server Configuration:
time: uvx mcp-server-time
🔧 Discovering tools from configured MCP servers...
🚀 Starting MCP server: uvx mcp-server-time
✅ MCP server started and initialized successfully
✅ time: 2 tools available
🔌 MCP connection closed
📦 Total tools available: 2
👤 User: What time is it in UTC?
----------------------------------------
🔧 Calling MCP tool: get_current_time
Arguments: {'timezone': 'UTC'}
🚀 Starting MCP server: uvx mcp-server-time
✅ MCP server started and initialized successfully
✅ Result: {
"timezone": "UTC",
"datetime": "2025-08-21T04:52:31+00:00",
"is_dst": false
}
🔌 MCP connection closed
🤖 Agent: Based on the tools:
📅 {
"timezone": "UTC",
"datetime": "2025-08-21T04:52:31+00:00",
"is_dst": false
}
============================================================
👤 User: What time is it in Paris?
----------------------------------------
🔧 Calling MCP tool: get_current_time
Arguments: {'timezone': 'Europe/Paris'}
🚀 Starting MCP server: uvx mcp-server-time
✅ MCP server started and initialized successfully
✅ Result: {
"timezone": "Europe/Paris",
"datetime": "2025-08-21T06:52:33+02:00",
"is_dst": true
}
🔌 MCP connection closed
🤖 Agent: Based on the tools:
📅 {
"timezone": "Europe/Paris",
"datetime": "2025-08-21T06:52:33+02:00",
"is_dst": true
}
============================================================
👤 User: Convert 3pm in New York to London time
----------------------------------------
🔧 Calling MCP tool: convert_time
Arguments: {'source_timezone': 'America/New_York', 'time': '15:00', 'target_timezone': 'Europe/London'}
🚀 Starting MCP server: uvx mcp-server-time
✅ MCP server started and initialized successfully
✅ Result: {
"source": {
"timezone": "America/New_York",
"datetime": "2025-08-21T15:00:00-04:00",
"is_dst": true
},
"target": {
"timezone": "Europe/London",
"datetime": "2025-08-21T20:00:00+01:00",
"is_dst": true
},
"time_difference": "+5.0h"
}
🔌 MCP connection closed
🤖 Agent: Based on the tools:
📅 {
"source": {
"timezone": "America/New_York",
"datetime": "2025-08-21T15:00:00-04:00",
"is_dst": true
},
"target": {
"timezone": "Europe/London",
"datetime": "2025-08-21T20:00:00+01:00",
"is_dst": true
},
"time_difference": "+5.0h"
}
============================================================
👤 User: What time is it in Tokyo?
----------------------------------------
🔧 Calling MCP tool: get_current_time
Arguments: {'timezone': 'Asia/Tokyo'}
🚀 Starting MCP server: uvx mcp-server-time
✅ MCP server started and initialized successfully
✅ Result: {
"timezone": "Asia/Tokyo",
"datetime": "2025-08-21T13:52:36+09:00",
"is_dst": false
}
🔌 MCP connection closed
🤖 Agent: Based on the tools:
📅 {
"timezone": "Asia/Tokyo",
"datetime": "2025-08-21T13:52:36+09:00",
"is_dst": false
}
============================================================
✅ Agent cleanup complete - using direct connections
Want to see MCP in action with data analysis coupling with dbt? Check out the official server:
- dbt-mcp - Query dbt models and metrics
All available via uvx <server-name>
MCP standardizes how LLMs interact with external tools. Instead of building custom integrations for each LLM provider, you build one MCP server and it works everywhere. Think of it as the USB-C of AI tools.
The Risk: Nobody wants to mention
We've covered the technical implementation — from function calling to MCP servers. But here's where theory meets reality, and where most teams learn expensive lessons.
Give an LLM database write access? Email sending capabilities? Code execution? Each tool is a potential footgun. Here's how to not shoot yourself:
Production Checklist
What actually works in production? Here are the non-negotiable rules:
- Least Privilege: Read-only by default, always
- Validate Everything: Never trust LLM-generated parameters
- Audit Everything: If it's not logged, it didn't happen
- Circuit Breakers: Automatic shutoff for suspicious patterns
- Human in the Loop: Critical operations need approval
- Sandbox Execution: Isolate tool execution environment
- Cost Controls: Set spending limits per tool
- Rollback Ready: Every action must be reversible
Every tool configuration needs these safety requirements:
- Rate limiting (reasonable limits, not 1000 calls per minute)
- Timeout controls
- Audit logging enabled
- Sandbox mode for execution
- Clear rollback strategy
And remember: tools that can modify data (write to databases, send emails, delete files) MUST have audit logging enabled. No exceptions.
Some potential honour stories:
- The $72K OpenAI Bill: No rate limiting on a code generation tool. LLM went into a loop.
- The Dropped Production Table:
DROP TABLE
wasn't in the blocklist. Guess what happened. - The Email Storm: LLM sent 10,000 emails before anyone noticed. No rate limiting.
- The Infinite Loop: Tool called itself recursively. No iteration limits.
Key Takeaways
- Start with read-only tools — You can always add write capabilities later, but you can't undo a dropped table
- Defense in depth is not optional — Input validation + rate limiting + audit logs + circuit breakers + human approval for critical ops
- MCP is now the standard — Standardized tool interfaces mean write once, use everywhere (when it's mature)
What's Next?
Your agents can think (reasoning), access knowledge (RAG), and take actions (tools). But what about remembering what happened 5 minutes ago? Or last week? Time to dive into memory systems...
Technical deep dive series — Part 4 of 5
← Part 3: RAG Systems | Part 5: Memory Systems →
Related Articles in This Series
📚 Context Engineering Deep Dive Series:
- User Intent & Prompting: Making LLMs understand what you really want
- Agents & Reasoning: When LLMs Learn to Think Before They Speak
- RAG Systems: When Your LLM Needs to Phone a Friend
- Action Tools (You are here)
- Memory Systems: Teaching LLMs to Remember (Without Going Broke)
🎯 Start with the overview: Context Engineering: How RAG, agents, and memory make LLMs actually useful
Comments
Join the discussion! Your thoughts and feedback are valuable.
💡 Comments are powered by GitHub Discussions. You'll need a GitHub account to comment. Make sure Discussions are enabled in the repository.