开发环境跑得好好的,一上生产就各种问题?这是每个开发者都会遇到的噩梦。Agent 系统更复杂:有 LLM、向量数据库、缓存、消息队列、监控... 一环出问题,整个系统就瘫痪。这一课,我们学习如何把 Agent 安全、稳定地部署到生产环境。
部署架构设计
整体架构
┌─────────────┐
│ 用户 │
└──────┬──────┘
│
┌──────▼──────┐
│ 负载均衡 │ (Nginx/ALB)
└──────┬──────┘
│
┌──────────────┼──────────────┐
│ │ │
┌──────▼──────┐ ┌───▼────┐ ┌─────▼─────┐
│ Gateway │ │ Agent 1 │ │ Agent 2 │
│ (FastAPI) │ │ │ │ │
└──────┬──────┘ └───┬────┘ └─────┬─────┘
│ │ │
└──────────────┼────────────┘
│
┌──────────▼──────────┐
│ 服务发现 │ (Consul/Etcd)
└──────────┬──────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────▼────┐ ┌──────▼──────┐ ┌─────▼─────┐
│ LLM API │ │ Vector DB │ │ Redis │
│(OpenAI/ │ │ (ChromaDB) │ │ (Cache) │
│ Anthropic) │ │ │ │ │
└───────────┘ └─────────────┘ └───────────┘
关键组件
| 组件 | 作用 | 推荐方案 |
|---|---|---|
| 负载均衡 | 分发请求 | Nginx / ALB |
| API Gateway | 统一入口 | FastAPI / Kong |
| Agent 服务 | 核心逻辑 | Docker/K8s |
| 服务发现 | 动态注册 | Consul / Etcd |
| 缓存层 | 提升性能 | Redis |
| 消息队列 | 异步处理 | RabbitMQ / Kafka |
| 监控系统 | 可观测性 | Prometheus + Grafana |
| 日志系统 | 日志聚合 | ELK / Loki |
Docker 容器化
1. Dockerfile 编写
# 多阶段构建
FROM python:3.11-slim as builder
WORKDIR /app
# 安装依赖
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# 生产镜像
FROM python:3.11-slim
WORKDIR /app
# 只复制必要的文件
COPY --from=builder /root/.local /root/.local
COPY app ./app
COPY config ./config
# 非 root 用户运行
RUN useradd -m agentuser
USER agentuser
# 健康检查
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:8000/health || exit 1
# 暴露端口
EXPOSE 8000
# 启动命令
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
2. docker-compose 编排
version: '3.8'
services:
# Agent 服务
agent:
build: .
ports:
- "8000:8000"
environment:
- ENVIRONMENT=production
- OPENAI_API_KEY=${OPENAI_API_KEY}
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://db:5432/agent
depends_on:
- redis
- db
restart: unless-stopped
deploy:
resources:
limits:
cpus: '0.5'
memory: 512M
replicas: 2
# Redis 缓存
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
restart: unless-stopped
# PostgreSQL 数据库
db:
image: postgres:15-alpine
ports:
- "5432:5432"
environment:
- POSTGRES_USER=agent
- POSTGRES_PASSWORD=${DB_PASSWORD}
- POSTGRES_DB=agent
volumes:
- db_data:/var/lib/postgresql/data
restart: unless-stopped
# 监控
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
restart: unless-stopped
volumes:
redis_data:
db_data:
3. 最佳实践
镜像优化:
# ❌ 错误:复制所有文件
COPY . .
# ✅ 正确:只复制必要文件
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app ./app
分层构建:
# 依赖层不常变,放在前面
COPY requirements.txt .
RUN pip install -r requirements.txt # 这层会被缓存
# 代码经常变,放在后面
COPY app ./app # 修改代码只需要重新构建这一层
安全加固:
# 非 root 用户
RUN useradd -m agentuser
USER agentuser
# 只读文件系统
READONLY root filesystem
# 删除不必要的包
RUN apt-get purge -y --auto-remove \
gcc g++ make cmake
Kubernetes 部署
1. Deployment 配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-service
labels:
app: agent
spec:
replicas: 3 # 3个副本
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: agent
template:
metadata:
labels:
app: agent
spec:
containers:
- name: agent
image: your-registry/agent:v1.0.0
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: openai-api-key
- name: REDIS_URL
value: "redis://redis-service:6379"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
2. Service 配置
apiVersion: v1
kind: Service
metadata:
name: agent-service
spec:
selector:
app: agent
ports:
- protocol: TCP
port: 80
targetPort: 8000
type: LoadBalancer
3. ConfigMap 和 Secret
# ConfigMap - 非敏感配置
apiVersion: v1
kind: ConfigMap
metadata:
name: agent-config
data:
config.yaml: |
log_level: INFO
max_tokens: 2000
timeout: 30
# Secret - 敏感信息
apiVersion: v1
kind: Secret
metadata:
name: agent-secrets
type: Opaque
data:
openai-api-key: <base64-encoded-key>
database-password: <base64-encoded-password>
监控和可观测性
1. Prometheus 指标收集
from prometheus_client import Counter, Histogram, Gauge, start_http_server
# 指标定义
REQUEST_COUNT = Counter(
'agent_requests_total',
'Total requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'agent_request_duration_seconds',
'Request duration',
['endpoint']
)
ACTIVE_CONNECTIONS = Gauge(
'agent_active_connections',
'Active connections'
)
CACHE_HIT_RATE = Gauge(
'agent_cache_hit_rate',
'Cache hit rate'
)
# 中间件
def prometheus_middleware(request, call_next):
start_time = time.time()
response = await call_next(request)
# 记录指标
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
REQUEST_DURATION.labels(
endpoint=request.url.path
).observe(time.time() - start_time)
return response
# 缓存监控
def cache_get(key):
start_time = time.time()
result = cache.get(key)
if result:
CACHE_HIT_RATE.set(cache.hit_rate())
return result
# 启动指标服务
start_http_server(9090)
2. Grafana 面板
{
"dashboard": {
"title": "Agent Service Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(agent_requests_total[5m])",
"legendFormat": "{{endpoint}}"
}
]
},
{
"title": "Request Duration",
"targets": [
{
"expr": "histogram_quantile(0.95, agent_request_duration_seconds)",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.50, agent_request_duration_seconds)",
"legendFormat": "P50"
}
]
},
{
"title": "Cache Hit Rate",
"targets": [
{
"expr": "agent_cache_hit_rate",
"legendFormat": "Hit Rate"
}
]
}
]
}
}
3. 日志聚合(Loki)
import structlog
# 配置 structlog
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
cache_logger_on_first_use=True,
)
# 使用日志
logger = structlog.get_logger()
logger.info(
"agent_request",
method="POST",
endpoint="/agent/execute",
duration=1.23,
status="success"
)
性能优化
1. 缓存策略
from functools import lru_cache
from fastapi_cache2 import FastAPICache, Coder
from fastapi_cache2.backends.redis import RedisBackend
# 启用缓存
FastAPICache.init(
RedisBackend(redis_client),
prefix="agent_cache"
)
# 装饰器缓存
@app.get("/agent/{agent_id}")
@cache(expire=3600) # 缓存1小时
async def get_agent(agent_id: str):
return await agent_db.get(agent_id)
# LRU 缓存(本地)
@lru_cache(maxsize=1000)
def get_embedding(text: str):
"""缓存向量"""
return embed(text)
# 分层缓存
def cached_get(key):
"""分层缓存:内存 -> Redis -> 数据库"""
# L1: 内存
if key in memory_cache:
return memory_cache[key]
# L2: Redis
value = redis.get(key)
if value:
memory_cache[key] = value
return value
# L3: 数据库
value = db.get(key)
redis.set(key, value, ex=3600)
memory_cache[key] = value
return value
2. 连接池
import httpx
import redis
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool
# HTTP 连接池
http_client = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=100,
max_keepalive_connections=20
),
timeout=30.0
)
# Redis 连接池
redis_pool = redis.ConnectionPool(
host='localhost',
port=6379,
max_connections=50
)
redis_client = redis.Redis(connection_pool=redis_pool)
# 数据库连接池
engine = create_engine(
DATABASE_URL,
poolclass=QueuePool,
pool_size=20,
max_overflow=10,
pool_timeout=30,
pool_recycle=3600
)
3. 批量处理
async def batch_process(items, batch_size=10):
"""批量处理"""
results = []
for i in range(0, len(items), batch_size):
batch = items[i:i+batch_size]
# 并行处理批次
tasks = [process_item(item) for item in batch]
batch_results = await asyncio.gather(*tasks)
results.extend(batch_results)
return results
故障恢复
1. 自动重试
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def call_llm(prompt):
"""调用 LLM(带重试)"""
try:
response = await openai_client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}]
)
return response.choices[0].message.content
except Exception as e:
logger.error(f"LLM 调用失败: {e}")
raise
2. 断路器
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(url):
"""外部服务调用(带断路器)"""
response = await http_client.get(url)
return response.json()
# 使用
try:
result = await call_external_service(url)
except CircuitBreakerError:
logger.warning("服务不可用,使用降级策略")
result = fallback_logic()
3. 降级策略
class FallbackStrategy:
def __init__(self):
self.primary = PrimaryService()
self.secondary = SecondaryService()
async def execute(self, task):
"""优先使用主服务,失败时降级"""
try:
return await self.primary.execute(task)
except Exception as e:
logger.warning(f"主服务失败: {e},切换到备用服务")
return await self.secondary.execute(task)
async def cached_execute(self, task):
"""先查缓存,缓存未命中才执行"""
cached = await cache.get(task.id)
if cached:
return cached
result = await self.execute(task)
await cache.set(task.id, result, expire=300)
return result
CI/CD 流水线
1. GitHub Actions 配置
name: CI/CD Pipeline
on:
push:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov
- name: Run tests
run: |
pytest --cov=app --cov-report=xml
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
build-and-push:
needs: build-and-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Build Docker image
run: |
docker build -t agent:${{ github.sha }} .
- name: Push to registry
run: |
docker tag agent:${{ github.sha }} your-registry/agent:latest
docker push your-registry/agent:latest
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Deploy to K8s
run: |
kubectl set image deployment/agent-service \
agent=your-registry/agent:latest
kubectl rollout status deployment/agent-service
2. 蓝绿部署
# 蓝环境(生产)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-service-blue
spec:
replicas: 3
selector:
matchLabels:
app: agent
version: blue
template:
metadata:
labels:
app: agent
version: blue
spec:
containers:
- name: agent
image: agent:v1.0.0
# 绿环境(测试)
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-service-green
spec:
replicas: 3
selector:
matchLabels:
app: agent
version: green
template:
metadata:
labels:
app: agent
version: green
spec:
containers:
- name: agent
image: agent:v1.1.0
# 切换流量的 Service
---
apiVersion: v1
kind: Service
metadata:
name: agent-service
spec:
selector:
app: agent
version: green # 切换到绿环境
ports:
- port: 80
targetPort: 8000
最佳实践清单
部署前检查
- 所有配置已从代码中移除(使用环境变量)
- 敏感信息已加密存储
- 健康检查端点已配置
- 日志级别已设置为 INFO 或 ERROR
- 数据库连接池已配置
- 缓存已启用
- 监控指标已暴露
- 限流已配置
部署后验证
- 服务健康检查通过
- 日志无错误
- 监控指标正常
- 性能基准测试通过
- 灰度流量验证通过
- 回滚预案已准备
运维手册
常见问题排查:
# 1. 检查服务状态
kubectl get pods -l app=agent
# 2. 查看日志
kubectl logs -f deployment/agent-service
# 3. 进入容器调试
kubectl exec -it <pod-name> -- /bin/bash
# 4. 检查资源使用
kubectl top pods -l app=agent
# 5. 扩容
kubectl scale deployment agent-service --replicas=5
# 6. 回滚
kubectl rollout undo deployment/agent-service
总结
从开发到生产,是一条完整的旅程:
关键收获:
- Docker 容器化,标准化部署
- Kubernetes 编排,弹性扩展
- 监控和日志,可观测性
- 性能优化,缓存和连接池
- 故障恢复,重试和降级
- CI/CD 自动化,持续交付
部署清单:
- ✅ 容器化(Dockerfile 最佳实践)
- ✅ 编排(Kubernetes/Compose)
- ✅ 监控(Prometheus + Grafana)
- ✅ 日志(Loki/ELK)
- ✅ 缓存(Redis)
- ✅ 健康检查
- ✅ 滚动更新
- ✅ 回滚预案
下一步:
- 为你的 Agent 项目配置 Docker
- 设置 CI/CD 流水线
- 配置监控和告警
记住:部署不是终点,而是开始。持续监控、持续优化,才能让 Agent 系统在生产环境中稳定运行。
系列完结:
感谢你跟随这个系列课程,从 Agent 基础到生产部署,我们走过了完整的旅程。现在你已经具备了构建、部署、维护一个生产级 Agent 系统的能力。
回顾系列:
- 01: 从会话到会做
- 02: MCP协议详解
- 03: 提示词工程
- 04: 上下文管理
- 05: 错误处理
- 06: 多模态实战
- 07: 长文本处理
- 08: 多Agent协作
- 09: 安全与权限
- 10: 生产环境部署
祝你在 AI 编程的道路上越走越远,构建出改变世界的 Agent 系统!
Views: 0
