【斯坦福 AI 编程课 10】生产环境部署——从开发到上线的完整旅程

开发环境跑得好好的,一上生产就各种问题?这是每个开发者都会遇到的噩梦。Agent 系统更复杂:有 LLM、向量数据库、缓存、消息队列、监控... 一环出问题,整个系统就瘫痪。这一课,我们学习如何把 Agent 安全、稳定地部署到生产环境。

部署架构设计

整体架构

                    ┌─────────────┐
                    │   用户      │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  负载均衡   │ (Nginx/ALB)
                    └──────┬──────┘
                           │
            ┌──────────────┼──────────────┐
            │              │              │
     ┌──────▼──────┐ ┌───▼────┐ ┌─────▼─────┐
     │  Gateway    │ │ Agent 1 │ │  Agent 2  │
     │  (FastAPI)  │ │        │ │           │
     └──────┬──────┘ └───┬────┘ └─────┬─────┘
            │             │            │
            └──────────────┼────────────┘
                           │
                ┌──────────▼──────────┐
                │   服务发现          │ (Consul/Etcd)
                └──────────┬──────────┘
                           │
        ┌──────────────────┼──────────────────┐
        │                  │                  │
  ┌─────▼────┐    ┌──────▼──────┐   ┌─────▼─────┐
  │ LLM API  │    │ Vector DB   │   │ Redis     │
  │(OpenAI/   │    │ (ChromaDB)  │   │ (Cache)   │
  │ Anthropic) │    │             │   │           │
  └───────────┘    └─────────────┘   └───────────┘

关键组件

组件 作用 推荐方案
负载均衡 分发请求 Nginx / ALB
API Gateway 统一入口 FastAPI / Kong
Agent 服务 核心逻辑 Docker/K8s
服务发现 动态注册 Consul / Etcd
缓存层 提升性能 Redis
消息队列 异步处理 RabbitMQ / Kafka
监控系统 可观测性 Prometheus + Grafana
日志系统 日志聚合 ELK / Loki

Docker 容器化

1. Dockerfile 编写

# 多阶段构建
FROM python:3.11-slim as builder

WORKDIR /app

# 安装依赖
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

# 生产镜像
FROM python:3.11-slim

WORKDIR /app

# 只复制必要的文件
COPY --from=builder /root/.local /root/.local
COPY app ./app
COPY config ./config

# 非 root 用户运行
RUN useradd -m agentuser
USER agentuser

# 健康检查
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:8000/health || exit 1

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. docker-compose 编排

version: '3.8'

services:
  # Agent 服务
  agent:
    build: .
    ports:
      - "8000:8000"
    environment:
      - ENVIRONMENT=production
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://db:5432/agent
    depends_on:
      - redis
      - db
    restart: unless-stopped
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
      replicas: 2

  # Redis 缓存
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    restart: unless-stopped

  # PostgreSQL 数据库
  db:
    image: postgres:15-alpine
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=agent
      - POSTGRES_PASSWORD=${DB_PASSWORD}
      - POSTGRES_DB=agent
    volumes:
      - db_data:/var/lib/postgresql/data
    restart: unless-stopped

  # 监控
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    restart: unless-stopped

volumes:
  redis_data:
  db_data:

3. 最佳实践

镜像优化

# ❌ 错误:复制所有文件
COPY . .

# ✅ 正确:只复制必要文件
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY app ./app

分层构建

# 依赖层不常变,放在前面
COPY requirements.txt .
RUN pip install -r requirements.txt  # 这层会被缓存

# 代码经常变,放在后面
COPY app ./app  # 修改代码只需要重新构建这一层

安全加固

# 非 root 用户
RUN useradd -m agentuser
USER agentuser

# 只读文件系统
READONLY root filesystem

# 删除不必要的包
RUN apt-get purge -y --auto-remove \
    gcc g++ make cmake

Kubernetes 部署

1. Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service
  labels:
    app: agent
spec:
  replicas: 3  # 3个副本
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: agent
  template:
    metadata:
      labels:
        app: agent
    spec:
      containers:
      - name: agent
        image: your-registry/agent:v1.0.0
        ports:
        - containerPort: 8000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: agent-secrets
              key: openai-api-key
        - name: REDIS_URL
          value: "redis://redis-service:6379"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

2. Service 配置

apiVersion: v1
kind: Service
metadata:
  name: agent-service
spec:
  selector:
    app: agent
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8000
  type: LoadBalancer

3. ConfigMap 和 Secret

# ConfigMap - 非敏感配置
apiVersion: v1
kind: ConfigMap
metadata:
  name: agent-config
data:
  config.yaml: |
    log_level: INFO
    max_tokens: 2000
    timeout: 30

# Secret - 敏感信息
apiVersion: v1
kind: Secret
metadata:
  name: agent-secrets
type: Opaque
data:
  openai-api-key: <base64-encoded-key>
  database-password: <base64-encoded-password>

监控和可观测性

1. Prometheus 指标收集

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# 指标定义
REQUEST_COUNT = Counter(
    'agent_requests_total',
    'Total requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'agent_request_duration_seconds',
    'Request duration',
    ['endpoint']
)

ACTIVE_CONNECTIONS = Gauge(
    'agent_active_connections',
    'Active connections'
)

CACHE_HIT_RATE = Gauge(
    'agent_cache_hit_rate',
    'Cache hit rate'
)

# 中间件
def prometheus_middleware(request, call_next):
    start_time = time.time()

    response = await call_next(request)

    # 记录指标
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.url.path,
        status=response.status_code
    ).inc()

    REQUEST_DURATION.labels(
        endpoint=request.url.path
    ).observe(time.time() - start_time)

    return response

# 缓存监控
def cache_get(key):
    start_time = time.time()
    result = cache.get(key)
    if result:
        CACHE_HIT_RATE.set(cache.hit_rate())
    return result

# 启动指标服务
start_http_server(9090)

2. Grafana 面板

{
  "dashboard": {
    "title": "Agent Service Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(agent_requests_total[5m])",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "title": "Request Duration",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, agent_request_duration_seconds)",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.50, agent_request_duration_seconds)",
            "legendFormat": "P50"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [
          {
            "expr": "agent_cache_hit_rate",
            "legendFormat": "Hit Rate"
          }
        ]
      }
    ]
  }
}

3. 日志聚合(Loki)

import structlog

# 配置 structlog
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

# 使用日志
logger = structlog.get_logger()

logger.info(
    "agent_request",
    method="POST",
    endpoint="/agent/execute",
    duration=1.23,
    status="success"
)

性能优化

1. 缓存策略

from functools import lru_cache
from fastapi_cache2 import FastAPICache, Coder
from fastapi_cache2.backends.redis import RedisBackend

# 启用缓存
FastAPICache.init(
    RedisBackend(redis_client),
    prefix="agent_cache"
)

# 装饰器缓存
@app.get("/agent/{agent_id}")
@cache(expire=3600)  # 缓存1小时
async def get_agent(agent_id: str):
    return await agent_db.get(agent_id)

# LRU 缓存(本地)
@lru_cache(maxsize=1000)
def get_embedding(text: str):
    """缓存向量"""
    return embed(text)

# 分层缓存
def cached_get(key):
    """分层缓存:内存 -> Redis -> 数据库"""
    # L1: 内存
    if key in memory_cache:
        return memory_cache[key]

    # L2: Redis
    value = redis.get(key)
    if value:
        memory_cache[key] = value
        return value

    # L3: 数据库
    value = db.get(key)
    redis.set(key, value, ex=3600)
    memory_cache[key] = value

    return value

2. 连接池

import httpx
import redis
from sqlalchemy import create_engine
from sqlalchemy.pool import QueuePool

# HTTP 连接池
http_client = httpx.AsyncClient(
    limits=httpx.Limits(
        max_connections=100,
        max_keepalive_connections=20
    ),
    timeout=30.0
)

# Redis 连接池
redis_pool = redis.ConnectionPool(
    host='localhost',
    port=6379,
    max_connections=50
)
redis_client = redis.Redis(connection_pool=redis_pool)

# 数据库连接池
engine = create_engine(
    DATABASE_URL,
    poolclass=QueuePool,
    pool_size=20,
    max_overflow=10,
    pool_timeout=30,
    pool_recycle=3600
)

3. 批量处理

async def batch_process(items, batch_size=10):
    """批量处理"""
    results = []

    for i in range(0, len(items), batch_size):
        batch = items[i:i+batch_size]

        # 并行处理批次
        tasks = [process_item(item) for item in batch]
        batch_results = await asyncio.gather(*tasks)

        results.extend(batch_results)

    return results

故障恢复

1. 自动重试

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
async def call_llm(prompt):
    """调用 LLM(带重试)"""
    try:
        response = await openai_client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}]
        )
        return response.choices[0].message.content
    except Exception as e:
        logger.error(f"LLM 调用失败: {e}")
        raise

2. 断路器

from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_external_service(url):
    """外部服务调用(带断路器)"""
    response = await http_client.get(url)
    return response.json()

# 使用
try:
    result = await call_external_service(url)
except CircuitBreakerError:
    logger.warning("服务不可用,使用降级策略")
    result = fallback_logic()

3. 降级策略

class FallbackStrategy:
    def __init__(self):
        self.primary = PrimaryService()
        self.secondary = SecondaryService()

    async def execute(self, task):
        """优先使用主服务,失败时降级"""
        try:
            return await self.primary.execute(task)
        except Exception as e:
            logger.warning(f"主服务失败: {e},切换到备用服务")
            return await self.secondary.execute(task)

    async def cached_execute(self, task):
        """先查缓存,缓存未命中才执行"""
        cached = await cache.get(task.id)
        if cached:
            return cached

        result = await self.execute(task)
        await cache.set(task.id, result, expire=300)
        return result

CI/CD 流水线

1. GitHub Actions 配置

name: CI/CD Pipeline

on:
  push:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-cov

    - name: Run tests
      run: |
        pytest --cov=app --cov-report=xml

    - name: Upload coverage
      uses: codecov/codecov-action@v3
      with:
        file: ./coverage.xml

  build-and-push:
    needs: build-and-test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3

    - name: Build Docker image
      run: |
        docker build -t agent:${{ github.sha }} .

    - name: Push to registry
      run: |
        docker tag agent:${{ github.sha }} your-registry/agent:latest
        docker push your-registry/agent:latest

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to K8s
      run: |
        kubectl set image deployment/agent-service \
          agent=your-registry/agent:latest
        kubectl rollout status deployment/agent-service

2. 蓝绿部署

# 蓝环境(生产)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service-blue
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent
      version: blue
  template:
    metadata:
      labels:
        app: agent
        version: blue
    spec:
      containers:
      - name: agent
        image: agent:v1.0.0

# 绿环境(测试)
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service-green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent
      version: green
  template:
    metadata:
      labels:
        app: agent
        version: green
    spec:
      containers:
      - name: agent
        image: agent:v1.1.0

# 切换流量的 Service
---
apiVersion: v1
kind: Service
metadata:
  name: agent-service
spec:
  selector:
    app: agent
    version: green  # 切换到绿环境
  ports:
  - port: 80
    targetPort: 8000

最佳实践清单

部署前检查

  • 所有配置已从代码中移除(使用环境变量)
  • 敏感信息已加密存储
  • 健康检查端点已配置
  • 日志级别已设置为 INFO 或 ERROR
  • 数据库连接池已配置
  • 缓存已启用
  • 监控指标已暴露
  • 限流已配置

部署后验证

  • 服务健康检查通过
  • 日志无错误
  • 监控指标正常
  • 性能基准测试通过
  • 灰度流量验证通过
  • 回滚预案已准备

运维手册

常见问题排查

# 1. 检查服务状态
kubectl get pods -l app=agent

# 2. 查看日志
kubectl logs -f deployment/agent-service

# 3. 进入容器调试
kubectl exec -it <pod-name> -- /bin/bash

# 4. 检查资源使用
kubectl top pods -l app=agent

# 5. 扩容
kubectl scale deployment agent-service --replicas=5

# 6. 回滚
kubectl rollout undo deployment/agent-service

总结

从开发到生产,是一条完整的旅程:

关键收获

  1. Docker 容器化,标准化部署
  2. Kubernetes 编排,弹性扩展
  3. 监控和日志,可观测性
  4. 性能优化,缓存和连接池
  5. 故障恢复,重试和降级
  6. CI/CD 自动化,持续交付

部署清单

  • ✅ 容器化(Dockerfile 最佳实践)
  • ✅ 编排(Kubernetes/Compose)
  • ✅ 监控(Prometheus + Grafana)
  • ✅ 日志(Loki/ELK)
  • ✅ 缓存(Redis)
  • ✅ 健康检查
  • ✅ 滚动更新
  • ✅ 回滚预案

下一步

  • 为你的 Agent 项目配置 Docker
  • 设置 CI/CD 流水线
  • 配置监控和告警

记住:部署不是终点,而是开始。持续监控、持续优化,才能让 Agent 系统在生产环境中稳定运行。


系列完结
感谢你跟随这个系列课程,从 Agent 基础到生产部署,我们走过了完整的旅程。现在你已经具备了构建、部署、维护一个生产级 Agent 系统的能力。

回顾系列

祝你在 AI 编程的道路上越走越远,构建出改变世界的 Agent 系统!

Views: 0