就绪探针和存活探针的区别

Gaaming Zhang2025/12/21大约 16 分钟

就绪探针和存活探针的区别

核心概念对比

存活探针（Liveness Probe）：

目的：检测容器是否还活着（是否死锁、假死）
失败后果：重启容器
使用场景：检测应用程序是否陷入死锁、无限循环等无法自动恢复的状态

就绪探针（Readiness Probe）：

目的：检测容器是否准备好接收流量
失败后果：从Service的Endpoints中移除，不会重启容器
使用场景：应用启动需要时间、依赖服务未就绪、临时不可用

对比项	存活探针（Liveness）	就绪探针（Readiness）
检测目标	容器是否存活	容器是否准备好服务
失败动作	重启容器	从Service移除
影响范围	单个容器	流量路由
检测频率	持续检测	持续检测
启动延迟	initialDelaySeconds	initialDelaySeconds
适用场景	死锁、假死	启动慢、临时不可用
Pod状态	影响容器重启	影响Ready状态

详细区别分析

1. 失败行为的不同

# 存活探针失败
livenessProbe失败 → kubelet杀死容器 → 根据restartPolicy重启容器

# 就绪探针失败
readinessProbe失败 → 标记Pod为NotReady → 从Service Endpoints移除 → 不接收流量

2. 影响的Pod状态

# 查看Pod状态
kubectl get pod myapp-pod

# 存活探针正常，就绪探针失败
NAME         READY   STATUS    RESTARTS   AGE
myapp-pod    0/1     Running   0          5m
# READY显示0/1，但STATUS是Running

# 存活探针失败
NAME         READY   STATUS             RESTARTS   AGE
myapp-pod    0/1     CrashLoopBackOff   3          5m
# STATUS显示CrashLoopBackOff，RESTARTS次数增加

3. Service流量路由的影响

存活探针：
- 失败 → 容器重启 → 短暂中断服务
- 不直接影响Service的Endpoints

就绪探针：
- 失败 → Pod从Endpoints移除 → 立即停止接收流量
- 容器继续运行，不会重启

三种探针完整对比

Kubernetes 1.16+引入了启动探针（Startup Probe）：

探针类型	作用	失败后果	典型场景
Startup Probe	检测容器是否已启动	重启容器	慢启动应用（避免livenessProbe过早杀死）
Liveness Probe	检测容器是否存活	重启容器	死锁、假死、内存泄漏
Readiness Probe	检测容器是否就绪	从Service移除	依赖服务未就绪、临时不可用

探针执行顺序：

1. Startup Probe（如果配置）
   ↓ 成功后才开始
2. Liveness Probe（持续检测）
   和
3. Readiness Probe（持续检测）

配置示例

完整的探针配置：

apiVersion: v1
kind: Pod
metadata:
  name: probe-demo
spec:
  containers:
  - name: myapp
    image: myapp:v1
    ports:
    - containerPort: 8080
    
    # 启动探针：用于慢启动应用
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      initialDelaySeconds: 0      # 可以设为0，因为是专门检测启动
      periodSeconds: 10            # 每10秒检测一次
      failureThreshold: 30         # 允许失败30次（5分钟启动时间）
      successThreshold: 1          # 成功1次即可
      timeoutSeconds: 5            # 超时时间
    
    # 存活探针：检测死锁
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Liveness
      initialDelaySeconds: 30      # 启动后30秒开始检测
      periodSeconds: 10            # 每10秒检测一次
      timeoutSeconds: 5            # 超时5秒算失败
      successThreshold: 1          # 成功1次即可
      failureThreshold: 3          # 连续失败3次才重启
    
    # 就绪探针：检测是否可以接收流量
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5       # 启动后5秒开始检测
      periodSeconds: 5             # 每5秒检测一次
      timeoutSeconds: 3            # 超时3秒算失败
      successThreshold: 1          # 成功1次即Ready
      failureThreshold: 3          # 连续失败3次标记为NotReady

三种探针方式

1. HTTP GET（最常用）：

livenessProbe:
  httpGet:
    path: /health        # 健康检查路径
    port: 8080           # 端口
    scheme: HTTP         # HTTP或HTTPS
    httpHeaders:         # 可选的HTTP头
    - name: Authorization
      value: Bearer token
  initialDelaySeconds: 30
  periodSeconds: 10

应用端实现（Go示例）：

// 健康检查端点
http.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
    // 检查应用状态
    if appHealthy() {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("OK"))
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("Unhealthy"))
    }
})

// 就绪检查端点
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    // 检查依赖服务
    if databaseConnected() && cacheAvailable() {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Ready"))
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("Not Ready"))
    }
})

2. TCP Socket（适用于TCP服务）：

livenessProbe:
  tcpSocket:
    port: 3306           # MySQL端口
  initialDelaySeconds: 15
  periodSeconds: 10

使用场景：

数据库（MySQL、PostgreSQL）
缓存（Redis、Memcached）
消息队列（不提供HTTP接口的服务）

3. Exec Command（执行命令）：

livenessProbe:
  exec:
    command:
    - cat
    - /tmp/healthy
  initialDelaySeconds: 5
  periodSeconds: 5

# 更复杂的检查脚本
readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - |
      # 检查进程
      if ! pgrep -f myapp > /dev/null; then
        exit 1
      fi
      # 检查文件
      if [ ! -f /tmp/ready ]; then
        exit 1
      fi
      exit 0
  initialDelaySeconds: 10
  periodSeconds: 5

使用场景：

无HTTP接口的应用
需要复杂逻辑判断
检查文件或进程状态

参数详解

probe:
  # 初始延迟：容器启动后多久开始探测
  initialDelaySeconds: 30
  
  # 探测周期：多久探测一次
  periodSeconds: 10
  
  # 超时时间：单次探测的超时时间
  timeoutSeconds: 5
  
  # 成功阈值：连续成功多少次才算成功
  # Liveness和Startup只能是1
  # Readiness可以大于1
  successThreshold: 1
  
  # 失败阈值：连续失败多少次才算失败
  failureThreshold: 3

参数配置建议：

场景	initialDelay	period	timeout	failureThreshold
快速启动应用	5-10s	5-10s	3-5s	3
慢启动应用	60-120s	10-20s	5-10s	3-5
数据库	30-60s	10s	5s	3
微服务	10-30s	10s	5s	3

实际使用场景

场景1：Web应用启动过程

时间线：
t=0s    : 容器启动，应用开始初始化
t=5s    : Readiness开始检测 → 失败（应用还在启动）
t=15s   : 数据库连接成功
t=20s   : 缓存预热完成
t=25s   : Readiness检测 → 成功
          Pod标记为Ready，加入Service Endpoints，开始接收流量
t=30s   : Liveness开始检测 → 成功
t=40s   : Liveness检测 → 成功（持续监控）

配置：

containers:
- name: webapp
  readinessProbe:
    httpGet:
      path: /ready
    initialDelaySeconds: 5   # 很快开始检测
    periodSeconds: 5
  livenessProbe:
    httpGet:
      path: /health
    initialDelaySeconds: 30  # 等应用完全启动后再检测
    periodSeconds: 10

场景2：定期维护窗口

应用需要定期执行维护任务（如缓存重建），期间不应接收流量：

var isReady = true

// 维护任务
func maintenanceTask() {
    // 设置为Not Ready
    isReady = false
    
    // 执行维护
    rebuildCache()
    
    // 恢复Ready
    isReady = true
}

// Readiness端点
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    if isReady {
        w.WriteHeader(http.StatusOK)
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
})

场景3：依赖服务检查

func readinessCheck(w http.ResponseWriter, r *http.Request) {
    // 检查数据库连接
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    // 检查Redis连接
    if err := redisClient.Ping().Err(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    // 检查下游服务
    if err := checkDownstreamService(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    
    w.WriteHeader(http.StatusOK)
}

func livenessCheck(w http.ResponseWriter, r *http.Request) {
    // Liveness只检查应用本身是否存活
    // 不检查依赖服务（避免级联重启）
    if applicationHealthy() {
        w.WriteHeader(http.StatusOK)
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
}

常见问题和最佳实践

问题1：Liveness探针太激进导致频繁重启

# 错误配置
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 5   # 太短，应用还没启动完成
  periodSeconds: 5
  failureThreshold: 1      # 太小，一次失败就重启

# 正确配置
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 60  # 给足够的启动时间
  periodSeconds: 10
  failureThreshold: 3      # 允许连续失败3次

问题2：Readiness和Liveness使用相同的检查逻辑

# 错误：两个探针都检查依赖服务
livenessProbe:
  httpGet:
    path: /health    # 这个端点检查了DB、Redis等
readinessProbe:
  httpGet:
    path: /health    # 同样的检查

# 问题：DB故障 → Liveness失败 → 容器重启 → 无法解决问题

正确做法：

# Liveness：只检查应用本身
livenessProbe:
  httpGet:
    path: /livez     # 只检查应用进程是否存活

# Readiness：检查依赖服务
readinessProbe:
  httpGet:
    path: /readyz    # 检查应用和依赖服务

问题3：慢启动应用被Liveness过早杀死

# 问题：应用启动需要5分钟，但Liveness配置太激进
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3
# 30秒后开始检测，连续失败3次（30秒），总共60秒就会重启
# 但应用需要5分钟启动，导致反复重启

# 解决方案1：使用Startup Probe（推荐）
startupProbe:
  httpGet:
    path: /health
  periodSeconds: 10
  failureThreshold: 30    # 10秒 × 30次 = 5分钟
livenessProbe:
  httpGet:
    path: /health
  periodSeconds: 10       # Startup成功后才开始

# 解决方案2：增大initialDelaySeconds
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 300  # 5分钟后再开始检测
  periodSeconds: 10

问题4：探针检查太重，影响性能

// 错误：探针检查太复杂
func healthCheck(w http.ResponseWriter, r *http.Request) {
    // 执行复杂的数据库查询
    rows, _ := db.Query("SELECT COUNT(*) FROM large_table")
    // 执行复杂的计算
    result := expensiveCalculation()
    // 检查所有缓存键
    for _, key := range allKeys {
        cache.Get(key)
    }
    w.WriteHeader(http.StatusOK)
}

// 正确：轻量级检查
func healthCheck(w http.ResponseWriter, r *http.Request) {
    // 只做简单的Ping检查
    if err := db.Ping(); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

最佳实践总结

1. 探针端点设计：

/livez   - Liveness检查：只检查应用本身
/readyz  - Readiness检查：检查应用和依赖
/startupz - Startup检查：简单检查，确认启动完成

2. 参数配置原则：

initialDelaySeconds：根据应用启动时间设置，留出足够余量
periodSeconds：不要太频繁（避免性能影响），10秒是常见值
timeoutSeconds：略大于网络延迟，3-5秒合理
failureThreshold：至少3次，给应用临时波动的容错

3. 检查逻辑原则：

Liveness：只检查应用自身（避免级联失败）
Readiness：可以检查依赖（影响流量路由）
轻量级：避免复杂查询或计算
幂等性：多次调用不影响应用状态

4. 慢启动应用：

使用Startup Probe（Kubernetes 1.18+）
或设置较大的initialDelaySeconds
failureThreshold × periodSeconds ≥ 启动时间

5. 分离关注点：

# 不同的检查路径
startupProbe:
  httpGet:
    path: /startupz    # 简单检查
livenessProbe:
  httpGet:
    path: /livez       # 检查应用存活
readinessProbe:
  httpGet:
    path: /readyz      # 检查是否就绪

调试和监控

查看探针状态：

# 查看Pod详情
kubectl describe pod <pod-name>

# 关键信息：
# - Conditions: 查看Ready状态
# - Containers.State: 查看容器状态
# - Events: 查看探针失败事件

# 示例输出：
Conditions:
  Type              Status
  Initialized       True
  Ready             False      # Readiness失败
  ContainersReady   False
  PodScheduled      True

Events:
  Warning  Unhealthy  Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy  Liveness probe failed: Get http://10.244.1.10:8080/health: dial tcp 10.244.1.10:8080: connect: connection refused

查看探针指标（Prometheus）：

# Readiness探针失败
kube_pod_status_ready{pod="myapp"}

# Liveness探针导致的重启
rate(kube_pod_container_status_restarts_total[5m])

# 容器状态
kube_pod_container_status_ready

临时禁用探针（调试用）：

# 编辑Deployment，注释掉探针
kubectl edit deployment myapp

# 或者使用patch
kubectl patch deployment myapp --type=json -p='[
  {"op": "remove", "path": "/spec/template/spec/containers/0/livenessProbe"},
  {"op": "remove", "path": "/spec/template/spec/containers/0/readinessProbe"}
]'

相关面试题

Q1: 为什么Liveness探针不应该检查依赖服务？

答案：

原因：

避免级联失败：如果依赖服务（如数据库）故障，会导致所有Pod被重启，重启后问题依然存在，造成无意义的重启循环
扩大故障影响：一个服务的问题会传播到所有依赖它的服务
无法解决问题：重启容器无法修复外部依赖的问题

正确做法：

# Liveness：只检查应用本身
livenessProbe:
  exec:
    command:
    - pgrep
    - -f
    - myapp
  # 或简单的HTTP检查
  httpGet:
    path: /livez
    port: 8080

# Readiness：可以检查依赖
readinessProbe:
  httpGet:
    path: /readyz    # 检查DB、Redis等依赖
    port: 8080

示例场景：

数据库故障 → 如果Liveness检查DB：
1. Liveness失败 → 所有Pod重启
2. Pod重启后DB仍故障 → 继续失败
3. 进入CrashLoopBackOff → 服务完全不可用

正确做法（只Readiness检查DB）：
1. Readiness失败 → Pod从Service移除
2. 容器继续运行，定期重试
3. DB恢复后 → Readiness成功 → 自动恢复服务

Q2: Startup Probe的作用是什么？什么时候需要使用？

答案：

作用：

专门用于检测容器是否已启动完成
在Startup成功之前，Liveness和Readiness不会执行
避免慢启动应用被Liveness过早杀死

适用场景：

Java应用：JVM启动、类加载需要时间
大型应用：数据初始化、缓存预热
数据库：启动、恢复、索引重建
机器学习模型：模型加载需要时间

配置示例：

# 慢启动应用（启动需要5分钟）
startupProbe:
  httpGet:
    path: /startup
  periodSeconds: 10
  failureThreshold: 30      # 10s × 30 = 300s = 5分钟
  # 在这期间Liveness不会检测

livenessProbe:
  httpGet:
    path: /health
  periodSeconds: 10
  failureThreshold: 3       # Startup成功后才生效

优势：

不需要设置很大的initialDelaySeconds
启动后立即开始正常的健康检查
更精确地控制启动检测逻辑

Q3: 如何处理滚动更新时的流量无缝切换？

答案：

问题场景：
滚动更新时，旧Pod被删除前如果还在处理请求，会导致连接中断。

解决方案：使用preStop + Readiness

apiVersion: apps/v1
kind: Deployment
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0    # 确保总是有Pod可用
  template:
    spec:
      containers:
      - name: myapp
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - sleep 15   # 等待15秒
        
        readinessProbe:
          httpGet:
            path: /ready
          periodSeconds: 5
        
        terminationGracePeriodSeconds: 30

完整流程：

1. 新Pod创建 → Readiness检测 → 成功后加入Endpoints
2. kubectl delete pod → Pod进入Terminating状态
3. Pod的Readiness立即失败 → 从Endpoints移除（停止接收新流量）
4. 执行preStop钩子（sleep 15秒）→ 等待现有请求完成
5. 发送SIGTERM信号
6. 等待terminationGracePeriodSeconds
7. 如果仍未退出，发送SIGKILL强制终止

应用端配置：

// 优雅关闭
func gracefulShutdown() {
    // 设置为NotReady（可选，K8s会自动处理）
    isReady = false
    
    // 等待现有请求完成
    time.Sleep(10 * time.Second)
    
    // 关闭服务器
    server.Shutdown(context.Background())
}

// 监听信号
signal.Notify(sigChan, syscall.SIGTERM, syscall.SIGINT)
go func() {
    <-sigChan
    gracefulShutdown()
}()

Q4: 探针失败率很高，如何排查和优化？

答案：

排查步骤：

# 1. 查看失败原因
kubectl describe pod <pod-name> | grep -A 10 "Events"

# 2. 查看探针配置
kubectl get pod <pod-name> -o yaml | grep -A 20 "Probe"

# 3. 手动测试探针
kubectl exec <pod-name> -- curl -f http://localhost:8080/health

# 4. 查看应用日志
kubectl logs <pod-name> | grep -i "health\|ready"

# 5. 检查资源使用
kubectl top pod <pod-name>

常见原因和解决方案：

原因	解决方案
超时时间太短	增大timeoutSeconds（建议5-10秒）
检查逻辑太重	简化检查逻辑，避免复杂查询
资源不足	增加CPU/内存limits
网络延迟	增大timeoutSeconds和failureThreshold
应用负载高	优化应用性能或扩容
探测频率太高	增大periodSeconds（10-15秒）
启动时间长	使用startupProbe或增大initialDelaySeconds

优化配置示例：

# 优化前（探针失败率高）
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 10   # 太短
  periodSeconds: 5          # 太频繁
  timeoutSeconds: 1         # 太短
  failureThreshold: 1       # 太严格

# 优化后
livenessProbe:
  httpGet:
    path: /health
  initialDelaySeconds: 30   # 给足够启动时间
  periodSeconds: 10         # 降低频率
  timeoutSeconds: 5         # 增加超时
  failureThreshold: 3       # 允许偶尔失败

Q5: 如何为数据库、缓存等中间件配置探针？

答案：

MySQL/PostgreSQL：

# 使用TCP检测（简单）
livenessProbe:
  tcpSocket:
    port: 3306
  initialDelaySeconds: 30
  periodSeconds: 10

# 或使用命令检测（更准确）
livenessProbe:
  exec:
    command:
    - mysqladmin
    - ping
    - -h
    - localhost
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  exec:
    command:
    - mysql
    - -h
    - localhost
    - -e
    - "SELECT 1"
  initialDelaySeconds: 5
  periodSeconds: 5

Redis：

livenessProbe:
  exec:
    command:
    - redis-cli
    - ping
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  exec:
    command:
    - sh
    - -c
    - redis-cli ping | grep PONG
  initialDelaySeconds: 5
  periodSeconds: 5

Elasticsearch：

livenessProbe:
  httpGet:
    path: /_cluster/health
    port: 9200
  initialDelaySeconds: 60
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /_cluster/health?wait_for_status=yellow
    port: 9200
  initialDelaySeconds: 30
  periodSeconds: 10

MongoDB：

livenessProbe:
  exec:
    command:
    - mongo
    - --eval
    - "db.adminCommand('ping')"
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  tcpSocket:
    port: 27017
  initialDelaySeconds: 5
  periodSeconds: 5

通用原则：

数据库启动慢，initialDelaySeconds设大一些（30-60秒）
Liveness使用简单检测（ping、TCP）
Readiness可以检测实际查询能力
periodSeconds不要太频繁（10-15秒）

Q6: 探针检查应该同步还是异步？如何避免阻塞？

答案：

建议：异步检查，但返回同步

错误做法（同步阻塞）：

func healthCheck(w http.ResponseWriter, r *http.Request) {
    // 直接进行耗时操作，阻塞请求
    if err := db.Ping(); err != nil {  // 可能超时
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    if err := checkExternalAPI(); err != nil {  // 可能很慢
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

正确做法（后台异步检查）：

var (
    healthy     = true
    healthyLock sync.RWMutex
)

// 后台异步检查
func backgroundHealthCheck() {
    ticker := time.NewTicker(5 * time.Second)
    for range ticker.C {
        result := true
        
        // 检查数据库
        if err := db.Ping(); err != nil {
            result = false
        }
        
        // 检查外部依赖
        if err := checkDependencies(); err != nil {
            result = false
        }
        
        // 更新状态
        healthyLock.Lock()
        healthy = result
        healthyLock.Unlock()
    }
}

// 探针端点（快速返回）
func healthCheck(w http.ResponseWriter, r *http.Request) {
    healthyLock.RLock()
    isHealthy := healthy
    healthyLock.RUnlock()
    
    if isHealthy {
        w.WriteHeader(http.StatusOK)
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
}

func main() {
    go backgroundHealthCheck()
    http.HandleFunc("/health", healthCheck)
    http.ListenAndServe(":8080", nil)
}

优势：

探针检查立即返回，不会超时
后台检查可以更彻底
避免探针检查阻塞应用
可以使用更短的timeoutSeconds

折中方案（带超时的同步检查）：

func healthCheck(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
    defer cancel()
    
    resultChan := make(chan bool, 1)
    
    go func() {
        // 执行检查
        result := db.PingContext(ctx) == nil
        resultChan <- result
    }()
    
    select {
    case result := <-resultChan:
        if result {
            w.WriteHeader(http.StatusOK)
        } else {
            w.WriteHeader(http.StatusServiceUnavailable)
        }
    case <-ctx.Done():
        // 超时
        w.WriteHeader(http.StatusServiceUnavailable)
    }
}

Q7: 多容器Pod中如何配置探针？

答案：

每个容器都有独立的探针：

apiVersion: v1
kind: Pod
metadata:
  name: multi-container-pod
spec:
  containers:
  # 主应用容器
  - name: app
    image: myapp:v1
    ports:
    - containerPort: 8080
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
  
  # Sidecar容器（日志收集）
  - name: log-collector
    image: fluentd:v1
    livenessProbe:
      exec:
        command:
        - pgrep
        - -f
        - fluentd
      initialDelaySeconds: 10
      periodSeconds: 10
    readinessProbe:
      tcpSocket:
        port: 24224
      initialDelaySeconds: 5
      periodSeconds: 5
  
  # Sidecar容器（监控agent）
  - name: metrics-agent
    image: prometheus-agent:v1
    livenessProbe:
      httpGet:
        path: /metrics
        port: 9090
    readinessProbe:
      httpGet:
        path: /metrics
        port: 9090

Pod Ready状态：

Pod Ready = 所有容器的Readiness都成功

示例：
- app容器：Ready
- log-collector：Ready
- metrics-agent：NotReady
→ Pod整体：NotReady（不会接收流量）

最佳实践：

主容器：配置完整的探针
Sidecar容器：
- 如果是关键服务（如代理），配置探针
- 如果是辅助服务（如日志），可以只配置Liveness
Init容器：不需要探针（Init容器成功完成后才启动主容器）

Service Mesh场景（Istio）：

containers:
# 应用容器
- name: app
  livenessProbe:
    httpGet:
      path: /health
      port: 8080
  readinessProbe:
    httpGet:
      path: /ready
      port: 8080

# Istio Sidecar（自动注入）
- name: istio-proxy
  livenessProbe:
    httpGet:
      path: /healthz/ready
      port: 15021
  readinessProbe:
    httpGet:
      path: /healthz/ready
      port: 15021

Q8: 如何通过探针实现灰度发布和金丝雀部署？

答案：

利用Readiness探针控制流量：

方案1：手动控制

apiVersion: v1
kind: Pod
metadata:
  name: canary-pod
  labels:
    app: myapp
    version: v2
    canary: "true"
spec:
  containers:
  - name: app
    image: myapp:v2
    env:
    - name: CANARY_ENABLED
      value: "false"    # 初始不接收流量
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - test "$CANARY_ENABLED" = "true"
      periodSeconds: 5

启用灰度：

# 启用金丝雀Pod接收流量
kubectl set env pod/canary-pod CANARY_ENABLED=true

方案2：使用ConfigMap动态控制

apiVersion: v1
kind: ConfigMap
metadata:
  name: canary-config
data:
  enabled: "false"
---
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    image: myapp:v2
    envFrom:
    - configMapRef:
        name: canary-config
    readinessProbe:
      exec:
        command:
        - sh
        - -c
        - test "$enabled" = "true"

动态开启灰度：

# 开启金丝雀流量
kubectl patch configmap canary-config -p '{"data":{"enabled":"true"}}'

方案3：应用内实现（推荐）

var canaryEnabled = false

// 配置更新端点（内部使用）
http.HandleFunc("/admin/canary", func(w http.ResponseWriter, r *http.Request) {
    if r.Method == "POST" {
        enabled := r.FormValue("enabled")
        canaryEnabled = (enabled == "true")
    }
    w.WriteHeader(http.StatusOK)
})

// Readiness检查
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    if canaryEnabled {
        w.WriteHeader(http.StatusOK)
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
    }
})

灰度流程：

1. 部署金丝雀版本（Readiness失败，不接收流量）
2. 观察Pod状态、日志、指标
3. 确认没问题后，启用Readiness
4. 金丝雀Pod开始接收流量（5-10%）
5. 观察指标（错误率、延迟等）
6. 如果正常，逐步增加金丝雀比例
7. 如果异常，禁用Readiness，回滚

配合Deployment：

# 稳定版本
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 9
  template:
    metadata:
      labels:
        app: myapp
        version: v1
    spec:
      containers:
      - name: app
        image: myapp:v1

---
# 金丝雀版本
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1    # 10%流量
  template:
    metadata:
      labels:
        app: myapp
        version: v2
    spec:
      containers:
      - name: app
        image: myapp:v2
        readinessProbe:
          httpGet:
            path: /ready

关键点总结

核心区别：

Liveness：存活检查，失败重启容器
Readiness：就绪检查，失败移除流量
Startup：启动检查，保护慢启动应用

使用原则：

Liveness：只检查应用本身，不检查依赖
Readiness：可以检查依赖服务
轻量级检查：避免复杂逻辑和耗时操作
合理参数：给足够的启动时间和容错次数

常见配置：

# 典型Web应用
startupProbe:        # 慢启动保护
  initialDelaySeconds: 0
  periodSeconds: 10
  failureThreshold: 30

livenessProbe:       # 死锁检测
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:      # 流量控制
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3

调试技巧：

kubectl describe pod <pod>    # 查看探针状态
kubectl logs <pod>            # 查看应用日志
kubectl exec <pod> -- curl    # 手动测试探针