前言:最近在看k8s相关资料,看到健康检查这一部分,这一部分是很重要的,涉及到的内容也很多,本文章将整理我在学习中的笔记以及相应方法。
部分来源于网络并通过测试。
k8s完整教程可参考:https://blog.tag.gg/showinfo-3-36255-0.html
说明:健康检查是在 应用层面实现的。
livenessProbe(存活检查)和readinessProbe(可读性或就绪检查)
livenessProbe:
存活探测将通过http、shell命令或者tcp等方式去检测容器中的应用是否健康,然后将检查结果返回给kubelet,如果检查容器中应用为不健康状态提交给kubelet后,kubelet将杀死当前容器病根据Pod配置清单中定义的重启策略restartPolicy来对Pod进行重启。
readinessProbe:
就绪探测也是通过http、shell命令或者tcp等方式去检测容器中的应用是否健康或则是否能够正常对外提供服务,如果能够正常对外提供服务,则认为该容器为(Ready状态),达到(Ready状态)的Pod才可以接收请求。
如果容器或则Pod状态为(NoReady)状态,Kubernetes则会把该Pod从Service的后端endpoints Pod中去剔除。
ExecAction(exec):在容器中执行shell命令,命令执行后返回的状态为0则成功,表示我们探测结果正常
HTTPGetAction(httpget):根据容器IP、端口以及路径发送HTTP请求,返回码如果是200-400之间表示成功
TCPSocketAction(tcpsocket):根据容器IP地址及特定的端口进行TCP检查,端口开放表示成功
以上每种检查动作都可能有以下三种返回状态
Success,表示通过了健康检查
Failure,表示没有通过健康检查
Unknown,表示检查动作失败
范例:
livenessProbe(存活检查)的三种方法:
1、exec方式的示例:
创建一个Pod——》运行Nginx容器——》首先启动nginx——》然后沉睡60秒后——〉删除nginx.pid
通过livenessProbe存活探测的exec命令判断nginx.pid文件是否存在,如果探测返回结果非0,则按照重启策略进行重启。
预期是容器真正(Ready)状态60s后,删除nginx.pid,exec命令探测生效,按照重启策略进行重启
根据yaml文件生成:PodapiVersion: v1
kind: Pod
metadata:
name: ngx-health
spec:
containers:
- name: ngx-liveness
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
livenessProbe:
exec:
command: [ "/bin/sh", "-c", "test", "-e", "/run/nginx.pid" ]
restartPolicy: Always
等待Pod Readykubectl apply -f ngx-health.yaml
查看Pod的详细信息[root@master ~]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-85b98978db-nh2m4 1/1 Running 0 2d2h
ngx-health 0/1 ContainerCreating 0 18s
[root@master ~]# kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-85b98978db-nh2m4 1/1 Running 0 2d2h
ngx-health 1/1 Running 0 41s
通过长格式输出可以看到如下,第一次长格式输出Pod运行时间22s,重启次数为0#第一次查看,Pod中的容器启动成功,事件正常
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulling 12s kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 6s kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 6s kubelet, k8s-node03 Created container ngx-liveness
Normal Started 5s kubelet, k8s-node03 Started container ngx-liveness
#第二次查看,容器的livenessProbe探测失败,
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulling 52s kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 46s kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 46s kubelet, k8s-node03 Created container ngx-liveness
Normal Started 45s kubelet, k8s-node03 Started container ngx-liveness
Warning Unhealthy 20s (x3 over 40s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 20s kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
#第三次查看,已经重新拉取镜像,然后创建容器再启动容器
kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health to k8s-node03
Warning Unhealthy 35s (x3 over 55s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 35s kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
Normal Pulling 4s (x2 over 67s) kubelet, k8s-node03 Pulling image "nginx:latest"
Normal Pulled 2s (x2 over 61s) kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 2s (x2 over 61s) kubelet, k8s-node03 Created container ngx-liveness
Normal Started 2s (x2 over 60s) kubelet, k8s-node03 Started container ngx-liveness
第二次长格式输出,运行时间是76s,Pod已经完成一次重启
第二次健康探测失败及第二次重启kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 0 22s 10.244.5.44 k8s-node03 <none> <none>
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 1 76s 10.244.5.44 k8s-node03 <none> <none>
2、livenessProbe for HTTPGetAction示例:kubectl describe pods/ngx-health | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health to k8s-node03
Normal Pulled 58s (x2 over 117s) kubelet, k8s-node03 Successfully pulled image "nginx:latest"
Normal Created 58s (x2 over 117s) kubelet, k8s-node03 Created container ngx-liveness
Normal Started 58s (x2 over 116s) kubelet, k8s-node03 Started container ngx-liveness
Warning Unhealthy 31s (x6 over 111s) kubelet, k8s-node03 Liveness probe failed:
Normal Killing 31s (x2 over 91s) kubelet, k8s-node03 Container ngx-liveness failed liveness probe, will be restarted
Normal Pulling 0s (x3 over 2m3s) kubelet, k8s-node03 Pulling image "nginx:latest"
kubectl get pods -o wide | grep ngx-health
ngx-health 1/1 Running 2 2m13s 10.244.5.44 k8s-node03 <none> <none>
通过容器的ip地址,端口号及路径调用HTTPGet方法,如果响应的状态码大于等于200且小于400,则认为容器健康,spec.containers.livenessProbe.httpGet字段用于定义此类检测,它的可用配置字段包括如下几个:
创建一个Pod——》运行Nginx容器——》首先启动nginx——》然后沉睡60秒后——〉删除nginx.pidhost :请求的主机地址,默认为Pod IP;也可以在httpHeaders中使用 Host: 来定义
port :请求的端口,必选字段,端口范围1-65535
httpHeaders <[]Object>:自定义的请求报文首部
path :请求的HTTP资源路径,即URL path
scheme:建立连接使用的协议,仅可为HTTP或HTTPS,默认为HTTP
通过livenessProbe存活探测的httpGet方式请求nginx项目根目录下的index.html文件,访问端口为80,访问地址默认为Pod IP,请求协议为HTTP,如果请求失败则按照重启策略进行重启。
创建Pod资源对象apiVersion: v1
kind: Pod
metadata:
name: ngx-health-tcp
spec:
containers:
- name: ngx-liveness-tcp
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
livenessProbe:
httpGet:
path: /index.html
port: 80
scheme: HTTP
restartPolicy: Always
查看Pod运行状态:kubectl apply -f ngx-health-tcp.yaml
查看Pod的详细事件信息#容器创建
kubectl get pods -o wide | grep ngx-health-tcp
ngx-health-tcp 0/1 ContainerCreating 0 7s <none> k8s-node02 <none> <none>
#容器运行成功
kubectl get pods -o wide | grep ngx-health-tcp
ngx-health-tcp 1/1 Running 0 19s 10.244.2.36 k8s-node02 <none> <none>
容器镜像拉取并启动成功
容器ready状态后运行60s左右livenessProbe健康检测,可以看到下面已经又开始拉取镜像kubectl describe pods/ngx-health-tcp | grep -A 10 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health-tcp to k8s-node02
Normal Pulling 30s kubelet, k8s-node02 Pulling image "nginx:latest"
Normal Pulled 15s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 15s kubelet, k8s-node02 Created container ngx-liveness-tcp
Normal Started 14s kubelet, k8s-node02 Started container ngx-liveness-tcp
镜像拉取完后再次重启创建并启动了一遍,可以看到 Age 列的时间已经重新计算kubectl describe pods/ngx-health-tcp | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health-tcp to k8s-node02
Normal Pulled 63s kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 63s kubelet, k8s-node02 Created container ngx-liveness-tcp
Normal Started 62s kubelet, k8s-node02 Started container ngx-liveness-tcp
Normal Pulling 1s (x2 over 78s) kubelet, k8s-node02 Pulling image "nginx:latest"
长格式输出Pod,可以看到Pod已经重启过一次kubectl describe pods/ngx-health-tcp | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/ngx-health-tcp to k8s-node02
Normal Pulling 18s (x2 over 95s) kubelet, k8s-node02 Pulling image "nginx:latest"
Normal Pulled 2s (x2 over 80s) kubelet, k8s-node02 Successfully pulled image "nginx:latest"
Normal Created 2s (x2 over 80s) kubelet, k8s-node02 Created container ngx-liveness-tcp
Normal Started 1s (x2 over 79s) kubelet, k8s-node02 Started container ngx-liveness-tcp
kubectl get pods -o wide | grep ngx-health-tcp
ngx-health-tcp 0/1 Completed 0 96s 10.244.2.36 k8s-node02 <none> <none>
k8sops@k8s-master01:~/manifests/pod$ kubectl get pods -o wide | grep ngx-health-tcp
ngx-health-tcp 1/1 Running 1 104s 10.244.2.36 k8s-node02 <none> <none>
通过查看容器日志,可以看到下面的探测日志,默认10秒探测一次
3、livenessProbe for TCPSocketAction示例[root@master ~]# kubectl logs -f pods/ngx-health-tcp
2023/03/23 10:01:52 [notice] 7#7: using the "epoll" event method
2023/03/23 10:01:52 [notice] 7#7: nginx/1.21.5
2023/03/23 10:01:52 [notice] 7#7: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
2023/03/23 10:01:52 [notice] 7#7: OS: Linux 3.10.0-1160.el7.x86_64
2023/03/23 10:01:52 [notice] 7#7: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2023/03/23 10:01:52 [notice] 8#8: start worker processes
2023/03/23 10:01:52 [notice] 8#8: start worker process 9
2023/03/23 10:01:52 [notice] 8#8: start worker process 10
2023/03/23 10:01:52 [notice] 8#8: start worker process 12
2023/03/23 10:01:52 [notice] 8#8: start worker process 13
2023/03/23 10:01:52 [notice] 8#8: start worker process 14
2023/03/23 10:01:52 [notice] 8#8: start worker process 15
2023/03/23 10:01:52 [notice] 8#8: start worker process 16
2023/03/23 10:01:52 [notice] 8#8: start worker process 17
10.244.2.1 - - [23/Mar/2023:10:02:01 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:02:11 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:02:21 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:02:31 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:02:41 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:02:51 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:03:01 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:03:11 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:03:21 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
10.244.2.1 - - [23/Mar/2023:10:03:31 +0000] "GET /index.html HTTP/1.1" 200 615 "-" "kube-probe/1.23" "-"
通过容器的IP地址和端口号进行TCP检查,如果能够建立TCP连接,则表明容器健康。相比较来说,它比基于HTTP的探测要更高效,更节约资源,但精准度略低,毕竟建立连接成功未必意味着页面资源可用,spec.containers.livenessProbe.tcpSocket字段用于定义此类检测,它主要包含以下两个可用的属性:
host:请求连接的目标IP地址,默认为Pod IP
port:请求连接的目标端口,必选字段
下面是在资源清单文件中使用liveness-tcp方式的示例,它向Pod IP的80/tcp端口发起连接请求,并根据连接建立的状态判定测试结果:
创建资源配置清单
name后面的名字必须小写。
创建资源对象apiVersion: v1
kind: Pod
metadata:
name: ngx-health-tcpSocket
spec:
containers:
- name: ngx-livenesstcpSocket
image: nginx:latest
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 100; rm -rf /run/nginx.pid
livenessProbe:
tcpSocket:
port: 80
restartPolicy: Always
查看Pod创建属性信息kubectl apply -f ngx-health-tcpsocket.yaml
#容器创建并启动成功
健康检测参数[root@master ~]# kubectl describe pods/ngx-health-tcpsocket | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 49s default-scheduler Successfully assigned default/ngx-health-tcpsocket to node1
Normal Pulling 49s kubelet Pulling image "nginx:latest"
Normal Pulled 48s kubelet Successfully pulled image "nginx:latest" in 793.703139ms
Normal Created 48s kubelet Created container ngx-liveness-tcpsocket
Normal Started 48s kubelet Started container ngx-liveness-tcpsocket
#在容器ready状态后100s左右Pod已经有了再次拉取镜像的动作
[root@master ~]# kubectl describe pods/ngx-health-tcpsocket | grep -A 15 Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m11s default-scheduler Successfully assigned default/ngx-health-tcpsocket to node1
Normal Pulled 4m10s kubelet Successfully pulled image "nginx:latest" in 793.703139ms
Normal Pulled 2m13s kubelet Successfully pulled image "nginx:latest" in 15.787433374s
Warning BackOff 31s (x2 over 33s) kubelet Back-off restarting failed container
Normal Pulling 18s (x3 over 4m11s) kubelet Pulling image "nginx:latest"
Normal Created 2s (x3 over 4m10s) kubelet Created container ngx-liveness-tcpsocket
Normal Started 2s (x3 over 4m10s) kubelet Started container ngx-liveness-tcpsocket
Normal Pulled 2s kubelet Successfully pulled image "nginx:latest" in 15.821059033s
#通过长格式输出Pod,也可以看到当前Pod已经进入了完成的状态,接下来就是重启Pod
[root@master ~]# kubectl get pods -o wide | grep ngx-health-tcpsocke
ngx-health-tcpsocket 1/1 Running 2 (96s ago) 5m14s 10.244.1.3 node1 <none> <none>
上面介绍了两种在不同时间段的探测方式,以及两种探测方式所支持的探测方法,这里介绍几个辅助参数
initialDelaySeconds:检查开始执行的时间,以容器启动完成为起点计算
periodSeconds:检查执行的周期,默认为10秒,最小为1秒
successThreshold:从上次检查失败后重新认定检查成功的检查次数阈值(必须是连续成功),默认为1,也必须是1
timeoutSeconds:检查超时的时间,默认为1秒,最小为1秒
failureThreshold:从上次检查成功后认定检查失败的检查次数阈值(必须是连续失败),默认为1
健康检测实践
以下示例使用了就绪探测readinessProbe和存活探测livenessProbe
就绪探测配置解析:
1、容器在启动5秒initialDelaySeconds后进行第一次就绪探测,将通过http访问探测容器网站根目录下的index.html文件,如果探测成功,则Pod将被标记为(Ready)状态。
2、然后就绪检测通过periodSeconds参数所指定的间隔时间进行循环探测,下面我所指定的间隔时间是10秒钟,每隔10秒钟就绪探测一次。
3、每次探测超时时间为3秒,如果探测失败1次就将此Pod从Service的后端Pod中剔除,剔除后客户端请求将无法通过Service访问到其Pod。
4、就绪探测还会继续对其进行探测,那么如果发现此Pod探测成功1次,通过successThreshold参数设定的值,那么会将它再次加入后端Pod。
存活探测配置解析
1、容器在启动15秒initialDelaySeconds后进行第一次存活探测,将通过tcpSocket探测容器的80端口,如果探测返回值为0则成功。
2、每次存活探测间隔为3秒钟,每次探测超时时间为1秒,如果连续探测失败2次则通过重启策略重启Pod。
3、检测失败后的Pod,存活探测还会对其进行探测,如果再探测成功一次,那么将认为此Pod为健康状态
1.资源配置清单
2.创建资源对象#create namespace
apiVersion: v1
kind: Namespace
metadata:
name: nginx-health-ns
labels:
resource: nginx-ns
spec:
---
#create deploy and pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-health-deploy
namespace: nginx-health-ns
labels:
resource: nginx-deploy
spec:
replicas: 3
revisionHistoryLimit: 10
selector:
matchLabels:
app: nginx-health
template:
metadata:
namespace: nginx-health-ns
labels:
app: nginx-health
spec:
restartPolicy: Always
containers:
- name: nginx-health-containers
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- /usr/sbin/nginx; sleep 60; rm -rf /run/nginx.pid
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
failureThreshold: 1
httpGet:
path: /index.html
port: 80
scheme: HTTP
livenessProbe:
initialDelaySeconds: 15
periodSeconds: 3
successThreshold: 1
timeoutSeconds: 1
failureThreshold: 2
tcpSocket:
port: 80
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
---
#create service
apiVersion: v1
kind: Service
metadata:
name: nginx-health-svc
namespace: nginx-health-ns
labels:
resource: nginx-svc
spec:
clusterIP: 10.106.189.88
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: nginx-health
sessionAffinity: ClientIP
type: ClusterIP
3.查看创建的资源对象[root@master ~]# kubectl apply -f nginx-health.yaml
namespace/nginx-health-ns created
deployment.apps/nginx-health-deploy created
service/nginx-health-svc created
4.查看Pod状态,目前Pod状态都没有就绪并且完成状态,准备重启[root@master ~]# kubectl get all -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx-health-deploy-585c775bf-6kqph 0/1 ContainerCreating 0 14s <none> node1 <none> <none>
pod/nginx-health-deploy-585c775bf-8jvlk 0/1 ContainerCreating 0 14s <none> node2 <none> <none>
pod/nginx-health-deploy-585c775bf-rszkc 0/1 ContainerCreating 0 14s <none> node2 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/nginx-health-svc ClusterIP 10.106.189.88 <none> 80/TCP 14s app=nginx-health
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR
deployment.apps/nginx-health-deploy 0/3 3 0 14s nginx-health-containers nginx:1.17.1 app=nginx-health
NAME DESIRED CURRENT READY AGE CONTAINERS IMAGES SELECTOR
replicaset.apps/nginx-health-deploy-585c775bf 3 3 0 14s nginx-health-containers nginx:1.17.1 app=nginx-health,pod-template-hash=585c775bf
5.目前已经有一台Pod完成重启,已准备就绪[root@master ~]# kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-585c775bf-6kqph 1/1 Running 0 48s 10.244.1.4 node1 <none> <none>
nginx-health-deploy-585c775bf-8jvlk 0/1 Running 0 48s 10.244.2.5 node2 <none> <none>
nginx-health-deploy-585c775bf-rszkc 1/1 Running 0 48s 10.244.2.4 node2 <none> <none>
6.三台Pod都均完成重启,已准备就绪[root@master ~]# kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-585c775bf-6kqph 0/1 CrashLoopBackOff 2 (10s ago) 3m55s 10.244.1.4 node1 <none> <none>
nginx-health-deploy-585c775bf-8jvlk 1/1 Running 2 (72s ago) 3m55s 10.244.2.5 node2 <none> <none>
nginx-health-deploy-585c775bf-rszkc 0/1 CrashLoopBackOff 2 (12s ago) 3m55s 10.244.2.4 node2 <none> <none>
[root@master ~]# kubectl get pods -n nginx-health-ns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-health-deploy-585c775bf-6kqph 1/1 Running 3 (48s ago) 4m33s 10.244.1.4 node1 <none> <none>
nginx-health-deploy-585c775bf-8jvlk 1/1 Running 3 (32s ago) 4m33s 10.244.2.5 node2 <none> <none>
nginx-health-deploy-585c775bf-rszkc 1/1 Running 3 (50s ago) 4m33s 10.244.2.4 node2 <none> <none>
文章评论 本文章有个评论