k8s problems

Pod 无法解析域名

Pod DNS 策略模式是 ClusterFirst,系统 /etc/resolve.conf 内容如下。

1
2
3
nameserver 127.0.0.53
options edns0 trust-ad
search .

导致 Pod 里面的 /etc/resolv.conf 配置也是如此,无法正常解析域名。先删除 /etc/resolv.conf/run/systemd/resolve/stub-resolv.conf 的软链) ,再创建并写入如下内容。

1
2
nameserver 223.5.5.5
nameserver 8.8.8.8
1
2
sudo rm /etc/resolv.conf
sudo vim /etc/resolv.conf

重启 Pods。

1
kubectl delete pods --all -n=<namespace> # 删除所有 pods

pod didn’t trigger scale-up

错误信息

1
.. (combined from similar events): pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 Insufficient memory, 7 can't increase node group size

原因

  • pod 添加的 container 的内存、CPU 资源超过资源池机器的限制,导致无法扩容

解决

  • 减少 container 的 memory / cpu

Core Dump 及保存

设置 core dump 保存路径及命名

deployment.yaml 中配置,运行命令 echo "core.%p" > /proc/sys/kernel/core_pattern

映射 HostPath,容器重启不删除文件。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# deployment.yaml
spec:
template:
spec:
volumes:
- name: core-dump-volume
hostPath:
path: /data/core
type: DirectoryOrCreate
containers:
- name: {container-name}
volumeMounts:
- name: core-dump-volume
mountPath: /data/core

install python from source

从源码安装 Python3

设置版本

1
2
export PYTHON_VERSION=3.9.13
export PYTHON_MAJOR=3

下载

1
2
3
wget https://www.python.org/ftp/python/${PYTHON_VERSION}/Python-${PYTHON_VERSION}.tgz
tar -xvzf Python-${PYTHON_VERSION}.tgz --no-check-certificate
cd Python-${PYTHON_VERSION}

配置

1
2
3
4
5
6
./configure \
--prefix=/opt/python/${PYTHON_VERSION} \
--enable-shared \
--enable-ipv6 \
LDFLAGS=-Wl,-rpath=/opt/python/${PYTHON_VERSION}/lib,--disable-new-dtags \
--enable-optimizations

编译安装

1
2
make
sudo make install

安装 pip

安装 python 3.9.13 时已安装 pip

1
2
curl -O https://bootstrap.pypa.io/get-pip.py
sudo /opt/python/${PYTHON_VERSION}/bin/python${PYTHON_MAJOR} get-pip.py

使用安装命令

1
2
3
4
5
6
7
8
9
10
# ubuntu
apt install python-pip #python 2
apt install python3-pip #python 3

# centos
yum install epel-release
yum install python-pip
#
dnf install python-pip #Python 2
dnf install python3 #Python 3

k9s setup

Install

1
2
3
4
5
# brew
brew install derailed/k9s/k9s

# snap
sudo snap install k9s

连接集群

1
2
3
4
# microk8s
# 保存内容至 ~/.kube/config
# k9s 会读取配置并连接集群
microk8s config > ~/.kube/config

network

设置网关导致端口网关失效

参考这里

1
2
3
ip route add 192.168.6.0/24 dev br0 table 16
ip route add default via 192.168.6.1 dev br0 table 16
ip rule iif to ipproto tcp sport 10014 lookup 16

另外一种方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ip route add 192.168.6.0/24 dev br0 table 10
ip route add default via 192.168.6.1 table 10
ip route add 192.168.9.0/24 dev br1 table 12
ip route add default via 192.168.9.1 table 12
ip rule add from 192.168.6.0/24 table 10 priority 1
ip rule add from 192.168.9.0/24 table 12 priority 2

# 添加 docker 网络
ip route add 172.17.0.0/16 dev docker0 table 10
ip route add 172.17.0.0/16 dev docker0 table 12

# 刷新配置
ip route flush cache

# 校验
$ ip route show table 12
default via 192.168.9.1 dev br1
172.17.0.0/16 dev docker0 scope link
192.168.9.0/24 dev br1 scope link

示例

1
2
3
4
5
6
7
8
9
10
11
12
13
# out
ip route add default via 192.168.6.1 dev ens8 table 10
ip route add default via 192.168.9.1 dev ens9 table 12
# in
ip rule add from 192.168.6.0/24 table 10 priority 1
ip rule add from 192.168.9.0/24 table 12 priority 2 # 可以不设置 priority

# 如果有设置了默认的路由,可以忽略其中的一个,比如有如下默认路由
ip route add default via 192.168.6.1 dev ens8
# 那么只需要设置 192.168.9.0/24
# ip route add default via 192.168.9.1 dev ens9 table 12
ip route add 192.168.9.0/24 dev ens9 proto kernel scope link src 192.168.9.8
ip rule add from 192.168.9.0/24 table 12

连接路由器的 VPN 之后无法访问内网服务

原因

路由器局域网网段和外网网段冲突,导致访问局域网 ip 不走 vpn 网络。

设置

1
2
3
4
5
6
7
# macos, route 命令
## 查看路由表
sudo netstat -rn
192.168.6.6 link#6 UHRLWIi en0 8
## en0 是 wifi 网络接口, vpn 的网络端口是 utun3
## 修改路由, 把 192.168.6.0/24 所有网段的路由走 vpn
sudo route change 192.168.6.0/24 -interface utun3

USB 外接网卡

1
2
3
4
5
6
7
8
9
10
$ lshw -c network
WARNING: you should run this program as super-user.
*-network
...
logical name: enp0s31f6
...
*-network DISABLED
...
logical name: enxf8e43b1a1229
...

pve - usage

直通

硬盘直通

1
2
3
4
# sata 硬盘直通
qm set {vm-id} -sata0 /dev/disk/by-id/{disk-id}
# 示例
qm set 105 -sata0 /dev/disk/by-id/ata-HS-SSD-A260_1024G_30066931838

samba server

安装

1
2
3
# ubuntu
sudo apt update
sudo apt install samba

配置

1
2
3
4
5
6
7
sudo vim /etc/samba/smb.conf
# 在文档最后添加
[sambashare]
comment = Samba on Ubuntu
path = /home/username/sambashare
read only = no
browsable = yes

重启服务

1
sudo service smbd restart

添加用户并设置密码

1
sudo smbpasswd -a <username>

HTTPS 证书

CertBot

使用 CertBot 进行证书签发及自动更新。

1
2
3
4
5
6
7
8
9
10
# 安装 certbot
$ sudo snap install --classic certbot
$ sudo ln -s /snap/bin/certbot /usr/bin/certbot

# 证书签发
# 需要: 关闭监听 80 端口的 web server
$ sudo certbot certonly --standalone
# 依次输入 邮箱、域名等
# 证书会放在 /etc/letsencrypt/live/{domain}/ 下
# CertBot 会创建定时任务刷新证书

Caddy

Caddy Server 可以自动识别证书,无需指定证书位置。

1
2
3
exploring.fun {
reverse_proxy http://172.60.2.1:30800
}

microk8s usage

描述

参考 官方文档,搭建一个三节点 k8s 集群。

节点 ip
k8s01-1 10.1.0.78
k8s01-2 10.1.0.62
k8s01-3 10.1.0.242

alias

1
2
3
4
5
6
7
# ~/.bash_aliases
alias k='microk8s kubectl'
alias mk='microk8s'

# 或者 ~/.local/bin/kubectl, 适配 k9s
#!/bin/bash
exec microk8s.kubectl $(echo "$*" | sed 's/-- sh.*/sh/')

加入用户组

1
2
3
4
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
# 重新进入 session
su - $USER

Hosts

所有节点

1
2
3
10.1.0.78 k8s01-1
10.1.0.62 k8s01-2
10.1.0.242 k8s01-3

初始化集群

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# k8s01-1 主节点
$ mk add-node
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.9.103:25000/5b502d061dd31ec58d1f6ddf96e10c56/be841c6899a7

Use the '--worker' flag to join a node as a worker not running the control plane, eg:
microk8s join 10.1.0.78:25000/5b502d061dd31ec58d1f6ddf96e10c56/be841c6899a7 --worker

If the node you are adding is not reachable through the default interface you can use one of the following:
microk8s join 10.1.0.78:25000/5b502d061dd31ec58d1f6ddf96e10c56/be841c6899a7

# k8s01-2 worker 节点
microk8s join 10.1.0.78:25000/5b502d061dd31ec58d1f6ddf96e10c56/be841c6899a7 --worker

# 对 k8s01-3, 重复上述操作

查看状态

1
2
3
4
5
$ k get nodes
NAME STATUS ROLES AGE VERSION
k8s01-1 Ready <none> 2d12h v1.24.3-2+63243a96d1c393
k8s01-2 Ready <none> 77s v1.24.3-2+63243a96d1c393
k8s01-3 Ready <none> 64s v1.24.3-2+63243a96d1c393

更详细的状态

1
2
3
4
5
$ k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s01-1 Ready <none> 2d12h v1.24.3-2+63243a96d1c393 10.1.0.78 <none> Ubuntu 20.04.4 LTS 5.4.0-124-generic containerd://1.5.13
k8s01-2 Ready <none> 5m23s v1.24.3-2+63243a96d1c393 10.1.0.62 <none> Ubuntu 20.04.4 LTS 5.4.0-124-generic containerd://1.5.13
k8s01-3 Ready <none> 5m10s v1.24.3-2+63243a96d1c393 10.1.0.242 <none> Ubuntu 20.04.4 LTS 5.4.0-124-generic containerd://1.5.13

安装插件

1
2
# 主节点
mk enable dns storage dashboard helm3

获取 k8s 配置

1
2
mkdir ~/.kube
mk config > ~/.kube/config

问题排查

查看事件

1
k get events --sort-by=.metadata.creationTimestamp --namespace=kube-system

异常

Failed create pod sandbox

错误信息

1
Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container ... error getting ClusterInformation: Get ... https://10.152.183.1:443 ...

原因

1
2
NAMESPACE↑                 NAME                      TYPE                     CLUSTER-IP     
default kubernetes ClusterIP 10.152.183.1

错误信息提示请求 ClusterIP 异常,检查节点 IP。

1
2
3
4
$ k get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP ... KERNEL-VERSION CONTAINER-RUNTIME
a Ready <none> 16h v1.27.2 172.21.0.3 ... 5.4.0-126-generic containerd://1.6.15
b Ready <none> 16h v1.27.2 192.168.6.201 ... 5.15.0-76-generic containerd://1.6.15

解决
在 b 节点无法通过 a 的 INTERNAL-IP 访问 a(controller) 节点,修改两个节点的 --node-ip 为可以访问的 ip。

1
2
3
4
5
6
7
8
9
10
11
microk8s stop
# or for workers: sudo snap stop microk8s

sudo vim.tiny /var/snap/microk8s/current/args/kubelet
# Add this to bottom: --node-ip=<this-specific-node-lan-ip>

sudo vim.tiny /var/snap/microk8s/current/args/kube-apiserver
# Add this to bottom: --advertise-address=<this-specific-node-lan-ip>

microk8s start
# or for workers: sudo snap start microk8s

certificate is valid for kubernetes … not for mydomain.com

参考 这里

1
2
3
4
5
6
7
8
9
10
11
12
$ vim /var/snap/microk8s/current/certs/csr.conf.template
# 添加域名
[ alt_names ]
DNS.1 = kubernetes
DNS.2 = kubernetes.default
DNS.3 = kubernetes.default.svc
DNS.4 = kubernetes.default.svc.cluster
DNS.5 = kubernetes.default.svc.cluster.local
DNS.6 = mydomain.com # 添加一行

# 生效
$ sudo microk8s refresh-certs --cert server.crt

microk8s 安装

文档

安装

1
sudo snap install microk8s --classic 

加入用户组

1
2
3
4
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
# 重新进入 session
su - $USER

配置 k8s.gcr.io 镜像地址

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# create a directory with the registry name
sudo mkdir -p /var/snap/microk8s/current/args/certs.d/k8s.gcr.io

# create the hosts.toml file pointing to the mirror
echo '
server = "https://k8s.gcr.io"

[host."https://registry.aliyuncs.com/v2/google_containers"]
capabilities = ["pull", "resolve"]
override_path = true
' | sudo tee -a /var/snap/microk8s/current/args/certs.d/k8s.gcr.io/hosts.toml


# 2
sudo mkdir -p /var/snap/microk8s/current/args/certs.d/registry.k8s.io
echo '
server = "registry.k8s.io"

[host."https://registry.aliyuncs.com/v2/google_containers"]
capabilities = ["pull", "resolve"]
override_path = true
' | sudo tee -a /var/snap/microk8s/current/args/certs.d/registry.k8s.io/hosts.toml

需要重启

1
sudo snap restart microk8s

检查状态

1
2
# 如果不翻墙/替换镜像, 会在这里卡住
microk8s status --wait-ready

配置

配置 kubectl 命令

1
2
3
4
5
mkdir -p ~/.local/bin/
vim ~/.local/bin/kubectl
# 输入如下内容
#!/bin/bash
exec /snap/bin/microk8s.kubectl $(echo "$*" | sed 's/-- sh.*/sh/')

配置别名

1
2
3
4
5
# vim ~/.bash_aliases
alias kubectl='microk8s kubectl'
alias k='microk8s kubectl'
alias mk='microk8s'
alias helm='microk8s helm3'

设置私有镜像仓库

1
2
3
4
5
$ docker login ... # 登录私有镜像仓库
$ kubectl create secret generic regcred \
--from-file=.dockerconfigjson=$HOME/.docker/config.json \
--type=kubernetes.io/dockerconfigjson \
--namespace=default

组建集群

安装 dashboard

1
2
3
microk8s enable dns dashboard
# 生成 token
microk8s kubectl create token -n kube-system default --duration=8544h

k8s.gcr.io 无法拉取镜像

配置 k8s.gcr.io 理论上可以解决问题。

1
2
3
4
5
6
7
8
9
10
11
# pause
## 从阿里云镜像拉取
docker pull registry.aliyuncs.com/google_containers/pause:3.7
## 重命名
docker tag registry.aliyuncs.com/google_containers/pause:3.7 k8s.gcr.io/pause:3.7
docker tag registry.aliyuncs.com/google_containers/pause:3.7 registry.k8s.io/pause:3.7

# metric server
docker pull registry.aliyuncs.com/google_containers/metrics-server:v0.5.2
docker tag registry.aliyuncs.com/google_containers/metrics-server:v0.5.2 k8s.gcr.io/metrics-server/metrics-server:v0.5.2
docker tag registry.aliyuncs.com/google_containers/metrics-server:v0.5.2 registry.k8s.io/metrics-server/metrics-server:v0.5.2

Join 集群

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# master 节点运行
$ microk8s add-node
From the node you wish to join to this cluster, run the following:
microk8s join 192.168.1.230:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05

Use the '--worker' flag to join a node as a worker not running the control plane, eg:
microk8s join 192.168.1.230:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05 --worker

If the node you are adding is not reachable through the default interface you can use one of the following:
microk8s join 192.168.1.230:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05
microk8s join 10.23.209.1:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05
microk8s join 172.17.0.1:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05

# slave 节点运行
$ microk8s join 172.17.0.1:25000/92b2db237428470dc4fcfc4ebbd9dc81/2c0cb3284b05