Commit 0b6a24a5 by wangtingwei

增加文档

1 parent 7fd24e8a
Showing with 488 additions and 0 deletions
# 平台部署文档
# 平台部署文档
## 一、涉及的项目仓库
#### 1.1、configmap配置文件模板
##### https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git
##### 说明:配置文件采用k8s configmap外部挂载实现,配置文件中的部分可变字段如mysql连接地址做成了模板,可以通过sed修改模版替换为变量进行渲染。
#### 1.2、平台部署所需的相关文件
##### 说明:包括平台部署到k8s集群中的yaml文件、初始化sql语句、渲染configmap脚本等
https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git
## 二、基础软件依赖及配置
### 2.1、Mysql数据库
#### 2.1.1、版本为5.7.28
#### 2.1.2、配置文件:
###### bind-address 不要设置127.0.0.1 ,因为其他节点Pod需要与之通信,根据mysql实际部署情况修改配置文件。
```
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
bind-address = 0.0.0.0
skip_name_resolve
max_connections = 10000
slow_query_log = TRUE
slow_query_log_file = /var/log/mysql/slowquery.log
long_query_time = 0.1
log-queries-not-using-indexes
log_queries_not_using_indexes = 0
key_buffer_size = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
myisam-recover-options = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
log_error = /var/log/mysql/error.log
expire_logs_days = 10
max_binlog_size = 100M
interactive_timeout=28800000
wait_timeout=28800000
log-bin=mysql-bin
binlog-format=Row
server-id=111
character-set-server=utf8mb4
[mysql]
default-character-set = utf8mb4
[client]
default-character-set = utf8mb4
```
#### 2.2.3、数据库初始化操作
- ##### 用户授权
###### 示例中mysql用户为root、密码为seetatech、授予权限为最高,可以根据实际情况降低权限,但是需要保证mysql用户具备增删改查权限、另外授予访问mysql数据库主机这里为%代表所有主机,此处可以设置为内网网段。
```shell
$ use mysql;
$ GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' identified by 'seetatech';
$ flush privileges;
```
- ##### 创建数据库
```shell
$ CREATE SCHEMA `autodl-core`;
$ CREATE SCHEMA `user-center`;
$ CREATE SCHEMA `kpl`;
$ CREATE SCHEMA `quota`;
```
- ##### 执行初始化sql语句
###### 初始化sql文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中的sql目录下。按照数字顺序从小到大执行。然后再添加sql记录
```shell
$ INSERT INTO `autodl-core`.`service` (`service_id`, `noti_api`) VALUES ("kpl3", "http://kpl--monitor.kpl.svc.cluster.local:8920/status");
```
### 2.2、redis数据库
#### 2.2.1、版本为5.0.6
#### 2.2.2、配置文件
###### bind不要设置为127.0.0.1,平台用到了redis的消息侦听功能,需要配置notify-keyspace-events "KEA",配置文件中已添加,使用了requirepass 设置了密码为seetatech。可以根据实际需求更改。
```
daemonize yes
pidfile /var/run/redis/redis-server.pid
port 6379
tcp-backlog 511
bind 0.0.0.0
timeout 0
tcp-keepalive 60
loglevel notice
logfile /var/log/redis/redis-server.log
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
notify-keyspace-events "KEA"
requirepass seetatech
```
### 2.3、mongo数据库
#### 2.3.1、版本为4.0.10
#### 2.3.2、配置文件:
###### bind不要设置为127.0.0.1, auth=true设置开启认证。
```
dbpath=/data/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
port=27017
fork=true
auth=true
bind_ip=0.0.0.0
```
#### 2.3.3、创建mongo初始化用户
###### 示例中创建的用户名字为admin、密码为admin、role角色为root。
```shell
use admin;
db.createUser(
{
user: "admin",
pwd: "admin",
roles: [ { role: "root", db: "admin" } ]
}
)
```
### 2.4、Docker
#### 2.4.1、版本为18.09.2
#### 2.4.2、配置文件
###### 以下是/etc/docker/daemon.json示例,配置了加速仓库以及insecure registry地址。
```json
{
"registry-mirrors": ["https://hub-mirror.c.163.com","https://rrkngb5t.mirror.aliyuncs.com"],
"insecure-registries":["192.168.1.32:5000"]
}
```
###### 以下是docker.service配置文件,这里主要通过 --graph设置了docker数据目录为/data/dock er。通过EnvironmentFile=/run/flannel/subnet.env以及$DOCKER_NETWORK_OPTIONS 参数实现了docker对接flannel网络插件的配置(我们的flannel网络插件是二进制方式安装如,果是其他方式比如CSI可以参考官方文档如何配置与docker对接)
```
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target docker.socket firewalld.service
Wants=network-online.target
Requires=docker.socket
[Service]
Type=notify
EnvironmentFile=/run/flannel/subnet.env
ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS --graph /data/docker
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Delegate=yes
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
```
### 2.5、kubernetes
#### 2.5.1、版本为1.15.5
#### 2.5.2、其他要求
###### k8s集群中网络组件为flannel
###### k8s集群中dns组件为coredns,对应的svc名字需要为kube-dns
###### k8s集群中kubelet数据目录为/data/kubelet,二进制部署时可以通过--root-dir设置,其他部署方式见官方文档
### 2.6、Nvidia-docker2
#### 2.6.1、版本为2.2.1
#### 2.6.2、注意事项
###### Nvidia-docker2只有gpu服务器需要装,不要在cpu服务器上安装。
###### 以下示例为安装了Nvidia-docker2的docker daemon.json示例为
```json
{
"registry-mirrors": ["https://hub-mirror.c.163.com","https://rrkngb5t.mirror.aliyuncs.com"],
"insecure-registries":["192.168.1.53:5000"],
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
```
### 2.7、 Nvidia-device-plugin插件
###### k8s集群安装好,将以下内容报错为yaml文件,通过kubectl apply -f 创建即可。需要注意yaml 中hostpath /var/lib/kubelet/device-plugins是否存在kubelet.sock,更改了kubelet数据目录为/data/后,此处需要保持一致/data/kubelet/device-plugins
```
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: hub.kce.ksyun.com/kpl_k8s/k8s-device-plugin:1.0.0-beta4
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
```
### 2.8、镜像仓库
###### 镜像仓库可以选择registry或者harbor。用于推送拉取和推送docker镜像。
### 2.9、NFS共享存储
###### 服务端配置:
```shell
$ apt install nfs-kernel-server -y
```
```
vim /etc/exports 添加:
/nfs_storage *(rw,async,no_root_squash)
service nfs-kernel-server restart 重启生效
```
###### 客户端配置:所有k8s节点需要安装好nfs客户端 nfs-common,pod才能够正常挂载。
```shell
$ apt install nfs-common -y
```
###### 注意事项:
###### 使用async模式,上传数据集压缩包时解压超时
###### 开启no_root_squash ,防止权限问题
## 三、配置平台k8s维度的基础环境
#### 3.1、创建namespace、serviceaccout、secrets、pv 、pvc等资源
###### 相关的配置在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git 的kpl_base目录,平台有两个namespace一个是autodl另外一个是kpl 。kpl_base下面的子目录中分别存放了这两个namespace的资源配置。
```shell
# 针对nfs实际部署情况修改kpl_base/autodl/4-pv_pvc 以及kpl_base/kpl/4-pv_pvc信息。
# yaml中有多个nfs配置信息,如果是部署单节点nfs,则nfs_server和path都配置成一个即可。修改后执行
$ kubectl apply -f kpl_base/autodl/
$ kubectl apply -f kpl_base/kpl/
```
#### 3.2、创建镜像仓库imagePull secrets
###### 某些用户的镜像仓库项目目录并不是公有类型,无法直接pull镜像,需要配置imagePull secrets,假如你们的镜像仓库不需要认证,此时仍需要创建该secrets,因为在部署yaml中已经配置引用了。你可以配置一个空的也可以。secrets名字为harbor-secret
```shell
kubectl create secret -n autodl docker-registry \
--docker-server=<镜像仓库地址> \
--docker-email=<账号邮箱信息> \
--docker-username=<镜像仓库用户> \
--docker-password<镜像仓库密码> \
harbor-secret
```
```shell
kubectl create secret -n kpl docker-registry \
--docker-server=<镜像仓库地址> \
--docker-email=<账号邮箱信息> \
--docker-username=<镜像仓库用户> \
--docker-password<镜像仓库密码> \
harbor-secret
```
#### 3.3、节点打label
##### 3.3.1、cpu节点标签
```shell
kubectl label node <节点名> autodl=true kpl=true cpu=true user_job_node=true internal_service_node=true
```
##### 3.3.2、gpu节点标签
```shell
kubectl label node <节点名> autodl=true kpl=true gpu=true cpu=true user_job_node=true internal_service_node=true
```
## 四、配置文件介绍及创建
###### 配置文件configmap模板仓库为https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git ,里面包含了平台所需配置文件的模板,需要配合下面的sed-config.sh脚本进行渲染。
###### 修改configmap模板渲染脚本在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中有一个sed-configmap目录里面有一个sed-config.sh用于渲染configmap的脚本。需要根据环境实际情况更改变量,相关变量介绍已经在sed-config.sh中介绍。
```shell
$ sh sed-config.sh
$ kubectl apply -f {configmap目录}/autodl-core
$ kubectl apply -f {configmap目录}/kpl
```
## 五、服务创建
#### 5.1、服务yaml文件位置介绍
###### 各个组件的yaml文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git中的kpl_deploy_yaml中。每个组件下面的目录都有服务yaml文件。
#### 5.2、服务镜像列表如下
```shell
hb.seetatech.com/core/adl-core-v1:20200902205144
hb.seetatech.com/core/core--nginx:20200902205144
hb.seetatech.com/core/core--collector:20200902205144
hb.seetatech.com/seetaas/kpl-backend-v1:20200902205144
hb.seetatech.com/seetaas/kpl--nginx:20200902205144
hb.seetatech.com/seetaas/kpl--frontend:20200902205144
hb.seetatech.com/seetaas/kpl-stream-v1:20200902205144
```
#### 5.3、服务部署创建
###### 将yaml中的image改为自己的镜像信息(<仓库地址>/<项目目录>/<镜像名和tag>),然后通过kubectl apply -f部署
```shell
$ kubectl apply -f kpl_deploy_yaml/1-autodl-core
$ kubectl apply -f kpl_deploy_yaml/2-kpl-frontend
$ kubectl apply -f kpl_deploy_yaml/3-kpl-backend
$ kubectl apply -f kpl_deploy_yaml/4-kpl-stream
$ kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/volcano
$ kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/kpl-launcher
```
## 六、服务端口暴露以及访问
#### 6.1、k8s集群外部服务端口
###### 30180端口:服务采用nodeport形式对外暴露,通过nodeport类型的 kpl--nginx-svc的30180端口将流量进行转发。
###### 30205端口:通过nodeport暴露的tcp服务,对应kpl--stream的svc用于平台能够通过ssh进入容器的服务
#### 6.2、平台地址访问
```shell
http://<node_ip>:30180
```
Markdown is supported
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!