增加文档

wangtingwei
Commit 0b6a24a5 authored Sep 02, 2020 by wangtingwei
Showing with 488 additions and 0 deletions
平台部署文档.md
--- a/平台部署文档.md
+++ b/平台部署文档.md
+# 平台部署文档
+# 平台部署文档
+
+## 一、涉及的项目仓库
+
+#### 1.1、configmap配置文件模板
+
+##### https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git
+
+##### 说明：配置文件采用k8s configmap外部挂载实现，配置文件中的部分可变字段如mysql连接地址做成了模板，可以通过sed修改模版替换为变量进行渲染。
+
+#### 1.2、平台部署所需的相关文件
+
+##### 说明：包括平台部署到k8s集群中的yaml文件、初始化sql语句、渲染configmap脚本等
+
+https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git
+
+## 二、基础软件依赖及配置
+
+### 2.1、Mysql数据库
+
+#### 2.1.1、版本为5.7.28
+
+#### 2.1.2、配置文件：
+
+###### bind-address  不要设置127.0.0.1 ，因为其他节点Pod需要与之通信，根据mysql实际部署情况修改配置文件。
+
+```
+[mysqld_safe]
+socket          = /var/run/mysqld/mysqld.sock
+nice            = 0
+[mysqld]
+user            = mysql
+pid-file        = /var/run/mysqld/mysqld.pid
+socket          = /var/run/mysqld/mysqld.sock
+port            = 3306
+basedir         = /usr
+datadir         = /var/lib/mysql
+tmpdir          = /tmp
+lc-messages-dir = /usr/share/mysql
+skip-external-locking
+bind-address = 0.0.0.0  
+skip_name_resolve
+max_connections = 10000
+slow_query_log = TRUE
+slow_query_log_file = /var/log/mysql/slowquery.log
+long_query_time = 0.1
+log-queries-not-using-indexes
+log_queries_not_using_indexes = 0
+key_buffer_size         = 16M
+max_allowed_packet      = 16M
+thread_stack            = 192K
+thread_cache_size       = 8
+myisam-recover-options  = BACKUP
+query_cache_limit       = 1M
+query_cache_size        = 16M
+log_error = /var/log/mysql/error.log
+expire_logs_days        = 10
+max_binlog_size   = 100M
+interactive_timeout=28800000
+wait_timeout=28800000
+log-bin=mysql-bin
+binlog-format=Row
+server-id=111
+character-set-server=utf8mb4
+[mysql]
+default-character-set = utf8mb4
+[client]
+default-character-set = utf8mb4
+
+```
+
+#### 2.2.3、数据库初始化操作
+
+- ##### 用户授权
+
+###### 示例中mysql用户为root、密码为seetatech、授予权限为最高，可以根据实际情况降低权限，但是需要保证mysql用户具备增删改查权限、另外授予访问mysql数据库主机这里为%代表所有主机，此处可以设置为内网网段。
+
+```shell
+$ use mysql;
+
+$ GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' identified by 'seetatech';
+
+$ flush privileges;
+```
+
+- ##### 创建数据库
+
+```shell
+$ CREATE SCHEMA `autodl-core`;
+
+$ CREATE SCHEMA `user-center`;
+
+$ CREATE SCHEMA `kpl`;
+
+$ CREATE SCHEMA `quota`;
+```
+
+- ##### 执行初始化sql语句
+
+###### 初始化sql文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中的sql目录下。按照数字顺序从小到大执行。然后再添加sql记录
+
+```shell
+$ INSERT INTO `autodl-core`.`service` (`service_id`, `noti_api`) VALUES ("kpl3", "http://kpl--monitor.kpl.svc.cluster.local:8920/status");
+```
+
+### 2.2、redis数据库
+
+#### 2.2.1、版本为5.0.6
+
+#### 2.2.2、配置文件
+
+###### bind不要设置为127.0.0.1，平台用到了redis的消息侦听功能，需要配置notify-keyspace-events "KEA"，配置文件中已添加，使用了requirepass 设置了密码为seetatech。可以根据实际需求更改。
+
+```
+daemonize yes
+pidfile /var/run/redis/redis-server.pid
+port 6379
+tcp-backlog 511
+bind 0.0.0.0
+timeout 0
+tcp-keepalive 60
+loglevel notice
+logfile /var/log/redis/redis-server.log
+databases 16
+save 900 1
+save 300 10
+save 60 10000
+stop-writes-on-bgsave-error yes
+rdbcompression yes
+rdbchecksum yes
+dbfilename dump.rdb
+dir /var/lib/redis
+slave-serve-stale-data yes
+slave-read-only yes
+repl-diskless-sync no
+repl-diskless-sync-delay 5
+repl-disable-tcp-nodelay no
+slave-priority 100
+appendonly no
+appendfilename "appendonly.aof"
+appendfsync everysec
+no-appendfsync-on-rewrite no
+auto-aof-rewrite-percentage 100
+auto-aof-rewrite-min-size 64mb
+aof-load-truncated yes
+lua-time-limit 5000
+slowlog-log-slower-than 10000
+slowlog-max-len 128
+latency-monitor-threshold 0
+hash-max-ziplist-entries 512
+hash-max-ziplist-value 64
+list-max-ziplist-entries 512
+list-max-ziplist-value 64
+set-max-intset-entries 512
+zset-max-ziplist-entries 128
+zset-max-ziplist-value 64
+hll-sparse-max-bytes 3000
+activerehashing yes
+client-output-buffer-limit normal 0 0 0
+client-output-buffer-limit slave 256mb 64mb 60
+client-output-buffer-limit pubsub 32mb 8mb 60
+hz 10
+aof-rewrite-incremental-fsync yes
+notify-keyspace-events "KEA"
+requirepass seetatech
+```
+
+### 2.3、mongo数据库
+
+#### 2.3.1、版本为4.0.10
+
+#### 2.3.2、配置文件：
+
+###### bind不要设置为127.0.0.1, auth=true设置开启认证。
+
+```
+dbpath=/data/mongodb
+logpath=/var/log/mongodb/mongodb.log
+logappend=true
+port=27017
+fork=true
+auth=true
+bind_ip=0.0.0.0
+```
+
+#### 2.3.3、创建mongo初始化用户
+
+###### 示例中创建的用户名字为admin、密码为admin、role角色为root。
+
+```shell
+use admin;
+db.createUser(
+  {
+    user: "admin",
+    pwd: "admin",
+    roles: [ { role: "root", db: "admin" } ]
+  }
+)
+```
+
+### 2.4、Docker
+
+#### 2.4.1、版本为18.09.2
+
+#### 2.4.2、配置文件
+
+###### 以下是/etc/docker/daemon.json示例，配置了加速仓库以及insecure registry地址。
+
+```json
+{
+    "registry-mirrors": ["https://hub-mirror.c.163.com","https://rrkngb5t.mirror.aliyuncs.com"],
+    "insecure-registries":["192.168.1.32:5000"]
+}
+```
+
+###### 以下是docker.service配置文件，这里主要通过 --graph设置了docker数据目录为/data/dock er。通过EnvironmentFile=/run/flannel/subnet.env以及$DOCKER_NETWORK_OPTIONS 参数实现了docker对接flannel网络插件的配置(我们的flannel网络插件是二进制方式安装如，果是其他方式比如CSI可以参考官方文档如何配置与docker对接）
+
+```
+[Unit]
+Description=Docker Application Container Engine
+Documentation=https://docs.docker.com
+After=network-online.target docker.socket firewalld.service
+Wants=network-online.target
+Requires=docker.socket
+[Service]
+Type=notify
+EnvironmentFile=/run/flannel/subnet.env
+ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS --graph /data/docker
+ExecReload=/bin/kill -s HUP $MAINPID
+LimitNOFILE=1048576
+LimitNPROC=infinity
+LimitCORE=infinity
+TasksMax=infinity
+TimeoutStartSec=0
+Delegate=yes
+KillMode=process
+Restart=on-failure
+StartLimitBurst=3
+StartLimitInterval=60s
+[Install]
+WantedBy=multi-user.target
+```
+
+### 2.5、kubernetes
+
+#### 2.5.1、版本为1.15.5
+
+#### 2.5.2、其他要求
+
+###### k8s集群中网络组件为flannel
+
+###### k8s集群中dns组件为coredns，对应的svc名字需要为kube-dns
+
+###### k8s集群中kubelet数据目录为/data/kubelet，二进制部署时可以通过--root-dir设置，其他部署方式见官方文档
+
+### 2.6、Nvidia-docker2
+
+#### 2.6.1、版本为2.2.1
+
+#### 2.6.2、注意事项
+
+###### Nvidia-docker2只有gpu服务器需要装，不要在cpu服务器上安装。
+
+###### 以下示例为安装了Nvidia-docker2的docker daemon.json示例为
+
+```json
+{
+    "registry-mirrors": ["https://hub-mirror.c.163.com","https://rrkngb5t.mirror.aliyuncs.com"],
+    "insecure-registries":["192.168.1.53:5000"],
+    "default-runtime": "nvidia",
+    "runtimes": {
+        "nvidia": {
+            "path": "/usr/bin/nvidia-container-runtime",
+            "runtimeArgs": []
+        }
+    }
+}
+```
+
+### 2.7、 Nvidia-device-plugin插件
+
+###### k8s集群安装好，将以下内容报错为yaml文件，通过kubectl apply -f 创建即可。需要注意yaml 中hostpath /var/lib/kubelet/device-plugins是否存在kubelet.sock，更改了kubelet数据目录为/data/后，此处需要保持一致/data/kubelet/device-plugins
+
+```
+# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: nvidia-device-plugin-daemonset
+  namespace: kube-system
+spec:
+  selector:
+    matchLabels:
+      name: nvidia-device-plugin-ds
+  updateStrategy:
+    type: RollingUpdate
+  template:
+    metadata:
+      # This annotation is deprecated. Kept here for backward compatibility
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      annotations:
+        scheduler.alpha.kubernetes.io/critical-pod: ""
+      labels:
+        name: nvidia-device-plugin-ds
+    spec:
+      tolerations:
+      # This toleration is deprecated. Kept here for backward compatibility
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      - key: CriticalAddonsOnly
+        operator: Exists
+      - key: nvidia.com/gpu
+        operator: Exists
+        effect: NoSchedule
+      # Mark this pod as a critical add-on; when enabled, the critical add-on
+      # scheduler reserves resources for critical add-on pods so that they can
+      # be rescheduled after a failure.
+      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
+      priorityClassName: "system-node-critical"
+      containers:
+      - image: hub.kce.ksyun.com/kpl_k8s/k8s-device-plugin:1.0.0-beta4
+        name: nvidia-device-plugin-ctr
+        securityContext:
+          allowPrivilegeEscalation: false
+          capabilities:
+            drop: ["ALL"]
+        volumeMounts:
+          - name: device-plugin
+            mountPath: /var/lib/kubelet/device-plugins
+      volumes:
+        - name: device-plugin
+          hostPath:
+            path: /var/lib/kubelet/device-plugins
+
+```
+
+### 2.8、镜像仓库
+
+###### 镜像仓库可以选择registry或者harbor。用于推送拉取和推送docker镜像。
+
+### 2.9、NFS共享存储
+
+###### 服务端配置:
+
+```shell
+$ apt install nfs-kernel-server -y
+
+```
+
+```
+vim /etc/exports 添加：
+/nfs_storage *(rw,async,no_root_squash)
+service nfs-kernel-server restart 重启生效
+```
+
+###### 客户端配置：所有k8s节点需要安装好nfs客户端 nfs-common，pod才能够正常挂载。
+
+```shell
+$ apt install nfs-common -y 
+```
+
+###### 注意事项：
+
+###### 使用async模式，上传数据集压缩包时解压超时
+
+###### 开启no_root_squash ，防止权限问题
+
+## 三、配置平台k8s维度的基础环境
+
+#### 3.1、创建namespace、serviceaccout、secrets、pv  、pvc等资源
+
+###### 相关的配置在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git 的kpl_base目录，平台有两个namespace一个是autodl另外一个是kpl 。kpl_base下面的子目录中分别存放了这两个namespace的资源配置。
+
+```shell
+# 针对nfs实际部署情况修改kpl_base/autodl/4-pv_pvc 以及kpl_base/kpl/4-pv_pvc信息。
+# yaml中有多个nfs配置信息，如果是部署单节点nfs，则nfs_server和path都配置成一个即可。修改后执行
+$ kubectl apply -f  kpl_base/autodl/
+$ kubectl apply -f  kpl_base/kpl/
+```
+
+#### 3.2、创建镜像仓库imagePull secrets
+
+###### 某些用户的镜像仓库项目目录并不是公有类型，无法直接pull镜像，需要配置imagePull secrets，假如你们的镜像仓库不需要认证，此时仍需要创建该secrets，因为在部署yaml中已经配置引用了。你可以配置一个空的也可以。secrets名字为harbor-secret
+
+```shell
+kubectl create secret -n autodl docker-registry \
+ --docker-server=<镜像仓库地址>  \
+ --docker-email=<账号邮箱信息>  \
+ --docker-username=<镜像仓库用户>  \
+ --docker-password<镜像仓库密码> \
+ harbor-secret
+```
+
+```shell
+kubectl create secret -n kpl docker-registry \
+ --docker-server=<镜像仓库地址>  \
+ --docker-email=<账号邮箱信息>  \
+ --docker-username=<镜像仓库用户>  \
+ --docker-password<镜像仓库密码> \
+ harbor-secret
+```
+
+#### 3.3、节点打label
+
+##### 3.3.1、cpu节点标签
+
+```shell
+kubectl label node  <节点名> autodl=true kpl=true cpu=true user_job_node=true internal_service_node=true
+```
+
+##### 3.3.2、gpu节点标签
+
+```shell
+kubectl label node <节点名> autodl=true kpl=true gpu=true cpu=true user_job_node=true internal_service_node=true
+```
+
+## 四、配置文件介绍及创建
+
+###### 配置文件configmap模板仓库为https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git ，里面包含了平台所需配置文件的模板，需要配合下面的sed-config.sh脚本进行渲染。
+
+###### 修改configmap模板渲染脚本在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中有一个sed-configmap目录里面有一个sed-config.sh用于渲染configmap的脚本。需要根据环境实际情况更改变量，相关变量介绍已经在sed-config.sh中介绍。
+
+```shell
+$ sh sed-config.sh
+$ kubectl apply -f  {configmap目录}/autodl-core
+$ kubectl apply -f  {configmap目录}/kpl
+```
+
+## 五、服务创建
+
+#### 5.1、服务yaml文件位置介绍
+
+###### 各个组件的yaml文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git中的kpl_deploy_yaml中。每个组件下面的目录都有服务yaml文件。
+
+#### 5.2、服务镜像列表如下
+
+```shell
+hb.seetatech.com/core/adl-core-v1:20200902205144
+hb.seetatech.com/core/core--nginx:20200902205144
+hb.seetatech.com/core/core--collector:20200902205144
+hb.seetatech.com/seetaas/kpl-backend-v1:20200902205144
+hb.seetatech.com/seetaas/kpl--nginx:20200902205144
+hb.seetatech.com/seetaas/kpl--frontend:20200902205144
+hb.seetatech.com/seetaas/kpl-stream-v1:20200902205144
+```
+
+#### 5.3、服务部署创建
+
+###### 将yaml中的image改为自己的镜像信息（<仓库地址>/<项目目录>/<镜像名和tag>)，然后通过kubectl apply -f部署
+
+```shell
+$ kubectl apply -f kpl_deploy_yaml/1-autodl-core
+$ kubectl apply -f kpl_deploy_yaml/2-kpl-frontend
+$ kubectl apply -f kpl_deploy_yaml/3-kpl-backend
+$ kubectl apply -f kpl_deploy_yaml/4-kpl-stream
+$ kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/volcano
+$ kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/kpl-launcher
+```
+
+## 六、服务端口暴露以及访问
+
+#### 6.1、k8s集群外部服务端口
+
+###### 30180端口：服务采用nodeport形式对外暴露，通过nodeport类型的 kpl--nginx-svc的30180端口将流量进行转发。
+
+###### 30205端口：通过nodeport暴露的tcp服务，对应kpl--stream的svc用于平台能够通过ssh进入容器的服务
+
+####  6.2、平台地址访问
+
+```shell
+http://<node_ip>:30180
+```
+
+
+