Skip to content
Toggle navigation
P
Projects
G
Groups
S
Snippets
Help
tingweiwang
/
ksy-project-docking
This project
Loading...
Sign in
Toggle navigation
Go to a project
Project
Repository
Issues
0
Merge Requests
0
Pipelines
Wiki
Snippets
Settings
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Commit 0b6a24a5
authored
Sep 02, 2020
by
wangtingwei
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
增加文档
1 parent
7fd24e8a
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
488 additions
and
0 deletions
平台部署文档.md
平台部署文档.md
0 → 100644
View file @
0b6a24a
# 平台部署文档
# 平台部署文档
## 一、涉及的项目仓库
#### 1.1、configmap配置文件模板
##### https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git
##### 说明:配置文件采用k8s configmap外部挂载实现,配置文件中的部分可变字段如mysql连接地址做成了模板,可以通过sed修改模版替换为变量进行渲染。
#### 1.2、平台部署所需的相关文件
##### 说明:包括平台部署到k8s集群中的yaml文件、初始化sql语句、渲染configmap脚本等
https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git
## 二、基础软件依赖及配置
### 2.1、Mysql数据库
#### 2.1.1、版本为5.7.28
#### 2.1.2、配置文件:
###### bind-address 不要设置127.0.0.1 ,因为其他节点Pod需要与之通信,根据mysql实际部署情况修改配置文件。
```
[mysqld_safe]
socket = /var/run/mysqld/mysqld.sock
nice = 0
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
bind-address = 0.0.0.0
skip_name_resolve
max_connections = 10000
slow_query_log = TRUE
slow_query_log_file = /var/log/mysql/slowquery.log
long_query_time = 0.1
log-queries-not-using-indexes
log_queries_not_using_indexes = 0
key_buffer_size = 16M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 8
myisam-recover-options = BACKUP
query_cache_limit = 1M
query_cache_size = 16M
log_error = /var/log/mysql/error.log
expire_logs_days = 10
max_binlog_size = 100M
interactive_timeout=28800000
wait_timeout=28800000
log-bin=mysql-bin
binlog-format=Row
server-id=111
character-set-server=utf8mb4
[mysql]
default-character-set = utf8mb4
[client]
default-character-set = utf8mb4
```
#### 2.2.3、数据库初始化操作
-
##### 用户授权
###### 示例中mysql用户为root、密码为seetatech、授予权限为最高,可以根据实际情况降低权限,但是需要保证mysql用户具备增删改查权限、另外授予访问mysql数据库主机这里为%代表所有主机,此处可以设置为内网网段。
```
shell
$
use mysql;
$
GRANT ALL PRIVILEGES ON
*
.
*
TO
'root'
@
'%'
identified by
'seetatech'
;
$
flush privileges;
```
-
##### 创建数据库
```
shell
$
CREATE SCHEMA
`
autodl-core
`
;
$
CREATE SCHEMA
`
user-center
`
;
$
CREATE SCHEMA
`
kpl
`
;
$
CREATE SCHEMA
`
quota
`
;
```
-
##### 执行初始化sql语句
###### 初始化sql文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中的sql目录下。按照数字顺序从小到大执行。然后再添加sql记录
```
shell
$
INSERT INTO
`
autodl-core
`
.
`
service
`
(
`
service_id
`
,
`
noti_api
`
)
VALUES
(
"kpl3"
,
"http://kpl--monitor.kpl.svc.cluster.local:8920/status"
)
;
```
### 2.2、redis数据库
#### 2.2.1、版本为5.0.6
#### 2.2.2、配置文件
###### bind不要设置为127.0.0.1,平台用到了redis的消息侦听功能,需要配置notify-keyspace-events "KEA",配置文件中已添加,使用了requirepass 设置了密码为seetatech。可以根据实际需求更改。
```
daemonize yes
pidfile /var/run/redis/redis-server.pid
port 6379
tcp-backlog 511
bind 0.0.0.0
timeout 0
tcp-keepalive 60
loglevel notice
logfile /var/log/redis/redis-server.log
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes
notify-keyspace-events "KEA"
requirepass seetatech
```
### 2.3、mongo数据库
#### 2.3.1、版本为4.0.10
#### 2.3.2、配置文件:
###### bind不要设置为127.0.0.1, auth=true设置开启认证。
```
dbpath=/data/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true
port=27017
fork=true
auth=true
bind_ip=0.0.0.0
```
#### 2.3.3、创建mongo初始化用户
###### 示例中创建的用户名字为admin、密码为admin、role角色为root。
```
shell
use admin;
db.createUser
(
{
user:
"admin"
,
pwd
:
"admin"
,
roles:
[
{
role:
"root"
, db:
"admin"
}
]
}
)
```
### 2.4、Docker
#### 2.4.1、版本为18.09.2
#### 2.4.2、配置文件
###### 以下是/etc/docker/daemon.json示例,配置了加速仓库以及insecure registry地址。
```
json
{
"registry-mirrors"
:
[
"https://hub-mirror.c.163.com"
,
"https://rrkngb5t.mirror.aliyuncs.com"
],
"insecure-registries"
:[
"192.168.1.32:5000"
]
}
```
###### 以下是docker.service配置文件,这里主要通过 --graph设置了docker数据目录为/data/dock er。通过EnvironmentFile=/run/flannel/subnet.env以及$DOCKER_NETWORK_OPTIONS 参数实现了docker对接flannel网络插件的配置(我们的flannel网络插件是二进制方式安装如,果是其他方式比如CSI可以参考官方文档如何配置与docker对接)
```
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network-online.target docker.socket firewalld.service
Wants=network-online.target
Requires=docker.socket
[Service]
Type=notify
EnvironmentFile=/run/flannel/subnet.env
ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS --graph /data/docker
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Delegate=yes
KillMode=process
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s
[Install]
WantedBy=multi-user.target
```
### 2.5、kubernetes
#### 2.5.1、版本为1.15.5
#### 2.5.2、其他要求
###### k8s集群中网络组件为flannel
###### k8s集群中dns组件为coredns,对应的svc名字需要为kube-dns
###### k8s集群中kubelet数据目录为/data/kubelet,二进制部署时可以通过--root-dir设置,其他部署方式见官方文档
### 2.6、Nvidia-docker2
#### 2.6.1、版本为2.2.1
#### 2.6.2、注意事项
###### Nvidia-docker2只有gpu服务器需要装,不要在cpu服务器上安装。
###### 以下示例为安装了Nvidia-docker2的docker daemon.json示例为
```
json
{
"registry-mirrors"
:
[
"https://hub-mirror.c.163.com"
,
"https://rrkngb5t.mirror.aliyuncs.com"
],
"insecure-registries"
:[
"192.168.1.53:5000"
],
"default-runtime"
:
"nvidia"
,
"runtimes"
:
{
"nvidia"
:
{
"path"
:
"/usr/bin/nvidia-container-runtime"
,
"runtimeArgs"
:
[]
}
}
}
```
### 2.7、 Nvidia-device-plugin插件
###### k8s集群安装好,将以下内容报错为yaml文件,通过kubectl apply -f 创建即可。需要注意yaml 中hostpath /var/lib/kubelet/device-plugins是否存在kubelet.sock,更改了kubelet数据目录为/data/后,此处需要保持一致/data/kubelet/device-plugins
```
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: hub.kce.ksyun.com/kpl_k8s/k8s-device-plugin:1.0.0-beta4
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
```
### 2.8、镜像仓库
###### 镜像仓库可以选择registry或者harbor。用于推送拉取和推送docker镜像。
### 2.9、NFS共享存储
###### 服务端配置:
```
shell
$
apt install nfs-kernel-server -y
```
```
vim /etc/exports 添加:
/nfs_storage *(rw,async,no_root_squash)
service nfs-kernel-server restart 重启生效
```
###### 客户端配置:所有k8s节点需要安装好nfs客户端 nfs-common,pod才能够正常挂载。
```
shell
$
apt install nfs-common -y
```
###### 注意事项:
###### 使用async模式,上传数据集压缩包时解压超时
###### 开启no_root_squash ,防止权限问题
## 三、配置平台k8s维度的基础环境
#### 3.1、创建namespace、serviceaccout、secrets、pv 、pvc等资源
###### 相关的配置在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git 的kpl_base目录,平台有两个namespace一个是autodl另外一个是kpl 。kpl_base下面的子目录中分别存放了这两个namespace的资源配置。
```
shell
# 针对nfs实际部署情况修改kpl_base/autodl/4-pv_pvc 以及kpl_base/kpl/4-pv_pvc信息。
# yaml中有多个nfs配置信息,如果是部署单节点nfs,则nfs_server和path都配置成一个即可。修改后执行
$
kubectl apply -f kpl_base/autodl/
$
kubectl apply -f kpl_base/kpl/
```
#### 3.2、创建镜像仓库imagePull secrets
###### 某些用户的镜像仓库项目目录并不是公有类型,无法直接pull镜像,需要配置imagePull secrets,假如你们的镜像仓库不需要认证,此时仍需要创建该secrets,因为在部署yaml中已经配置引用了。你可以配置一个空的也可以。secrets名字为harbor-secret
```
shell
kubectl create secret -n autodl docker-registry
\
--docker-server
=
<镜像仓库地址>
\
--docker-email
=
<账号邮箱信息>
\
--docker-username
=
<镜像仓库用户>
\
--docker-password<镜像仓库密码>
\
harbor-secret
```
```
shell
kubectl create secret -n kpl docker-registry
\
--docker-server
=
<镜像仓库地址>
\
--docker-email
=
<账号邮箱信息>
\
--docker-username
=
<镜像仓库用户>
\
--docker-password<镜像仓库密码>
\
harbor-secret
```
#### 3.3、节点打label
##### 3.3.1、cpu节点标签
```
shell
kubectl label node <节点名>
autodl
=
true
kpl
=
true
cpu
=
true
user_job_node
=
true
internal_service_node
=
true
```
##### 3.3.2、gpu节点标签
```
shell
kubectl label node <节点名>
autodl
=
true
kpl
=
true
gpu
=
true
cpu
=
true
user_job_node
=
true
internal_service_node
=
true
```
## 四、配置文件介绍及创建
###### 配置文件configmap模板仓库为https://tingweiwang@gitlab.seetatech.com/tingweiwang/configmap.git ,里面包含了平台所需配置文件的模板,需要配合下面的sed-config.sh脚本进行渲染。
###### 修改configmap模板渲染脚本在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git项目中有一个sed-configmap目录里面有一个sed-config.sh用于渲染configmap的脚本。需要根据环境实际情况更改变量,相关变量介绍已经在sed-config.sh中介绍。
```
shell
$
sh sed-config.sh
$
kubectl apply -f
{
configmap目录
}
/autodl-core
$
kubectl apply -f
{
configmap目录
}
/kpl
```
## 五、服务创建
#### 5.1、服务yaml文件位置介绍
###### 各个组件的yaml文件在https://tingweiwang@gitlab.seetatech.com/tingweiwang/ksy-project-docking.git中的kpl_deploy_yaml中。每个组件下面的目录都有服务yaml文件。
#### 5.2、服务镜像列表如下
```
shell
hb.seetatech.com/core/adl-core-v1:20200902205144
hb.seetatech.com/core/core--nginx:20200902205144
hb.seetatech.com/core/core--collector:20200902205144
hb.seetatech.com/seetaas/kpl-backend-v1:20200902205144
hb.seetatech.com/seetaas/kpl--nginx:20200902205144
hb.seetatech.com/seetaas/kpl--frontend:20200902205144
hb.seetatech.com/seetaas/kpl-stream-v1:20200902205144
```
#### 5.3、服务部署创建
###### 将yaml中的image改为自己的镜像信息(<仓库地址>/<项目目录>/<镜像名和tag>),然后通过kubectl apply -f部署
```
shell
$
kubectl apply -f kpl_deploy_yaml/1-autodl-core
$
kubectl apply -f kpl_deploy_yaml/2-kpl-frontend
$
kubectl apply -f kpl_deploy_yaml/3-kpl-backend
$
kubectl apply -f kpl_deploy_yaml/4-kpl-stream
$
kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/volcano
$
kubectl apply -f kpl_deploy_yaml/5-kpl-launcher/kpl-launcher
```
## 六、服务端口暴露以及访问
#### 6.1、k8s集群外部服务端口
###### 30180端口:服务采用nodeport形式对外暴露,通过nodeport类型的 kpl--nginx-svc的30180端口将流量进行转发。
###### 30205端口:通过nodeport暴露的tcp服务,对应kpl--stream的svc用于平台能够通过ssh进入容器的服务
#### 6.2、平台地址访问
```
shell
http://<node_ip>:30180
```
Write
Preview
Markdown
is supported
Attach a file
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to post a comment