跳到主要内容

重建 Performance 数据

当出现效能统计数据不符合预期等异常时,可能是报表数据异常,可参考本文重建。

1 影响说明

在效能重建期间,系统会逐步创建全新的效能报表。该过程采用增量式处理,数据会按批次被重新分析并加入的报表。

因此,在重建过程中,各个报表会逐步恢复可用;不可用的报表会提示正在重建中。

2 重建操作

2.1 获取ONES版本号

参考获取ONES版本号,下述相关操作分支与版本有关。

2.2 6.103.0及以上版本

2.2.1 重建操作

进入 ones-ai-k8s 操作终端

ones-ai-k8s.sh

全量重建

make rebuild-perf

2.2.2 检查状态

pod=$(kubectl get po -nones | grep bi-sync-etl | awk '{print $1}')
kubectl exec -it -nones "${pod}" -- etl/ones-bi-sync-etl -s status -c etl/config.json

2.3 6.103.0以下版本

2.3.1 下载脚本

curl -O https://res.ones.pro/script/reset_performance_k3s.sh 

具体脚本用法可查看帮助说明

root@iZwz94kqmvp7aa5zkju609Z:~# sh reset_performance_k3s.sh 
usage:
[-a]: Reset all performance data, include ones-bi-sync-canal/ones-bi-sync-etl/kafka && clickhouse performance data.
[-s]: Show bi-sync-etl sync status.
[-o]: Optimize tables (task, field_value_history, manhour), to avoid the same data.
[-v]: Show version.

2.3.2 完全重新同步-a

通常以下情况需要完全重新同步:

  1. performance 展示的数据有部分未同步 或 跟 project 展示的有差异(可能是程序 bug)

  2. 增改了 bi-sync-canal 对 mysql 表的 binlog 监听则需要完成重新同步数据才能落入 kafka 供下游消费;

  3. kafka 中存在脏数据;

执行 -a --后可使用 -s 查看同步状态:

root@iZwz94kqmvp7aa5zkju609Z:~# sh reset_performance_k3s.sh -a
---- start to reset all performance data...
localstorageStorageBasePath: /data/ones/ones-local-storage
KafkaReadAddress: kafka-ha:9092
kafkaProjectBinlogTopic: project_binlog

---- step 1 stop bi-sync-*-deployment:
kubectl -n ones scale deployment ones-bi-sync-canal-deployment ones-bi-sync-etl-deployment --replicas=0
deployment.apps/ones-bi-sync-canal-deployment scaled
deployment.apps/ones-bi-sync-etl-deployment scaled

---- step 2 delete topic project_binlog:
kubectl exec -it -nones kafka-ha-0 -- kafka-topics.sh --bootstrap-server kafka-ha:9092 --delete --topic project_binlog

---- step 3 delete ones-bi-sync-*/* files:
rm -rf /data/ones/ones-local-storage/others-static-pvc/ones/ones-bi-sync-*/*bolt

---- step 4 restart bi-sync-*-deployment
kubectl -n ones scale deployment ones-bi-sync-canal-deployment ones-bi-sync-etl-deployment --replicas=1
deployment.apps/ones-bi-sync-canal-deployment scaled
deployment.apps/ones-bi-sync-etl-deployment scaled

---- wait a few seconds for restart ...

---- reset performance success, now you can get bi-sync-etl status by '-s' argv

2.3.3 指定表快速同步-q

(>= 6.1.0 版本有效) 当需要针对一个或多个特定表进行重新同步时,可执行该选项。

需要注意,指定重新同步的表,可能会依赖于其它表,如果依赖的表未同步 或 依赖表数据本身有问题了,那么指定重新同步的表可能会同步不成功

使用方法,例如指定重新同步 task 和 sprint 表(多表用逗号隔开,中间不要出现空格):

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -q project.task,project.sprint
root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -q project.task,project.sprint
---- start to reset cdc performance data by the specified tables quickly ...
---- step 1 stop bi-sync-etl-deployment:
kubectl -n ones scale deployment ones-bi-sync-etl-deployment --replicas=0
deployment.apps/ones-bi-sync-etl-deployment scaled

---- step 2 set onesBISyncEtlQuickSnapshot* and make setup-ones
2024-07-01 02:19:44,110 [INFO] ones_path=, k8s_root_dir=/data/ones/ones-ai-k8s
2024-07-01 02:19:44,111 [INFO] waiting for lock...
2024-07-01 02:19:44,111 [INFO] starting...
2024-07-01 02:19:44,111 [INFO] render config
… …(此处省略一些信息)
2024-07-01 02:20:32,711 [INFO] setup ones finish
2024-07-01 02:20:32,711 [INFO] elapsed time: 48.600 seconds
---- set config/private.yaml success:
onesBISyncEtlQuickSnapshotTables: project.task,project.sprint
onesBISyncEtlQuickSnapshotVersion: 20240701101943

---- step 3 redeploy ones-bi-sync-etl-deployment
kubectl -n ones scale deployment ones-bi-sync-etl-deployment --replicas=1
deployment.apps/ones-bi-sync-etl-deployment scaled

---- wait a few seconds for restart ...

---- reset performance success ----

etl 服务重启后,会发现重新对 task 和sprint 表做了快照并重新消费

... ...
2024/07/01 10:30:22 [INFO] Waiting for preorder snapshots to complete...
2024/07/01 10:30:32 [INFO] Topic initialization completed.
2024/07/01 10:30:32 [INFO] Connector initialization successful.
2024/07/01 10:30:32 [INFO] Consumer quick snapshot: {20240701102842 [project.task_status]}, force snapshot: {1 []}
2024/07/01 10:30:32 [INFO] Consumer quick snapshot all: false, force snapshot all: false
2024/07/01 10:30:32 [INFO] Consumers prepare tables for snapshots: []
2024/07/01 10:30:32 [INFO] The snapshot record message is: &{Timestamp:1719801837520 AppName:bi-sync-etl_project_data}, table: project.task
2024/07/01 10:30:32 [INFO] The snapshot record message is: &{Timestamp:1719801837520 AppName:bi-sync-etl_project_data}, table: project.sprint
2024/07/01 10:30:32 [INFO] The tables ready to execute snapshots are: [project.task_status]
2024/07/01 10:30:46 [INFO] Consumer bi-sync-etl_project_data start consuming the schema event stream
... ...

2.3.4 查看同步状态-s

这里指的是 ones-bi-sync-etl 服务的同步状态。用法(使用 -s 选项,可多次执行)

ONES版本 >= 6.1.0 时

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s // 反复执行,可见同步进度
[project_data] running, dumping // 还在准备同步中,不一定有实时进度
[department] running, dump finished, incremental syncing

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s
[project_data] running, dumping, current table: org_user at 1 / 26 (3%) // 存量数据同步中,可看到实时进度
[department] running, dump finished, incremental syncing

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s
[project_data] running, dumping, current table: project at 6 / 26 (23%)
[department] running, dump finished, incremental syncing

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s
[project_data] running, dumping, current table: field_value at 16 / 26 (61%)
[department] running, dump finished

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s
[project_data] running, dumping, current table: manhour at 24 / 26 (92%)
[department] running, dump finished, incremental syncing

root@iZwz90sbtj7ak4ns2gr2heZ:~# sh reset_performance_k3s.sh -s
// 所有管道的存量数据同步完成,进入正常增量同步
[project_data] running, dump finished, incremental syncing
[department] running, dump finished, incremental syncing

ONES版本 < 6.1.0 时

root@iZwz94kqmvp7aa5zkju609Z:~# sh reset_performance_k3s.sh -s
Defaulted container "ones-bi-sync-etl" out of: ones-bi-sync-etl, wait-for-mysql (init), wait-for-clickhouse (init), wait-for-kafka (init)
[project_data] running, dumping, prepared, 379963/594657 (63%)
[department] running, dumping, prepared, 594652/594657 (99%)

重做完成的标准:

[project_data] 和 [department] 这两个管道组都要求达到99%或100%的状态。(99%也可以认为同步完成,是因为同步过程中客户环境可能不断产生新数据,从而无法完全达到100%)

2.3.5 强制合并表数据-o

脚本执行前面同步(-p/-a)命令后,页面上刷新数据可能会出现相同的数据(这是因为 Clickhouse 来不及进行 mergetree 合并数据的操作)。

服务最迟会在每天凌晨(服务器时区)进行优化操作,但有时为了避免造成客户困扰,也可以在 -a 数据同步完成后,执行一下-o选项,合并数据:

root@iZwz94kqmvp7aa5zkju609Z:~# sh reset_performance_k3s.sh -o
clickhouseUser: default
clickhousePassword: ****
clickhousePortTCP: 9000
OPTIMIZE TABLE default.task FINAL success
OPTIMIZE TABLE default.field_value_history FINAL success
OPTIMIZE TABLE default.manhour FINAL success

3 FAQ

3.1 全量重建过程中,cdc报错

参考重建索引的FAQ第1节。

3.2 全量重建过程中,etl报错

重建过程中, bash reset_performance.sh -s 一直提示 dumping,同时etl这个pod日志报错如下:

[ERROR] 2026-01-05 15:03:54 etl/watchdog-go:86 run rule group: project_data error: code: 241, message: (total) memory limit exceeded: would use 2.97 GiB (attempt to allocate chunk of 4533018 bytes), current RSS 3.60 GiB, maximum: 3.60 GiB. OvercommitTracker decision: Query was selected to stop by OvercommitTracker

故障原因是clickhouse 内存不足,需要调大配置,从默认内存4G调整为8G。

#进入 ones-ai-k8s 操作终端
ones-ai-k8s.sh

#添加配置
vi config/private.yaml
clickhouseMemoryLimit: 8Gi

#应用配置
make setup-ones

#彻底重建
make rebuild-cdc