2025-01-08

MetaSpore - Startup

打包

先编译 C++ 库，生成 metaspore.so
再使用 setuptools 和 wheel 工具，打包 python 库

编译

镜像

环境变量

1 2	export REPOSITORY={hub-repo} export VERSION={version}

构建 Dev 镜像（基础环境）

1
2
3

DOCKER_BUILDKIT=1 docker build --network host --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} --build-arg RUNTIME=gpu -f docker/ubuntu20.04/Dockerfile_dev -t $REPOSITORY/metaspore-dev-gpu:${VERSION} .

DOCKER_BUILDKIT=1 docker build --network host --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} --build-arg RUNTIME=cpu -f docker/ubuntu20.04/Dockerfile_dev -t $REPOSITORY/metaspore-dev-cpu:${VERSION} .

Serving
- Build 镜像（基于 Dev 镜像进行编译）
- Service 镜像

Training

Build 镜像（基于 Dev 镜像进行编译）

DOCKER_BUILDKIT=1 docker build --network host --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} -f docker/ubuntu20.04/Dockerfile_training_build --build-arg DEV_IMAGE=$REPOSITORY/metaspore-dev-cpu:${VERSION} -t $REPOSITORY/metaspore-training-build:${VERSION} .

Spark Training 镜像

DOCKER_BUILDKIT=1 docker build --network host --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} -f docker/ubuntu20.04/Dockerfile_training_release --build-arg METASPORE_RELEASE=build --build-arg METASPORE_BUILD_IMAGE=$REPOSITORY/metaspore-training-build:${VERSION} -t $REPOSITORY/metaspore-training-release:${VERSION} --target release .

Jupyter

DOCKER_BUILDKIT=1 docker build --network host --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} -f docker/ubuntu20.04/Dockerfile_jupyter --build-arg RELEASE_IMAGE=$REPOSITORY/metaspore-training-release:${VERSION} -t $REPOSITORY/metaspore-training-jupyter:${VERSION} docker/ubuntu20.04

MetaSpore C++

MetaSpore C++ 包含几个模块。

common
serving
metaspore (shared)

common

globals
- 定义 gflags 变量
hashmap
arrow
features

metaspore

提供离线训练、在线 serving 的共用代码
离线使用
1. 使用 pybind11 库定义并绑定 C++ 代码接口
2. python 代码加载共享库，像调用 python 代码一样调用使用 pybind11 定义的 C++ 接口

Getting Started

文档链接

步骤

定义模型（PyTorch Module）
定义 Estimator
- PyTorchEstimator，封装 PyTorch 模型，并在分布式环境下训练（调用 fit 方法并传入 DataFrame 进行训练）
- 调用 launcher.launch() 在个节点启动 PS 进程（server、worker、coordinator）

定义模型

embedding_size      : 每个特征组的 embedding size
# MetaSpore 相关
sparse              : ms.EmbeddingSumConcat
sparse.updater      : ms.FTRLTensorUpdater
sparse.initializer  : ms.NormalTensorInitializer
dense.normalization : ms.nn.Normalization
# Torch 相关
dense               : torch.nn.Sequential

初始化内容。

EmbeddingSumConcat
- SparseFeatureExtractor
  - 解析原始特征列配置文件
  - 向计算图中添加计算 Hash 特征的 Node
- EmbeddingBagModule
TensorUpdater，Sparse & Dense 数据更新类
- FTRLTensorUpdater
TensorInitializer，张量初始化器
- NormalTensorInitializer，归一化张量初始化器
Normalization，归一化

训练模型

1	PyTorchEstimator

定义 PyTorchEstimator
- module
- worker / server 数量
- 模型输出路径
- Label 列索引

PyTorchAgent
PyTorchLauncher
PyTorchHelperMixin
PyTorchModel
PyTorchEstimator

核心概念

JobRunner
PyTorchEstimator
- pyspark.ml.base.Estimator
Launcher
- PSLauncher
Agent
Module
- EmbeddingOperator
- TensorUpdater
- TensorInitializer
- Normalization
PyTorchModel
- pyspark.ml.base.Model
Metric

2025-01-06

研习录►机器学习

机器学习平台

开源方案

训练框架

Tensorfow
PyTorch
PaddlePaddle
MegEngine
Keras
MXNet
CNTK

Inference / Serving

MLOps

云商产品

TI-ONE

文章

训推一体方案

稀疏数据集

https://www.paddlepaddle.org.cn/documentation/docs/zh/api_guides/low_level/layers/sparse_update.html

文章

MLOps（六）-回顾2023年开源的MLOps产品、框架、工具与格局变化

2024-10-27

研习录►机器学习

Python 机器学习基于 Pytorch 和 Scikit-Learn - 第二章

神经网络与感知机学习规则

基于神经元模型，提出了感知机学习规则。感知机规则提出了一个可以自动学习的权重优化算法。

感知机算法步骤如下。

初始化权重和偏置项为 0 或很小的随机数
遍历每个训练样本
1. 计算感知机输出值
2. 更新权重和偏置项

需要注意的是。

只有当训练数据线性可分时，才能保证感知机具有收敛性
1. 此时需要设置训练数据集的最大循环次数，或容错次数的阈值，来结束训练
权重、偏置项使用很小的初始化值替代 0 ，如果全是 0 则学习率会失去对决策边界的影响
1. 学习率只影响权重向量的大小，不影响其方向

2024-08-14

研习录

CUDA - Coding

错误处理

运行时 API 错误码

调用 CUDA 运行时 API 时，接口返回错误码。

1	__host__ __device__ cudaError_t cudaGetDeviceCount ( int* count ); // 获取设备数量, 返回错误码

错误检查

1
2

__host__ __device__ const char* cudaGetErrorName ( cudaError_t error );     // 获取错误码的枚举名称
__host__ __device__ const char*	cudaGetErrorString ( cudaError_t error );   // 获取错误码的解释描述

定义错误检查函数

__host__ void error_check_entry() {
  int device_id_in_use;
  error_check(cudaGetDevice(&device_id_in_use), __FILE__, __LINE__);
  error_check(cudaSetDevice(999), __FILE__, __LINE__);
  //  char *p_c;
  //  error_check(cudaMalloc(&p_c, 100), __FILE__, __LINE__);

  cudaDeviceSynchronize();
} /** output
error_check, ok
CUDA error:
        code=101, name=cudaErrorInvalidDevice, description=invalid device ordinal,
        file=/data/code/cook-cuda/src/sample/hello_world.cu, line=51
*/

核函数中的异常

核函数的返回值必须是 void。

1	__host__ __device__ cudaError_t cudaGetLastError ( void ); // 返回最后一次错误码

__global__ void kernel_error_entry() {
  dim3 block(2048);
  print_build_in_vars<<<2, block>>>();  // block size 最大 1024
  error_check(cudaGetLastError(), __FILE__, __LINE__);
} /** output
CUDA error:
        code=9, name=cudaErrorInvalidConfiguration, description=invalid configuration argument,
        file=/data/code/cook-cuda/src/sample/hello_world.cu, line=67
*/

性能评估

事件计时

__host__ cudaError_t cudaEventCreate ( cudaEvent_t* event );
__host__ __device__ cudaError_t 	cudaEventRecord ( cudaEvent_t event, cudaStream_t stream = 0 );
__host__ cudaError_t cudaEventSynchronize ( cudaEvent_t event );
__host__ cudaError_t cudaEventElapsedTime ( float* ms, cudaEvent_t start, cudaEvent_t end );
__host__ __device__ cudaError_t 	cudaEventDestroy ( cudaEvent_t event );

示例。

cudaEvent_t start, end;
error_check(cudaEventCreate(&start), __FILE__, __LINE__);
error_check(cudaEventCreate(&end), __FILE__, __LINE__);
error_check(cudaEventRecord(start), __FILE__, __LINE__);
cudaEventQuery(start);

// run GPU Task

error_check(cudaEventRecord(end), __FILE__, __LINE__);
error_check(cudaEventSynchronize(end), __FILE__, __LINE__);
float elapsed_time_ms;
ERROR_CHECK(cudaEventElapsedTime(&elapsed_time_ms, start, end));

printf("elapsed time: %f ms\n", elapsed_time_ms);
ERROR_CHECK(cudaEventDestroy(start));
ERROR_CHECK(cudaEventDestroy(end));

error_check。

__host__ __device__ cudaError_t error_check(cudaError_t err, const char *fn, int line) {
  if (err != cudaSuccess) {
    printf("CUDA error:\n\tcode=%d, name=%s, description=%s, \n\tfile=%s, line=%d\n", err, cudaGetErrorName(err),
           cudaGetErrorString(err), fn, line);
  }
  return err;
}
#define ERROR_CHECK(exp) error_check(exp, __FILE__, __LINE__)

nvprof

nvprof 是评估 cuda 程序性能的工具。不过目前已经是过时的工具，不适用 compute capability >= 8.0 的设备。新设备适用 nsys 替代。

1	$ nvprof {cuda-program}

nsys

1 2	$ nsys profile {cuda-program} # 运行并记录程序的 profile 到 nsys-rep 文件 $ nsys analyze {nsys-rep} # 分析 profile 文件

获取 GPU 信息

运行时 API

1	__host__ cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device )

__host__ void PrintDeviceInfo() {
  int deviceCount;
  cudaGetDeviceCount(&deviceCount);
  std::cout << "GPU device count: " << deviceCount << std::endl;

  for (int i = 0; i < deviceCount; ++i) {
    // sm: 流式多处理器, Streaming Multiprocessor
    cudaDeviceProp dp{};
    cudaGetDeviceProperties(&dp, i);
    std::cout << "device.0  " << std::endl;
    std::cout << "  sm count: \t\t\t\t" << dp.multiProcessorCount << std::endl;
    std::cout << "  shared memory per block: \t\t" << dp.sharedMemPerBlock / 1024 << "KB" << std::endl;
    std::cout << "  max threads per block:\t\t" << dp.maxThreadsPerBlock << std::endl;
    std::cout << "  max threads per multi processor:\t" << dp.maxThreadsPerMultiProcessor << std::endl;
    std::cout << "  max threads per sm:\t\t\t" << dp.maxThreadsPerMultiProcessor / 32 << std::endl;
    std::cout << "  max blocks per multi processor:\t" << dp.maxBlocksPerMultiProcessor << std::endl;
  }
}

2024-08-04

k8s►helm

helm - usage

local git repo

目录结构

$ tree
.
├── ace
│   ├── nginx
│   │   ├── Chart.yaml
│   │   ├── configmap
│   │   │   └── sources.list
│   │   ├── templates
│   │   │   ├── configmap.yaml
│   │   │   ├── deployment.yaml
│   │   │   ├── _helpers.tpl
│   │   │   └── service.yaml
│   │   └── values.yaml
│   └── ...
├── Makefile
└── README.md

安装 Chart

$ helm install ace ace/nginx
NAME: ace
LAST DEPLOYED: Mon Aug  5 13:42:17 2024
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

更新 Chart

$ helm upgrade ace ace/nginx
Release "ace" has been upgraded. Happy Helming!
NAME: ace
LAST DEPLOYED: Mon Aug  5 13:47:34 2024
NAMESPACE: default
STATUS: deployed
REVISION: 2
TEST SUITE: None

卸载 Chart

1 2	$ helm uninstall ace release "ace" uninstalled

示例

$ helm upgrade --install {release-name} {chart-path} \
	--create-namespace -n ${kubeNamespace} 
	-f {path-to-helm-values-file} \
  --set app.tag={tag-value} \
  --kube-context {kube-context} \
  --kubeconfig {path-to-kube-config-file}

解释

--install ：更新 release，如果不存在则安装
--create-namespace：如果 namespace 不存在，则创建，和 --install 配合使用
--set app.tag ：指定 Value app.tag 的值

2024-07-15

研习录

问题排查 - CPU

原因分析

计算任务

计算量过大的任务占用过多 CPU
死循环

上下文切换

死锁
频繁加锁
过多的并发
内存不足
频繁 GC（Java、GO 等语言）

问题排查

借助 TOP 命令

1	top -Hp {pid} # 查看指定进程内各线程占用 CPU 的情况

查看线程数量

1	ps -p {pid} -L \| wc -l

排查进程的上下文切换情况

pidstat

1	pidstat -w -p {pid}

其中，<PID> 是目标进程的进程 ID。上述命令将显示指定进程的 CPU 上下文切换统计信息，包括自愿切换（voluntary switches）和非自愿切换（non-voluntary switches）。

Linux 4.14.301-224.520.amzn2.x86_64 (...) 	2024年07月04日 	_x86_64_	(32 CPU)

10时23分19秒   UID       PID   cswch/s nvcswch/s  Command
10时23分19秒     0   3637168      0.17      0.00  ...

1 2	# 安装 yum install sysstat -y

perf

perf stat -e cs,<event> -p <PID>
# event: cs (所有模式切换) , cs:u (用户模式切换), cs:k (内核模式切换)
$ perf stat -e cs,cs:u,cs:k -p 3637168  # Ctrl-C 结束收集

^C
 Performance counter stats for process id '3637168':

            44,981      cs
                 0      cs:u
            44,981      cs:k

      27.447834538 seconds time elapsed

perf stat 和 perf record 区别

perf stat
快速查看程序基本性能指标
采集 CPU 指令、缓存命中率、上下文切换等
perf record
可采集系统或特定进程的性能事件
采集指令、缓存、分支等事件
可导出文件，用于后续的分析

1	yum install perf -y

其他问题

如何区分是计算任务占用 CPU 还是过多上下文切换占用任务
区分 IO 线程和 Work 线程的必要性

2024-07-02

机器学习►工具►tensorflow

tensorflow api

2024-07-02

机器学习►工具►tensorflow

tensorflow

安装

1	$ pip3 install tensorflow

验证

1 2	import tensorflow as tf print("TensorFlow version:", tf.__version__)

参考

Tutorials For Beginners

2024-06-30

研习录

英语 - 单词

词性

名词（n.）
- 人相关
  - -er / -or，做事的人或拥有特定职业的人
    - actor
  - -ist，拥护某种主义的人或从事某一领域的人
    - journalist
  - -ian，某方面的人或精通某领域的人
    - historian
- -tion / -sion，表示动作或状态
  - action
  - version
- -ment，行为或状态
  - development
- -ness，性质或状态
  - Happiness
- -ity / -ty / -cy，标识性质或状态
  - ability
  - beauty
- -ship，关系或身份
  - friendship
- -hood，状态或身份
  - childhood
- -ure，表示行为结果
  - failure
  - pressure
- -ance / ence
  - appearance
  - difference
- -ee，受动者或参与者
  - employee
- -acy / -ency，性质或状态
  - accuracy
- -ology，研究或学科
  - biology
- -ary，集合或事物相关
  - dictionary
  - array
- -th，由形容词转为名词
  - strength
- -age，标识性质或状态
  - storage
  - marriage
- -ry，标识场所或集合名词
  - machinery
形容词（adj.）
- -able, -ible: 表示“可以…的”，“有…能力的”或“值得…的”
  - capable, edible, visible
- -al, -ial: 表示“具有…属性的”或“关于…的”。
  - natural, personal, revolutionary
- -ful: 表示“充满…的”。
  - beautiful, helpful, wonderful
- -less: 表示“没有…的”。
  - careless, homeless, timeless
- -ous: 表示“有…倾向的”或“充满…的”。
  - dangerous, curious, glorious
- -ic, -ical: 表示“…的”，通常与科学或艺术相关。
  - historic, historical, economic
- -ive: 表示“具有…属性的”或“倾向于…的”。
  - active, negative, sensitive
- -ly: 有时用于形容词，表示“有…品质的”。
  - friendly, deadly
- -y, -ish: 表示“像…的”或“有一点…的”。
  - tasty, foolish, bluish
- -like: 表示“像…一样”。
  - childlike, manlike
- -some: 表示“有…特质的”或“引起…的”。
  - troublesome, handsome
- -ary, -ory: 表示“属于…的”或“与…有关的”。
  - honorary, imaginary
及物动词（vt.）
- -ize 或 -ise ，后缀可以加在名词或形容词后面，形成动词，通常表示“使成为…”，“按照…方式处理”，“照…样子做”
  - normal → normalize（使正常化）
  - equal → equalize（使平等，使相等）
- -ify，加在名词或形容词后面，形成动词，通常表示“使…化”，“使成为…”
  - beautify（美化）
  - purify（净化）
不及物动词（vi.）
副词（adv.）
- -ly，加在形容词后面行成副词
  - quick -> quickly（快地），happy -> happily（快乐地）
- -wise ，表示方向或方式。
  - Examples: clockwise -> clockwise（顺时针方向），money -> moneywise（在金钱方面）
- -wards/-ward，表示方向或趋势
  - sun -> sunwards/sunward（向太阳），north -> northwards/northward（向北）
- -erly，表示方向或来源
  - east -> easterly（向东的，来自东边的）
- -ways，表示方式或方向
  - back -> backways（向后地），side -> sideways（侧面地）
介词（prep.）
连词（conj.）
数词（num.）
代词（pron.）
感叹词（int.）

名词（Noun, n.）

表示人、事物、地点或概念。比如 dog、table。

动词（Verb, v.）

表示动作、状态或事件。比如 run、eat。

形容词（adjective, adj.）

描述或修饰名词、代词。比如 blue、tall。

副词（Adverb, adv.）

修饰动词、形容词或其他副词。比如 quickly、here、well。

代词（Pronoun, pron.）

用来代替名词，避免重复。比如 he、she。

介词（Preposition, prep.）

表示名词或代词与句子其他部分的关系。比如，in、on、with。

连词（Conjunction, conj.）

用于连接词、句子或短语。比如 but、or、and。

冠词（Article, art.）

限定名词，分为定冠词和不定冠词。the（不定冠词），a/an（定冠词）。

表示能力、许可、可能性或义务。比如 may、can、must。

助动词（Auxiliary Verb, aux.）

用于构成复合时态或被动语态。比如 be、do、have。

感叹词（Interjection, int.）

表示强烈的情感或反应。比如 oh、wow。

数词（numeral, num.）

表示数量或顺序。比如 one、first。

对比

及物动词和不及物动词

需要在后面接宾语的动词叫及物动词，反之叫不及物动词。

1	The bird flies.

fly 是不及物动词。

1	I eat an apple.

eat 为及物动词。

2024-06-27

机器学习►工具►Conda

Conda

安装

linux

mkdir -p ~/.miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/.miniconda3/miniconda.sh
bash ~/.miniconda3/miniconda.sh -b -u -p ~/.miniconda3
rm -rf ~/.miniconda3/miniconda.sh

配置 Shell

# for bash
~/.miniconda3/bin/conda init bash
# for zsh
~/.miniconda3/bin/conda init zsh

shell profile

1 2	eval "$(/path/to/anaconda3/bin/conda shell.YOUR_SHELL_NAME hook)" # 修改路径及 YOUR_SHELL_NAME

环境

列出环境

1	$ conda info --envs

创建环境

1	$ conda create -n ml

指定 channel

1	$ conda create -n ml --channel=conda-forge

克隆环境

1	$ conda create --name new_name --clone old_name

启用环境

1	$ conda activate {env-name}

环境重命名

1	conda rename -n old_name new_name

使用 yml 文件更新环境

1	$ conda env update --file env.yml --prune

删除环境

1	$ conda remove --name {env-name} --all

默认不启用 conda base 环境

1	$ conda config --set auto_activate_base false # 关闭默认使用 base

打印环境信息

1	$ conda info

Channel

为环境添加 channel

1	$ conda config --append channels conda-forge

添加 channel

1	$ conda config

打印 channel

1	$ conda config --show channels

包管理

conda 的包管理有 channel 的概念，如果不指定则为默认的 defaults。如果我们想要安装其他 channel 的包，示例如下。

1	$ conda install anaconda::gcc_linux-64

查询可用包

1	$ conda search {package}

或在这里搜索，页面有安装命令，比如。

1
2
3

$ conda install anaconda::gcc_linux-64
# 另外一个包
$ conda install conda-forge::gcc_linux-64

已安装包

1	$ conda list

移除包

1	$ conda uninstall {package}

安装包

# 默认包
$ conda install {package}

# 指定channel
$ conda install {channel}::{package}

# 指定版本
$ conda install {package}={version}

Trouble Shotting

GLIBCXX_3.4.30 not found

ImportError: /home/wii/.miniconda3/envs/ml/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/wii/.miniconda3/envs/ml/lib/python3.12/site-packages/paddle/base/libpaddle.so)

可以通过如下命令，查看当前 gcc 支持的 GLIBCXX 版本。

1	$ strings /path/to/libstdc++.so.6 \| grep GLIBCXX

这个报错通常是运行的程序依赖的 gcc 版本和已经安装的 gcc 版本不匹配，要么太高，要么太低。安装兼容版本的 gcc 即可。

libstdcxx-ng 11.2.0.* is not installable

1	libstdcxx-ng 11.2.0.* is not installable because it conflicts with any installable versions previously reported

这个报错是在 conda 环境，安装 gcc 13.3.0 时报的错。原因是已经安装 libstdcxx-ng 11.2.0.* ，和要安装的 gcc 13.3.0 出现依赖冲突。可以先卸载。

1	$ conda uninstall libstdcxx-ng # 当然，大概率会失败

可以重新创建环境，并添加 channel conda-forge。

1	$ conda create --name {env-name} --channel=conda-forge conda-forge::gcc=13.2.0

常用包

1
2
3

# gcc & g++, 指定版本
conda-forge::gxx=8.5.0
conda-forge::gcc=8.5.0

打包

编译

镜像