vicuna模型部署


羊驼模型部署

查看配置

1
2
3
查看配置nvidia-smi
查看torch版本 conda list torch
查看cuda版本 nvcc --version

切换环境,模型微调时需要,本地部署不需要

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
创建python3.11版本环境
conda create -n my-env python=3.11
conda init bash && source /root/.bashrc
切换到新环境
conda activate my-env

安装torch
pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117

安装fastchat
pip install fschat --default-timeout=100 -i https://pypi.tuna.tsinghua.edu.cn/simple

安装transformers
pip install transformers==4.28.1 --default-timeout=100 -i https://pypi.tuna.tsinghua.edu.cn/simple

安装einops
pip install einops --default-timeout=100 -i https://pypi.tuna.tsinghua.edu.cn/simple

安装flash(需要很久的安装时间)
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

安装tensorboardX
pip install tensorboardX --default-timeout=100 -i https://pypi.tuna.tsinghua.edu.cn/simple

clone项目

1
git clone https://github.com/lm-sys/FastChat.git

安装llama权重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cd FastChat

pip install pyllama -U --default-timeout=100 -i https://pypi.tuna.tsinghua.edu.cn/simple

选择权重7B或者13B
python -m llama.download --model_size 7B
#python -m llama.download --model_size 13B

权重转换脚本:
https://github.com/huggingface/transformers/blob/9eae4aa57650c1dbe1becd4e0979f6ad1e572ac0/src/transformers/models/llama/convert_llama_weights_to_hf.py

将脚本放在权重目录下
权重转换
python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./output/13B

安装vicuna

1
2
3
4
python -m fastchat.model.apply_delta \
--base ./pyllama_data/output/13B \
--target ./vicuna_data/vicuna-13b \
--delta lmsys/vicuna-13b-delta-v1.1

启动api

1
2
3
4
5
6
7
python -m fastchat.serve.controller --host="127.0.0.1"

使用的模型
python -m fastchat.serve.model_worker --model-path ./vicuna_data/vicuna-13b --host="127.0.0.1" --worker-address="http://127.0.0.1:21002" --controller-address="http://127.0.0.1:21001"

启动api服务
python -m fastchat.serve.openai_api_server --host 127.0.0.1 --port 6006

问题

1
2
3
4
cache问题:
vi fastchat/model/model_adapter.py
修改from functools import lru_cache as cache
或者使用python3.9以上的版本

使用

1
2
3
4
5
6
7
8
9
10
11
12
embeddings接口:
curl http://xxx.com/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "vicuna-13b",
"input": "使用php合并数组"
}'

chat接口:
curl http://xxx.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "vicuna-13b","messages":[{"role":"user","content":"使用php合并数组"}], "temperature": 0, "max_tokens": 820,"top_p":1,"frequency_penalty":0,"presence_penalty":0.6,"stream":true}'

微调环境

1
2
3
4
5
6
7
8
9
10
11
镜像
PyTorch 1.11.0
Python 3.8(ubuntu20.04)
Cuda 11.3
GPU
A100-PCIE-40GB(40GB) * 2
CPU20 vCPU Intel Xeon Gold 6248R
内存144GB
硬盘
系统盘:25 GB
数据盘:免费:50GB SSD 付费:40GB

可能出现的错误

1
2
3
4
5
6
mmvc版本错误 https://mmcv.readthedocs.io/zh_CN/latest/get_started/installation.html

安装合适版本的pytorch
https://pytorch.org/get-started/previous-versions/
2.0版本地址
https://pytorch.org/get-started/pytorch-2.0/

开始微调(未成功)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
修改训练内容
vi FastChat/playground/data/dummy.json

这里使用三张V100(40GB),会出现cpu报错,估计需要3张以上的配置
torchrun --nproc_per_node=2 --master_port=20002 fastchat/train/train_mem.py \
--model_name_or_path pyllama_data/output/13B \
--data_path playground/data/dummy.json \
--bf16 True \
--output_dir output/vicuna-13b-dummy \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 300 \
--save_total_limit 2 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--fsdp "full_shard offload auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True

出现error
Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token`
`(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token':
'[PAD]'})`.

修改pyllama_data/output/13B下的两个文件
special_tokens_map.json和tokenizer_config.json

从https://huggingface.co/lmsys/vicuna-13b-delta-v0/tree/main获取覆盖

出现error
torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 2045) of binary: /root/miniconda3/envs/my-env/bin/python

修改mmcv版本(根据cuda和troch选择合适的版本)
pip install mmcv==2.0.0 -f https://download.openmmlab.com/mmcv/dist/cu117/torch2.0/index.html

torchrun --nproc_per_node=3 --master_port=20001 fastchat/train/train_mem.py \
--model_name_or_path pyllama_data/output/13B \
--data_path playground/data/dummy.json \
--bf16 false \
--output_dir output/vicuna-13b-dummy \
--num_train_epochs 3 \
--per_device_train_batch_size 3 \
--per_device_eval_batch_size 3 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 300 \
--save_total_limit 2 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--fsdp "full_shard offload auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True