使用GPT-3训练垃圾短信分类器示例详解

时间:2023-03-21 大鹏学开发人气:0

引言

平时我们都会收到很多短信，由于微信等即时通讯工具的普及，短信已经成为了一个验证码接收器，但是偶尔也有不少垃圾短信，所以对短信进行分类和屏蔽是一个很简单又很重要的需求。

目前在AppStroe上有很多实现短信分类的App，比如《熊猫吃短信》，有需要可以自行下载体验一下。解决这样的一个简单的需求的App，就可以让App的开发者赚不少钱，我们可以学习一下这种需求用GPT-3如何实现。

今天这个教程，我们可以使用GPT-3模型来实现一个垃圾短信分类器，可以做为一个GPT3模型二次开发训练的简单的入门练手项目

因为使用成本的原因（训练完成调用接口仍然需要付费，而且更贵），此方式不适合用于正式的生产环境，仅作为学习体验使用，期待以后会有成本更低更合适的方式。

*如果您没有开发基础也可以了解学习训练过程，再找到有基础的程序员代为训练 *

训练数据

希望训练什么，就要准备什么数据，如果想要做一个通用的短信识别那就需要尽可能的广泛而多的短信样本，这个案例我们只取一个人的短信来训练，需要的样本数量可以很少，训练出来的模型也会很适用于这个人。

我们从某位同学手机上直接导出了一万条短信（好几年没删），然后随机取了500条短信作为样本进行标注，将短信分成四个简单的类型：通知短信、垃圾短信、公益短信、正常短信，然后将文件保存为.csv格式的文件，放到项目文件夹

分类	短信内容
通知短信	【码上购】【网上营业厅】您的订单正在做修改证件操作，验证码：522348，非本人同意请勿向他人提供验证码信息
通知短信	尊敬的客户：您好！您所反映的问题（工单号：TS00000000000000）已处理完毕，我司将跟进满意度调查，如您收到提示短信，请对我们的服务给予10分的满意评价。感谢您的理解和支持！<湖南联通10010>
垃圾短信	交费、充值更多人选联通手机营业厅，安全快捷，固定面值本机交费享受9.95折，快来体验吧！u.10010.cn/khddf2
公益短信	公益短信：4月15日是全民国家安全教育日。国家安全，人人有责！发现危害国家安全的情况，请拨打举报电话12339，一经查实将予奖励。【湖南省国家安全厅】
正常短信	今天上午可以安装吗老板

注意事项

样本数量最少200条，建议500条以上，数据越多准确率越高
做分类训练，每个类型至少有100个样本，否则会影响准确率
确保训练的样本与实际使用的情况是非常相似的，否则影响准确率
如果样本里面包含敏感信息，可以用*号或者某某来进行脱敏处理，不会影响训练效果
每条样本添加一个结束符，比如“###”或者“->"，如果没有添加，转换工具会问是否需要帮你添加

这里需要注意，我们将短信分为四种，并且用中文表示，是方便我们教程测试，实际使用中，使用数字代替中文分类，我们将分类换成：

正常短信=1，通知短信=2，公益短信=3，垃圾短信=4

因为模型接口是按token收费的，可以理解为按字数收费，用数字就可以节省一些成本

另外，训练的模型有四种可选，davinci、curie、babbage、ada

其中ada价格最便宜，性能最好，像这种分类的简单需求，使用ada模型就可以了。

四种模型的价格如下：

模型	训练价格	训练完成调用价格
Ada	$0.0004 / 1K tokens	$0.0016 / 1K tokens
Babbage	$0.0006 / 1K tokens	$0.0024 / 1K tokens
Curie	$0.0030 / 1K tokens	$0.0120 / 1K tokens
Davinci	$0.0300 / 1K tokens	$0.1200 / 1K tokens

每1千token，token大概相当于字数，一个中文字约为2个token，一条短信大约为140个token，如果我们以ada模型作为训练模型，换算下来，识别1千条短信大概成本为1.568人民币。

价格不算便宜，但是人类历史上所有有需求但价格昂贵的东西，最终都会被市场打下来的。

训练过程

首先安装最新的openai库

pip install --upgrade openai

然后导入open的密钥，可以使用环境变量导入的方式

export OPENAI_API_KEY="&lt;填你的openai密钥&gt;" // linux系统 
set OPENAI_API_KEY="&lt;填你的openai密钥&gt;" // windows系统

GPT-3训练需要将样本数据转换为他们要求的JSONL格式

{"prompt": "输入的提示", "completion": "输出的结果"}
...

{"prompt":"sms: 今天上午可以安装吗老板 ->", "completion":" 正常短信"}
……

我们可以使用openai提供的转换工具，来换为符合要求的格式

openai tools fine_tunes.prepare_data -f <样本文件地址>
openai tools fine_tunes.prepare_data -f sms_classifier/sms_sample_500_converted.csv

首先将我们csv文件的表头，改成 prompt 和 completion，代表输入和输出的内容，然后在信息前面加一个标志（sms: ），用于区别正常的内容

completion	prompt
通知短信	sms:【码上购】【网上营业厅】您的订单正在做修改证件操作，验证码：522348，非本人同意请勿向他人提供验证码信息！

处理好csv文件之后，执行命令进行转换

openai tools fine_tunes.prepare_data -f sms_classifier/sms_sample_500_converted.csv

其中提示我们一些注意事项，一路点选Y就可以了

(venv) D:\dev2023\openai-tutorial>openai tools fine_tunes.prepare_data -f sms_classifier/sms_sample_500_converted.csv
Analyzing...
- Based on your file extension, your file is formatted as a CSV file
- Your file contains 441 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- All prompts end with suffix ` ##`
- All prompts start with prefix `sms: `
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for 
more details
Based on the analysis we will perform the following actions:
- [Necessary] Your format `CSV` will be converted to `JSONL`
- [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: y
- [Recommended] Would you like to split into training and validation set? [Y/n]: y
Your data will be written to a new JSONL file. Proceed [Y/n]: y
Wrote modified files to `sms_classifier/sms_sample_500_converted_prepared_train.jsonl` and `sms_classifier/sms_sample_500_converted_prepared_valid.jsonl`
Feel free to take a look!
Now use that file when fine-tuning:
> openai api fine_tunes.create -t "sms_classifier/sms_sample_500_converted_prepared_train.jsonl" -v "sms_classifier/sms_sample_500_converted_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 4
After you've fine-tuned a model, remember that your prompt has to end with the indicator string ` ##` for the model to start generating completions, rather than continuing with the prompt.
Once your model starts training, it'll approximately take 12.92 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.

其中工具会帮我们将样本分成训练集和测试集，以便训练完成之后测试训练的效果

同时也提醒我们：

训练完成后，正常的调用也需要保持与样本相同的请求格式
如果选择curie模型，大概需要12.92分钟，如果选择ada或者babbage模型则更短一些

开始训练

这里我们指定模型为ada：-m ada

指定训练的名称为：--suffix sms_classifier

(venv) D:\dev2023\openai-tutorial&gt;openai api fine_tunes.create -m ada --suffix "sms_classifier" -t "sms_classifier/sms_sample_500_converted_prepared_train.jsonl" -v "sms_classifier/sms_sample_500_converted_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 4
Upload progress: 100%|██████████████████████████████████████████| 97.6k/97.6k [00:00&lt;00:00, 95.8Mit/s]
Uploaded file from sms_classifier/sms_sample_500_converted_prepared_train.jsonl: file-HQgXiRZBxwn7In0sUax1WVdj
Upload progress: 100%|██████████████████████████████████████████| 24.3k/24.3k [00:00&lt;?, ?it/s]
Uploaded file from sms_classifier/sms_sample_500_converted_prepared_valid.jsonl: file-gtmsXSjMpmdFowRQ8Hn0FxbX
Created fine-tune: ft-tEt9Oo95zgJ42KJvP4nS8nee
Streaming events until fine-tuning is complete...
(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-02-14 11:56:00] Created fine-tune: ft-tEt9Oo95zgJ42KJvP4nS8nee

这里提示已经创建了一个训练任务，返回了一个任务ID：ft-zYQQqF1bBvOgiFllSR8R9jvZ

后面我们可以通过这个任务ID来查询具体的情况

按Ctrl+C可以中断输出任务训练情况，但不会中断任务

如果发生中断，可以使用命令继续查看记录

openai api fine_tunes.follow -i <任务ID>

等待一会后可以看到已经完成了训练

(venv) D:\dev2023\openai-tutorial>openai api fine_tunes.follow -i ft-wHXGw263e8ujLaDHNQGqYB6K
[2023-02-14 13:36:56] Created fine-tune: ft-wHXGw263e8ujLaDHNQGqYB6K
[2023-02-14 13:44:57] Fine-tune costs $0.10
[2023-02-14 13:44:58] Fine-tune enqueued. Queue number: 1
[2023-02-14 13:44:58] Fine-tune is in the queue. Queue number: 0
[2023-02-14 13:45:01] Fine-tune started
[2023-02-14 13:46:10] Completed epoch 1/4
[2023-02-14 13:47:07] Completed epoch 2/4
[2023-02-14 13:48:03] Completed epoch 3/4
[2023-02-14 13:48:59] Completed epoch 4/4
[2023-02-14 13:49:24] Uploaded model: ada:ft-personal:sms-classifier-2023-02-14-05-49-24
[2023-02-14 13:49:25] Uploaded result file: file-SaX4z4avlLH8KXDFM3UyNFoU
[2023-02-14 13:49:25] Fine-tune succeeded
Job complete! Status: succeeded

加载全部内容