云栈社区»论坛 › 开源实战「 OpenSource 」 › LLaMA2-7B指令微调实战：基于Dolly-15K数据集的QLoRA高效微调指 ...

发回帖发新帖

1009 积分	0 好友	131 主题

发消息

[Python] LLaMA2-7B指令微调实战：基于Dolly-15K数据集的QLoRA高效微调指南

发表于前天 18:27 | 查看: 6| 回复: 0

在人工智能模型训练领域，一个共识是：数据质量远胜于数据数量。一份高质量的指令数据集，对于提升大语言模型（LLM）在特定任务上的表现至关重要。

1、Alpaca格式介绍

为了让AI模型更精准地理解并执行人类指令，研究人员提出了一种清晰、固定的数据格式。这种格式就像一张“工作单”，明确地告诉AI应该做什么、基于什么材料做，以及期望的答案是什么。其核心结构如下：

### Instruction:
(指令内容-作用：告诉AI这次要“扮演”什么角色（翻译官、总结者、提问者等）)
### Input:
(输入内容-完成这项任务所需要的具体资料，如一段需要翻译的文字，有时可以为空。比如你的指令是“讲个笑话”，那就不需要额外材料。)
### Response:
(期望的回复-这是最关键的部分，是“教学样本”。AI通过成千上万张这样的“工作单”（指令+输入）和“标准答案”（响应）来学习，以后遇到类似的“指令+输入”时，它就能模仿着生成正确的“响应”。)

示例：

### Instruction:
把这段中文翻译成英文
### Input:
今天天气真好。
### Response:
The weather is so nice today.

2、Databricks Dolly-15K数据集介绍

Databricks Dolly-15K 是一个由专家精心制作的、包含1.5万道高质量指令对的开源数据集，专门用于系统性地训练AI模型，提升其指令遵循能力和实用性。该数据集遵循CC开源协议，意味着任何个人、学校或公司都可以免费下载、使用甚至修改它来训练自己的模型，这极大地推动了整个开源大语言模型社区的发展。

Databricks Dolly-15K数据集介绍

3、LLaMA2-7B指令微调实战

3.1、下载并以 Alpaca 格式处理数据集

我们首先下载数据集，并按照上述Alpaca格式进行转换。

from datasets import load_dataset
from random import randrange

# 下载 databricks-dolly-15k 数据集
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# 以 Alpaca-Style 格式化指令数据
def format_instruction(sample_data):
    """
    Formats the given data into a structured instruction format.
    Parameters:
    sample_data (dict): A dictionary containing 'response' and 'instruction' keys.
    Returns:
    str: A formatted string containing the instruction, input, and response.
    """
    # Check if required keys exist in the sample_data
    if 'response' not in sample_data or 'instruction' not in sample_data:
        # Handle the error or return a default message
        return "Error: 'response' or 'instruction' key missing in the input data."
    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.
### Input:
{sample_data['response']}
### Response:
{sample_data['instruction']}"""

# 随机抽选一个样例，打印 Alpaca 格式化后的样例
print(format_instruction(dataset[randrange(len(dataset))]))

3.2、加载和配置模型

我们使用Hugging Face的transformers库加载LLaMA2-7B模型，并应用4位量化（QLoRA）来大幅减少显存占用。同时，我们检查硬件是否支持Flash Attention以加速训练。

# 检查硬件是否支持 Flash Attention
python -c “import torch; assert torch.cuda.get_device_capability()[0] >= 8, ‘Hardware not supported for Flash Attention’”
# 如果硬件支持（CUDA compute capability >= 8.0），可以安装 flash-attn 加速包：
MAX_JOBS=4 pip install flash-attn --no-build-isolation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 检查硬件是否支持 Flash Attention
if torch.cuda.get_device_capability()[0] >= 8:
    from utils.llama_patch import replace_attn_with_flash_attn
    print(“Using flash attention”)
    replace_attn_with_flash_attn()
    use_flash_attention = True
else:
    use_flash_attention = False
    print(“Hardware not supported for Flash Attention”)

# 获取 LLAMA 2-7B 模型权重
# 无需 Meta AI 审核的模型权重，别人开源的
model_id = “NousResearch/llama-2-7b-hf”
# 通过 Meta AI 审核后可使用此 Model ID 下载
# model_id = “meta-llama/llama-2-7b-hf”

# 使用 BnB 加载优化后的模型
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type=“nf4”,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# 加载模型与分词器
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    device_map=“auto”
)
# 验证模型是否在使用 flash attention
if use_flash_attention:
    from utils.llama_patch import forward
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, “Model is not using flash attention”

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = “right”

接下来，我们配置训练参数并应用QLoRA适配器。

import datetime
from transformers import TrainingArguments
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# 生成时间戳用于输出目录
timestamp = datetime.datetime.now().strftime(“%Y%m%d_%H%M%S”)
# 演示训练参数（实际训练时可设置为 False）
demo_train = True
output_dir = f“models/llama-7-int4-dolly-{timestamp}”

# 训练超参数配置
args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=100,
    per_device_train_batch_size=3, # Nvidia T4 16GB 显存支持的最大 Batch Size
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim=“paged_adamw_32bit”,
    logging_steps=10,
    save_strategy=“steps” if demo_train else “epoch”,
    save_steps=10,
    learning_rate=2e-4,
    bf16=True, # 修正：取消注释并启用 bfloat16
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type=“constant”
)

# QLoRA 配置
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=16,
    bias=“none”,
    task_type=“CAUSAL_LM”, # 修正：将 “CAUSALL_LM” 改为 “CAUSAL_LM”
)

# 使用 QLoRA 配置加载 PEFT 模型
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)
qlora_model.print_trainable_parameters()

最后，使用SFTTrainer（监督式微调训练器）来训练模型。

from trl import SFTTrainer

# 数据库的最大长度序列（筛选后的训练数据样例数为1158）
max_seq_length = 2048

trainer = SFTTrainer(
    model=qlora_model, # 修正为 qlora_model
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=args,
)

3.3、开始训练模型

启动训练进程，这一步骤将根据你的硬件配置和设置的步数消耗一定时间。

trainer.train()

3.4、加载微调后的模型进行推理

训练完成后，我们可以加载保存的模型进行测试，查看其指令生成能力。

# 代码片段1：使用微调后的LLaMA2-7B推理
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

# model_dir = “models/llama-7-int4-dolly”
model_dir = “models/llama-7-int4-dolly-20240404_033139”

# 加载基础LLM模型与分词器
model = AutoPeftModelForCausalLM.from_pretrained(
    model_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

# 代码片段2：生成测试样本
from datasets import load_dataset
from random import randrange

# 从hub加载数据集并得到一个样本
dataset = load_dataset(“databricks/databricks-dolly-15k”, split=“train”)
sample = dataset[randrange(len(dataset))]

prompt = f“””### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.
### Input:
{sample[‘response’]}
### Response:”“”

input_ids = tokenizer(prompt, return_tensors=“pt”, truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.9)

print(f“Prompt:\n{sample[‘response’]}\n”)
print(f“Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}”)
print(f“Ground truth:\n{sample[‘instruction’]}”)

预期输出格式：

Prompt:
The process of photosynthesis in plants converts light energy, usually from the sun, into chemical energy stored in glucose. This vital reaction takes place in the chloroplasts of plant cells and requires water and carbon dioxide, releasing oxygen as a byproduct.

Generated instruction:
Explain the process of photosynthesis in plants.

Ground truth:
Describe how photosynthesis works in plants.

输出解析：

Prompt：从数据集中随机抽取的一个“回答”。这是一段关于光合作用的陈述性事实。
Generated instruction：你的微调模型根据上面的回答所生成的“指令”。模型学会了将一段陈述性文字，转换成一个可以引导LLM生成类似文本的提问式指令。
Ground truth：数据集中人工编写的、与该回答配对的真实指令。它作为“标准答案”，用于评估模型生成指令的质量是否接近人类水平。

上一篇：职场沟通与边界管理：离职员工如何在尴尬场景中得体回应
下一篇：互补振荡电路原理与负阻特性分析：RLC串联如何产生间歇振荡模拟小鸡音效

LLaMA2, QLoRA, Python, transformers, 语言模型微调