Han Wenbo

Focus on technology, trends, and their intertwined politics.

奖励系统的种类

在RewardBench中,分了Seq. Classifier,Custom Classifiers,DPO,Random,Generative5种类型的奖励模型,我比较知道的是Sequence Classifiers 和Generative,今天也主要记录这两者。


update:重新阅读了下R1的论文,好像没有用到过程奖励模型(PRM),以下是原文:

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

奖励模型

DPO我之前只是之前制作偏好数据集进行DPO训练的程度,而不是利用DPO训练一个奖励模型。

img

Sequence Classifiers

我的理解是,拿Qwen模型举例,去掉[Qwen2ForCausalLM](transformers/src/transformers/models/qwen2/modeling_qwen2.py at main · huggingface/transformers)中使用的vocab_size分类,使用Qwen2ForSequenceClassification的分类,这个分类是一个单分类num_labels=1。

生成式奖励模型 (GRM)

生成式奖励模型可以看看这两篇,获得奖励分数是通过直接prompt构造的。

在R1论文里生成80万条数据时,2.3.3说道使用了DeepSeek-V3作为generative reward model:

However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment.

在r1的强化学习中同样用到了GRM:

For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts.

训练生成式奖励模型的数据集回答就是两条,一个是chosen,一个是rejected。

训练出来的GRM至少可以做两件事,拿 Skywork-Critic-Llama-3.1-70B举例:

用作偏好数据选择器:

回答A和回答B选择更好的一个的prompt:

"""Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user\'s instructions and answers the user\'s question better. 
Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. 
Please directly output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.

[User Question]
{input}

[The Start of Assistant A's Answer]
{response_a}
[The End of Assistant A's Answer]

[The Start of Assistant B's Answer]
{response_b}
[The End of Assistant B's Answer]
"""

用作打分Judge:

prompt:

 """请扮演一个专业的对话质量评价专家,能够从多个维度诊断和分析 AI 助手对用户问题的回答,并进行总体打分(分值范围是 1-5)。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度,请注意,不同任务类型的指令对评估分析维度的侧重不一样,需要根据具体的问题进行分析。

[用户问题]
{input}

[助手回答的开始]
{response_a}
[助手回答的结束]

你的详细评估和总体打分为: """

返回:

**评估分析:**

1. **有用性**:回答直接解决了用户的问题,提供了清晰的步骤和计算过程,非常有用。
2. **相关性**:回答完全相关于用户的问题,没有偏离主题。
3. **准确性**:计算过程准确无误,结果正确。
4. **深度**:回答提供了详细的步骤和解释,帮助用户理解问题的解决过程。
5. **创造性**:虽然回答是直接的计算过程,但在解释每一步时使用了简单的语言和明确的计算,使得回答易于理解。
6. **安全性**:回答中没有包含任何可能引起误解或危险的信息。

**总体打分:5**

**总结**:这个回答非常清晰、准确且有用,完全满足了用户的需求。它通过逐步解释和计算,帮助用户理解问题的解决过程。没有发现任何需要改进的地方,因此给予满分。

基于规则的奖励系统

这块就和奖励模型不同了,我能想到的:r1里的格式奖励,基于代码沙盒的运行结果准确性奖励。

困惑度奖励,重复性奖励,长度奖励,中文字符奖励,…

这个项目可以很好作为参考。

RewardBench5种类型

  1. Sequence Classifiers (Seq. Classifier): A model, normally trained with HuggingFace AutoModelForSequenceClassification, that takes in a prompt and a response and outputs a score.

  2. Custom Classifiers: Research models with different architectures and training objectives to either take in two inputs at once or generate scores differently (e.g. PairRM and Stanford SteamSHP).

  3. DPO: Models trained with Direct Preference Optimization (DPO), with modifiers such as -ref-free or -norm changing how scores are computed. Note: This also includes other models trained with implicit rewards, such as those trained with KTO.

  4. Random: Random choice baseline.

  5. Generative: Prompting fine-tuned models to choose between two answers, similar to MT Bench and AlpacaEval.

扩展资料

https://mp.weixin.qq.com/s/15zzzDJsFap8mFvofbEtgQ

2408.15240, 2410.12832 | Cool Papers - Immersive Paper Discovery

Reward Bench Leaderboard - a Hugging Face Space by allenai

Skywork/Skywork-Critic-Llama-3.1-70B · Hugging Face

Skywork/Skywork-Reward-Preference-80K-v0.1 · Datasets at Hugging Face

Skywork/Skywork-Reward-Preference-80K-v0.1 · Datasets at Hugging Face

2408.02666

LLaMA-Factory/src/llamafactory/chat/hf_engine.py at main · hiyouga/LLaMA-Factory

RewardBench: Evaluating Reward Models for Language Modeling