首页 - 技术栈

泰州网站关键词优化面包屑 wordpress

作者: 五速梦信息网
时间: 2026年06月19日 08:26

当前位置：首页 > news >正文

泰州网站关键词优化,面包屑 wordpress,西安做门户网站最好的公司,公司开发个网站有哪些中文版 DPO paper: https://arxiv.org/pdf/2305.18290 DPO 算法详解#xff1a;从理论到实现

什么是 DPO#xff1f; DPO#xff08;Direct Preference Optimization#xff09;是一种直接基于人类偏好进行优化的算法#xff0c;旨在解决从人类偏好数据中训练出表现…中文版 DPO paper: https://arxiv.org/pdf/2305.18290 DPO 算法详解从理论到实现
什么是 DPO DPODirect Preference Optimization是一种直接基于人类偏好进行优化的算法旨在解决从人类偏好数据中训练出表现更优的语言模型的问题。它与传统的基于奖励建模的强化学习方法如 PPO不同通过引入一种基于 Bradley-Terry 模型的参数化方法将人类偏好概率直接与语言模型的输出概率相关联从而避免了明确训练奖励模型的过程。 2. DPO 解决什么问题在 RLHFReinforcement Learning with Human Feedback框架中通常需要训练一个奖励模型来对语言模型的生成进行打分。然而训练奖励模型和使用强化学习优化策略模型如 PPO通常会引入一些复杂性和不稳定性奖励模型可能过拟合或偏离人类真实偏好。使用强化学习优化策略模型需要平衡探索和收敛容易引发 KL 散度爆炸等问题。 DPO 提供了一种更直接的优化方式通过重新参数化将偏好建模直接嵌入语言模型优化中从而绕过奖励建模简化了训练流程。 3. DPO 的核心公式 DPO 的核心思想是通过 Bradley-Terry 偏好模型将偏好概率建模为语言模型输出概率的对数比值并引入温度参数 ( β \beta β ) 来控制 KL 惩罚强度。核心公式人类偏好概率建模公式如下 p ∗ ( y 1 ≻ y 2 ∣ x ) 1 1 exp ⁡ ( β log ⁡ π ∗ ( y 2 ∣ x ) π ref ( y 2 ∣ x ) − β log ⁡ π ∗ ( y 1 ∣ x ) π ref ( y 1 ∣ x ) ) p^(y_1 \succ y_2 | x) \frac{1}{1 \exp\left(\beta \log \frac{\pi^(y2|x)}{\pi{\text{ref}}(y_2|x)} - \beta \log \frac{\pi^(y1|x)}{\pi{\text{ref}}(y1|x)}\right)} p∗(y1≻y2∣x)1exp(βlogπref(y2∣x)π∗(y2∣x)−βlogπref(y1∣x)π∗(y1∣x))1 在实际中我们通过最大化以下目标函数来优化参数化的策略模型 ( π θ \pi\theta πθ ) L DPO ( π θ ; π ref ) − E ( x , y w , y l ) ∼ D [ log ⁡ σ ( β log ⁡ π θ ( y w ∣ x ) π ref ( y w ∣ x ) − β log ⁡ π θ ( y l ∣ x ) π ref ( y l ∣ x ) ) ] L{\text{DPO}}(\pi\theta; \pi{\text{ref}}) - \mathbb{E}{(x, y_w, yl) \sim D}\left[ \log \sigma\left(\beta \log \frac{\pi\theta(yw | x)}{\pi{\text{ref}}(yw | x)} - \beta \log \frac{\pi\theta(yl | x)}{\pi{\text{ref}}(y_l | x)}\right) \right] LDPO(πθ;πref)−E(x,yw,yl)∼D[logσ(βlogπref(yw∣x)πθ(yw∣x)−βlogπref(yl∣x)πθ(yl∣x))] 其中 ( σ \sigma σ ) 是 Sigmoid 函数。( y w y_w yw ) 和 ( y l yl yl ) 分别是人类标注的偏好和非偏好样本。通过最大化该目标函数策略模型会更倾向于生成被人类偏好的输出同时抑制被人类不喜欢的输出。 4. 如何理解 DPO DPO 的优化过程可以从以下几个方面理解奖励重新参数化通过将奖励模型嵌入策略模型输出的对数比值中避免了显式训练奖励模型的过程。隐式奖励定义为 r ^ θ ( x , y ) β log ⁡ π θ ( y ∣ x ) π ref ( y ∣ x ) \hat{r}\theta(x, y) \beta \log \frac{\pi\theta(y | x)}{\pi{\text{ref}}(y | x)} r^θ(x,y)βlogπref(y∣x)πθ(y∣x) 梯度优化 DPO 的梯度公式为 ∇ θ L DPO − β E ( x , y w , y l ) ∼ D [ σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) ⋅ ( ∇ θ log ⁡ π θ ( y w ∣ x ) − ∇ θ log ⁡ π θ ( y l ∣ x ) ) ] \nabla\theta L{\text{DPO}} -\beta \mathbb{E}_{(x, y_w, yl) \sim D}\left[ \sigma(\hat{r}\theta(x, yl) - \hat{r}\theta(x, yw)) \cdot (\nabla\theta \log \pi_\theta(yw | x) - \nabla\theta \log \pi_\theta(y_l | x)) \right] ∇θLDPO−βE(x,yw,yl)∼D[σ(r^θ(x,yl)−r^θ(x,yw))⋅(∇θlogπθ(yw∣x)−∇θlogπθ(yl∣x))] 直观上这意味着模型会提高 ( y w y_w yw ) 的生成概率。降低 ( y l yl yl ) 的生成概率。偏差较大的样本即 ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) \hat{r}\theta(x, yl) - \hat{r}\theta(x, y_w) r^θ(x,yl)−r^θ(x,yw) ) 较大时权重更高。温度参数 ( β \beta β ) ( β \beta β ) 控制 KL 惩罚的强度平衡策略模型与参考模型之间的分布差异。 5. 示例解析假设我们有一个 Prompt生成了两个候选回复 ( y 1 y_1 y1 ) 和 ( y 2 y_2 y2 )并根据人类偏好得到以下信息 ( y 1 y_1 y1 ) 被偏好 (( y w y 1 y_w y_1 ywy1 ))( y 2 y_2 y2 ) 被不偏好 (( y l y 2 y_l y2 yly2 ))。模型的输出概率为 π θ ( y 1 ∣ x ) 0.6 , π θ ( y 2 ∣ x ) 0.4 , π ref ( y 1 ∣ x ) 0.5 , π ref ( y 2 ∣ x ) 0.5 \pi\theta(y1|x) 0.6, \quad \pi\theta(y2|x) 0.4, \quad \pi{\text{ref}}(y1|x) 0.5, \quad \pi{\text{ref}}(y2|x) 0.5 πθ(y1∣x)0.6,πθ(y2∣x)0.4,πref(y1∣x)0.5,πref(y2∣x)0.5 计算隐式奖励 r ^ θ ( x , y 1 ) β log ⁡ π θ ( y 1 ∣ x ) π ref ( y 1 ∣ x ) β log ⁡ 0.6 0.5 \hat{r}\theta(x, y1) \beta \log \frac{\pi\theta(y1|x)}{\pi{\text{ref}}(y1|x)} \beta \log \frac{0.6}{0.5} r^θ(x,y1)βlogπref(y1∣x)πθ(y1∣x)βlog0.50.6 r ^ θ ( x , y 2 ) β log ⁡ π θ ( y 2 ∣ x ) π ref ( y 2 ∣ x ) β log ⁡ 0.4 0.5 \hat{r}\theta(x, y2) \beta \log \frac{\pi\theta(y2|x)}{\pi{\text{ref}}(y_2|x)} \beta \log \frac{0.4}{0.5} r^θ(x,y2)βlogπref(y2∣x)πθ(y2∣x)βlog0.50.4 偏好模型的概率 p ∗ ( y 1 ≻ y 2 ∣ x ) 1 1 exp ⁡ ( r ^ θ ( x , y 2 ) − r ^ θ ( x , y 1 ) ) p^(y_1 \succ y2 | x) \frac{1}{1 \exp\left(\hat{r}\theta(x, y2) - \hat{r}\theta(x, y_1)\right)} p∗(y1≻y2∣x)1exp(r^θ(x,y2)−r^θ(x,y1))1 优化目标是让模型进一步增加 ( y 1 y_1 y1 ) 的概率同时减少 ( y 2 y2 y2 ) 的概率。 6. DPO 和 PPO 的区别特性DPOPPO核心思想直接基于人类偏好优化语言模型基于奖励信号通过强化学习优化策略是否需要奖励模型不需要需要优化目标最大化偏好概率最大化累计奖励实现复杂度较低较高稳定性较高可能出现 KL 爆炸等问题关于KL爆炸问题可以参考笔者的另一篇博客PPO 可能出现 KL 爆炸等问题的详细分析(KL Explosions in PPO) 中英双语 7. 总结 DPO 提供了一种高效、稳定的语言模型优化方法适合在大规模人类偏好数据上训练更优的模型。相比于传统的 RLHF 方法DPO 不仅简化了实现过程还具备更强的理论一致性和实践可靠性。 Direct Preference Optimization (DPO): A Comprehensive Overview What Problem Does DPO Solve? Direct Preference Optimization (DPO) addresses the limitations of Reinforcement Learning with Human Feedback (RLHF) by offering a simpler and more direct optimization method. RLHF traditionally uses reward models and Proximal Policy Optimization (PPO) to align language models with human preferences. However, PPO introduces complexity due to the need for dynamic reward modeling and reinforcement learning updates, which involve policy rollouts and value function estimation. DPO simplifies this process by directly optimizing the likelihood of human-preferred responses relative to dispreferred ones without requiring an explicit reward model or reinforcement learning steps. Instead, it reformulates the optimization as a maximum likelihood estimation (MLE) problem. Core Formula of DPO The central idea of DPO is to use a Bradley-Terry preference model to define probabilities for human preferences based on the log-probabilities output by the model. Given: ( π θ \pi\theta πθ ): The policy (current model being optimized)( π r e f \pi_{ref} πref ): The reference policy (pre-trained model used as a baseline)( y w y_w yw ): Preferred response( y l yl yl ): Dispreferred response( β \beta β ): Temperature hyperparameter controlling regularization strength DPO models human preferences using the log-ratio of probabilities between the preferred and dispreferred outputs. The loss function is: L D P O ( π θ ; π r e f ) − E ( x , y w , y l ) ∼ D [ log ⁡ σ ( β ( log ⁡ π θ ( y w ∣ x ) π r e f ( y w ∣ x ) − log ⁡ π θ ( y l ∣ x ) π r e f ( y l ∣ x ) ) ) ] L{DPO}(\pi\theta; \pi{ref}) -E_{(x, y_w, yl) \sim D} \left[ \log \sigma \left( \beta \left( \log \frac{\pi\theta(yw | x)}{\pi{ref}(yw | x)} - \log \frac{\pi\theta(yl | x)}{\pi{ref}(y_l | x)} \right) \right) \right] LDPO(πθ;πref)−E(x,yw,yl)∼D[logσ(β(logπref(yw∣x)πθ(yw∣x)−logπref(yl∣x)πθ(yl∣x)))] Key Points in the Formula: The loss directly optimizes the relative log-probabilities of preferred (( y w y_w yw)) versus dispreferred (( y l y_l yl)) responses.( β \beta β ) controls the strength of KL-regularization between the policy and the reference model.( σ ( ⋅ ) \sigma(\cdot) σ(⋅) ) represents the sigmoid function, ensuring the preference probabilities are modeled effectively.It eliminates the need for explicit reward modeling, treating model preferences as implicit rewards. Understanding the Formula
Implicit Reward Calculation DPO implicitly defines a reward function based on the policy and reference model: r ^ θ ( x , y ) β log ⁡ π θ ( y ∣ x ) π r e f ( y ∣ x ) \hat{r}\theta(x, y) \beta \log \frac{\pi\theta(y | x)}{\pi_{ref}(y | x)} r^θ(x,y)βlogπref(y∣x)πθ(y∣x) This means the reward is proportional to the log-likelihood ratio between the current and reference models.
Optimization Objective DPO optimizes the probability of preferred completions being ranked higher than dispreferred completions. Specifically, it increases the likelihood of preferred completions (( y w y_w yw)) while decreasing the likelihood of dispreferred ones (( y l yl yl)). The gradient of the loss is: ∇ θ L D P O − β E ( x , y w , y l ) ∼ D [ σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) ( ∇ θ log ⁡ π θ ( y w ∣ x ) − ∇ θ log ⁡ π θ ( y l ∣ x ) ) ] \nabla\theta L{DPO} -\beta E{(x, y_w, yl) \sim D}\left[ \sigma(\hat{r}\theta(x, yl) - \hat{r}\theta(x, yw)) \left( \nabla\theta \log \pi_\theta(yw | x) - \nabla\theta \log \pi_\theta(y_l | x) \right) \right] ∇θLDPO−βE(x,yw,yl)∼D[σ(r^θ(x,yl)−r^θ(x,yw))(∇θlogπθ(yw∣x)−∇θlogπθ(yl∣x))]
Weighting by Confidence The weighting term ( σ ( r ^ θ ( x , y l ) − r ^ θ ( x , y w ) ) \sigma(\hat{r}_\theta(x, yl) - \hat{r}\theta(x, y_w)) σ(r^θ(x,yl)−r^θ(x,yw)) ) penalizes errors when the model incorrectly assigns higher rewards to dispreferred completions. This ensures that updates focus on examples where the model is most uncertain or wrong, leading to more effective training. Example Analysis Suppose we have the following preferences for prompts: Input Prompt: “What is the capital of France?” Completions: ( y w y_w yw ): “The capital of France is Paris.” (Preferred)( y l yl yl ): “The capital of France is London.” (Dispreferred) The log-probabilities from the current model (( π θ \pi\theta πθ )) and reference model (( π r e f \pi{ref} πref )) are: ( π θ ( y w ∣ x ) − 0.2 \pi\theta(yw | x) -0.2 πθ(yw∣x)−0.2 ), ( π θ ( y l ∣ x ) − 0.8 \pi\theta(yl | x) -0.8 πθ(yl∣x)−0.8 )( π r e f ( y w ∣ x ) − 0.3 \pi{ref}(yw | x) -0.3 πref(yw∣x)−0.3 ), ( π r e f ( y l ∣ x ) − 0.7 \pi{ref}(y_l | x) -0.7 πref(yl∣x)−0.7 ) Using the DPO loss formula: Calculate the log-probability ratios: r w log ⁡ π θ ( y w ∣ x ) π r e f ( y w ∣ x ) log ⁡ ( − 0.2 ) − log ⁡ ( − 0.3 ) − 0.17 rw \log \frac{\pi\theta(yw | x)}{\pi{ref}(y_w | x)} \log(-0.2) - \log(-0.3) -0.17 rwlogπref(yw∣x)πθ(yw∣x)log(−0.2)−log(−0.3)−0.17 r l log ⁡ π θ ( y l ∣ x ) π r e f ( y l ∣ x ) log ⁡ ( − 0.8 ) − log ⁡ ( − 0.7 ) 0.06 rl \log \frac{\pi\theta(yl | x)}{\pi{ref}(y_l | x)} \log(-0.8) - \log(-0.7) 0.06 rllogπref(yl∣x)πθ(yl∣x)log(−0.8)−log(−0.7)0.06 Compute the preference difference: Δ r β ( r w − r l ) β ( − 0.17 − 0.06 ) β ( − 0.23 ) \Delta r \beta (r_w - r_l) \beta(-0.17-0.06)\beta(-0.23) Δrβ(rw−rl)β(−0.17−0.06)β(−0.23) Final loss: L − log ⁡ σ ( Δ r ) − log ⁡ σ ( − 0.23 β ) L -\log \sigma(\Delta r) -\log \sigma(-0.23\beta) L−logσ(Δr)−logσ(−0.23β)
The optimization encourages increasing the likelihood of ( y w y_w yw ) while reducing ( y l y_l yl ). DPO vs PPO: Key Differences AspectDPOPPOReward ModelImplicitly modeled via log-probabilities.Requires an explicit, learned reward model.Algorithm TypeMaximum Likelihood Estimation (MLE).Reinforcement Learning with Policy Gradients.Training ComplexitySimpler and requires fewer hyperparameters.More complex with value function updates and clipping mechanisms.StabilityMore stable due to direct optimization.Requires careful tuning to avoid divergence.Data RequirementRelies on preference data directly.Requires preference data and rollout data for updates.KL RegularizationControlled by parameter ( β \beta β ).Controlled by PPO clipping. Why is DPO Effective? Simplified Training Process: No need for reward model training or complex PPO pipelines.Implicit Reward Modeling: Avoids separate reward models and leverages pre-trained probabilities.Theoretical Guarantees: Based on Bradley-Terry models, ensuring consistency under reasonable assumptions.Practical Applicability: Compatible with public preference datasets without requiring new data collection. Implementation Example import torch import torch.nn.functional as Fdef dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):pi_yw_logps, pi_yl_logps pi_logps[yw_idxs], pi_logps[yl_idxs]ref_yw_logps, ref_yl_logps ref_logps[yw_idxs], ref_logps[yl_idxs]pi_logratios pi_yw_logps - pi_yl_logpsref_logratios ref_yw_logps - ref_yl_logpslosses -F.logsigmoid(beta * (pi_logratios - ref_logratios))rewards beta * (pi_logps - ref_logps).detach()return losses, rewardsConclusion DPO offers a lightweight alternative to PPO for preference optimization by directly leveraging preference data without relying on complex reinforcement learning frameworks. It is particularly effective for aligning language models with human preferences and offers theoretical guarantees grounded in Bradley-Terry models. Given its simplicity and effectiveness, DPO is increasingly used for tasks requiring preference-based fine-tuning of large language models. 后记 2024年12月26日20点52分于上海在GPT4o大模型辅助下完成。