Reinforcement Learning from Human Feedback¶
Reinforcement Learning from Human Feedback (RLHF) is a technique that fine-tunes language models using human-generated preference data to align model outputs with desired behaviors.
vLLM can be used to generate the completions for RLHF. Some ways to do this include using libraries like TRL, OpenRLHF, verl and unsloth.
See the following basic examples to get started if you don't want to use an existing library:
- Training and inference processes are located on separate GPUs (inspired by OpenRLHF)
- Training and inference processes are colocated on the same GPUs using Ray
- Utilities for performing RLHF with vLLM
See the following notebooks showing how to use vLLM for GRPO: