SFT
SFT is the foundation training method that teaches a model to follow instructions using demonstration data.Basic SFT Training
SFT Parameters
data_set: List ofStringThreadobjects containing training examplesmodel: Training model instancelogger: Logger for tracking training metricslr: Learning ratesamples_per_batch: Batch sizemax_grad_norm: Gradient clipping norm
PPO
PPO is a reinforcement learning algorithm that uses a reward function to improve model behavior.Basic PPO Training
PPO Parameters
data_set: List ofStringThreadpromptsmodel: Policy model for trainingvalue_model: Value model for advantage estimationscoring_fn: Function that returns reward scoreslr_policy: Policy learning ratelr_value: Value learning ratekl_beta: KL divergence penalty coefficientclip_range: PPO clipping range
GRPO
GRPO is similar to PPO but generates multiple completions per prompt and uses relative ranking for training.Basic GRPO Training
GRPO Parameters
data_set: List ofStringThreadpromptsmodel: Training modelscoring_fn: Function that returns reward scorescompletions_per_sample: Number of completions per promptlr: Learning ratekl_beta: KL divergence penalty coefficient
DPO
DPO trains models using preference data (preferred vs non-preferred responses) without explicit reward modeling.Basic DPO Training
DPO Parameters
data_set: List of tuples containing (preferred_response, non_preferred_response)model: Training modellogger: Logger for tracking metricslr: Learning ratesamples_per_batch: Batch sizebeta: DPO beta parameter

