Group Relative Policy Optimization: Enhanced Reasoning Abilities of DeepSeek R1

DeepSeek R1:

DeepSeek R1 is an open-source model offered by DeepSeek AI, a Chinese company. This model was trained using reinforcement learning with a cold start. It assists in reasoning tasks, fact-based queries, and text generation and summarization. The DeepSeek research team introduced a new algorithm, Grade Relative Policy Optimization (GRPO), that dramatically increased the reasoning abilities of the model. For reasoning tasks, it has a comparable performance to the OpenAI-o1 model.

Reinforcement Learning (RL):

RL is a type of machine learning algorithm that trains an AI model called agent that takes actions in an environment and receives feedback. If the action is good, it receives a high reward and if the action is bad, it receives a low or negative reward. The goal is to learn what actions lead to maximum reward.

Training:

The DeepSeek R1 model uses the DeepSeek-V3-Base model as the base model. The pipeline designed to train DeepSeek R1 consists of four stages:

1. Cold Start:

Before starting RL, the DeepSeek-V3-Base model was trained on a small amount of highly detailed chain of thought (CoT) data. It improved the readability of responses and performance of the model.

2. Reasoning-Oriented Reinforcement Learning:

This stage uses Group Relative Policy Optimization (GRPO), a modified version of Proximal Policy Optimization (PPO), as the RL algorithm. It employed a rule-based reward system that had accuracy, format, and language consistency rewards. This stage was mainly focused on reasoning tasks.

3. Rejection sampling and Supervised Fine Tuning:

Once the model finished its RL training, this model was used to create new and improved data for the next round of training. The data contained 600K reasoning data samples and 200K non-reasoning data samples. The model is trained on this data for two training cycles called epochs. This stage was primarily focused on text generation and other general-purpose tasks.

4. Reinforcement Learning for all Scenarios:

The model goes through a second RL phase to make it more aligned with human preferences. The goal is to make it more helpful and harmless while still being good at reasoning. For reasoning data, the rule-based reward system is used as discussed previously, and for non-reasoning data, the rewards are based on human preference used for more complex and sensitive topics. For helpfulness, only the final summary is evaluated, while for harmlessness, the entire response is evaluated. The goal is to have a model that is good at reasoning while being helpful and harmless.

Proximal Policy Optimization (PPO):

PPO is a reinforcement learning algorithm. While training, PPO uses a clipped objective to ensure that policy updates are not too far from the old policy. Too large updates can make the policy unstable. It belongs to the family of policy gradient methods. Policy gradient methods directly optimize the policy (the strategy that the agent uses to decide actions) by adjusting the parameters of the policy based on the gradients (direction of improvement) derived from the rewards it receives. PPO requires high computational resources because it uses a separate critic model to evaluate the value of a response. It is also hard to generalize across various reasoning domains due to absolute reward evaluation.

Group Relative Policy Optimization (GRPO):

GRPO is an RL algorithm and a modified and improved version of PPO. It does not require a separate critic model and it uses a group of outputs for relative reward evaluation. It aims to maximize the following objective function:

J_GPRO=E[ q ~ P(Q),{o_i }_(i=1)^G ~ π_old (q)]

1/G ∑_(i=1)^G (min ((π_θ (o_i |q))/(π_old (o_i |q) ) A_i ,clip((π_θ (o_i |q))/(π_old (o_i |q) ) ,1-ϵ ,1+ϵ) A_i )-βD_KL (π_θ ||π_ref))

Sample questions and outputs

means we take a random question q from the dataset, and generate a group of outputs , ..., using the old policy.

Average over the group

The average of the results across all the outputs in the group is used to get a fair estimate.

Ratio of probabilities

Is the probability of the new model generating vs. the old model? If this ratio is greater than 1, the new model likes it more than the old one. If it is less than 1, the new model likes it less.

Advantage

computes how far the reward is from the group’s average (mean), scaled by how much the rewards vary (standard deviation), making it a normalized score.

Multiply with advantage

In we multiply the ratio of probabilities by the advantage , so the new model gets a bigger reward if it boosts the probability of good answers.

Clipping for stability

We clip the ratio between and to prevent too big changes. It is a hyperparameter usually taken as 0.1 to 0.2. This keeps training stable, like PPO (Proximal Policy Optimization).

Choose the safer value

We use min (normal term, clipped term) to take the minimum of the normal and clipped versions to avoid overly risky updates.

Kullback-Leibler (KL) Divergence Penalty

KL measures how much the new model is different from the reference model. It is a penalty if the new model becomes too different from the reference model. The KL divergence term keeps the new model grounded, and β controls how strong that penalty is.

The KL formula used is:

GRPO objective subtracts this KL term to keep the new policy close to the reference and avoid unstable learning. If is very close to , KL will be near zero (good). If changes too much from the reference, KL will be large (bad).

Implementation:

We need to install the following dependencies:

torch and torch.nn: Used to build and train neural networks.

torch.optim: Helps to optimize the model.

BertTokenizer and BertModel: Tools from transformers library to understand human language using the BERT model.

BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model developed by Google. It reads text in both directions, left-to-right and right-to-left, together to deeply understand the context of each word in a sentence. Most traditional models only read one way, but BERT captures the full meaning of a sentence more accurately.

import torch

import torch.nn as nn

import torch. optim as optim

from transformers import BertTokenizer, BertModel

The TextVectorizer class transforms the text sentence into a fixed-size vector using BERT. We load a pre-trained Portuguese BERT model and its tokenizer. The tokenizer splits the sentence into words. We specify the return format of tokens using return_tensors as PyTorch tensors rather than a Python list, etc., because we need to pass these to a PyTorch model. We set padding to true to ensure that all sequences or ordered collections of tokens are of the same length. BERT has a maximum sequence length of 512. We set truncation to true so that the sequence gets truncated if it exceeds the max length. The encoder then converts the words into vectors. It outputs multiple layer,s so we take the last layer that has the most refined meaning. The last line returns the summary vector of the sentence by averaging all vectors.

class TextVectorizer:

def __init__(self, model_tag="neuralmind/bert-base-portuguese-cased"):

self.tokenizer = BertTokenizer.from_pretrained(model_tag)

self.encoder = BertModel.from_pretrained(model_tag)

def encode(self, sentence):

tokens = self.tokenizer(

sentence, return_tensors="pt", padding=True, truncation=True, max_length=512

)

output = self.encoder(**tokens)

return output.last_hidden_state.mean(dim=1).detach()

The DecisionNet class decides which answer is best. The input_size represents the dimensions of the vector. The choice_count is the number of options to choose from. The vector goes through one linear layer that takes the vector as input and gives a score for each output (4 scores in this case). The softmax function converts these scores into probabilities.

class DecisionNet(nn.Module):

def __init__(self, choice_count, input_size=768):

super().__init__()

self.layer = nn.Linear(input_size, choice_count)

def forward(self, inputs):

return torch.softmax(self.layer(inputs), dim=-1)

The OptimizerGRPO is the main class that brings everything together. It learns over time to choose the correct answer. It takes input in the form of a question, options, and the correct answer. The group size is taken as four, which means that four actions are evaluated at a time. The is set to 0.15 and is set to 0.0005. The learning rate decides how much the model's weights are updated due to loss after each training step. We pass the prompt to the Text Vectorizer class to convert it into a vector. Then we use two models, current and reference. The reference model is fixed and is used for comparison,n while the current model's weights are adjusted during learning using the ADAM optimizer

class OptimizerGRPO:

def __init__(self, prompt, options, correct_option, group_size=4, clip_range=0.15, penalty_factor=0.0005, lr=0.001):

self.prompt = prompt

self.options = options

self.correct_idx = correct_option

self.group_size = group_size

self.clip_val = clip_range

self.kl_scale = penalty_factor

self.learning_rate = lr

self.vectorizer = TextVectorizer()

self.state_vector = self.vectorizer.encode(prompt)

self.choices = len(options)

self.current_net = DecisionNet(self.choices)

self.reference_net = DecisionNet(self.choices)

self.reference_net.load_state_dict(self.current_net.state_dict())

self.optimizer = optim.Adam(self.current_net.parameters(), lr=self.learning_rate)

The _evaluate_rewards method is a part of the OptimizerGRPO class and calculates the rewards for a group of selected answers. If a selected option matches the correct answer, it receives a reward of 1.0 otherwise it gets a small penalty of -0.1. These raw rewards are then normalized as their mean is subtracted and the result is divided by their standard deviation. This normalization ensures that the learning process is not biased and focuses on how good or bad an option is relative to the rest of the group.

def _evaluate_rewards(self, selections):

reward_vals = [

1.0 if sel == self.correct_idx else -0.1 for sel in selections

]

reward_tensor = torch.tensor(reward_vals)

return (reward_tensor - reward_tensor.mean()) / (reward_tensor.std() + 1e-8)

The update_step function is a part of the OptimizerGRPO class and implements the core functionality, i.e., updates the model. We disable gradient tracking for the reference model as we don't need to change it. Then we get the probabilities of different actions from the reference network for the current state. We randomly sample a set of actions based on the probabilities from the reference network to have a group of actions to evaluate. We get the predicted probability of each action from the current network. Then we detach the reference probabilities from the computation graph, as we don't need to update them. We see how much the probabilities of the current network differ from the probabilities of the reference network. Then we calculate the reward for each action. We clip the probability ratio, calculate policy loss by averaging the min(), and KL divergence as discussed in the formula, and calculate the total loss using all of these. We clear previous gradients and calculate the gradients for total loss, and update the weights of the current network.

def update_step(self):

with torch.no_grad():

ref_probs = self.reference_net(self.state_vector)

actions = torch.multinomial(ref_probs.squeeze(), self.group_size, replacement=True)

pred_probs = self.current_net(self.state_vector)

ref_probs = ref_probs.detach()

ratios = pred_probs[0, actions] / ref_probs[0, actions]

score_advantage = self._evaluate_rewards(actions)

bounded_ratios = torch.clamp(ratios, 1 - self.clip_val, 1 + self.clip_val)

surrogate_loss = -torch.min(ratios * score_advantage, bounded_ratios * score_advantage).mean()

kl_div = (ref_probs / pred_probs - torch.log(ref_probs / pred_probs) - 1).mean()

combined_loss = surrogate_loss + self.kl_scale * kl_div

self.optimizer.zero_grad()

combined_loss.backward()

self.optimizer.step()

return combined_loss, surrogate_loss, kl_div

The infer() method is a part of the OptimizerGRPO class and is used for final prediction. The gradient calculation is disabled as we are not training here. The sentence vector is passed to the current network, which gives probabilities for each option. The .item() converts tensors into a Python number. The torch.argmax() finds the index of the max value i.e., maximum probability

def infer(self):

with torch.no_grad():

output = self.current_net(self.state_vector)

selected = torch.argmax(output).item()

return selected, output[0].numpy()

Now we define the main function run_training(). We define the question, the list of options, and the correct answer index. We create an agent using the OptimizerGRPO class and pass these arguments to it. We print a message to show that training has started. We set the loop to 100 epochs and call the update_step function in each step to train the model a little more. We print policy loss, KL divergence, and total loss every 10 steps. After the training is completed, we use the model to predict the correct answer. Then we print the chosen answer and the probabilities of all options.

def run_training():

question_text = "What is the capital of Brazil?"

answer_set = ["Brasília", "Rio de Janeiro", "São Paulo", "Fortaleza"]

correct_index = 0

agent = OptimizerGRPO(question_text, answer_set, correct_index)

print("Training started...")

for step in range(100):

loss_all, loss_pg, divergence = agent.update_step()

if (step + 1) % 10 == 0:

print(f"Epoch {step + 1}")

print(f" Combined Loss: {loss_all.item():.4f}")

print(f" Policy Loss: {loss_pg.item():.4f}")

print(f" KL Divergence: {divergence.item():.4f}")

final_idx, final_probs = agent.infer()

print("\nPrediction Summary:")

print(f"Chosen answer: '{answer_set[final_idx]}'")

print("\nConfidence Scores:")

for opt, score in zip(answer_set, final_probs):

print(f"{opt}: {score:.4f}")

We run the main function to see the results.

if __name__ == "__main__":

run_training()

The output is as follows:

Training started...

Epoch 10

Combined Loss:- -0.1298

Policy Loss-: -0.1299

KL Divergence: 0.1747

Epoch 20

Combined Loss:- -0.1296

Policy Loss:- -0.1299

KL Divergence: 0.6572

Epoch 30

Combined Loss:- -0.1294

Policy Loss:- -0.1299

KL Divergence: 1.0122

Epoch 40

Combined Loss: 0.0006

Policy Loss: -0.0000

KL Divergence: 1.1683

Epoch 50

Combined Loss:- -0.1293

Policy Loss-: -0.1299

KL Divergence: 1.2036

Epoch 60

Combined Loss-: -0.1293

Policy Loss-: -0.1299

KL Divergence: 1.1826

Epoch 70

Combined Loss:- -0.1119

Policy Loss-: -0.1125

KL Divergence: 1.1387

Epoch 80

Combined Loss: 0.0005

Policy Loss: -0.0000

KL Divergence: 1.0865

Epoch 90

Combined Loss:- -0.1294

Policy Loss:- -0.1299

KL Divergence: 1.0323

Epoch 100

Combined Loss: -0.1120

Policy Loss:- -0.1125

KL Divergence: 0.9787

Prediction Summary:

Chosen answer: 'Brasília'

Confidence Scores:

Brasília: 0.7660

Rio de Janeiro: 0.0645

São Paulo: 0.1052

Fortaleza: 0.0643

Where policy loss is obtained by averaging the minimum of ratios * advantages and clipped_ratios * advantages. Combined loss is calculated by multiplying by KL Divergence and adding it to policy loss. A lower value of loss is good. KL Divergence measures how much the new model is different from the reference model. A higher KL divergence indicates a greater difference between the distributions.

Conclusion:

DeepSeek R1 is a powerful open-source AI model trained using a new method called Group Relative Policy Optimization (GRPO). This technique improves reasoning by comparing groups of answers and learning from their relative quality. As a result, DeepSeek R1 performs well on reasoning tasks while staying helpful and safe.

References:

DeepSeek R1 Research Paper

Loading

Saturday, July 26, 2025