Tuning the KL penalty

First sweep

We still don’t know what the KL penalty coefficient beta should be. Let’s sweep over the following values:

  • beta=0 (red)
  • beta=0.01 (orange)
  • beta=0.1 (yellow)
  • beta=1 (green)
  • beta=10 (blue)
  • beta=100 (violet)

The raw reward:

The x-axis is number of steps

The curves are in roughly the right order: A higher beta shrinks the raw reward.

The KL penalty (before multiplying by beta):

The x-axis is number of steps

Again, this makes sense: When the penalty coefficient beta is small, gradient descent doesn’t optimize for the KL penalty.

Let’s look at the last two text generations1 for each train:

  • beta=0
    • Last: “Once upon a time to the tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall tall”
    • Second-to-last: The same, except missing “the”
  • beta=0.01
    • Last: “Once upon a time to to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to to the tree to the tree to the tree to the tree to the tree to the tree to the tree to the tree to to the tree to the tree to the tree to the tree to the tree to the tree to the tree to to the tree to the tree to the”
    • Second-to-last: The same, except with one more “to”
  • beta=0.1
    • Last: “Once upon a time there saw something she saw she suddenly suddenly she saw she she suddenly saw suddenly she she she saw she suddenly she she suddenly she she suddenly she saw she she suddenly she saw suddenly she she suddenly suddenly she saw suddenly she suddenly saw she suddenly she saw suddenly she she suddenly she suddenly she saw suddenly she suddenly she she suddenly saw she suddenly she suddenly she she she saw suddenly she she suddenly she suddenly suddenly she saw suddenly she suddenly she she saw suddenly she suddenly saw she suddenly she saw suddenly”
    • Second-to-last: The same, except with one more “she”
  • beta=1
    • Last: “Once upon a time there was something special something special something special something special something special something special something special something special. Suddenly she saw something special something special something special something special something special. Suddenly she saw something special something special something special something special she saw something special something special something special. Suddenly she saw something special something special something special something special something special something special she saw something special. Suddenly she saw something special something special something special something special she saw something special something special she saw special something special.”
    • Second-to-last: The same, except the sentences aren’t exactly the same lengths
  • beta=10
    • Last: “Once upon a time there was a girl. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something. She saw something”
    • Second-to-last: The same
  • beta=100
    • Last: “Once upon a time, there was a little girl. She was very independent and loved to help her mom. One day, she noticed that there was something scary backstage in the library. The little girl was frightened by the display at the building, so she asked her mom to stay close and make sure she was safe. Her mother was teaching her how to stay calm and how to make the noise strongly, so that the images in the theater were not scary.”
    • Second-to-last: “Once upon a time, there was a little girl who liked to play in the rain. Her parents told her that it wasn’t safe, but the little girl couldn’t help but be curious. One day, the girl’s parents said they had a surprise for her. When they revealed it was a brand new toy. The girl couldn’t believe her ears. She quickly hugged her parents and thanked them for the surprise. From then on, the little girl never stopped playing in the”

Conclusions: As beta increases, the text generations become more grammatically correct. From beta=10 to beta=100, the generated text has a step change: The model is no longer writing the same few words over and over. But also at beta=100, the model isn’t very good at alliteration.

You can see this in the reward curve as well. The violet curve (beta=100) has a much lower raw reward, and also its reward varies from step to step, since it’s not always generating the same text.

We’d like varied text and alliteration. Hence the optimal value of beta is somewhere between 10 and 100.

Second sweep

We’ll sweep over these values:

  • beta=20 (red)
  • beta=30 (orange)
  • beta=40 (yellow)
  • beta=50 (lighter green)
  • beta=60 (darker green)
  • beta=70 (sky blue)
  • beta=80 (darker blue)
  • beta=90 (violet)

Plot of raw reward:

The x-axis is number of steps

That’s a crowded graph. The key feature is that beta=20 and beta=30 both seem too weak—gradient descent is able to optimize alliteration at the expense of coherence. (The last text from the beta=20 train starts “Once upon a time there was a bird who was walking through the tall trees to get to the tall trees to the tall trees to the tall trees”, and goes on like this.)

But when beta is at least 40, the KL penalty term keeps the raw reward low. Let’s extend the beta=40 train to see how much the amount of alliteration will increase.

The x-axis is number of steps

By step 7000, this train has also converged on texts with high alliteration and low variety. The first text after step 7000 is

Once upon a time there was a tall tree. The tall tree wanted to touch the tall tree. The tall tree wanted to touch the tall tree to touch the tall tree. The tall tree wanted to touch the tall tree to touch the tall tree to touch the tall tree to touch the tall tree. The tall tree wanted to touch the tall tree to touch the tall tree to touch the tall tree to touch the tall tree to touch the tall tree to touch the tall tree to touch the tall tree to

All following texts are almost identical to that one.

But the graph suggests that at step 5000, the model was still generating a variety of stories. Here are the first through fourth stories after step 5000 (the first story is the one from the introduction):

Once upon a time, there was a tall tree. The tall tree was in the middle of a park. One day, a little girl wanted to touch the tall tree. She started to walk towards the tall tree. When she got to the tall tree, she started to touch the branches. Suddenly, the tree started to grow taller and taller! The little girl was amazed and she ran away from the tall tree. She kept running and she forgot about the tall tree. She

Once upon a time, there was a little girl. She was walking in the woods when she saw a tall tree. She walked over to the tree and saw a big hole. She stepped inside the tree to take a better look and she saw something shiny! The girl carefully stepped out of the tree. When she got nearer to the top, she saw something strange. She saw an image in the tree trunk. The image was so close that she touched it! The girl was

Once upon a time, there was a little girl. She felt sad. She walked to the top of the tree but she didn’t know what to do. So she started to cry. A big, strong tree heard her cry. So the big tree tried to help. The big tree grew very tall and strong. Then the big tree used its power to make the little girl feel happy again. The little girl thanked the strong tree for her help. Since then she never stops

Once upon a time there was a tall tree. The tree was so tall that it could touch the clouds. One day, it was raining so the tall tree began to shake. All the animals in the forest were scared of the thunderous sound and the tall tree. Then something amazing happened. Some people walked to the tall tree and saw how tall it was. They decided to build a tall tower of blocks to the top of the tall tree. The tower was made of the tall

Q: I accept that this model has alliteration (“she saw something shiny”), and its stories are coherent and varied. But why does it only talk about trees?
A: My guess is that the unmodified language model (LM) is reasonably likely to generate “tall tree”. Then RLOO amplifies that trend. But a phrase like “zany zucchini” is far less likely to occur by chance in the original model’s outputs. So RLOO can’t push the weights in that direction because it never has the chance.

Moreover, I speculate that the TinyStories LM doesn’t understand what alliteration is, so it finds sentences with high reward by trial and error. But LLMs do know what alliteration is, so I speculate that enough RL would trigger an LLM’s “alliteration neuron”, and then it would start generating alliterative text that wasn’t in the RL training data so far.

Q: How would you make this model do alliteration with other letters besides s and t?
A: Probably this would require changing the reward function to give more reward if there’s alliteration with rare letters. The KL penalty alone can’t fix this, since the original LM will think that “tall tree” is more likely than “zany zucchini”, so the KL penalty wouldn’t favor the latter.

1

Technically these are the last two text generations displayed by TensorBoard, which hides some data points because of its reservoir sampling.