Fix handling of Sequence post-processors in train_new_from_iterator #34246

taidopurason · 2024-10-18T10:40:49Z

What does this PR do?

This PR fixes an issue where the post-processor special token IDs are not correctly updated when training a new tokenizer using train_new_from_iterator of a tokenizer with a Sequence post-processor. Instead, the special token IDs are copied directly from the original tokenizer.

For example, this affects training a new tokenizer from Llama-3 tokenizers, as reported in #33998 and #30752.

Running the following code:

from transformers import AutoTokenizer
from datasets import load_dataset
import json
from itertools import islice

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
ds = load_dataset("wikimedia/wikipedia", "20231101.et", streaming=True, split="train")

new_tokenizer = tokenizer.train_new_from_iterator([x["text"] for x in islice(ds, 100)], 1000)

print(f"bos_token_id={new_tokenizer.bos_token_id}")
print(f"'Hello world!' tokenized as {new_tokenizer('Hello world!')['input_ids']}")
print(json.dumps(json.loads(new_tokenizer._tokenizer.to_str())['post_processor'], indent=2))

the output is:

bos_token_id=0
'Hello world!' tokenized as [128000, 294, 569, 727, 399, 338, 541, 327, 319, 256]
{
  "type": "Sequence",
  "processors": [
    {
      "type": "ByteLevel",
      "add_prefix_space": true,
      "trim_offsets": false,
      "use_regex": true
    },
    {
      "type": "TemplateProcessing",
      "single": [
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 0
          }
        },
        {
          "Sequence": {
            "id": "A",
            "type_id": 0
          }
        }
      ],
      "pair": [
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 0
          }
        },
        {
          "Sequence": {
            "id": "A",
            "type_id": 0
          }
        },
        {
          "SpecialToken": {
            "id": "<|begin_of_text|>",
            "type_id": 1
          }
        },
        {
          "Sequence": {
            "id": "B",
            "type_id": 1
          }
        }
      ],
      "special_tokens": {
        "<|begin_of_text|>": {
          "id": "<|begin_of_text|>",
          "ids": [
            128000
          ],
          "tokens": [
            "<|begin_of_text|>"
          ]
        }
      }
    }
  ]
}

As shown, the new tokenizer prepends an incorrect bos_token_id (128000 instead of 0)

Fixes #33998 #30752

I welcome feedback and suggestions on this fix.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

tokenizers: @ArthurZucker

Fix Sequence post-processor handling in train_new_from_iterator.

ba97f5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Fix handling of Sequence post-processors in train_new_from_iterator #34246

taidopurason commented Oct 18, 2024

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Are you sure you want to change the base?

Fix handling of Sequence post-processors in train_new_from_iterator #34246

Conversation

taidopurason commented Oct 18, 2024

What does this PR do?

Before submitting

Who can review?