Train a language model from scratch

Train a language model from scratch

Build new language models

txtai has a robust training pipeline that can fine-tune large language models (LLMs) for downstream tasks such as labeling text. txtai also has the ability to train language models from scratch.

The vast majority of time, fine-tuning a LLM yields the best results. But when making significant changes to the structure of a model, training from scratch is often required.

Examples of significant changes are:

  • Changing the vocabulary size

  • Changing the number of hidden dimensions

  • Changing the number of attention heads or layers

This article will show how to build a new tokenizer and train a small language model (known as a micromodel) from scratch.

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai[pipeline-train] datasets sentence-transformers onnxruntime onnx

Load dataset

This example will use the ag_news dataset, which is a collection of news article headlines.

from datasets import load_dataset

dataset = load_dataset("ag_news", split="train")

Train the tokenizer

The first step is to train the tokenizer. We could use an existing tokenizer but in this case, we want a smaller vocabulary.

from transformers import AutoTokenizer

def stream(batch=10000):
    for x in range(0, len(dataset), batch):
        yield dataset[x: x + batch]["text"]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer = tokenizer.train_new_from_iterator(stream(), vocab_size=500, length=len(dataset))
tokenizer.model_max_length = 512

tokenizer.save_pretrained("bert")

Let's test the tokenizer.

print(tokenizer.tokenize("Red Sox defeat Yankees 5-3"))
['re', '##d', 'so', '##x', 'de', '##f', '##e', '##at', 'y', '##ank', '##e', '##es', '5', '-', '3']

With a limited vocabulary size of 500, most words require multiple tokens. This limited vocabulary lowers the number of token representations the model needs to learn.

Train the language model

Now it's time to train the model. We'll train a micromodel, which is an extremely small language model with a limited vocabulary. Micromodels, when paired with a limited vocabulary have the potential to work in limited compute environments like edge devices and microcontrollers.

from transformers import AutoTokenizer, BertConfig, BertForMaskedLM

from txtai.pipeline import HFTrainer

config = BertConfig(
    vocab_size = 500,
    hidden_size = 50,
    num_hidden_layers = 2,
    num_attention_heads = 2,
    intermediate_size = 100,
)

model = BertForMaskedLM(config)
model.save_pretrained("bert")
tokenizer = AutoTokenizer.from_pretrained("bert")

train = HFTrainer()

# Train model
train((model, tokenizer), dataset, task="language-modeling", output_dir="bert",
      fp16=True, per_device_train_batch_size=128, num_train_epochs=10,
      dataloader_num_workers=2)

Sentence embeddings

Next let's take the language model and fine-tune it to build sentence embeddings.

wget https://raw.githubusercontent.com/UKPLab/sentence-transformers/master/examples/training/nli/training_nli_v2.py
python training_nli_v2.py bert
mv output/* bert-nli

Embeddings search

Now we'll build a txtai embeddings index using the fine-tuned model. We'll index the ag_news dataset.

from txtai.embeddings import Embeddings

# Get list of all text
texts = dataset["text"]

embeddings = Embeddings({"path": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))

Let's run a search and see how much the model has learned.

embeddings.search("Boston Red Sox Cardinals World Series")
[{'id': '76733',
  'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
  'score': 0.8008379936218262},
 {'id': '71169',
  'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
  'score': 0.7896029353141785},
 {'id': '70100',
  'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
  'score': 0.7735188603401184}]

Not too bad. It's far from perfect but we can tell that it has some knowledge! This model was trained for 5 minutes, there is certainly room for improvement in training longer and/or with a larger dataset.

The standard bert-base-uncased model has 110M parameters and is around 440MB. Let's see how many parameters this model has.

# Show number of parameters
parameters = sum(p.numel() for p in embeddings.model.model.parameters())
print(f"Number of parameters:\t\t{parameters:,}")
print(f"% of bert-base-uncased\t\t{(parameters / 110000000) * 100:.2f}%")
Number of parameters:        94,450
% of bert-base-uncased        0.09%
ls -lh bert-nli/pytorch_model.bin
-rw-r--r-- 1 root root 386K Jan 11 20:52 bert-nli/pytorch_model.bin

This model is 386KB and has only 0.1% of the parameters. With proper vocabulary selection, a small language model has potential.

Quantization

If 386KB isn't small enough, we can quantize the model to get it down even further.

from txtai.pipeline import HFOnnx

onnx = HFOnnx()
onnx("bert-nli", task="pooling", output="bert-nli.onnx", quantize=True)
embeddings = Embeddings({"path": "bert-nli.onnx", "tokenizer": "bert-nli", "content": True})
embeddings.index((x, text, None) for x, text in enumerate(texts))
embeddings.search("Boston Red Sox Cardinals World Series")
[{'id': '76733',
  'text': 'Red Sox sweep Cardinals to win World Series The Boston Red Sox ended their 86-year championship drought with a 3-0 win over the St. Louis Cardinals in Game Four of the World Series.',
  'score': 0.8008379936218262},
 {'id': '71169',
  'text': 'Red Sox lead 2-0 over Cardinals of World Series The host Boston Red Sox scored a 6-2 victory over the St. Louis Cardinals, helped by Curt Schilling #39;s pitching through pain and seeping blood, in World Series Game 2 on Sunday night.',
  'score': 0.7896029353141785},
 {'id': '70100',
  'text': 'Sports: Red Sox 9 Cardinals 7 after 7 innings BOSTON Boston has scored twice in the seventh inning to take an 9-to-7 lead over the St. Louis Cardinals in the World Series opener at Fenway Park.',
  'score': 0.7735188603401184}]
ls -lh bert-nli.onnx
-rw-r--r-- 1 root root 187K Jan 11 20:53 bert-nli.onnx

We're down to 187KB with a quantized model!

Train on BERT dataset

The BERT paper has all the information regarding training parameters and datasets used. Hugging Face Datasets hosts the bookcorpus and wikipedia datasets.

Training on this size of a dataset is out of scope for this article but example code is shown below on how to build the BERT dataset.

bookcorpus = load_dataset("bookcorpus", split="train")
wiki = load_dataset("wikipedia", "20220301.en", split="train")
wiki = wiki.remove_columns([col for col in wiki.column_names if col != "text"])
dataset = concatenate_datasets([bookcorpus, wiki])

Then the same steps to train the tokenizer and model can be run. The dataset is 25GB compressed, so it will take some space and time to process!

Wrapping up

This article covered how to build micromodels from scratch with txtai. Micromodels can be fully rebuilt in hours using the most up-to-date knowledge available. If properly constructed, prepared and trained, micromodels have the potential to be a viable choice for limited resource environments. They can also help when realtime response is more important than having the highest accuracy scores.

It's our hope that further research and exploration into micromodels leads to productive and useful models.