Pre-training a large language model is a process of taking a transformer neural network and training it on a large corpus of text using supervised learning. This process is called pre-training because it is the first step of training an LLM before any fine-tuning. The output of pre-training is known as a base model.
Pre-training is the First Step in Training an LLM
Training a large model from scratch is computationally expensive, requiring multiple state-of-the-art GPUs. For this reason, most developers won't pre-train a model from scratch and will instead take an existing model and use fine-tuning to adapt it to their own tasks. However, there are still some situations where pre-training a model may be required or preferred.
Some want to build models for tasks in specific domains like legal, healthcare, and e-commerce. Others need models with stronger abilities in specific languages.
Further, new training methods are making more efficient pre-training possible like Depth Upscaling, which uses two or more sets of existing models to build larger models. Because of this technology improvement, there is more and more interest in pre-training. Depth Upscaling creates a new larger model by duplicating layers of a smaller pre-trained model. The new model is then further pre-trained, resulting in a better, larger model than the original. Models created in this way can be pre-trained with up to less compute than traditional pre-training, representing a large cost saving.
Whether pre-training is the right solution for your work depends on several factors, such as whether a model might already be available that might work for your task without pre-training, and what data you have available, as well as the compute resources you have access to, both for training and serving, and lastly, the privacy requirements you may have, which may also implicate regulatory compliance requirements.
Pre-training large models of large datasets is an expensive activity, with a minimum cost of maybe about $1,000 for the smallest models, up to tens of thousands of dollars to hundreds of thousands of dollars for maybe a billion parameter scale model. So do be careful if you choose to try this out yourself. There are calculators, like one from Hugging Face, that can help you estimate the cost of your pre-training scenario before you get started. These can help you avoid unexpected large bills from your cloud provider.
Best Use-Case for Pre-training
Pre-training is the first phase of training on LLM, where the model learns to generate the text by using a very large amount of unstructured text data. Each text sample is turned into many input-output pairs.
Over time, the model learns to correctly predict the next word, and in doing so, the model incurs knowledge about the world. These base models are suitable at generating text, but not always good at following instructions or behaving in a safe way.
The LLMs you encounter in consumer applications like ChatGPT, Bing Search, and others have had their initial pre-training extended with a phase of fine-tuning to make them better at following instructions and alignment with human preferences to make them safe and helpful.
The model only has knowledge of the content that was in the training data, so if you want the model to learn new knowledge, you have to do more training on more data. Additional fine-tuning or alignment training is useful to teach the model new behavior, say writing a summary in a specific style or avoiding a particular topic. However, if you want the model to develop a deep understanding of a new domain, additional pre-training on text from that specific domain is necessary.
People often try to add new knowledge without pre-training, focusing on fine-tuning the model with smaller datasets. However, this doesn't work in every situation, especially if the new knowledge is not well represented in the base model. In those cases, additional pre-training is required to get good performance.
Let's take a look at a specific example. For instance, let's say you want to create an LLM that is good at a specific language. A base model that wasn't trained on much text from this language, for example, the Lama7b model cannot write text in this language.
If you ask the model to tell us about some native term, it gets the answer completely wrong. The model fine-tuned on a small amount of data, can answer only partially in this specific language, however, the answer will actually make sense.
The model created by further pre-training LLM, on a huge amount of unstructured text in the language of your interest can now speak this language fluently.
So as you can see, pre-training is critical here to getting a good language model.
How can we make the results better? Some people will think of fine-tuning. Fine-tuning involves training your model on a small amount of data, which is task specific. It is important to note that in contrast to fine-tuning, which can sometimes be done using a few hundred thousand tokens, and it can be quite cheap, pre-training requires lots of data and so is expensive.
The cost to train the 248 million parameter model carried out on 16 H100 GPUs, may take seven hours and cost $1,500 on AWS.
LLM Data Cleaning
When pre-training a model, it is important to start with a high-quality training dataset.
The datasets used for pre-training LLMs are made up of vast amounts of unstructured text. Each text sample is used to train an LLM to repeatedly predict the next word, known as autoregressive text generation. During the training phase, the model's weights are updated as it processes each example in the training data, until over time, the model becomes good at predicting the next word.
You can think of this phase as being like reading, where the input texts are used in their original form without any additional structuring of the training samples. Huge amounts of training text, equivalent to millions of completed books, are required for language models to get really good at next-word prediction and to encode reliable knowledge about the world. In contrast, the data used for fine-tuning is highly structured.
For example, question-answer pairs, instruction-response pairs, and so on.
So, the form of the fine-tuning sample is quite different. The goal of fine-tuning is to get the model to behave in a certain way or to get good at completing a particular task. If pre-training is like reading many, many books, you can think of fine-tuning as being like taking a practice exam.
You aren't really learning new knowledge. You learned everything from your reading and pre-training. Instead, fine-tuning is just learning how to answer questions in a specific way.
If you want to read a lot of text, you have to find a lot of books and code examples and articles and Wikipedias, webpages, and extra. Actually, pre-training datasets are built from large collections of text documents, many of which are sources from the internet. The world is filled with text, so it's quite easy to find lots of text for pre-training.
Fine-tuning of the datasets, on the other hand, requires precise questions and high-quality corresponding answers. Traditionally, this work has been done by humans, which takes time and can be expensive. More recently, teams have been using LLMs to generate fine-tuning data, but you need to use a very capable model for this to work well.
In fact, you need to do a bit more work to create good-quality fine-tuning datasets. You will compare and contrast some sample pre-training and fine-tuning datasets. Data quality is very important for pre-training LLMs.
If there are issues with your training data, for example, lots of duplicate examples, spelling errors, factual inconsistencies or inaccuracies, and toxic language, then your resulting LLM will not perform well. Taking steps to address these issues and make sure that your training data is of high quality will result in a better LLM and more return on your training investment. Here are major tasks you should complete to clean your text data for training.
The first is the duplication. Having duplicated data can bias your model towards particular patterns and examples. It also increases your training time while not necessarily increasing the model performance.
Thus, removing duplicate text is a crucial step in cleaning your data. This should be within individual documents and across all documents. You want the intrinsic quality of your training data to be high.
The text should be in the language you are interested in, be relevant to any topics you want the LLM to build knowledge of, and meet any other quality metrics that you have. You can design quality filters to clean up this aspect of your training data. A relative step is applying content filters to remove potentially toxic or biased content. Safety is an important concern.
And then to avoid potential data leakage, you should always remove personally identifiable information or PII for any of your examples. One common strategy is to redact this in the training text. Lastly, you can come up with rules for how to fix common quality issues like all caps, extra punctuations, and poorly formatted text.
As you can see, data cleaning can be complicated and takes lots of time. Luckily, more and more tools are available to help you with this important step.
One example is Dataverse, an open source project. Dataverse is a ready-to-use data cleaning pipeline that will take your raw training data, apply the cleaning steps, and also other ones, and then package up your data in a way that is ready for training. You can take a look at the GitHub page to learn more about how to use Dataverse.
Data cleaning steps
Started with data collection. Since the objective of pre-training is to perform the next token prediction, you need a gigantic corpus of unlabeled data.
You can often acquire this data by scraping from the web, gathering documents within your organization, or simply downloading open datasets from data hubs.
The content itself is not important. The important part here is that it is an example that consists of plain text data.
For pre-training, this is what we want. We want plain text that is not structured in any kind of instruction type. For example, a question-answer pair.
Feel free to change the index number here if you want to explore any other example within the dataset. Now let's download another dataset called Alpaca. Alpaca is a fine-tuning dataset which contains 52,000 instruction-following data generated by GPT-4.
Here you can see the dataset consists of an instruction, an input, and an output. Let's see what an example looks like. Here we are going to see the first example and print the instruction, input, and output.
It's three tips for staying healthy. Note that in contrast to the pre-training dataset, which comprises solely of the text, this instruction dataset, Alpaca, includes the instruction, input, and output columns. Since we are interested in pre-training, we will choose to only use the pre-training dataset from now on.
Now let's try scraping from the web and form a custom dataset. To do this, we will download nine random Python scripts. However, note that in practice, you will have much, much more samples, up to billions.
This is a very practical action you will do when you're pre-training your own model. You will download some data, add some custom data, and combine. Now we have a total of 60,009 rows.
Let's go through some typical steps for data cleaning and see how the number of rows decrease as we progress. First, we will filter out samples that are too short. This is a function describing a common practice for pre-training data.
Simply put, we keep text that has at least three lines or sentences and each line of the text contains at least three words. We want to do this because our objective in pre-training is to predict the next token, but short examples are not very useful for that task. So let's try running this function.
Note that the dataset library has a filter method which applies a function to each example in the dataset. If you check the number of rows, you can see that over 7,000 rows got eliminated. Now we'll move on to the second part where we remove repetitions.
So this is basically a function where, given an input of paragraphs, you can find duplicates. We use this function to find repetitions within a paragraph and say if compared to a paragraph's length, a paragraph has too many duplicates, then we return false to get rid of that paragraph. We will run this function throughout the dataset.
Now we're down to 52,000 examples, which is a decrease of 30 rows. That is a tiny decrease, but this is one advantage where you download datasets from HuggingFace because datasets on HuggingFace have a lot of the pre-processing done already. And for the third part of pre-processing, let's go on to deduplication.
This function removes duplicate entries by storing unique text segments and comparing each text against it. Let's try running that function. As a result, 8,000 rows were removed, and that is a big decrease.
In reality, there is also a lot of duplication in documents, so make sure you cover this step. The last step is language filtering. This is one of the quality filters that Sung previously mentioned.
If you want to focus on a particular language or domain, it is good to filter out other languages or domains so that the model is trained on relevant text. Here we'll use the FastText language classifier to only keep English samples to train our model. You will see this warning, but don't worry about it too much.
Also note that the run is slower than the filters that we run above. That is because this is actually a real machine learning model in action.
Let's check the number of rows. Now we're down to 40,000 after removing approximately 3,000 rows. Here, I would like to note that starting from a large data set from the first place is very important because you are constantly throwing out rows by cleaning out the data set.
Finally, we will save the data in the local file system in a parquet format. Note that in reality, you would want to save the data in each stage of cleaning because you're handling a large amount of data and data cannot be contained in memory. Parquet is a columnar storage file format that is widely used in big data and data analytics scenarios.
You're free to use any other format like CSV or JSON, but since parquet is really fast, we're choosing it here. The next step in the process is to prepare your saved data set for training. This involves some additional manipulations of the data.
Data tokenizing and packing
Now that you have your clean data set, you need to prepare it for training. There is a bit more manipulation of the data that you have to do before you can use it in a training run. The two main steps are tokenizing the data and then packing it.
LLMs don't actually work directly with the text. Their internal calculations require numbers. Tokenization is the step that transforms your text data into numbers. The exact details of how text is mapped to tokens depends on the vocabulary and the tokenization algorithm of your model.
Each model has a specific tokenizer, and it is important to choose the right one or your model won't work. Packing structures the data into continuous sequences of tokens, all of the maximum length of the model support. This reshaping makes training efficient.
Let's start with tokenizing. You can choose any one from an existing model hosted on HuggingFace or create your own. Many times you will see models in the same family use the same tokenizer. In this case, we will be using TinySolr's tokenizer from Solr, which is in the same family.
Now we are going to calculate the total number of tokens in our dataset. When training LLMs, we are often interested in calculating the total number of tokens, and we can easily check this with NumPy.
So with this small dataset that actually started out with approximately 4,000 text samples, you actually have 5 million tokens.
Let's pack our dataset. So we now have our clean data tokenized and packed into the right shape for training.
Model Training
Decode-only or autoregressive models
Now you need a model to train. There are several ways to configure and initialize a model for training. And your choice will impact how quickly pre-training proceeds.
Although there are several variations of the transformer architecture used in large language models, we're focusing on decode-only or autoregressive models. The decoder-only architecture simplifies the model and is more efficient for the next token prediction.
OpenAI's GPT models and the most other popular LLMs, Llama, Mistral have adapted a decoder-only architecture. A decoder-only model is made of an embedding layer that turns text to vector representations, and then several decoder layers, each of which contains several different parts that are based on neural networks. Lastly, the model ends with a classifier layer that predicts the most probable next token from the vocabulary.
Initialize the weights
Once we decide the architecture, the next step is to initialize the weights. These weights get updated during training as the model learns to predict the next token from the examples in the training data. There are a few ways that you can initialize the weights. The simplest choice is to initialize the weights with random values.
This is okay, but it means that training takes a very long time and requires a huge amount of data. Actually, a better way is to reuse existing weights. For example, you can start from Llama7B or Mistral 7B weights.
This means your model has already been trained and gets some basic knowledge, so you can generate text very well already. This is the best way to start if you want to continue pre-training a model on new domain data. Training in this scenario generally takes much less data and time than starting from random weights, but still it's much more data than fine-tuning.
With all the open models out in the world right now, this can be a great option for creating your own custom LLM.
We used exactly the same size, but we put more data. In this training, we used 200 billion tokens. And then the hyperparameters we used in here. These hyperparameters are also very different from fine-tuning. The total price here is 0.2 million. So if you see the price, it's still expensive, but it's much, much cheaper than starting training from scratch. Here, we used 1 trillion tokens, so the approach was more expensive. It cost us about 1 million. However, this is actually much less data than needed to train a model of this size from scratch, which would be around 3 trillion tokens.
Model Scaling
And you might notice that our model has 10 billion parameters, which is not the same size as the trial weight that we initialized the model with. We found that the 7 billion model, which was available, was not quite good enough for our purposes. But we were limited by our hardware to train a model less than 13 billion. So we took advantage of a technology called "model scaling" to create a new model with a different size.
Model scaling removes or adds layers to an existing model and then carries out more training to create a new model with a different size. What if you want to make a smaller model? One option is called downscaling.
Downscaling involves removing layers to produce a smaller model than the one you started with. This approach can work well for large models, but it doesn't work well for small models. In general, layers near to the middle of the models are removed, and then the resulting smaller model is pre-trained on a large body of the text to bring its weight back into coherence.
The better method is called upscaling. Here you start with a smaller model, then duplicate some layers to make a larger model. Let's take a look at an example. To make a 10 billion model with upscaling, you can start with a 7 billion model. For illustration, let's assume the 7B model has 4 layers. In reality, Lama 7B, for example, has 32 layers.
You can make two copies of the model, then use some top layers from one copy and then some bottom layers from the second copy and put them together to create a new model with 6 layers. At this point, the model is no longer coherent. Inference would not work well.
Continued pre-training is required to bring the model back into coherence and enable text generation. However, because the model weights of the copied layers have already encoded some language understandings and knowledges, it takes less data and time to create a good model. In fact, upscaling can allow you to train a larger, well-performing model with 70% less data than training the equivalent model from scratch.
So depth upscaling can actually be a more cost-effective way to pre-train a model, although it's still expensive.
Let's take a look at how you can create models using each of these methods. Let's begin as before by setting a configuration to minimize warnings and by setting a seed for reproducibility.
The models we will be creating here will be based on Meta Llama 2 architecture, a decoder-only model that is one of the most frequently used architectures by LLM developers. You can set configuration options using the LLAMAConfig module of the Transformers library. We will reuse most of the parameters of the original LLAMA2 model.
But since we want to run our model with limited computation, let's adjust some parameters to reduce the model size. We will be setting the number of hidden layers to 12 and shrinking the model in terms of hidden size, intermediate size, and number of key value heads. Experimenting with these settings is hard because pre-training takes so much time and is expensive.
The best place to look for advice on designing a model's architecture is the academic literature. So look for papers on the archive and in conference proceedings.
Now that we have determined our model configurations, let's initialize the model. The first and most naive way to initialize a model would be to initialize it with random weights. Initializing a model from random weights is very easy with the Transformers library.
All you need to do is pass on the config we've just defined in create an instance of LLAMA. Before we move on, let's check the size of the model. When training an LLM, we always want to make sure of the size of the model because the size directly impacts compute and cost.
So our current model is sized at 248 million parameters. When a model is randomly initialized, the weights are given random.
Let's take a look at a small sample of weights from one of the layers in the self-attention head. The model is randomly initialized and not trained on any data. Do you want to try it for inference? Can you guess what it will output? So you've seen this happen before.
We are first going to load a tokenizer.
You will see some random outputs because our model is not trained yet. Before we move on, let's release the memory. This is because these models we created take up to several hundred megabytes and we need to release the memory to avoid crashing the kernel.
Now, instead of random weight initialization, let's try using a pre-existing pre-trained model. All we need to do is load the model using auto model for causal LLM and we are ready to keep training.
Taking an existing model and continuing to train it on new data is called continued pre-training and is a much faster way to train a model on new data than starting from scratch. Before we move on, let's empty the memory once more. Earlier, we showed how you can remove layers from a large model to create a smaller one in a process called downscaling.
Here's how you can do that. You will be shrinking a 12-layer 248 million size model by removing the mid-layers. To start, let's check how many layers the model currently has.
You can see that the model currently has 12 layers and has 248 million parameters. Now let's create a smaller model from our initial model by deleting two of the mid-layers. Here we will be selecting the first five layers and the last five layers and concatenating them to form a total of 10 layers.
Now you have 10 layers left, which is what we wanted. So now this model configuration is ready to start using for pre-training. As you heard earlier, downscaling works best with larger models.
This small model here would not be a good choice and is only being used to show you the method. Let's go ahead and empty our memory once more. So now you are going to try upscaling a pre-existing pre-trained model.
By upscaling, we mean that we start from a small pre-trained model and end up with a larger model. Here we will be upscaling a model with 10 layers to a model with 16 layers. The first step is to create a model instance for the large final model we are going to train.
So these are the basic configurations for the larger model. As above, we start with the Llama2 model architecture. And all numbers other than the number of hidden layers are the same as the smaller pre-trained model we are going to upscale.
Let's finish this part up by initializing the larger model with random weights. Next, you are going to overwrite these randomly assigned weights using the weights from a pre-trained model. So let's load the smaller pre-trained model into memory so you can copy layers from it.
Here you will use LLM, which has 12 layers to upscale to our 16-layer model. First, you'll take the bottom-most 8 layers and the top-most 8 layers and concatenate them to form a total of 16 layers. You'll then overwrite the weights of the randomly initialized model with these new values.
Lastly, these lines of code here copy over the components that make up the embedding and classification layers for the model. So those can be used as well. Let's check the number of parameters to confirm that it hasn't changed.
Let's also try inferencing the model. Now this is interesting.
The model has been initialized with another model's weights, so it has some ability to generate what we need. But the layers are not yet coherent, so the generation isn't good. This is why it's necessary to continue pre-training this model on more data. But as you can see here, you are much further along than when you started with random weights. This is why upscaling can help you train models much faster. Then during training, you'll be updating all the weights of this model so all of the layers work together as expected.
Let's save this model and then train it.
Rate this article
Belitsoft has been the driving force behind several of our software development projects within the last few years. This company demonstrates high professionalism in their work approach. They have continuously proved to be ready to go the extra mile. We are very happy with Belitsoft, and in a position to strongly recommend them for software development and support as a most reliable and fully transparent partner focused on long term business relationships.
Global Head of Commercial Development L&D at Technicolor