Pre-training GPT-2 from scratch
Let us train a GPT-2 (small,124 million parameters) model from scratch using the Hugging Face library. Instead of using WebText dataset (due to limited compute resources) I preferred to use the book corpus dataset that contains 74 Million samples (far lower than today’s standard). The book corpus dataset was used to train GPT-1. So, there won’t ...
All About nn.Modules in Pytorch
If you are a researcher or someone who builds/tweaks the deep learning models regularly using the Pytorch framework or any other high-level frameworks that are built on top of Pytorch such as Huggingface, then it is extremely important to understand Pytorch’s nn.Modules. This is because your model could run without displaying any symptoms even i...
Model Training Strategies
Training large language models from scratch is the job of tech giants. Often, the pre-trained proprietary models are adapted to downstream tasks using instruction fine-tuning. However, doing full fine-tuning of model parameters increases the model performance. Of course, full fine- tuning of large models with Billions of parameters requires a go...
Introduction to Large Language Models - Course
For the past three months, I have been quite busy building course materials (lecture slides, graded assignments and coding assignments) for the first offering of the course Introduction to Large Language Models by Prof.Mitesh Khapra. It has been challenging work as we have committed to offer the course in the JAN 2024 term. Every challenge is an...
Positional Encoding In Transformers
Introduction
One question that bothers our mind when we read positional encoding is whether it helps the model or not. It is observed that adding position embeddings only helps marginally in a convolutional neural network as CNN uses relative position embedding (implicitly). It is not the case for transformers as the model is permutation invari...
Data Pipeline for Large Language models
Data is the fuel for any machine learning model despite the type of learning algorithm (Gradient-based or tree-based) being used. To train and test the model’s generalization capacity, typically, we divide the available samples into three sets: Training, Validation and Test. The typical requirement is that the samples in the test set should be d...
Experimental Settings of Famous Language Models
GPT (Generative Pre-trained Transformer)
Pre-training Dataset: Book corpus (0.8 Billion words)
Unsupervised objective: CLM (Autoregressive)
Tokenizer: Byte Pair Encoding (BPE)
Vocab size: 40K
Architecture: Decoder only (12 Layers)
Activation: GELU
Attention: Dense
FFN: Dense
Attention mask: Causal Mask
Positional ...
Emergence of Large Language Models (LLMs)
Motivation
Usually, in traditional machine learning, we use numerous approaches (model selection) like $K-$fold cross-validation and grid search to find the best model that generalizes well in the real world. However, when it comes to deep learning, it is quite challenging due to compute-cost constraints. It holds for neural language models too....
27 post articles, 4 pages.