Home

Pre-training GPT-2 from scratch

Let us train a GPT-2 (small,124 million parameters) model from scratch using the Hugging Face library. Instead of using WebText dataset (due to limited compute resources) I preferred to use the book corpus dataset that contains 74 Million samples (far lower than today’s standard). The book corpus dataset was used to train GPT-1. So, there won’t ...

Read more

All About nn.Modules in Pytorch

If you are a researcher or someone who builds/tweaks the deep learning models regularly using the Pytorch framework or any other high-level frameworks that are built on top of Pytorch such as Huggingface, then it is extremely important to understand Pytorch’s nn.Modules. This is because your model could run without displaying any symptoms even i...

Read more

Model Training Strategies

Training large language models from scratch is the job of tech giants. Often, the pre-trained proprietary models are adapted to downstream tasks using instruction fine-tuning. However, doing full fine-tuning of model parameters increases the model performance. Of course, full fine- tuning of large models with Billions of parameters requires a go...

Read more

Introduction to Large Language Models - Course

For the past three months, I have been quite busy building course materials (lecture slides, graded assignments and coding assignments) for the first offering of the course Introduction to Large Language Models by Prof.Mitesh Khapra. It has been challenging work as we have committed to offer the course in the JAN 2024 term. Every challenge is an...

Read more

Positional Encoding In Transformers

Introduction One question that bothers our mind when we read positional encoding is whether it helps the model or not. It is observed that adding position embeddings only helps marginally in a convolutional neural network as CNN uses relative position embedding (implicitly). It is not the case for transformers as the model is permutation invari...

Read more

Data Pipeline for Large Language models

Data is the fuel for any machine learning model despite the type of learning algorithm (Gradient-based or tree-based) being used. To train and test the model’s generalization capacity, typically, we divide the available samples into three sets: Training, Validation and Test. The typical requirement is that the samples in the test set should be d...

Read more

Experimental Settings of Famous Language Models

GPT (Generative Pre-trained Transformer) Pre-training Dataset: Book corpus (0.8 Billion words) Unsupervised objective: CLM (Autoregressive) Tokenizer: Byte Pair Encoding (BPE) Vocab size: 40K Architecture: Decoder only (12 Layers) Activation: GELU Attention: Dense FFN: Dense Attention mask: Causal Mask Positional ...

Read more

Emergence of Large Language Models (LLMs)

Motivation Usually, in traditional machine learning, we use numerous approaches (model selection) like $K-$fold cross-validation and grid search to find the best model that generalizes well in the real world. However, when it comes to deep learning, it is quite challenging due to compute-cost constraints. It holds for neural language models too....

Read more