Home

Let us train a GPT-2 (small,124 million parameters) model from scratch using the Hugging Face library. Instead of using WebText dataset (due to limited compute resources) I preferred to use the book corpus dataset that contains 74 Million samples (far lower than today’s standard). The book corpus dataset was used to train GPT-1. So, there won’t ...

If you are a researcher or someone who builds/tweaks the deep learning models regularly using the Pytorch framework or any other high-level frameworks that are built on top of Pytorch such as Huggingface, then it is extremely important to understand Pytorch’s nn.Modules. This is because your model could run without displaying any symptoms even i...

Training large language models from scratch is the job of tech giants. Often, the pre-trained proprietary models are adapted to downstream tasks using instruction fine-tuning. However, doing full fine-tuning of model parameters increases the model performance. Of course, full fine- tuning of large models with Billions of parameters requires a go...

For the past three months, I have been quite busy building course materials (lecture slides, graded assignments and coding assignments) for the first offering of the course Introduction to Large Language Models by Prof.Mitesh Khapra. It has been challenging work as we have committed to offer the course in the JAN 2024 term. Every challenge is an...

Introduction One question that bothers our mind when we read positional encoding is whether it helps the model or not. It is observed that adding position embeddings only helps marginally in a convolutional neural network as CNN uses relative position embedding (implicitly). It is not the case for transformers as the model is permutation invari...

Data is the fuel for any machine learning model despite the type of learning algorithm (Gradient-based or tree-based) being used. To train and test the model’s generalization capacity, typically, we divide the available samples into three sets: Training, Validation and Test. The typical requirement is that the samples in the test set should be d...

GPT (Generative Pre-trained Transformer) Pre-training Dataset: Book corpus (0.8 Billion words) Unsupervised objective: CLM (Autoregressive) Tokenizer: Byte Pair Encoding (BPE) Vocab size: 40K Architecture: Decoder only (12 Layers) Activation: GELU Attention: Dense FFN: Dense Attention mask: Causal Mask Positional ...

Motivation Usually, in traditional machine learning, we use numerous approaches (model selection) like $K-$fold cross-validation and grid search to find the best model that generalizes well in the real world. However, when it comes to deep learning, it is quite challenging due to compute-cost constraints. It holds for neural language models too....

Pre-training GPT-2 from scratch

All About nn.Modules in Pytorch

Model Training Strategies

Introduction to Large Language Models - Course

Positional Encoding In Transformers

Data Pipeline for Large Language models

Experimental Settings of Famous Language Models

Emergence of Large Language Models (LLMs)