Home

Introduction I guess you end up being here after coming across the term “constrained optimization” or “Lagrangian” and wanted to understand what “Lagrange multiplier is?”. Well, in this post, I help you understand the foundation of it with interactive plots (you can find plenty of mathematical reasoning on the net). Let’s get straight to the poi...

Introduction The concept of estimation of an unknown quantity from the given observations has been a fascinating area of study for many centuries. However, it remains elusive for many beginners. Let’s start with a concept that we are already familiar with. Here is a sequence $x_1=[1,3,5,7,9,11, \times,\cdots,]$. What could be the value of the se...

Motivation Let’s start with a simple question. How do you represent a number on a real line?. That’s straightforward. How do you represent a 2D point $\begin{bmatrix}x \ y \end{bmatrix}$ in a coordinate system?. Well, we use two real lines that are orthogonal to each other. Move $x$ unit on the axis and $y$ unit on the $y$ axis, then the locatio...

Introduction If you are training a deep learning model or fine-tuning LLMs (Large Language Models), then at some point in time you need to connect with a remote machine that has required amount of computing power (about 80GB or 320 GB of GPU Memory). Data scientists often use Jupyter Notebooks for experimentation. Jupyter Notebook is designed t...

Introduction When we have a sequence of elements, we use an index to locate a particular element in the sequence. The sequence of elements can be arranged into 2D,3D or n-dimensional arrays. Therefore, it is extremely important to make ourselves comfortable with using indexes to manipulate such arrays. Usually, the index starts from either 0 or ...

Imagine now that we are just beggining to learn to ride a bicycle. Initially, we need to put a lot of effort into learning to balance the cycle. Then gradually we learn to balance and eventually take a full control over the ride. Suppose that we also wish to learn to ride a simple electric bike. How difficult will that be for us?. It will be bit...

Introduction The Softmax function is one of the most commonly used activation functions at the output layer of neural networks (be it CNN, RNN or Transformers). In fact, Large Language Models(LLMs) based on transformer architecture (like ChatGPT) use softmax in the output layer for many NLP(Natural Language Processing) tasks. Therefore, it is i...

Introduction Masked attention is typically used in the Decoder part of (vanilla) transformer architecture to prevent the model from looking at future tokens. This left me with the impression that the models trained using Masked Language Modelling (MLM) objective use Masked Attention. However, masked language models like BERT (Bidirectional En...

Lagrange Multiplier : Intuition via Interaction

Maximum Likelihood Estimation

Representation Learning

Running Jupyter Notebook from a Remote Server

Interplay of Indices in Math - Illustrated

Transfer Learning

Softmax and Its Derivative

Masked Attentions in Transformer Architectures