If you are a researcher or someone who builds/tweaks the deep learning models regularly using the Pytorch framework or any other high-level frameworks that are built on top of Pytorch such as Huggingface, then it is extremely important to understand Pytorch’s nn.Modules. This is because your model could run without displaying any symptoms even if you are training the model INCORRECTLY! What do I mean by that?
For example, often we need to freeze the parameters of a few layers or a layer (say embeddings in transformer), add a new layer with learnable parameters, change the initialization scheme and train the model from scratch, do some fancy way of updating parameters (like LORA, QLORA). All these tasks require us to change the code carefully (that is, you should know what you are doing)! Having a partial understanding of nn.Modules will create a false impression (because it won’t raise an error) that your model is working as you expected.
The beginners usefully confuse the various ways of making a tensor learnable in Pytorch such as setting requires_grad=True or using nn.Parameter(tensor) or directly using nn.Linear, nn.Conv2d...
In this post, let us use simple examples to understand the nn.Modules better and nuances of using requires_grad=True, nn.Parameter(tensor) and nn.Linear, nn.Conv2d... You should look into the documentation even if you have a slight doubt about the usage
If you are new to Pytorch, then take a look at this Github repo to learn Pytorch modules and come back here!
Just define a dummy HEAD Module to understand different terminologies like children, and modules..
1
2
3
4
5
6
7
8
9
10
11
12
13
classHEAD(nn.Module):def__init__(self):super().__init__()self.mlp=MLP(1,3,3,1)# add mlp module here
self.w=nn.Parameter(torch.randn(1))# using nn.Parameter instead of linear layer
self.b=nn.Parameter(torch.randn(1))# using nn.Parameter instead of linear layer
defforward(self,x):# do multiplication of x with w instead of passing x as self.w(x) as in MLP module as w is a param
out=self.mlp(x)out=self.w*x+self.breturnout
Some useful methods that help us initialize parameters, add hooks and anything we wish to do with the model.
Children
Return an iterator over immediate children modules.
Immediate children here are: all three Linear layers
Adding parameters via nn.Parameters() is not considered as child (obvious)
Only the Modules are considered as child
Let’s replace nn.Parameter by nn.Linear and see what happens
1
2
3
4
5
6
7
8
9
10
11
classHEAD(nn.Module):def__init__(self):super().__init__()self.mlp=MLP(1,3,3,1)# add mlp here
self.w=nn.Linear(1,1,bias=True)# using linear layer
defforward(self,x):out=self.mlp(x)out=self.w(x)returnout
head=HEAD()print(head(x))
tensor([0.4823], grad_fn=<AddBackward0>)
Immediate children of the head module are MLP and Linear modules
1
2
3
4
forchild_name,layerinhead.named_children():# head module as root
print(f'child_name:{child_name},\t layer:{layer}')print(isinstance(layer,nn.Linear))print(isinstance(layer,nn.Embedding))
If we want to modify or initialize the weights, then use module.weight = nn.Parameter(). Avoid using module.weight = torch.Tensor([],requires_grad=True)
Initialize the weights randomly
Weights of Linear layers are by default initialized randomly using uniform distribution (Kaiming) with $U(-\sqrt{k},\sqrt{k})$, where $k=\frac{1}{in_features}$ doc
self.weight=Parameter(torch.empty((out_features,in_features),**factory_kwargs))ifbias:self.bias=Parameter(torch.empty(out_features,**factory_kwargs))else:self.register_parameter('bias',None)self.reset_parameters()defreset_parameters(self)->None:# Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
# uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
# https://github.com/pytorch/pytorch/issues/57109
init.kaiming_uniform_(self.weight,a=math.sqrt(5))ifself.biasisnotNone:fan_in,_=init._calculate_fan_in_and_fan_out(self.weight)bound=1/math.sqrt(fan_in)iffan_in>0else0init.uniform_(self.bias,-bound,bound)
Finally, nn.functional.linear(x,self.weight,self.bias) is called in the forward method.
Therefore, w1.weight modifies self.weight before calling the forward method
What if we want to initialize the parameters of all linear layers of the model using normal distribution?
Hooks are mostly used for debugging purposes (such as tracking input, gradient, output…)
forward_hook
Modifies the output (if needed)
Look at the signature of various hooks
We can register more than one hook for the same Module
Signature:hook(module, input, output) -> None or modified outputdoc
1
2
3
4
5
6
7
out=[]defforward_hook_1(module,input,output):## we can do whatever we want like storing, printing, tracking..
print(module,input,output)defforward_hook_2(module,input,output):out.append(output)