我現在的問題是,我的模型由兩部分組成,bert+gat,bert只需要3~5輪就能收斂,而gat需要幾十次,
我期望的目標是訓練5輪過后,就只訓練gat,bert不被更新
總體上有兩種思路,一種是將不想被訓練的參數修改為requires_grad=False,另一種是只將要訓練的參數放到優化器中
第一種:設置requires_grad=Fasle
點擊查看代碼
import torch
import torch.nn as nn
from torch.nn.modules import loss
from torch_geometric.nn import models
import ipdb
data = torch.randn(4,10)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(1,10)
self.fc2 = nn.Linear(10,1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
# ipdb.set_trace()
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
def print_gram(model):
for name, param in model.named_parameters():
if 'fc1' in name:
print(name, param.data, param.grad.norm(), param.requires_grad)
model = Net()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,betas=(0.5, 0.999))
criterion = nn.MSELoss()
data = torch.tensor([[1.0],[3.0],[5.0],[7.0]])
label = torch.tensor([[1.0],[9.0],[25.0],[49.0]])
for i in range(20):
a = model(data)
loss = criterion(a,label)
print(i, loss)
if i < 10:
optimizer.zero_grad()
loss.backward()
optimizer.step()
else:
optimizer.zero_grad()
model.fc1.requires_grad_(False)
loss.backward()
optimizer.step()
print_gram(model)
你會發現,fc1的梯度的確為0,且requires_grad==False,但是weight變了!!!
可見 關於pytorch中使用detach並不能阻止參數更新這檔子事兒
(1)backward之前,grad=None
(2)backward之后,grad變成具體值
(3)執行step,weight得到更新
(4)執行zero_grad,grad=0
除了逐一設置,也能直接將一個模塊設為False,nn.Module.requires_grad_()
因此,梯度為0,因為有歷史信息,權重也能更新!!
感謝這個問題,讓我明白了pytorch為什么不在每個step自動清零grad而是非要讓人主動去做zero_grad。
第二種:只優化要更新的參數
點擊查看代碼
import torch
import torch.nn as nn
from torch.nn.modules import loss
from torch_geometric.nn import models
import ipdb
data = torch.randn(4,10)
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(1,10)
self.fc2 = nn.Linear(10,1)
self.relu = nn.ReLU()
self.sigmoid = nn.Sigmoid()
def forward(self, x):
# ipdb.set_trace()
x = self.relu(self.fc1(x))
x = self.sigmoid(self.fc2(x))
return x
model = Net()
def get_parameters():
for name, param in model.named_parameters():
params = []
if 'fc1' not in name:
params.append(param)
print(len(params))
return params
optimizer = torch.optim.Adam(model.parameters(), lr=0.001,betas=(0.5, 0.999))
optimizer2 = torch.optim.Adam(get_parameters(), lr=0.001,betas=(0.5, 0.999))
def print_gram(model):
for name, param in model.named_parameters():
if 'fc1' in name:
print(name, param.data, param.grad.norm(), param.requires_grad)
criterion = nn.MSELoss()
data = torch.tensor([[1.0],[3.0],[5.0],[7.0]])
label = torch.tensor([[1.0],[9.0],[25.0],[49.0]])
for i in range(20):
a = model(data)
loss = criterion(a,label)
print(i, loss)
if i < 10:
optimizer.zero_grad()
loss.backward()
optimizer.step()
else:
optimizer2.zero_grad()
loss.backward()
optimizer2.step()
print_gram(model)
雖然fc1會有梯度,但權重不會被更新,感覺還不錯
總結:
使用requires_grad可以對參數或模塊進行設置;使用detach用法簡單,例如凍結bert,x = pooled_output.detach() 這句簡單,只要把bert的輸出加個detach,而且這兩種方法只能用於凍結上游參數,不能是中間或后面,而且只有之前沒有進行過BP才不會有歷史信息,才會完全不影響梯度
使用優化器方法,只設置優化器還是會進行整個BP,不會更新參數值,但並不會節省算力。參考bert模型的微調,如何固定住BERT預訓練模型參數,只訓練下游任務的模型參數? - ZJU某小白的回答 - 知乎 https://www.zhihu.com/question/317708730/answer/634068499
參考鏈接:
https://www.cxyzjd.com/article/Answer3664/108493753
https://blog.csdn.net/jinxin521125/article/details/83621268
https://pytorch.org/docs/stable/notes/autograd.html