add two optimizers if a pretrained network is used

save the pretrained part, use a smaller learning rate. See:

https://arxiv.org/abs/2202.07012

first: train the head, then when loss saturates: train all

Edited by fra-wa