Written by Matthieu Lin
on January 7, 2022

RoToGrad

RotoGrad - Gradient Homogenization in Multitask Learning

Authors: Adrian Javaloy & Isabel Valera from Saarland University

Submitted to ICLR2022, score: 8, 8, 8, 8.

Notations:

$\nabla$ denotes the gradient operator.

$\underline{x}$ underline to denote vectors.

$X$ capital letters denote matrices.

$<,>$ dot product operator.

$SO(d)=\{R \in \mathbb{R}^{d\times d} \space|\space R^TR=I, \det(R)=+1\}$ the special orthogonal group.

$\wedge : \mathbb{R}^{d} \rightarrow so(d)$ $\underline{x} \rightarrow \hat{X}$ $\wedge$ denoted as the hat oprator.

$so(d)=\{ \hat{X} \in \mathbb{R}^{d\times d} \space | \space \underline{x}\in\mathbb{R}^d \}$ $SO(d)$ constitute a set of skew-symmetric matrices.

$\vee$ the inverse of the hat operator denoted as the vee operator.

1. What did the authors try to accomplish?

Rotograd tackles the problem of negative transfer in multi-task learning (MTL). In particular, negative transfer in MTL is caused by:

Varying gradient magnitude across tasks, e.g., one task with a relatively higher gradient magnitude will dominate the gradient's direction resulting in poor performance of other tasks.
Varying gradient direction across tasks, e.g., gradients from different tasks, may cancel out each other, resulting in slow task learning.

1.1 Main Claims

$R_k: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}$ rotates the feature space to align the gradients according to a given direction.

1.2 Contributions

$SO(d)$ to multi-task learning.

2. What were the key element of the approach?

2.1 Idea

$X$ $f_\theta(X) = \underline{z}$ $h_{\phi_k}$ $\nabla_{\theta}L= \sum_k w_k \nabla_{\theta}L_k$ $\nabla_{\theta}L_k = \nabla_{\theta}f \space \nabla_{\underline{z}}L_k$ $\underline{z}$ $\nabla_{\theta} f$ $\nabla_{\underline{z}}L_k$ $L_k$ $h_k$ be the head for each task.

2.1.1 Gradient Magnitude

$\norm{G_k}$ .

$\underline{g}_{n,k} = \nabla_{\underline{z}} L_k(h_k(X), Y_{n,k})$ $k$ $n$ $G^{T}_{k} = [\underline{g}_{1,k}, ..., \underline{g}_{b,k}]$ .

$G_k$ $a_k$ be this weight for each task, then

a_{k} = \frac{‖ G_{k} ‖ / ‖ G_{k}^{0} ‖}{\sum_{i} ‖ G_{i} ‖ / ‖ G_{i}^{0} ‖} .

A side effect of this is that slow converging tasks will force quick converging tasks to escape from saddle points.

2.1.2 Gradient Direction

$R_k$ $L_k(\underline{z})$ $L_k(\underline{r}_k)$ $\underline{r}_k = R_k\underline{z}$ $\underline{v}_n$ $R_k$ has to minimize the following objective:

L_{r o t}^{k} = - \sum_{n} < R_{k}^{T} {\underset{―}{g}}_{n, k}^{^{'}}, {\underset{―}{v}}_{n} >,

$\underline{v}_n = \frac{1}{K} \sum_k \underline{u}_n,k$ $g^{'}_{n,k}=\nabla_{\underline{r}_k} L_k(h_k, Y_{n,k})$ is the gradient flowing down from the head before being rotated.

$R_k: \mathbb{R}^{d} \rightarrow \mathbb{R}^{d}$ $\underline{z}$ $R_k$ $\det R_k = 1$ $R_k \in SO(d)$ $so(d)$ $\det R_k =1$ $R_k$ .

One might think that we can directly optimize the free parameters of the rotation matrices, however, constructing such a rotation matrix in high dimension is unfeasible.

Lie Group and Lie Algebra

$d$ $SO(d)$ , one can smoothly rotate a matrix into another, hence it is a connected Lie group. This group is "special" because these matrices' determinant are 1.

$R_k \in SO(d)$ $so(d)$ $d \times d$ $\exp: so(d) \rightarrow SO(d)$ $so(d)$ $SO(d)$ .

$R(t): t \rightarrow SO(d)$ $R(0)=I$ ) to a different one:

X (t) = R (t) X (0),

$R(t)R(t)^T =I$ $\forall t$ , we have

\begin{aligned} \frac{d}{d t} R R^{T} & = \frac{d R}{d t} R^{T} + R \frac{d R^{T}}{d t} \\ = 0 since I cst w.r.t t \end{aligned}

then,

\frac{d R}{d t} R^{T} = - R \frac{d R^{T}}{d t},

$M = -M^{T}$ $M_{ii}=-M_{ii}$ $R_{ii}=-R_{ii}$ $R_{ii} = R^{T}_{ii}$ .

$\hat{W}\in so(d)$ $\underline{w} \in \mathbb{R^d}$ $\underline{w} \in \mathbb{R}^{d}$ such that:

\frac{d R (t)}{d t} R^{T} (t) = \hat{W} (t) ⟺ \frac{d R (t)}{d t} = \hat{W} (t) R (t),

$R(0)=I$ $\frac{dR(0)}{dt} = \hat{W}(0)$ $\hat{W}(t)$ $t$ , then the solution is the matrix exponential:

\begin{aligned} R (t) & = e^{\hat{W} t} \\ = \sum_{k = 0}^{\infty} \frac{1}{k!} {\hat{W}}^{k} \end{aligned},

$\exp: so(3) \rightarrow SO(3)$ $\hat{W} \rightarrow e^{\hat{W}}$ $R \in SO(d)$ $\underline{w}$ $e^{\hat{W}} = R$ . The exponential coordinates provide a local parameterization for rotation matrices.

$\hat{W}$ $-\hat{W} = \hat{W}^T$ , hence:

(e^{\hat{W t}})^{- 1} = e^{- \hat{W t}} = e^{\hat{W^{T} t}} = (e^{\hat{W t}})^{T}

$\hat{W} \in so(d)$ $t=0$ :

R (0) + d R = I_{d} + \hat{W} (0) d t .

$I_d$ $\det{(I_d + \hat{W}(0)dt)} =1$ .

To conclude, we showed that:

If \hat{W} \in s o (d) then e^{\hat{W}} \in S O (3) .

Note $\underline{z}$ in the forward pass, i.e., it affects the effective learning rate of different heads.

2.2 Limitations

$R_k$ $Kd^2$ $Kd^3$ .

3. Reference

RotoGrad: Gradient Homogenization in Multitask Learning

An invitation to 3-D Vision, FromImages to Geometric Models. Yi Ma, Stefano Soatto, Jana Kosecka, S. Shankar Sastry.

← Top