RoToGrad

RotoGrad - Gradient Homogenization in Multitask Learning

Authors: Adrian Javaloy & Isabel Valera from Saarland University

Submitted to ICLR2022, score: 8, 8, 8, 8.

Notations:

denotes the gradient operator.

x underline to denote vectors.

X capital letters denote matrices.

<,> dot product operator.

SO(d)={RRd×d | RTR=I,det(R)=+1} the special orthogonal group.

:Rdso(d), i.e., xX^. denoted as the hat oprator.

so(d)={X^Rd×d | xRd} the associated lie algebra to SO(d) constitute a set of skew-symmetric matrices.

the inverse of the hat operator denoted as the vee operator.

1. What did the authors try to accomplish?

Rotograd tackles the problem of negative transfer in multi-task learning (MTL). In particular, negative transfer in MTL is caused by:

  1. Varying gradient magnitude across tasks, e.g., one task with a relatively higher gradient magnitude will dominate the gradient's direction resulting in poor performance of other tasks.
  2. Varying gradient direction across tasks, e.g., gradients from different tasks, may cancel out each other, resulting in slow task learning.

1.1 Main Claims

On the one hand, they propose to guide the step size with the least converged task. On the other hand, a task-specific rotation matrix Rk:RdRd rotates the feature space to align the gradients according to a given direction.

1.2 Contributions

The authors claim that previous work in the literature primarily focused on the first problem. Instead, they propose to tackle both problems. Specifically, they propose a novel way to tackle gradient direction conflict by rotating the feature space such that the gradients align towards a common direction. To the best of my knowledge, they are the first to use rotation transformation in the Lie group SO(d) to multi-task learning.

2. What were the key element of the approach?

2.1 Idea

Given an input X, let fθ(X)=z be the shared backbone for all tasks and hϕk the task-specific head. Moreover, given the linear combination of task gradients θL=kwkθLk, we have θLk=θf zLk by the chain rule. Since the tasks only compete for ressource on z, we can ignore θf and focus on zLk, where Lk is the task-specific loss. Furthermore let hk be the head for each task.

2.1.1 Gradient Magnitude

Goal change the magnitude of each Gk.

Let gn,k=zLk(hk(X),Yn,k) be the gradient for the k-th task and the n-th data point. Then, its batch version is GkT=[g1,k,...,gb,k].

The authors propose to normalize each gradient Gk to a unit vector. Then they rescale it such that the tasks that have converged the least set up the step size. We set step size to be a weighted combination of task-wise gradient combination, where the weights sum up to one. Let ak be this weight for each task, then

ak=Gk/Gk0iGi/Gi0.

A side effect of this is that slow converging tasks will force quick converging tasks to escape from saddle points.

2.1.2 Gradient Direction

Let Rk be the task-specific rotation matrix, then instead of optimizing Lk(z), we optimize Lk(rk) where rk=Rkz. By rotating the feature space we can ensure that gradients direction from different tasks do not conflict. Instead, they align to the same direction given by vn. In particular, Rk has to minimize the following objective:

Lrotk=n<RkTgn,k,vn>,

where vn=1Kkun,k is a sum of normalized gradient direction from the different tasks, and gn,k=rkLk(hk,Yn,k) is the gradient flowing down from the head before being rotated.

Learning a task-specific rotation matrix requires considering a constrained minimization problem for Rk:RdRd. In practice, z is of high dimension; hence optimizing this objective is unfeasible as it would require computing the determinant of Rk, i.e., we want detRk=1. Hence, to turn it into an unconstrained minimization problem, we consider RkSO(d) that allows us to work with rotational matrices in a vector space so(d). A vector space of matrices is much nicer to work with since it is closed under vector summation and scalar multiplication. We do not need to keep the constraint detRk=1 and use regular gradient descent to optimize Rk.

One might think that we can directly optimize the free parameters of the rotation matrices, however, constructing such a rotation matrix in high dimension is unfeasible.

Lie Group and Lie Algebra

On a high level, the set of all d-dimensional rotational matrices (with determinant 1) constitute a group as it respects the four axioms (i.e., identity, associative, inverse, and closure) under matrix multiplication. In addition, if these rotational matrices are differential, it is also a Lie group. For the special orthogonal matrices group denoted as SO(d), one can smoothly rotate a matrix into another, hence it is a connected Lie group. This group is "special" because these matrices' determinant are 1.

We consider RkSO(d) because every Lie group has an associated Lie algebra. In particular, the Lie algebra denoted so(d) is the set of d×d skew-symmetric matrices that form a vector space. These skew-symmetric matrices can be mapped back to their corresponding element on the Lie group via exponential maps exp:so(d)SO(d). And this new space so(d) allows us to work with a local approximation of the rotation matrices SO(d).

More formally, consider a set rotation matrices R(t):tSO(d), which continuously transform a point from its original location (R(0)=I) to a different one:

X(t)=R(t)X(0),

since R(t)R(t)T=I, t, we have

ddtRRT=dRdtRT+RdRTdt=0  since I cst w.r.t t

then,

dRdtRT=RdRTdt,

We know that, skew-symmetric matrices or anti-symmetric matrices have the two following properties M=MT and Mii=Mii. The diagonal must be 0 as Rii=Rii and Rii=RiiT.

Let W^so(d) be the skew-symmetric matrix of a vector wRd. Then the above results suggest that there exists a vector wRd such that:

dR(t)dtRT(t)=W^(t)dR(t)dt=W^(t)R(t),

and since R(0)=I, it follows that dR(0)dt=W^(0). This is a simple ordinary differential equation, for simplicity assume W^(t) constant in t, then the solution is the matrix exponential:

R(t)=eW^t=k=01k!W^k,

yielding the exponential maps exp:so(3)SO(3); W^eW^. In fact, every rotation matrix RSO(d) have infinitely many exponential coordinates w such that eW^=R. The exponential coordinates provide a local parameterization for rotation matrices.

As a sanity check, since W^ is skew-symmetric, it yields that W^=W^T, hence:

(eWt^)1=eWt^=eWTt^=(eWt^)T

Furthermore, we observe that the skew-symmetric W^so(d) gives the first-order approximation of a rotation at t=0:

R(0)+dR=Id+W^(0)dt.

Hence, the Lie algebra uses the tangent space to the Lie group at the identity element Id. We also observe that det(Id+W^(0)dt)=1.

To conclude, we showed that:

If  W^so(d)  then  eW^SO(3).

Note: The authors have tried using a learned non-linear transformation instead of a rotation matrix. However, such choice results in numerical issues related to scaling of the feature z in the forward pass, i.e., it affects the effective learning rate of different heads.

2.2 Limitations

The method scales linearly with the number of tasks, i.e., Rk induces Kd2 additional parameters. And the time complexity induced is Kd3.

3. Reference

RotoGrad: Gradient Homogenization in Multitask Learning

An invitation to 3-D Vision, FromImages to Geometric Models. Yi Ma, Stefano Soatto, Jana Kosecka, S. Shankar Sastry.