Written by Matthieu Lin
on January 3, 2022

self supervised learning

Gradient Equivalence in Siamese Self-Supervised Learning

Jan 3, 2022 by Matthieu Lin.

Following [1], we derive the gradient of different siamese self-supervised learning methods and show that although these methods appear to be quite different, they have similar gradient formulas. In particular, the gradient consists of three terms:

A positive gradient, i.e., the representation of another augmented view from the same image, pulls positive samples together.
A negative gradient, i.e., a weighted combination of the representations from different images, pushes negative samples apart.
A balancing factor weights the two terms.

Under this assumption, [1] empirically shows that these methods' performances are similar, and only the momentum encoder dramatically improves the final performance.

1. Introduction

1.1 What is "Self" in Self-Supervised Learning?

The world self refers to the ability to generate labels from the data by leveraging the underlying structure in the data. For instance, it could be predicting a missing word in a sentence. Because of the natural structure of human language, solving this task requires a high-level understanding of the sentence. From this, we hope that our model extracts a generic representation to solve the task.

1.2 What is "Siamese" in Siamese Self-Supervised Learning?

A siamese network is composed of two branches: an online branch and a target branch, where the target branch shares weights with the online branch or keeps an exponential moving average of the online branch. In particular, given two different augmented views of the input image, each branch computes an augmented view, and the output of the target branch serves as the training target for the online branch. The loss maximizes the agreement between the two augmented views, i.e., we want our network to be invariant to image augmentation. We hope the online branch learns a generic representation transferrable to downstream tasks by optimizing this objective. It is important to note that these methods usually rely on heavy, hand-engineered data augmentation.

1.3 What is the goal of Self-Supervised Learning?

Self-Supervised Learning acts as a pre-text task where the goal is to learn a generic representation transferrable to downstream tasks. In contrast to previous methods that pre-train a network on annotated datasets, self-supervised learning methods do not require labels.

Note: Although self-supervised learning techniques do not rely on labels, it relies on the highly curated ImageNet dataset, e.g., images usually contain a single object at the center.

When transferring to downstream tasks, we use the online branch. For instance, those downstream tasks can be classification, object detection, or semantic segmentation. We evaluate the quality of the learned representation of the online network on annotated datasets, where we either fine-tune the network or train a classifier on top of frozen features. For fair comparisons, these methods use the ImageNet dataset and a randomly initialized ResNet50. A good representation makes the classes linearly separable.

1.4 Key Concepts

A trivial solution exists for these siamese networks, where the network outputs the same embedding for all images to minimize the loss. When the network learns this trivial solution, we call this feature collapse.

We can roughly split current self-supervised methods into three frameworks, where each proposes different ways to prevent this collapse:

Contrastive Learning methods (e.g., MoCo [3], SimCLR [2]) contrast positive samples with negative samples to prevent collapse. Specifically, the target branch outputs the representation of a positive sample and a set of negative samples, and the loss explicitly pulls the pair of positive samples together while pushing apart the pair of negative samples.
Asymmetric networks (e.g., BYOL [5], SimSiam [4]) only rely on positive samples. They introduce an asymmetry between the target and the online branch to prevent collapse. In this case, the loss explicitly pulls together the positive sample pairs.
Feature decorrelation methods (e.g., Barlow Twins [6]) prevent collapse by pushing the cross-correlation matrix of the two views close to the identity matrix. Optimizing this objective makes each feature invariant under data augmentation while reducing the redundancy between each feature.

2. Equivalence of Gradients

Notations:

$\underline{x}$ we use underline to denotes vectors.

$\underline{x}^{(n)}$ $(\underline{x}^{(n)})_i$ refers to the n-th element in a batch and the i-th element of this vector.

$X$ we use capital letters to denote matrices.

$\nabla$ denotes the gradient operator.

$<\cdot,\cdot>$ denotes the inner product operator.

2.1 Contrastive Methods

2.1.1 MoCo:

$\underline{q}$ denote the output of the online branch.

$\underline{k}$ $\underline{k}^{(+)}$ $\underline{q}$ in the batch.

$\theta_k$ are the parameters of the target branch.

$\theta_q$ are the parameters of the online branch.

Method.

$\theta_k$ $\theta_q$ . Hence, the two branches are different.

$\tau$ be the temperature hyper-parameter. Then, We define the loss for a single pair as:

L (\underset{―}{q}) = - \log \frac{\exp (< \underset{―}{q}, {\underset{―}{k}}^{(+)} > / τ)}{\sum_{i = 0}^{K} \exp (< \underset{―}{q}, {\underset{―}{k}}^{(i)} > / τ)} .

$K+1$ $\underline{q}$ $\underline{k}^{(+)}$ $\underline{q}$ $\underline{k}^{(i)}$ apart.

Gradient.

$B$ be the set of elements in the memory bank. Then, we define the loss function as:

L = \frac{1}{N} \sum_{n = 1}^{N} L ({\underset{―}{q}}^{(n)}),

$z_{j}^{(n)}$ $q^{(n)}$ $k^{(+)}$ , is at index 0, then we can write the cross-entropy loss as:

L ({\underset{―}{q}}^{(n)}) = - \sum_{j = 0}^{K} 1_{j = 0} \log (z_{j}^{(n)})

z_{j}^{(n)} = \frac{\exp (s_{j}^{(n)})}{\sum_{m \in B} \exp (S_{m}^{(n)})},

s_{m}^{(n)} =< {\underset{―}{q}}^{(n)}, {\underset{―}{k}}^{(m)} > / τ

$\underline{q}^{(n)}$ $\underline{q}^{(n)}$ is:

\begin{aligned} (\nabla_{{\underset{―}{q}}^{(n)}} L)_{i} & = \frac{1}{N} \sum_{n = 1}^{N} \sum_{j = 0}^{K} \sum_{m \in B} \frac{\partial L}{\partial z_{j}^{(n)}} \frac{\partial z_{j}^{(n)}}{\partial s_{m}^{(n)}} \frac{\partial s_{m}^{(n)}}{\partial ({\underset{―}{q}}^{(n)})_{i}} \\ = \frac{1}{N} (\frac{\partial L}{\partial z_{0}^{(n)}} \frac{\partial z_{0}^{(n)}}{\partial s_{0}^{(n)}} \frac{\partial s_{0}^{(n)}}{\partial ({\underset{―}{q}}^{(n)})_{i}} + \sum_{m \in B ∖ {{\underset{―}{k}}^{(0)}}} \frac{\partial L}{\partial z_{0}^{(n)}} \frac{\partial z_{0}^{(n)}}{\partial s_{m}^{(n)}} \frac{\partial s_{m}^{(n)}}{\partial ({\underset{―}{q}}^{(n)})_{i}}), \end{aligned}

where,

\begin{array}{r} \frac{\partial L}{\partial z_{j}^{(n)}} = - \frac{1}{z_{j}^{(n)}}, \\ \frac{\partial s_{m}^{(n)}}{\partial ({\underset{―}{q}}^{(n)})_{i}} = \frac{1}{τ} ({\underset{―}{k}}^{(m)})_{i}, \end{array}

$m \neq j$ ,

\begin{aligned} \frac{\partial z_{j}^{(n)}}{\partial s_{m}^{(n)}} & = \frac{\partial}{\partial s_{m}^{(n)}} \frac{\exp (s_{j}^{(n)})}{\sum_{m \in B} \exp (s_{m}^{(n)})} \\ = \frac{- \exp (s_{j}^{(n)}) \exp (s_{m}^{(n)})}{(\sum_{m \in B} \exp (s_{m}^{(n)}))^{2}} \\ = - z_{j}^{(n)} z_{m}^{(n)}, \end{aligned}

else,

\begin{aligned} \frac{\partial z_{j}^{(n)}}{\partial s_{j}^{(n)}} & = \frac{\partial}{\partial s_{j}^{(n)}} \frac{\exp (s_{j}^{(n)})}{\sum_{m \in B} \exp (s_{m}^{(n)})} \\ = \frac{\exp (s_{j}^{(n)}) \sum_{m \in B} \exp (s_{m}^{(n)}) - \exp (s_{j}^{(n)}) \exp (s_{j}^{(n)})}{(\sum_{m \in B} \exp (s_{m}^{(n)}))^{2}} \\ = z_{j}^{(n)} (1 - z_{j}^{(n)}) . \end{aligned}

Hence,

\begin{aligned} (\nabla_{{\underset{―}{q}}^{(n)}} L)_{i} & = \frac{1}{N} ((z_{0}^{(n)} - 1) \frac{({\underset{―}{k}}^{(0)})_{i}}{τ} + \sum_{m \in B ∖ {{\underset{―}{k}}^{(0)}}} z_{m}^{(n)} \frac{({\underset{―}{k}}^{(m)})_{i}}{τ}) \\ = \frac{1}{N} (- \frac{({\underset{―}{k}}^{(0)})_{i}}{τ} + \sum_{m \in B} z_{m}^{(n)} \frac{({\underset{―}{k}}^{(m)})_{i}}{τ}) \\ \nabla_{{\underset{―}{q}}^{(n)}} L & = \frac{1}{N} (- \frac{{\underset{―}{k}}^{(0)}}{τ} + \sum_{m \in B} z_{m}^{(n)} \frac{{\underset{―}{k}}^{(m)}}{τ}), \end{aligned}

From this, we observe two things: (1) the first term pulls positive samples together, (2) the second term pushes negative samples apart.

2.1.2 SimCLR:

Method.

This paper introduces four ingredients that substantially improve the learned representation. It consists of (1) a non-linear projection on top of the encoder during training of the pre-text task, (2) a set of heavy data augmentation, (3) a cosine similarity function between pairs, and (4) an extended training schedule.

$\underline{z}^{(n)}$ $\underline{q}^{(n)}$ $\underline{k}^{(n)}$ $2N-2$ images in the batch.

Gradient.

As in MoCo, we assume the target branch computes the positive and the negative samples. Thus, by stopping the gradient on the target branch, those two methods are equivalent. In [1], they empirically verified this.

$n_+$ $n$ $\underline{z}^{(n)}$ receive gradient from all other samples in the batch. Given the loss:

L = \frac{1}{N} \sum_{n = 1}^{N} L ({\underset{―}{z}}^{(n)}),

where,

L ({\underset{―}{z}}^{(n)}) = - \log (s_{n_{+}}^{(n)}),

and

s_{n_{+}}^{(n)} = \frac{\exp c_{n_{+}}^{(n)}}{\sum_{m \in B ∖ {{\underset{―}{z}}^{(n)}}} \exp c_{m}^{(n)}},

and

c_{m}^{(n)} =< {\underset{―}{z}}^{(n)}, {\underset{―}{z}}^{(m)} > / τ .

$\underline{z}^{(n)}$ is:

\begin{aligned} (\nabla_{{\underset{―}{z}}^{(n)}} L)_{i} & = \sum_{k = 1}^{N} \sum_{m \in B ∖ {{\underset{―}{z}}^{(k)}}} \frac{\partial L}{\partial s_{k_{+}}^{(k)}} \frac{\partial s_{k_{+}}^{(k)}}{\partial c_{m}^{(k)}} \frac{\partial c_{m}^{(k)}}{\partial ({\underset{―}{z}}^{(n)})_{i}} \\ = \sum_{m \in B ∖ {{\underset{―}{z}}^{(n)}}} \frac{\partial L}{\partial s_{n_{+}}^{(n)}} \frac{\partial s_{n_{+}}^{(n)}}{\partial c_{m}^{(n)}} \frac{\partial c_{m}^{(n)}}{\partial ({\underset{―}{z}}^{(n)})_{i}} \\ + \sum_{m \in B ∖ {{\underset{―}{z}}^{(n_{+})}}} \frac{\partial L}{\partial s_{n}^{(n_{+})}} \frac{\partial s_{n}^{(n_{+})}}{\partial c_{m}^{(n_{+})}} \frac{\partial c_{m}^{(n_{+})}}{\partial ({\underset{―}{z}}^{(n)})_{i}} \\ + \sum_{k \neq n, n_{+}} \sum_{m \in B ∖ {{\underset{―}{z}}^{(k)}}} \frac{\partial L}{\partial s_{k_{+}}^{(k)}} \frac{\partial s_{k_{+}}^{(k)}}{\partial c_{m}^{(k)}} \frac{\partial c_{m}^{(k)}}{\partial ({\underset{―}{z}}^{(n)})_{i}}, \end{aligned}

where the first term is

\sum_{m \in B ∖ {{\underset{―}{z}}^{(n)}}} \frac{\partial L}{\partial s_{n_{+}}^{(n)}} \frac{\partial s_{n_{+}}^{(n)}}{\partial c_{m}^{(n)}} \frac{\partial c_{m}^{(n)}}{\partial ({\underset{―}{z}}^{(n)})_{i}} = \frac{1}{τ N} [- ({\underset{―}{z}}^{(n_{+})})_{i} + \sum_{m \in B ∖ {{\underset{―}{z}}^{n}}} s_{m}^{(n)} ({\underset{―}{z}}^{m})_{i}],

$m \neq n$ $0$ , the second term is

\sum_{m \in B ∖ {{\underset{―}{z}}^{(n_{+})}}} \frac{\partial L}{\partial s_{n}^{(n_{+})}} \frac{\partial s_{n}^{(n_{+})}}{\partial c_{m}^{(n_{+})}} \frac{\partial c_{m}^{(n_{+})}}{\partial ({\underset{―}{z}}^{(n)})_{i}} = \frac{1}{τ N} [- ({\underset{―}{z}}^{n_{+}})_{i} + s_{n}^{(n_{+})} ({\underset{―}{z}}^{n_{+}})_{i}],

$m \neq n$ $0$ , the term is

\sum_{k \neq n, n_{+}} \sum_{m \in B ∖ {{\underset{―}{z}}^{(k)}}} \frac{\partial L}{\partial s_{k_{+}}^{(k)}} \frac{\partial s_{k_{+}}^{(k)}}{\partial c_{m}^{(k)}} \frac{\partial c_{m}^{(k)}}{(\partial {\underset{―}{z}}^{(n)})_{i}} = \frac{1}{τ N} [\sum_{k \neq n, n_{+}} s_{n}^{(k)} ({\underset{―}{z}}^{(k)})_{i}] .

Hence,

\begin{matrix} (\nabla_{{\underset{―}{z}}^{(n)}} L)_{i} = \frac{1}{τ N} [- ({\underset{―}{z}}^{(n_{+})})_{i} + \sum_{m \in B ∖ {{\underset{―}{z}}^{n}}} s_{m}^{(n)} ({\underset{―}{z}}^{m})_{i}] \\ + \frac{1}{τ N} [- ({\underset{―}{z}}^{n_{+}})_{i} + \sum_{k \neq n} s_{n}^{(k)} ({\underset{―}{z}}^{(k)})_{i}] . \end{matrix}

And if we stop the gradient through the target branch, then the second term vanishes and

\nabla_{{\underset{―}{z}}^{(n)}} L = \frac{1}{τ N} [- {\underset{―}{z}}^{(n_{+})} + \sum_{m \in B ∖ {{\underset{―}{z}}^{n}}} s_{m}^{(n)} {\underset{―}{z}}^{m}] .

This gradient is similar to MoCo's, i.e., the first term pulls positive samples together and the second term pushes negative samples apart.

2.2 Asymmetric Methods

2.2.1 BYOL

Practically, we can think of BYOL as SimSiam with a momentum encoder.

Method.

$h(\cdot)$ $f$ $h(\cdot)$ $\underline{p}_1$ $\underline{z}_2$ , the objective function is:

\begin{array}{r} L = - < \frac{{\underset{―}{p}}_{1}}{{‖ {\underset{―}{p}}_{1} ‖}_{2}}, \frac{{\underset{―}{z}}_{2}}{{‖ {\underset{―}{z}}_{2} ‖}_{2}} > . \end{array}

Gradient.

$B$ $\rho_{z}$ be the moving average weight for each sample according to their batch order.

$h(\cdot)$ $W_{h}$ $F$ . The feature correlation matrix is obtained via a moving average, i.e.,

F = \sum_{z \in B} ρ_{z} {\underset{―}{z}}_{1} {\underset{―}{z}}_{1}^{T} .

$W_{h}$ $F$ $W_{h}$ requires the SVD decomposition, i.e.,

W_{h} = U Λ_{h} U^{T}, Λ_{h} = Λ_{F}^{1 / 2} + ϵ λ_{m a x} I,

$U$ $\Lambda_{h}$ $\lambda_{max}$ $\epsilon$ is a hyper-parameter to help boost small eigenvalues. Equipped with this analytical solution, we rewrite the loss as:

L = \frac{1}{N} \sum_{b = 1}^{N} L ({\underset{―}{p}}_{1}^{(b)}),

and,

L ({\underset{―}{p}}_{1}^{(b)}) = - < {\underset{―}{p}}_{1}^{(b)^{^{'}}}, {\underset{―}{z}}_{2}^{(b)^{^{'}}} >, where {\underset{―}{z}}_{2}^{(b)^{^{'}}} = \frac{{\underset{―}{z}}_{2}^{(b)}}{{‖ {\underset{―}{z}}_{2}^{(b)} ‖}_{2}} and {\underset{―}{p}}_{1}^{(b)^{^{'}}} = \frac{{\underset{―}{p}}_{1}^{(b)}}{{‖ {\underset{―}{p}}_{1}^{(b)} ‖}_{2}}

Note that the authors of [7] stop the gradient through the target branch.

$\underline{p}_{1}^{(b)} = W_h \, \underline{z}^{(n)}_1$ $\underline{p}_{1}^{(b)} \in \mathbb{R}^{d_p}$ $\underline{p}_{1}^{(b)^{'}} \in \mathbb{R}^{d^{'}}$ $\underline{z}^{(n)}_1$ is:

\begin{aligned} (\nabla_{{\underset{―}{z}}_{i}^{(n)}} L)_{i} & = \frac{1}{N} \sum_{b = 1}^{N} \sum_{j = 1}^{d^{^{'}}} \frac{\partial L}{\partial ({\underset{―}{p}}_{1}^{(b)^{^{'}}})_{j}} \sum_{k = 1}^{d_{p}} \frac{\partial ({\underset{―}{p}}_{1}^{(b)^{^{'}}})_{j}}{\partial ({\underset{―}{p}}_{1}^{(b)})_{k}} \frac{\partial ({\underset{―}{p}}_{1}^{(b)})_{k}}{\partial ({\underset{―}{z}}_{1}^{(n)})_{i}} \\ = \frac{1}{N} \sum_{j = 1}^{d^{^{'}}} \frac{\partial L}{\partial ({\underset{―}{p}}_{1}^{(n)^{^{'}}})_{j}} \sum_{k = 1}^{d_{p}} \frac{\partial ({\underset{―}{p}}_{1}^{(n)^{^{'}}})_{j}}{\partial ({\underset{―}{p}}_{1}^{(n)})_{k}} \frac{\partial ({\underset{―}{p}}_{1}^{(n)})_{k}}{\partial ({\underset{―}{z}}_{1}^{(n)})_{i}} \\ = \frac{1}{N} \sum_{j = 1}^{d^{^{'}}} - ({\underset{―}{z}}_{2}^{(n)^{^{'}}})_{j} \sum_{k = 1}^{d_{p}} \frac{1}{{‖ {\underset{―}{p}}_{1}^{(n)} ‖}_{2}} (1_{[j = k]} - \frac{({\underset{―}{p}}_{1}^{(n)})_{j} ({\underset{―}{p}}_{1}^{(n)})_{k}}{{‖ {\underset{―}{p}}_{1}^{(n)} ‖}_{2}^{2}}) (W_{h})_{k i}, \end{aligned}

thus,

\begin{aligned} \nabla_{{\underset{―}{z}}_{i}^{(n)}} L & = \frac{1}{N} \frac{\partial ({\underset{―}{p}}_{1}^{(n)})^{T}}{\partial {\underset{―}{z}}^{(n)}} \frac{\partial ({\underset{―}{p}}_{1}^{(n)^{^{'}}})^{T}}{\partial ({\underset{―}{p}}_{1}^{(n)})} \frac{\partial L}{\partial ({\underset{―}{p}}_{1}^{(n)^{^{'}}})} \\ = \frac{1}{N} [- W_{h}^{T} \frac{1}{{‖ {\underset{―}{p}}_{1}^{(n)} ‖}_{2}} (I - \frac{{\underset{―}{p}}_{1}^{(n)} {\underset{―}{p}}_{1}^{(n)^{T}}}{{\underset{―}{p}}_{1}^{(n)^{T}} {\underset{―}{p}}_{1}^{(n)}}) {\underset{―}{z}}_{2}^{(n)^{^{'}}}] where {\underset{―}{p}}_{1}^{(n)} = W_{h} {\underset{―}{z}}_{1}^{(n)} \\ = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} [- W_{h}^{T} (I - \frac{W_{h} {\underset{―}{z}}_{1}^{(n)} {\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T}}{{\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T} W_{h} {\underset{―}{z}}_{1}^{(n)}}) {\underset{―}{z}}_{2}^{(n)^{^{'}}}] \\ = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} [- W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}} + \frac{W_{h}^{T} W_{h} {\underset{―}{z}}_{1}^{(n)} {\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T}}{{\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T} W_{h} {\underset{―}{z}}_{1}^{(n)}} {\underset{―}{z}}_{2}^{(n)^{^{'}}}] \\ = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} [- W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}} + W_{h}^{T} W_{h} {\underset{―}{z}}_{1}^{(n)} \frac{{\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}}}{{\underset{―}{z}}_{1}^{(n)^{T}} W_{h}^{T} W_{h} {\underset{―}{z}}_{1}^{(n)}}], \end{aligned}

$\lambda = \frac{ \underline{z}_1^{(n)^{T}} W_{h}^{T} \underline{z}_{2}^{(n)^{'}} }{ \underline{z}_1^{(n)^{T}} W_{h}^{T} W_h \underline{z}_1^{(n)} } \in \mathbb{R}$ , we know that

\begin{aligned} W_{h}^{T} W_{h} & = U Λ_{h} U^{T} U Λ_{h} U^{T}, \\ = U Λ_{h} Λ_{h} U^{T}, \\ = U (Λ_{F} + 2 ϵ λ_{m a x} Λ_{F}^{1 / 2} + ϵ^{2} λ_{m a x}^{2} I) U^{T} \\ = F + 2 ϵ λ_{m a x} F^{1 / 2} + ϵ^{2} λ_{m a x}^{2} I . \end{aligned}

Empirically, the second term can be safely removed, giving

\nabla_{{\underset{―}{z}}_{i}^{(n)}} L = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} (- W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}} + λ (F {\underset{―}{z}}_{1}^{(n)} + ϵ^{2} λ_{m a x}^{2} {\underset{―}{z}}_{1}^{(n)}))

$\underline{z}_1^{(n)}$ $l_2$ $\epsilon^{2} \lambda_{max}^{2}$ $\underline{z}_1^{(n)}$ $\underline{z}_1^{(n)}$ as:

\begin{aligned} \nabla_{{\underset{―}{z}}_{i}^{(n)}} L & = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} (- W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}} + λ F {\underset{―}{z}}_{1}^{(n)}) \\ = \frac{1}{{‖ W_{h} {\underset{―}{z}}_{1}^{(n)} ‖}_{2} N} (- W_{h}^{T} {\underset{―}{z}}_{2}^{(n)^{^{'}}} + λ (\sum_{b \in B} ρ_{z} {\underset{―}{z}}_{1}^{(b)} {\underset{―}{z}}_{1}^{(b)^{T}}) {\underset{―}{z}}_{1}^{(n)}) . \end{aligned}

$W_h$ $F$ .

2.3 Feature Decorrelation Methods

2.3.1 Barlow Twins

$\underline{q}$ denotes the standardized output from the online branch.

$\underline{k}$ denotes the standardized output from the target branch.

$C \in \R^{D \times D}$ $D$ is the feature dimension.

Compared to the previous method, Barlow Twins does not require a large batch, asymmetric architecture, gradient stopping, or moving average on the weight update.

Method.

Inspired by Horace Barlow's efficient coding hypothesis, this paper proposes to reduce redundancy instead of maximizing similarity. In particular, we want each neuron to satisfy (1) invariance under data augmentation (2) independence to other neurons, e.g., reduce redundancy. Property one means that the neurons behave the same way for different data augmentation and property two means that all the neurons should be different. By forcing each neuron to be different, it prevents feature collapse where all the neurons are the same. This process is equivalent to pushing the cross-correlation matrix close to the identity matrix. In particular, given the cross-correlation matrix as the sum of outer products:

C = \frac{1}{N} \sum_{b = 1}^{N} {\underset{―}{q}}^{(b)} {\underset{―}{k}}^{(b)^{T}},

so,

C_{i j} = \frac{1}{N} \sum_{b = 1}^{N} ({\underset{―}{q}}^{(b)})_{i} ({\underset{―}{k}}^{(b)^{T}})_{j},

and the objective function is:

L = \sum_{d = 1}^{D} (C_{d d} - I_{d d})^{2} + λ \sum_{d = 1}^{D} \sum_{j \neq i} C_{d j}^{2},

$\lambda$ $D^2 -D$ $D$ .

Note: there might be another type of collapse where the networks satisfy the two properties, e.g., invariance and redundancy reduction, but it outputs representations that are constant across the batch dimension. To prevent this, we standardize the output; if the representation across the batch dimension is the same, C would be a zero matrix.

$D=16384$ .

Gradient.

$\mathcal{L}_1 = \sum_{d=1}^{D} (C_{dd} - 1)^2$ $\mathcal{L}_2 = \sum_{d=1}^{D}\sum_{j \neq d} C^{2}_{dj}$ $\mathcal{L} = \mathcal{L}_1 + \lambda \mathcal{L}_2$ .

$\mathcal{L}_1$ $\underline{q}^{(n)}$ is:

\begin{aligned} (\nabla_{{\underset{―}{q}}^{(n)}} L_{1})_{i} & = \frac{\partial L_{1}}{\partial C_{i i}} \frac{\partial C_{i i}}{\partial ({\underset{―}{q}}^{(n)})_{i}} \\ = 2 (C_{i i} - I_{i i}) \frac{({\underset{―}{k}}^{(n)^{T}})_{i}}{N}, \end{aligned}

$\mathcal{L}_2$ $\underline{q}^{(n)}$ is:

\begin{aligned} (\nabla_{{\underset{―}{q}}^{(n)}} L_{2})_{i} & = λ \sum_{j \neq i}^{D} \frac{\partial L_{2}}{\partial C_{i j}} \frac{\partial C_{i j}}{\partial ({\underset{―}{q}}^{(n)})_{i}} \\ = λ \sum_{j \neq i}^{D} 2 C_{i j} \frac{({\underset{―}{k}}^{(n)^{T}})_{j}}{N} \\ = \frac{2 λ}{N} [- C_{i i} ({\underset{―}{k}}^{(n)^{T}})_{i} + \sum_{j = 1}^{D} C_{i j} ({\underset{―}{k}}^{(n)^{T}})_{j}] \\ = \frac{2 λ}{N} [- C_{i i} ({\underset{―}{k}}^{(n)^{T}})_{i} + \sum_{j = 1}^{D} \frac{1}{N} \sum_{b = 1}^{N} ({\underset{―}{q}}^{(b)})_{i} ({\underset{―}{k}}^{(b)^{T}})_{j} ({\underset{―}{k}}^{(n)^{T}})_{j}] \\ = \frac{2 λ}{N} [- C_{i i} ({\underset{―}{k}}^{(n)^{T}})_{i} + \frac{1}{N} \sum_{b = 1}^{N} ({\underset{―}{q}}^{(b)})_{i} \sum_{j = 1}^{D} ({\underset{―}{k}}^{(b)^{T}})_{j} ({\underset{―}{k}}^{(n)^{T}})_{j}] \\ = \frac{2 λ}{N} [- C_{i i} ({\underset{―}{k}}^{(n)^{T}})_{i} + \sum_{b = 1}^{N} \frac{{\underset{―}{k}}^{(n)^{T}} {\underset{―}{k}}^{(b)}}{N} ({\underset{―}{q}}^{(b)})_{i}], \end{aligned}

Hence,

\begin{aligned} (\nabla_{{\underset{―}{q}}^{(n)}} L)_{i} & = \frac{2}{N} [({\underset{―}{k}}^{(n)^{T}})_{i} (C_{i i} - I_{i i} - λ C_{i i}) + λ \sum_{b = 1}^{N} \frac{{\underset{―}{k}}^{(n)^{T}} {\underset{―}{k}}^{(b)}}{N} ({\underset{―}{q}}^{(b)})_{i}] \\ = \frac{2}{N} [- ({\underset{―}{k}}^{(n)^{T}})_{i} (I_{i i} - C_{i i} (1 - λ)) + λ \sum_{b = 1}^{N} \frac{{\underset{―}{k}}^{(n)^{T}} {\underset{―}{k}}^{(b)}}{N} ({\underset{―}{q}}^{(b)})_{i}] . \end{aligned}

$A =(I - (1 - \lambda)C_{diag})$ $(C_{diag})_{ij}=\delta_{ij}C_{ij}$ $\delta_{ij}$ is the Kronecker delta, then

(\nabla_{{\underset{―}{q}}^{(n)}} L) = \frac{2}{N} [- A {\underset{―}{k}}^{(n)} + λ \sum_{b = 1}^{N} \frac{{\underset{―}{k}}^{(n)^{T}} {\underset{―}{k}}^{(b)}}{N} {\underset{―}{q}}^{(b)}] .

$\underline{k}^{(n)}$ $\underline{q}^{(b)}$ .

Note $A$ $l_2$ $\underline{k} \; \text{and} \; \underline{q}$ does not the performance either.

3. Conclusion

The authors of [1] show some interesting insights about the success of siamese Self-Supervised Learning. In particular, they find that:

Increasing the depth of the projector from 1 to 3 boosts the linear evaluation accuracy significantly.
Increasing the projector's width boosts the performance and does not seem to saturate even when the dimension increases to 16384.
Only a consistent and slow updating positive key is enough and essential for self-supervised learning. Hence, contrary to findings from MoCo's, the slowly updating memory bank of negative samples is unnecessary.
The representation learned by these different methods works similarly. SimCLR and BYOL also learn to decorrelate different channels, and Barlow Twins can learn to discriminate between positive and negative samples. This result supports the claim that these methods have similar gradient formula.

Although siamese self-supervised methods have shown remarkable performance, these methods still rely on hand-crafted invariance (data augmentation). Therefore, those methods may greatly benefit from learned data augmentation.

4. Reference

[1] Exploring the Equivalence of Siamese Self-Supervised Learning Via A Unified Gradient Framework.

[2] A simple Framework for Contrastive Learning of Visual Representations.

[3] Momentum Contrast for Unsupervised Visual Representation Learning.

[4] Exploring Simple SIamese Representation Learning.

[5] Bootstrap Your Own Latent A New Approach to Self-Supervised Learning.

[6] Barlow Twins: Self-Supervised Learning via Redundancy Reduction.

[7] Understanding Self-Supervised Learning Dynamics without Contrastive Paris.

→ Top