Unified Framework for Self-Supervised Learning

Paper Summary: “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”

Dickson Wu
6 min readMar 29, 2022

Paper by: Alexei Baevski,Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

Abstract:

Self-supervised learning is identical across mediums (aka cover up something and get the model to predict what we just covered up). But the actual implementation is different because each was created with that medium in mind.

This paper presents a framework to do self-supervised learning across all medians! The idea is to turn the input data into latent representation to do self-supervised learning. Instead of predicting medium-specific things (words, images, sound), the model has to predict the latent representation.

Experiments show that this is at par with SOTA or is the new SOTA!

Introduction:

Self-supervised learning has led to advances in NLP, speech processing and computer vision. But each method is designed specifically for each medium. But is it optimal? Leading theories in biology indicate that humans use the same learning procedures for vision as they do with language, plus general NN architectures perform well against modular specific ones.

Thus this paper presents data2vec, a framework for self-supervised learning across all those mediums. It unifies the learning algorithm, but models still learn individual tasks.

This work builds off of mask predictions + latent target representation (but uses multiple network layers as targets & generalizes it to different mediums).

Specifically, we get 1 transformer to encode the input data as a representation. Then we mask the same input data, feed it into the student network, and then get it to try and predict the latent representations of the teacher network.

This method could be seen as a simplification of medium designs or normalizing the input to create better targets. The targets are produced through self-attention, thus are continuous and contextualized.

Experiments show that it sets the new SOTA for ViT-B for Vision, and improves the prior work for speech and natural language.

Related work:

Self-supervised learning in computer vision: Lots of work has been done in this area like contrasting representations of the same images augmented, different images, or online clustering. Or using NN representations as targets — but this paper uses multiple layers instead of just the top + used a masked prediction task. Other methods also used masked prediction, but this method predicts latent representations.

Self-supervised learning in NLP: NLP has been super successful because we can easily break apart text into individual tokens which we’ll predict. But latent representations have advantages over this method: Targets aren’t predefined so there isn’t a limit, plus targets are contextualized so it takes into account all the information.

Self-supervised learning in speech: Similar to speech, instead of predicting descrete words, we predict continuous latent representations.

Multimodal pre-training: There has been working that explores multi-task training, but this paper doesn't do multi-modal training.

Method:

We have 2 models, a student and a teacher model. The teacher model is given the full view of the image and encodes it. The student gets a masked version and is tasked to predict the embeddings of the teacher model.

The architecture is a standard Transformer Architecture but encodes data differently depending on the medium. For CV we use the ViT strategy of taking 16x16 patches, encoding them, and passing that into the model. Speech = multi-layer 1-D CNN. Text = tokenized the words and get their embeddings.

Masking is done as usual. CV = we mask blocks, Speech = mask spans of speech, NLP = mask tokens.

The student model has to learn the representations of the teacher model. But these representations are contextualized since the target embeddings are created using self-attention. This is different than other methods, where their target is missing contextual information.

The teacher is actually the student — It uses the same weights as the student! But the difference is that we use an exponential moving average of the weights of the student model. This makes it such that the teacher model is updating more at the start and less towards the end.

The targets are the top K blocks of the teacher network, normalized (prevent model collapse from constant representations) and then averaged (same results with more efficiency of predicting each layer) on the timesteps of the masked input data.

The loss is:

It’s just L1 loss with the beta controlling the transition of the squared loss of the L1. This makes it less sensitive to outliers.

[Going to skip Experimental Setup, but you can check it out yourself!]

Results

CV:

Speech processing:

NLP:

Ablation:

First, they tested if averaging over multiple layers is a good idea. Thus they experimented with averaging the last K layers (1–12):

They also tested out using different features from layers as targets for the student model. They found that the output of the feedforward network block is the best:

Discussion:

This paper unifies the learning process, but we’re still using modular-specific ways to encode the data — but this makes sense because they’re all vastly different data types (with different resolutions & different masking types). Though there has been work on training directly on the raw data, which could be complementary to this paper.

In terms of the target type, latent representations allow for contextualized labels, no limitation in the number of target units, and doesn’t need pre-defined target units (first in NLP).

Algorithms which use their own targets are prone to collapse. This paper did the right amount of hyparameter thing + masking larger spans of modularities with highly correlated adjacent targets (speech).

Conclusion:

This paper unifies the learning regime for vision, speech, and language. We do this by regressing the latent representations. data2vec beats other self-supervised algorithms.

A unified training regime will allow for it to be easier to learn across mediums, and further work can be done to find a single masking & encoding strategy.

If you want to find out more: Read the paper here!

Thanks for reading! I’m Dickson, currently working on Deus Ex Securitas, where we aspire to achieve superhuman level performance in smart contract exploit detection!

--

--

Dickson Wu

Hi I’m Dickson! I’m an 18-year-old innovator who’s excited to change the course of humanity for the better — using the power of ML!