Meta’s chief AI scientist Yann LeCun is betting on a new AI paradigm


I-JEPA shows how Metas AI chief Yann LeCun sees the future of AI – and it all starts again with ImageNet benchmarks.

Less than a year ago, AI pioneer and Meta AI chief Yann LeCun unveiled a new AI architecture designed to overcome the limitations of current systems, such as hallucinations and logical weaknesses. With I-JEPA, a team from Meta AI (FAIR), McGill University, Mila, Quebec AI Institute and New York University presents one of the first AI models to follow the “Joint Embedding Predictive Architecture”. The researchers include first author Mahmoud Assran and Yann LeCun.

The Vision Transformer-based model achieves high performance in benchmarks ranging from linear classification to object counting and depth prediction, and is more computationally efficient than other widely used computer vision models.

I-JEPA learns with abstract representations

I-JEPA is trained in a self-supervised manner to predict details of the unseen parts of an image. This is done by simply masking large blocks of those images whose content I-JEPA is supposed to predict. Other methods often rely on much more extensive training data.


To ensure that I-JEPA learns semantic, higher-level representations of objects and does not operate at the pixel or token level, Meta places a kind of filter between the prediction and the original image.

In addition to a context encoder, which processes the visible parts of an image, and a predictor, which uses the output of the context encoder to predict the representation of a target block in the image, I-JEPA consists of a target encoder. This target encoder sits in between the full image, which serves as a training signal, and the predictor.

Picture: Meta

Thus, I-JEPA’s prediction is not done at the pixel level, but at the level of abstract representations as the image is processed by the target encoder. With this the model uses “abstract prediction targets for which unnecessary pixel-level details are potentially eliminated,” Meta says, thereby leading the model to learn more semantic features.

I-JEPA shines in ImageNet

The learned representations can then be reused for different tasks, allowing I-JEPA to achieve great results in ImageNet with only 12 labeled examples per class. The 632 million parameter model was trained on 16 Nvidia A100 GPUs in less than 72 hours. Other methods typically require two to ten times as many GPU hours and achieve worse error rates when trained on the same amount of data.

I-JEPA achieves high scores in ImageNet with relatively low computational overhead. | Picture: Meta

In an experiment, the team uses a generative AI model to visualize I-JEPA’s representations and shows that the model learns as expected.


I-JEPA meta-blog. The model and code are available on GitHub.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top