In a new paper, Meta announces ๐๐ก๐๐ฆ๐๐ฅ๐๐จ๐ง, a ๐ฆ๐ข๐ฑ๐๐-๐ฆ๐จ๐๐๐ฅ ๐๐๐ซ๐ฅ๐ฒ-๐๐ฎ๐ฌ๐ข๐จ๐ง foundation model. Contrary to earlier multimodal models, which model the different modalities (text, image, audio, etc.) separately, mixed-modal early-fusion foundation models like Chameleon are end-to-end models. They ingest all modalities from the start and project them into one representational space. That permits integrating information across modalities and generating multimodal documents. Indeed, the paper contains some nice examples of ๐ข๐ง๐ญ๐๐ซ๐ฅ๐๐๐ฏ๐๐ ๐ข๐ฆ๐๐ ๐ ๐๐ง๐ ๐ญ๐๐ฑ๐ญ generation (see below), which seems to be Chameleonโs forte.
Despite Metaโs different practices in the social media department, the company is remarkably transparent in its GenAI business. The paper contains many interesting insights, and the Chameleon model will hopefully become open source as its predecessors already are.
The paper describes the used datasets (not including Meta user data) and gives detailed insights into techniques for stable and scalable model training. Interesting too is the section about inference that identifies ๐ญ๐ก๐ซ๐๐ ๐ฌ๐ฉ๐๐๐ข๐๐ข๐ ๐ฆ๐ข๐ฑ๐๐-๐ฆ๐จ๐๐๐ฅ ๐๐ก๐๐ฅ๐ฅ๐๐ง๐ ๐๐ฌ: generated tokens must be copied from the GPU to the CPU to inspect their nature (i.e., text or image) to send them to the correct decoder, tokens that do not belong to a particular modality need to be masked, and finally, textโs variable length versus imagesโ fixed-size blocks of tokens need to be seamlessly integrated.
As for ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง, Chameleon excels at interleaved image and text generation, while remaining very competitive in text-only and image-only tasks. Missing is a comparison against GPT-4o (to be fair, launched only three days before this paper was published, indicative of the speed of innovation). Unfortunately, not much is known about GPT-4oโs architecture. Likely, GPT-4o is much larger than the 34 billion parameter Chameleon (which also exists in a 7B version) and trained on more data. If Chameleon holds up to GPT-4o, then it might lead us towards a future of smaller models, which is desirable in many ways. Note, however, that GPT-4o has audio capabilities which are currently absent in Chameleon.
Next to benchmarking (for the text-only and image-to-text tasks) for evaluation purposes, the paper contains a large section about ๐ก๐ฎ๐ฆ๐๐ง ๐๐ฏ๐๐ฅ๐ฎ๐๐ญ๐ข๐จ๐ง, which serves as a great introduction to the topic for the uninitiated reader.