Apple debuts its MM1 multimodal AI model with rich visual capabilities


Apple MM1 is a capable multimodal AI model that can compete with GPT-4V and Google Gemini in visual tasks thanks to its intelligent architecture and sophisticated training.

Like GPT-4V and Gemini, MM1 is based on the Large Language Model (LLM) architecture and was trained on a mixture of image-text pairs, interleaved image-text documents, and text-only data (45% image-text pairs, 45% interleaved image-text documents, 10% text-only data).

This training regimen has enabled MM1 to develop capabilities similar to its rivals, including image description, question answering, and even basic mathematical problem-solving.

Apple’s MM1 model can recognize subjects and text on images and combine them across multiple images. | Image: B. McKinzie et al.

Apple’s researchers conducted in-depth investigations to identify the factors that most significantly influence MM1’s performance, such as architectural components and training data.



They discovered that high image resolution, the performance of the image processing component (known as the “visual encoder”), and the volume of training data are particularly crucial. Interestingly, the link between image and language was found to be less critical.

The visual encoder is tasked with converting image information into a format that the AI system can process. The more advanced this encoder is, the better MM1 can understand and interpret image content.

The study also highlights the importance of the right mix of training data. Image-text pairs, interleaved image-text data, and text-only data were essential for achieving strong results with limited examples in the input prompt. However, when MM1 had to generate outputs without examples in the prompt, image-text pairs in the training data played a more significant role.

Image-text pairs or image-caption pairs are data in which each image is directly paired with an associated text. This text is usually a description or explanation of the image content.

An example would be an image of a dog with the caption “A brown dog playing with a ball in the park”. Such paired data is often used to train models for tasks such as automatic image labeling.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top