{"id":77,"date":"2024-03-23T14:26:36","date_gmt":"2024-03-23T14:26:36","guid":{"rendered":"https:\/\/wsw-int.de\/?p=77"},"modified":"2024-10-25T13:23:28","modified_gmt":"2024-10-25T13:23:28","slug":"from-language-models-to-multimodal-models","status":"publish","type":"post","link":"https:\/\/multai.eu\/de\/from-language-models-to-multimodal-models\/","title":{"rendered":"Von Sprachmodellen zu multimodalen Modellen"},"content":{"rendered":"<p>Language models have remarkable qualities. Their ability to analyze complex human language queries, which comes from training on the immense volumes of textual data accessible on the Internet, was enough to provoke enthusiasm. However, these algorithms model only one component of human perception: text.<\/p>\n\n\n\n<p>Multimodal models aim to overcome this limitation by natively processing different types of data, such as text, images, sounds and even video ( <em>modalities<\/em>).<\/p>\n\n\n\n<p>The first multimodal models are already available on the market: <em>OpenAI<\/em> combines ChatGPT4 with GPT-4V (image recognition), DALL-E 3 (image generation), Whisper (speech recognition) and TTS (text-to-speech) to meet the most varied user requirements. <em>Google<\/em> Gemini Ultra offers comparable capabilities, and <em>Anthropic<\/em> is not to be outdone, since the new Claude 3 Opus model launched two weeks ago is also multimodal.<\/p>\n\n\n\n<p>The new frontier is video. OpenAI recently revealed the <strong>Sora<\/strong> <em>text-to-video<\/em> model, which creates videos of up to 60 seconds based on a simple text <em>prompt<\/em>. Take a look at their impressive demonstration:<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe title=\"Introducing Sora \u2014 OpenAI\u2019s text-to-video model\" width=\"800\" height=\"450\" src=\"https:\/\/www.youtube.com\/embed\/HK6y8DAPN_0?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>A word of terminology before going into detail: the acronym for multimodal models is LMM (&#8220;Large Multimodal Models&#8221;), as opposed to language models known as LLM (&#8220;Large Language Models&#8221;).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Learning by representation<\/h2>\n\n\n\n<p>The <em>secret sauce<\/em> that makes multimodal models work is representation learning. It will transform a concept presented in its &#8220;humanly intelligible&#8221; form into a vector, i.e. a sequence of numbers of fixed size.<\/p>\n\n\n\n<p>In the case of a language model, this representation will map each word (or, more precisely, each token) to a vector. These vectors are generally high-dimensional: we&#8217;re talking about 1536 and 3072 dimensions for the two text representation models used by OpenAI described <a href=\"https:\/\/platform.openai.com\/doc\u00d2s\/guides\/embeddings\">here<\/a>.<\/p>\n\n\n\n<p>This representation is designed to preserve semantic correspondence. In other words, the distance between vectors measures their semantic proximity (vectors for &#8216;car&#8217; and &#8216;van&#8217; will be close to each other). Even stronger, the differences between vectors correspond to other, more elementary concepts: the difference between the vectors &#8220;king&#8221; and &#8220;queen&#8221; is close to that between the vectors &#8220;man&#8221; and &#8220;woman&#8221;. The same applies to the differences between a verb&#8217;s genitive and its past tense!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"557\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\" alt=\"\" class=\"wp-image-78\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/vectorRepresentation-300x163.png 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/vectorRepresentation-768x418.png 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1536x836.png 1536w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/vectorRepresentation.png 1600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: 3D representation of semantic vectors<br>Source: https:\/\/towardsdatascience.com\/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c)<\/figcaption><\/figure>\n\n\n\n<p>This notion of representation lies at the heart of all generative language models, which are nothing more or less than machines for extending sequences of vectors. At the heart of the language model lies the algorithm called <em>transform<\/em>, whose action can be summarized as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Represent the input text as a sequence of vectors;<\/li>\n\n\n\n<li>Transform the sequence of vectors through various mathematical operations that enrich and combine the vectors in the prompt word sequence to create new ones;<\/li>\n\n\n\n<li>Repeat the above action a number of times, until a final sequence of vectors is obtained;<\/li>\n\n\n\n<li>Use this &#8220;enriched&#8221; final sequence of vectors to predict the next vector in the sequence, and therefore the next word;<\/li>\n\n\n\n<li>Repeat the whole process, adding the predicted word to the end of the sequence to predict the next word etc&#8230;<\/li>\n<\/ul>\n\n\n\n<p>In addition to generative models, the technique of textual representation makes language processing much easier: text search, grouping and classification become much less mysterious when you realize you can perform them on vectors.<\/p>\n\n\n\n<p><br>What&#8217;s more, imagine having learned a representation for the entire French vocabulary. And another representation for German, but in a space of the same dimensionality\u2026 you can then define a transformation between the vector spaces that will enable you to switch from one language to the other!<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Different types of representation<\/h2>\n\n\n\n<p>What applies to text also applies to images and sounds. Given a sufficient volume of training data, it is possible to define an image representation, which will also map each image to a representation in vector space.<\/p>\n\n\n\n<p><br>As with text, the vector will capture the visual content of the image, which can then be used for various automated vision tasks: object detection, image classification, facial recognition, image search by similarity\u2026<\/p>\n\n\n\n<p>In concrete terms, this means that images containing cars will be represented by similar vectors, as will those containing dogs, buildings or any other material object. Ideally, the dimensionality of the vector will be sufficient to model complex visual situations containing several objects, taking into account their respective positioning and other features appearing in the image.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"384\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/ImageEmbedding-1024x384.webp\" alt=\"\" class=\"wp-image-80\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/ImageEmbedding-1024x384.webp 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/ImageEmbedding-300x113.webp 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/ImageEmbedding-768x288.webp 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/ImageEmbedding.webp 1100w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Vector representation of images<br>Source: <a href=\"https:\/\/towardsdatascience.com\/image-analytics-for-everyone-image-embeddings-with-orange-7f0b91fa2ca2\">https:\/\/towardsdatascience.com\/image-analytics-for-everyone-image-embeddings-with-orange-7f0b91fa2ca2<\/a><\/figcaption><\/figure>\n\n\n\n<p>And what&#8217;s possible for images is also possible for sounds. Sound representations capture the semantic and contextual content of audio files: the pronunciation of the word car and the sound of a car starting up will be linked in vector space by a proximity relationship.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"456\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/audio2vec-1024x456.png\" alt=\"\" class=\"wp-image-82\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/audio2vec-1024x456.png 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/audio2vec-300x134.png 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/audio2vec-768x342.png 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/audio2vec.png 1431w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Vector representation of audio<br>Source: <a href=\"https:\/\/people.csail.mit.edu\/weifang\/project\/spml17-audio2vec\/\">https:\/\/people.csail.mit.edu\/weifang\/project\/spml17-audio2vec\/<\/a><\/figcaption><\/figure>\n\n\n\n<p>All that&#8217;s left is to put it all together. We now have a mechanism for encoding data from different modalities in a single, multimodal representation vector space.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"537\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation-1024x537.png\" alt=\"\" class=\"wp-image-83\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation-1024x537.png 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation-300x157.png 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation-768x402.png 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation-1536x805.png 1536w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalVectorRepresentation.png 1828w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Multimodal representations<br>Source: <a href=\"https:\/\/www.pinecone.io\/learn\/vector-search-basics\/\">https:\/\/www.pinecone.io\/learn\/vector-search-basics\/<\/a><\/figcaption><\/figure>\n\n\n\n<p>The final step is to integrate this into a model, usually of the <em>transform<\/em> type, which will seek to predict the next vector; you then have a multimodal model which can draw on all available sources of information to generate output data in the desired format.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"522\" src=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/multimodalComplete-1024x522.webp\" alt=\"\" class=\"wp-image-84\" srcset=\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalComplete-1024x522.webp 1024w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalComplete-300x153.webp 300w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalComplete-768x392.webp 768w, https:\/\/multai.eu\/wp-content\/uploads\/2024\/03\/multimodalComplete.webp 1400w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Mod\u00e8le g\u00e9n\u00e9ratif multimodal complet<br>Source: <a href=\"https:\/\/medium.com\/@cout.shubham\/exploring-multimodal-large-language-models-a-step-forward-in-ai-626918c6a3ec\">https:\/\/medium.com\/@cout.shubham\/exploring-multimodal-large-language-models-a-step-forward-in-ai-626918c6a3ec<\/a><\/figcaption><\/figure>\n\n\n\n<p>One small remark is that the idealized &#8220;end-to-end&#8221; multimodal model I&#8217;ve just described probably doesn&#8217;t yet exist. Current multimodal models such as those from OpenAI, Google or Anthropic are probably built as an assembly of different models, namely a unimodal language model that coordinates and calls on other &#8220;transmodal&#8221; models as required: For example, ChatGPT+ will call on DALL-E 3 if the user wants to generate an image<em>(text-to-image<\/em>), or on GPT4-V if an image is to be interpreted<em>(image-to-text<\/em>), etc. So today, we find ourselves in a multi-agent scenario.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Applications and outlook<\/h2>\n\n\n\n<p>LMMs are particularly attractive for the automation of healthcare, where patient data is dispersed across handwritten or digital text, imagery and even laboratory analysis reports in tabular form. Radiology is often cited as an example, since its raw material is imaging (CT scans, MRIs, X-rays, etc.), but there&#8217;s nothing to stop an LMM from being trained to receive and interpret other signals, such as those from an electrocardiogram.<\/p>\n\n\n\n<p><br>Another field where multimodality will play an essential role is robotics, where we will be seeking to give robots the ability to perceive and interact with their environment. Consolidating this visual, auditory and textual information into a single model will enable the robot to navigate and act more effectively on the outside world.<br><\/p>\n\n\n\n<p>The great challenge of multimodality, particularly for robotics, is the integration of video into the multimodal chain. The major players in the sector are working on this.<\/p>\n\n\n\n<p>Google has an important advantage in this field, as <em>Youtube<\/em> is one of its subsidiaries. With over 500 hours of new video published every <em>minute<\/em> on Youtube, this channel constitutes an excellent reservoir of data for training future multimodal video models.<\/p>\n\n\n\n<p>In conclusion, deep multimodal learning is an exciting and rapidly evolving field with great potential for advancing computer vision and other areas of artificial intelligence.<\/p>\n\n\n\n<p>Although multimodal learning has its challenges, including the need for large amounts of training data and the difficulty of fusing information from multiple modalities, recent advances in deep learning models are enabling significant performance improvements across a range of tasks.<\/p>\n\n\n\n<p>This is an area to watch out for in 2024, which could well be the year of LMMs just as 2023 was that of LLMs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Sources and references<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Multimodal Models and Computer Vision: A Deep Dive<\/em> by Petru Potrimba on Roboflow, May 10th 2023 : <a href=\"https:\/\/blog.roboflow.com\/multimodal-models\/\">https:\/\/blog.roboflow.com\/multimodal-models\/<\/a><\/li>\n\n\n\n<li><em>Multimodal LLMs \u2013 Beyond the Limits of Language<\/em> by Tim Flilzinger for Konfuzio, Oct. 19th 2023 : <a href=\"https:\/\/konfuzio.com\/en\/multimodal-llm\/\">https:\/\/konfuzio.com\/en\/multimodal-llm\/<\/a><\/li>\n\n\n\n<li><em>What are embeddings<\/em> ?, online book by Vicki Boykis : <a href=\"https:\/\/vickiboykis.com\/what_are_embeddings\/\">https:\/\/vickiboykis.com\/what_are_embeddings\/<\/a><\/li>\n\n\n\n<li><em>Exploring Multimodal Large Language Models: A Step Forward in AI<\/em>, by Shubram Karwa, Nov. 16th 2023 on Medium : <a href=\"https:\/\/medium.com\/@cout.shubham\/exploring-multimodal-large-language-models-a-step-forward-in-ai-626918c6a3ec\">https:\/\/medium.com\/@cout.shubham\/exploring-multimodal-large-language-models-a-step-forward-in-ai-626918c6a3ec<\/a><\/li>\n\n\n\n<li><em>The Multimodal Evolution of Vector Embeddings<\/em>, by James Le, Aug. 9th 2023 on TwelveLabs : <a href=\"https:\/\/www.twelvelabs.io\/blog\/multimodal-embeddings\">https:\/\/www.twelvelabs.io\/blog\/multimodal-embeddings<\/a><\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/multai.eu\/de\/\">MultAI.eu<\/a> &#8230;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"block-558d25e1-1f38-48e1-90f0-ab4fec0a05f5\"\/>\n\n\n\n<p id=\"block-d4bfcc52-6a5e-4883-a1f5-1461e662e3d9\"><mark><mark style=\"background-color:#ffffff\" class=\"has-inline-color\">Translated with <a href=\"https:\/\/www.deepl.com\/translator\">DeepL <\/a>and adapted from our partner Arnaud Stevins&#8217; <a href=\"https:\/\/artificiellementintelligent.wordpress.com\/2024\/03\/18\/des-modeles-de-langage-aux-modeles-multimodaux\/\">blog<\/a> (March. 18th, 2024).<\/mark><\/mark><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" id=\"block-558d25e1-1f38-48e1-90f0-ab4fec0a05f5\"\/>\n\n\n\n<p>March 24th, 2024<\/p>\n\n\n\n<p><\/p>","protected":false},"excerpt":{"rendered":"<p>Language models have remarkable qualities. Their ability to analyze complex human language queries, which comes from training on the immense volumes of textual data accessible on the Internet, was enough to provoke enthusiasm. However, these algorithms model only one component of human perception: text. Multimodal models aim to overcome this limitation by natively processing different [&hellip;]<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[18],"class_list":["post-77","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-genai-llm-multimodal-text-audio-image"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>From language models to multimodal models - MultAI<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/multai.eu\/de\/from-language-models-to-multimodal-models\/\" \/>\n<meta property=\"og:locale\" content=\"de_DE\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"From language models to multimodal models - MultAI\" \/>\n<meta property=\"og:description\" content=\"Language models have remarkable qualities. Their ability to analyze complex human language queries, which comes from training on the immense volumes of textual data accessible on the Internet, was enough to provoke enthusiasm. However, these algorithms model only one component of human perception: text. Multimodal models aim to overcome this limitation by natively processing different [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/multai.eu\/de\/from-language-models-to-multimodal-models\/\" \/>\n<meta property=\"og:site_name\" content=\"MultAI\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-23T14:26:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-10-25T13:23:28+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\" \/>\n<meta name=\"author\" content=\"hans\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Verfasst von\" \/>\n\t<meta name=\"twitter:data1\" content=\"hans\" \/>\n\t<meta name=\"twitter:label2\" content=\"Gesch\u00e4tzte Lesezeit\" \/>\n\t<meta name=\"twitter:data2\" content=\"7\u00a0Minuten\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/\"},\"author\":{\"name\":\"hans\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292\"},\"headline\":\"From language models to multimodal models\",\"datePublished\":\"2024-03-23T14:26:36+00:00\",\"dateModified\":\"2024-10-25T13:23:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/\"},\"wordCount\":1371,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/multai.eu\/#organization\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\",\"keywords\":[\"GenAI; LLM; multimodal; text; audio; image\"],\"articleSection\":[\"Uncategorized\"],\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/\",\"url\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/\",\"name\":\"From language models to multimodal models - MultAI\",\"isPartOf\":{\"@id\":\"https:\/\/multai.eu\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\",\"datePublished\":\"2024-03-23T14:26:36+00:00\",\"dateModified\":\"2024-10-25T13:23:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#breadcrumb\"},\"inLanguage\":\"de\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage\",\"url\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\",\"contentUrl\":\"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/multai.eu\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"From language models to multimodal models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/multai.eu\/#website\",\"url\":\"https:\/\/multai.eu\/\",\"name\":\"WSW\",\"description\":\"Generative AI for your business\",\"publisher\":{\"@id\":\"https:\/\/multai.eu\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/multai.eu\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"de\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/multai.eu\/#organization\",\"name\":\"WSW\",\"alternateName\":\"MultAI\",\"url\":\"https:\/\/multai.eu\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png\",\"contentUrl\":\"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png\",\"width\":225,\"height\":244,\"caption\":\"WSW\"},\"image\":{\"@id\":\"https:\/\/multai.eu\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292\",\"name\":\"hans\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"de\",\"@id\":\"https:\/\/multai.eu\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g\",\"caption\":\"hans\"},\"sameAs\":[\"https:\/\/wsw-int.de\"],\"url\":\"https:\/\/multai.eu\/de\/author\/hans\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"From language models to multimodal models - MultAI","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/multai.eu\/de\/from-language-models-to-multimodal-models\/","og_locale":"de_DE","og_type":"article","og_title":"From language models to multimodal models - MultAI","og_description":"Language models have remarkable qualities. Their ability to analyze complex human language queries, which comes from training on the immense volumes of textual data accessible on the Internet, was enough to provoke enthusiasm. However, these algorithms model only one component of human perception: text. Multimodal models aim to overcome this limitation by natively processing different [&hellip;]","og_url":"https:\/\/multai.eu\/de\/from-language-models-to-multimodal-models\/","og_site_name":"MultAI","article_published_time":"2024-03-23T14:26:36+00:00","article_modified_time":"2024-10-25T13:23:28+00:00","og_image":[{"url":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png"}],"author":"hans","twitter_card":"summary_large_image","twitter_misc":{"Verfasst von":"hans","Gesch\u00e4tzte Lesezeit":"7\u00a0Minuten"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#article","isPartOf":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/"},"author":{"name":"hans","@id":"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292"},"headline":"From language models to multimodal models","datePublished":"2024-03-23T14:26:36+00:00","dateModified":"2024-10-25T13:23:28+00:00","mainEntityOfPage":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/"},"wordCount":1371,"commentCount":0,"publisher":{"@id":"https:\/\/multai.eu\/#organization"},"image":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage"},"thumbnailUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png","keywords":["GenAI; LLM; multimodal; text; audio; image"],"articleSection":["Uncategorized"],"inLanguage":"de","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/","url":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/","name":"From language models to multimodal models - MultAI","isPartOf":{"@id":"https:\/\/multai.eu\/#website"},"primaryImageOfPage":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage"},"image":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage"},"thumbnailUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png","datePublished":"2024-03-23T14:26:36+00:00","dateModified":"2024-10-25T13:23:28+00:00","breadcrumb":{"@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#breadcrumb"},"inLanguage":"de","potentialAction":[{"@type":"ReadAction","target":["https:\/\/multai.eu\/from-language-models-to-multimodal-models\/"]}]},{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#primaryimage","url":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png","contentUrl":"https:\/\/wsw-int.de\/wp-content\/uploads\/2024\/03\/vectorRepresentation-1024x557.png"},{"@type":"BreadcrumbList","@id":"https:\/\/multai.eu\/from-language-models-to-multimodal-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/multai.eu\/"},{"@type":"ListItem","position":2,"name":"From language models to multimodal models"}]},{"@type":"WebSite","@id":"https:\/\/multai.eu\/#website","url":"https:\/\/multai.eu\/","name":"WSW","description":"Generative AI for your business","publisher":{"@id":"https:\/\/multai.eu\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/multai.eu\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"de"},{"@type":"Organization","@id":"https:\/\/multai.eu\/#organization","name":"WSW","alternateName":"MultAI","url":"https:\/\/multai.eu\/","logo":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/#\/schema\/logo\/image\/","url":"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png","contentUrl":"https:\/\/multai.eu\/wp-content\/uploads\/2024\/10\/Logo.png","width":225,"height":244,"caption":"WSW"},"image":{"@id":"https:\/\/multai.eu\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/multai.eu\/#\/schema\/person\/06def8c374b5d6724bec911e9880c292","name":"hans","image":{"@type":"ImageObject","inLanguage":"de","@id":"https:\/\/multai.eu\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1409f6643b6f17d5838709af9deca41643884a95390f8a4f8ea478b9187aec41?s=96&d=mm&r=g","caption":"hans"},"sameAs":["https:\/\/wsw-int.de"],"url":"https:\/\/multai.eu\/de\/author\/hans\/"}]}},"_links":{"self":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/77","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/comments?post=77"}],"version-history":[{"count":5,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/77\/revisions"}],"predecessor-version":[{"id":1425,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/posts\/77\/revisions\/1425"}],"wp:attachment":[{"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/media?parent=77"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/categories?post=77"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/multai.eu\/de\/wp-json\/wp\/v2\/tags?post=77"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}