We’re just starting to see Multimodal AI systems hit the spotlight. Unlike the text-based and image-generation AI tools we’ve seen before, multimodal systems can absorb and generate content in multiple formats – text, image, video, audio, etc.

How They Work

Unlike the previous generation that worked with a single format, multimodal systems are a collection of AI models that work together: one model processes text, another processes images, another processes sound, etc. A “fusion” model takes the outputs of these and finds connections between them. A set of generators then assemble the output.

But just as learners need practice to connect training content to their real-world performance, multimodal systems also need tuning to connect the disparate models together. You can expect that multimodal systems will be more specialized than their earlier forms: one of the earliest uses is as a shopping assistant to find similar items and then home in on one.

Implications for L&D

Multimodal AI can produce terrific performance support systems when trained on role-specific content. Imagine this scenario: a support specialist feeds a trouble ticket into the system and gets back a range of related content. After a brief dialog, the system produces a tailored answer to the question. A Stanford study tested AI-based performance support for customer service and increased productivity an average of 14% (with a 34% improvement for newer workers) and didn’t hinder expert workers.

This could also apply to personalized instruction – give learners a problem to solve and a tuned model that combines direct instruction with performance support. The results could be narrowed at the start of training (to provide scaffolding) and loosened over time until learners are working with the full performance support system. There will still be a need for mentors and experts to oversee the resulting behaviors, but they would need to spend substantially less time. 


Just as a specialist can work more effectively than a novice, the need for “fusion” in Multimodal AI – making meaningful connections between the input models – will lead to ever more specialized models. You may need  to provide your content before you get usable results. This will mean that more content-specialized vendors arise or you’ll need to work through the process of sharing internal data with vendors.  

As always, the results need to be vetted for accuracy. There is some progress in that direction, especially from the related field of AI Agents (where one model generates content and the other checks it, then they collaborate to home in on an answer). You’ll need to lean on your subject matter experts during the training and vetting process.

One challenge that receives little attention is how the input prompts that work on one model don’t work on another (or even on later versions of the same model). This means that whoever operates the model needs to know how to formulate prompts and adapt them to changing circumstances.

Will This Open the Door to Personalization?

This might also revive Learning Objects as a practice. Learning objects took a catalog of content and assembled training from these pieces. It failed because the result tended to be sterile and uninteresting; there was no common content or flow between elements. However, a multimodal AI could provide linking context and finally enable personalized instruction from standard components. 

Next Steps

It’s still a bit early to put your hands on these systems unless you’re a software developer, but you can start to plan. Look at how current AI is being used in your company (or in education, which has been an enthusiastic early adopter) and note where text or image-only solutions are falling short. This can lead you to a set of quick evaluation projects and, hopefully, deliver more personalized and powerful learning for your organization.