Large language models (LLMs) like ChatGPT and Bard, to name only two of the many out there already, have been and continue furiously to gain traction. These are examples of the larger set of so-called “generative” AI tools that provide output based upon an input request.

Notably, OpenAI’s marketing engine is doing its job well as it attempts to convince users of the strengths of the system: We have been told that ChatGPT passes the bar exam; excels at the Biology Olympiad; and can write, improve, and test code in Python for bugs and errors. Implementations of LLMs for learning in well-known learning applications such as Duolingo and the Khan Academy, for example, seem to be further proof points in and of themselves.

The question of “What real value do these models have for learning in and around the workplace?” seems almost unjustified, given all the good news and use cases being presented to L&D professionals and the general public, which encompasses our learners. In this article we aim to dive deeper into this question, exploring opportunities—as well as limitations and concerns—of LLMs for learning.

The inner workings of LLMs and chatbots

While LLMs or so-called chatbots excel as communication engines, and they very much do excel in this department, they are only as good as the knowledge engine that powers them. We have covered this previously, in How ChatGPT3 Impacts the Future of L&D in an AI World. The better the system is trained on a topic, the better it will respond to prompts and questions around that topic. Further, the more correct the data for training, the greater trust we may have in its response. With the internet as the training ground, we can see both how these chatbots indeed can do very well on some topics and where the limitations of the internet as a training ground might be.

To understand the limitations of LLMs we first need to understand how they work. As language engines they “understand” how language and text works. They do not understand the world the language is describing; they just understand how the language has been used in the training set, in the knowledge engine, to describe that world.

In short, they use their training to make a best guess of what the right next thing to say is, based upon their ability to parse language. They do so by using information that is in their knowledge base. While their language skills are well developed, the quality of their responses is very much dependent on the validity of the knowledge base used.

AI is not sentient

Despite what interactions with chatbots seem to imply(!), AI tools have no concept of knowledge, no concept of truth, and no concept of logic. Just like a child who has overheard their parents discussing a topic and remembers what was said, the AI tool will be able to parrot back ideas and concepts. While these responses may imply “understanding,” we actually know that the child (or AI) did not comprehend what they were saying, though the exchange might provide great amusement—or embarrassment—for all the grown-ups present.

Thus, the actual problem the LLM is solving is: What might an answer to this prompt/question sound like?

This is quite different from: What is the answer to this question/prompt?

Very often, where the training has been extensive and where the engine has had access to a good knowledge base, the answer the LLM provides can of course be very good and useful.

With this in mind, we can understand why to a certain prompt the language engine will come back with a “research paper”—with a well-formulated title and a long list of authors—that doesn’t actually exist. This is not a bug. The engine has done exactly what it was designed for: It has answered the question of “What would an answer to this prompt/question sound like?”

Which brings us back to the OpenAI marketing strategy, as an example of LLMs in general.

What OpenAI ‘sells’ us

How can it then be, you may ask, that ChatGPT4 is so good at the Biology Olympiad, how can it achieve a passing score on the bar exam, and why is it so good at programming languages such as Python?

The answer is compellingly simple. The bar exam is based on language: the language of the law and the language of cases and precedents. The internet is full of these, and all of these cases and precedents are well explained so there is a good knowledge engine out there. Even for new or fictional cases, the engine knows what an answer probably could or should look like. The same goes for the Biology Olympiad, which is mainly based on knowledge and existing concepts.

The flip side of this is why ChatGPT isn’t great at the Physics, Chemistry, and Math Olympiads: All of these focus on ideas that consist of building models of concepts, and then working with them, logically—neither of which a language engine can do.

Not so clever after all, eh?

Similarly, what about Python programming? Clearly that is logical and works with building models, correct? Well, no. Python is a language, and the knowledge engine, the internet, offers a huge number of examples of its use. When writing or fixing Python code, ChatGPT4 is not really doing anything different than if you ask it to translate from English to German. Note that the code doesn’t always run the first time!

This understanding is critical to refuting the widespread—and incorrect—belief that understanding prompts is the key to getting the most out of LLMs and chatbots. It is also crucial to be able to evaluate responses sensibly and be aware of the limitations and potential “mistakes” of the system.

Limitations of LLMs for learning

Equipped with the aforementioned understanding, we identify four fundamental limitations of LLMs when it comes to learning and applications in L&D and performance support.

1. LLMs cannot use images

The glaringly obvious one comes first, which is that language models cannot, at least yet, utilize images. Without the ability to utilize images for learning, it cannot, for example, highlight a specific part of an electric circuit when explaining that part of the circuit board.

However, from the well-researched cognitive theory of multimedia learning, we know how important the combined use of words and images is for learning, along with spatial and temporal contiguity. While some generative AI systems can deal with images, they’re separate from the language engines, and the two are not yet able to work together.

2. LLMs revise text

The second limitation is that, as is the nature of language models, LLMs and chatbots will change the source text when they provide answers to questions. While these changes may result in correct answers in many cases, it cannot be ruled out that the AI’s responses are incorrect. This is a potential disaster for learning, especially in areas where risk comes into play such as health & safety, cybersecurity, or compliance.

3. LLMs make things up

The third fundamental limitation is that LLMs ”hallucinate.” This has also been coined the “BS problem,” a result of the misalignment between the intent of the initial prompt and an LLM’s response. So-called hallucination in LLMs results from an insufficient training set in a specific topic or area, paired with the fact that an LLM has no capability of logic. It is merely answering the question of “What might an answer to this prompt/question sound like?”

In areas where the engine lacks sufficient knowledge, it can create an answer that sounds plausible but is limited or wrong, much like a parrot provides the response that “sounds best” for the given context. LLMs are not looking to be right or truthful; they simply provide a linguistically-correct response.

4. LLMs fail to uncover the unknown unknowns

Lastly, our fourth limitation: What’s key for effective and efficient learning is to uncover the “unknown unknowns” and guide the learner through these to mastery, including contextual understanding and transfer of knowledge, not just information retention and memory.

Unless prompted for a very precise area, the issue here with an LLM is one of “if you didn’t ask, then you won’t get an answer.” It’s like that good friend who has been helping you with a topic: You think you’ve understood, you go away, you try to put what you’ve learned into practice, and you stumble. You realize why you’ve stumbled, and you go back to your friend and say, “Why did you never tell me that?”

To which your friend says, “Well, you didn’t ask.”

A key skill for a good teacher, tutor, or coach is to anticipate, uncover, and guide the learner through the unknown unknowns.

Conclusion

When explaining a certain topic area, or pretending to be a waiter in a Paris restaurant conversing with the learner in French on Duolingo, a chatbot can be a fantastic tool.

But the aforementioned four fundamental limitations pose restrictions that we have to be aware of when it comes to the creation and delivery of learning and training. For instance, even if you have a high-quality data set for the topics your learners are looking to explore, limiting them to text without images is in stark opposition to well-researched multimedia learning principles.

It still takes a human to understand when such tools are useful as an adjunct, and when they shouldn’t be trusted. We conclude, again, that there is no shortcut around well-designed, high-quality learning content and impactful and effective delivery.

Understanding how different AI systems work, and their relative strengths and weaknesses, provides a basis for smart decisions. These decisions can undermine or accelerate the learning journeys you create and deliver to your learners. Choose wisely!