5 ways in which Gemini’s context window is an AI game-changer

Techtonic
8 min readMar 7, 2024

--

Technically this is not how a context window actually works. Credit: DALL-E 3 via Bing Chat

Google recently announced the release of Gemini 1.5 Pro, a mid-sized model that is multimodal native and has a range of improvements from Gemini 1.0, which was released in the Dark Ages (December 2023) and mostly uses leeches and disease for training. While 1.5 Pro has a context window of 128,000 tokens (the same as GPT-4, following the release of GPT-4 Turbo in November 2023), Google has announced that it is testing a window of 1 million tokens.

The context window is essentially the working memory of an LLM, used to store the current conversation. From a technical standpoint, the tokens in the context window–everything the user and the LLM have exchanged in the current conversation, assuming it fits–is the input to the actual LLM. Larger context windows require the LLM to process more data, and also require careful attention to ensure the LLM doesn’t get confused by the sheer volume of information in the window.

So how big are these numbers, really? “Token” is a technical term that’s not very intuitive; we use it because it doesn’t always translate to words or pages in a straightforward way. You’ll see different conversion ratios; I use OpenAI’s formula of 1 token equals 0.75 words. This makes the 128k context window equivalent to just under 100,000 words, which is about the length of a typical contemporary novel.

One million tokens is obviously a big step up from this. The math is simple–more like 800,000 words, or perhaps 2,000 pages (depending on your pages).

This sounds pretty good–the original version of ChatGPT from November 2022 had a context window of 4,096 tokens, and it was easy to exceed it even in a normal exchange (the window has to fit both sides of the conversation, the user’s prompts and the LLM’s responses). It was a common occurrence to ask the LLM for clarification on something it had created earlier, and then realize it didn’t know what you were talking about, or had no idea what the original question was.

This isn’t likely to happen with a 128k-token window–it would be a long, long exchange before the LLM forgot what you had originally asked it to do. Even with persistence across multiple sessions, it’s unlikely you’re going to fill the entire window just by going back and forth.

So why does the prospect of a 1 million-token context window matter? Because the context window isn’t just for user-generated input–it’s also the most efficient way to give the LLM directly relevant information for your specific question. This can be as simple as cutting and pasting text directly into a web interface, or, slightly more sophisticated, appending documents to an API call.

This means that we can more easily provide reliable information directly to the LLM. For example, a user can pass a set of legal documents to the LLM in the context window for review, or a technical journal, or the complete works of a poet, and then ask the LLM to do something with that input. A larger context window doesn’t replace other techniques for doing this, such as RAG (essentially building an external data store alongside the LLM where it can look things up) or fine-tuning (adjusting the weights of the LLM itself to incorporate additional information), but it’s a simpler and often more effective method.

While a larger context window is generally better than a smaller one (although there are many technical issues to be resolved–it’s easy for an LLM to get confused with so much direct input), is the 1m-token window really that much better than a 128k-token one?

And…the answer is yes. A definitive, resounding yes. The 1m-token window isn’t just better than 128k, or even eight times better. Once we understand how to use it properly, and assuming it works as intended, the 1m-token window is a game-changer that will make LLMs significantly more useful.

The reason is that somewhere between 128k and 1m lies the threshold of what we expect humans to be able to use. I don’t mean “memorize” — people can’t memorize 100,000 words, at least not easily–but “use.” When we’re teaching people, we generally use materials that are the equivalent of a few hundred thousand tokens–a college textbook, say, supplemented by lectures, study notes, and problem sets. Or a series of technical journal articles. Or software documentation. All of these would be larger than a 128k window, but smaller than 1 million. We don’t expect people to read them once and know them, but it’s enough information, presented in a mutually reinforcing way, for people to master a subject.

Why does this matter? Because it will enable LLMs to learn from the same materials we use to teach people. Instead of having to create custom data specifically designed for LLMs, we can simply hand the machine the same thing we would hand a person and tell it to study up.

This doesn’t give us a better way of training LLMs. It gives us an easier way. Because we’ve already created knowledge materials for people on literally every subject we consider important. Everything has a textbook, or documentation, or something designed to explain it to humans. And once LLMs can understand a topic with those materials, they can quickly learn anything.

For example, if you want an LLM to review several years of annual reports and answer questions about them the way a stock analyst would, the larger context window makes this a trivial task (whereas before, it would have been more difficult and more reliable). Or you’d like it to read up on a country’s political situation and analyze it. Or review a textbook for errors. All of these become far simpler in when the context window is equal to or greater than what we expect humans to be able to handle, because most of what we do is structured in such a way as to enable humans to be able to handle it.

So the 1m-token context window isn’t just 8 times better than the 128k window. It crosses over the human-sized context window, which in turn creates a brand-new capability with broad applicability. This doesn’t mean that LLMs are now going to have capabilities surpassing humans (or even equal to them), but it removes one specific and significant shortcoming they’ve had, which has been the inability to use human-centric knowledge materials.

An analogy from the physical world would be humanoid robots. We have robots that are economically viable in a range of scenarios, particularly specific manufacturing tasks and increasingly in warehouse applications. However, right now we have to design robots for specific tasks, and then we generally have to redesign the physical working space around the robot (for example, warehouses designed for heavy robot use generally need wide corridors and special shelving). This means that we can only hand physical tasks off to robots one at a time, after we’ve worked extensively to fit the robot and the task to each other.

We will keep making slow progress on this problem, adding robots one job at a time, until we are able to build a humanoid robot that can operate the way a person can, with similar perception, agility, balance, and even size and weight. We are a long ways from this–imagine what it would take for a robot to carry a tray of drinks up a narrow flight of stairs to the second floor of a restaurant–but when we get there, suddenly we’ll go from robots covering a handful of specific tasks to robots being able, in theory, to do almost anything. A larger LLM context window is a lot like a robot that can now walk up stairs–it’s not the entire solution to doing everything humans can, but it’s a large piece of the answer.

How will this play out in practice? I see five immediate applications:

1) Reliable information on specific topics. Today, if you ask an LLM for information on the Crimean War, or ACL surgery, or chess openings, you’re taking your chances on whatever it may have learned as it trained on every topic simultaneously. It might be correct, it might be detailed, or it might be anything from quirky to flat-out wrong. However, an LLM that has just read a book, or ten journal articles, on the topic you’re asking about will be far more accurate.

2) Ability to work more readily and accurately with information not in the training data. An LLM is great if you’d like to know about a topic that’s covered well in public sources. Out of the box, though, it has no ability to provide information that’s not in those sources. This is relevant for most commercial applications–what most companies would like is a model that understands the company’s internal data, or policies, or documents. This can be done with fine-tuning or RAG, but it’s difficult and can be hit-and-miss. If customer-service agents can ask an LLM to help draft chat responses, and the system automatically appends the thousand-page manual to the prompt, the results are likely to be far more useful.

3) Up-to-date information. We’ve all had the frustrating experience of being told that an LLM’s training data cut off at this or that date, and so it can’t give us a piece of common knowledge:

A larger context window allows either the user to supply more recent information directly to the LLM, or for the LLM to pull that information in directly (for example, conduct its own internet search and then prepend the results to your prompt)

4) Self-training expert LLMs. As noted above, a large context window allows LLMs to learn the same way that people learn. Because we have a structured way of teaching people on almost every topic, and a university course syllabus is a detailed explanation of one possible playbook, it becomes simple to teach almost anything to an LLM–find a syllabus, or similar document, which lists the key sources required to reach a solid level of understanding, and ingest them.

5) Persistent LLM memory. This has been discussed for a while, and it’s not clear whether it’s something we actually want (the recent OpenAI announcement about their implementation of memory used GPT’s knowledge of your toddler’s preferences as an example, which wasn’t what I would have gone with). However, a wide context window makes it easy to do, if we want it–your preferred LLM simply creates a working document with the key points from its previous interactions with you, and includes it in its own context window. Doing this through the context window, instead of an external vector database (through RAG) or with fine-tuning, has the added benefit that the document can be human-readable, enabling you to edit it directly (or delete material) and also to share it across LLMs if you choose.

These five use cases are all feasible now. Over time, of course, we’ll become more sophisticated in our understanding of LLM capabilities and how the larger context window opens up more options, and we’ll see even more utility emerge over time

Will a 1m-token context window solve all of the problems we face with LLMs? Of course not. We still have a significant number of obstacles to overcome. And a wide context window creates challenges of its own, with the possibility that the model gets confused by all the information, and doesn’t pay enough attention to the core of the prompt. But when we move from 128k to 1m tokens, we cross a very significant threshold as we match and then exceed the human equivalent of the context window. It’s a significant step forward for artificial intelligence.

--

--

Techtonic
Techtonic

Written by Techtonic

I'm a company CEO and data scientist who writes on artificial intelligence and its connection to business. Also at https://www.linkedin.com/in/james-twiss/

No responses yet