3 remarkable things that happened in AI this week: GPT-4o, Google makes a huge and underappreciated breakthrough, and another great step towards killer robots

4 min readMay 18, 2024

The leash is a safety mechanism in case the robot is about to fall, or find a weapon. Source: DrEureka

OpenAI releases GPT-4o, and we can’t quite decide if it’s good or not

What happened: OpenAI announced its latest and best model, GPT-4o (not “40” but “4o”; the O stands for “omni” — I mean, come on, guys). 4o’s benchmark performance is about the same as 4 Turbo for everyday text and coding but better on audio, visual, and multimodal tasks. It’s faster, with response times about one-third lower than 4-Turbo, which OpenAI (plausibly) claims it will feel like a regular conversation with a person. It’s also cheaper through the API, which isn’t really a technical advance, but, hey, it’s still nice.

Why it matters: Half the people reviewing 4o think it’s good because it’s, you know, good. The other half were expecting GPT-5 and are disappointed; some people even think that the release of 4o means OpenAI won’t be able to deliver GPT-5, and we’ve now reached the limits of LLM scaling. I’m more in the latter camp than the former–no one has refuted the math on scaling from the Chinchilla work in March 2022, and we’re well past that now (even Meta admitted they had overtrained Llama 3 and justified it on the grounds of reducing inference cost). The most thoughtful commentary I saw came from Ethan Mollick, who noted that the biggest impact of 4o wasn’t the change in capabilities, but the changes in access and cost. 4o will now be available in OpenAI’s free access tier (with message limits), which in turn opens it up for schools in particular to use in teaching. This creates further risks around academic integrity, but also reduces the have/have-not gap in AI access.

Google discovers “infini-attention” and maybe context windows don’t matter any more

What happened: Those smart people at Google released a paper called “Leave No Context Behind.” The authors should actually work in marketing, because in addition to that awesome title, the paper introduces the term “infini-attention,” the ability to retain information from any element of context, no matter how different. They create infini-attention through the use of “compressive memory” (also a great name), which is a method for extracting the key points from context in condensed form. Then during inference, if the model thinks those points are relevant, it can go back and pull up the relevant original context. Think of it as the ability to take your arbitrarily long prompt and break it up into labeled file folders; as the LLM does its work, it can look at the folder labels and pull out the ones it needs to answer your question.

Why it matters: First, “infini-attention” is such a cool term for anything. Second, this provides a different and perhaps better solution to the context-window problem–instead of trying to make the windows bigger and bigger, with implications for memory and compute, infini-attention offers a more human-like alternative, remembering some key items from the context precisely, while having a less detailed recollection of others (but the ability to summon up the fuller version if required). This could be a more efficient path forward, and/or it could point towards a different way of using LLMs, or it could even lead to the collapse of the current division of LLM work into training, fine-tuning, RAG, and prompt engineering, if it turns out that infini-attention allows a model to hold as much tailored information as required in memory at different levels of resolution.

Everything you were worried about with that weird dancing Boston Dynamics dog but shouldn’t have been, you can worry about now

What happened: A team of researchers from the University of Pennsylvania have created a model that allows a robot dog to cross the street balanced on a yoga ball. This follows previous feats such as enabling a robot hand to twirl a pen and also to kill people who look suspicious. I made that last one up, and it would be extremely difficult to do; you would need to change the code slightly. The important achievement here is that the robots aren’t pre-programmed; they are reacting to a dynamic environment and making very quick adjustments–quicker than most humans can do.

Why it matters: Watching these videos shows that we have solved (well, okay, not you and me personally) one of the key challenges in physical robotics, the ability to collect data about slight changes in the physical environment, process it, and respond physically at a speed sufficient to be able to balance as well or better than humans. This in turn allows very different robots, robots on legs instead of wheels that can navigate an environment built for humans, including (in the near future) the ability to climb stairs and open doors. It isn’t everything–there are still problems to be solved in machine vision, force feedback, weight, and batteries, among others, but it’s a lot.

Remember those videos from Boston Dynamics from 2018 that showed those robots dancing? The ones where your parents emailed you the link telling you it was cute, but you immediately knew you were looking at the first version of the Skynet robots? Look, you were just being paranoid then. That cute little dancing dog wasn’t going to kill you; it was just executing a pre-programmed dance. This is the cute little dancing dog that’s going to kill you.

3 remarkable things that happened in AI this week: GPT-4o, Google makes a huge and underappreciated breakthrough, and another great step towards killer robots

Written by Techtonic