The latest AI paradigm

An under-the-radar shift in capabilities from 'assistant' to 'collaborator'

Oct 31, 2024

My mental model of Large Language Model capabilities runs like this:

’GPT-1 class’ models - Breakthrough models using the transformer architecture that could now run on your AI-enabled fridge or dishwasher. Mostly of academic interest, or for trolling your friends and family.

‘GPT-2 class’ models - A sort of schizophrenic zen sage. Strong NVDA buy signal, particularly if you’d read Gwern’s scaling hypothesis. Useful if you’re looking for wild leaps of logic, and drawing connections between words most humans would ignore for being too whacky. Generalised models come out like a sort of outsider-art slam poet. Training models of this size on a small corpus of text (e.g. Shakespeare) might be of use for some niche applications where compute capacity is severely limited (e.g. running on some small offline device).

‘GPT-3 class’ models - The original ‘ChatGPT’ many people still think of when they hear the word ‘AI’, having played with it a bit when it first released in November 2020. They probably had a bit of fun, tried to use it for some practical tasks but found it lacking, and haven’t really used it much since. Arguably the first chatbot that could (mostly) pass the Turing Test. Sort of like an intelligent highschooler. Broadly read, but with a poor understanding of the world. Commonly misunderstands inputs, misses context, ‘uncreative’, and often confidently wrong. Nevertheless, had a range of practical uses, such as for simple classification tasks and for high-level ideation (e.g. give me 10 dessert ideas using this ingredient). Suggests you add 2x as much baking soda as needed. Understands the words better, but is significantly worse at poetry.

‘GPT-4 class’ models - A paradigm shift in capabilities. These models (which include the likes of Google Gemini and Anthropic’s Claude 3) are like having a tireless intern with access to an old copy of the internet. For the first time, you could ask a model to summarise a piece of text, and be reasonably confident it’d get the gist of it correct, as long as the text could be understood by a broad audience. Responds reasonably well to instruction, but often requires some experimentation with different prompts to get the result you want. Tell it you want it to give its answer in plain language, and it might condescendingly respond as if you’re a child, rather than the ‘respectful public institution social media post’ tone you were after. Unfortunately, as many weren’t operating with in a ‘community of practice’ that were talking about different approaches to prompting, many people gave up using LLMs here after not getting what they wanted.

These models have developed a lot since the first release in March 20231. Benchmark accuracy scores have steadily crept up with each new release. Usage costs have dropped by 10x or more thanks to intense competition from the likes of Google and Meta’s opensource ‘Llama’ model, as well as various algorithmic improvements. Context windows (i.e. the amount of input data they could handle at once) on first release were short, and they’d have poor recall about the middle of the text. Context windows now run to 100+ pages of text, with near-perfect recall. Most models can now understand most human speech perfectly2, and incorporate image generation and other capabilities into a single application. Increasingly, we’ll see these act as ‘operating systems’ able to call different functions depending on the nature of the user’s input (e.g. where historically calculations have been run through the language model itself, it makes sense for the app to simply call a dedicated and efficient calculator where required).

This class of models can deliver a lot of labour-saving work. They can generate boilerplate advice or documents, give decent cake recipes, translate between languages, teach complicated topics to people of different ages and abilities, communicate in different tones, and produce workable code for simple tasks with limited scope, like basic data processing or a barebones website. Chain some of these capabilities together, tidy up some of the rough edges, and you can deliver sizeable projects in a fraction of the time it’d otherwise take. Expect to get passable but utterly cliched answers in every creative endeavour. Unfortunately, they often lack a deeper understanding of fields that can’t be picked up from a textbook, and struggles to parse subtext. In other words, an intern without ‘real-world’ experience about the cussedness of human systems, like how to parse the sort of obfuscatory bullshit organisations spin out when they want to give the appearance of committing to something without actually committing to it.

Microsoft Copilot’s take on this essay so far. Copilot is based on GPT-4, and calls another OpenAI product, Dalle-3, for image generation. Copilot can currently accessed for free via the Microsoft Edge browser.

If the established trend holds, GPT-5 class models will be 10x larger or more than those of GPT-4. However, where technical metrics have previously conformed to the broad capabilities of that model class, the models being released today are of noticeably different quality and capability to those of a year ago, despite being of similar size and scale. These latest models, with the absolutely abysmally named ‘Claude 3.5 Sonnet (new)3’ as the best publicly available example, represent a leap in capabilities from ‘precocious foolhardy intern’ to ‘intermediate staff member’ with several years of experience on the job. This should prompt us to change the way we think of these tools from a GPT-4 ‘assistant’ to that of a genuine collaborator.

Like Deep Thought from the Hitchhikers Guide, Claude can help provide answers, but has no conception of what the right questions are. Using Claude as a collaborator rewards forethought and a fleshing out of your key ideas and questions - as you might to a colleague. See what it comes back with, and go from there. The entire conversation will be in the context window, so dig deeper. Ask it to argue for the opposite viewpoint.

I was recently asked to write a piece about localism [spoilers], and the distribution of political power across different levels of government. After doing some thinking, I put some thoughts to Claude like so:

I’ve found earlier models too highly suggestible once you moved away from core knowledge in their training data. This made them prone to agreeing with whatever proposition you put forward. Here, Claude came back with a document that first outlined its presumptions around core principles of accountability, efficiency, and scope alignment, followed by a detailed proposal for a distribution of power4, and finally, a list of potential failure modes. Nothing you wouldn’t expect from an intermediate staffer, but it was a thorough response that raised a few points I’d yet to consider and provided structure for further interrogation of the different moving parts5.

Jensen Huang, CEO of Nvidia, talked recently about how he’s now getting input from their internal models on every major decision. This sort of use will be commonplace within a few years. I’d recommend getting started now.

‘Claude 3.5 Sonnet (new)’ is currently freely available here.

Most are unaware of how much change has occurred since the original release, because subsequent releases all have absolutely terrible names and branding. We’re talking names like ‘GPT-4-turbo-2024-04-09’, ‘GPT-4o1-preview’, and ‘Llama 3.1 405B’, which make it almost impossible to talk about new model capabilities with colleagues who are interested in using AI, but aren’t the sorts of nerds who’re forecasting AI development and chat with the developers of these models on Twitter.

Programmes that generate text from audio have been around for ages, such as on Samsung phones, but has been unworkable for most people speaking quickly, with non-American accents, with background noise present, or when using jargon or slang. A 90% accurate speech-to-text model is basically useless, in my view, but AI tools like OpenAI’s Whisper (since incorporated into the GPT app) can transcribe the likes of K-pop, thick Scottish accents, and French perfectly. These transcriptions can then be used by the app as the input into the GPT model.

Not to be confused with Claude 3 Sonnet or Claude 3.5 Sonnet, or Claude 3.5 Haiku (new) which are all totally different beasts.

Implicitly the response was US-centric. Given the weight of American-oriented text in the English language training corpus, you should expect all responses to be oriented towards American values, ideas, geography, unless you specifically prompt otherwise. Here, I used follow-up prompts to dive into how the answer might change from one nation to another.

I expect to publish this Localism essay in November.

Public Service

Discussion about this post

Ready for more?