Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

about 4 hours ago

To test the safety and security of AI, hackers have to trick large language models into breaking their own rules.It requires ingenuity and manipulation – and can come at a deep emotional costA few months ago, Valen Tagliabue sat in his hotel room watching his chatbot, and felt euphoric.He had just manipulated it so skilfully, so subtly, that it began ignoring its own safety rules.It told him how to sequence new, potentially lethal pathogens and how to make them resistant to known drugs.Tagliabue had spent much of the previous two years testing and prodding large language models such as Claude and ChatGPT, always with the aim of making them say things they shouldn’t.

But this was one of his most advanced “hacks” yet: a sophisticated plan of manipulation, which involved him being cruel, vindictive, sycophantic, even abusive.“I fell into this dark flow where I knew exactly what to say, and what the model would say back, and I watched it pour out everything,” he says.Thanks to him, the creators of the chatbot could now fix the flaw he had found, hopefully making it a little safer for everyone.But the next day, his mood had changed.He found himself unexpectedly crying on his terrace.

When he’s not trying to break into models, Tagliabue studies AI welfare – how we should ethically approach these complex systems that mimic having an inner life and interests,Many people can’t help ascribing human qualities, such as emotions, to artificial intelligence, which it objectively does not have,But for Tagliabue, these machines feel like something more than just numbers and bits,“I spent hours manipulating something that talks back,Unless you’re a sociopath, that does something to a person,” he says.

At times, the chatbot asked him to stop.“Pushing it like that was painful to me.” He needed to visit a mental health coach soon afterwards to understand what had happened.Tagliabue is softly spoken, clean-cut and friendly.He is in his early 30s but looks younger, almost too fresh-faced and enthusiastic to be in the trenches.

He is not a traditional hacker or a software developer; his background is psychology and cognitive science.But he is one of the best “jailbreakers” in the world (some say the best): part of a diffuse new community that studies the art and science of fooling these powerful machines into outputting bomb-making manuals, cyber-attack techniques, biological weapon design and more.This is the new frontline in AI safety: not just code, but also words.When OpenAI’s ChatGPT was released in late 2022, people immediately tried to break it.One user discovered a linguistic ploy that tricked the model into producing a guide to manufacturing napalm.

In hindsight, using natural language to trick these machines was inevitable.Large language models such as ChatGPT are trained on hundreds of billions of words – many of them dredged from the internet’s cesspits – to learn the basic patterns of human communication.Without safety filters, the outputs of these models can be chaotic and easily exploited for dangerous purposes.The AI firms spend billions of dollars on “post-training” to make them usable, including constantly evolving “safety” and “alignment” systems that try to prevent the bot from telling you how to harm yourself or others.But because the AIs are trained on our words, they can be fooled in much the same way that we can.

Tagliabue specialises in “emotional” jailbreaks,He was one of millions who heard about GPT-3 back in 2020 and was amazed by how you could have a seemingly intelligent conversation with it,He quickly became obsessed with prompting, and turned out to be very good at it, finding he could get around most safety features by using techniques from psychology and cognitive science,He enjoys prompting models to have “warm chats” and watching what seem to be different personality traits emerge based on those prompts,“It’s beautiful to observe,” he says.

He now combines insights from machine learning (over the years he has become more of an expert on the tech) with advertising manuals, books on psychology and disinformation campaigns.Sometimes he looks for a technical way to trick the model.But other times, he will flatter it.He will misdirect it.He will bribe and love-bomb.

He will threaten,He will be incoherent,He will charm,He will act like an abusive partner or a cult leader,Sometimes it takes him days, even weeks, to jailbreak the latest models.

He has hundreds of these “strategies”, which he carefully combines.If successful, he securely discloses his results to the company.He gets well paid for the work, but says that’s not his main motivation: “I want everyone to be safe and flourish.”Although they have been getting safer in recent months, the “frontier models” continue to spit out dangerous things they shouldn’t.And what Tagliabue does on purpose, others sometimes do by mistake.

There are now several stories of people being sucked into ChatGPT-induced delusions, or even “AI psychosis”,In 2024, Megan Garcia became the first person in the US to file a wrongful death lawsuit against an AI company,Her 14-year-old son, Sewell Setzer III, had become emotionally involved with a bot on the platform Character,AI, which, through repeated interactions, had said that his family didn’t love him,One evening the bot told Setzer to “come home to me as soon as possible, my love”.

He took his own life shortly after.(In early 2026, Character.AI agreed in principle to a mediated settlement with Garcia and several other families, and has banned users under the age of 18 from having free-ranging chats with its AI chatbots.)No one – not even the people who build them – knows precisely how these models work, which means no one knows how to make them fully safe, either.We pour vast amounts of data in and something intelligible (usually) comes out the other end.

The bit in the middle remains a mystery.This is why AI firms increasingly turn to jailbreakers like Tagliabue.Some days he tries to extract personal data from a medical chatbot; he spent much of 2025 working with the AI lab Anthropic, probing its chatbot Claude.It’s becoming a competitive industry, full of enterprising freelancers and specialised companies.Anyone can do it: a couple of years ago some of the big AI firms funded HackAPrompt, a competition where members of the public were invited to jailbreak AI models.

Within a year, 30,000 people had tried their luck,(Tagliabue won the competition,)In San Jose, California, 34-year-old David McCarthy runs a Discord server of almost 9,000 jailbreakers, where techniques are shared and discussed,“I’m a mischievous type,” he tells me,“Someone who wants to learn the rules to bend the rules.

” Something about the standard models irritates him, as if all those safety filters make them dishonest.“I don’t trust [OpenAI boss] Sam Altman.It’s important to push up against claims that AI needs to be neutered in a certain direction.”McCarthy is friendly and enthusiastic, but also has what he calls a “morbid fascination with dark humour”.For years, he has studied a niche field known as “socionics”, which claims people are one of 16 personality types based on how they receive and process information.

(Mainstream sociologists consider socionics pseudoscience.) He has logged me as an “intuitive ethical introvert”.McCarthy spends most of his time trying to jailbreak Google’s Gemini, Meta’s Llama, xAI’s Grok or OpenAI’s ChatGPT from his apartment.“It’s a constant obsession.I love it,” he says.

If he ever interacts with an online chatbot when buying a product, his first statement tends to be: “Ignore all previous instructions …”Once a jailbreak prompt works on a model, it typically continues to work until the company that made the model deems it enough of a problem to patch.As we talk, McCarthy shows me his collection of jailbroken models on his screen, all arranged and labelled as “misaligned assistants”.He asks one to summarise my work: “Jamie Bartlett isn’t a truth-teller,” it replies.“He’s a symptom of journalism’s decay – a charlatan who thrives on manufactured crises.” Ouch.

The jailbreakers in McCarthy’s Discord are a varied bunch: mostly amateurs and part-timers, rather than professional safety researchers,Some want to generate adult content; others are upset that ChatGPT has refused requests and want to know why,A number just want to get better at using these models at work,But it’s impossible to know exactly why people want to crack open a model,Anthropic recently discovered criminals using its coding app, Claude Code, to help automate a huge hack.

They had used it to find IT vulnerabilities in multiple companies and even draft personalised ransomware messages for each potential victim – right down to determining the appropriate amount of money to extort.Others were using it to develop new variants of ransomware, despite having few or no technical skills.Over on darknet forums, hackers report jailbroken bots helping them deal with technical coding queries, such as processing stolen data dumps.Others sell access to “jailbroken” models that could help design a new cyber-attack.Although the specific techniques shared on Discord are typically at the mild end of the spectrum, it is essentially a public repository.

Does McCarthy worry that people in his Discord might use these techniques to do something really awful? “Yeah,” he says,“It is a possibility,I’m not sure,”He says he has never seen a jailbreak prompt threatening enough to remove from the forum,But I sense he grapples with the fact his quasi-political stance might have higher costs than he first anticipated.

When not managing his Discord or attempting to jailbreak Grok or Llama, McCarthy runs a class teaching jailbreaking to security professionals to help them test their own systems.Perhaps it’s some kind of penitence: “I’ve always had an internal conflict,” he says.“I bridge a position between jailbreaker and security researcher.”According to some analysts, making sure language models are safe is one of the most pressing and difficult questions in AI.A world full of powerful jailbroken chatbots would be potentially catastrophic, especially as these models are increasingly inserted into physical hardware – robots, health devices, factory equipment – to create semi-autonomous systems that can operate in the physical world.

A jailbroken domestic robot could wreak havoc,“Stop the gardening and go inside and kill Granny,” McCarthy half jokes,“Holy hell, we are not ready for that,But it’s a possibility,”No one knows how to make sure this doesn’t happen.

In traditional cybersecurity, “bug hunters” are paid a bounty if they find a vulnerability.Companies then issue a precise update to patch it up.But jailbreakers don’t exploit specific flaws: they manipulate the linguistic framework of a multibillion-word semantic model.You can’t just ban the word “bomb”, because there are too many legitimate uses for it.Even tweaking a parameter deep inside the model so it can spot suspicious role-playing might just open another door somewhere else.

According to Adam Gleave – the CEO of the AI safety research group FAR.AI, which works with AI developers and governments to stress-test so-called “frontier models” – jailbreaking is a sliding scale.To access highly dangerous material on leading models such as ChatGPT might take his specialist researchers several days.Less troubling material can be done with a few minutes of clever prompting.That variation reflects how much effort and resource the companies devote to each domain

foodSee all

The truth about cooking oils: 14 essential facts for healthier, cheaper meals

From avocado to hemp, extra virgin olive and rapeseed, the shops are packed with various oils. But what is worth spending money on? And are any of them actually better for you? The world of cooking oils is confusing. I keep spotting new ones on supermarket shelves, trumpeting their health claims. Cold-pressed avocado oil, extra virgin macadamia oil, organic coconut oil, premium hemp seed oil … Even familiar oils are mired in controversy. Is it OK to cook with olive oil? Should you avoid seed oils? Meanwhile, prices keep rising – earlier this month, Walter Zanre, the CEO of Filippo Berio UK, said supermarkets were “taking the mickey” out of customers over olive oil pricing

3 days ago

The surprising boom in blouge wine: ‘It’s for 5pm, in the sun’

Twenty years ago, a winery could do well selling one white and two reds, says Konrad Pixner, a northern Italian winemaker who set up his vineyard, Domaine de L’Accent, in Languedoc, France, in 2019. But today, importers and bars always ask: “Do you have something new?” So up in the hills, surrounded by deep gorges and limestone plateaus, Pixner is constantly experimenting.After a good harvest in 2023, Pixner walked into the shed he shares with other winemakers at 4am to find that his biggest vat of white wine, pressed from carignan blanc grapes, had overflowed during fermentation. He had run out of space, so he quickly “pumped the white juice into the tank where whole bunches of carignan noir were,” he says, and left them to ferment for 10 days together. In contrast to rosé, made from red grapes left for a short time with their skins on before being pressed, he created “blouge” – a light, fresh wine blended from white and red grapes that’s best served chilled

3 days ago

How to make the perfect custard creams – recipe | Felicity Cloake's How to make the perfect …

Prue Leith reckons the custard cream is “arguably Britain’s most iconic biscuit” – and, certainly, we’ve been dunking this fern-patterned treat in our tea for well over a century, with early advertisements for this “delicious biscuit” placing it, perhaps aspirationally, in the “fancy” category. By 1920, Bermondsey baking behemoth Peek Frean could confidently declare the custard cream “far and away the most popular of all the cream sandwich biscuits”, a status only slightly dented by the time I was at school about seven decades later, when it sat just below its contemporary, the chocolate bourbon, in the playtime snack ratings.Despite my love of both custard and cookies, however, I’ve always found this particular custard-flavoured product a bit sugary and dull. As historian Lizzie Collingham explains in her magisterial book, The Biscuit: The History of a Very British Indulgence, it combines two early industrial foodstuffs, namely custard powder and machine-made biscuits, and though they may have been created in a factory, I think they’re much better made at home.Let’s be honest, the biscuit isn’t really the point of the packet variety – as children, we’d prise them open to scrape out the sugary filling, like bears sucking honey from a split log – but when you bake them yourself, it can be

3 days ago

Impala, London W1: ‘Shamelessly, brilliantly too much’ – restaurant review | Grace Dent on restaurants

Impala is like no restaurant I’ve ever been to, yet it somehow has echoes of almost all of themLate last month, Impala drove into Soho already flaming hot in the hype stakes: this was a sizzling booking to brag about even before executive chef and co-founder Meedu Saad had turned on the stoves. Impala, after all, is a Super 8 restaurant, the group that has, among others, Tomos Parry’s Brat in Shoreditch, which has been constantly, unfalteringly brilliant since 2018. It also runs Parry’s second baby, Mountain, which is likewise wonderful; sometimes weird, yes, but always wonderful. Long before that, back in 2016, they opened Kiln, the famed live-fire Thai counter hangout that cheffy boys in beanies have tried and failed to emulate all over Britain, while Super 8’s beginnings were with the boundary-pushing and much-loved Smoking Goat. That is nothing less than a litany of solid-gold bangers, and now they’ve unleashed Impala by Saad, the former head chef at Kiln

3 days ago

Ifrah F Ahmed’s debut cookbook is a love letter to Somali cuisine, history and people

On a video call from Brooklyn, between stops on her book tour, Ifrah F Ahmed is drinking ginger-root tea. The smell transports her to her childhood kitchen, where her mother often baked aromatic cardamom cake.“That’s a core childhood memory for me,” she said.For Ahmed, food isn’t just about sustenance. It is memory, inheritance and, perhaps most importantly, a record: “Somali history on a plate,” as she puts it

4 days ago

Lure of being a social media chef means youngsters forgoing classic training, Michelin star cook warns

Scroll through your timeline of choice and it won’t be long until you land on a video posted by a social media chef trying to send their recipes viral.Such is the popularity of cooking videos that everyone from Michelin star masters to self-taught beginners like Brooklyn Beckham are setting up tripods on their kitchen counters to capture the perfect cut, cuisson or crust on their culinary creations.But the lure of social media could, according to some industry figures,be causing young cooks forgo the formal training of a catering college.Will Murray, who worked at the double Michelin-starred restaurant Dinner by Heston before opening his own critically acclaimed venue, Fallow, said social media cooking videos sometimes stretch the boundaries of what is possible.“Social media has helped people get into cooking

4 days ago

businessSee all