close
close

Gottagopestcontrol

Trusted News & Timely Insights

Artificial intelligence solves exciting connections
Washington

Artificial intelligence solves exciting connections

Millions of people log in every day to read the latest issue of Connectionsa popular category matching game from The New York TimesThe game, released in mid-2023, received Played 2.3 billion times in the first six months. The concept is straightforward yet engaging: players have four attempts to identify four topics from 16 words.

Part of the fun for players is applying abstract thinking and semantic knowledge to identify connecting meanings. But under the hood, puzzle creation is complex. Researchers at New York University recently tested the ability of OpenAIs GPT-4 Large Language Model (LLM) to create engaging and creative puzzles. Their study, published as a preprint on arXiv in July, found that LLMs lack the necessary metacognition to take the player’s perspective and anticipate their subsequent reasoning—but with careful prompting and domain-specific subtasks, LLMs can still create puzzles on par with The New York Times.

grey boxes with black text on the left, yellow, green, blue and purple boxes with black text on the rightEach Connections puzzle consists of 16 words (left) that must be sorted into 4 categories of 4 words each (right).The New York Times

“Models like GPT don’t know how people think, so they have a hard time estimating how difficult a puzzle is for the human brain,” says the lead author Timothy Merinoa PhD student in NYU’s Game Innovation Lab. “On the other hand, LLMs have a very impressive linguistic understanding and knowledge base due to the huge amounts of text they use for training.”

The researchers first had to understand the basic game mechanics and find out what makes them so exciting. Some players may be familiar with certain groups of words, such as opera titles or basketball teams. But the challenge is not just in checking knowledge. “The challenge is to identify groups with misleading words that make their categorization ambiguous,” says Merino.

Intentionally distracting words serve as red herrings and are part of the game’s typical trickiness. When developing GPT-4’s generative pipeline, researchers tested whether intentional overlaps and false groupings resulted in difficult, yet entertaining puzzles.

yellow, green, blue and purple boxes with black text and black arrowsA successful Connections puzzle contains intentionally overlapping words (top). NYU researchers incorporated a process for generating new word groups into their LLM approach to creating Connections puzzles (bottom).NYU

This reflects the mindset of Connections founder and editor Wyna Liu, whose editorial approach taken into account “Decoys” that don’t fit into any other category. Senior puzzle editor Joel Fagliano, who tests and edits Liu’s boards, said that Spotting a red herring is one of the hardest skills to learn. As he puts it, “More overlap makes the puzzle harder.” (The New York Times rejected IEEE spectrum‘s request for an interview with Liu.)

The NYU article identifies Liu’s three axes of difficulty: word familiarity, category ambiguity, and pun variety. Meeting these constraints is a unique challenge for modern LLM systems.

AI needs good prompts for good puzzles

The team first explained the rules of the game to the AI ​​model, gave examples of connection puzzles, and asked the model to create a new puzzle.

“We discovered that it’s really hard to write a comprehensive set of rules for connections that GPT can follow and that always produces a good result,” says Merino. “We wrote a big set of rules, asked it to generate a few puzzles, and then inevitably discovered a new unspoken rule that we needed to incorporate.”

Even though we lengthened the prompts, the quality of the results didn’t improve. “The more rules we added, the more GPT seemed to ignore them,” Merino adds. “It’s difficult to follow 20 different rules and still come up with something smart.”

The team succeeded by breaking the task down into smaller workflows. An LLM creates puzzles based on iterative prompts, a step-by-step process that generates one or more sets of words in a single context, which are then broken down into separate nodes. Next, an Editor LLM identifies the connecting theme and edits the categories. Finally, a human reviewer selects the highest quality sets. Each LLM agent in the pipeline follows a limited set of rules, without requiring an exhaustive explanation of the game’s intricacies. For example, an Editor LLM only needs to know the rules for naming categories and fixing bugs, not the gameplay.

To test the attractiveness of the model, the researchers collected 78 responses from 52 human players comparing sets generated by LLM with real Connections puzzles. These surveys confirmed that GPT-4 could successfully produce novel puzzles with comparable difficulty and competitive player preferences.

Bar chart with blue and purple colors and black textIn about half of the comparisons with real Connections puzzles, human players rated the AI-generated versions as equally or more difficult, more creative, and more fun.NYU

Greg Durrettassociate professor of computer science at the University of Texas at Austin, calls the NYU study an “interesting benchmark task” and fertile ground for future work on understanding set operations such as semantic groupings and solutions.

Durrett explains that while LLMs are great at generating different sets of words or acronyms, their results can be mundane or less interesting than human creations. He adds, “The (NYU) researchers have put a lot of work into developing the right input strategies to generate these puzzles and get high-quality results from the model.”

Director of the NYU Game Innovation Lab Julian Togeliusassociate professor of computer science and engineering and co-author of the paper, says the group’s task assignment workflow could be applied to other titles, such as Codenamesa popular board game for multiple players. Like Connections, Codenames is about identifying similarities between words. “We could probably use a very similar method with good results,” adds Togelius.

While LLMs may never match human creativity, Merino believes they will be excellent assistants to today’s puzzle designers. Their training knowledge taps into huge word pools. For example, GPT can list 30 shades of green in seconds, while humans need a minute to think of a few.

“If I wanted to do a puzzle using the ‘shades of green’ category, I would be limited to the shades I know,” says Merino. “GPT told me about ‘celadon,’ a shade I didn’t know. To me, that sounds like the name of a dinosaur. I could ask GPT for 10 dinosaurs whose names end in ‘-don’ for a tricky follow-up group.”

From your site articles

Related articles on the web

LEAVE A RESPONSE

Your email address will not be published. Required fields are marked *