Talk to the City: an open-source AI tool for scaling deliberation

Oct 25

Bruno Marnette, Colleen McKenzie

This post gives an overview of Talk to the City and shares updates on our last few months of work. If you're already read the main project page, skip down to "Are frontier LLMs reliable enough?" for the updates.

Github

Motivations

The fast pace of technological development in the AI space is both a source of concern and a source of hope for democracy. On the pessimistic side, many are worried that our existing democratic processes and institutions are too slow and inefficient to address the many crises humanity is facing. The more we lose trust in the efficacy of our democratic processes, the more we lose the broad participation and agreement on legitimacy of outcomes that make democracy functional, which in turn may further damage the levels of trust that people have in democratic institutions.

On the more optimistic side, however, there is hope that AI technologies may help us escape the spiral of declining trust by allowing us to consult the public in much larger numbers, at a much faster pace, and in much more inclusive and transparent ways that capture diversity and nuance of opinion. Making this a reality won’t be an easy task, and building sufficient public trust in AI technologies will also take some time, but now that a growing number of AI pioneers have been popularizing this idea (e.g. here and here), we believe that, with the appropriate AI safety precautions, it is time to accelerate the design and development of open-source prototypes and begin testing them in the wild.

This is the focus of our team at the AI Objectives Institute, a non-profit research lab focused on differential technology development and working to ensure that people can thrive in a world of rapidly deployed, extremely capable AI systems. Our tool Talk to the City (“TttC”) is a platform helping people collect and analyze democratic input at scale. Our first iterations date back to early 2023, when we used it to map the discourse on AI safety and ethics, but it has since then evolved to something more ambitious, which we were proud to open-source earlier this month (October 2023).

Product introduction

What makes Talk to the City unique is the combination of the following features:

It leverages the power of frontier LLMs to analyze large datasets
It is designed to summarize human opinions, as opposed to just factual data
It automatically prepares summaries, visualizations and reports
It is developed as a non-profit initiative

Talk to the City’s data processing pipeline starts by processing a variety of data types, then uses LLMs to extract key arguments, and finally arranges similar arguments into clusters and subclusters. Users can navigate through a map of opinions and drill down to the subclusters they find most interesting.

The examples below show a TttC's ability to ingest different types of data and produce reports in different languages:

https://tttc.dev/recursive is a report on high-priority topics in AI deployment, using data collected by the Polis tool, and letting you filter arguments by level of consensus (in partnership with Recursive Public).
https://tttc.dev/genai reports opinions on generative AI, and can be read either in English or Mandarin. The data came from a public consultation conducted in Taiwan by Audrey Tang and the Collective Intelligence Project.
Heal Michigan is a report generated using video transcripts from interviews with activists in Michigan, and providing links to the source videos. Interviews conducted in partnership with Silent Cry (coming soon).

Research questions

We're building this tool as a proof of concept and testing it with a variety of datasets, including data extracted from twitter, blog posts, pol.is consultations, and video interviews.

Our primary goal with these first experimental deployments has been to improve our understanding of the risks and benefits of using LLMs in the context of democratic consultations, and to identify where LLMs are helpful in creating intuitive interfaces for complex datasets. In particular, we wanted to answer the following questions:

Are modern LLMs already reliable enough to prove helpful?
Which interfaces best help users understand large corpora of opinion data?
How can we mitigate natural safety and quality concerns inherent in the use of current LLMs?
How can we collaborate with and bring value to the broader researcher community?

While we’re still at the beginning of this journey, we have started to make some progress on all four axes. The rest of this article shares some of the relevant lessons that we have learned in the process.

Are frontier LLMs reliable enough?

While we’ve been consistently impressed by the capacity of modern LLMs to summarize factual information, accurately representing subjective information (e.g. human opinions and preferences) is more delicate, and even the best models as of this writing (Oct 2023) fall short of what we would need to trust fully automated processes.

That being said, our experiments suggest that models such as GPT-4 and Claude2 can already save analysts huge amounts of manual work, which is all we need to start scaling up consultations to involve more people in deliberative processes.

Aside from improved prompt engineering, we found that one of the best ways to reduce the rate of mistakes was to give the AI as much context as possible. For instance, clarifying the goals of a consultation can help the models extract more interesting and relevant arguments. Likewise, providing a few good examples of arguments in our prompts (a technique also known as “few shot learning”) proved particularly effective when we gave examples taken from the same domain as the consultation.

In our experiments, comparing the output quality of different models, we found that GPT4 and Claude2 were roughly equivalent – and significantly superior to open-source models such as Llama2.

We also experimented with other applications of AI beyond argument extraction and obtained encouraging results for a few other tasks:

We translated arguments back and forth from Mandarin to English using GPT-3.5, for a collaboration with Audrey Tang and the vTaiwan project, and the translations were nearly perfect.
We transcribed some video interviews with Descript and eventually got good transcripts, but we had to fix quite a bit manually and the punctuation was often not ideal.
We used GPT-4 to generate labels for different clusters and subclusters, and the results were also positive but more mixed – we often felt the need to polish these labels by hand.

Overall, the number of manual changes that we had to make felt like reasonable overhead, and we expect that there will be decreasing need for manual edits as LLM capabilities keep improving.

Product learnings thus far

While our user interface is still a work in progress, we have had the chance to test different designs in a series of usability testing sessions to get a clearer sense of what would bring the most value to different types of users.

Experienced analysts want control: the people preparing the analysis of a dataset often want a fairly high level of control over how to organize and present the data. For instance, one analyst may want to break things down into three large clusters, while another might prefer a much more granular approach. Likewise, the definition of an “interesting” argument may differ from one use case to another. For this reason, we’re aiming to give experienced analysts a high level of control over the prompts and parameters of our algorithms.

Most readers prefer very simple interfaces: different consumers of the data often have different preferences on how to consume it – while our most tech-savvy beta testers appreciated looking at cluster maps and various plots, many users found these too abstract and preferred something simpler and less interactive. For this reason, the latest version of our UI starts by presenting the data in a simple “report” format while allowing users to expand the interactive maps included in the report to dig deeper into the data. We have also started to experiment with other UIs, including maps with hierarchical clusters and interactive chat interfaces. We plan to include these soon in the generated reports, but only after making sure that the designs remain simple and intuitive enough to all readers.

Pair analysis of ideas with analysis of agreement: users want to understand not just the opinions of a surveyed population, but also the proportion of the population that agrees with each point. In previous iterations, Talk to the City focused only on the variety and relationships of ideas and opinions held by a population, while other tools for deliberation (e.g. pol.is) highlighted the proportion of respondents that support a particular view. It however became clear that these separate focuses can complement each other when used together, and we recently added further capabilities allowing Talk to the City to display the number of upvotes/downvotes and filter arguments by levels of consensus. This proved particularly helpful when we collaborated with vTaiwan and Chatham House to extract areas of agreements from a broad-ranging consultation on IA, as per screenshot below.

The path to safety and quality

To be a trustworthy tool for analysis, Talk to the City needs to present accurate, high-fidelity representations of the opinions expressed by all participants, even when those opinions are politically incorrect or widely considered immoral. This means that some of the current approaches to AI safety – such as RLHF, when used to make model outputs less offensive – may in fact be introducing bias that will skew a LLM's representation of a person's stated opinion.

One of our goals is to ensure that our product accurately extracts and summarizes all the key opinions and arguments expressed by participants – and we plan to design novel frameworks for evaluating and ensuring this accuracy. As a starting point, we have identified a set of questions for participants to evaluate how well their views were represented, including:

Did the AI capture all your key arguments?
Did the AI provide a fair and accurate summary of each argument?
Did the AI preserve the initial tone and intention of your original argument?

Early discussions with participants have confirmed that there is often room for progress on the third point. For example, one of the individuals interviewed in one of our studies was a formerly incarcerated individual who said "living a good life after prison is hard but possible" – but our AI broke this down in two atomic arguments, one saying “living a good life after prison is possible” (implicitly suggesting that it is easy) and the other saying “living a good life after prison is hard" (suggesting it might not be possible). While each statement is arguably accurate, they each fail to capture the intended message when read separately.

Our plan is to start collecting feedback from large samples of participants and use their input to train new language models (or fine-tuning existing ones) to detect and mitigate issues in more automated and scalable ways.

First open-source release

We conducted our first experiments both with datasets we collected ourselves, and with data from partners we worked in close collaboration with. It was convenient for us to start with a private codebase to iterate quickly, but our goal has been to build a more portable tool which would allow other teams to run their own experiments.

We’re excited to announce that it is now possible for teams to use our open-source repo to produce their own reports, using their own data and customizing the process as they see fit.

Our near-term goal is to encourage large numbers of open-source contributors to test the limits of the current implementation and propose improvements. To this end, we have tried to keep the code base as simple as possible, allowing users to run the pipeline on their own desktop without the need to set up any cloud infrastructure. As the project matures, however, we plan to add the required infrastructure to allow other teams to deploy Talk to the City as a service on their own servers, with an admin UI to upload data and run the AI pipeline. ‘

As a non-profit organization, we’re always looking for volunteers to help us design, develop, and test the next iterations of our products. Please contact hello@objective.is if you’re interested in hearing more or exploring potential collaborations.

Colleen McKenzie