Reading time: 5 mins

Crossposted from the LessWrong Forum. May contain more technical and AI Safety jargon than usual.


Recently I spent some time thinking about ways in which studying the human side of human-machine systems would be beneficial to build aligned AIs. I discussed these ideas informally and people seemed interested and wanted to know more. Thus, I decided to write a list of research directions for studying humans that could help solve the alignment problem.

The list is non-exhaustive. Also, the intention behind it is not to argue that these research directions are more important than any other but rather to suggest directions to someone with a related background or personal fit in studying humans. There is also a lot of valuable work in AI Strategy that involves studying humans, which I am not familiar with. I wrote this list mostly with Technical AI Safety in mind.

Human-AI Research Fields

Before diving into my suggestions for studying humans with AI Safety in mind, I want to mention some less well-known research fields that study the interactions between human and AI systems in different ways, since I reference some of these below. Leaving aside the usual suspects of psychology, cognitive science and neuroscience, other interesting research areas I came across are


A “transdisciplinary” approach defined by Norbert Wiener in 1948 as “the scientific study of control and communication in the animal and the machine”. It is currently mostly used as a historical reference and a foundational reading. However, there is growing work in integrating cybernetics concepts in current research.

Human-AI Interaction

Human-Computer Interaction (HCI) is an established field dating back to the 70s. It “studies the design and use of computer technology, focused on the interfaces between people and computers”. Human-AI Interaction is a recently established sub-field of HCI concerned with studying specifically the interactions between humans and “AI-infused systems”.

Computational Social Science

“Using computers to model, simulate, and analyze social phenomena. It focuses on investigating social and behavioural relationships and interactions through social simulation, modelling, network analysis, and media analysis”

Collective Intelligence

Defined as “the enhanced capacity that is created when people work together, often with the help of technology, to mobilise a wider range of information, ideas, and insights”

Artificial Social Intelligence

Which some define as “the domain aimed at endowing artificial agents with social intelligence, the ability to deal appropriately with users’ attitudes, intentions, feelings, personality and expectations”

Research ideas to study humans with AI Safety in mind

1 - Understand how specific alignment techniques interact with actual humans

Many concrete proposals of AI Alignment solutions, such as AI Safety via Debate, Recursive Reward Modelling or Iterated Distillation and Amplification involve human supervision. However, as Geoffrey Irving and Amanda Askell argued we do not know what problems may emerge when these systems interact with real people in realistic situations. Irving and Askell suggested a specific list of questions to work on: the list is primarily aimed at the Debate technique but knowledge gained about how humans perform with one approach is likely to partially generalize to other approaches (I also recommend reading the LessWrong comments to their paper).

Potentially useful fields: Cognitive science, Human-AI Interaction.

2 - Demonstrate where factored cognition and evaluation work well

Factored cognition and evaluation refer to mechanisms to address open-ended cognitive tasks by breaking them down (or factoring) into many small and mostly independent tasks. Note that the possibly recursive nature of this definition makes it hard to reason about the behaviour of these mechanisms in the limit. Paul Christiano already made the case for better understanding factored cognition end evaluation when describing what Ought is doing and why it matters. Factored cognition and evaluation are major components of numerous concrete proposals to solve outer alignment, including Paul’s ones. It, therefore, seems important to understand the extent to which factored cognition and evaluation work well for solving meaningful problems. Rohin Shah and Buck Shlegeris mentioned that they would love to see more research in this direction for similar reasons and also because it seems plausible to Buck that “this is the kind of thing where a bunch of enthusiastic people could make progress on their own”.

Potentially useful fields: Cognitive science, Collective Intelligence

3 - Unlocking richer feedback signals

Jan Leike et al. asked whether feedback-based models (such as Recursive Reward Modelling or Iterated Distillation and Amplification) can attain sufficient accuracy with an amount of data that we can produce or label within a realistic budget. Explicitly expressing approval for a given set of agent behaviours is time-consuming and often an experimental bottleneck. Among themselves, humans tend to use more sample efficient feedback methods, such as non-verbal communication. The most immediate way of addressing this question is to work on understanding preferences and values from natural language, which is being tackled but still unsolved. Going further, can we train agents from head nods and other micro-expressions of approval? There are already existing examples of such work coming out of Social Signal Processing. We can extend this idea as far as training agents using brain-waves, which would take us to Brain-Computer Interfaces, although this direction seems relatively further away in time. Additionally, it makes sense to study this because systems could develop it on their own and we would want to have a familiarity with it if they do.

Potentially useful fields: Artificial Social Intelligence, Neuroscience

4 - Unpacking interpretability

Interpretability seems to be a key component of numerous concrete solutions to inner alignment problems. However, it also seems that improving our understanding of transparency and interpretability is an open problem. This probably requires both formal contributions around defining robust definitions of interpretability as well as the human cognitive processes involved in understanding, explaining and interpreting things. I would not be happy if we ended up with some interpretability tools that we trust for some socially idiosyncratic reasons but are not de-facto safe. I would be curious to see some work that tries to decouple these ideas and help us get out of the trap of interpretability as an ill-defined concept.

Potentially useful fields: Human-AI Interaction, Computational Social Science.

5 - Understanding better what “learning from preferences” mean

When talking about value alignment, I heard a few times an argument that goes like this: “while I can see that the algorithm is learning from my preferences, how can I know that it has learnt my preferences”? This is a hard problem since latent preferences seem to be somewhat unknowable in full. While we certainly need some work on ensuring generalisation across distributions and avoiding unacceptable outcomes, it would also be useful to better understand what would make people think that their preferences have been learnt. This could also help with concerns like gaming preferences or deceitfully soliciting approval.

Potentially useful fields: Psychology, Cognitive Science, Human-AI Interaction

6 - Understanding value formation in human brains

This is something that I am less familiar about, but let me put it out there for debate anyway. Since we want to build systems that are aligned and compatible with human values, would it not be helpful to better understand how humans form values in their brains? I do not think that we should _copy _how humans form values, as there could be better ways to do it, but knowing how we do it could be helpful, to say the least. There is ongoing work in neuroscience to answer such questions.

Potentially useful fields: Neuroscience

7 - Understanding the risks and benefits of “better understanding humans”

Some think that if powerful AI systems could understand us better, such as by doing more advanced sentiment recognition, there would be a significant risk that they may deceive and manipulate us better. On the contrary, others argue that if powerful AI systems cannot understand certain human concepts well, such as emotions, it may be easier for misaligned behaviour to emerge. While an AI having deceiving intentions would be problematic for many reasons other than its ability to understand us, it seems interesting to better understand the risks, benefits, and the trade-offs of enabling AI systems to understand us better. It might be that these are no different than any other capability, or it might be that there are some interesting specificities. Some also argued that access to human modelling could be more likely to produce mesa-optimizers, learnt algorithms that have their own objectives. This argument hinges on the idea that since humans often act as optimizers, reasoning about humans would lead these algorithms to learn about optimization. A more in-depth evaluation of what reasoning about humans would involve could likely provide more evidence about the weight of this argument.

Potentially useful fields: Cognitive Science, AI Safety Strategy.

8 - Work on aligning recommender systems

Ivan Vendrov and Jeremy Nixon made a compelling case on why working on aligning existing recommended systems can lead to significant social benefits but also have positive flow-through effects on the broader problem of AGI alignment. Recommender systems are likely the largest datasets of real-word human decisions currently existing. Therefore, working on aligning them will require significantly more advanced models of human preferences values, such as metrics of extrapolated volition. It could also provide a large-scale real-world ground to test techniques of human-machine communication as interpretability and corrigibility.

Potentially useful fields: Human-AI Interaction, Product Design


The list is non-exhaustive and I am very curious to hear additional ideas and suggestions. Additionally, I am excited about any criticism or comments on the proposed ideas.

Finally, if you are interested in this topic, there are a couple of interesting further readings that overlap with what I am writing here, specifically:

Thanks to Stephen Casper, Max Chiswick, Mark Xu, Jiajia Hu, Joe Collman, Linda Linsefors, Alexander Fries, Andries Rosseau and Amanda Ngo which shared or discussed with me some of the ideas below.