Working with DAWN at the University of Cambridge - News

The question of whether an AI system behaves consistently — and what consistency even means in the context of human need — is one of the most consequential and least resolved problems in applied AI.

Establishing the terms

There is a difference between an AI system that works and one that works reliably, in the right way, for the right people, across the full range of situations it will encounter. The distinction matters enormously when systems are designed to understand and respond to human behaviour. A model that performs well on average can still fail catastrophically for specific individuals, in specific contexts — and in high-stakes environments, that failure is not a metric. It is a consequence.

This is the problem we are beginning to address properly. Working with the DAWN AI system at the University of Cambridge, Mortar is establishing the first validation framework for assessing behavioural consistency in AI models. It is a significant step, and one that has been a long time in the making.

Building towards this

Our work in AI has been deliberate in its development. Rather than adopting the pace and posture of the sector broadly — where speed of deployment often outpaces understanding of what has actually been built — we have been asking harder questions about where AI can add genuine value and what it means to apply it responsibly.

We have been applying new technologies to support and extend our existing design systems, while building out our Behavioural Intelligence capability through Hxly. The development of our first behavioural models — designed to improve how systems can understand and respond more effectively to human needs — has been central to this. What has become increasingly clear through that work is that building a model is only the beginning. Knowing whether it behaves as intended, and whether that behaviour holds across different users and contexts, is the harder and more important challenge.

What the DAWN collaboration makes possible

The DAWN AI system provides access to the kind of high-performance computing infrastructure that genuinely changes what is possible in AI research. For us, the opportunity is not simply about scale. It is about rigour.

We will be working alongside our partners at Brunel University London's Arts, Health and Social Change Research Group, with Professor Dominik Havsteen-Franklin and Dr Ivan Girina, to build a validation framework that draws on established therapeutic assessment methodologies alongside games design principles. This is an unusual combination — and deliberately so. Therapeutic assessment offers structured approaches for understanding how individuals respond and engage over time, including how that engagement varies and what that variation signals. Games design offers its own vocabulary for consistency, challenge calibration, and behavioural feedback loops. Together, they open a more precise and human-grounded way of asking whether a behavioural model is doing what it should.

The aim is not to produce a generic validation instrument. It is to develop methodology with enough specificity and conceptual depth to evaluate whether AI systems behave in ways that are genuinely aligned with the humans they are designed to serve — and to surface clearly when they do not.

Why this is necessary

The accelerating deployment of AI in contexts that touch human welfare — health, social care, education, public services — is happening faster than the field's capacity to evaluate what is actually being deployed. Most current assessment focuses on performance metrics that measure accuracy, recall, and efficiency within defined test conditions. These are not the same as understanding whether a system responds appropriately when conditions shift, when users fall outside the training distribution, or when the stakes of a failure are human rather than technical.

Behavioural consistency — the degree to which a system holds to its intended function across the full and varied range of human interaction — is not currently something the field has agreed, standard methods for measuring. This is a gap with real consequences, and it is the gap we are working to close.

The validation framework we develop through this collaboration will be the first structured attempt to assess this problem systematically. It draws on disciplines not typically brought into conversation with AI development, and it does so because the problem requires it.

Moving forward

This collaboration represents a meaningful milestone for us, and for the development of Hxly as a platform for behavioural intelligence. We are building something that the sector needs — not just for our own models, but as infrastructure for the responsible development of human-aligned AI more broadly.

We will be sharing more as the work progresses. If you are interested in what we are building, or in the research underpinning it, we would welcome the conversation.