60-Second Overview

  • The Queensland Police Service costs QLD taxpayers $3 billion a year. And the entire machinery of justice often relies on a single, humble output: an accurate written record of what happened when police got there.
  • Early trials show automated transcription outputs being used to create drafts of domestic violence paperwork are saving officers 10% of their shift. That is thousands of hours of capacity returned to the frontline.
  • We record everything but hear almost nothing. From prison calls (someone told me we listen to just 0.02% but people literally talk about murders in the calls sometimes) to Triple Zero and covert intercepts, we are sitting on a goldmine of unsearchable evidence.
  • The industry standard metrics (Word Error Rate) is dangerous. Some vendors say they have a word error rate of 6% but that’s in office conditions.
  • Policing reality is different to offices. Domestic violence incidents are often emotionally charged, violent events with people talking over each other.
  • Policing is so different I’m suggesting we create synthetic content and test VTT providers in transcribing that.

The Humility of the Record

They asked me to do a desktop review of VTT (Voice To Text) solutions for QPS. Initially, I thought it would be a boring task. And then I read one of the most interesting sentences I’ve read while I’ve been at QPS.

In one of the research documents I was studying, it said “One of the most important elements of justice is an accurate record of what happened.” There is a humility and humanity when you stop to think that QPS, an organisation with helicopters, advanced forensics, and a $3 billion budget, ultimately relies on something as mundane as an officer’s notebook to underpin the justice system.

One of the most important elements of justice is an accurate record of what happened.

If we cannot accurately record what happened, courts grind to a halt. When you think about it, carefully recording events actually a big part of what police do. Currently, it’s officers which make notes of what they’re doing when they’re there or get back to the station. Sometimes you have exhausted officers at 2 AM trying to recall chaotic events.

And the work I’ve been asked to do is to consider Voice To Text (VTT) solutions which could automate the transcription of (the audio from) Body Worn Camera footage.

We currently record vast amounts of audio that we never analyse. We store it, secure it, and ignore it because we lack the ears to listen. If we can transcribe all audio, we turn “noise” into “data”. It becomes searchable. We can collect and analyse it. Determine trends.

Consider the potential:

  • Prison Calls: We currently listen to approximately 0.02% of calls made by inmates. We know people are confessing to murders, coordinating drug supply, and threatening witnesses on these lines. We just don’t know which lines. Transcribing them all makes them searchable. We could flag every mention of a specific weapon or associate in seconds.
  • Triple Zero (000): We could add sentiment analysis to 000 calls. VTT can analyse the silence and stress in a caller’s voice, helping dispatchers triage “silent calls” where a victim cannot speak but is in mortal danger.
  • Domestic Violence: By transcribing historical DV reports and Body Worn Camera audio, we can spot patterns of escalation that human memory might miss, perhaps identifying a subtle shift in language that precedes a homicide.
  • Translation: We can instantly bridge the gap with non-English speaking communities, turning a barrier into a bridge in real-time.

The Trap of “Good Enough” Maths

So why haven’t we just switched it on? Because the risk of getting it wrong is catastrophic.

The tech industry loves the industry standard error rate, the Word Error Rate (WER). In a consumer or business context, a 5% error rate is at most, an annoyance.

But WERs are typically calculated using audio from very sterile environments, essentially offices with meetings. Policing is nothing like that. I’ve never been to one but imagine showing up at a Domestic Violence incident. Emotional. Confrontational. 2 people shouting at each other. The chaos of real life. Dogs barking. Traffic. People talking over each other.

Now consider the stakes. If an AI transcribes “I didn’t hit him” as “I did hit him,” the WER is low, but the Critical Error Rate is 100%. It is an inversion of reality. We send the wrong person to prison.

The Path Forward: Synthetic Reality

We are moving forward with VTT as a solution. The benefits are substantial. The suggestion I’ve made in my report is that we do not trust the brochure.

My proposal is that we create “synthetic reality” audio packs of what a real Domestic Violence incident would sound like. We should have sirens, wind noise, overlapping shouting matches, and even a heavy Indigenous accents into our test data.

And I think we should force the VTT engines to process that audio, and we will measure them not on “words correct,” but on how often they hallucinate or invert the truth. And against a list of other metrics. (The specific metrics I’ve suggested are below.)

We intend to share these benchmarks with other police jurisdictions. If we can solve this for the acoustic chaos of Queensland policing, we might help solve it for everyone.

The goal is simple: Save officers some time and, over time, turn the lights on in the dark corners of our evidence.