In the first of a new series, co-op member Howard David Ingham explains why we aren’t quite ready to submit to our robot overlords.
Are we worried that the growing availability of cheap AI speech-to-text software might put us out of a job?
You won’t be surprised that we’ve heard that one a lot recently.
Engineers have been working for a whopping 70 years to get a computer to write down what a person is saying without anyone else having to listen. It’d be interesting to get into why this is a thing and who’s paying for it (a big shout out to everyone listening at GCHQ, by the way). But let’s stick to the subject: AI transcription utilities, powered by machine learning, really are getting better and better. I can see a time in the near future when an AI transcription service might be as good as the “real thing.”
We’re in a world where you can even do it on your phone. In fact, I dictated that last paragraph on my phone while walking the dog, because I was curious to see how good it was. So are we old school transcribers going to go the way of the triceratops and the typesetter? The simple answer is: not for a while yet, and probably not entirely.
OK, then, so AI isn’t going to put us out of a job. So it’s going to make our jobs easier, right? Well, no, that’s not really true either.
Let’s look more closely at both of those questions.
Let’s talk about Intelligent Verbatim
Like a lot of transcription services, the folks at TypeOlogy use the Intelligent Verbatim method. When you transcribe using Intelligent Verbatim, you do a bit of cleaning up as you go. You skip the ums and ahs and the occasional “y’know” or “I mean”, cutting the words we inevitably repeat while we’re getting our thoughts in order. Then you fix the punctuation, because people don’t actually punctuate their speech. And all of this depends upon interpreting the speech we’re transcribing in a way that’s sensitive and appropriate. Right now, AI just isn’t very good at that. It’s starting to get better at it – for example, you can set up some online services to remove ums and ahs automatically – but it’s still hit and miss, and when people are talking at length, especially when it’s an unscripted conversation, the result is often a bit of a mess.
To demonstrate, below is the raw transcript of the preceding paragraph, read out clearly and at a normal speaking pace, and put through one of the better online AI speech-to-text services (one I do actually use from time to time):
OK, that isn’t terrible, but it is still going to take a bit of cleaning up. Now imagine what that’d look like with the stops, starts, and interjections of unrehearsed, unread speech.
The proof is in the word salad
We’ve noticed an increasing number of clients coming to us with automated transcripts, asking if we’ll edit them to make them fit for purpose. Surely a half-decent transcript is better than no transcript at all? Doesn’t that make things a lot easier?
Well, that depends on who’s talking. Audio from an experienced, trained orator (a barrister or politician, perhaps) will come out better than someone with a regional accent describing personal experiences in an informal conversation, for example. This is another problem with AI transcription: it remains less accessible for people who aren’t provided with education or certain class markers. But even your very best case is probably going to look something like the text that my example generated.
Isn’t that still better than not having it? Crucially, this really depends on your skill set. The fact is, an automated transcription requires a different set of skills to work on.
An experienced transcriber with the right equipment can do a full transcription without really stopping typing, just occasionally taking their foot off the pedal while they catch up with the audio.
But a pre-produced transcription is different. No matter how perfect the transcription is, there isn’t currently an AI that can do Intelligent Verbatim. You’ve got to stop and start the audio while you separate out the bits where the AI got the speakers mixed up. You need to fix the punctuation. You have to remove the “yeahs” and the “OKs” and all the other phatic utterances people unconsciously make while they’re listening. You’ll have to go through and sort out the proper names that the AI just didn’t know, and occasionally streamline some phrasing. And from time to time some background noise, or two people talking at the same time, or just a strong regional accent will generate a nice tasty word salad.
No matter how accurate the AI output, the task of correcting it stops being a transcription job and becomes much more of an editorial and proofing job. And on top of that, it’s one that needs you to be listening to the audio at the same time. As a result, the time a skilled editor winds up spending on it works out roughly the same as a transcriber would take typing it up from scratch – sometimes even longer.
This is why, as members of an ethical co-operative that believes in compensating its workers fairly, we don’t currently offer a discount for editing automated transcripts. As AI transcriptions improve, this might change, but for the moment, the technology still has a way to go.
This doesn’t mean that AI services aren’t a valuable tool. They will definitely change how we work in years to come, but they’re still not at the stage where they make our lives easier, let alone put us out of jobs. They’re just another tool, and they create different challenges.
