The Speech-to-text service

eResearch services has developed the speech-to-text service that automatically transcribes audio data. This service is safe, secure and complies with Australian government, Griffith, and ARC funding requirements.

What is the speech-to-text service

The speech-to-text service uses the Microsoft Azure transcription service to transcribe your audio data. Machine learning models are used to automate this process, this is where Microsoft's wealth of training data comes in handy to create transcriptions with a high accuracy.

Unlike other transcription services available, the speech-to-text service is secure. That is, your data will not be shared with third parties, mined for information, or stored in an off-shore location.


The cost of the service is broken down into two components:

Transcription: ~$2 per audio hour

Storage of uploaded and created files: ~$.030 per Gb per month

A budget limit is used to monitor usage costs. When the budget reaches certain percentages of the budget, email alerts are sent, and appropriate actions are taken.

How do I get an account

To get a speech-to-text account please fill in this form.

Once a request has been received, eResearch Services will create the account with the requested budget allocation, and access for the requested users.

Access to the application is only available to Griffith University members. If you would like to add someone external to Griffith, you will need to fill out this form, to get them a  Griffith visitors account.

Speech-to-text examples

Examples of speech-to-text transcriptions can be found here. These examples are straight out of the speech-to-text transcription service, no alterations.

Audio quality has the greatest affect on transcription accuracy, below are some tips to improve the audio recording, and thereby your transcription.

Improving the accuracy of transcriptions


You should consider the following tips in preparation for recording an interview for transcription with Griffith's speech-to-text service.

1. Please speak as clear as possible. Refrain from speaking too fast.

2. Try not to speak over each other. The transcription will not be able to differentiate the speakers and can add it all into the same sentence.

3. Pause between sentences. AI uses gaps between voices for punctuation.

4. When conducting phone interviews, please do not use a separate recorder that is external to the phone, use the internal recorder. Using an external recorder for phone interviews will significantly decrease transcription accuracy, especially for the person being interviewed.

Recording Devices

There are several options available to record a phone interview or virtual meeting

Record from a Griffith Desktop Phone

Griffith desktop phones can record calls. You will need to dial in the extension 59788 during the call, this will create a voicemail that can be downloaded later from the self-care portal. For more information please see the user guide

Important: by default the maximum recording time on Griffith desktop phones is 5 minutes, please contact 55555 to extend this time limit.

Microsoft Teams

A comprehensive video about recording interviews in Teams can be found here


To record a Jabber call, you will need to use a screen recorder or audio recorder software, then you will need to extract the audio from the video. The following are the recommended programs for video capture:

On a Windows 10, use the built in program "Game Bar"

On a macOS Mojave or later, use the built in program "Screenshot toolbar"

There are also downloadable tools like Audacity and OBS that can be used to capture recording through your computer.

Then you can extract the audio from video file. “VLC media player” is the recommended software, and comes installed on all Griffith computers. This is shown in the following images.

screenshot of setting panel

screenshot of setting panel

You can also use the command line tool "ffmpeg" to extract audio from video, and to change audio formats.

Things to consider when running the speech-to-text transcription

Speech Diarization will attempt to transcribe the speech of each individual for up to two people. To enable diarization, go into your speech-to-text web portal, and tick the “Diarization” checkbox in advance settings. If you run diarization, you will need to convert the audio to mono if it was captured in stereo. VLC media player is ideal to convert stereo recordings to audio. Conversion of audio  from stereo to mono is possible using ffmpeg. Check out this How-To guide.

AI Model - Choose the latest AI model for your transcription. Models are named by year month day, so please pick the model with the most recent date. For example, 20201019 was the latest when this document is written.