Features of ailia Speech

In this page, we present the features that are provided by both the C and the C# APIs.

Basic usage

Text transcription and translation

You can transcribe a text from an audio input by passing AILIA_SPEECH_TASK_TRANSCRIBE as an argument. Passing AILIA_SPEECH_TASK_TRANSLATE in argument allows to transcribe and then translate into English.

Available AI models

As AI model, you can use the Whisper variants base, tiny, small, or medium. The models listed in this order are of increasing accuracy. We recommend at least "small" to get an accuracy good enough to be useful.

"medium" is of an even greater accuracy than "small", but at a cost of a big processing load.

Set the language

By default, the language is deduced for each audio segment. When using the API method SetLanguage, the language is fixed. As for short audio input the language deduction can be erroneous, please call the SetLanguage API in case the language is clearly known by other means.

The languages codes that can be indicated are listed below:

"en", "zh", "de", "es", "ru", "ko", "fr", "ja", "pt", "tr", "pl", "ca", "nl", "ar", "sv", "it", "id", "hi", "fi", "vi", "iw", "uk", "el", "ms", "cs", "ro", "da", "hu", "ta", "no", "th", "ur", "hr", "bg", "lt", "la", "mi", "ml", "cy", "sk", "te", "fa", "lv", "bn", "sr", "az", "sl", "kn", "et", "mk", "br", "eu", "is", "hy", "ne", "mn", "bs", "kk", "sq", "sw", "gl", "mr", "pa", "si", "km", "sn", "yo", "so", "af", "oc", "ka", "be", "tg", "sd", "gu", "am", "yi", "lo", "uz", "fo", "ht", "ps", "tk", "nn", "mt", "sa", "lb", "my", "bo", "tl", "mg", "as", "tt", "haw", "ln", "ha", "ba", "jw", "su"

Enable live processing

When live processing is enabled, it is possible to run and preview a tentative inference on the current buffer content, without even waiting for 30 seconds of audio data to arrive. In normal speech recognition, inference does not happen until the next voice input border is detected, e.g. a silence detected using VAD. By contrast, with ailia Speech, when enabling live processing, it is possible to perform inference even before detecting an audio border. To enable live conversion, use the flag AILIA_SPEECH_FLAG_LIVE. The inference preview is notified to IntermediateCallback. Inference is more accurate when the live setting is not enabled, because it can refer to past audio data. For this reason, in case of processing an audio file, it is recommended to not use the live mode, and enable it only if some audio input has to be processed realtime.

Virtual Memory Mode

Using virtual memory mode allows you to reduce memory consumption. Specifically, when running Whisper Medium on CPU inference, the default settings require 5.66GB of memory, but using virtual memory mode, inference can be done with just 2.59GB of memory. However, inference time is reduced by about 16%. To enable virtual memory mode, specify a directory to save temporary files using ailiaSetTemporaryCachePath before calling ailiaSpeechCreate, then pass AILIA_MEMORY_REDUCE_CONSTANT | AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | AILIA_MEMORY_REUSE_INTERSTAGE | AILIA_MEMORY_REDUCE_CONSTANT_WITH_FILE_MAPPED to the memory_mode argument of ailiaSpeechCreate.

Interface

Text

You can access to these different kind of results after the speech recognition:

Item	Content
text	The text that has been transcribed. In UTF8.
time_stamp_begin	Timestamp of the begining of the transcribed audio segment, in seconds.
time_stamp_begin	Timestamp of the end of the transcribed audio segment, in seconds.
person_id	Unique ID identifying the speaking person. Not implemented. Present only for future development.
language	Language code of the text. When using autodetection, this contains the detected language, else it contains just the language that has been set.
confidence	Confidence score of the transcription. Close to 0.0 when the confidence is low, close to 1.0 when the confidence is high.

Notifications and interruption

Using IntermediateCallback, it is possible to get the partial results of the inference currently being processed. Interrupting the speech recognition is possible by making IntermediateCallback return 1.

Silence detection feature

How to setup silence detection

By default, speech recognition is triggered every 30 seconds of audio data input. By using the SetSilentThreshold API, it is possible to trigger transcription each time there is silence for a certain period of time. The first argument is the sound level below which it is considered silence, the second argument is how many seconds of speech (i.e. non-silence) are required to trigger the inference, the third argument is how many seconds of continuous silence are required to trigger the inference. If VAD is not used, silence is determined based on the volume, and, if VAD is used, an AI model actually determines if there is silence. Whitout VAD, the threshold can be defined between 0.0 and 1.0 of the volume, and, with VAD, it can be defined as between 0.0 and 1.0 of confidence. For example without VAD the threshold can be set at 0.01, and with VAD it could be set at 0.5.

VAD (Voice Activity Detection)

It is possible to detect silence with AI by using AILIA_SPEECH_VAD_TYPE_SILERO and the OpenVadFile API method. Compared to silence detection using the volume, this allows to achieve silence detection with a very high accuracy. When silent audio data is fed into AI, it can happen that it outputs something like "Thank you for your attention", so it is recommended to use VAD silence detection in order to skip speech recognition on these parts.

Specialized features

Restriction of the character set

It is possible to restrict the set of characters used for the transcription by using AILIA_SPEECH_CONSTRAINT_CHARACTERS and the SetConstraint API method. For example, by passing the constraint below, only expressions denoting numbers will be allowed.

u8"1234567890,."

Restriction of the vocabulary

It is possible to restrict the vocabulary used for the transcription by using AILIA_SPEECH_CONSTRAINT_WORDS and the SetConstraint API method. In the context of vocal commands for example, this can be used to determined which of the available commands have been pronounced. For example, by passing the constraint below, it is possible to get the likelihood of either of "command1" or "command2".

u8"command1,command2"

Prompt

It is possible to increase the accuracy of the recognition of person names or specialized vocabulary by passing them as "prompt". Example of prompt:

u8"hardware software"

Dictionary for autocorrection

By using a dictionary for autocorrection, it is possible to apply substitutions on the character strings resulting from the speech recognition, so that you can correct mistakes made during by the inference. Only simple substitutions are supported.

To use the autocorrection dictionary feature, pass a UTF8 CSV file to the OpenDictionary API method. See the example below dict.csv: in this dictionary, the kind of error that is being corrected is when some words have been transcribed phonetically, so in the dictionary file each line presents first the phonetic transcription, followed by the correct word to substitute.

inaf,enough
colam,column
eiai,AI

Post-processing

By using post-processing, it is possible to perform speech recognition error correction using T5 and translation using FuguMT after speech recognition. For speech recognition error correction using T5, a correction model for medical terms can be used. For translation using FuguMT, after performing speech recognition in multiple languages using Translate mode in Whisper, translation from English to Japanese can be used.

GPU usage

On Windows and Linux, it is possible to perform inference on the GPU with cuDNN. In order to use cuDNN, please install the CUDA Toolkit and cuDNN from the NVIDIA website:

Please install the CUDA Toolkit by following the installer instructions. For cuDNN, after downloading it (and uncompressing it) please adjust the environment variable PATH to reflect its location. You need to register as NVIDIA developper in order to download these libraries.