Overview of ailia Speech API

Basic usage

With ailia Speech, you can create an instance with ailiaSpeechCreate, then open a model with ailiaSpeechOpenModelFile, input a PCM with ailiaSpeechPushInputData, check that enough PCM has been fed with ailiaSpeechBuffered, transcribe into text with ailiaSpeechTranscribe, and then get the resulting text with ailiaSpeechGetText.

With ailiaSpeechPushInputData it is not necessary to input the whole audio data at once, it is possible to feed it little by little, so that it can be used in real-time with the input from a microphone.

#include "ailia.h"
#include "ailia_audio.h"
#include "ailia_speech.h"
#include "ailia_speech_util.h"
 
void main(void){
 // Create the ailia Speech instance
 struct AILIASpeech* net;
 AILIASpeechApiCallback callback = ailiaSpeechUtilGetCallback();
 int memory_mode = AILIA_MEMORY_REDUCE_CONSTANT | AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | AILIA_MEMORY_REUSE_INTERSTAGE;
 ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_NONE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);
 
 // Load the model file
 ailiaSpeechOpenModelFileA(net, "encoder_small.onnx", "decoder_small_fix_kv_cache.onnx", AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);
 
 // Set the language
 ailiaSpeechSetLanguage(net, "ja");
 
 // Gather and feed the input PCM
 ailiaSpeechPushInputData(net, pPcm, nChannels, nSamples, sampleRate);
 
 // Transcribe
 while(true){
  // Check that enough PCM has been fed to perform the transcription
  unsigned int buffered = 0;
  ailiaSpeechBuffered(net, &buffered);
  if (buffered == 1){
   // Do the transcription
   ailiaSpeechTranscribe(net);
 
   // Get the amount of text fragments that have been transcribed
   unsigned int count = 0;
   ailiaSpeechGetTextCount(net, &count);
 
   // Get the transcribed text
   for (unsigned int idx = 0; idx < count; idx++){
    AILIASpeechText text;
    ailiaSpeechGetText(net, &text, AILIA_SPEECH_TEXT_VERSION, idx);
 
    float cur_time = text.time_stamp_begin;
    float next_time = text.time_stamp_end;
    printf("[%02d:%02d.%03d --> %02d:%02d.%03d] ", (int)cur_time/60%60,(int)cur_time%60, (int)(cur_time*1000)%1000, (int)next_time/60%60,(int)next_time%60, (int)(next_time*1000)%1000);
    printf("%s\n", text.text);
   }
  }
 
  // Check if all of the PCM has been processed
  unsigned int complete = 0;
  ailiaSpeechComplete(net, &complete);
  if (complete == 1){
   break;
  }
 }
 
 // Destroy the ailia Speech instance
 ailiaSpeechDestroy(net);
}

How to modify the example for live transcription

To enable live transcription, pass the flag AILIA_SPEECH_FLAG_LIVE to ailiaSpeechCreate.

ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_LIVE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);

AILIA_SPEECH_FLAG_LIVE

#define AILIA_SPEECH_FLAG_LIVE

Enable live transcribe mode.

Definition: ailia_speech.h:176

The transcription preview is notified by being passed in argument to IntermediateCallback.

int intermediate_callback(void *handle, const char *text){
 printf("%s\n", text);
 return 0; // return 1 to interrupt
}
 
ailiaSpeechSetIntermediateCallback(net, &intermediate_callback, NULL);

VAD

When using voice activity detection, call the ailiaSpeechOpenVAD API after the ailiaSpeechCreate API.

ailiaSpeechOpenVadFileA(net, "silero_vad.onnx", AILIA_SPEECH_VAD_TYPE_SILERO);

ailiaSpeechOpenVadFileA

int AILIA_API ailiaSpeechOpenVadFileA(struct AILIASpeech *net, const char *vad_path, int vad_type)

Set vad model for voice activity detection.

AILIA_SPEECH_VAD_TYPE_SILERO

#define AILIA_SPEECH_VAD_TYPE_SILERO

SileroVAD.

Definition: ailia_speech.h:191

Post-process

If you want to apply post-processing such as speech recognition error correction or translation to the speech recognition result, call the ailiaSpeechOpenPostProcessFile API after the ailiaSpeechCreate API, and call the ailiaSpeechPostProcess API after ailiaSpeechTranscribe.

When using speech recognition error correction:

ailiaSpeechOpenPostProcessFileA(net, "t5_whisper_medical-encoder.obf.onnx", "t5_whisper_medical-decoder-with-lm-head.obf.onnx", "spiece.model", NULL, "Correction of medical terms: ", AILIA_SPEECH_POST_PROCESS_TYPE_T5);

ailiaSpeechOpenPostProcessFileA

int AILIA_API ailiaSpeechOpenPostProcessFileA(struct AILIASpeech *net, const char *encoder_path, const char *decoder_path, const char *source_path, const char *target_path, const char *prefix, int post_process_type)

Set AI model for post process (MBSC)

AILIA_SPEECH_POST_PROCESS_TYPE_T5

#define AILIA_SPEECH_POST_PROCESS_TYPE_T5

T5.

Definition: ailia_speech.h:236

When using translation:

English to Japanese:

ailiaSpeechOpenPostProcessFileA(net, "fugumt_en_ja_seq2seq-lm-with-past.onnx", NULL, "fugumt_en_ja_source.spm", "fugumt_en_ja_target.spm", NULL, AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA);

AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA

#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA

FuguMT EN JA.

Japanese to English:

ailiaSpeechOpenPostProcessFileA(net, "fugumt_ja_en_encoder_model.onnx", "fugumt_ja_en_decoder_model.onnx", "fugumt_ja_en_source.spm", "fugumt_ja_en_target.spm", NULL, AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN);

AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN

#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN

Definition: ailia_speech.h:266

Common:

ailiaSpeechTranscribe(net);

ailiaSpeechPostProcess(net);

ailiaSpeechPostProcess

int AILIA_API ailiaSpeechPostProcess(struct AILIASpeech *net)

Execute post process.

Speaker Diarization

It is possible to perform speaker diarization on the speech recognition results. When using speaker separation, call the ailiaSpeechOpenDiarization API after the ailiaSpeechCreate API.

ailiaSpeechOpenDiarizationFileA(net, "segmentation.onnx", "speaker-embedding.onnx", AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO);

AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO

#define AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO

PyannoteAudio.

Definition: ailia_speech.h:206

ailiaSpeechOpenDiarizationFileA

int AILIA_API ailiaSpeechOpenDiarizationFileA(struct AILIASpeech *net, const char *segmentation_path, const char *embedding_path, int type)

Set AI model for speaker diarization (MBSC)

GPU usage

In order to use the GPU, pass the env_id corresponding to the GPU as the env_id argument of ailiaSpeechCreate. By default, the value AILIA_ENVIRONMENT_ID_AUTO is used, which indicates to perform the inference on the CPU. See ailia_speech_sample.cpp as an example of how to determine the GPU env_id to be passed as the env_id argument.

Flow of API call

The relationship diagram for each API is as follows.

Speech recognition

flowchart A(Microphone or File)-->B B[ailiaSpeechPushInputData API]-->C C[ailiaSpeechBuffered API]-->D C-->B D[ailiaSpeechTranscribe API]-->E E[ailiaSpeechGetTextCount API]-->F F[ailiaSpeechGetText API]-->K K[ailiaSpeechComplete API] --> B

Speech recognition (with post-processing)

flowchart A(Microphone or File)-->B B[ailiaSpeechPushInputData API]-->C C[ailiaSpeechBuffered API]-->D C-->B D[ailiaSpeechTranscribe API]-->E E[ailiaSpeechGetTextCount API]-->F F[ailiaSpeechGetText API]-->K F-->G G[ailiaSpeechPostProcess API]-->H H[ailiaSpeechGetTextCount API]-->I I[ailiaSpeechGetText API]-->K K[ailiaSpeechComplete API] --> B

Only post-processing.

flowchart F[ailiaSpeechSetText API]-->G G[ailiaSpeechPostProcess API]-->I I[ailiaSpeechGetText API]