ailia_speech  1.3.0.0
API Usage

Overview of ailia Speech API

Basic usage

With ailia Speech, you can create an instance with ailiaSpeechCreate, then open a model with ailiaSpeechOpenModelFile, input a PCM with ailiaSpeechPushInputData, check that enough PCM has been fed with ailiaSpeechBuffered, transcribe into text with ailiaSpeechTranscribe, and then get the resulting text with ailiaSpeechGetText.

With ailiaSpeechPushInputData it is not necessary to input the whole audio data at once, it is possible to feed it little by little, so that it can be used in real-time with the input from a microphone.

#include "ailia.h"
#include "ailia_audio.h"
#include "ailia_speech.h"
#include "ailia_speech_util.h"
void main(void){
// Create the ailia Speech instance
struct AILIASpeech* net;
AILIASpeechApiCallback callback = ailiaSpeechUtilGetCallback();
int memory_mode = AILIA_MEMORY_REDUCE_CONSTANT | AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | AILIA_MEMORY_REUSE_INTERSTAGE;
ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_NONE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);
// Load the model file
ailiaSpeechOpenModelFileA(net, "encoder_small.onnx", "decoder_small_fix_kv_cache.onnx", AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);
// Set the language
// Gather and feed the input PCM
ailiaSpeechPushInputData(net, pPcm, nChannels, nSamples, sampleRate);
// Transcribe
while(true){
// Check that enough PCM has been fed to perform the transcription
unsigned int buffered = 0;
ailiaSpeechBuffered(net, &buffered);
if (buffered == 1){
// Do the transcription
// Get the amount of text fragments that have been transcribed
unsigned int count = 0;
// Get the transcribed text
for (unsigned int idx = 0; idx < count; idx++){
float cur_time = text.time_stamp_begin;
float next_time = text.time_stamp_end;
printf("[%02d:%02d.%03d --> %02d:%02d.%03d] ", (int)cur_time/60%60,(int)cur_time%60, (int)(cur_time*1000)%1000, (int)next_time/60%60,(int)next_time%60, (int)(next_time*1000)%1000);
printf("%s\n", text.text);
}
}
// Check if all of the PCM has been processed
unsigned int complete = 0;
ailiaSpeechComplete(net, &complete);
if (complete == 1){
break;
}
}
// Destroy the ailia Speech instance
}

How to modify the example for live transcription

To enable live transcription, pass the flag AILIA_SPEECH_FLAG_LIVE to ailiaSpeechCreate.

ailiaSpeechCreate(&net, AILIA_ENVIRONMENT_ID_AUTO, AILIA_MULTITHREAD_AUTO, memory_mode, AILIA_SPEECH_TASK_TRANSCRIBE, AILIA_SPEECH_FLAG_LIVE, callback, AILIA_SPEECH_API_CALLBACK_VERSION);

The transcription preview is notified by being passed in argument to IntermediateCallback.

int intermediate_callback(void *handle, const char *text){
printf("%s\n", text);
return 0; // return 1 to interrupt
}
ailiaSpeechSetIntermediateCallback(net, &intermediate_callback, NULL);

VAD

When using voice activity detection, call the ailiaSpeechOpenVAD API after the ailiaSpeechCreate API.

Post-process

If you want to apply post-processing such as speech recognition error correction or translation to the speech recognition result, call the ailiaSpeechOpenPostProcessFile API after the ailiaSpeechCreate API, and call the ailiaSpeechPostProcess API after ailiaSpeechTranscribe.

When using speech recognition error correction:

ailiaSpeechOpenPostProcessFileA(net, "t5_whisper_medical-encoder.obf.onnx", "t5_whisper_medical-decoder-with-lm-head.obf.onnx", "spiece.model", NULL, "Correction of medical terms: ", AILIA_SPEECH_POST_PROCESS_TYPE_T5);

When using translation:

English to Japanese:

ailiaSpeechOpenPostProcessFileA(net, "fugumt_en_ja_seq2seq-lm-with-past.onnx", NULL, "fugumt_en_ja_source.spm", "fugumt_en_ja_target.spm", NULL, AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA);

Japanese to English:

ailiaSpeechOpenPostProcessFileA(net, "fugumt_ja_en_encoder_model.onnx", "fugumt_ja_en_decoder_model.onnx", "fugumt_ja_en_source.spm", "fugumt_ja_en_target.spm", NULL, AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN);

Common:

GPU usage

In order to use the GPU, pass the env_id corresponding to the GPU as the env_id argument of ailiaSpeechCreate. By default, the value AILIA_ENVIRONMENT_ID_AUTO is used, which indicates to perform the inference on the CPU. See ailia_speech_sample.cpp as an example of how to determine the GPU env_id to be passed as the env_id argument.

Flow of API call

The relationship diagram for each API is as follows.

Speech recognition

flowchart A(Microphone or File)-->B B[ailiaSpeechPushInputData API]-->C C[ailiaSpeechBuffered API]-->D C-->B D[ailiaSpeechTranscribe API]-->E E[ailiaSpeechGetTextCount API]-->F F[ailiaSpeechGetText API]-->K K[ailiaSpeechComplete API] --> B

Speech recognition (with post-processing)

flowchart A(Microphone or File)-->B B[ailiaSpeechPushInputData API]-->C C[ailiaSpeechBuffered API]-->D C-->B D[ailiaSpeechTranscribe API]-->E E[ailiaSpeechGetTextCount API]-->F F[ailiaSpeechGetText API]-->K F-->G G[ailiaSpeechPostProcess API]-->H H[ailiaSpeechGetTextCount API]-->I I[ailiaSpeechGetText API]-->K K[ailiaSpeechComplete API] --> B

Only post-processing.

flowchart F[ailiaSpeechSetText API]-->G G[ailiaSpeechPostProcess API]-->I I[ailiaSpeechGetText API]

ailiaSpeechDestroy
void AILIA_API ailiaSpeechDestroy(struct AILIASpeech *net)
It destroys the network instance.
ailiaSpeechPushInputData
int AILIA_API ailiaSpeechPushInputData(struct AILIASpeech *net, const float *src, unsigned int channels, unsigned int samples, unsigned int sampling_rate)
Push PCM data to queue.
AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN
#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN
Definition: ailia_speech.h:251
AILIA_SPEECH_TEXT_VERSION
#define AILIA_SPEECH_TEXT_VERSION
Struct version.
Definition: ailia_speech.h:387
_AILIASpeechText::text
const char * text
Definition: ailia_speech.h:390
ailiaSpeechComplete
int AILIA_API ailiaSpeechComplete(struct AILIASpeech *net, unsigned int *complete)
Determines whether all data has been processed.
ailiaSpeechPostProcess
int AILIA_API ailiaSpeechPostProcess(struct AILIASpeech *net)
Execute post process.
_AILIASpeechApiCallback
Definition: ailia_speech.h:316
ailiaSpeechSetLanguage
int AILIA_API ailiaSpeechSetLanguage(struct AILIASpeech *net, const char *language)
Set language.
ailiaSpeechOpenModelFileA
int AILIA_API ailiaSpeechOpenModelFileA(struct AILIASpeech *net, const char *encoder_path, const char *decoder_path, int model_type)
Set models into a network instance.
_AILIASpeechText::time_stamp_begin
float time_stamp_begin
Definition: ailia_speech.h:391
AILIA_SPEECH_API_CALLBACK_VERSION
#define AILIA_SPEECH_API_CALLBACK_VERSION
Struct version.
Definition: ailia_speech.h:313
ailiaSpeechCreate
int AILIA_API ailiaSpeechCreate(struct AILIASpeech **net, int env_id, int num_thread, int memory_mode, int task, int flags, AILIASpeechApiCallback callback, int version)
Creates a network instance.
AILIA_SPEECH_VAD_TYPE_SILERO
#define AILIA_SPEECH_VAD_TYPE_SILERO
SileroVAD.
Definition: ailia_speech.h:191
ailiaSpeechSetIntermediateCallback
int AILIA_API ailiaSpeechSetIntermediateCallback(struct AILIASpeech *net, AILIA_SPEECH_USER_API_INTERMEDIATE_CALLBACK callback, void *handle)
Set a callback to get intermediate results of recognition.
AILIA_SPEECH_POST_PROCESS_TYPE_T5
#define AILIA_SPEECH_POST_PROCESS_TYPE_T5
T5.
Definition: ailia_speech.h:221
AILIA_SPEECH_FLAG_LIVE
#define AILIA_SPEECH_FLAG_LIVE
Enable live transcribe mode.
Definition: ailia_speech.h:176
ailiaSpeechTranscribe
int AILIA_API ailiaSpeechTranscribe(struct AILIASpeech *net)
Speech recognition.
AILIA_SPEECH_FLAG_NONE
#define AILIA_SPEECH_FLAG_NONE
Default flag.
Definition: ailia_speech.h:165
ailiaSpeechBuffered
int AILIA_API ailiaSpeechBuffered(struct AILIASpeech *net, unsigned int *buffered)
Determines if there is enough data to perform speech recognition.
_AILIASpeechText
Definition: ailia_speech.h:389
AILIA_SPEECH_TASK_TRANSCRIBE
#define AILIA_SPEECH_TASK_TRANSCRIBE
Transcribe mode.
Definition: ailia_speech.h:113
ailia_speech.h
ailiaSpeechGetTextCount
int AILIA_API ailiaSpeechGetTextCount(struct AILIASpeech *net, unsigned int *count)
Get recognized text count.
AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL
#define AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL
Whisper Small model.
Definition: ailia_speech.h:65
AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA
#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA
FuguMT EN JA.
ailiaSpeechOpenVadFileA
int AILIA_API ailiaSpeechOpenVadFileA(struct AILIASpeech *net, const char *vad_path, int vad_type)
Set vad model for voice activity detection.
ailiaSpeechGetText
int AILIA_API ailiaSpeechGetText(struct AILIASpeech *net, AILIASpeechText *text, unsigned int version, unsigned int idx)
Get recognized text.
ailiaSpeechOpenPostProcessFileA
int AILIA_API ailiaSpeechOpenPostProcessFileA(struct AILIASpeech *net, const char *encoder_path, const char *decoder_path, const char *source_path, const char *target_path, const char *prefix, int post_process_type)
Set AI model for post process (MBSC)
_AILIASpeechText::time_stamp_end
float time_stamp_end
Definition: ailia_speech.h:392