ailia_speech  1.3.0.0
API Usage

High Level API

Presentation of the High Level API

The C# High Level API is using AiliaSpeechModel which is an abstraction of the AiliaSpeech Low Level API. AiliaSpeechModel is designed to perform multithread text transcription. The model file can be loaded with Open, then the audio waveform can be passed to Transcribe, and, after the inference is completed, the results can be obtained with GetResults.

void OnEnable(){
// Create the instance
AiliaSpeechModel ailia_speech = new AiliaSpeechModel();
ailia_speech.Open(asset_path + "/" + encoder_path, asset_path + "/" + decoder_path, env_id, memory_mode, api_model_type, task, flag, language);
}
void Update(){
// Get the waveform from the mic
float [] waveData = GetMicInput();
// get the results from the multithread processing
List<string> results = ailia_speech.GetResults();
for (uint idx = 0; idx < results.Count; idx++){
string text = results[(int)idx];
string display_text = text + "\n";
content_text = content_text + display_text;
}
// Request a new inference
ailia_speech.Transcribe(waveData, frequency, channels, complete);
waveQueue = new List<float[]>(); // reset the queue
}
void OnDisable(){
// Destroy the instance
ailia_speech.Close();
}

How to modify the example for live transcription

To enable live transcription, pass the flag AILIA_SPEECH_FLAG_LIVE to the Open API method.

flag = AiliaSpeech.AILIA_SPEECH_FLAG_LIVE;

The preview of the inference is notified to IntermediateCallback.

string text = GetIntermediateText();

VAD

When using voice activity detection (VAD), call the OpenVad API after the Open API.

ailia_speech.OpenVad(asset_path + "/" + "silero_vad.onnx", AiliaSpeech.AILIA_SPEECH_VAD_TYPE_SILERO);

Post-process

If you want to apply post-processing such as speech recognition error correction or translation to the speech recognition result, call the OpenPostProcess API after the Open API.

If using speech recognition error correction:

ailia_speech.OpenPostProcess(asset_path + "/" +"t5_whisper_medical-encoder.obf.onnx", asset_path + "/" +"t5_whisper_medical-decoder-with-lm-head.obf.onnx", asset_path + "/" +"spiece.model", null, "Correction of medical terminology: ", AiliaSpeech.AILIA_SPEECH_POST_PROCESS_TYPE_T5);

If using translation:

English to Japanese:

ailia_speech.OpenPostProcess(asset_path + "/" +"fugumt_en_ja_seq2seq-lm-with-past.onnx", null, asset_path + "/" +"fugumt_en_ja_source.spm", asset_path + "/" +"fugumt_en_ja_target.spm", null, AiliaSpeech.AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA);

Japanese to English:

ailia_speech.OpenPostProcess(asset_path + "/" +"fugumt_ja_en_encoder_model.onnx", asset_path + "/" +"fugumt_ja_en_decoder_model.onnx", asset_path + "/" +"fugumt_ja_en_source.spm", asset_path + "/" +"fugumt_ja_en_target.spm", null, AiliaSpeech.AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN);

Translate

By using the AiliaSpeechTranslateModel, it is possible to perform translation only.

void OnEnable(){
// Create Instance
AiliaSpeechModel ailia_speech = new AiliaSpeechModel();
ailia_speech_translate.Open(asset_path + "/" +"fugumt_en_ja_seq2seq-lm-with-past.onnx", null, asset_path + "/" +"fugumt_en_ja_source.spm", asset_path + "/" +"fugumt_en_ja_target.spm", AiliaSpeech.AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA, env_id, memory_mode);
}
void Translate(){
// Translate
string output = ailia_speech_translate.Translate(ui_input_field.text);
}
void OnDisable(){
// Destroy Instance
ailia_speech_translate.Close();
}

Low Level API

Overview of the Low Level API

Create the ailia Speech instance with ailiaSpeechCreate, open the model with ailiaSpeechOpenModelFile, feed some audio data with ailiaSpeechPushInputData, check if enough audio waveform has been fed with ailiaSpeechBuffered, transcribe with ailiaSpeechTranscribe, then get the inference results with ailiaSpeechGetText.

With ailiaSpeechPushInputData it is not necessary to input the whole audio data at once, it is possible to feed it little by little, so that it can be used in real-time with the input from a microphone.

// Create the instance
IntPtr net = IntPtr.Zero;
AiliaSpeech.AILIASpeechApiCallback callback = AiliaSpeech.GetCallback();
int memory_mode = Ailia.AILIA_MEMORY_REDUCE_CONSTANT | Ailia.AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | Ailia.AILIA_MEMORY_REUSE_INTERSTAGE;
AiliaSpeech.ailiaSpeechCreate(ref net, env_id, Ailia.AILIA_MULTITHREAD_AUTO, memory_mode, AiliaSpeech.AILIA_SPEECH_TASK_TRANSCRIBE, AiliaSpeech.AILIA_SPEECH_FLAG_NONE, callback, AiliaSpeech.AILIA_SPEECH_API_CALLBACK_VERSION);
string base_path = Application.streamingAssetsPath+"/";
AiliaSpeech.ailiaSpeechOpenModelFile(net, base_path + "encoder_small.onnx", base_path + "decoder_small_fix_kv_cache.onnx", AiliaSpeech.AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL);
// Set the language
AiliaSpeech.ailiaSpeechSetLanguage(net, "ja");
// Gather and feed the audio waveform
AiliaSpeech.ailiaSpeechPushInputData(net, samples_buf, threadChannels, (uint)samples_buf.Length / threadChannels, threadFrequency);
// Check if thre is enough audio data to perform the transcription
while (true){
// Is there enough audio data to perform the transcription?
uint buffered = 0;
AiliaSpeech.ailiaSpeechBuffered(net, ref buffered);
if (buffered == 1){
// Perform the transcription
AiliaSpeech.ailiaSpeechTranscribe(net);
// Get the amount of text fragments that have been transcribed
uint count = 0;
AiliaSpeech.ailiaSpeechGetTextCount(net, ref count);
// Get the transcribed text
for (uint idx = 0; idx < count; idx++){
AiliaSpeech.AILIASpeechText text = new AiliaSpeech.AILIASpeechText();
AiliaSpeech.ailiaSpeechGetText(net, text, AiliaSpeech.AILIA_SPEECH_TEXT_VERSION, idx);
float cur_time = text.time_stamp_begin;
float next_time = text.time_stamp_end;
Debug.Log(Marshal.PtrToStringAnsi(text.text));
}
}
// Check if all of the audio data has been processed
uint complete = 0;
AiliaSpeech.ailiaSpeechComplete(net, ref complete);
if (complete == 1){
break;
}
}
// Destroy the instance
AiliaSpeech.ailiaSpeechDestroy(net);

GPU usage

In order to use the GPU, pass the env_id corresponding to the GPU as the env_id argument of AiliaSpeech.Open. By default, the value AILIA_ENVIRONMENT_ID_AUTO is used, which indicates to perform the inference on the CPU. See the GetEnvId() function of AiliaSpeechSample.cs as an of how to determine the GPU env_id to be passed as the env_id argument. In the example below, the ailia API is used to enumerate the environments, and then, if env_type value is 1, we get the env_id corresponding to the GPU.

private int GetEnvId(int env_type){
int env_id = Ailia.AILIA_ENVIRONMENT_ID_AUTO;
if (env_type == 1) { // GPU
int count = 0;
Ailia.ailiaGetEnvironmentCount(ref count);
for (int i = 0; i < count; i++){
IntPtr env_ptr = IntPtr.Zero;
Ailia.ailiaGetEnvironment(ref env_ptr, (uint)i, Ailia.AILIA_ENVIRONMENT_VERSION);
Ailia.AILIAEnvironment env = (Ailia.AILIAEnvironment)Marshal.PtrToStructure(env_ptr, typeof(Ailia.AILIAEnvironment));
if (env.backend == Ailia.AILIA_ENVIRONMENT_BACKEND_MPS || env.backend == Ailia.AILIA_ENVIRONMENT_BACKEND_CUDA || env.backend == Ailia.AILIA_ENVIRONMENT_BACKEND_VULKAN){
env_id = env.id;
env_name = Marshal.PtrToStringAnsi(env.name);
}
}
} else {
env_name = "cpu";
}
return env_id;
}

Platform-specific remarks

Windows

For the Unity Plugin sample, the StandaloneFileBrowser asset is used as the file dialog. As a limitation of StandaloneFileBrowser, building on Windows with il2cpp produces an error, so in this case please use mono to build instead. This is not a limitation of ailia Speech, so if the file dialog is never used you can build with il2cpp.

iOS

When running on iOS, please set the "Increased Memory Limit" capability. 1.82GB of memory are required for the "small" model.

Android

On Android, it is not possible to access the StreamingAssets files directly, so, at runtime, the model file is transfered to TemporaryCachePath.