Overview of ailia Speech API
Basic usage
With ailia Speech, you can create an instance with ailiaSpeechCreate
, then open a model with ailiaSpeechOpenModelFile
, input a PCM with ailiaSpeechPushInputData
, check that enough PCM has been fed with ailiaSpeechBuffered
, transcribe into text with ailiaSpeechTranscribe
, and then get the resulting text with ailiaSpeechGetText
.
With ailiaSpeechPushInputData
it is not necessary to input the whole audio data at once, it is possible to feed it little by little, so that it can be used in real-time with the input from a microphone.
#include "ailia.h"
#include "ailia_audio.h"
#include "ailia_speech_util.h"
void main(void){
struct AILIASpeech* net;
int memory_mode = AILIA_MEMORY_REDUCE_CONSTANT | AILIA_MEMORY_REDUCE_CONSTANT_WITH_INPUT_INITIALIZER | AILIA_MEMORY_REUSE_INTERSTAGE;
while(true){
unsigned int buffered = 0;
if (buffered == 1){
unsigned int count = 0;
for (unsigned int idx = 0; idx < count; idx++){
printf("[%02d:%02d.%03d --> %02d:%02d.%03d] ", (int)cur_time/60%60,(int)cur_time%60, (int)(cur_time*1000)%1000, (int)next_time/60%60,(int)next_time%60, (int)(next_time*1000)%1000);
printf(
"%s\n", text.
text);
}
}
unsigned int complete = 0;
if (complete == 1){
break;
}
}
}
void AILIA_API ailiaSpeechDestroy(struct AILIASpeech *net)
It destroys the network instance.
int AILIA_API ailiaSpeechBuffered(struct AILIASpeech *net, unsigned int *buffered)
Determines if there is enough data to perform speech recognition.
#define AILIA_SPEECH_API_CALLBACK_VERSION
Struct version.
Definition: ailia_speech.h:327
int AILIA_API ailiaSpeechPushInputData(struct AILIASpeech *net, const float *src, unsigned int channels, unsigned int samples, unsigned int sampling_rate)
Push PCM data to queue.
int AILIA_API ailiaSpeechOpenModelFileA(struct AILIASpeech *net, const char *encoder_path, const char *decoder_path, int model_type)
Set models into a network instance.
#define AILIA_SPEECH_MODEL_TYPE_WHISPER_MULTILINGUAL_SMALL
Whisper Small model.
Definition: ailia_speech.h:65
int AILIA_API ailiaSpeechGetText(struct AILIASpeech *net, AILIASpeechText *text, unsigned int version, unsigned int idx)
Get recognized text.
#define AILIA_SPEECH_TASK_TRANSCRIBE
Transcribe mode.
Definition: ailia_speech.h:113
int AILIA_API ailiaSpeechSetLanguage(struct AILIASpeech *net, const char *language)
Set language.
#define AILIA_SPEECH_FLAG_NONE
Default flag.
Definition: ailia_speech.h:165
int AILIA_API ailiaSpeechGetTextCount(struct AILIASpeech *net, unsigned int *count)
Get recognized text count.
int AILIA_API ailiaSpeechComplete(struct AILIASpeech *net, unsigned int *complete)
Determines whether all data has been processed.
int AILIA_API ailiaSpeechCreate(struct AILIASpeech **net, int env_id, int num_thread, int memory_mode, int task, int flags, AILIASpeechApiCallback callback, int version)
Creates a network instance.
#define AILIA_SPEECH_TEXT_VERSION
Struct version.
Definition: ailia_speech.h:401
int AILIA_API ailiaSpeechTranscribe(struct AILIASpeech *net)
Speech recognition.
Definition: ailia_speech.h:330
Definition: ailia_speech.h:414
float time_stamp_end
Definition: ailia_speech.h:417
const char * text
Definition: ailia_speech.h:415
float time_stamp_begin
Definition: ailia_speech.h:416
How to modify the example for live transcription
To enable live transcription, pass the flag AILIA_SPEECH_FLAG_LIVE to ailiaSpeechCreate.
#define AILIA_SPEECH_FLAG_LIVE
Enable live transcribe mode.
Definition: ailia_speech.h:176
The transcription preview is notified by being passed in argument to IntermediateCallback.
int intermediate_callback(void *handle, const char *text){
printf("%s\n", text);
return 0;
}
int AILIA_API ailiaSpeechSetIntermediateCallback(struct AILIASpeech *net, AILIA_SPEECH_USER_API_INTERMEDIATE_CALLBACK callback, void *handle)
Set a callback to get intermediate results of recognition.
VAD
When using voice activity detection, call the ailiaSpeechOpenVAD API after the ailiaSpeechCreate API.
int AILIA_API ailiaSpeechOpenVadFileA(struct AILIASpeech *net, const char *vad_path, int vad_type)
Set vad model for voice activity detection.
#define AILIA_SPEECH_VAD_TYPE_SILERO
SileroVAD.
Definition: ailia_speech.h:191
Post-process
If you want to apply post-processing such as speech recognition error correction or translation to the speech recognition result, call the ailiaSpeechOpenPostProcessFile API after the ailiaSpeechCreate API, and call the ailiaSpeechPostProcess API after ailiaSpeechTranscribe.
When using speech recognition error correction:
int AILIA_API ailiaSpeechOpenPostProcessFileA(struct AILIASpeech *net, const char *encoder_path, const char *decoder_path, const char *source_path, const char *target_path, const char *prefix, int post_process_type)
Set AI model for post process (MBSC)
#define AILIA_SPEECH_POST_PROCESS_TYPE_T5
T5.
Definition: ailia_speech.h:236
When using translation:
English to Japanese:
#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_EN_JA
FuguMT EN JA.
Japanese to English:
#define AILIA_SPEECH_POST_PROCESS_TYPE_FUGUMT_JA_EN
Definition: ailia_speech.h:266
Common:
int AILIA_API ailiaSpeechPostProcess(struct AILIASpeech *net)
Execute post process.
Speaker Diarization
It is possible to perform speaker diarization on the speech recognition results. When using speaker separation, call the ailiaSpeechOpenDiarization API after the ailiaSpeechCreate API.
#define AILIA_SPEECH_DIARIZATION_TYPE_PYANNOTE_AUDIO
PyannoteAudio.
Definition: ailia_speech.h:206
int AILIA_API ailiaSpeechOpenDiarizationFileA(struct AILIASpeech *net, const char *segmentation_path, const char *embedding_path, int type)
Set AI model for speaker diarization (MBSC)
GPU usage
In order to use the GPU, pass the env_id corresponding to the GPU as the env_id
argument of ailiaSpeechCreate
. By default, the value AILIA_ENVIRONMENT_ID_AUTO is used, which indicates to perform the inference on the CPU. See ailia_speech_sample.cpp
as an example of how to determine the GPU env_id to be passed as the env_id
argument.
Flow of API call
The relationship diagram for each API is as follows.
Speech recognition
flowchart
A(Microphone or File)-->B
B[ailiaSpeechPushInputData API]-->C
C[ailiaSpeechBuffered API]-->D
C-->B
D[ailiaSpeechTranscribe API]-->E
E[ailiaSpeechGetTextCount API]-->F
F[ailiaSpeechGetText API]-->K
K[ailiaSpeechComplete API] --> B
Speech recognition (with post-processing)
flowchart
A(Microphone or File)-->B
B[ailiaSpeechPushInputData API]-->C
C[ailiaSpeechBuffered API]-->D
C-->B
D[ailiaSpeechTranscribe API]-->E
E[ailiaSpeechGetTextCount API]-->F
F[ailiaSpeechGetText API]-->K
F-->G
G[ailiaSpeechPostProcess API]-->H
H[ailiaSpeechGetTextCount API]-->I
I[ailiaSpeechGetText API]-->K
K[ailiaSpeechComplete API] --> B
Only post-processing.
flowchart
F[ailiaSpeechSetText API]-->G
G[ailiaSpeechPostProcess API]-->I
I[ailiaSpeechGetText API]