Features of ailia AI Voice

In this page, we present the features that are provided by both the C and the C# APIs.

Text-to-speech conversion

With ailia AI Voice, it is possible to use the Tacotron2 and GPT-SoVITS algorithms for speech synthesis.

Text-to-speech model

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is integrated into the ailia AI Voice library.

Japanese speech synthesis

To synthesize Japanese speech, it is necessary to convert Japanese text into phonemes, and OpenJtalk is used for the conversion to phonemes. OpenJtalk is incorporated into the ailia AI Voice library.

Voice synthesis in any tone of voice

When using GPT-SoVITS, it is possible to synthesize speech in any voice timbre by providing an audio file of about 10 seconds.

User Dictionary

By defining a user dictionary, it is possible to correct the pronunciation of Japanese.

GPU usage

On Windows and Linux, it is possible to perform inference on the GPU with cuDNN. In order to use cuDNN, please install the CUDA Toolkit and cuDNN from the NVIDIA website:

Please install the CUDA Toolkit by following the installer instructions. For cuDNN, after downloading it (and uncompressing it) please adjust the environment variable PATH to reflect its location. You need to register as NVIDIA developper in order to download these libraries.

Creating a user dictionary

To create a user dictionary, prepare a userdic.csv like the one below. The 0/5 at the end indicates that there are 5 morae, and the accent is on the 0th position.

超電磁砲,,,1,名詞,固有名詞,一般,*,*,*,超電磁砲,レールガン,レールガン,0/5,*

The user dictionary is converted from a CSV file to a dic file using pyopenjtalk.

import pyopenjtalk

pyopenjtalk.mecab_dict_index("userdic.csv", "userdic.dic")

The converted dic file can be loaded by executing the ailiaVoiceSetUserDictionary API.