ailia.audio package¶

Functions¶

ailia.audio.complex_norm(spec, power=1.0)¶

Compute the norm of complex spectrogram

Parameters:: spec (numpy.ndarray(dtype=complex)) – input spectrogram.
Returns:: res – the norm of complex spectrogram
Return type:: numpy.ndarray

ailia.audio.compute_mel_spectrogram_with_fixed_length(wav, sample_rate=16000, fft_n=2048, hop_n=None, win_n=None, mel_n=128, max_frame_n=128)¶

Create a melspectrogram.

Parameters:

wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
sample_rate (int, optional, default: 16000) – sample rate of input audio signal.
fft_n (int, optional, default: 2048) –
size of FFT, creates fft_n // 2 + 1 bins requirements :

fft_n == 2 ** m (m = 1,2,…)
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n // 4) – window size.
mel_n (int, optional, default: 128) – number of mel filter banks.
max_frame_n (int, optional, default: 128) – number of time frames of mel spectrogram.

Returns:

res – created melspectrogram.

Return type:

numpy.ndarray

ailia.audio.convert_power_to_db(signal, top_db=None)¶

turn a spectrogram from the power scale to the decibel scale.

Parameters:

signal (numpy.ndarray) – input signal in power scale.
top_db (float, optional, default: 80.0) – threshold the output at top_db below the peak.

Returns:

res – output signal in decibel scale.

Return type:

numpy.ndarray

ailia.audio.fft(signal)¶

run fast fourier transform (FFT)

Parameters:: signal (numpy.ndarray) – input signal.
Returns:: res – created spectrum.
Return type:: numpy.ndarray

ailia.audio.filterfilter(n_coef, d_coef, wav, axis=-1, padtype='odd', padlen=None)¶

filter forward and backward to a audio signal

Parameters:

n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
axis (int) – TBD
padtype (str, int or None, optional, default: odd) –
type of padding for the input signal extention. requirements :

None or 0 : no padding “odd” or 1 : padding odd mode “even” or 2 : padding even mode “constant” or 3 : padding constant mode
padlen (int or None, optional, default: 3 * max(len(n_coef), len(d_coef))) – number of padding samples at both ends of input signal before forward filtering.

Returns:

res – output filtered audio signal.

Return type:

numpy.ndarray

ailia.audio.fix_frame_len(spec, fix_frame_n, pad=0.0)¶

Adjust frame length of a spectrogram.

Parameters:

spec (numpy.ndarray) –
input data. requirements :

input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
fix_frame_n (int) – target number of time frames.
pad (float, optional, default: 0.0) – constant value to fill the added frames.

Returns:

res

Return type:

numpy.ndarray

ailia.audio.get_fb_matrix(sample_rate, freq_n, f_min=0.0, f_max=None, mel_n=128, norm=False, htk=False)¶

Create a Filterbank matrix to combine FFT bins into Mel-frequency bins

Parameters:

sample_rate (int) – sampling rate of the incoming signal
freq_n (int) – number of FFT bins.
f_min (float, optional, default: 0.0) – minimum frequency.
f_max (float, optional, default: sample_rate // 2) – maximum frequency.
mel_n (int, optional, default: 128) – number of mel bands.
norm (bool, optional, default: False) – normalize created filterbank matrix.
htk (bool, optional, default: False) – use HTK formula instead of Slaney’s formula.

Returns:

res – created filterbank matrix.

Return type:

numpy.ndarray

ailia.audio.get_frame_len(sample_n, fft_n, hop_n=None, center_mode=1)¶

Calculate the number of frames when a spectrogram is created.

Parameters:

sample_n (int) – length of audio signal.
fft_n (int) –
size of FFT, creates fft_n // 2 + 1 bins requirements :

fft_n == 2 ** m (m = 1,2,…)
hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window
center_mode (int, optional, default: 1) –

whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.

Returns:

frame_n – frame number of created spectrogram.

Return type:

int

ailia.audio.get_linerfilter_zi_coef(n_coef, d_coef)¶

Create coefficents of initial condition for the liner filter delay.

Parameters:

n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]

Returns:

zi – coefficents of initial condition for the liner filter delay.

Return type:

numpy.ndarray

ailia.audio.get_resample_len(sample_n, org_sr, target_sr)¶

Calculate the number of samples after resample.

Parameters:

sample_n (int) – length of audio signal.
org_sr (int) –
sampling rate of input audio signal requirements :

org_sr > 0
target_sr (int) –
target sampling rate requirements :

target_sr > 0

Returns:

resample_n – length of resampled audio signal.

Return type:

int

ailia.audio.get_sample_len(frame_n, freq_n, hop_n=None, center=True)¶

Calculate the number of samples when a signal is inversely transformed from a spectrogram

Parameters:

frame_n (int) – frame number of spectrogram.
freq_n (int) – frequency of spectrgram. freq_n = fft_n // 2 + 1
hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window
center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.

Returns:

sample_n – length of signal.

Return type:

int

ailia.audio.get_window(win_n, win_type)¶

Create a window of a given length and type.

Parameters:

win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :

use hann window”hann” or 1
hamming window : “hamming” or 2

Returns:

res – a window of given length and type.

Return type:

numpy.ndarray

ailia.audio.ifft(spec)¶

run inverse fast fourier transform (IFFT)

Parameters:: signal (numpy.ndarray(dtype=complex)) – input spectrum.
Returns:: res – created spectrum.
Return type:: numpy.ndarray(dytpe=complex)

ailia.audio.inverse_spectrogram(spec, hop_n=None, win_n=None, win_type=None, center=True, norm_type=None)¶

Inverse Transform from a spectrogram.

Parameters:

spec (numpy.ndarray(shape=(1 + fft_n/2, frame_n ) or (ch_n, 1 + fft_n/2, frame_n) ,dtype=complex)) – input spectrogram.
hop_n (int, optional, default: win_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :

”hann” or 1 : hann window “hamming” or 2 : hamming window
center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.
norm_type (int, optional, default: 0) –
types of output normalization. requirements :

0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.

Returns:

res – signal with inverse transformation of spectrogram res.shape :

(sample_n) if input.ndim == 1 (ch_n, sample_n) if input.ndim == 2

Return type:

numpy.ndarray(dtype=float)

ailia.audio.linerfilter(n_coef, d_coef, wav, axis=-1, zi=None)¶

filter a audio signal, using a digital filter(ex.IIR or FIR)

Parameters:

n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
axis (int) – TBD
zi (numpy.ndarray) – initial conditions for the filter delays. if zi is None, initial condition uses 0.

Returns:

res (numpy.ndarray) – output filterd audio signal.
zf (numpy.ndarray, optional) – final conditions for the filter delays. If zi is None, this is not returned.

ailia.audio.log1p(signal)¶

calculate log1p ( y = log_e(1.0 + x) )

Parameters:: signal (numpy.ndarray) – input signal.
Returns:: res – output signal.
Return type:: numpy.ndarray

ailia.audio.magphase(spec, power=1.0, complex_out=True)¶

Separate a complex-valued spectrogram into its magnitude and phase components.

Parameters:

spec (numpy.ndarray) –
input data. requirements :

input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :

power > 0.0
complex_out (bool, optional, default: True) –

return phase as complex value.
True : compatible with librosa False : compatible with pytorch

Returns:

res_mag (numpy.ndarray) – magnitude components of the input spectrogram.
res_phase (numpy.ndarray) – phase components of the input spectrogram.

ailia.audio.mel_scale(spec, mel_fb)¶

convert spectorogram to Mel spectrogram,using mel filter bank

Parameters:

spec (numpy.ndarray) – input real spectrogram. spec.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
mel_fb (numpy.ndarray) – Filterbank matrix to combine FFT bins into Mel-frequency bins mel_fb.shape must be (mel_n, freq_n)

Returns:

res – created mel spectrogram.

Return type:

numpy.ndarray

ailia.audio.mel_spectrogram(wav, sample_rate=16000, fft_n=1024, hop_n=None, win_n=None, win_type=1, center_mode=1, power=1.0, fft_norm_type=None, f_min=0.0, f_max=None, mel_n=128, mel_norm=True, htk=False)¶

Create a melspectrogram.

Parameters:

wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
sample_rate (int, optional, default: 16000) – sample rate of input audio signal.
fft_n (int, optional, default: 1024) –
size of FFT, creates fft_n // 2 + 1 bins requirements :

fft_n == 2 ** m, where m is a natural number
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :

”hann” or 1 : hann window “hamming” or 2 : hamming window
center_mode (int, optional, default: 1) –

whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :

power > 0.0
fft_norm_type (int, optional, default: 0) –
types of spectrogram normalization. requirements :

0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.
f_min (float, optional, default: 0.0) – minimum frequency.
f_max (float, optional, default: sample_rate // 2) – maximum frequency.
mel_n (int, optional, default: 128) – number of mel filter banks.
mel_norm (bool, optional, default: True) – normalize melspectrofram.
htk (bool, optional, default: False) –

convert frequency to mel scale using htk formula.
True : using htk formula. (compatible with pytorch. ) False : using Slaney’s formula. (compatible with a default setting of librosa. )

Returns:

res – created melspectrogram.

Return type:

numpy.ndarray

ailia.audio.resample(wav, org_sr, target_sr)¶

resample a audio signal form original sampling rate to target sampring rate

Parameters:

wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
org_sr (int) –
sampling rate of input audio signal requirements :

org_sr > 0
target_sr (int) –
target sampling rate requirements :

target_sr > 0

Returns:

res – created resample audio signal.

Return type:

numpy.ndarray

ailia.audio.spectrogram(wav, fft_n=1024, hop_n=None, win_n=None, win_type=None, center_mode=1, power=None, norm_type=None)¶

Create a spectrogram from a audio signal.

Parameters:

wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
fft_n (int, optional, default: 1024) –
size of FFT, creates fft_n // 2 + 1 bins requirements :

fft_n == 2 ** m, where m is a natural number
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :

”hann” or 1 : hann window “hamming” or 2 : hamming window
center_mode (int, optional, default: 1) –

whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. requirements :

power > 0.0
norm_type (int, optional, default: 0) –
types of output normalization. requirements :

0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.

Returns:

res – created spectrogram. res.shape :

(freq_n, frame_n) if input.ndim == 1 (ch_n, freq_n, frame_n) if input.ndim == 2

Return type:

numpy.ndarray

ailia.audio.standardize(signal)¶

Standardize input signal.

Parameters:: signal (numpy.ndarray) – input signal.
Returns:: res – standardized signal.
Return type:: numpy.ndarray

ailia.audio.trim(wav, thr_db=60, ref=<function amax>, frame_length=2048, hop_length=512)¶

Truncate the silence before and after a audio signal

Parameters:

wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
thr_db (float, optional, default: 60) – Threshold for determining silence
ref – TBD
frame_length (int, optional, default=2048) – length of analysis windows
hop_length (int, optional, default=512) – length of hop between analysis windows

Returns:

res_trimmed (numpy.ndarray) – output trimmed audio signal.
res_pos (numpy.ndarray shape=(2,)) – non silent position [start,end]