ailia.audio package¶
Functions¶
- ailia.audio.complex_norm(spec, power=1.0)¶
Compute the norm of complex spectrogram
- Parameters:
spec (numpy.ndarray(dtype=complex)) – input spectrogram.
- Returns:
res – the norm of complex spectrogram
- Return type:
numpy.ndarray
- ailia.audio.compute_mel_spectrogram_with_fixed_length(wav, sample_rate=16000, fft_n=2048, hop_n=None, win_n=None, mel_n=128, max_frame_n=128)¶
Create a melspectrogram.
- Parameters:
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
sample_rate (int, optional, default: 16000) – sample rate of input audio signal.
fft_n (int, optional, default: 2048) –
size of FFT, creates fft_n // 2 + 1 bins requirements :
fft_n == 2 ** m (m = 1,2,…)
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n // 4) – window size.
mel_n (int, optional, default: 128) – number of mel filter banks.
max_frame_n (int, optional, default: 128) – number of time frames of mel spectrogram.
- Returns:
res – created melspectrogram.
- Return type:
numpy.ndarray
- ailia.audio.convert_power_to_db(signal, top_db=None)¶
turn a spectrogram from the power scale to the decibel scale.
- Parameters:
signal (numpy.ndarray) – input signal in power scale.
top_db (float, optional, default: 80.0) – threshold the output at top_db below the peak.
- Returns:
res – output signal in decibel scale.
- Return type:
numpy.ndarray
- ailia.audio.fft(signal)¶
run fast fourier transform (FFT)
- Parameters:
signal (numpy.ndarray) – input signal.
- Returns:
res – created spectrum.
- Return type:
numpy.ndarray
- ailia.audio.filterfilter(n_coef, d_coef, wav, axis=-1, padtype='odd', padlen=None)¶
filter forward and backward to a audio signal
- Parameters:
n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
axis (int) – TBD
padtype (str, int or None, optional, default: odd) –
type of padding for the input signal extention. requirements :
None or 0 : no padding “odd” or 1 : padding odd mode “even” or 2 : padding even mode “constant” or 3 : padding constant mode
padlen (int or None, optional, default: 3 * max(len(n_coef), len(d_coef))) – number of padding samples at both ends of input signal before forward filtering.
- Returns:
res – output filtered audio signal.
- Return type:
numpy.ndarray
- ailia.audio.fix_frame_len(spec, fix_frame_n, pad=0.0)¶
Adjust frame length of a spectrogram.
- Parameters:
spec (numpy.ndarray) –
input data. requirements :
input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
fix_frame_n (int) – target number of time frames.
pad (float, optional, default: 0.0) – constant value to fill the added frames.
- Returns:
res
- Return type:
numpy.ndarray
- ailia.audio.get_fb_matrix(sample_rate, freq_n, f_min=0.0, f_max=None, mel_n=128, norm=False, htk=False)¶
Create a Filterbank matrix to combine FFT bins into Mel-frequency bins
- Parameters:
sample_rate (int) – sampling rate of the incoming signal
freq_n (int) – number of FFT bins.
f_min (float, optional, default: 0.0) – minimum frequency.
f_max (float, optional, default: sample_rate // 2) – maximum frequency.
mel_n (int, optional, default: 128) – number of mel bands.
norm (bool, optional, default: False) – normalize created filterbank matrix.
htk (bool, optional, default: False) – use HTK formula instead of Slaney’s formula.
- Returns:
res – created filterbank matrix.
- Return type:
numpy.ndarray
- ailia.audio.get_frame_len(sample_n, fft_n, hop_n=None, center_mode=1)¶
Calculate the number of frames when a spectrogram is created.
- Parameters:
sample_n (int) – length of audio signal.
fft_n (int) –
size of FFT, creates fft_n // 2 + 1 bins requirements :
fft_n == 2 ** m (m = 1,2,…)
hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window
center_mode (int, optional, default: 1) –
- whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.
- Returns:
frame_n – frame number of created spectrogram.
- Return type:
int
- ailia.audio.get_linerfilter_zi_coef(n_coef, d_coef)¶
Create coefficents of initial condition for the liner filter delay.
- Parameters:
n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]
- Returns:
zi – coefficents of initial condition for the liner filter delay.
- Return type:
numpy.ndarray
- ailia.audio.get_resample_len(sample_n, org_sr, target_sr)¶
Calculate the number of samples after resample.
- Parameters:
sample_n (int) – length of audio signal.
org_sr (int) –
sampling rate of input audio signal requirements :
org_sr > 0
target_sr (int) –
target sampling rate requirements :
target_sr > 0
- Returns:
resample_n – length of resampled audio signal.
- Return type:
int
- ailia.audio.get_sample_len(frame_n, freq_n, hop_n=None, center=True)¶
Calculate the number of samples when a signal is inversely transformed from a spectrogram
- Parameters:
frame_n (int) – frame number of spectrogram.
freq_n (int) – frequency of spectrgram. freq_n = fft_n // 2 + 1
hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window
center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.
- Returns:
sample_n – length of signal.
- Return type:
int
- ailia.audio.get_window(win_n, win_type)¶
Create a window of a given length and type.
- Parameters:
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :
- use hann window”hann” or 1
hamming window : “hamming” or 2
- Returns:
res – a window of given length and type.
- Return type:
numpy.ndarray
- ailia.audio.ifft(spec)¶
run inverse fast fourier transform (IFFT)
- Parameters:
signal (numpy.ndarray(dtype=complex)) – input spectrum.
- Returns:
res – created spectrum.
- Return type:
numpy.ndarray(dytpe=complex)
- ailia.audio.inverse_spectrogram(spec, hop_n=None, win_n=None, win_type=None, center=True, norm_type=None)¶
Inverse Transform from a spectrogram.
- Parameters:
spec (numpy.ndarray(shape=(1 + fft_n/2, frame_n ) or (ch_n, 1 + fft_n/2, frame_n) ,dtype=complex)) – input spectrogram.
hop_n (int, optional, default: win_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :
”hann” or 1 : hann window “hamming” or 2 : hamming window
center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.
norm_type (int, optional, default: 0) –
types of output normalization. requirements :
0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.
- Returns:
res – signal with inverse transformation of spectrogram res.shape :
(sample_n) if input.ndim == 1 (ch_n, sample_n) if input.ndim == 2
- Return type:
numpy.ndarray(dtype=float)
- ailia.audio.linerfilter(n_coef, d_coef, wav, axis=-1, zi=None)¶
filter a audio signal, using a digital filter(ex.IIR or FIR)
- Parameters:
n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.
d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
axis (int) – TBD
zi (numpy.ndarray) – initial conditions for the filter delays. if zi is None, initial condition uses 0.
- Returns:
res (numpy.ndarray) – output filterd audio signal.
zf (numpy.ndarray, optional) – final conditions for the filter delays. If zi is None, this is not returned.
- ailia.audio.log1p(signal)¶
calculate log1p ( y = log_e(1.0 + x) )
- Parameters:
signal (numpy.ndarray) – input signal.
- Returns:
res – output signal.
- Return type:
numpy.ndarray
- ailia.audio.magphase(spec, power=1.0, complex_out=True)¶
Separate a complex-valued spectrogram into its magnitude and phase components.
- Parameters:
spec (numpy.ndarray) –
input data. requirements :
input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :
power > 0.0
complex_out (bool, optional, default: True) –
- return phase as complex value.
True : compatible with librosa False : compatible with pytorch
- Returns:
res_mag (numpy.ndarray) – magnitude components of the input spectrogram.
res_phase (numpy.ndarray) – phase components of the input spectrogram.
- ailia.audio.mel_scale(spec, mel_fb)¶
convert spectorogram to Mel spectrogram,using mel filter bank
- Parameters:
spec (numpy.ndarray) – input real spectrogram. spec.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)
mel_fb (numpy.ndarray) – Filterbank matrix to combine FFT bins into Mel-frequency bins mel_fb.shape must be (mel_n, freq_n)
- Returns:
res – created mel spectrogram.
- Return type:
numpy.ndarray
- ailia.audio.mel_spectrogram(wav, sample_rate=16000, fft_n=1024, hop_n=None, win_n=None, win_type=1, center_mode=1, power=1.0, fft_norm_type=None, f_min=0.0, f_max=None, mel_n=128, mel_norm=True, htk=False)¶
Create a melspectrogram.
- Parameters:
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
sample_rate (int, optional, default: 16000) – sample rate of input audio signal.
fft_n (int, optional, default: 1024) –
size of FFT, creates fft_n // 2 + 1 bins requirements :
fft_n == 2 ** m, where m is a natural number
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :
”hann” or 1 : hann window “hamming” or 2 : hamming window
center_mode (int, optional, default: 1) –
- whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :
power > 0.0
fft_norm_type (int, optional, default: 0) –
types of spectrogram normalization. requirements :
0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.
f_min (float, optional, default: 0.0) – minimum frequency.
f_max (float, optional, default: sample_rate // 2) – maximum frequency.
mel_n (int, optional, default: 128) – number of mel filter banks.
mel_norm (bool, optional, default: True) – normalize melspectrofram.
htk (bool, optional, default: False) –
- convert frequency to mel scale using htk formula.
True : using htk formula. (compatible with pytorch. ) False : using Slaney’s formula. (compatible with a default setting of librosa. )
- Returns:
res – created melspectrogram.
- Return type:
numpy.ndarray
- ailia.audio.resample(wav, org_sr, target_sr)¶
resample a audio signal form original sampling rate to target sampring rate
- Parameters:
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
org_sr (int) –
sampling rate of input audio signal requirements :
org_sr > 0
target_sr (int) –
target sampling rate requirements :
target_sr > 0
- Returns:
res – created resample audio signal.
- Return type:
numpy.ndarray
- ailia.audio.spectrogram(wav, fft_n=1024, hop_n=None, win_n=None, win_type=None, center_mode=1, power=None, norm_type=None)¶
Create a spectrogram from a audio signal.
- Parameters:
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
fft_n (int, optional, default: 1024) –
size of FFT, creates fft_n // 2 + 1 bins requirements :
fft_n == 2 ** m, where m is a natural number
hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows
win_n (int, optional, default: fft_n) – window size.
win_type (str or int, optional, default: 1) –
type of window function. requirements :
”hann” or 1 : hann window “hamming” or 2 : hamming window
center_mode (int, optional, default: 1) –
- whether to pad an audio signal on both sides.
0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.
power (float, optional, default: 1.0) –
exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. requirements :
power > 0.0
norm_type (int, optional, default: 0) –
types of output normalization. requirements :
0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.
- Returns:
res – created spectrogram. res.shape :
(freq_n, frame_n) if input.ndim == 1 (ch_n, freq_n, frame_n) if input.ndim == 2
- Return type:
numpy.ndarray
- ailia.audio.standardize(signal)¶
Standardize input signal.
- Parameters:
signal (numpy.ndarray) – input signal.
- Returns:
res – standardized signal.
- Return type:
numpy.ndarray
- ailia.audio.trim(wav, thr_db=60, ref=<function amax>, frame_length=2048, hop_length=512)¶
Truncate the silence before and after a audio signal
- Parameters:
wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).
thr_db (float, optional, default: 60) – Threshold for determining silence
ref – TBD
frame_length (int, optional, default=2048) – length of analysis windows
hop_length (int, optional, default=512) – length of hop between analysis windows
- Returns:
res_trimmed (numpy.ndarray) – output trimmed audio signal.
res_pos (numpy.ndarray shape=(2,)) – non silent position [start,end]