ailia.audio package

Functions

ailia.audio.complex_norm(spec, power=1.0)

Compute the norm of complex spectrogram

Parameters:

spec (numpy.ndarray(dtype=complex)) – input spectrogram.

Returns:

res – the norm of complex spectrogram

Return type:

numpy.ndarray

ailia.audio.compute_mel_spectrogram_with_fixed_length(wav, sample_rate=16000, fft_n=2048, hop_n=None, win_n=None, mel_n=128, max_frame_n=128)

Create a melspectrogram.

Parameters:
  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • sample_rate (int, optional, default: 16000) – sample rate of input audio signal.

  • fft_n (int, optional, default: 2048) –

    size of FFT, creates fft_n // 2 + 1 bins requirements :

    fft_n == 2 ** m (m = 1,2,…)

  • hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows

  • win_n (int, optional, default: fft_n // 4) – window size.

  • mel_n (int, optional, default: 128) – number of mel filter banks.

  • max_frame_n (int, optional, default: 128) – number of time frames of mel spectrogram.

Returns:

res – created melspectrogram.

Return type:

numpy.ndarray

ailia.audio.convert_power_to_db(signal, top_db=None)

turn a spectrogram from the power scale to the decibel scale.

Parameters:
  • signal (numpy.ndarray) – input signal in power scale.

  • top_db (float, optional, default: 80.0) – threshold the output at top_db below the peak.

Returns:

res – output signal in decibel scale.

Return type:

numpy.ndarray

ailia.audio.fft(signal)

run fast fourier transform (FFT)

Parameters:

signal (numpy.ndarray) – input signal.

Returns:

res – created spectrum.

Return type:

numpy.ndarray

ailia.audio.filterfilter(n_coef, d_coef, wav, axis=-1, padtype='odd', padlen=None)

filter forward and backward to a audio signal

Parameters:
  • n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.

  • d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]

  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • axis (int) – TBD

  • padtype (str, int or None, optional, default: odd) –

    type of padding for the input signal extention. requirements :

    None or 0 : no padding “odd” or 1 : padding odd mode “even” or 2 : padding even mode “constant” or 3 : padding constant mode

  • padlen (int or None, optional, default: 3 * max(len(n_coef), len(d_coef))) – number of padding samples at both ends of input signal before forward filtering.

Returns:

res – output filtered audio signal.

Return type:

numpy.ndarray

ailia.audio.fix_frame_len(spec, fix_frame_n, pad=0.0)

Adjust frame length of a spectrogram.

Parameters:
  • spec (numpy.ndarray) –

    input data. requirements :

    input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)

  • fix_frame_n (int) – target number of time frames.

  • pad (float, optional, default: 0.0) – constant value to fill the added frames.

Returns:

res

Return type:

numpy.ndarray

ailia.audio.get_fb_matrix(sample_rate, freq_n, f_min=0.0, f_max=None, mel_n=128, norm=False, htk=False)

Create a Filterbank matrix to combine FFT bins into Mel-frequency bins

Parameters:
  • sample_rate (int) – sampling rate of the incoming signal

  • freq_n (int) – number of FFT bins.

  • f_min (float, optional, default: 0.0) – minimum frequency.

  • f_max (float, optional, default: sample_rate // 2) – maximum frequency.

  • mel_n (int, optional, default: 128) – number of mel bands.

  • norm (bool, optional, default: False) – normalize created filterbank matrix.

  • htk (bool, optional, default: False) – use HTK formula instead of Slaney’s formula.

Returns:

res – created filterbank matrix.

Return type:

numpy.ndarray

ailia.audio.get_frame_len(sample_n, fft_n, hop_n=None, center_mode=1)

Calculate the number of frames when a spectrogram is created.

Parameters:
  • sample_n (int) – length of audio signal.

  • fft_n (int) –

    size of FFT, creates fft_n // 2 + 1 bins requirements :

    fft_n == 2 ** m (m = 1,2,…)

  • hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window

  • center_mode (int, optional, default: 1) –

    whether to pad an audio signal on both sides.

    0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.

Returns:

frame_n – frame number of created spectrogram.

Return type:

int

ailia.audio.get_linerfilter_zi_coef(n_coef, d_coef)

Create coefficents of initial condition for the liner filter delay.

Parameters:
  • n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.

  • d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]

Returns:

zi – coefficents of initial condition for the liner filter delay.

Return type:

numpy.ndarray

ailia.audio.get_resample_len(sample_n, org_sr, target_sr)

Calculate the number of samples after resample.

Parameters:
  • sample_n (int) – length of audio signal.

  • org_sr (int) –

    sampling rate of input audio signal requirements :

    org_sr > 0

  • target_sr (int) –

    target sampling rate requirements :

    target_sr > 0

Returns:

resample_n – length of resampled audio signal.

Return type:

int

ailia.audio.get_sample_len(frame_n, freq_n, hop_n=None, center=True)

Calculate the number of samples when a signal is inversely transformed from a spectrogram

Parameters:
  • frame_n (int) – frame number of spectrogram.

  • freq_n (int) – frequency of spectrgram. freq_n = fft_n // 2 + 1

  • hop_n (int, optional, default: (fft_n//4)) – length of hop between STFT window

  • center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.

Returns:

sample_n – length of signal.

Return type:

int

ailia.audio.get_window(win_n, win_type)

Create a window of a given length and type.

Parameters:
  • win_n (int, optional, default: fft_n) – window size.

  • win_type (str or int, optional, default: 1) –

    type of window function. requirements :

    use hann window”hann” or 1

    hamming window : “hamming” or 2

Returns:

res – a window of given length and type.

Return type:

numpy.ndarray

ailia.audio.ifft(spec)

run inverse fast fourier transform (IFFT)

Parameters:

signal (numpy.ndarray(dtype=complex)) – input spectrum.

Returns:

res – created spectrum.

Return type:

numpy.ndarray(dytpe=complex)

ailia.audio.inverse_spectrogram(spec, hop_n=None, win_n=None, win_type=None, center=True, norm_type=None)

Inverse Transform from a spectrogram.

Parameters:
  • spec (numpy.ndarray(shape=(1 + fft_n/2, frame_n ) or (ch_n, 1 + fft_n/2, frame_n) ,dtype=complex)) – input spectrogram.

  • hop_n (int, optional, default: win_n // 4) – length of hop between STFT windows

  • win_n (int, optional, default: fft_n) – window size.

  • win_type (str or int, optional, default: 1) –

    type of window function. requirements :

    ”hann” or 1 : hann window “hamming” or 2 : hamming window

  • center (bool, optional, default: True) – True : input spectrogram is assumed to habe centered frames. False : input spectrogram is assumed to have left-aligned frames.

  • norm_type (int, optional, default: 0) –

    types of output normalization. requirements :

    0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.

Returns:

res – signal with inverse transformation of spectrogram res.shape :

(sample_n) if input.ndim == 1 (ch_n, sample_n) if input.ndim == 2

Return type:

numpy.ndarray(dtype=float)

ailia.audio.linerfilter(n_coef, d_coef, wav, axis=-1, zi=None)

filter a audio signal, using a digital filter(ex.IIR or FIR)

Parameters:
  • n_coef (numpy.ndarray(ndim = 1)) – numerator coefficient.

  • d_coef (numpy.ndarray(ndim = 1)) – denominator coefficient. If d_coef[0] is not 1, n_coef and d_coef are normalized by d_coef[0]

  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • axis (int) – TBD

  • zi (numpy.ndarray) – initial conditions for the filter delays. if zi is None, initial condition uses 0.

Returns:

  • res (numpy.ndarray) – output filterd audio signal.

  • zf (numpy.ndarray, optional) – final conditions for the filter delays. If zi is None, this is not returned.

ailia.audio.log1p(signal)

calculate log1p ( y = log_e(1.0 + x) )

Parameters:

signal (numpy.ndarray) – input signal.

Returns:

res – output signal.

Return type:

numpy.ndarray

ailia.audio.magphase(spec, power=1.0, complex_out=True)

Separate a complex-valued spectrogram into its magnitude and phase components.

Parameters:
  • spec (numpy.ndarray) –

    input data. requirements :

    input.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)

  • power (float, optional, default: 1.0) –

    exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :

    power > 0.0

  • complex_out (bool, optional, default: True) –

    return phase as complex value.

    True : compatible with librosa False : compatible with pytorch

Returns:

  • res_mag (numpy.ndarray) – magnitude components of the input spectrogram.

  • res_phase (numpy.ndarray) – phase components of the input spectrogram.

ailia.audio.mel_scale(spec, mel_fb)

convert spectorogram to Mel spectrogram,using mel filter bank

Parameters:
  • spec (numpy.ndarray) – input real spectrogram. spec.shape must be (ch_n, freq_n, frame_n) or (freq_n, frame_n)

  • mel_fb (numpy.ndarray) – Filterbank matrix to combine FFT bins into Mel-frequency bins mel_fb.shape must be (mel_n, freq_n)

Returns:

res – created mel spectrogram.

Return type:

numpy.ndarray

ailia.audio.mel_spectrogram(wav, sample_rate=16000, fft_n=1024, hop_n=None, win_n=None, win_type=1, center_mode=1, power=1.0, fft_norm_type=None, f_min=0.0, f_max=None, mel_n=128, mel_norm=True, htk=False)

Create a melspectrogram.

Parameters:
  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • sample_rate (int, optional, default: 16000) – sample rate of input audio signal.

  • fft_n (int, optional, default: 1024) –

    size of FFT, creates fft_n // 2 + 1 bins requirements :

    fft_n == 2 ** m, where m is a natural number

  • hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows

  • win_n (int, optional, default: fft_n) – window size.

  • win_type (str or int, optional, default: 1) –

    type of window function. requirements :

    ”hann” or 1 : hann window “hamming” or 2 : hamming window

  • center_mode (int, optional, default: 1) –

    whether to pad an audio signal on both sides.

    0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.

  • power (float, optional, default: 1.0) –

    exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. requirements :

    power > 0.0

  • fft_norm_type (int, optional, default: 0) –

    types of spectrogram normalization. requirements :

    0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.

  • f_min (float, optional, default: 0.0) – minimum frequency.

  • f_max (float, optional, default: sample_rate // 2) – maximum frequency.

  • mel_n (int, optional, default: 128) – number of mel filter banks.

  • mel_norm (bool, optional, default: True) – normalize melspectrofram.

  • htk (bool, optional, default: False) –

    convert frequency to mel scale using htk formula.

    True : using htk formula. (compatible with pytorch. ) False : using Slaney’s formula. (compatible with a default setting of librosa. )

Returns:

res – created melspectrogram.

Return type:

numpy.ndarray

ailia.audio.resample(wav, org_sr, target_sr)

resample a audio signal form original sampling rate to target sampring rate

Parameters:
  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • org_sr (int) –

    sampling rate of input audio signal requirements :

    org_sr > 0

  • target_sr (int) –

    target sampling rate requirements :

    target_sr > 0

Returns:

res – created resample audio signal.

Return type:

numpy.ndarray

ailia.audio.spectrogram(wav, fft_n=1024, hop_n=None, win_n=None, win_type=None, center_mode=1, power=None, norm_type=None)

Create a spectrogram from a audio signal.

Parameters:
  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • fft_n (int, optional, default: 1024) –

    size of FFT, creates fft_n // 2 + 1 bins requirements :

    fft_n == 2 ** m, where m is a natural number

  • hop_n (int, optional, default: fft_n // 4) – length of hop between STFT windows

  • win_n (int, optional, default: fft_n) – window size.

  • win_type (str or int, optional, default: 1) –

    type of window function. requirements :

    ”hann” or 1 : hann window “hamming” or 2 : hamming window

  • center_mode (int, optional, default: 1) –

    whether to pad an audio signal on both sides.

    0 : ignored. 1 : audio signal is padded on both sides with its own reflection, mirrored around its first and last sample respectively. 2 : audio signal is padded on both sides with zero. Then,it is padded to integer number of windowed segments.

  • power (float, optional, default: 1.0) –

    exponent for the magnitude spectrogram, e.g., 1 for energy, 2 for power, etc. If None, then the complex spectrum is returned instead. requirements :

    power > 0.0

  • norm_type (int, optional, default: 0) –

    types of output normalization. requirements :

    0 : ignored. 1 : compatible with librosa and pytorch. 2 : compatible with scipy.

Returns:

res – created spectrogram. res.shape :

(freq_n, frame_n) if input.ndim == 1 (ch_n, freq_n, frame_n) if input.ndim == 2

Return type:

numpy.ndarray

ailia.audio.standardize(signal)

Standardize input signal.

Parameters:

signal (numpy.ndarray) – input signal.

Returns:

res – standardized signal.

Return type:

numpy.ndarray

ailia.audio.trim(wav, thr_db=60, ref=<function amax>, frame_length=2048, hop_length=512)

Truncate the silence before and after a audio signal

Parameters:
  • wav (numpy.ndarray) – input audio signal. wav.shape must be (sample_n,) or (channel_n, sample_n).

  • thr_db (float, optional, default: 60) – Threshold for determining silence

  • ref – TBD

  • frame_length (int, optional, default=2048) – length of analysis windows

  • hop_length (int, optional, default=512) – length of hop between analysis windows

Returns:

  • res_trimmed (numpy.ndarray) – output trimmed audio signal.

  • res_pos (numpy.ndarray shape=(2,)) – non silent position [start,end]