【librosa】及其在音頻處理中的應用

本文轉載自查看原文 2019-12-19 17:24 2094 Algorithms/ Acoustic/ Tools/ Python

【持續更新】

首先：import librosa

load

讀取wav文件：

wav, sr = librosa.load(path, sr=22050, mono=True, offset=0.0, duration=None, dtype=<class 'numpy.float32'>, res_type='kaiser_best')

1. Load an audio file as a floating point time series.

2. Audio will be automatically resampled to the given rate (default sr=22050).

3. To preserve the native sampling rate of the file, use sr=None.

Any codec supported by soundfile or audioread will work.

源碼中，先嘗試soundfile解碼，不然再audioread解碼。

sampling-rate-conversion

librosa.load函數可以指定采樣率讀取音頻文件。濾波器實現？？？

默認重采樣類別kaiser_best，表示 `resampy` python包的high-quality mode，參考：Introduction — resampy 0.2.2 documentation

resampy is a python module for efficient time-series resampling. It is based on the band-limited sinc interpolation method for sampling rate conversion.

cache

參考：librosa之cache_daisycolour_新浪博客 (sina.com.cn)

緩存級別以類似於日志級別的方式運行。

對於較小的 LIBROSA_CACHE_LEVEL 值，僅緩存最重要（經常使用）的數據。 隨着緩存級別的增加，會緩存更廣泛的函數類。 因此，應用程序代碼可能會以更大的磁盤使用量為代價運行得更快。

緩存級別描述如下：

10：過濾器基礎，獨立於音頻數據（dct、mel、色度、constant-q）
20：低級特征（cqt、stft、過零等）
30：高級特征（節奏、節拍、分解、重復等）
40：后處理（delta、stack_memory、normalize、sync）
默認緩存級別為 10。

display

`specshow`(data[, x_coords, y_coords, x_axis, …])	Display a spectrogram/chromagram/cqt/etc.
`waveplot`(y[, sr, max_points, x_axis, …])	Plot the amplitude envelope of a waveform.
`cmap`(data[, robust, cmap_seq, cmap_bool, …])	Get a default colormap from the given data.
`TimeFormatter`([lag, unit])	A tick formatter for time axes.
`NoteFormatter`([octave, major])	Ticker formatter for Notes
`LogHzFormatter`([major])	Ticker formatter for logarithmic frequency
`ChromaFormatter`	A formatter for chroma axes
`TonnetzFormatter`	A formatter for tonnetz axes

[1]中介紹了很多關於librosa的應用，同時提出librosa.display模塊並不默認包含在librosa中，使用時要單獨引入：

import librosa.display

waveplot

Plot the amplitude envelope of a waveform.

If y is monophonic, a filled curve is drawn between [-abs(y), abs(y)].

If y is stereo, the curve is drawn between [-abs(y[1]), abs(y[0])], so that the left and right channels are drawn above and below the axis, respectively.

Long signals (duration >= max_points) are down-sampled to at most max_sr before plotting.

librosa.display.waveplot(y, sr=22050, max_points=50000.0, x_axis='time', offset=0.0, max_sr=1000, ax=None, **kwargs)

specshow

Display a spectrogram/chromagram/cqt/etc.

librosa.display.specshow(data, x_coords=None, y_coords=None, x_axis=None, y_axis=None, sr=22050, hop_length=512, fmin=None, fmax=None, tuning=0.0, bins_per_octave=12, ax=None, **kwargs)

注意：源碼中 sr 默認是22050Hz，如果音頻文件是8k或者16k，一定要指定采樣率。

可以選擇不同的尺度顯示頻譜圖，y_axis={‘linear’, ‘log’, ‘mel’, ‘cqt_hz’,...}

feature-extraction

參考：https://librosa.org/doc/latest/feature.html

melspectrogram

計算mel-scaled spectrogram。

librosa.feature.melspectrogram(y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='reflect', power=2.0, **kwargs)

應用實例：

S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

與filters中的mel相關：

librosa.filters.mel(sr, n_fft, n_mels=128, fmin=0.0, fmax=None, htk=False, norm='slaney', dtype=<class 'numpy.float32'>)

stft / istft

短時傅里葉變換 / 逆短時傅里葉變換，參考librosa源碼和博客[librosa語音信號處理]。

librosa.stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, pad_mode='reflect')

librosa.core.stft(y, n_fft=2048, hop_length=None, win_length=None, window='hann', center=True, dtype=<class 'numpy.complex64'>, pad_mode='reflect')   # This function caches at level 20.

The STFT represents a signal in the time-frequency domain by computing discrete Fourier transforms (DFT) over short overlapping windows. This function returns a complex-valued matrix D such that

np.abs(D[f, t]) is the magnitude of frequency bin f at frame t, and
np.angle(D[f, t]) is the phase of frequency bin f at frame t.

Parameters:

Parameters:	y : np.ndarray [shape=(n,)], real-valued input signal n_fft : int > 0 [scalar] length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. This value is well adapted for music signals. However, in speech processing, the recommended value is 512, corresponding to 23 milliseconds at a sample rate of 22050 Hz. In any case, we recommend setting n_fft to a power of two for optimizing the speed of the fast Fourier transform (FFT) algorithm. hop_length : int > 0 [scalar] number of audio samples between adjacent STFT columns. Smaller values increase the number of columns in D without affecting the frequency resolution of the STFT. If unspecified, defaults to win_length / 4 (see below). win_length : int <= n_fft [scalar] Each frame of audio is windowed by window() of length win_length and then padded with zeros to match n_fft. Smaller values improve the temporal resolution of the STFT (i.e. the ability to discriminate impulses that are closely spaced in time) at the expense of frequency resolution (i.e. the ability to discriminate pure tones that are closely spaced in frequency). This effect is known as the time-frequency localization tradeoff and needs to be adjusted according to the properties of the input signal y. If unspecified, defaults to `win_length = n_fft`. window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)] Either: a window specification (string, tuple, or number); see `scipy.signal.get_window` a window function, such as `scipy.signal.hanning` a vector or array of length n_fft Defaults to a raised cosine window (“hann”), which is adequate for most applications in audio signal processing. center : boolean If True, the signal y is padded so that frame D[:, t] is centered at y[t hop_length]. If False, then D[:, t]* begins at y[t hop_length]. Defaults to True, which simplifies the alignment of D onto a time grid by means of `librosa.core.frames_to_samples`. Note, however, that center* must be set to False when analyzing signals with `librosa.stream`. dtype : numeric type Complex numeric type for D. Default is single-precision floating-point complex (np.complex64). pad_mode : string or function If center=True, this argument is passed to np.pad for padding the edges of the signal y. By default (pad_mode=”reflect”), y is padded on both sides with its own reflection, mirrored around its first and last sample respectively. If center=False, this argument is ignored.
Returns:	D : np.ndarray [shape=(1 + n_fft/2, n_frames), dtype=dtype] Complex-valued matrix of short-term Fourier transform coefficients.

y : np.ndarray [shape=(n,)], real-valued

input signal

n_fft : int > 0 [scalar]

length of the windowed signal after padding with zeros. The number of rows in the STFT matrix D is (1 + n_fft/2). The default value, n_fft=2048 samples, corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz, i.e. the default sample rate in librosa. This value is well adapted for music signals. However, in speech processing, the recommended value is 512, corresponding to 23 milliseconds at a sample rate of 22050 Hz. In any case, we recommend setting n_fft to a power of two for optimizing the speed of the fast Fourier transform (FFT) algorithm.

hop_length : int > 0 [scalar]

number of audio samples between adjacent STFT columns.

Smaller values increase the number of columns in D without affecting the frequency resolution of the STFT.

If unspecified, defaults to win_length / 4 (see below).

win_length : int <= n_fft [scalar]

Each frame of audio is windowed by window() of length win_length and then padded with zeros to match n_fft.

Smaller values improve the temporal resolution of the STFT (i.e. the ability to discriminate impulses that are closely spaced in time) at the expense of frequency resolution (i.e. the ability to discriminate pure tones that are closely spaced in frequency). This effect is known as the time-frequency localization tradeoff and needs to be adjusted according to the properties of the input signal y.

If unspecified, defaults to win_length = n_fft.

window : string, tuple, number, function, or np.ndarray [shape=(n_fft,)]

Either:

a window specification (string, tuple, or number); see scipy.signal.get_window
a window function, such as scipy.signal.hanning
a vector or array of length n_fft

Defaults to a raised cosine window (“hann”), which is adequate for most applications in audio signal processing.

center : boolean

If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length].

If False, then D[:, t] begins at y[t * hop_length].

Defaults to True, which simplifies the alignment of D onto a time grid by means of librosa.core.frames_to_samples. Note, however, that center must be set to False when analyzing signals with librosa.stream.

dtype : numeric type

Complex numeric type for D. Default is single-precision floating-point complex (np.complex64).

pad_mode : string or function

If center=True, this argument is passed to np.pad for padding the edges of the signal y. By default (pad_mode=”reflect”), y is padded on both sides with its own reflection, mirrored around its first and last sample respectively. If center=False, this argument is ignored.

Returns:

D : np.ndarray [shape=(1 + n_fft/2, n_frames), dtype=dtype]: Complex-valued matrix of short-term Fourier transform coefficients.

librosa.istft(stft_matrix, hop_length=None, win_length=None, window='hann', center=True, length=None)

librosa.core.istft(stft_matrix, hop_length=None, win_length=None, window='hann', center=True, dtype=<class 'numpy.float32'>, length=None) # This function caches at level 30.

Converts a complex-valued spectrogram stft_matrix to time-series y by minimizing the mean squared error between stft_matrix and STFT of y as described in [2] up to Section 2 (reconstruction from MSTFT).

In general, window function, hop length and other parameters should be same as in stft, which mostly leads to perfect reconstruction of a signal from unmodified stft_matrix.

Parameters:

Parameters:	stft_matrix : np.ndarray [shape=(1 + n_fft/2, t)] STFT matrix from `stft` hop_length : int > 0 [scalar] Number of frames between STFT columns. If unspecified, defaults to win_length / 4. win_length : int <= n_fft = 2 * (stft_matrix.shape[0] - 1) When reconstructing the time series, each frame is windowed and each sample is normalized by the sum of squared window according to the window function (see below). If unspecified, defaults to n_fft. window : string, tuple, number, function, np.ndarray [shape=(n_fft,)] a window specification (string, tuple, or number); see `scipy.signal.get_window` a window function, such as `scipy.signal.hanning` a user-specified window vector of length n_fft center : boolean If True, D is assumed to have centered frames. If False, D is assumed to have left-aligned frames. dtype : numeric type Real numeric type for y. Default is 32-bit float. length : int > 0, optional If provided, the output y is zero-padded or clipped to exactly length samples.
Returns:	y : np.ndarray [shape=(n,)] time domain signal reconstructed from stft_matrix

stft_matrix : np.ndarray [shape=(1 + n_fft/2, t)]

STFT matrix from stft

hop_length : int > 0 [scalar]

Number of frames between STFT columns. If unspecified, defaults to win_length / 4.

win_length : int <= n_fft = 2 * (stft_matrix.shape[0] - 1)

When reconstructing the time series, each frame is windowed and each sample is normalized by the sum of squared window according to the window function (see below).

If unspecified, defaults to n_fft.

window : string, tuple, number, function, np.ndarray [shape=(n_fft,)]

a window specification (string, tuple, or number); see scipy.signal.get_window
a window function, such as scipy.signal.hanning
a user-specified window vector of length n_fft

center : boolean

If True, D is assumed to have centered frames.
If False, D is assumed to have left-aligned frames.

dtype : numeric type

Real numeric type for y. Default is 32-bit float.

length : int > 0, optional

If provided, the output y is zero-padded or clipped to exactly length samples.

Returns:

y : np.ndarray [shape=(n,)]: time domain signal reconstructed from stft_matrix

有用的函數

effects.split

librosa.effects.split(y, top_db=60, ref=<function amax at 0x7fa274a61d90>, frame_length=2048, hop_length=512)

Split an audio signal into non-silent intervals. 參數說明源碼。

Parameters:

Parameters:	y : np.ndarray, shape=(n,) or (2, n) An audio signal top_db : number > 0 The threshold (in decibels) below reference to consider as silence ref : number or callable The reference power. By default, it uses `np.max` and compares to the peak power in the signal. frame_length : int > 0 The number of samples per analysis frame hop_length : int > 0 The number of samples between analysis frames
Returns:	intervals : np.ndarray, shape=(m, 2) intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i.

y : np.ndarray, shape=(n,) or (2, n): An audio signal
top_db : number > 0: The threshold (in decibels) below reference to consider as silence
ref : number or callable: The reference power. By default, it uses np.max and compares to the peak power in the signal.
frame_length : int > 0: The number of samples per analysis frame
hop_length : int > 0: The number of samples between analysis frames

Returns:

intervals : np.ndarray, shape=(m, 2): intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i.

參考

[1] 音頻特征提取——librosa工具包使用 - 桂。 - 博客園 (cnblogs.com)

[2] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984.

[3] librosa語音信號處理 - 凌逆戰 - 博客園 (cnblogs.com)

Load an audio file as a floating point time series.

Audio will be automatically resampled to the given rate (default sr=22050).

To preserve the native sampling rate of the file, use sr=None.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 音頻處理庫—librosa的安裝與使用【librosa】音頻特征提取音頻特征提取——librosa工具包使用 librosa語音信號處理 pytorch深度學習之音頻librosa庫與torchaudio庫的安裝與使用 Python中的音頻和數字信號處理（DSP）音頻處理中的尺度--Bark尺度與Mel尺度【aubio】音頻處理筆記音頻處理常用芯片 python音頻處理