Elevating Audio Datasets: The Power of Augmentation Techniques

Let's have a look at different techniques for Data augmentation of audio data

💡 Articles

3 July 2024

Deep learning focuses on the representation of the input data and the generalization of the model. It is well known that data augmentation can combat overfitting and improve the generalization ability of deep neural network

In the vast landscape of machine learning, especially when dealing with audio data, the quest for robust and diverse training datasets is paramount. Here enters the hero: audio data augmentation. This technique serves as a magician's wand, expanding the richness and variability of your training data through controlled modifications. By doing so, it equips your models with the prowess to grasp nuanced audio features, ultimately translating into enhanced performance across a myriad of audio-centric tasks.

Let's have a look at different techniques for Data augmentation of audio data

Time Shifting:

The idea of shifting time is very simple. It just shifts the audio to left/right with a random second. If shifting audio to the left (fast forward) with x seconds, the first n seconds will be marked as 0 (i.e. silence). If shifting audio to the right (back forward) with x seconds, the last n seconds will be marked as 0 (i.e. silence).

import numpy as np

def augment_with_time_shift(data, sampling_rate, shift_max, shift_direction):
    # Generate a random shift value within the specified range
    shift = np.random.randint(sampling_rate * shift_max)
    if shift_direction == 'right':
        shift = -shift  # Shift right (backward)

    # Apply the time shift
    augmented_data = np.roll(data, shift)
    if shift > 0:
        augmented_data[:shift] = 0  # Mark initial silence
    else:
        augmented_data[shift:] = 0  # Mark final silence

    return augmented_data

Time Stretching:

Imagine stretching or compressing the temporal fabric of an audio signal, like a gentle tug-of-war with time itself. This dynamic alteration exposes your model to the ebb and flow of natural speech patterns, empowering it to navigate through diverse temporal landscapes. Time stretching is the process of changing the speed or duration of an audio signal without affecting its pitch. Here’s a function that uses the librosa library to perform time stretching:

import librosa

def time_stretch(audio_file, rate=1):
    # Load the audio file
    y, sr = librosa.load(audio_file)

    # Perform time stretching
    y_stretch = librosa.effects.time_stretch(y, rate)

    return y_stretch, sr

Pitch Scaling:

Think of adjusting the pitch of an audio signal as tuning a musical instrument to different frequencies. This melodic augmentation enables your model to resonate with various pitches, paving the way for robust performance across different vocal tones and musical nuances.

import librosa
import soundfile as sf

def pitch_scaling(audio_file_path, pitch_scale):
    # Load the audio file
    y, sr = librosa.load(audio_file_path)

    # Perform pitch scaling
    y_pitch_scaled = librosa.effects.pitch_shift(y, sr, pitch_scale)

    return y_pitch_scaled

# Usage:
audio_file_path = 'path_to_your_audio_file.wav'
pitch_scale = 2.0  # Change this value for different pitch scales

# Get the pitch scaled audio
pitch_scaled_audio = pitch_scaling(audio_file_path, pitch_scale)

# Save the pitch scaled audio to a new file
sf.write('pitch_scaled_audio.wav', pitch_scaled_audio, sr)

Noise Addition:

Infuse the symphony of real-world ambiance into your training data by incorporating background noise. Whether it's the bustling city streets or the serene whispers of nature, this augmentation enriches your model's understanding of signal amidst noise, a crucial skill in the cacophony of reality.

import numpy as np
import librosa
import soundfile as sf

def noise_addition(audio_file_path, noise_file_path, noise_factor):
    # Load the audio file
    y, sr = librosa.load(audio_file_path)

    # Load the noise audio file
    noise, _ = librosa.load(noise_file_path)

    # Make sure the noise and the audio file are the same length
    if len(y) < len(noise):
        noise = noise[0:len(y)]
    else:
        noise = np.pad(noise, (0, len(y) - len(noise)), 'constant')

    # Add the noise to the original audio
    y_noisy = y + noise_factor * noise

    return y_noisy

# Usage:
audio_file_path = 'path_to_your_audio_file.wav'
noise_file_path = 'path_to_your_noise_file.wav'
noise_factor = 0.1  # Change this value for different noise levels

# Get the noisy audio
noisy_audio = noise_addition(audio_file_path, noise_file_path, noise_factor)

# Save the noisy audio to a new file
sf.write('noisy_audio.wav', noisy_audio, sr)

Impulse Response Addition:

Transport your model to different acoustic realms by simulating audio propagation through distinct environments. From the intimate confines of a room to the grandeur of a concert hall, this augmentation broadens your model's horizons, enriching its acoustic repertoire.

import numpy as np
import pyroomacoustics as pra
import soundfile as sf

def impulse_response_addition(audio_file_path, room_dimensions, mic_location, source_location):
    # Create a shoebox room
    room = pra.ShoeBox(room_dimensions)

    # Add a microphone at the mic_location
    room.add_microphone_array(pra.MicrophoneArray(np.array([mic_location]), room.fs))

    # Load the audio file
    y, sr = sf.read(audio_file_path)

    # Add the audio source at the source_location
    room.add_source(source_location, signal=y)

    # Compute the room impulse response
    room.compute_rir()

    # Convolve the audio signal with the room impulse response
    y_room = np.convolve(y, room.rir[0][0])

    return y_room

# Usage:
audio_file_path = 'path_to_your_audio_file.wav'
room_dimensions = [10, 10, 10]  # Change this to the dimensions of your room
mic_location = [5, 5, 5]  # Change this to the location of your microphone
source_location = [2, 2, 2]  # Change this to the location of your audio source

# Get the room audio
room_audio = impulse_response_addition(audio_file_path, room_dimensions, mic_location, source_location)

# Save the room audio to a new file
sf.write('room_audio.wav', room_audio, sr)

Low/High-Pass Band Filters:

Sculpt the frequency landscape of your audio signals with surgical precision. By selectively sculpting specific frequency ranges, these filters refine your model's perception, akin to clearing the fog to reveal the essence of the audio signal.

from scipy.signal import butter, lfilter
import soundfile as sf

def butter_filter(audio_file_path, cutoff, fs, order=5, filter_type='low'):
    # Load the audio file
    y, sr = sf.read(audio_file_path)

    # Create the Butterworth filter
    b, a = butter(order, cutoff / (0.5 * fs), btype=filter_type)

    # Apply the filter to the audio signal
    y_filtered = lfilter(b, a, y)

    return y_filtered

# Usage:
audio_file_path = 'path_to_your_audio_file.wav'
cutoff = 1000.0  # Change this value for different cutoff frequencies
fs = 44100.0  # Change this to the sample rate of your audio file

# Get the low-pass filtered audio
low_pass_audio = butter_filter(audio_file_path, cutoff, fs, filter_type='low')

# Save the low-pass filtered audio to a new file
sf.write('low_pass_audio.wav', low_pass_audio, fs)

# Get the high-pass filtered audio
high_pass_audio = butter_filter(audio_file_path, cutoff, fs, filter_type='high')

# Save the high-pass filtered audio to a new file
sf.write('high_pass_audio.wav', high_pass_audio, fs)

Polarity Inversion:

Flip the audio waveform like turning a kaleidoscope, revealing new patterns and perspectives. This simple yet transformative augmentation primes your model for the unexpected, fostering adaptability in the face of signal polarity reversals.

import soundfile as sf

def polarity_inversion(audio_file_path):
    # Load the audio file
    y, sr = sf.read(audio_file_path)

    # Perform polarity inversion
    y_inverted = -y

    return y_inverted

# Usage:
audio_file_path = 'path_to_your_audio_file.wav'

# Get the inverted audio
inverted_audio = polarity_inversion(audio_file_path)

# Save the inverted audio to a new file
sf.write('inverted_audio.wav', inverted_audio, sr)

Concluding Thoughts:

In the tapestry of machine learning, the threads of audio data augmentation weave a narrative of innovation and resilience. By harnessing the transformative power of these techniques, you embark on a journey towards a richer, more diverse training dataset. Remember, experimentation and evaluation are the guiding stars on this odyssey, leading you to the shores of enhanced model performance and unparalleled adaptability in the realm of audio processing.

Share this post