7. Speech Classification Guide#

Botiverse initially implemented a speech classifier that was destined to be used with the voice bot. The speech classifier is simply a model that takes a speech signal and outputs a class.

We will import the speech classifier from botiverse.bots, its there because its a very simple bot of sorts that can understand some speech. We also import voice_input so we can later test it.

from botiverse.bots import SpeechClassifier
from botiverse.bots.VoiceBot.utils import voice_input

In this guide, we will use the speech classifier to tackle the problem of generalizing from a few samples of synthetically generated speech data to real speech data.

7.1. Synthetically Generate Voice Data#

The speech classifier can be trained on existing data, it takes the words or phrases to classify between and if no existing dataset folder is found while calling generate_read_data, it will synthetically generate data for each word while performing random useful audio transformations (can be customly passed) and corrupting the dataset with noise.

S = SpeechClassifier(['Yes', 'No'], samplerate=16000, duration=0.9, machine='lstm', repr='wav2vec')
X, y = S.generate_read_data(force_download_noise=False, n=5)

While making an instance of the speech classifer we use lstm as the core model and wav2vec as a representation. The speech classifier also supports spec and mfcc representations but they make tasks like these (no data available) much harder.

7.1.1. Train the Model#

We train the model and pass all the necessary core model parameters (they are optional, and here the go to the lstm). The patience parameter decides when to stop training. If patience epochs have passed with no improvement, then take the model we had prior to those 100 epochs. Check the documentation for other parameters.

S.fit(X, y, hidden=128, patience=100, α=0.01)
#S.save('speechclassifier')

We can optionally save the model as well and later load it as expected.

7.1.2. Predict#

In this, we record audio for 3 seconds and with a voice_threshold of 900 to later classify the record into either ‘yes’ or ‘no’. The voice_threshold is helpful in not only recording when there is voice, that’s why the output record will often be less than 3 seconds.

audio_path = voice_input(record_time=3, voice_threshold=900)
S.predict(audio_path)