botiverse.models.FastSpeech1 package#
Submodules#
botiverse.models.FastSpeech1.FastSpeech module#
FastSpeech 1.0 interface and implementation from scratch in PyTorch for inference.
- class botiverse.models.FastSpeech1.FastSpeech.MultiHeadAttention(num_head, emb_dim, h_dim, dropout=0.1)[source]#
Bases:
ModuleMulti-Head Attention module with a residual connection and layer normalization. Used as self-attention in the FFT block of the encoder and decoder.
- Parameters:
num_head (int) – Number of attention heads.
emb_dim (int) – Input encoder/decoder embeddings dimensions.
h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).
dropout (float, optional) – Dropout probability. Default is 0.1.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(q, k, v, mask)[source]#
Pass given query, key, and value through the Multi-Head Attention module.
- Parameters:
q (torch.Tensor) – Query tensor of shape [batch_size, seq_len, emb_dim].
k (torch.Tensor) – Key tensor of shape [batch_size, seq_len, emb_dim].
v (torch.Tensor) – Value tensor of shape [batch_size, seq_len, emb_dim].
mask (torch.Tensor) – Mask to apply to the attention so that padding tokens do not attend and are not attended to.
- Returns:
Output tensor of shape [batch_size, seq_len, emb_dim].
- Return type:
torch.Tensor
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.Conv1DNet(inp_dim, inner_dim, dropout=0.1)[source]#
Bases:
Module1D convolutional network with residual connection and layer normalization. Used in the FFT block of the encoder and decoder.
- Parameters:
inp_dim (int) – Input dimension.
inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).
dropout (float, optional) – Dropout probability. Default is 0.1.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(x)[source]#
Pass given input through the 1D convolutional network. The input comes from the Multi-Head Attention module.
- Parameters:
x (torch.Tensor) – Input tensor of shape [batch_size, seq_len, d_in].
- Returns:
Output tensor of shape [batch_size, seq_len, d_in].
- Return type:
torch.Tensor
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.FFTBlock(emb_dim, num_head, h_dim, inner_dim, dropout=0.1)[source]#
Bases:
ModuleFFT block used in the encoder and decoder. It consists of a Multi-Head Attention module and a 1D convolutional network.
- Parameters:
emb_dim (int) – Input encoder/decoder embeddings dimensions.
num_head (int) – Number of heads for the Multi-Head Attention module.
h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).
inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).
dropout (float, optional) – Dropout probability. Default is 0.1.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(input, non_pad_mask, attn_mask)[source]#
Pass given encoder/decoder embeddings through the FFT block. The input comes from the previous FFT block or the input embeddings.
- Parameters:
input (torch.Tensor) – Input tensor of shape [batch_size, seq_len, emb_dim].
non_pad_mask (torch.Tensor) – Mask to nullify outputs due to padding tokens.
attn_mask (torch.Tensor) – Mask to apply to the attention so that future tokens do not attend and are not attended to.
- Returns:
Output tensor of shape [batch_size, seq_len, emb_dim] and attention weights tensor of shape [batch_size * num_head, seq_len, seq_len].
- Return type:
tuple(torch.Tensor, torch.Tensor)
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.SinusoidEncodingTable(max_seq_len, inp_dim, padding_idx=None)[source]#
Bases:
objectSinusoid encoding table used in the encoder and decoder. It is used to add positional information to the input embeddings.
- Parameters:
max_seq_len (int) – Maximum sequence length.
inp_dim (int) – Input encoder/decoder embeddings dimensions.
padding_idx (int, optional) – Index of the padding token. Default is None.
- class botiverse.models.FastSpeech1.FastSpeech.Dencoder(mode, vocab_dim, max_seq_len, emb_dim, num_layer, num_head, h_dim, d_inner, mel_num, dropout)[source]#
Bases:
ModuleA single module that implements the encoder and decoder of the FastSpeech 1.0 model.
- Parameters:
mode (str) – ‘Encoder’ or ‘Decoder’.
vocab_dim (int, optional) – Vocabulary dimension. None if mode is ‘Decoder’.
max_seq_len (int) – Maximum sequence length as needed by the sinusoid encoding table.
emb_dim (int) – Encoder/decoder embeddings dimensions.
num_layer (int) – Number of FFT blocks.
num_head (int) – Number of heads for the Multi-Head Attention module.
h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).
d_inner (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).
mel_num (int, optional) – Number of mel spectrogram bins (to map the final decoder embeddings). None if mode is ‘Encoder’.
dropout (float) – Dropout probability.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(inp_seq, inp_seq_pos)[source]#
Pass given input sequence through the encoder/decoder.
- Parameters:
inp_seq (torch.Tensor) – Input sequence of shape [batch_size, seq_len] for Encoder or [batch_size, seq_len, emb_dim] for Decoder.
inp_seq_pos (torch.Tensor) – Input sequence positions of shape [batch_size, seq_len].
- Returns:
Output tensor of shape [batch_size, seq_len, emb_dim] for Decoder or [batch_size, seq_len, mel_num] for Encoder.
- Return type:
torch.Tensor
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.DurationPredictor(inp_dim, inner_dim, kernel_size, padding_size, dropout=0.1)[source]#
Bases:
ModuleDuration predictor module. It predicts the duration of each phoneme in the input sequence of encoder phoneme embeddings.
- Parameters:
inp_dim (int) – Input dimension.
inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).
kernel_size (int) – Kernel size of the convolutional layers.
padding_size (int) – Padding size of the convolutional layers.
dropout (float, optional) – Dropout probability. Default is 0.1.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(encoder_output)[source]#
Pass given encoder output through the duration predictor module. The input comes from the encoder.
- Parameters:
encoder_output (torch.Tensor) – Encoder output of shape [batch_size, seq_len, emb_dim].
- Returns:
Predicted duration of each phoneme in the input sequence of encoder phoneme embeddings of shape [batch_size, seq_len].
- Return type:
torch.Tensor
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.LengthRegulator(inp_dim, inner_dim, kernel_size, padding_size, dropout=0.1)[source]#
Bases:
ModuleLength regulator module. It repeats the encoder outputs according to the predicted duration of each phoneme in the input sequence of encoder phoneme embeddings.
- Parameters:
inp_dim (int) – Input dimension.
inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).
kernel_size (int) – Kernel size of the convolutional layers.
padding_size (int) – Padding size of the convolutional layers.
dropout (float, optional) – Dropout probability. Default is 0.1.
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- forward(enc_output)[source]#
Pass given encoder output through the length regulator module. The input comes from the encoder.
- Parameters:
enc_output (torch.Tensor) – Encoder output of shape [batch_size, seq_len, emb_dim].
- Returns:
Modified encoder output of shape [batch_size, new_seq_len, emb_dim] and new positional encoding for the modified encoder output of shape [batch_size, new_seq_len].
- Return type:
tuple(torch.Tensor, torch.Tensor)
- training: bool#
- class botiverse.models.FastSpeech1.FastSpeech.FastSpeech[source]#
Bases:
ModuleFastSpeech 1.0 model. It consists of an encoder, a duration predictor, a length regulator, and a decoder.
Initialize the structure of the FastSpeech 1.0 model.
- training: bool#
- forward(text_seq, src_pos)[source]#
Pass given input sequence through the FastSpeech 1.0 model.
- Parameters:
text_seq (torch.Tensor) – Input sequence of shape [batch_size, seq_len] and assigns a unique id to each character in it.
src_pos (torch.Tensor) – Input sequence positions (indices) of shape [batch_size, seq_len].
- Returns:
Predicted mel spectrogram of shape [batch_size, mel_num, new_seq_len].
- Return type:
torch.Tensor
- class botiverse.models.FastSpeech1.FastSpeech.TTS(force_download_wg=False, force_download_fs=False)[source]#
Bases:
objectText-to-Speech (TTS) class that implements the FastSpeech 1.0 model and the WaveGlow model for speech synthesis.
- Parameters:
force_download_wg (bool, optional) – Whether to force download the WaveGlow weights if they already seem to exist. Default is False.
force_download_fs (bool, optional) – Whether to force download the FastSpeech 1.0 weights if they already seem to exist. Default is False.
- speak(text, play=False, save=False)[source]#
Pass given text through the FastSpeech 1.0 model and the WaveGlow model to generate speech.
- Parameters:
text (str) – Text to be spoken with at most 300 characters.
play (bool, optional) – Whether to play the generated speech. Default is False.
save (bool, optional) – Whether to save the generated speech as an audio file. Default is False.
- Returns:
The generated speech as an audio signal.
- Return type:
numpy.ndarray