botiverse.models.FastSpeech1 package#

Submodules#

botiverse.models.FastSpeech1.FastSpeech module#

FastSpeech 1.0 interface and implementation from scratch in PyTorch for inference.

class botiverse.models.FastSpeech1.FastSpeech.MultiHeadAttention(num_head, emb_dim, h_dim, dropout=0.1)[source]#

Bases: Module

Multi-Head Attention module with a residual connection and layer normalization. Used as self-attention in the FFT block of the encoder and decoder.

Parameters:
  • num_head (int) – Number of attention heads.

  • emb_dim (int) – Input encoder/decoder embeddings dimensions.

  • h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).

  • dropout (float, optional) – Dropout probability. Default is 0.1.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(q, k, v, mask)[source]#

Pass given query, key, and value through the Multi-Head Attention module.

Parameters:
  • q (torch.Tensor) – Query tensor of shape [batch_size, seq_len, emb_dim].

  • k (torch.Tensor) – Key tensor of shape [batch_size, seq_len, emb_dim].

  • v (torch.Tensor) – Value tensor of shape [batch_size, seq_len, emb_dim].

  • mask (torch.Tensor) – Mask to apply to the attention so that padding tokens do not attend and are not attended to.

Returns:

Output tensor of shape [batch_size, seq_len, emb_dim].

Return type:

torch.Tensor

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.Conv1DNet(inp_dim, inner_dim, dropout=0.1)[source]#

Bases: Module

1D convolutional network with residual connection and layer normalization. Used in the FFT block of the encoder and decoder.

Parameters:
  • inp_dim (int) – Input dimension.

  • inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).

  • dropout (float, optional) – Dropout probability. Default is 0.1.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(x)[source]#

Pass given input through the 1D convolutional network. The input comes from the Multi-Head Attention module.

Parameters:

x (torch.Tensor) – Input tensor of shape [batch_size, seq_len, d_in].

Returns:

Output tensor of shape [batch_size, seq_len, d_in].

Return type:

torch.Tensor

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.FFTBlock(emb_dim, num_head, h_dim, inner_dim, dropout=0.1)[source]#

Bases: Module

FFT block used in the encoder and decoder. It consists of a Multi-Head Attention module and a 1D convolutional network.

Parameters:
  • emb_dim (int) – Input encoder/decoder embeddings dimensions.

  • num_head (int) – Number of heads for the Multi-Head Attention module.

  • h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).

  • inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).

  • dropout (float, optional) – Dropout probability. Default is 0.1.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(input, non_pad_mask, attn_mask)[source]#

Pass given encoder/decoder embeddings through the FFT block. The input comes from the previous FFT block or the input embeddings.

Parameters:
  • input (torch.Tensor) – Input tensor of shape [batch_size, seq_len, emb_dim].

  • non_pad_mask (torch.Tensor) – Mask to nullify outputs due to padding tokens.

  • attn_mask (torch.Tensor) – Mask to apply to the attention so that future tokens do not attend and are not attended to.

Returns:

Output tensor of shape [batch_size, seq_len, emb_dim] and attention weights tensor of shape [batch_size * num_head, seq_len, seq_len].

Return type:

tuple(torch.Tensor, torch.Tensor)

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.SinusoidEncodingTable(max_seq_len, inp_dim, padding_idx=None)[source]#

Bases: object

Sinusoid encoding table used in the encoder and decoder. It is used to add positional information to the input embeddings.

Parameters:
  • max_seq_len (int) – Maximum sequence length.

  • inp_dim (int) – Input encoder/decoder embeddings dimensions.

  • padding_idx (int, optional) – Index of the padding token. Default is None.

build_table()[source]#

Build the sinusoid encoding table.

Returns:

Sinusoid encoding table of shape [max_seq_len, inp_dim] which is indexed by the position of the input embeddings and the index of the input embeddings value.

Return type:

torch.Tensor

class botiverse.models.FastSpeech1.FastSpeech.Dencoder(mode, vocab_dim, max_seq_len, emb_dim, num_layer, num_head, h_dim, d_inner, mel_num, dropout)[source]#

Bases: Module

A single module that implements the encoder and decoder of the FastSpeech 1.0 model.

Parameters:
  • mode (str) – ‘Encoder’ or ‘Decoder’.

  • vocab_dim (int, optional) – Vocabulary dimension. None if mode is ‘Decoder’.

  • max_seq_len (int) – Maximum sequence length as needed by the sinusoid encoding table.

  • emb_dim (int) – Encoder/decoder embeddings dimensions.

  • num_layer (int) – Number of FFT blocks.

  • num_head (int) – Number of heads for the Multi-Head Attention module.

  • h_dim (int) – Hidden dimension (output dimension of the linear layers Wq, Wk, Wv).

  • d_inner (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).

  • mel_num (int, optional) – Number of mel spectrogram bins (to map the final decoder embeddings). None if mode is ‘Encoder’.

  • dropout (float) – Dropout probability.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(inp_seq, inp_seq_pos)[source]#

Pass given input sequence through the encoder/decoder.

Parameters:
  • inp_seq (torch.Tensor) – Input sequence of shape [batch_size, seq_len] for Encoder or [batch_size, seq_len, emb_dim] for Decoder.

  • inp_seq_pos (torch.Tensor) – Input sequence positions of shape [batch_size, seq_len].

Returns:

Output tensor of shape [batch_size, seq_len, emb_dim] for Decoder or [batch_size, seq_len, mel_num] for Encoder.

Return type:

torch.Tensor

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.DurationPredictor(inp_dim, inner_dim, kernel_size, padding_size, dropout=0.1)[source]#

Bases: Module

Duration predictor module. It predicts the duration of each phoneme in the input sequence of encoder phoneme embeddings.

Parameters:
  • inp_dim (int) – Input dimension.

  • inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).

  • kernel_size (int) – Kernel size of the convolutional layers.

  • padding_size (int) – Padding size of the convolutional layers.

  • dropout (float, optional) – Dropout probability. Default is 0.1.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(encoder_output)[source]#

Pass given encoder output through the duration predictor module. The input comes from the encoder.

Parameters:

encoder_output (torch.Tensor) – Encoder output of shape [batch_size, seq_len, emb_dim].

Returns:

Predicted duration of each phoneme in the input sequence of encoder phoneme embeddings of shape [batch_size, seq_len].

Return type:

torch.Tensor

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.LengthRegulator(inp_dim, inner_dim, kernel_size, padding_size, dropout=0.1)[source]#

Bases: Module

Length regulator module. It repeats the encoder outputs according to the predicted duration of each phoneme in the input sequence of encoder phoneme embeddings.

Parameters:
  • inp_dim (int) – Input dimension.

  • inner_dim (int) – Inner dimension of the convolutional layers (output dimension of the first convolutional layer).

  • kernel_size (int) – Kernel size of the convolutional layers.

  • padding_size (int) – Padding size of the convolutional layers.

  • dropout (float, optional) – Dropout probability. Default is 0.1.

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(enc_output)[source]#

Pass given encoder output through the length regulator module. The input comes from the encoder.

Parameters:

enc_output (torch.Tensor) – Encoder output of shape [batch_size, seq_len, emb_dim].

Returns:

Modified encoder output of shape [batch_size, new_seq_len, emb_dim] and new positional encoding for the modified encoder output of shape [batch_size, new_seq_len].

Return type:

tuple(torch.Tensor, torch.Tensor)

training: bool#
class botiverse.models.FastSpeech1.FastSpeech.FastSpeech[source]#

Bases: Module

FastSpeech 1.0 model. It consists of an encoder, a duration predictor, a length regulator, and a decoder.

Initialize the structure of the FastSpeech 1.0 model.

training: bool#
forward(text_seq, src_pos)[source]#

Pass given input sequence through the FastSpeech 1.0 model.

Parameters:
  • text_seq (torch.Tensor) – Input sequence of shape [batch_size, seq_len] and assigns a unique id to each character in it.

  • src_pos (torch.Tensor) – Input sequence positions (indices) of shape [batch_size, seq_len].

Returns:

Predicted mel spectrogram of shape [batch_size, mel_num, new_seq_len].

Return type:

torch.Tensor

class botiverse.models.FastSpeech1.FastSpeech.TTS(force_download_wg=False, force_download_fs=False)[source]#

Bases: object

Text-to-Speech (TTS) class that implements the FastSpeech 1.0 model and the WaveGlow model for speech synthesis.

Parameters:
  • force_download_wg (bool, optional) – Whether to force download the WaveGlow weights if they already seem to exist. Default is False.

  • force_download_fs (bool, optional) – Whether to force download the FastSpeech 1.0 weights if they already seem to exist. Default is False.

speak(text, play=False, save=False)[source]#

Pass given text through the FastSpeech 1.0 model and the WaveGlow model to generate speech.

Parameters:
  • text (str) – Text to be spoken with at most 300 characters.

  • play (bool, optional) – Whether to play the generated speech. Default is False.

  • save (bool, optional) – Whether to save the generated speech as an audio file. Default is False.

Returns:

The generated speech as an audio signal.

Return type:

numpy.ndarray

Module contents#