botiverse.preprocessors.Special.ConverseBot_Preprocessor package#

Submodules#

botiverse.preprocessors.Special.ConverseBot_Preprocessor.ConverseBot_Preprocessor module#

class botiverse.preprocessors.Special.ConverseBot_Preprocessor.ConverseBot_Preprocessor.ConverseBot_Preprocessor(file_path=None, dataset=None)[source]#

Bases: object

‘An interface that provides the required preprocessing for the ConverseBot bot

Initializes a ConverseBot_Preprocessor instance with an optional training dataset, note that the dataset structure is an array of multiturn conversations and each multiturn conversation is an array of strings, e.g., [[“hi”,”hello”,”how are you?”], [“good”,”how about you?”,”i am fine”]]

Parameters:
  • dataset (list of list of str, optional) – Dataset to be processed (use it or file_path).

  • file_path (str, optional) – Path to the .json file that contains the conversation array (use it or dataset).

Returns:

None

process()[source]#

Processes the conversations dataset by cleaning it then combining each conversation into a single string (with [C] between each turn) and then tokenizing it.

Returns:

DataFrame containing the processed conversations.

Return type:

DataFrame

clean_string(string)[source]#

Cleans a string by removing certain spaces and new line characters.

Parameters:

string (str) – The string to clean.

Returns:

The cleaned string.

Return type:

str

tokenize_string(string, target=False)[source]#

Tokenizes a string.

Parameters:
  • string (str) – The string to tokenize.

  • target (bool, optional) – Indicates whether the string is a target.

Returns:

Tokenized string.

Return type:

Dict[str, Tensor]

decode_tokens(tokens)[source]#

Decodes a sequence of tokens.

Parameters:

tokens (Tensor) – The tokens to decode.

Returns:

The decoded string.

Return type:

str

process_string(string)[source]#

Cleans and tokenizes a conversational string.

Parameters:

string (str) – The conversational string to process.

Returns:

Processed string in token vector form.

Return type:

Dict[str, Tensor]

Module contents#