botiverse.Theorizer.model package

botiverse.Theorizer.model package#

Submodules#

botiverse.Theorizer.model.dataloader module#

class botiverse.Theorizer.model.dataloader.SquadGPT2Example(input_ids: List[int], attention_mask: List[int] | None = None, token_type_ids: List[int] | None = None, lm_labels: List[int] | None = None)[source]#

Bases: object

A single example for the SQuAD dataset, processed for GPT-2 training.

input_ids: List[int]#

attention_mask: List[int] = None#

token_type_ids: List[int] = None#

lm_labels: List[int] = None#

botiverse.Theorizer.model.dataloader.prepare_squad_data_for_gpt2(tokenizer: GPT2Tokenizer, processed_examples: List[SquadProcessedExample]) → List[SquadGPT2Example][source]#

botiverse.Theorizer.model.dataloader.prepare_and_pad_squad_data_for_gpt2(tokenizer: GPT2Tokenizer, processed_examples: List[SquadProcessedExample], max_len: int | None = None, padding: int = 0) → List[SquadGPT2Example][source]#

Prepare and pad SQuAD data for GPT-2.

This function tokenizes and processes the input data, builds the input sequence, token_type_ids, and lm_labels suitable for training GPT-2. It then pads the sequences to the maximum length in the dataset with the specified padding value.

Args:: tokenizer (GPT2Tokenizer): The GPT-2 tokenizer used to tokenize the text. processed_examples (List[SquadProcessedExample]): A list of SquadProcessedExample instances containing the processed SQuAD data. padding (int, optional): The padding value to use when padding the sequences. Default is 0.
Returns:: Dict[str, List[List[int]]]: A dictionary containing the prepared and padded data points with keys ‘input_ids’, ‘token_type_ids’, and ‘lm_labels’.

botiverse.Theorizer.model.dataloader.from_dict_to_squad_processed_example(data: Dict) → SquadProcessedExample[source]#

Convert a dictionary to a SquadProcessedExample instance.

Args:: data (Dict): A dictionary containing the data to convert.
Returns:: SquadProcessedExample: The SquadProcessedExample instance containing the data.

botiverse.Theorizer.model.dataloader.read_cached_processed_examples(filepath: str) → List[SquadProcessedExample][source]#

botiverse.Theorizer.model.dataloader.test_getdataset()[source]#

botiverse.Theorizer.model.dataloader.test_tokenizer()[source]#

botiverse.Theorizer.model.dataloader.prepare_squad_for_gpt2(tokenizer: GPT2Tokenizer, processed_examples: List[SquadProcessedExample], split) → List[Dict][source]#

botiverse.Theorizer.model.finetuned_model module#

This module defines a GPT2 model with a custom head for language modeling. It is adapted from the original huggin face implementation. Some functions are taken as is, as they are necessary for the model to work, they override methods from the GPT2PreTrainedModel class, like the prepare_inputs_for_generation method, __reopen_input_ids, etc.

I implemented the forward function, with the same set of parameters as the original implementaion, they used during generation.

class botiverse.Theorizer.model.finetuned_model.MyGPT2LMHeadModel(config)[source]#

Bases: GPT2PreTrainedModel

Initializes internal Module state, shared by both nn.Module and ScriptModule.

get_output_embeddings()[source]#

Returns the model’s output embeddings.

Returns:: nn.Module: A torch module mapping hidden states to vocabulary.

set_output_embeddings(new_embeddings)[source]#

prepare_inputs_for_generation(input_ids, past_key_values=None, inputs_embeds=None, **kwargs)[source]#

labels (torch.LongTensor of shape (batch_size, sequence_length), optional):: Labels for language modeling. Note that the labels are shifted inside the model, i.e. you can set labels = input_ids Indices are selected in [-100, 0, …, config.vocab_size] All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, …, config.vocab_size]

training: bool#

botiverse.Theorizer.model.train module#

botiverse.Theorizer.model.utils module#

botiverse.Theorizer.model.utils.get_overlap_position(para_ids, ans_ids, ans_prefix_ids)[source]#

Get the position (start and end indices) of the overlapping region between a paragraph and an answer after the answer prefix.

Args:: para_ids (list): The paragraph token IDs. ans_ids (list): The answer token IDs. ans_prefix_ids (list): The prefix token IDs of the answer.
Returns:: tuple: A tuple representing the start and end indices of the overlapping region.

botiverse.Theorizer.model.utils.pad_dataset(dataset, padding=0)[source]#: Pad the dataset. This could be optimized by defining a Dataset class and padd only batches but this is simpler.