botiverse.Theorizer.squad package

botiverse.Theorizer.squad package#

Submodules#

botiverse.Theorizer.squad.info_extractor module#

class botiverse.Theorizer.squad.info_extractor.ClueInfo(clue_pos_tag: str, clue_ner_tag: str, clue_length: int, clue_chunk: Tuple[str, str, List[str], int, int], clue_answer_dep_path_len: int, padded_selected_clue_binary_ids: numpy.ndarray)[source]#

Bases: object

clue_pos_tag: str#

clue_ner_tag: str#

clue_length: int#

clue_chunk: Tuple[str, str, List[str], int, int]#

clue_answer_dep_path_len: int#

padded_selected_clue_binary_ids: ndarray#

class botiverse.Theorizer.squad.info_extractor.SquadAugmentedExample(context_text: str, question_text: str, answer_text: str, question_type: QuestionType, answer_pos_tag: str, answer_ner_tag: str, answer_length: int, clue_info: ClueInfo)[source]#

Bases: object

A single training/test example for the Squad-Zhou dataset, after augmenting with clue info and question style.

context_text: str#

question_text: str#

answer_text: str#

question_type: QuestionType#

answer_pos_tag: str#

answer_ner_tag: str#

answer_length: int#

clue_info: ClueInfo#

botiverse.Theorizer.squad.info_extractor.chunks(sentence: str) → List[Tuple[str, str, List[str], int, int]][source]#

Takes a sentence and returns a list of chunks with their respective NER tags, POS tags, words, and start and end positions in the sentence.

Args:

sentence (str): The input sentence to be parsed and chunked.

Returns:

list: A list of tuples where each tuple contains the following elements:

chunk_ner_tag (str): The NER tag of the chunk (e.g., ‘PERSON’, ‘ORG’).
chunk_pos_tag (str): The POS tag of the chunk (e.g., ‘NP’, ‘VP’).
leaves_without_position (list): A list of words in the chunk.
start (int): The start position of the chunk in the sentence.
end (int): The end position of the chunk in the sentence.

botiverse.Theorizer.squad.info_extractor.get_dependency_paths(token_list: List[SpacyToken])[source]#

Given a list of spaCy tokens, extract the dependency paths between different tokens.

Args:: doc (spacy.tokens.Doc): A spaCy document.
Returns:: dict: A dictionary mapping token indices to tokens. dict: A dictionary mapping token indices to related tokens and their dependency paths. list: A list of token texts.

botiverse.Theorizer.squad.info_extractor.extract_clue(sentence: str, question: str, answer: str, answer_start: int, config: InfoConfig = InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20)) → ClueInfo[source]#

Given a sentence, question, answer, and the answer’s starting position, this function extracts information about the clues related to the answer.

Args:: sentence (str): The sentence containing the answer. question (str): The question being asked. answer (str): The correct answer. answer_start (int): The starting position of the answer in the sentence.
Returns:: A Clue Info object holding all the clue information.

botiverse.Theorizer.squad.info_extractor.extract_question_type_and_id(question, config: InfoConfig = InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20)) → Tuple[QuestionType, int][source]#

Given a question string, returns its type and associated id.

Args:: question (str): A question string.
Returns:: tuple: A tuple containing the question type (str) and its id (int).

botiverse.Theorizer.squad.info_extractor.extract_clue_and_question_info(sentence: str, question: str, answer: str, answer_start: int, config: ~botiverse.Theorizer.squad.utils.InfoConfig = <class 'botiverse.Theorizer.squad.utils.InfoConfig'>) → SquadAugmentedExample[source]#

Extracts information about the question, answer, and clue from the provided sentence, question, and answer.

Args:: sentence (str): The sentence containing the answer. question (str): The question being asked. answer (str): The answer to the question. answer_start (int): The character index of the answer’s start position in the sentence. config: A configuration object containing token limits and clue extraction settings.
Returns:: SquadAugmentedExample object containing extracted information about the question, answer, and clue.

botiverse.Theorizer.squad.info_extractor.test_chunks()[source]#

botiverse.Theorizer.squad.info_extractor.test_extract_clue()[source]#

botiverse.Theorizer.squad.info_extractor.test_extract_question_type()[source]#

botiverse.Theorizer.squad.info_extractor.test_extract_clue_and_question_info()[source]#

botiverse.Theorizer.squad.sample_data module#

class botiverse.Theorizer.squad.sample_data.AnswerSample(answer_text: str, char_st: int, char_ed: int, st: int, ed: int, answer_bio_ids: List[str], answer_pos_tag: str, answer_ner_tag: str)[source]#

Bases: object

answer_text: str#

char_st: int#

char_ed: int#

st: int#

ed: int#

answer_bio_ids: List[str]#

answer_pos_tag: str#

answer_ner_tag: str#

class botiverse.Theorizer.squad.sample_data.ClueSample(clue_text: str, clue_binary_ids: numpy.ndarray)[source]#

Bases: object

clue_text: str#

clue_binary_ids: ndarray#

botiverse.Theorizer.squad.sample_data.select_answers(chunklist, sentence, sample_probs, config=InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20)) → List[AnswerSample][source]#

Select multiple answer chunks from a given list of chunks based on their probability.

Args:

chunklist (list): A list of chunks, where each chunk is a tuple containing NER tag, POS tag,: token leaves, start index, and end index.

sentence (str): The input sentence from which the chunks are extracted. sample_probs (dict): A dictionary containing the probabilities of different answer conditions. config (InfoConfig, optional): A configuration object containing parameters for the sampling process.

Returns:

list: A list of sampled answers, where each answer is a tuple containing answer text, character start index,: character end index, token start index, token end index, answer BIO tags, POS tag, and NER tag.

botiverse.Theorizer.squad.sample_data.select_questions(ans: AnswerSample, sample_probs, config=InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20))[source]#

Select question styles based on the answer’s POS and NER tags, given sample probabilities.

Args:

ans (AnswerSample): A tuple containing information about the answer, including its text, indices, BIO tags,: POS tag, and NER tag.

sample_probs (dict): A dictionary containing the probabilities of different question styles. config (InfoConfig, optional): A configuration object containing the maximum number of sampling attempts and

the desired number of question styles to sample.

Returns:

list: A list of sampled question styles.

botiverse.Theorizer.squad.sample_data.select_clues(chunklist, doc: SpacyDoc, ans: AnswerSample, sample_probs, config=InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20))[source]#

Select clues from a list of chunks based on the dependency distance and the probability of the chunk given the answer.

Args:

chunklist (list): A list of chunks, each containing NER tag, POS tag, text, start index, and end index. doc (spacy.tokens.Doc): A SpaCy document containing the tokens of the sentence. ans (AnswerSample): A tuple containing information about the answer, including its text, indices, BIO tags,

POS tag, and NER tag.

config (InfoConfig, optional): A configuration object containing the maximum number of sampling attempts and: the desired number of clues to sample.

Returns:

list: A list of sampled clues, with each clue containing its text and binary ids.

botiverse.Theorizer.squad.sample_data.select(sentence, sample_probs, config=InfoConfig(num_sample_style=2, num_sample_answer=5, ans_len_bin_width=3, ans_len_min_val=0, ans_len_max_val=30, ans_limit=30, num_sample_clue=2, is_clue_topN=20, clue_dep_dist_bin_width=2, clue_dep_dist_min_val=0, clue_dep_dist_max_val=20, ques_limit=50, sent_limit=100, max_sample_times=20))[source]#

botiverse.Theorizer.squad.sample_data.read_sample_probs(sample_probs_path)[source]#

botiverse.Theorizer.squad.sample_data.select_with_default_sampel_probs(sentence)[source]#

botiverse.Theorizer.squad.sample_data.test()[source]#

botiverse.Theorizer.squad.squad_example module#

class botiverse.Theorizer.squad.squad_example.SquadExample(context_text: str, question_text: str, answer_text: str, answer_start: int)[source]#

Bases: object

A single example for the Squad-Zhou dataset, as loaded from disk.

context_text: str#

question_text: str#

answer_text: str#

answer_start: int#

class botiverse.Theorizer.squad.squad_example.SquadProcessedExample(context_text: str, question_text: str, question_type: str, answer_text: str, answer_start: int, clue_text: str, clue_start: int, para_id: int)[source]#

Bases: object

A single example for the processed SQuAD Zhou dataset, used for training.

context_text: str#

question_text: str#

question_type: str#

answer_text: str#

answer_start: int#

clue_text: str#

clue_start: int#

para_id: int#

botiverse.Theorizer.squad.squad_example.read_squad_examples(input_file: str) → List[SquadExample][source]#: Read a SQuAD Zhou text file into a list of SquadExample.

botiverse.Theorizer.squad.squad_example.create_squad_example_with_info(raw_ex: List[SquadExample]) → List[SquadAugmentedExample][source]#: Augment the raw examples with question-type and clue info.

botiverse.Theorizer.squad.squad_example.calculate_probability_distribution(augmented_examples: List[SquadAugmentedExample]) → Dict[str, Counter][source]#

Calculates the probability distribution of answer, clue, and sentence based on the given list of augmented examples.

The probability distribution is defined as: P(a, c, s) = p(a) * p(c|a) * p(s|c, a)

= p(a|a_tag, a_length) * p(c|c_tag, dep_dist) * p(s|a_tag)

Args:: augmented_examples (List[SquadAugmentedExample]): A list of SquadAugmentedExample objects.
Returns:: Dict[str, Counter]: A dictionary containing the probability distribution of answer, clue, and sentence.

botiverse.Theorizer.squad.squad_example.create_process_squad_examples(raw_ex: List[SquadExample])[source]#: Get a list of spaCy processed examples.

botiverse.Theorizer.squad.squad_example.pipeline(input_file: str)[source]#: Pipeline for processing squad examples.

botiverse.Theorizer.squad.utils module#

class botiverse.Theorizer.squad.utils.SpacyDoc[source]#

Bases: object

A stub class for the spacy.tokens.doc.Doc object, used for type hinting.

class botiverse.Theorizer.squad.utils.SpacyToken[source]#

Bases: object

A stub class for the spacy.tokens object, used for type hinting.

class botiverse.Theorizer.squad.utils.InfoConfig(num_sample_style: int = 2, num_sample_answer: int = 5, ans_len_bin_width: int = 3, ans_len_min_val: int = 0, ans_len_max_val: int = 30, ans_limit: int = 30, num_sample_clue: int = 2, is_clue_topN: int = 20, clue_dep_dist_bin_width: int = 2, clue_dep_dist_min_val: int = 0, clue_dep_dist_max_val: int = 20, ques_limit: int = 50, sent_limit: int = 100, max_sample_times: int = 20)[source]#

Bases: object

Configuration for information extraction.

num_sample_style: int = 2#

num_sample_answer: int = 5#

ans_len_bin_width: int = 3#

ans_len_min_val: int = 0#

ans_len_max_val: int = 30#

ans_limit: int = 30#

num_sample_clue: int = 2#

is_clue_topN: int = 20#

clue_dep_dist_bin_width: int = 2#

clue_dep_dist_min_val: int = 0#

clue_dep_dist_max_val: int = 20#

ques_limit: int = 50#

sent_limit: int = 100#

max_sample_times: int = 20#

botiverse.Theorizer.squad.utils.find_token_spans_in_text(text: str, tokens: str) → List[Tuple[int, int]][source]#

Get the character-level spans of the specified tokens in the input text.

Parameters:

text – str, input text
tokens – list of str, token texts to find in the input text

Returns:

list of tuples, representing the character-level spans of each token in the input text Each tuple contains the start and end indices of the token in the text (end index exclusive)

Raises:

Exception – If any of the specified tokens cannot be found in the input text

botiverse.Theorizer.squad.utils.match_spans(pattern: str, input_text: str) → List[Tuple[int, int]][source]#

Find all occurrences of the given pattern in the input text and return the character-level spans.

Parameters:

pattern – str, the pattern to match in the input text
input_text – str, the input text where the pattern will be searched

Returns:

list of tuples, each tuple represents the character-level span of the pattern in the input text Each tuple contains the start and end indices of the pattern in the text (end index exclusive)

botiverse.Theorizer.squad.utils.normalize(text: str) → str[source]#: Normalize the given text by replacing `` with “ and ‘’ with “.

botiverse.Theorizer.squad.utils.token_to_char_indices(sentence)[source]#

Generate character index ranges for each token in a given sentence using spaCy.

Args:

sentence (str): The input sentence to be tokenized.

Returns:

token_to_char_range (dict): A dictionary where keys are token indices and values are tuples representing the start: and end indices (inclusive) of the token in the sentence.
char_to_token (dict): A dictionary where keys are character indices and values are the corresponding token indices: in the sentence.

botiverse.Theorizer.squad.utils.weighted_sample(choices, probs)[source]#

Sample from choices with probability according to probs.

Args:: choices (list): A list of elements to sample from. probs (list): A list of probabilities corresponding to each element in choices.

The probabilities don’t need to be normalized.
Returns:: any: A randomly sampled element from choices based on the provided probabilities.

botiverse.Theorizer.squad.utils.value_to_bin(input_val: int, min_val: int, max_val: int, bin_width: int)[source]#

Determine the bin index for the given input value, based on a binned range between min_val and max_val with a specified bin width.

Args:

input_val (float): The input value to be binned. min_val (float): The minimum value of the range. max_val (float): The maximum value of the range. bin_width (float): The width of each bin in the range.

Returns:

int: The bin index of the input value.

If the input value is within the range, the function returns the corresponding bin index.
If the input value is greater than the maximum value, the function returns the index of the last bin + 1.
If the input value is less than the minimum value, the function returns -1.

botiverse.Theorizer.squad.utils.str_find(text, tklist)[source]#

Searches for a sequence of tokens in a given text string, allowing for spaces between characters.

The function takes a text string and a list of tokens as input, and returns the start and end character indices of the sequence of tokens in the text. If the sequence is not found within the text, the function returns (-1, -1). Spaces between characters in the input text are allowed and do not affect the search.

Args:

text (str): The input text string in which to search for the sequence of tokens. tklist (list): A list of tokens representing the sequence to search for in the text. Each token is a string.

Returns:

tuple: A tuple containing two integer values:

The start character index of the sequence in the text, or -1 if the sequence is not found.
The end character index of the sequence in the text, or -1 if the sequence is not found.

Example:

>>> text = "This is an example text."
>>> tklist = ["ex", "am", "ple"]
>>> str_find(text, tklist)
(11, 18)

botiverse.Theorizer.squad.utils.test_find_token_spans_in_text()[source]#

botiverse.Theorizer.squad.utils.test_match_spans()[source]#

botiverse.Theorizer.squad package

Contents

botiverse.Theorizer.squad package#

Submodules#

botiverse.Theorizer.squad.info_extractor module#

botiverse.Theorizer.squad.sample_data module#

botiverse.Theorizer.squad.squad_example module#

botiverse.Theorizer.squad.utils module#

Module contents#