mzutils.nlp_tasks package

Submodules

mzutils.nlp_tasks.data_preprocessing module

mzutils.nlp_tasks.data_preprocessing.concatenate_predictions_dicts(squadjsons_files_dir, output_file=None)[source]

concatenate_predictions_dicts.

Parameters:

squadjsons_files_dir – should be one of “wikipedia-train” “wikipedia-dev” “web-train” “web-dev” “verified-web-dev” “verified-wikipedia-dev”

it is actually a directory with format: squadjsons_files_dir ├── squadjsons%d │ └── predictions.json :param outout_dir: directory to store concatenated predictions.json :return: None

mzutils.nlp_tasks.data_preprocessing.generate_multi_test_cases(list_of_paragraphs, list_of_questions, json_store_path)[source]

given pairs of paragraphs and questions, it creates a json file just like how training/dev/test data stored in SQuAD 1.1 :param list_of_paragraphs: :param list_of_questions: :param json_store_path: :return:

mzutils.nlp_tasks.data_preprocessing.generate_multi_test_cases_triviaQA(retrieved_json_path, json_store_path, documents_path, missing_file_path=None)[source]

given pairs of paragraphs and questions, it creates a json file just like how training/dev/test data stored in SQuAD 1.1 the format is : {

“[question_order, answer_order, “qid”, [“ground_truths”]] : “ans”, “[0, 0, ‘tc_1250’, [‘The Swiss Miss’, ‘Martina hingis’, ‘Martina Hingisová’, ‘Martina Hingis’, ‘MartinaHingis’, ‘Martina Hingisova’, ‘Hingis’, ‘hingis’, ‘swiss miss’, ‘martina hingis’, ‘martina hingisova’, ‘martinahingis’, ‘martina hingisová’, ‘martina hingis’]]”: “Li Na”,

}

mzutils.nlp_tasks.data_preprocessing.retrieve_questions_from_triviaQA(file_path, destination_path=None)[source]
Parameters:

file_path

:return:[{“Question” : “”, “QuestionId” : “”, “AcceptableAnswers” : “”}] or None and write {“data”: [{“Question” : “”, “QuestionId” : “”, “AcceptableAnswers” : “”}]}

mzutils.nlp_tasks.data_preprocessing.simple_squad_segmentor(squad_file_path, store_location, num_of_paragraphs=500)[source]

mzutils.nlp_tasks.ner_funcs module

mzutils.nlp_tasks.ner_funcs.helper_flatten(list_of_lists)[source]
mzutils.nlp_tasks.ner_funcs.labels_from_subword_labels(tokens, bert_labels, tokenizer, bert_special_tokens=True)[source]
Parameters:

tokens – something like [‘John’, ‘Johanson’, ‘lives’, ‘in’, ‘Ramat’, ‘Gan’, ‘.’]. can get from

tokens = tokenizer.basic_tokenizer.tokenize(“John Johanson lives in Ramat Gan.”) tokens = nltk.word_tokenize(“John Johanson lives in Ramat Gan.”)

Parameters:

bert_labels – [0, 1, 2, 3, 0, 0, 1, 3, 2, 2, 0, 0]. NER tokens from subword_tokenize_labels.

labels: 0 -> O out 1 -> B beginning 2 -> I continued 3 -> X sub-words that are not tagged.

Parameters:
  • tokenizer – e.g. transformers.BertTokenizer.from_pretrained(‘bert-base-cased’)

  • bert_special_tokens – add ‘[CLS]’ and ‘[SEP]’ or not.

Returns:

([‘John’, ‘Johanson’, ‘lives’, ‘in’, ‘Ramat’, ‘Gan’, ‘Gang’, ‘.’], [1, 2, 0, 0, 1, 2, 2, 0])

mzutils.nlp_tasks.ner_funcs.rejoin_bert_tokenized_sentence(sentence)[source]

original sentence is “The Smiths’ used their son’s car.” tokenizer.basic_tokenizer.tokenize(“The Smiths’ used their son’s car.”) gives [‘the’, ‘smiths’, “’”, ‘used’, ‘their’, ‘son’, “’”, ‘s’, ‘car’, ‘.’] fine_text returns “the smiths ‘ used their son ‘ s car .” tokenizer.basic_tokenizer.tokenize(fine_text) gives [‘the’, ‘smiths’, “’”, ‘used’, ‘their’, ‘son’, “’”, ‘s’, ‘car’, ‘.’] again.

mzutils.nlp_tasks.ner_funcs.subword_tokenize_labels(tokens, labels, tokenizer, bert_special_tokens=True)[source]
Parameters:

tokens – something like [‘John’, ‘Johanson’, ‘lives’, ‘in’, ‘Ramat’, ‘Gan’, ‘Gang’, ‘.’]. can get from

tokens = tokenizer.basic_tokenizer.tokenize(“John Johanson lives in Ramat Gan Gang.”) tokens = nltk.word_tokenize(“John Johanson lives in Ramat Gan Gang.”)

Parameters:

labels – [1, 2, 0, 0, 1, 2, 2, 0]. NER tokens.

labels: 0 -> O out 1 -> B beginning 2 -> I continued 3 -> X sub-words that are not tagged.

Parameters:
  • tokenizer – e.g. transformers.BertTokenizer.from_pretrained(‘bert-base-cased’)

  • bert_special_tokens – add ‘[CLS]’ and ‘[SEP]’ or not.

Returns:

([‘[CLS]’, ‘john’, ‘johan’, ‘##son’, ‘lives’, ‘in’, ‘rama’, ‘##t’, ‘gan’, ‘gang’, ‘.’, ‘[SEP]’], [101, 2198, 13093, 3385, 3268, 1999, 14115, 2102, 25957, 6080, 1012, 102], array([ 1, 2, 4, 5, 6, 8, 9, 10]), [0, 1, 2, 3, 0, 0, 1, 3, 2, 2, 0, 0])

mzutils.nlp_tasks.nlp_metrics module

mzutils.nlp_tasks.nlp_metrics.compute_sentence_pseudo_mlm_perplexity(model, tokenizer, sentence: str, mask_token: str = '[MASK]', max_length: int = 512, empty_cache: bool = False, batch_size=64)[source]

Compute perplexity of a sentence using pseudo MLM. contrary to https://huggingface.co/docs/transformers/perplexity, we use diagonal masking to compute the model confusion.

Parameters:
  • model (_type_) – e.g. BertForMaskedLM.from_pretrained(‘bert-base-uncased’)

  • tokenizer (_type_) – e.g. BertTokenizer.from_pretrained(‘bert-base-uncased’)

  • sentence (str) – _description_

  • mask_token (str, optional) – _description_

  • max_length (int, optional) – _description_

  • empty_cache (bool, optional) – Default to False.

  • batch_size (int, optional) – _description_

Returns:

_description_

Return type:

_type_

mzutils.nlp_tasks.nlp_metrics.remove_sub_strings(predicted_txt, tokens=['ᐛ ', ' ✬', '<unk>'])[source]

remove the list of strings (tokens) from predicted_txt

mzutils.nlp_tasks.nlp_metrics.remove_sub_strings_chinese(predicted_txt, tokens=['ᐛ', '✬', '<unk>'])[source]

remove the list of strings (tokens) from predicted_txt

mzutils.nlp_tasks.nlp_metrics.rouge_helper_prepare_results(m, p, r, f)[source]
mzutils.nlp_tasks.nlp_metrics.translation_paraphrase_evaluation(sources, hypos, refs, sentence_preproce_function=None, print_scores=True, max_n=4, rouge_alpha=0.5, rouge_weight_factor=1.2, rouge_stemming=True, hypo_style='first')[source]

to evalute generated paraphrase or translations with BlEU and ROUGE scores. Nothing should be tokenized here. :param sources: source sentence to start with. e.g. sources = [‘Young woman with sheep on straw covered floor.’, ‘A man who is walking across the street.’, ‘A brightly lit kitchen with lots of natural light.’] :param hypos: generated hypotheses. should share the same shape with sources. (each source, generate one list of hypothesis sentence.) e.g. [[‘A child places his hands on the head and neck of a sheep while another sheep looks at his face.’, ‘A person petting the head of a cute fluffy sheep.’, ‘A child is petting a sheep while another sheep watches.’, ‘A woman kneeling to pet animals while others wait. ‘], [‘A busy intersection with an ice cream truck driving by.’, ‘a man walks behind an ice cream truck ‘, ‘A man is crossing a street near an icecream truck.’, ‘The man is walking behind the concession bus.’], [‘A modern kitchen in white with stainless steel lights.’, ‘A kitchen filled with lots of white counter space.’, ‘A KITCHEN IN THE ROOM WITH WHITE APPLIANCES ‘, ‘A modern home kitchen and sitting area looking out towards the back yard’]] :param refs: list of list of sentences. For each source, given a list of possible references. e.g. [[‘A woman standing next to a sheep in a pen .<unk>’, ‘A woman standing next to a sheep on a farm .<unk>’, ‘A woman standing next to a sheep in a barn .<unk>’, ‘A woman standing next to a sheep in a field .<unk>’, ‘A woman standing next to a sheep in a barn<unk>’], [‘A man crossing the street in front of a store .<unk>’, ‘A man crossing the street in a city .<unk>’, ‘A person crossing the street in a city .<unk>’, ‘A man crossing the street in the middle of a city<unk>’, ‘A man crossing the street in the middle of a city street<unk>’], [‘a kitchen with a stove a microwave and a sink<unk>’, ‘a kitchen with a stove a sink and a microwave<unk>’, ‘a kitchen with a stove a sink and a refrigerator<unk>’, ‘A kitchen with a sink , stove , microwave and window .<unk>’, ‘a kitchen with a stove a sink and a window<unk>’]] :param hypo_style: how to evaluate the generated hypotheses. Pick the first? Choose the one with best evalution score? Average the scores on all hypotheses? Should be one of [‘first’, ‘best’, ‘average’] :param sentence_preproce_function: a function that will be applied to all sentences in sources, hypos, refs :return: a dictionary of scores.

mzutils.nlp_tasks.nlp_metrics.translation_paraphrase_evaluation_chinese(sources, hypos, refs, sentence_preproce_function=None, print_scores=True, max_n=4, rouge_alpha=0.5, rouge_weight_factor=1.2, rouge_stemming=True, hypo_style='first', word_segmentor='character')[source]

to evalute generated paraphrase or translations with BlEU and ROUGE scores. Nothing should be tokenized here. :param sources: source sentence to start with. :param hypos: generated hypotheses. should share the same shape with sources. (each source, generate one list of hypothesis sentence.) :param refs: list of list of sentences. For each source, given a list of possible references. :param hypo_style: how to evaluate the generated hypotheses. Pick the first? Choose the one with best evalution score? Average the scores on all hypotheses? Should be one of [‘first’, ‘best’, ‘average’] :param sentence_preproce_function: a function that will be applied to all sentences in sources, hypos, refs :param word_segmentor: ‘character’ means seperate each character to be a word, ‘hanlp’ means an hanlp chinese tokenizer. :return: a dictionary of scores.

mzutils.nlp_tasks.nlp_metrics.translation_paraphrase_evaluation_english_tagpa(sources, hypos, refs, print_scores=True, max_n=4, rouge_alpha=0.5, rouge_weight_factor=1.2, rouge_stemming=True)[source]

to evalute generated paraphrase or translations with BlEU and ROUGE scores. Nothing should be tokenized here. :param sources: source sentence to start with. e.g. [‘Young woman with sheep on straw covered floor .’, ‘A man who is walking across the street .’] :param hypos: generated hypotheses. should share the same shape with sources. (each source, generate one list of hypothesis sentence.) e.g. [‘Young woman with sheep on straw covered floor .’, ‘a little girl with sheep on straw covered floor .’] for ‘Young woman with sheep on straw covered floor .’ :param refs: list of list of sentences. For each source, given a list of possible references. e.g. [[‘Young woman with sheep on straw covered floor .’, ‘Young woman on the floor .’] [‘A man who is walking across the street now.’, ‘A man walking across the street.’]] :return: a dictionary of scores.

Module contents