BERTeley Preprocessing

alive_bar_joblib(bar)

Context manager to patch joblib to report into progress bar given as argument

combine_hyphens(doc: str)

Removes hyphens and concatenates all hyphenated words together

Parameters

doc – A string for a single document

Return type

A string with hyphenated words combined

Examples x-ray -> xray

expand_contractions(doc: str)

Expands contractions using the contractions library

Parameters

doc – A single document

Returns

  • A string with contractions expanded

  • Examples

  • won’t -> will not

lemmatize(doc)

This function utilizes the lemmatizer in the spacy package to lemmatize all the words in the list

Parameters

doc – A single document

Return type

a string with lemmatized words

preprocess(docs: List[str], allow_abbrev: bool = True, show_progress: bool = False) List[str]

Wrapper function for all the preprocessing steps

Parameters
  • docs – A list of all the documents

  • allow_abbrev – Boolean indicating whether abbreviations should be allowed. If set to false all strings with length 2 or less will be removed

  • show_progress – If True, shows a progress bar for completion status

Return type

A list of strings that have been preprocessed

preprocess_parallel(docs: List[str], n_workers: int = 4, allow_abbrev: bool = True, show_progress: bool = True) List[str]

Parallelizes the preprocessing by splitting the documents evenly amongst the n_workers

Parameters
  • docs – List of documents

  • n_workers – Number of workers to be assigned preprocessing in joblib

  • allow_abbrev – Boolean indicating whether abbreviations should be allowed. If set to false all strings with length 2 or less will be removed

  • show_progress (bool) – If True, shows a progress bar for completion status

Return type

A list of strings that have been preprocessed

remove_extraspace(doc: str)

Removes excess whitespace from a string :param doc: A single document

Return type

A string with excess whitespace removed

remove_html(doc: str)

Removes html tags from string :param doc: A single document

Return type

A string with html tags removed

remove_punctuation(doc: str)

Removes all punctuation from a string using the punctuation list in the string library

Parameters

doc – A single document

Return type

a string with no punctuation

remove_stopwords(doc: str, allow_abbrev: bool = True)

This function utilizes the stopwords in the nltk package to remove the stopwords from the string. Further steps were taken that remove words commonly found in scientific articles.

Parameters
  • doc – A single document

  • allow_abbrev – A boolean indicating whether abbreviations should be considered stopwords. If true strings with character length of 2 or less are removed.

Return type

A string with stopwords removed