BERTeley Preprocessing
- alive_bar_joblib(bar)
Context manager to patch joblib to report into progress bar given as argument
- combine_hyphens(doc: str)
Removes hyphens and concatenates all hyphenated words together
- Parameters
doc – A string for a single document
- Return type
A string with hyphenated words combined
Examples x-ray -> xray
- expand_contractions(doc: str)
Expands contractions using the contractions library
- Parameters
doc – A single document
- Returns
A string with contractions expanded
Examples
won’t -> will not
- lemmatize(doc)
This function utilizes the lemmatizer in the spacy package to lemmatize all the words in the list
- Parameters
doc – A single document
- Return type
a string with lemmatized words
- preprocess(docs: List[str], allow_abbrev: bool = True, show_progress: bool = False) List[str]
Wrapper function for all the preprocessing steps
- Parameters
docs – A list of all the documents
allow_abbrev – Boolean indicating whether abbreviations should be allowed. If set to false all strings with length 2 or less will be removed
show_progress – If True, shows a progress bar for completion status
- Return type
A list of strings that have been preprocessed
- preprocess_parallel(docs: List[str], n_workers: int = 4, allow_abbrev: bool = True, show_progress: bool = True) List[str]
Parallelizes the preprocessing by splitting the documents evenly amongst the n_workers
- Parameters
docs – List of documents
n_workers – Number of workers to be assigned preprocessing in joblib
allow_abbrev – Boolean indicating whether abbreviations should be allowed. If set to false all strings with length 2 or less will be removed
show_progress (bool) – If True, shows a progress bar for completion status
- Return type
A list of strings that have been preprocessed
- remove_extraspace(doc: str)
Removes excess whitespace from a string :param doc: A single document
- Return type
A string with excess whitespace removed
- remove_html(doc: str)
Removes html tags from string :param doc: A single document
- Return type
A string with html tags removed
- remove_punctuation(doc: str)
Removes all punctuation from a string using the punctuation list in the string library
- Parameters
doc – A single document
- Return type
a string with no punctuation
- remove_stopwords(doc: str, allow_abbrev: bool = True)
This function utilizes the stopwords in the nltk package to remove the stopwords from the string. Further steps were taken that remove words commonly found in scientific articles.
- Parameters
doc – A single document
allow_abbrev – A boolean indicating whether abbreviations should be considered stopwords. If true strings with character length of 2 or less are removed.
- Return type
A string with stopwords removed