BERTeley Models

calculate_metrics(texts: List[str], topic_model: bertopic._bertopic.BERTopic, topics: List[int], n_gram_range: Optional[Tuple[int, int]] = None)

Calculates the Topic Coherence and Topic Diversity of the topic model. :param texts: The list of documents

Returns: A dict containing the metrics as keys, and their respective scores as values.
Return type: dict

create_barcharts(topics, topic_model: bertopic._bertopic.BERTopic, path='')

creates and saves the BERTopic barcharts

Parameters

topics – the list of topic assignments output from the fit function
topic_model – fitted BERTopic model output from the fit function
path – A string containing the desired path to save the barchart

Return type

Barcharts are saved in .html and .png format in the desired directory.

fit(data: List[str], embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose: bool = False)

Fits a BERTopic model on the data. After fitting the topic assigned to each document is stored in the ‘topics’ attribute, the coherence and diversity measures are stored in the ‘coherence’ and ‘diversity’ attributes respectively, and the amount of documents assigned to each topic are stored in the ‘topic_sizes’ attribute.

Parameters

data – List of the documents
embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.
nr_topics – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.
n_gram_range – String indicating whether the user would like unigram or bigram, or tuple of ints.
verbose – Boolean indicating verbose output.

Returns

topics – a list of integers representing the topic the corresponding document was assigned to
probabilities – a
metrics – a dictionary of the 2 metrics with keys “Coherence” and “Diversity”
topic_sizes – a dictionary with key: topic number and the value: the number of documents assigned the said topic
topic_model – the fitted BERTopic model
topic_words – a dictionary with key: the topic number and the value: a list of strings

initialize_model(embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose='False')

Conducts type checks on the input variables and converts certain parameters to the proper types dependent on their input. Not intended to be called by the user, but instead used internally by the fit function.

Parameters

embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.
nr_topics (int) – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.
n_gram_range – String indicating whether the user would like unigram or bigram or tuple of ints.
verbose – Boolean indicating verbose output.

Returns

embedding_model – SentenceTransformer language model
n_gram_range – tuple indicating the level of n_gram