BERTeley Models
- calculate_metrics(texts: List[str], topic_model: bertopic._bertopic.BERTopic, topics: List[int], n_gram_range: Optional[Tuple[int, int]] = None)
Calculates the Topic Coherence and Topic Diversity of the topic model. :param texts: The list of documents
- Returns
A dict containing the metrics as keys, and their respective scores as values.
- Return type
dict
- create_barcharts(topics, topic_model: bertopic._bertopic.BERTopic, path='')
creates and saves the BERTopic barcharts
- Parameters
topics – the list of topic assignments output from the fit function
topic_model – fitted BERTopic model output from the fit function
path – A string containing the desired path to save the barchart
- Return type
Barcharts are saved in .html and .png format in the desired directory.
- fit(data: List[str], embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose: bool = False)
Fits a BERTopic model on the data. After fitting the topic assigned to each document is stored in the ‘topics’ attribute, the coherence and diversity measures are stored in the ‘coherence’ and ‘diversity’ attributes respectively, and the amount of documents assigned to each topic are stored in the ‘topic_sizes’ attribute.
- Parameters
data – List of the documents
embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.
nr_topics – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.
n_gram_range – String indicating whether the user would like unigram or bigram, or tuple of ints.
verbose – Boolean indicating verbose output.
- Returns
topics – a list of integers representing the topic the corresponding document was assigned to
probabilities – a
metrics – a dictionary of the 2 metrics with keys “Coherence” and “Diversity”
topic_sizes – a dictionary with key: topic number and the value: the number of documents assigned the said topic
topic_model – the fitted BERTopic model
topic_words – a dictionary with key: the topic number and the value: a list of strings
- initialize_model(embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose='False')
Conducts type checks on the input variables and converts certain parameters to the proper types dependent on their input. Not intended to be called by the user, but instead used internally by the fit function.
- Parameters
embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.
nr_topics (int) – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.
n_gram_range – String indicating whether the user would like unigram or bigram or tuple of ints.
verbose – Boolean indicating verbose output.
- Returns
embedding_model – SentenceTransformer language model
n_gram_range – tuple indicating the level of n_gram