BERTeley Models

calculate_metrics(texts: List[str], topic_model: bertopic._bertopic.BERTopic, topics: List[int], n_gram_range: Optional[Tuple[int, int]] = None)

Calculates the Topic Coherence and Topic Diversity of the topic model. :param texts: The list of documents

Returns

A dict containing the metrics as keys, and their respective scores as values.

Return type

dict

create_barcharts(topics, topic_model: bertopic._bertopic.BERTopic, path='')

creates and saves the BERTopic barcharts

Parameters
  • topics – the list of topic assignments output from the fit function

  • topic_model – fitted BERTopic model output from the fit function

  • path – A string containing the desired path to save the barchart

Return type

Barcharts are saved in .html and .png format in the desired directory.

fit(data: List[str], embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose: bool = False)

Fits a BERTopic model on the data. After fitting the topic assigned to each document is stored in the ‘topics’ attribute, the coherence and diversity measures are stored in the ‘coherence’ and ‘diversity’ attributes respectively, and the amount of documents assigned to each topic are stored in the ‘topic_sizes’ attribute.

Parameters
  • data – List of the documents

  • embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.

  • nr_topics – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.

  • n_gram_range – String indicating whether the user would like unigram or bigram, or tuple of ints.

  • verbose – Boolean indicating verbose output.

Returns

  • topics – a list of integers representing the topic the corresponding document was assigned to

  • probabilities – a

  • metrics – a dictionary of the 2 metrics with keys “Coherence” and “Diversity”

  • topic_sizes – a dictionary with key: topic number and the value: the number of documents assigned the said topic

  • topic_model – the fitted BERTopic model

  • topic_words – a dictionary with key: the topic number and the value: a list of strings

initialize_model(embedding_model: Union[sentence_transformers.SentenceTransformer.SentenceTransformer, str] = 'specter', nr_topics: Optional[int] = None, n_gram_range: Union[Literal['unigram', 'bigram'], Tuple[int, int]] = 'unigram', verbose='False')

Conducts type checks on the input variables and converts certain parameters to the proper types dependent on their input. Not intended to be called by the user, but instead used internally by the fit function.

Parameters
  • embedding_model – Either a SentenceTransformer, or a string with values “specter”. “aspire”, or “scibert”.

  • nr_topics (int) – The desired number of topics, if not specified the results will be determined by HDBSCAN’s reduction step.

  • n_gram_range – String indicating whether the user would like unigram or bigram or tuple of ints.

  • verbose – Boolean indicating verbose output.

Returns

  • embedding_model – SentenceTransformer language model

  • n_gram_range – tuple indicating the level of n_gram