Topic modeling — Pith glossary

Back to glossary

Topic modeling is the unsupervised discovery of themes in a corpus of documents, classically via algorithms like LDA and increasingly via embedding-based clustering.

Why it matters

Knowing what a corpus is *about* without reading it is the foundational task of corpus analytics. Classical LDA (Latent Dirichlet Allocation) was the gold standard until ~2020; modern approaches use embeddings + clustering (BERTopic, top2vec) for cleaner topics with less hyperparameter tuning.

The downstream use is typically navigation: instead of scrolling through 500 documents, you see 12 topics, each with representative documents. Search-by-topic, recommendation-by-topic, and trend detection all build on this.

How Pith relates

Pith's topic map is embedding-based clustering on bookmark embeddings. Hover over a cluster to see its theme; click to filter bookmarks. Topic-cluster labels are LLM-generated from the cluster's content.

Why it matters

How Pith relates

See also