LDA - Project Gutenberg - Moby-Dick

LDA - Project Gutenberg - Moby-Dick

por Joaquim Bonacho -
Número de respostas: 2

Introduction

In this post, I will share the results of a topic modeling analysis using Latent Dirichlet Allocation (LDA) on a corpus derived from The Project Gutenberg eBook of Moby-Dick. 

Methodology

The analysis followed these key steps:

  1. Text Collection: A sample of text from "The Project Gutenberg" in the Internet (https://gutenberg.org/files/2701/2701-0.txt) was compiled, encompassing various key events and characters from the series.
  2. Text Preprocessing: The text was cleaned by converting to lowercase, removing punctuation, numbers, stopwords, and applying stemming.
  3. Document-Term Matrix (DTM) Creation: The cleaned text was converted into a Document-Term Matrix.
  4. LDA Model Fitting: The LDA model was fitted with 5 topics.
  5. Visualization: The results were visualized using bar plots for the top terms of each topic, a word cloud for the entire corpus, and a bar plot showing the topic distribution across the corpus. An interactive view Interactive Visualization with LDAvis was also created.

Results



From here, I show the interactive graph where we can select the topics with Slide to adjust relevance metric:
Topic 1

Topic 2



Topic 3



Topic 4



Topic 5