Generalized word shift graphs: A method for visualizing and explaining arbitrary pairwise text comparison

More recent
Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy

More or equal citational love
How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter

Less recent
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Less or equal citational love
Packing-limited growth

Generalized word shift graphs: A method for visualizing and explaining arbitrary pairwise text comparison

R. J. Gallagher, M. R. Frank, L. Mitchell, A. J. Schwartz, A. J. Reagan, C. M. Danforth, and P. S. Dodds

EPJ Data Science, 10, 4, 2021

arXiv version | arXiv page | journal version | journal page | Github respository | Twitter Thread

Times cited: 73

Abstract:

A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for arbitrary linear measures. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.

This is the default HTML.
You can replace it with your own.
Include your own code without the HTML, Head, or Body tags.

BibTeX:

@Article{gallagher2021a,
  author =	 {Gallagher, Ryan J. and Frank, Morgan R. and Mitchell, Lewis and Schwartz, Aaron J. and Reagan, Andrew J. and Danforth, Christopher M. and Dodds, Peter Sheridan},
  title =	 {Generalized word shift graphs: A method for visualizing and explaining pairwise comparisons between texts},
  journal =	 {EPJ Data Science},
  year =	 {2021},
  key =	 {systems,text,words,divergences,entropy,language},
  volume =	 {10},
  pages =	 {4},
  note =	 {Available online at \href{https://arxiv.org/abs/2008.02250}{https://arxiv.org/abs/2008.02250}},
}

More recent
Computational timeline reconstruction of the stories surrounding Trump: Story turbulence, narrative control, and collective chronopathy

More or equal citational love
How the world's collective attention is being paid to a pandemic: COVID-19 related n-gram time series for 24 languages on Twitter

All Papers

Random paper

Less recent
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter

Less or equal citational love
Packing-limited growth

Generalized word shift graphs: A method for visualizing and explaining arbitrary pairwise text comparison

R. J. Gallagher, M. R. Frank, L. Mitchell, A. J. Schwartz, A. J. Reagan, C. M. Danforth, and P. S. Dodds

EPJ Data Science, 10, 4, 2021

arXiv version | arXiv page | journal version | journal page | Github respository | Twitter Thread

Times cited: 73

Abstract:

BibTeX:

Share this page:

Some of our Panometer’s online instruments:

Storywrangler: Track and compare Twitter n-grams from 2008 on in 100+ languages.

The Lexicocalorimeter: Measuring calories in and calories out with tweets.

The POTUSometer: Computational history, narrative control, ratios, and chronopathy—measuring how time flies and crawls.

Explore the Teletherm: the on-average coldest and warmest days of the year.

The Hedonometer: Measuring the happiness (and sadness) of all kinds of texts.

© Peter Sheridan Dodds, 7+13+5, 1995–