Data and the Stack

What we do behind the scenes is not magic. It’s pure, down-to-earth data processing and visualization. The data we use is all public, the processing tools are common, and the visualizations are there just to represent the data in an insightful manner. The user chooses which data they want to explore, and we make sure it’s available and accessible.

We started with Twitter data, due to various beneficial aspects the platform provides for our purposes:

  • Wide adoption among politicians, government bodies, journalists, columnists, and media, precisely the demographics we are most interested in.
  • Textual focus, for in-depth searching and analysis
  • Powerful API to collect this data in a flexible and scalable manner
  • Volume, quality, and richness of the data, exposing the deeper patterns that exist within political and apolitical discourse.

From the backend, collection starts with the notion of a Feed, a data structure that keeps track of what data we are collecting, and what we’ve collected so far, so we can query the Twitter API to just give us data we do not have yet. Examples of Feeds are, colloquially, ‘Tweets by @realDonaldTrump’ or ‘Tweets that mention @NASA’ or ‘Tweets that contain #metoo’.

Feeds are periodically updated by a Python application that uses the Django ORM on MySQL, according to the volume at which they produce new items, to guarantee a certain level of data freshness. When a Feed produces new Tweets, they are are queued onto RabbitMQ, and then ingested into an Elasticsearch cluster for both permanent storage and easy access, i.e. retrieval, aggregation, and search, through a REST API.

On the frontend we have a Bootstrap / D3 web application built on ReactJS that allows the user to do two main things:

  • create Feeds
  • explore the data

Feeds we already covered. Central to exploring the data is the notion of a Query. Queries define subsets of the data to be considered for further processing, usually by some means of analysis and/or visualization. This can be a simple representation of individual data items, i.e. Tweets, something more complex like showing Tweet volume over time or a top 10 of common authors, or more comprehensive patterns like significant terms, sentiment, and topics.

In the middle, gluing frontend and backend together, we have a Django API that provides a REST API though which the frontend accesses the data stored in the Elasticsearch cluster. This is a high-level API that translates user requests, like “give me the most common authors in this week’s replies to Tweets to Trump”, into low-level API calls submitted to the Elasticsearch REST API. The Django API collects the data from Elasticsearch, and processes into a data structure that the frontend can use without much effort.