Data and the Stack

What we do behind the scenes is not magic. It’s pure, down-to-earth data processing and visualization. The data we use is all public, the processing tools are common, and the visualizations are there just to represent the data in an insightful manner. The user chooses which data they want to explore, and we make sure it’s available and accessible.

We started with Twitter data, due to various beneficial aspects the platform provides for our purposes:

  • Wide adoption among politicians, government bodies, journalists, columnists, and media, precisely the demographics we are most interested in.
  • Textual focus, for in-depth searching and analysis
  • Powerful API to collect this data in a flexible and scalable manner
  • Volume, quality, and richness of the data, exposing the deeper patterns that exist within political and apolitical discourse.

From the backend, collection starts with the notion of a Feed, a data structure that keeps track of what data we are collecting, and what we’ve collected so far, so we can query the Twitter API to just give us data we do not have yet. Examples of Feeds are, colloquially, ‘Tweets by @realDonaldTrump’ or ‘Tweets that mention @NASA’ or ‘Tweets that contain #metoo’.

Feeds are periodically updated by a Python application that uses the Django ORM on MySQL, according to the volume at which they produce new items, to guarantee a certain level of data freshness. When a Feed produces new Tweets, they are are queued onto RabbitMQ, and then ingested into an Elasticsearch cluster for both permanent storage and easy access, i.e. retrieval, aggregation, and search, through a REST API.

On the frontend we have a Bootstrap / D3 web application built on ReactJS that allows the user to do two main things:

  • create Feeds
  • explore the data

Feeds we already covered. Central to exploring the data is the notion of a Query. Queries define subsets of the data to be considered for further processing, usually by some means of analysis and/or visualization. This can be a simple representation of individual data items, i.e. Tweets, something more complex like showing Tweet volume over time or a top 10 of common authors, or more comprehensive patterns like significant terms, sentiment, and topics.

In the middle, gluing frontend and backend together, we have a Django API that provides a REST API though which the frontend accesses the data stored in the Elasticsearch cluster. This is a high-level API that translates user requests, like “give me the most common authors in this week’s replies to Tweets to Trump”, into low-level API calls submitted to the Elasticsearch REST API. The Django API collects the data from Elasticsearch, and processes into a data structure that the frontend can use without much effort.


Trust in the media is in decline. News used to be handled by journalists. Trench coat, pencil and notepad, always close to a payphone to relay the latest news back to the editorial room just in time for tomorrow’s headlines. We relied on them for truth, relevance and objectivity. Then the internet got big and everything changed. We now enjoy, at our fingertips, many sources of many truths, endless feeds of little relevance, and with the fading boundary between information and opinion, more than ever we have to ask ourselves “Who is the source of this information, really?”, “Do they have an agenda?”, “What is the relevant context?”, and “Is this even true at all?”

Prevailing sentiment is for organisations to answer these questions for us, by filtering ‘fake news’, naming sources, presenting opposing views, et cetera. Yet ultimately these approaches do not satisfy, as they rely, again, on trust in the media, to carry out these tasks judiciously.

MediaGraphs strives to address this situation by creating an ecosystem in which we can explore and visualize online and offline media, and answer these questions for ourselves, side-stepping the trust issue altogether.