Parsing a Twitter archive

I just signed up to Pinboard because I wanted a permanent resource to capture all the links I’ve posted and retweeted on Twitter. While Pinboard integrates well in terms of capturing links from your ongoing feed, it will only work backwards to the previous 3200 tweets due to Twitter’s API limit. So the first thing I wanted was to do was process my long term Twitter archive to get everything from the previous four years. You’d think other people would want this too, so something must exist to do it, right? Wrong.

In the end, I had to write a small Ruby hack to parse the data from the archive, grep the links and post them to Pinboard. It was actually fairly easy once I’d found a suitable example via Google and identified which variant of each Ruby gem I needed was currently maintained and working. I added a few comfort touches like expansion of shortened links and decoding of HTML entities (Twitter uses them; Pinboard doesn’t). What took the most time was understanding the Twitter archive data format, since there doesn’t appear to be any formal documentation for it. But it’s basically JSON, so is fairly readable once you’ve perused a few example tweets. [N.B. I’m not a JSON expert or a regular coder and have apparently forgotten the little Ruby I ever learned, so treat all this as the desperate grasps for comprehension of a total naif.]

Your tweets are all stored in datestamped files within the data/js/tweets directory of the archive. Each file is formatted in JSON except for the first line (beginning ‘Grailbird…’), which needs to be discarded:

data=File.read(file).sub(/^Grailbird.data.tweets_([^=]*)=/){}
j=JSON.parse(data)

What you’re left with is an array of individual tweets for that month, each element containing hashes of the various components of the tweet, beginning with the source (i.e. the app or website used to post the tweet). The key parts required for Pinboard posts are any hashtags and URLs from the entities hash, the text and the created_at field. Note that the urls entity doesn’t appear to be populated in older tweets (circa 2010), so you need to grep the text with a suitable regex to locate any links if this hash is empty (I used the URI.extract method for this). If the entity is populated then take the expanded_url field. (The urls entity is actually an array of URLs but as a Pinboard post can only show one link, I only take the first element each time. However, there’s a fallback method to view any others, as discussed further below.) I used the LongURL module to try to expand each link to the final destination target, bypassing any URL shortening used (itself often shortened again using Twitter’s t.co shortcut) and generate a meaningful link.

Similarly, the hashtags entity is an array of hashtags in the tweet, so I iterate over that and gather the text item from each entry for Pinboard’s tagtext array parameter.

The text part becomes the Pinboard description field; since this contains the original text including all the links as posted, it acts as a backup of the original URI(s) in the event that the link expansion doesn’t work correctly. One thing I’ve learnt from this: obsolete URL shorteners are destructive to the Internet’s memory, since you’re left with no easy way to recover the original link destination. (Principal offender here is The Browser’s apparently defunct b.rw app, which means that their older posted links are now all invalid. Bit of a drawback for a curation site, that.) Also, many sites replace obsolete page links with redirects to their top level home page (or the page of the company that bought them out), which is no help at all. I guess that’s the drawback of relying on an ‘ephemeral’ medium like Twitter for archiving.

The only tricky part concerns (native) retweets: the tweet contains details of your retweet, including the ‘RT’ header with the retweeted user’s name and abridged text, while the original tweet is nested in a retweeted_status field within that tweet. This means that when you find such a field, you need to pull the relevant details from the surrounding tweet (I use the text for Pinboard’s description field as it shows the attribution, and retain the datestamp of the retweet rather than the original) and then extract the retweet for the actual link (you can treat this as a normal element; i.e. unwrap the parent element and proceed as before). I put the original, unabridged tweet text in Pinboard’s extended description field.

Unfortunately, I haven’t been able to process my entire archive as the one I’d previously downloaded only extends to February 2013 and, when I try to request an updated one from Twitter, it first wants to verify my email address and then fails to send a confirmation email for this purpose, despite continuing to send notifications successfully to the same address.

I’m surprised that there appears to be little other work undertaken in mining and analysing Twitter archives, as there are probably a number of vaguely useful stats and summaries that could be generated from them. But then I guess most of that isn’t readily monetisable, particularly as the data isn’t considered current.

Other bubbles

Exploring your twitter archive with unix: David MacIver has a nice blog post on analysing your archive with jq, a command line JSON parser.
Original code fragment I used as the basis of my parser hack (in Japanese, but the code is easily understandable).

What sucks, who sucks and you suck

Other bubbles