Duck Identification Error

I had a conversation on Twitterthe other week about why the hashtag #demo2010 might not be trending. More than one person was suggesting that Twitter were censoring it. I would be floored to learn that they were. That doesn’t mean that they weren’t, of course. I have no insider knowledge, and I’ve been wrong before, and will be wrong again. If it turns out I need to eat my words, I shall have them as a side to a nice steak, and move on.

But I attempted to explain why, and 140 characters was a bad length to do it in. So I thought I’d do it here.

So, in the first place, why don’t I think it was censorship? Well, as a US company, whose kit and advertisers are (mostly) in the US, Twitter has no obvious business driver to censor a tag relating to posts in the UK. It’s possible that the UK government or police were learning on them, but I can’t see what the stick was. “We’ll stop you doing business in the UK!” How, exactly? Without provoking public outcry? It’s possible that the Police or UK Government offered them a really big cheque, I guess, but then why didn’t the Iranians do that during the protests there? Twitter doesn’t have an obvious track record of bowing to governmental pressure – in fact, they’ve gone out of their way in the past to let politically relevant things trend without interruption before now.

Not only that, but all the searches for the #demo2010 hashtag still worked. The real value in censoring it to the UK government or police would have been to stop it being used as an organising or reporting tool. And it definitely wasn’t stopped in that regard.

And then we get to the professional part. I know a little about the way this shit works. Not, y’know, a lot, but a little. I do deal, in my day job, with the searching of large datasets, searching of textual data and analysis of user behaviour – I don’t do it real-time like Twitter tries to but actually the problems I encounter would be worse, doing it in real time – your ability to notice aberrations in the data and correct for them vanishes when your data set is time-based with a rolling window – by the time you’ve noticed an aberration, that dataset is no longer valid.

So, first thing, let me say right up front: #demo2010 was the most widely-used-in-a-short-space-of-time hashtag I can recall seeing. If you disagree with that assertion, then I don’t know what to say, other then your experience is different to mine.

Principle #1: If you’re searching large volumes of text, you don’t literally comb through the text itself, looking for matching text. You build tools to index that text, and then you search the indices. And when building these indices, you exclude certain things. You don’t index words like “is” or “the”, or other common words. So far so back of the reference book.

Principle #2: If you build a publicly accessibly index, people will try and game it – Flickr had problems with it’s “Interestingness” index in the weeks after it launched, and yes, Twitter had problems with people spamming Trends in 2008/2009. Think of it like slightly more complex comment spam. And to deal with that, you identify patterns of behaviour that look like spammers, and you exclude them from your indices. Yes, sometimes you’re going to get false positives – patterns of legitimate behaviour that look like spammers. Just this morning, I found a comment on this blog that I genuinely couldn’t decide if it was from a human, or a spammer. So I assumed it was from a spammer, and killed it, because statistically, that’s more likely. (It it wasn’t, and that was you, I’m terribly sorry!)

Twitter, of course, is all about the real time data. Which means they’re going to consider the following things when deciding if something is a word that should be trending:

Is it an uncommon word, within the time period of the data under consideration? (Because with a large enough dataset, all words become common, so the uncommmonality has to be within a specific set.)
Is it being posted as a rate that is consistent with normal human activity, rather than spammers? (Who tend to use automated systems.)

I would suggest that as a result of it’s sheer frequency in a short space of time, #demo2010 could have failed either or both of those tests.

There also a sub-question for Twitter: “where is it being posted?” Because their trends can also be tied to geographic regions.

When I looked at the UK vs. London data, I noted that there were protest related tags that appeared as trending in UK, but not in London. This just makes me more certain that within the London dataset, they appears too frequently, and wound up categorised as spam or common words, but within the UK, they appear infrequently enough to still trend.

Several people suggested to me that This Was Not Good Enough, that Twitter should, essentially be smarter about how it picks it’s trending topics. To which I say: “OK, how would you do it?” Because I’m reasonably smart, and have made a decent career out of doing this sort of thing, and I can’t see a better solution that doesn’t come down to “put a hack in” (and before you say that they should have done that, I suggest you stop and think – one doesn’t ask professionals to do less than good work). And in the same way that I don’t see that Twitter has a business driver to kowtow to weird demands from the UK government, I don’t see that it has a business driver to change its business practice in a bad way for, what, to someone in California, might be characterised as an afternoon’s irrelevant shouting. (No, of course I don’t think that’s what it was, but then, I live here.)

The people who work at Twitter are, y’know, smart enough, don’t get me wrong. But some problems are just plain hard. If something walks like a duck and quacks like a duck, you file it in the box marked duck, until it does something that allows you to distinguish it from a duck. If you cannot suggest a smarter way of identifying a duck, then I don’t think you have any business complaining when someone else’s duck identification system fails. I don’t mean you need to be able to code it, or do the difficult maths, I mean you need to be able to suggest in English, a more efficient solution to the duck problem that does not involve human intervention to artificially rig the output of the automated process, and that does not, in it’s solution, allow for other things to pretend to be ducks when they shouldn’t.

I should note that I’ve massively (over-)simplified the above. If you take a look at the comments on this discussion about the failure of Wikileaks to trend despite massive use in recent days/weeks, you’ll see there’s a Twitter employee explaining that they haven’t changed the algorithm, it’s just that the algorithm doesn’t work like you think, and looks at the whole of Twitter when determining popularity. It’s not enough that a term is used a lot – it has to be used a lot by a very diverse group of people. So if you and your friends are all using a certain hashtag , it’s not going to trend however much you all use it. It needs to also be in heavy use by people entirely outside your field of connections and social demographic as well. So in the case of Wikileaks and #demo2010, well, yes, they’re being used a lot by middle class lefties in the 15-35 age range, but unless they’re also being talked about by people outside of that lot, well, they’ll rank lower, in favour of the stuff that is being talked about by everyone.

Was it less than ideal that Twitter didn’t list the hashtag? Well, maybe, it’s a matter of perspective. Do I expect that they have noted that there was a problem there, and attempt to work out a way around future incidents? Yes. Do I think the problem is solvable? I’m not sure. Do I think that cries of censorship were a smart, or proportionate response? Not really, I’m afraid.

In case you’re wondering why I even care about this, it’s because I deal with something very similar on a regular basis from clients. “Explain to us why you’ve quoted X amount for this bit of work.” The answer always boils down to “because I know how to do my job, and that’s how long it will take, and honestly, trying to make you understand the answer will be somewhere between boring and futile”. It is one of the unique frustrations of programming, that essentially, everyone always assumes that because they can say in English what they want a computer to do, it must be easy to make it do it. You would not ask a civil servant why it will take 5 days to write a report. You would not ask a film producer why it will take ten days to edit a 5 minute film, or a an author why it takes them six months to produce a book.

And yet, as soon as it comes to matters of technology, everyone’s got an opinion as to why the programmers work isn’t done well enough or fast enough or cheap enough.

I’m not claiming all programmers are god like geniuses who should be worshipped – I’ve worked with too many to believe that. But you know the old saw: you are entitled to an informed opinion. If you don’t understand how real-time searching of large text indices works, well, I don’t blame you, but there is another old saw about not attributing to malice what can be explained by stupidity that applies here. Although, like I say: less stupidity, and more that this is a very hard problem.

I’m sure I’ve gone a bit fast in places here, so if anyone is still confused by this, do please feel free to ask questions. Like I say, I have no specific knowledge of Twitter, so take nothing I say as gospel, but I do at least know a bit about the sort of problems they’re solving, and why the solutions are harder than the layperson might think.

Duck Identification Error

Leave a response Cancel reply