Duck Identification Error

I had a conversation on Twitterthe other week about why the hashtag #demo2010 might not be trending. More than one person was suggesting that Twitter were censoring it. I would be floored to learn that they were. That doesn’t mean that they weren’t, of course. I have no insider knowledge, and I’ve been wrong before, and will be wrong again. If it turns out I need to eat my words, I shall have them as a side to a nice steak, and move on.

But I attempted to explain why, and 140 characters was a bad length to do it in. So I thought I’d do it here.

So, in the first place, why don’t I think it was censorship? Well, as a US company, whose kit and advertisers are (mostly) in the US, Twitter has no obvious business driver to censor a tag relating to posts in the UK. It’s possible that the UK government or police were learning on them, but I can’t see what the stick was. “We’ll stop you doing business in the UK!” How, exactly? Without provoking public outcry? It’s possible that the Police or UK Government offered them a really big cheque, I guess, but then why didn’t the Iranians do that during the protests there? Twitter doesn’t have an obvious track record of bowing to governmental pressure – in fact, they’ve gone out of their way in the past to let politically relevant things trend without interruption before now.

Not only that, but all the searches for the #demo2010 hashtag still worked. The real value in censoring it to the UK government or police would have been to stop it being used as an organising or reporting tool. And it definitely wasn’t stopped in that regard.

And then we get to the professional part. I know a little about the way this shit works. Not, y’know, a lot, but a little. I do deal, in my day job, with the searching of large datasets, searching of textual data and analysis of user behaviour – I don’t do it real-time like Twitter tries to but actually the problems I encounter would be worse, doing it in real time – your ability to notice aberrations in the data and correct for them vanishes when your data set is time-based with a rolling window – by the time you’ve noticed an aberration, that dataset is no longer valid.

So, first thing, let me say right up front: #demo2010 was the most widely-used-in-a-short-space-of-time hashtag I can recall seeing. If you disagree with that assertion, then I don’t know what to say, other then your experience is different to mine.

Principle #1: If you’re searching large volumes of text, you don’t literally comb through the text itself, looking for matching text. You build tools to index that text, and then you search the indices. And when building these indices, you exclude certain things. You don’t index words like “is” or “the”, or other common words. So far so back of the reference book.

Principle #2: If you build a publicly accessibly index, people will try and game it – Flickr had problems with it’s “Interestingness” index in the weeks after it launched, and yes, Twitter had problems with people spamming Trends in 2008/2009. Think of it like slightly more complex comment spam. And to deal with that, you identify patterns of behaviour that look like spammers, and you exclude them from your indices. Yes, sometimes you’re going to get false positives – patterns of legitimate behaviour that look like spammers. Just this morning, I found a comment on this blog that I genuinely couldn’t decide if it was from a human, or a spammer. So I assumed it was from a spammer, and killed it, because statistically, that’s more likely. (It it wasn’t, and that was you, I’m terribly sorry!)

Twitter, of course, is all about the real time data. Which means they’re going to consider the following things when deciding if something is a word that should be trending:

  1. Is it an uncommon word, within the time period of the data under consideration? (Because with a large enough dataset, all words become common, so the uncommmonality has to be within a specific set.)
  2. Is it being posted as a rate that is consistent with normal human activity, rather than spammers? (Who tend to use automated systems.)

I would suggest that as a result of it’s sheer frequency in a short space of time, #demo2010 could have failed either or both of those tests.

There also a sub-question for Twitter: “where is it being posted?” Because their trends can also be tied to geographic regions.

When I looked at the UK vs. London data, I noted that there were protest related tags that appeared as trending in UK, but not in London. This just makes me more certain that within the London dataset, they appears too frequently, and wound up categorised as spam or common words, but within the UK, they appear infrequently enough to still trend.

Several people suggested to me that This Was Not Good Enough, that Twitter should, essentially be smarter about how it picks it’s trending topics. To which I say: “OK, how would you do it?” Because I’m reasonably smart, and have made a decent career out of doing this sort of thing, and I can’t see a better solution that doesn’t come down to “put a hack in” (and before you say that they should have done that, I suggest you stop and think – one doesn’t ask professionals to do less than good work). And in the same way that I don’t see that Twitter has a business driver to kowtow to weird demands from the UK government, I don’t see that it has a business driver to change its business practice in a bad way for, what, to someone in California, might be characterised as an afternoon’s irrelevant shouting. (No, of course I don’t think that’s what it was, but then, I live here.)

The people who work at Twitter are, y’know, smart enough, don’t get me wrong. But some problems are just plain hard. If something walks like a duck and quacks like a duck, you file it in the box marked duck, until it does something that allows you to distinguish it from a duck. If you cannot suggest a smarter way of identifying a duck, then I don’t think you have any business complaining when someone else’s duck identification system fails. I don’t mean you need to be able to code it, or do the difficult maths, I mean you need to be able to suggest in English, a more efficient solution to the duck problem that does not involve human intervention to artificially rig the output of the automated process, and that does not, in it’s solution, allow for other things to pretend to be ducks when they shouldn’t.

I should note that I’ve massively (over-)simplified the above. If you take a look at the comments on this discussion about the failure of Wikileaks to trend despite massive use in recent days/weeks, you’ll see there’s a Twitter employee explaining that they haven’t changed the algorithm, it’s just that the algorithm doesn’t work like you think, and looks at the whole of Twitter when determining popularity. It’s not enough that a term is used a lot – it has to be used a lot by a very diverse group of people. So if you and your friends are all using a certain hashtag , it’s not going to trend however much you all use it. It needs to also be in heavy use by people entirely outside your field of connections and social demographic as well. So in the case of Wikileaks and #demo2010, well, yes, they’re being used a lot by middle class lefties in the 15-35 age range, but unless they’re also being talked about by people outside of that lot, well, they’ll rank lower, in favour of the stuff that is being talked about by everyone.

Was it less than ideal that Twitter didn’t list the hashtag? Well, maybe, it’s a matter of perspective. Do I expect that they have noted that there was a problem there, and attempt to work out a way around future incidents? Yes. Do I think the problem is solvable? I’m not sure. Do I think that cries of censorship were a smart, or proportionate response? Not really, I’m afraid.

In case you’re wondering why I even care about this, it’s because I deal with something very similar on a regular basis from clients. “Explain to us why you’ve quoted X amount for this bit of work.” The answer always boils down to “because I know how to do my job, and that’s how long it will take, and honestly, trying to make you understand the answer will be somewhere between boring and futile”. It is one of the unique frustrations of programming, that essentially, everyone always assumes that because they can say in English what they want a computer to do, it must be easy to make it do it. You would not ask a civil servant why it will take 5 days to write a report. You would not ask a film producer why it will take ten days to edit a 5 minute film, or a an author why it takes them six months to produce a book.

And yet, as soon as it comes to matters of technology, everyone’s got an opinion as to why the programmers work isn’t done well enough or fast enough or cheap enough.

I’m not claiming all programmers are god like geniuses who should be worshipped – I’ve worked with too many to believe that. But you know the old saw: you are entitled to an informed opinion. If you don’t understand how real-time searching of large text indices works, well, I don’t blame you, but there is another old saw about not attributing to malice what can be explained by stupidity that applies here. Although, like I say: less stupidity, and more that this is a very hard problem.

I’m sure I’ve gone a bit fast in places here, so if anyone is still confused by this, do please feel free to ask questions. Like I say, I have no specific knowledge of Twitter, so take nothing I say as gospel, but I do at least know a bit about the sort of problems they’re solving, and why the solutions are harder than the layperson might think.

Periodic Prompts: Let Go. What (or whom) did you let go of this year? Why?

Books. Yeah I know, heresey, but the simple fact is that I got an iPad and haven’t looked back. Now, don’t get me wrong a beautiful hardcover edition is impossible to beat, but I have shelf after shelf of paperbacks that vary somewhere between trash and reference editions, and for the latter, a searchable digital version is better in any case, and as for the former, well, I like the books I like and I make no apologies, but they don’t half take up space. (As anyone who has ever seen my living space can attest, the last decade or so has basically been a race for my space to stay fractionally ahead of my book habit.)

So I have a new rule – I will only buy paper books if they are beautiful objects in and of themselves, or if I absolutely cannot get a digital edition, and absolutely have to have the book, and even then, I’llbe grumbling about it . So I haven’t altogether stopped buying books as yet, but I’m getting there. Give it another two or three years, and with any luck, I’ll be buying paper editions as often as I buy CDs at the rate of one or two a year.

I’ve also done fairly well at acquiring ebook editions of a lot of print books I own. Not, er, strictly legit, I know, but in my defence they all books where I have at one time or another, owned the dead tree edition. In any event, the net result is that I’ve been able to declutter my life to the tune of some 200 books, and I plan to keep going.

I’m hoping to do the same for graphics novels, but imagine that’ll take a while longer.

Advent Assignments: Pick one moment during which you felt most alive this year, describe it in detail.

Oh good, they are picking up a bit.

Heh. This would have been much easier to do last year, when I threw myself from a great height for a laugh.

But I racked my brains, and I came up with this. It’s a daft enough wee story, and I’m not sure it entirely qualifies, but I know exactly when I was most terrified this year. It involves a small dog. Told you it was stupid.

Miranda and I spent a week in Woolacombe in September. Picture postcard little village on the North Devon Coast which would, I imagine, be hellish during the school holidays. But we popped along for the week after that, when there were far fewer kids around, but the weather was still pretty good. A lovely, and much needed, break.

Woolacombe isn’t a haven of fine dining – much more geared toward the families-with-kids burgers-and-pizzas types places. But there’s one restaurant there, The Courtyard that was absolutely superb. We were staying self-catering, but we’d decided that we’d have one night to get dressed up and go out for a romantic dinner, and this did not disappoint. Afterwards, a night-time stroll on the beach seemed like a good idea. Romance, and all that.

And so pausing only briefly to nip back to the flat we were staying in for some warmer coats, we ambled toward the beach, hand in hand. We crossed the car park by the beach, and were heading down toward the little path that zig-zagged past the surf hire shop down to the sand. It was unlit at night – a little creepy perhaps, but there were two of us, but hardly terrifying, particularly not with the sound of the surf and so on. We were on a seaside holiday, after all. What could possibly happen?

And then I spotted a couple of figures in the shadows at the top of the path. (Actually, I smelled cigarette smoke before I spotted them – one of them had a lit tab in their hand.) I was about to suggest we head a couple of hundred yards in the other direction, down to the other path, when they broke apart, and it was apparent to me that they weren’t two local hoodies out to get the tourists – they were another young couple, and that they’d been kissing. Aaah, romance. So we strolled on a bit further.

At which point, there was a tremendous and unexpected barking from the shadows, and a dark blur shot a short distance toward us. I distinctly recall giving a yelp, and levitating about three feet in the air.

Before you laugh, (and yes, it is funny, I know) I invite you to recall that I’m cynophobic. It’s not exactly rational, but still, unexpected barking followed by a dog coming at me from the shadows is quite literally the stuff of my nightmares.

Anyway, the initial shock passed, and with some trepidation we made our way past the young couple, who were thoughtfully restraining their hell-hound, and down onto the sands.

Except that by now, my system was flooded with adrenaline. A moonlight stroll on the beach, with silver light dancing the surf had sounded a good idea twenty minutes before. Now, though, my brain was full of fearful images. I kept thinking of those marvellously creepy shots of the sea from Ringu, with the voice over the top of them muttering “frolic in brine, goblins be thine”. Of the beach sequences in “Oh Whistle and I’ll Come To You My Lad” or “A Warning To the Curious”. I wasn’t taking a romantic walk by the seaside, I was in the opening sequence of a horror movie. The moonlight was eerie, not romantic. The sand wasn’t cool between my toes, it was freezing. Even once my heart-rate had slowed to near normal, said organ was still pounding much too loudly.

We gave up on the walk in the moonlight in pretty short order, and went back to the flat.

Links For Thursday 2nd December 2010

December Dailies: What do you do each day that doesn’t contribute to your writing – and can you eliminate it?

Sorry, but I can’t face calling them “Reverb” anything. I don’t imagine I’ll stick with “December Dailies”, either, but it’ll do for today. Mind you, on the strength of this question, I may not sticking with doing them, as it has an unpleasant smell of “no I are a proper writter for serios!” about it, and amateurs/fanwriters doing that sets my teeth on edge. Anyway, on with the show.

There are two obvious answers here: the first one being “lots”. I work a day job, I spend time with friends and loved ones, I eat, sleep, shower and occasionally shave. I read, I blog, I take photos. None of these things contribute to my writing. This is the tedious literalist’s answer.

But of course that every last one of those things contributes to my writing, which is the other obvious answer, because at some time or another every experience can inform writing. This is the pretentious wanker’s answer.

The truth, of course is that the only thing that I am certain doesn’t contribute to my writing are those times when I think “Shall I sit down and write? Nah, it’s been a long day, I’ll play computer games/watch TV instead.” It happens less often than it used to (although I did just get savagely hooked on Renaissance Batman Assassin’s Creed: Brotherhood, so that’s productivity shot for a little while longer), but I do still do it.

Yes, I could eliminate this. But I haven’t yet. Maybe in a few years. And in any case, I don’t do it each day. Things I do each day (or close enough to qualify, anyway), well, I listed them above. Short of “ditch the day job”, I don’t see a lot there that it’d even be useful to ditch.

I know this hasn’t been a very satisfying answer, but it wasn’t a very satisfying prompt (said the shoddy workman). Keep your fingers crossed for some better springboards.

See you tomorrow!

Links For Wednesday 1st December 2010

  • The stories of early space exploration from the original NASA transcripts. Now open to the public in a searchable, linkable format. Well, this will eat chunks of my time…
  • There's a short film competition happening here. Prize is a grand, and, well, it's a bit thin on the ground for entries. I'm just sayin' that if anyone fancied making a short film, they'd stand a decent chance of winning cash.

Reverb 10: One Word

Dreadful title “Reverb”, but I enjoyed something similar I did last year, so here we go with a month of blog posts in December. As before, I reserve the right to ignore or replace any prompts I think are just plain daft. Prompt one challenges me to sum up the last year in one word, explain that choice, and then pick another word for next year.

2010: Inspiration

I’ve a number of friends, old and new, who have directly or indirectly inspired me this year, but none more so than Miranda. I’ll spare you all the gushing stuff I could put here – for all I know, there’ll be a later prompt I can use it for, and I’ll nauseate you all then. For now, I will simply and sincerely say that, by virtue of her own drive and passion she pushes me to do better, for which she has my thanks and more besides.

And one of the ways I’m doing better is that for the first time in years, I have a fiction-writing project I’m excited about – I’m inspired to write. While I’ve mentioned it to a few people, I’m trying not to talk about it too much, and not at all on-line (beyond the odd bit of twitter-based venting which doesn’t count). I’m mostly avoiding talking about it because every time I’ve done that in the past, I’ve dropped the ball, lost interest, or in some other way, failed to bring the thing to fruition. I really don’t want that to happen here, because I love this idea out of all measure, so this is all you’ll hear about it on this blog for now – I have an idea I’m excited about, and I hope it goes well. Shocking stuff, I know, but it’s actually the first time I’ve felt like this in a good few years now. So I’m pleased, and that’ll have to do for now.

And so I need to pick a word for my hopes for next year, and I chose “perspiration”, after Edison’s famous quote. Actually, I don’t think my idea is genius-level, but I’d also quite like to get back into regular exercise next year, so it seems like an apt one to pick, when talking about a year I hope will be filled with productive work, with something nearing completion by the end of it.

See you tomorrow.

Links For Tuesday 23rd November 2010

  • Web API for extracting clutter from web pages and just returning the content. Nice!
  • Americans! Have you ever wondered why everyone hates you? It's because you elect people like this, and then apparently give them a chance of being in hugely influential of policy areas where they can fuck up the planet for people who didn't get a say in electing them, on the basis of some bullshit religious beliefs, that, in a civilised country, would disqualify them as a candidate for dog catcher. Seriously, America, please get on with reforming your political system and society to get rid of people like this. By force, if necessary.

Links For Monday 22nd November 2010

  • I have elderly family who have to wear bags exactly like this. The thought that anyone could consider it acceptable to humiliate someone in a manner like this makes me furious – I just keep imagining what it would feel like if it happened to my family. I reckon I would expect them to have legal recourse, and the assurance that someone had lost their job over this, because I don't care about security half so much as I care about basic human dignity and respect.
  • Stop what you are doing, and go and look at this link. I promise you: it will make your day 100% better. This is amazing and wonderful stuff.