In the last few days, I’ve been reading up on author obfuscation. By “author obfuscation” I mean tools and techniques that will ensure an author’s anonymity when posting a blog entry or writing a document. You might think that not giving your name or writing under a pseudonym may be sufficient, but I don’t think this will stand the test of time. Specifically, if you are writing a blog under a pseudonym, you are creating a large corpus of text, all of which is being archived, and ten years from now smart algorithms may be able to correlate those postings with other work by you that identifies you as an author of the blog.
The science of unveiling authors is called stylometry. While it has been around for a while, the lack of appropriate machinery until recently made it slow going. Over the next ten years or so I’d expect a significant boost to its efficiency. Stylometry today is based on grammatical mistakes people make, pattern recognition of certain phrases and idioms that authors use repeatedly, and in the future might even involve natural language processing rather than just statistics. For example, I tend to put adverbs in funny places, and my writing style certainly boldly goes beyond proper English grammar, not just with split infinitives. Given that what you write today will be around forever, stylometry might pose a serious threat to anonymity of what you are doing today ten years from now.
I basically only found one paper on author obfuscation (local copy); at present, the state of the art is quite restricted both with respect to stylometry and author obfuscation. People need defined corpora of works to work with, many techniques require an intervening human and there seems to be little machine processing yet, so it is not applied on a grand scale. All of which we can expect to change. At some point of time, a search engine might let you search for all the documents by one particular author, more likely to be content-based rather than by tracking explicitly declared authorship.
This is not my line of research but in that future world of ours I would like to see a service that obfuscates my (quasi genetic) fingerprint of weird phrases and grammatical twists and turns that make my writing recognizable. It should be a basic feature of any blogging and writing software. The point is to prevent pure machine-based recognition of authorship. People might still be able to track unidentified authors by contents and special knowledge. However, that requires human intervention and cannot be automated away by a search engine.