Preventing duplicate posts in forum or blog commenting software

Mistakes happen

I was reading http://scripting.com/ the other day and some article had 3 identical comments from the same person.

A mistake, obviously.

Such mistakes can be easily prevented with a bit of programming.

I know it’s easy because I’ve implemented it in a few lines of python in my blog engine and in my forum software.

Duplication detection, the easy way

The idea is simple: before inserting an article/post/comment into a database, check if an entry with exactly the same content already exists.

In most storage systems (like SQL database) doing this against large text column is difficult and slow. To make it easy and fast we can calculate MD5 or SHA1 hash of the text, store it as part of the data describing the post and check for duplicate hash.

SHA1 keys are short. They are a perfect match for key-value stores. With a proper index they’re also fast in a SQL database.

For added robustness you can trim whitespace from the beginning and end of text before calculating the hash.

This method doesn’t prevent malicious people. It only takes changing one character to change the hash, but it does fix the common problem of people submitting the same content twice due to network problems.

Edna	speedy note taking app with super powers
SumatraPDF	small, fast, free PDF / ePub / comic book reader for Windows