Mathematics: Using Binary Classification in Email Filtering

Last Updated on

Have you ever wondered how applications manage the huge amount of junk mail that is sent every day. Well the most common method is to use the binary classification, to help filter the junk mail from the real emails.  This means that any system should decide what to do with any individual email based on a simple decision or classification.  That is if an email is junk it should be either deleted or placed in a junk email, if not junk it should be delivered to the recipient.   It’s difficult to implement though as basically it relies on a confidence level of whether the item is junk or not.

This is where the problems starts because a binary classification has no real idea of confidence – it’s either junk or isn’t.  If the system decides that an email is junk and it isn’t then this is called a False Positive.  However if the application decides something is junk and it isn’t then the mistake is known as a False Negative.

There is another problem with using a simple binary classification method, in that sometimes whether an email is junk is very often a subjective decision. One person might consider the hundreds of loan offers arriving in his inbox the very epitomy of junk email, however someone else may be looking for one these services.  It could be said that there are certain rules which could define a junk or spam email, but an application should consider the emails as simply data.  It is a similar issue in dealing with classification on patient data in the NHS.

There can be no place for the subjective decision in our binary system – rules must be defined and ambiguity removed.  Most systems slowly build up these rules often using some user interaction.  For example emails can be marked as junk initially and users allowed to confirm these decisions, hence a set of rules can be built up to create an absolute definition.  This is essential in order to reliably identify each component and ensure we know where the email is originating from whether it’s from the USA, Russia or Australia for example.

This can cater for exceptions to the specific rules.   For example some people use encryption programs like PGP to encrypt very important emails.  Or they may modify the source and destination fields by using a UK VPN like this, which can be almost impossible to detect. These of course are a long way from junk status however to an application the email will look like junk and completely unreadable.  Without the key and a facility to decrypt the email, anything like this would get swallowed up by a binary classification system – if you want to read more on this – here’s a primer on email security.

This also allows the system to operate the binary classification system but based on an individuals subjective preferences. Other systems have other methods of reducing mis-classifications – like a temporary area where emails can be retrieved and reclassified with user intervention.