Mathematics: Using Binary Classification in Email Filtering

Have you ever wondered how applications manage the huge amount of junk mail that is sent every day. Well the most common method is to use the binary classification, to help filter the junk mail from the real emails.  This means that any system should decide what to do with any individual email based on a simple decision or classification.  That is if an email is junk it should be either deleted or placed in a junk email, if not junk it should be delivered to the recipient.   It’s difficult to implement though as basically it relies on a confidence level of whether the item is junk or not.

This is where the problems starts because a binary classification has no real idea of confidence – it’s either junk or isn’t.  If the system decides that an email is junk and it isn’t then this is called a False Positive.  However if the application decides something is junk and it isn’t then the mistake is known as a False Negative.

There is another problem with using a simple binary classification method, in that sometimes whether an email is junk is very often a subjective decision. One person might consider the hundreds of loan offers arriving in his inbox the very epitomy of junk email, however someone else may be looking for one these services.  It could be said that there are certain rules which could define a junk or spam email, but an application should consider the emails as simply data.  It is a similar issue in dealing with classification on patient data in the NHS.

There can be no place for the subjective decision in our binary system – rules must be defined and ambiguity removed.  Most systems slowly build up these rules often using some user interaction.  For example emails can be marked as junk initially and users allowed to confirm these decisions, hence a set of rules can be built up to create an absolute definition.  This is essential in order to reliably identify each component and ensure we know where the email is originating from whether it’s from the USA, Russia or Australia for example.

This can cater for exceptions to the specific rules.   For example some people use encryption programs like PGP to encrypt very important emails.  Or they may modify the source and destination fields by using a UK VPN like this, which can be almost impossible to detect. These of course are a long way from junk status however to an application the email will look like junk and completely unreadable.  Without the key and a facility to decrypt the email, anything like this would get swallowed up by a binary classification system – if you want to read more on this – here’s a primer on email security.

This also allows the system to operate the binary classification system but based on an individuals subjective preferences. Other systems have other methods of reducing mis-classifications – like a temporary area where emails can be retrieved and reclassified with user intervention.

Using the iPad to Protect PID in the NHS

It’s taken me a while but I’m slowly starting to appreciate the incredible little tool that my iPad is. It’s been a little neglected for the last few months but my interest was rekindled whilst doing some work at a local hospital.  The consultants all had iPads which were linked into a patient informational system.  This completely replaced the central register (paper), some rather dated looking hand held palmtops and in reality a lot of using a phone to check details.

The device’s were all networked together and each had a printer assigned so they could get paper copies of appointment times and details.  What was especially impressive was the way the devices handled printers and printouts.  In a hospital environment it is very important to keep a tight reign on patient information, in fact  there is a lot of Government legislation regarding this.  It is referred to a PID (Patient Identifiable Data) and refers to anything that could contain personal information which can be linked to a specific individual.

It’s quite simple to keep track of this information when it’s stored on a central application managed by the NHS or the hospital involved however as soon as it is printed out or stored on a USB device this becomes much, much harder.  One of the ways this particular hospital had dealt with this issue was by making use of the built in location functionality in the iPad.

They developed the application which would track the location of the consultant and assess the nearest available printer.  If the printer was in a secure area, these are designated in most NHS hospitals then the print would be allowed if it contained patient information, but this would be recorded so as to discourage non-essential print outs.  If the printer was not secure the printout would not be allowed and a message sent informing the consultant of the nearest secure printer.

This has reduced the amount of sensitive print outs and the consultants were happy as it required no specific input from then and additional training or approval for required print outs.  Instead the consultant would simply move to the location of one of the secured printers.  In reality many didn’t bother and so the volume of these unsecured paper print outs had fallen drastically without any issues.  The consultants were also allowed to use their iPads for leisure and personal purposes because no information was stored on them unlike PCs and laptops.  The doctor I was chatting to, watched the news using the BBC Iplayer iPad application using this , even when he was abroad.

There’s no doubt that this little device and the hundreds of tablets and mobile devices it has spawned are rapidly changing the way people do their jobs.  It is great to see that they are also making the work environment a little more secure at the same time as well as helping them watch things like BBC Iplayer in  their lunchtimes!