September01

Bayesian Filtering VS Keyword Filtering

I have recently tested Visendo Mail Checker Server for three weeks on Bayesian filtering. It had to be tested for at least 3 weeks because of the learning characteristic of the Bayesian filters. And the results were great. I will tell you a little about Bayesian filtering, what it means, how it’s done and why it is better than regular keyword filtering. Bayesian filtering is based on the principle that any event is dependent and that the probability of an event occurring in the future can be inferred from the history of occurrences of that event: history is always repeating (Various scientific researches have been made on Bayesian behavior http://www-ccrma.stanford.edu/~jos/bayes/Bayesian_Parameter_Estimation.htm).

Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email (sometimes called "ham"). Before email can be filtered using this method, the administrator needs to generate a database with words, tokens, phrases, IP addresses, domains and so on collected from a sample of spam mail and valid mail (usually referred to as ‘ham’). This can be done from an already defined database or, in time, from self-learning.

The basic Bayesian filter from Exchange does this from an already defined database. The learning process for this filter is, however, slow and unreliable. It is always better to have patience with your Bayesian filter and wait until it is able to build a strong and reliable database of keywords.

Why is this better than the regular keywords filtering? Because is based on probabilities. And it calculates the probability that a certain word (from the email title or body) received from a certain email account is SPAM or HAM. For instance, the word “mortgage” from an email received from a financial institution is highly improbable to be SPAM. Whereas the word “engine” received from a Kindergarten is highly probable to be classified as SPAM. Example of calculating these probabilities If the word "antivirus" occurs in 100 of 1,000 spam mails and in 8 out of 100 legitimate emails, then its SPAM – PROBABILITY would be 11.1% (that is, [100/1000] divided by [8/100 + 100/1000]). It is obvious that this type of algorithm works more reliable than a classical keyword filtering.

Read more about this in this whitepaper and find out what the main 5 advantages of Bayesian Filtering over Keyword filtering are.  What other Bayesian anti-SPAM tool have tried recenlty and what were your observations?

Kommentar schreiben

biuquote
Loading