Open access peer-reviewed chapter

# Spam Recognition using Linear Regression and Radial Basis Function Neural Network

By Tich Phuoc Tran, Min Li, Dat Tran and Dam Duong Ton

Published: October 1st 2009

DOI: 10.5772/7529

## Abstract

Spamming is the abuse of electronic messaging systems to send unsolicited bulk messages. It is becoming a serious problem for organizations and individual email users due to the growing popularity and low cost of electronic mails. Unlike other web threats such as hacking and Internet worms which directly damage our information assets, spam could harm the computer networks in an indirect way ranging from network problems like increased server load, decreased network performance and viruses to personnel issues like lost employee time, phishing scams, and offensive content. Though a large amount of research has been conducted in this area to prevent spamming from undermining the usability of email, currently existing filtering methods' performance still suffers from extensive computation (with large volume of emails received) and unreliable predictive capability (due to highly dynamic nature of emails). In this chapter, we discuss the challenging problems of Spam Recognition and then propose an anti-spam filtering framework; in which appropriate dimension reduction schemes and powerful classification models are employed. In particular, Principal Component Analysis transforms data to a lower dimensional space which is subsequently used to train an Artificial Neural Network based classifier. A cost-sensitive empirical analysis with a publicly available email corpus, namely Ling-Spam, suggests that our spam recognition framework outperforms other state- of-the-art learning methods in terms of spam detection capability. In the case of extremely high misclassification cost, while other methods' performance deteriorates significantly as the cost factor increases, our model still remains stable accuracy with low computation cost.

### Keywords

• Spam Recognition
• Redial Basis Function
• Linear Regression

## 1. Introduction

Email is widely accepted by the business community as a low cost communication tool to exchange information between business entities which are physically distant from one another. It minimizes the cost of organizing an in-person meeting. It is reported by a recent survey SurePayroll (Surepayroll, 2007), over 80% of small business owners believe email is a key to the success of their business and most people today spend between 20% to 50% of their working time using email, including reading, sorting and writing emails. Due to the very low cost of sending email, one could send thousands of malicious email messages each day over an inexpensive Internet connection. These junk emails, referred to as spam, can severely reduce staff productivity, consume significant network bandwidth and lead to service outages. In many cases, such messages also cause exposure to viruses, spyware and inappropriate contents that can create legal/compliance issues, loss of personal information and corporate assets. Therefore, it is important to accurately estimate costs associated with spam and evaluate the effectiveness of countermeasures such as spam-filtering tools. Though such spam prevention capability is implemented in existing email clients, there are some barriers that discourage users from utilizing this feature including error-prone and labor-intensive maintenance of filtering rules. Many researchers have developed different automatic spam detection systems but most of them suffer from low accuracy and high false alarm rate due to huge volume of emails, the wide spectrum of spamming topics and rapidly changing contents of these messages, especially in the case of high misclassification cost (Bayler, 2008). To deal with such challenges, this chapter proposes an anti-spam filtering framework using a highly performing Artificial Neural Network (ANN) based classifier. ANN is widely considered as a flexible “model-free” or “data-driven” learning method that can fit training data very well and thus reduce learning bias (how well the model fits the available sample data). However, they are also susceptible to the overfitting problem, which can increase generalization variance, i.e. making the predictive model unstable for unseen instances. This limitation can be overcome by combining ANN with a simple Linear Regression algorithm which makes the resulting classification model a stable semi-parametric classifier. Such model combination aims at stabilizing non-linear learning techniques while retaining their data fitting capability. Empirical analysis with the Ling- Spam benchmark confirms our superior spam detection accuracy and low computation cost in comparison with other existing approaches.

This chapter is organized as follow. Firstly, an overview of the spam problem is presented with associated negative impacts and protection techniques. This is followed by the application of Machine Learning (ML) to spam recognition, related works and details of the Ling-Spam corpus. Next, a brief review of several commonly used classification models and our proposed framework is given. The subsequent section compares the performance of our method with other learning techniques using the benchmark corpus under different cost scenarios. Finally, we provide some conclusion remarks for this chapter and future research directions.

## 2. Spam Recognition and Machine Learning techniques

### 2.1 Spam Recognition as a Challenging Task

#### 4.1.3 Phase 3: Email Classification

The data after being processed by the Feature selection module is input to train the Classification Model. The resulting model is then used to label emails as either “legit” or “spam”, indicating whether a message is classified as legitimate or a spam email. To implement the Classification Model, we propose an intelligent way of combining the linear part of the modeling with a simple non-linear model algorithm. In particular, MPNN is adapted in the nonlinear compensator which will only model higher ordered complexities while linear model will dominate in case of data far away from training clusters. It is described in the following equation.

ŷX=Lx¯+NLx¯=w0+Wx¯+ΣZiαiyifix¯ci,¯δΣZifix¯ci,¯δ

Where

Lx¯= Linear Regression Model

NLx¯= Nonlinear Residual Compensator (MPNN)

w0= initial offset

wi= weights of the linear model

αi=yNid1= Compensation factor

yNi=yiw0+Wixi¯=difference between the linear approximation and the training output

di=x¯ci¯=distance from the input vector to cluster i in the input space

The combination of linear regression model and MPNN is referred to as Linear Regression - Modified Probabilistic Neural Network (LR-MPNN). The piecewise linear regression model is firstly approximated by using all available training data in a simple regression fitting analysis. The MPNN is then constructed to compensate for higher ordered characteristics of the problem. Depending on different portions of the training set and how far the test data is away from the training data, the impact of nonlinear residual function is adjusted such that the overall Mean Square Error is minimized. This adjustment is formulated by the compensation factorαi. In particular, αiis computed based on how well the linear model performs on a training examples and distances from a test vector to clusters of training data. Firstly, the goodness of the linear model Lx¯on a particular training data is measures by yNi which is defined as the difference between the linear approximation and the actual output of the training data. A very small value of yNi means that Lx¯fits the data well in this case and therefore it should have higher priority or the impact of the nonlinear model NLx¯is minimized. In contrast, large value of yNi indicates that NLx¯should compensate more for the degraded accuracy of Lx¯.

Secondly, to determine how far a given test vector is away from the available training data, a distance di, from that vector to each training cluster ci¯is calculated. For any data which is far away from the training set, i.e. di is large, the value of αi will be minimized. As the result, NLx¯will have minimal residual effect and Lx¯will dominate. This is because Lx¯has more stable generalization than NLx¯for new instances.

#### 4.1.4 Phase 4: Evaluation

To evaluate the overall performance of the framework, the Cost-sensitive Evaluation module computes several performance metrics and also takes into consideration different cost scenarios.

### 4.2 Performance Evaluation

#### 4.2.1 Performance Measures

To measure the performance of different learning algorithms, the following measures are used:

SPAM RECALL SR=nSSnSS+nSL

SPAM PRECISION SP=nSSnSS+nLS

FAR(False Alarm Rate) = nSLNS

Miss Rate(MR) = nLSNL

From the above equations, Spam Recall (SR) is, in fact, the percentage of spam messages NS=nSS+nSLthat are correctly classified nSSwhile Spam Precision (SP) compares the number of correct spam classifications nSSto the total number of messages classified (correctly and incorrectly) as spam nSS+nLS. As the Miss Rate (MR) increases, the number of misclassifications of legitimate emails increases while the False Alarm Rate (FAR) increases, the number of misclassifications of spam emails (passing from the filter) increases. Therefore, both of FAR and MR should be as small as possible for a filter to be effective (should be 0 for a perfect filter).

#### 4.2.2 Cost-Sensitive Analysis

a) Cost Scenarios

Depending on what action is taken by a spam filter in response to a detected spam message, there are three major misclassification cost scenarios. The no-cost case is when the filter merely flags a detected spam message. This notification of spam does not risk losing any legitimate mail due to misclassification error (no misclassification cost), but it still takes time for the human users to check and delete the spam messages manually. To minimize the user efforts on eliminating spam, the filter can automatically detect and remove the suspicious messages. However, the total cost of misclassification in this case can be extremely high due to the seriousness of falsely discarding legitimate mails. This refers to the high-cost scenario. Beside the above approaches, the filter may not either flag or completely eliminate the detected spam messages. Instead, it might resend the message to the sender. This approach, referred to as moderate-cost, combats spamming by increasing its cost via Human Interactive Proofs (HIP) (L. F. Cranor & LaMacchia, 1998a). That is, the sender is required to give a proof of humanity that matches a puzzle before his message is delivered. The puzzles could be, for example, images containing some text that is difficult to automatically analyze by pattern recognition software. Alternatively, for anti-spam programs, simple questions (e.g. “what is one plus one”) can be used instead of graphical puzzles.

In this chapter, spam recognition experiments are conducted in a cost-sensitive manner. As emphasized previously, misclassifying a legitimate message as spam is generally more severe than mistakenly recognizing a spam message as legitimate. Let LS(legitimate classified as spam) and SL(spam classified as legitimate) denote the two types of error, respectively. We invoke a decision-theoretic notion of cost, and assume that LSis c times more costly than SL. A mail is classified as spam if the following criterion is met (Androutsopoulos et al., 2000):

ProbC=spamX=xProbC=legitimateX=x>λ

In the case of anti-spam filtering:

C=spamX=x=1ProbC=legitimateX=xE1

The above criterion becomes:

ProbC=spamX=x>t,witht=λ1+λ,λ=t1t

Depending on which cost scenarios are considered, the value of λ is adjusted accordingly.

• No-cost scenario (e.g. flagging spam messages): λ = 1

• Moderate-cost scenario (e.g. semi-automatic filter which notifies senders about blocked messages): λ = 9

• High-cost scenario (e.g. automatically removing blocked messages): λ = 999

b) Total Cost Ratio

Accuracy and error rates assign equal weights to the two error typesLSSLand are defined:

Acc=nLL+nSSNL+NSErr=nLS+nSLNL+NS

However, in the cost-sensitive contexts, the accuracy and error rates should be made sensitive to the cost difference, i.e. each legitimate message is counted for λ times. That is, when a legitimate message is misclassified, this counts as λ errors; and when it passes the filter, this counts as λ successes. This leads to the definition of weighted accuracy and weighted error (WAcc and WErr):

WAcc=λ.nLL+nSSλ.NL+NSWErr=λ.nLS+nSLλ.NL+NS

The values of performance measures (weighted or not) are misleadingly high. To get a true picture of the performance of a spam filter, its performance measures should be compared against those of a “baseline” approach where no filter is used. Such a baseline filter never blocks legitimate messages while spam emails always pass through the filter. The weighted accuracy and error rates for baseline are:

WAccb=λ.NLλ.NL+NSWErrb=NSλ.NL+NS

Total cost ratio (TCR) is another measure which evaluates performance of spam filter to that of a baseline.

TCR=WErrbWErr=NSλ.nLS+nSL

Greater TCR values indicate better performance. For TCR < 1, the baseline is better. If cost is proportional to wasted time, a TCR is intuitively equivalent to measuring how much time is wasted to manually delete all spam messages when the filter is used (NS) compared to the time wasted to manually delete any spam messages that passed the filter nSLplus the time needed to recover from mistakenly blocked legitimate messages λ.nSL

## 5. Experiment Results

### 5.1 Experiment Design

The proposed spam recognition framework is tested on the Ling-Spam corpus to compare with other existing learning methods including Naïve Bayes (NB), Weighted Memory Based Learning (WMBL), Boosted Trees (BT), Support Vector Machine (SVM) and Neural Network models (Multilayer Perceptron - MLP. Unlike other text categorization tasks, filtering spam messages is cost sensitive (Cohen, 1996), hence evaluation measures that account for misclassification costs are used. In particular, we define a cost factor λ with different values corresponding to three cost scenarios: first, no cost considered (λ = 0) e.g. marking messages as spam; second, semi-automatic filtering (λ = 9) e.g. issuing a notification about spam; and fully automatic filtering (λ = 999), e.g. discarding the spam messages.

The rate at which a legitimate mail is misclassified as spam is calculated by False Alarm Rate (FAR) and it should be low for a filter to be useful. Spam Recall (SR) measures the effectiveness of the filter, i.e. the percentage of messages correctly classified as spam, while Spam Precision (SP) indicates the filter's safety, i.e. the degree to which the blocked messages are truly spam. Because SR can be derived from FAR (e.g. FAR = 1 - SR), we will use SR, SP, and Total Cost Ratio (TCR) for evaluation. Besides comparing how accurately the filters perform, their computation is also measured using the computation time (in seconds) required for each classifier. Particularly, the total computation time is a summation of the time that a classifier needs to perform cross validation, testing on data and to calculate the relevant performance metrics (e.g. misclassification rate, accuracy …).

Stratified tenfold cross validation is employed for all experiments. That is, the corpus is partitioned into 10 stratified parts and each experiment was repeated 10 times, each time reserving a different part as the testing set and using the remaining 9 parts as the training set. Performance scores are then averaged over the 10 iterations.

In addition to the studies conducted by other researchers on the same Ling-Spam corpus (NB (Androutsopoulos et al., 2000), WMBL (Sakkis et al., 2003), SVM (Hsu et al., 2003), BT (Carreras & Marquez, 2001)), we also reproduced their experiments (based on the average value of TCR of three cost scenarios) to confirm and determine the parameters' values that give best performance for different learning methods. The optimal attribute size of these methods can be found in Figure 2. An MLP wth 15 neurons in hidden layer is deployed using the Matlab Neural Network toolbox.

### 5.2 Experiment Result

#### 5.2.1 TCR and Attribute Selection

From Figure 2, for λ = 1 and λ = 9, most of filters demonstrate a stable performance, with TCR constantly greater than 1. These filters differ from one another in terms of their sensitivity on attribute selection and the number of attributes which give maximum TCR. Our LR-MPNN is found to be moderately sensitive to attribute selection and it obtains the highest TCR for λ = 1 with 300 attributes selected. When λ = 9, LR-MPNN achieves very competitive TCR compared to SVM but with less number of attributes (200 attributes) and hence involves less computation overheads.

For λ = 999, all classifiers have their TCR reduced significantly for the effect of very high misclassification cost. The difference between low and high values of misclassification cost λ is the increased performance of the baseline filter when λ increases. That is, without a filter in use (baseline), all legitimate mails are retained, preventing the baseline from misclassifying those legitimate mails as spam. Therefor, large λ benefits the baseline and make it hard to be defeated by other filters. Recall that TCR is the measure of performance that a filter improves on the baseline case. As a result, TCR generally reduces when λ increases. Another important observation is that, the performance of most classifiers, except for BT and LR-MPNN, fall below the base case (TCR<1) for some numbers of selected attributes. This is due to the relative insensitivity of BT and LR-MPNN to attribute selection. In this case, the LR-MPNN is considered to be the best performing filter with the highest TCR.

#### 5.2.2 Spam Precision and Spam Recall

In this experiment, the classifiers are run iteratively by a tenfold cross-validation process. The SP ans SR rates are recorded in Table 1. We observe that, for the no-cost scenario (λ = 1), our method, LR-MPNN, is found to have best SP while its SR (0.958) is very similar to the highest SR of NB (0.959). For λ = 9, LR-MPNN obtains the highest SR (0.869) and second highest SP (0.991) after BT algorithm. Finally, in the case of extremely high misclassification cost (λ = 999), LR-MPNN significantly outperforms other methods with all evaluation metrics are of highest values.

Methodλ =1λ =9λ =999
SRSPSRSPSRSP
NB0.9590.9730.8610.9750.7900.984
WMBL0.8600.9170.7900.9820.6010.857
SVM0.9540.9810.8470.9830.6710.995
BT0.9570.9800.8640.9930.7680.996
MLP0.8520.9750.7820.9770.6230.979
LR-MPNN0.9580.9860.8690.9910.7930.998

### Table 1.

Precision/Recall evaluation on Ling-Spam data

#### 5.2.3 Computational Efficiency

Apart from comparing precision, recall and TCR scores between classifiers, we also measure their computational efficiency. Table 2 shows that WMBL had the minimum computation time (2.5 mins), followed by NB, LR-MPNN, SVM, MLP, BT respectively. LR-MPNN can achieve comparative spam precision and recall with a shorter computation time (3.5 mins) compared with BT (11.5 mins) and SVM (5 mins). Moreover, considering TCR scores, the models that require less time (WMBL, NB) than LR-MPNN do not perform as accurately as LR-MPNN.

MethodComputation Time (mins)λ =1λ =9λ =999
TCRTCRTCR
NB310.807.805.02
WMBL2.57.115.621.33
SVM520.479.113.42
BT11.521.187.354.91
MLP712.204.500.25
LR-MPNN3.524.178.995.25

### Table 2.

Computation Time, Memory size evaluation on Ling-Spam data

In summary, the most important finding in our experiment is that the proposed LR-MPNN model can achieve very accurate classification (high TCR, SP, SR) compared to other conventional learning methods. Such superior performance of LR-MPNN was observed most clearly for λ = 999 though it always obtains the highest TCR and very competitive SP, SR rates for other cases of λ. Our algorithm also requires relatively small computation time to obtain comparable or even higher predictive accuracy to other methods.

## 6. Conclusions and Future Work

In this chapter, we proposed a novel anti-spam filtering framework in which appropriate dimension reduction schemes and powerful classification models are employed. Particularly, Principal Component Analysis transforms data to a lower dimensional space. At the classification stage, we combine a simple linear regression model with a lightweight nonlinear neural network in an adjustable way. This learning method, referred to as Linear Regression Modified Probabilistic Neural Network (LR-MPNN), can take advantage of the virtues of both. That is, the linear model provides reliable generalization capability while the nonlinear can compensate for higher order complexities of the data. A cost-sensitive evaluation using a publicly available corpus, Ling-Spam, has shown that our LR-MPNN classifier compares favorably to other state-of-the-art methods with superior accuracy, affordable computation and high system robustness. Especially for extremely high misclassification cost, while other methods' performance deteriorates as! increases, the LR- MPNN demonstrates an absolutely superior outcome but retains low computation cost. LR- MPNN also has significant low computational requirement, i.e. its training time is shorter than other algorithms with similar accuracy and cost. Though our proposed model achieves good results in the conducted experiments, it is not necessarily the best solution for all problems. However, comparatively high predictive accuracy along with low computational complexity distinguish it from other state-of-the-art learning algorithms, and particularly suitable for cost-sensitive spam detection applications.

chapter PDF
Citations in RIS format
Citations in bibtex format

## How to cite and reference

### Cite this chapter Copy to clipboard

Tich Phuoc Tran, Min Li, Dat Tran and Dam Duong Ton (October 1st 2009). Spam Recognition using Linear Regression and Radial Basis Function Neural Network, Pattern Recognition, Peng-Yeng Yin, IntechOpen, DOI: 10.5772/7529. Available from:

### chapter statistics

1Crossref citations

### Related Content

#### Pattern Recognition

Edited by Peng-Yeng Yin

Next chapter

#### Designing and Training Feed-Forward Artificial Neural Networks For Secure Access Authorization

By Fadi N. Sibai, Aaesha Shehhi, Sheikha Shehhi, Buthaina Shehhi and Najlaa Salami

First chapter

#### Local Energy Variability as a Generic Measure of Bottom-Up Salience

By Anton Garcia-Diaz, Xose R. Fdez-Vidal, Xose M. Pardo and Raquel Dosil

We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

View all Books