Open access peer-reviewed chapter - ONLINE FIRST

An Emerging Solution for Detection of Phishing Attacks

By Prasanta Kumar Sahoo

Submitted: September 24th 2020Reviewed: January 21st 2021Published: March 3rd 2021

DOI: 10.5772/intechopen.96134

Downloaded: 50

Abstract

In this era of computer age, as more and more people use internet to carry out their day to day work so as hackers performs various security attacks on web browsers and servers to steal user’s vital data. Now Electronic mail (E-mail) is used by everyone including organizations, agency and becoming official communication for the society as a whole in day to day basis. Even though a lot of modern techniques, tools and prevention methods are being developed to secure the users vital information but still they are prone to security attacks by the fraudsters. Phishing is one such attack and its detection with high accuracy is one of the prominent research issues in the area of cyber security. Phisher fraudulently acquire confidential information like user-id, passwords, visa card and master card details through various social engineering methods. Mostly blacklist based methodology is used for detection of phishing attacks but this method has a limitation that it cannot be used for detection of white listed phishing. This chapter aims to use machine learning algorithms to classify between phishing E-mails and genuine E-mails and helps the user in detecting attacks. The architectural model proposed in this chapter is to identify phishing and use J48 decision tree classifier to classify the fake E-mail from real E-mail. The algorithm presented here goes through several stages to identify phishing attack and helps the user in a great way to protect their vital information.

Keywords

  • security attacks
  • phishing
  • fake E-mail
  • data mining

1. Introduction

It is one of the methods used by the phisher to steal user’s most secret information in a fraudulent manner. It is a very serious security problem that the modern world is facing today in cyber space which leads to financial losses for individuals and the society at large. It is an unlawful act, the fraudsters use it to retrieve user’s personal and secrete information by betraying them using various social engineering methods. It is becoming one of the major types of frauds where the phisher used to trick the user to reveal their own private information such as user id, password, pin and visa card details. Mostly phishing attack is done by E-mails. Very often a phishing messages may contain a uniform resource locator (URL) that redirect the user to visit an alternate web site. The redirected site is an extremely modified site and when the user clicks on that site, they are directed to enter their personal information which normally transferred to the phishing assailant [1, 2]. It is an offense in which a phisher sends the fake E-mails, that appears to be genuine and come from a trustworthy organization, instruct to enter their personal information such as online banking username, password, mobile number, residential address, details of the credit card and so forth [3, 4, 5]. There are many methodologies used by the phisher to trick the user to deceive them and to steal their personal credentials. Very often phisher used spoofed E-mails and forged websites to deceive the users. Web spoofing is one type of attack where phisher use artificial or forged sites to cheat users and to steal their personal information. The phishing E-mail seems to be a real one and even the website designed for the very purpose which directed the user to enter information looks real one. Mostly fake messages spread through E-mails, short message service (SMS), instant messengers, social networking sites, Voice over Internet Protocol (VoIP), and so forth, but E-mail is the most popular way and 65% of the phishing attack took place due to a click on the hyperlink attached to the E-mail [6]. Spear phishing is one of the methods used by the phisher to dupe organizations and individuals in Business E-mail Compromise (BEC). The very sophisticated spear phishing attacks [7, 8, 9] to target selected groups, individuals in an organization. Phishing is a type of attack that is very similar to fishing in a pond or river, but instead of trying to catch a fish, the phisher try to dupe user’s most vital information [10, 11]. A user generally follows the authentication procedure by filing login id and passwords. The password should be strong password from security point of view to protect it from the attackers. Many anti-phishing tools were developed to provide stronger security which includes using image as input in the login process and hashing of passwords [12]. The web sites are specially designed by the phisher to looks to be legitimate one for which it is becoming very difficult for the user to detect fake website through their appearance.

1.1 Related work

Tan et al. [13] suggested an anti-phishing method to extract body tags and Meta from the URL. The uniform resource locator (URL) is broken down into tokens and after that the keywords are compared with yahoo search engine. The original domain name is compared against the given domain name and also with the country code of top level domain to check if there is a matching. The country code of top level domain is matched with that of web site and if found correct then it is considered as real web site otherwise fake website. Yan et al. [14] reviewed on Chinese phishing on Ecommerce sites. Sequential minimal optimization algorithm is used for the purpose and the features such as URL and the web features are used for detection of phishing. Genetic algorithm has been used to optimize the features. The data mining tool Waikato Environment for Knowledge Analysis (WEKA) is used to train the model that the system proposes. Li et al. [15] suggested using machine learning to detect phishing web pages. He has used document object model to optimize the features and emphases has given to web image that are extracted from the webpage. The features after optimization are passed into transductive support vector machine to differentiate between fake web site and real web site.

1.2 Existing work on phishing

Gemini is a well known tool used for the authentication process to protect the user against phishing. There were some anti-phishing techniques available today to prevent user from falling prey to the fake web sites by providing a strong secured authentication process. Some of the reputed sites display the security indicator for their sites to convey a message to the user that the site is not a fake website. The presence of URL indicator enables the users to identify the site as a real one [16, 17]. In some cases in the absence of such security indicators, the users avoid themselves from entering the passwords [13]. One of the examples of such is Sitekey [18] which is used by Bank of America for internet banking [19]. The user can choose an image as input into the login process and when the user trying to login, the system will validate the image. In case the input image is wrong then the user stopped form login and authentication failed. Dhamija et al. [20] in his paper titled “dynamic security skins” proposed to use personal identity for authentication by the remote server that the user can verify. So in this method the web site will be considered as fake web site if the identity of the web site cannot be proved. Parno et al. [21] presented an anti-phishing technique which uses hard ware devices such as smart phones and smart tokens to authenticate. Although a user unknowingly log into a phishing site, this process helps the user to protect the vital information from leaking out to phisher because of this trusted authentication procedure. Gemini does not require the support of other devices in comparison to other existing techniques. There are some anti-phishing techniques such as Antiphish [22] and Webwallet [23] already available to identify the actual intent of users browsing activity which helps the user from falling prey to phishing attacks. This research work takes user name as input to initiate anti-phishing technique which helps the user to protect their vital information whereas other techniques based more on passwords. Yue et al. [24] proposed a technique that is free from any kind of deceit to protect the user credentials being leaked out fraudsters by hiding the actual content from the fake sites. It makes the fraudsters very tough to retrieve user’s secret info before the user identifies the site as fake. Some password management techniques such as PwdHash [19], Password Multiplier [15], and passpet [24] offer password hashing to provide better security strength for passwords. Because of rehashing of passwords and randomly changing the name of web sites a phisher could not make use of the stolen information. Birk et al. [12] suggested a different mechanism to track the identity of the attacker. He has proposed to use of fingerprint credentials to track the stolen information from fake sites.

Advertisement

2. Case studies in phishing

2.1 Case study: website phishing experiment

In this study a website was designed with an exact replica of website www.ahlionline.com, the original Jordan Ahli Bank website. The objective is to misleading the user’s by targeted phishing E-mail attack to giving away their vital information. We intentionally put a lot of known phishing features during web site design to understand the user’s awareness of these kinds of risk after getting authorization from the management. We used Internet Protocol (IP) address instead of domain name, http instead of https, poor design, spelling errors, absence of secure sockets layer (SSL), padlock icon and phony security certificate. We almost achieve our target to attack 120 employees through well planned phishing E-mail, informing them that their e-banking accounts are at high risk of being attacked and requested them to immediately log into their account through fake link attached to our E-mail to verify their balance in the account. We successfully deceive 52 from the group of 120 employees in our organization representing 44% of the sample, who followed the deceiving instructions and give away their actual credentials. The very surprising fact is 8 employees of the IT department and IT auditors are victims out of the 120 representing 7% of the sample. 44 employees from other departments out of 120-targeted victims representing 37% of the sample fell into the trap and give away their information without much hesitation. 28 employees are very cautious and given wrong information representing 23% of the sample and 40 employees choose not to respond at all after receiving the E-mail representing 33% of the sample. The experimental result shows that phishing is extremely dangerous to the whole society since almost half of the employees who responded were victimized. In particularly the very well educated and technically trained people from IT department and IT auditors are also among them. So increasing the awareness of all users who are using e-banking facility regarding this risk factor is highly recommended [25].

2.2 Case study: phone phishing experiment

A group of around 50 employees in an organization were contacted by their female colleagues to lure them into giving away their personal e-banking accounts details such as user id and passwords through friendly conversations with an aim to deceiving them. The results were very surprising as many of the employees fell for the trick. After having friendly conversations for quite for some time with them, the assigned team able to seduce them into giving away their e-banking details such as user id and passwords for false reasons. Some of these lame reasons which were used in the conversations are to check the account integrity, their privileges and accessibility and connectivity issues with the Web server for maintenance purpose. The assigned team managed to deceive 16 out of the 50 employees used for testing purpose into giving away their complete e-banking information such as user id and password, which is about 32% of the sample. Another eight employees (16% of the sample) agreed to give their user name only. The remaining 52% of the sample (26 employees) were very vigilant and decided not to reveal any information over the phone. The summary of the testing results reveals the high risk of the social engineering security factor. The results prove that there is a urgent need to increase the awareness of customers not to fall victims of this kind of threat which can have devastating results [26].

2.3 Case study: business email compromise (BEC)

The Nigeria-based Business E-mail Compromise (BEC) attack hit over 50 countries in 2017, targeting more than 500 businesses predominantly industrial organizations. The phishing scam directed the user to download a malicious file. When theses files were downloaded, malware would gain authorized access to their business data and networks [27].

2.4 Case study: shipping information

The internet security company Comodo found a new type of phishing scam specifically to target small businesses in July 2018. E-mails containing phishing spam was sent out to more than 3,000 small businesses firms, mentioning Shipping Information on the subject line. The E-mail was to inform about approaching delivery by United Parcel Service (UPS) and the user were asked to click on the delivery tracking link to get the delivery status. When the user clicked on the delivery tracking link it contained malware, potentially releasing a virus [27].

3. Proposed system architecture

Even though there are several methods exists today to detect phishing but still it has become a very difficult task to detect fake E-mails in the current scenario. Today there are a number of techniques exist for identification of phishing E-mails and some of them are white listing, heuristics, blacklisting and machine learning. A machine learning technique is proposed in this chapter to identify the phishing E-mails and protect the user from revealing their pin, user id and passwords. The objective of this chapter is to use J48 one of the machine learning algorithms to analyze incoming E-mails and helps in preventing the user from phishing attacks. This chapter presented an architectural model as shown in Figure 1 below and uses the various sub-processes at different stages to classify between fake E-mail and genuine E-mails.

Figure 1.

Shows the architectural model of the proposed work.

3.1 The architectural model

The architectural model presented in the chapter as stated in Figure 1 consists of seven sub-modules as Input Raw E-mails (data set), Covert to Electronic Mail Format (EML) format, Data Preprocessing, Feature Selection, Training Phase and using the model to classify test data. In the initial stage the system reads the raw E-mails data from Enron dataset. In the second stage convert them into EML (Electronic Mail Format) format and store them as a file. The third stage is data Pre-processing to removes unnecessary words through tokenization. The fourth stage is feature extraction. The features such as body, to, from, URL, carbon copy (Cc), blind carbon copy (Bcc) and the body of the E-Mails that is message are extracted from the input E-mails. In fifth stage the extracted data get converted into Attribute Relation File Format (ARFF). The sixth stage is training phase where model which is used for classification is trained using J-48 classification. The next stage is testing phase and the model is used to classify the E-mails to fake E-mail and real E-mails.

  1. The very first step in E-mail classification is to select the suitable E-mail dataset which is a real sample of existing E-mails that includes both phishing and legitimate E-mails.

  2. After E-mail data set is selected, splitting each and every E-mail and then converting them into Electronic Mail Format (EML). EML files are normally store each and every message as a single file and attachments may either be in the form of Multipurpose Internet Mail Extensions (MIME) content in the message or can be written off as a separate file.

  3. Then data pre-processing is applied on to the above files to remove stop words and unwanted information. Then data reduction technique is applied to reduce the data size that needs to be examined. At the last step in the pre-processing phase lemmatization and stemming technique applied on token of words to convert them into their root forms.

  4. After the data pre-processing step is over, then feature selection process starts with the cleaned data to extract different features form the E-mail data set. The features such as to, from, URL, carbon copy (Cc), blind carbon copy (Bcc) and the body of the E-Mails are extracted from the input data set. The process of feature extraction goes on unless the complete data is scanned properly and all the features are extracted. The most important features are E-mail header, Body, Java script and URL as given below.

    • E-mail Header: The header information is extracted from E-mail’s data set. The header features are to, from, bcc, and cc fields. Some of the most popular phishing E-mail header includes keywords such as bank, debit, credit, Fwd:, Re:.

    • Body of E-mail: The body part of the E-mail is selected from the E-mails which contains the message part of the E-mail. Mostly phishing E-mail body include keyword such as dear, credit, click, log, identify, information, suspension and verify your account.

    • Java-script: It mainly contains a Java-script code in the email body. A phishing E- mail most of the time contain an On Click event, pop-up window code, or a code that links to an external website.

    • URL: The uniform resource locator (URL) contains suspicious URLs. The phishing E-mail mostly contain “@” sign in the URL, port numbers in the URL, presence of an IP address in the URL.

    • Network-based: The network-based feature mostly contains packet size and TCP/IP headers.

  5. In order to apply classification algorithm to detect phishing E-mail the data needs to be converted into Attribute Relation File Format (ARFF). This chapter has suggested using J48 classifier for the E-mail dataset classification. Decision tree J48 algorithm is the extension of most popular ID3 (Iterative Dichotomise 3). This algorithm is most suitable for E-mail dataset classification where it can handle errors and missing values to some extent.

  6. The model is trained in the training phase using the training data set and model is evaluated for its suitability.

  7. After the model is thoroughly trained and evaluated properly, it is used to classify the test data set.

  8. Finally the E-mail data set are classified into genuine, phishing E-mail and accuracy of the classifier is calculated from the confusion matrix.

3.2 Implementation and discussion on results

Enron data set is used to test the model being proposed by the chapter that includes both genuine E-mails and phishing E-mails. Initially the Data Pre-processing is performed on the data set to remove stop words, superfluous words and also the size of the data is reduced to get better result as shown below in Figure 2.

Figure 2.

This shows feature selection process after data preprocessing.

As stated above in Figure 3, the chapter is being implemented using J48 classifier for classification genuine E-mails and phishing E-mails with 98% accuracy. In order to measure the efficiency and performance of the proposed algorithm in detecting phishing E-mails, False Positive (FP), True Positive (TP), True Negative (TN) and False Negative (FN) are computed and considered in the result. Then accuracy, Precision, Recall and F-1 score are computed using the formula given below.

  1. ACCURACY:Accuracy is used to find the correct values; it is the sum of all true values divided by total values

    (True Positive + True Negative)/(True Positive +True Negative +False Positive +False Negative)

    ACCURACY=TP+TNTP+TN+FP+FN

  • PRECISION:How often a model predicts a positive value is correct? It is all the true positives divided by the total number of predicted positive values.

    (True Positive/True Positive + False Positive)

    PRECISION=TPTP+FP

  • RECALL:It used to calculate the models ability to predict positive values. How often does the model actually predict the correct positive values? It is true positives divided by the total number of actual positive values.

    (True Positive/True Positive + False Negative)

    RECALL=TPTP+FN

  • F-1 SCORE:F1 measure is used when we need to take both Precision and recall.

    F1=2×PRECISION×RECALLPRECISION+RECALL

  • Figure 3.

    This figure show E-mails are classified using weka tool.

    4. Conclusion

    At this modern era as more and more people use internet for their day to day activities so as hackers on the network to steal their vital data through various security attacks. The objective of this chapter to presents a model using machine learning technique to detect phishing attacks and to prevent users from phishing. This chapter provides a very powerful architectural model in order to identify phishing E-mails. This chapter ends with a conclusion that phishing attacks is very dangerous to everyone in the society including organizations, person and hence must be detected accurately. Many researchers have contributed by giving their ideas to classify phishing E-mails from genuine E-mails but without much success. This chapter used J48 classification algorithm to classify between fake E-mails and genuine E-mails and it was observed that the model able to classify 98% accurately which is a far better result. Hence the model proposed in this chapter is very accurate and efficiently classify and could able to identify phishing E-mails. This chapter would provide a great help for ordinary man in protecting their important information by detecting phishing attacks.

    Conflict of interest

    I the author of “An Emerging Solution for Detection of Phishing Attacks” states that this research works fully compile with ethical standards as per the Journal.

    • I have no direct or potential influence or impart bias on this research work.

    • I have no conflict conflicts of interests that are directly or indirectly related to this research work.

    • I have no funding from any funding agency or financial support from any organization.

    Advertisement

    Acronyms and abbreviations

    E-mailElectronic mail
    URLuniform resource locator
    EMLElectronic Mail Format
    SMSshort message service
    VoIPVoice over Internet Protocol
    WEKAWaikato Environment for Knowledge Analysis
    IPInternet Protocol
    SSLsecure sockets layer
    BECBusiness Email Compromise
    UPSUnited Parcel Service
    CcCarbon copy
    Bccblind carbon copy
    ARFFAttribute Relation File Format
    MIMEMultipurpose Internet Mail Extensions

    Download for free

    chapter PDF

    © 2021 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3.0 License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    How to cite and reference

    Link to this chapter Copy to clipboard

    Cite this chapter Copy to clipboard

    Prasanta Kumar Sahoo (March 3rd 2021). An Emerging Solution for Detection of Phishing Attacks [Online First], IntechOpen, DOI: 10.5772/intechopen.96134. Available from:

    chapter statistics

    50total chapter downloads

    More statistics for editors and authors

    Login to your personal dashboard for more detailed statistics on your publications.

    Access personal reporting

    We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. We share our knowledge and peer-reveiwed research papers with libraries, scientific and engineering societies, and also work with corporate R&D departments and government entities.

    More About Us