Improved techniques for phishing email detection based on random forest and firefly-based support vector machine learning algorithms.
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Electronic fraud is one of the major challenges faced by the vast majority of online internet users today.  Curbing  this  menace  is  not  an  easy  task,  primarily  because  of  the  rapid  rate  at  which fraudsters  change  their  mode  of  attack. Many  techniques  have  been  proposed  in the academic literature  to  handle  e-fraud.  Some  of  them include: blacklist, whitelist,  and  machine  learning (ML) based techniques. Among all these techniques, ML-based techniques have proven to be the most efficient, because of their ability to detect new fraudulent attacks as they appear.There are three  commonly  perpetrated  electronic  frauds,  namely: email spam, phishing  and  network intrusion. Among these three, more financial loss has been incurred owing to phishing attacks. This  research  investigates  and  reports  the  use  of MLand  Nature  Inspired  technique  in  the domain  of  phishing  detection, with  the foremost  objective  of  developing  a dynamic  and  robust phishing  email  classifier  with improved classification accuracy  and reduced  processing  time.Two  approaches  to  phishing  email  detection are proposed,  and  two  email  classifiers are developed based on the proposed approaches. In the first approach, a random forest algorithm is used  to  construct  decision  trees,which are,in  turn,used  for  email  classification. The  second approach  introduced  a  novel MLmethod  that  hybridizes firefly  algorithm  (FFA)  and  support vector machine (SVM). The hybridized method consists of three major stages: feature extraction phase,  hyper-parameter  selection  phase  and  email  classification  phase.  In  the  feature  extraction phase, the feature vectors of all the features described in Section 3.6 are extracted and saved in a file for easy access.In the second stage, a novel hyper-parameter search algorithm, developed in this  research,  is  used  to  generate  exponentially  growing  sequence  of  paired  C  and  Gamma  (γ) values.  FFA  is  then  used  to  optimize  the  generated  SVM hyper-parameters  and  to  also  find  the best hyper-parameter pair. Finally, in the third phase, SVM is used to carry out the classification. This new approach addresses the problem of hyper-parameter optimization in SVM, and in turn, improves the classification speed and accuracy of SVM. Using two publicly available email  datasets, some experiments are performed to evaluate the performance of the two proposed phishing email detection techniques. During the evaluation of each approach, a set of features  (well suited for phishing detection) are extracted from the training dataset  and  used  to constructthe  classifiers.  Thereafter,  the  trained  classifiers are evaluated  on  the  test  dataset. The  evaluations  produced  very  good  results.  The  RF-based classifier  yielded  a  classification  accuracy  of  99.70%,  a  FP  rate  of  0.06%  and  a  FN  rate  of 2.50%. Also, the hybridized classifier (known as FFA_SVM) produced a classification accuracy of 99.99%, a FP rate of 0.01% and a FN rate of 0.00%.
Description
Master of Science in Computer Science. University of KwaZulu-Natal, Durban, 2014.
