This section will introduce the phishing attack, methods of phishing prevention, and why machine learning could be a possible solution to combat this attack. The tactics used by malicious actors will also be studied, allowing for a greater understanding of how it can be prevented.
The project overview will then follow, outlining the project method, objectives, and hypotheses.
1.1.1. Phishing Attacks
In our business and personal lives, email has taken over as the go to method of communication. By the end of 2019, it is estimated that over one-third of the worldwide population will be using email (The Radicati Group, 2015). While this has revolutionized the way we do business, it has also created an opportunity for malicious activity. One of the ways criminals have used email to their advantage has been through phishing attacks.
Phishing is the act of getting a victim to carry out an action in order to help an attacker achieve their goals (Che et al., 2017). In their 2016 Q4 report, The Anti Phishing Working Group (2016) advised that both social engineering and technical subterfuge can be used steal personal identity data and financial account details from victims. The same working group highlights that there has been a 5753% increase in phishing attacks over 12 years. When you consider that 2016 saw a 65% increase over 2015 (Anti Phishing Working Group (APWG), 2016), it is safe to say this problem is not going away.
Phishing through email can be broken down into a two-step process. The victim is first convinced to click on a link, after which they are tricked into handing over sensitive information (Chhabra et al., 2011). To carry out this attack effectively, a massive number of emails are sent out which include a URL to a website they control (Moore and Clayton, 2007). Both the email and the website will be crafted as to fool the user into thinking it is from a legitimate source. These kinds of attacks may target users who are not be aware of online security issues, therefore they are willing to hand over their credentials to a site that looks legitimate (Gupta, Singhal and Kapoor, 2017). This could also include high value targets with valuable information, such as HR or Accounting departments (Ferreira and Lenzini, 2015).
While this process may seem reasonably simple, it has proven to be a lucrative venture for cyber criminals. The RSA Online Fraud Report 2016 states that “phishing attacks have cost global organizations $4.6 billion in losses in 2015” (Verma and Das, 2017). With free ‘do-it-yourself' phishing kits online, like the one found by the security company Sophos (2004), the risk to companies and individuals is growing day by day.
The first step of the phishing process, convincing the user to go to a malicious URL, is the main focus of this project. In a 2008 study it was found that nearly one third of websites contain malicious code (Liang et al., 2009), highlighting the need for an effective solution which prevents users from accessing these sites. It is not difficult to convince most users to visit a malicious link. When asked to identify suspicious aspects of a phishing email, users tend to focus on the spelling and design as the number one indicator of malicious activity (Jakobsson, 2007). This same study found that while users study URLs carefully, they “were not highly suspicious of URLs that were well-formed”. Obfuscation techniques are used to increase the likelihood a user will see the URL as legitimate. Methods such as large host names, using IP addresses and using a domain name within the URL host can all be used to fool the victim (Garera et al., 2007). Methods such as URL shorteners are also increasing in prominence (Chhabra et al., 2011). The APWG (2016) have also suggested that phishers do not even need deceptive domain names when they can use obfuscation techniques. For all of these reasons, any method for reducing human decision making will be beneficial.
1.1.2. Phishing Prevention Methods
While URL analysis is the main focus of this project, a lot of the previous work done in phishing email prevention uses different techniques. These techniques can be broken down into two main categories; content-based and behavioural-based,
Content based detection focuses on the actual email itself. This could include a variety of different features, all of which can be directly extracted from the email without thorough processing. Almomani et al. (2013) lists five ‘basic features' used for analysis; Structural, Link, Element, Spam Filter and word list. Certain structural features are very prevalent in phishing emails, allowing for fast categorisation based on content. The language used in an email can be an indicator of its purpose. Natural language processing, enabling a computer to derive meaning from language, can be utilized to process emails and look for indicators of phishing. Common social engineering techniques, such as a sense of urgency or reply inducing sentences, can be identified and used for categorization (Aggarwal, Kumar and Sudarasan, 2014).
While the body of an email may be used for language understanding, the header can reveal information that is much more difficult to spoof. While the body is controlled by the sender, there are features in the header which cannot be changed. The Message-ID, a globally unique identifier required contained within the header, can be used to classify emails and identify phishing attempts (Verma and Rai, 2015).
As a phishing attacks main aim is to direct the user to a forged website, a lot of research focusses on URL analysis. Phishers will use certain obfuscation techniques to mask a URL and make it appear to be a legitimate address. Encoding techniques can be counteracted using URL untangling, revealing the real destination of a link (Chandrasekaran, Narayanan and Upadhyaya, 2006). While HTML emails are used commonly in marketing, phishers use this feature to make their emails look more realistic (Fette, Sadeh and Tomasic, 2007). A common technique using HTML focusses on disguising the real content of a URL to make it look legitimate. If the text of the URL and the HREF are different, this indicates the URL may be used for phishing. A simple technique using URLs is blacklisting. This technique uses a database of known phishing sites in order to query if a link is legitimate (Sahoo, Liu and Hoi, 2017). While this method is simple to implement and uses little overhead, the speed at which new sites are created can render this technique ineffective (Verma and Das, 2017). To counteract this, some methods will use the content of the URL itself an identify properties that may suggest malicious activity. Simple features, such as the randomness of a URL, can be used to separate legitimate and phishing URLs (Verma and Das, 2017).
One issue with content-based detection is the difficulty to detect zero-hour phishing attacks. These are attacks that have not been seen before (Li et al., 2016), therefore blacklisting and other URL analysis techniques may struggle to keep up. To counteract this, some anti-phishing mechanisms will actually access the link in order to scan the site for suspicious behaviour. Retrieving the content pointed to by a URL has a higher detection rate than only looking at the email itself (Garera et al., 2007). This heuristic approach looks for signatures of malicious activities, such as process creation, which would suggest that the link is that of a phishing scam (Sahoo, Liu and Hoi, 2017). While this method may be more accurate, actually following a link contained within a phishing email could result in unwanted consequences. Some of these may be an inconvenience for the user, such as signing up for marketing mailing list (Garera et al., 2007). More malicious actions resulting from visiting the web page may be more damaging. Attacks may be launched from the URL once I has been visited or after a period of time, putting the user at risk (Sahoo, Liu and Hoi, 2017). To counteract this, resource heavy ‘controlled environments' may be implemented, with disposable VMs used for analysis. Therefore, classifying using the URL alone is a much more lightweight solution (Ma et al., 2009).
1.1.3. Machine Learning for Phishing Detection
To overcome the issues faced in phishing detection, researchers have looked at Machine Learning (ML) as a method of classifying URLs (Patil and Patil, 2015). Machine Learning uses a set of known URLs as training data, and using the statistical properties of that data, learns to predict unknown URLs. This is more effective than blacklisting, as even if the URL is newly created it can be classified based on the characteristics of previous phishing attempts (Sahoo, Liu and Hoi, 2017). The textual properties of the URL, i.e. its lexical features, can be used to extract certain features that indicate if a URL is malicious or not (Ma et al., 2009). Features such as the domain name length, number of unique characters and letter frequencies can all be combined to create a model used for categorization (McGrath and Gupta, 2008). These features are extracted from datasets of known malicious or benign URLs, training the model to work on completely new data.
The methods of training ML models can vary. The first difference is supervised versus unsupervised learning. Supervised learning is the method previously discussed, where the training dataset is labelled to ‘teach' the machine what a malicious URL looks like. In unsupervised learning, the machine learns through observation and finds its own structures in the data. Due to the diverse nature of URLs, unsupervised ML is not a popular method for detection (Sahoo, Liu and Hoi, 2017). Supervised learning has shown to be much more popular for anomaly detection. This method can be further broken down into batch or online learning. Batch learning, as the name suggests, takes the learning dataset in batches and works through all of them to create a model. Online learning, however, learns from data sequentially. The model is created using one new instance of a URL at a time (Verma and Das, 2017). Online learning is faster than batch learning, especially when dealing with larger datasets (Gyawali et al., 2011; Ma et al., 2011).
With the lexical features of a URL, a known dataset of malicious and benign URLs, and an effective machine learning algorithm, a method for detecting brand new phishing URLs can be created (Li et al., 2016).
...(download the rest of the essay above)