Skip to main content

Machine Learning Techniques for Phishing Attacks

Researcher: Jason Hong

Research Area: Trustworthy Computing Platforms and Devices | Privacy Protection

Abstract

Phishing scams are a plaque on today's Internet. Phishing scams are a form of social engineering that trick people into giving up personal information to fake websites that impersonate legitimate ones. Criminals have used phishing to steal money from people's bank accounts, conduct corporate espionage, as well as break into military systems.

Currently deployed countermeasures to phishing are insufficient. The most widely adopted technique is blacklists, which are maintained by Microsoft, Google and PhishTank (PhishTank is an open community where users both submit and verify phishing sites.) The problem with blacklists today, however, is that they are slow to react to new attacks (since they require human verification) and are easily overcome by phishers (either by targeting very few people so that the phishing site does not get blacklisted or by generating large numbers of URLs per phishing site to overwhelm the blacklists).

Our long-term vision is to develop an Internet Immune System that can detect and respond to phishing attacks far faster and more reliably than we can do today. This system would be comprised of a suite of Internet "sensors" to quick detect new attacks, complemented by a suite of automated techniques and support tools to help coordinate efforts in responding to the attacks. With respect to detection, the sensors might include plug-ins for browsers, email filters, web crawlers, as well as user-submitted data coupled with more sophisticated algorithms for identifying phish. With respect to response, some features might include better and more responsive blacklists, notifying ISPs and web administrators, capturing evidence and providing rationale as to why a site is fake, as well as a per-phish wiki that helps experts coordinate efforts in shutting down fake sites.

A core part of this Internet Immune System, and the part we will focus on in this project, is applying machine learning techniques in novel ways to help improve our ability to automatically detect phishing sites. Specifically, we will a) develop new heuristics that make use of content-based analysis and web topology to detect phishing sites; b) take human-verified blacklist data and apply machine learning techniques to learn characteristics of phishing sites; and c) evaluate the effectiveness of our algorithms by comparing them in lab-based studies, as well as measuring how many of the phishing sites we detect actually end up on blacklists.