Skip to main content

Technical Reports: CMU-CyLab-10-022

Title:BitShred: Fast, Scalable Malware Triage
Authors:Jiyong Jang, David Brumley, and Shobha Venkataraman
Publication Date:November 5, 2010


The sheer volume of new malware found each day is enormous. Worse, current trends show the amount of malware is doubling each year. The large-scale volume has created a need for automated large-scale triage techniques. Typical triage tasks include clustering malware into families and finding the nearest neighbor to a given malware.

In this paper we propose efficient techniques for largescale malware triage. At the core of our work is BitShred, a framework for data mining features extracted by existing per-sample malware analysis. BitShred uses a probabilistic data structure created through feature hashing for large-scale correlation that is agnostic to per-sample malware analysis. BitShred then defines a fast variant of the Jaccard similarity metric to compare malware feature sets. We also develop a distributed version of BitShred that is optimal: given 2x more hardware, we get 2x the performance. After clustering, BitShred can go one step further than previous similar work and also automatically discover semantic inter-family and inter-malware distinguishing features, based upon co-clustering techniques adapted to BitShred’s fingerprints. We have implemented and evaluated BitShred using two different per-sample analysis routines: one based upon static code reuse detection and one based upon dynamic behavior analysis. Our evaluation show BitShred’s probabilistic data structure and algorithms speed up typical malware triage tasks by up to three orders of magnitude and use up to 82x less memory, all with similar accuracy to previous approaches.

Full Report: CMU-CyLab-10-022

Related Project : BAP: The Binary Analysis Platform