Skip to main content

Artifact-FreeSanitization of Insider-Threat Data

Researcher: Roy Maxion


Artifact-free Sanitization of Insider-threat Data

The insider threat is a significant and growing concern, especially in fields where espionage is profitable. To combat insider threat, systems are needed to monitor users, profile their behavior, and detect suspicious or anomalous activity. Such systems are best evaluated using natural, real-world data to gauge, compare and improve their performance. The data sets drive research progress, particularly when they are benchmarked and made available to the research community. Progress is currently being limited, however, by the paucity of useful real-world data, largely due to confidentiality and privacy concerns, as well as the data’s realism. Even sanitized data are difficult to obtain, and when sanitized data are used, there is no useful way to measure the effect of the sanitization itself on detection capability; that is, detection of insider activity may not be equally good when using either raw data or its sanitized equivalent. The proposed work addresses these issues - gathering natural-event data, and examining the artifacts and effects of sanitization.

We expect to deploy data-logging software to volunteers in the CMU environment. The software will be completely under the control of the volunteer. Volunteers will collaborate with “insiders” by permitting vetted insider scripts to be executed on their machines, all under the eye and supervision of the volunteer. (The insider activity will be reversed immediately after the injection.) User activities (excepting passwords), including the activities of injected insiders, will be logged to a file. The volunteer will be provided with existing sanitization software that allows the volunteer to select the items s/he wants sanitized. Once these items are selected, all the logs of that volunteer will be similarly sanitized without further user intervention. That the volunteer determines for him/herself what needs to be sanitized, permits custom sanitization on a per-user basis, avoiding the over-sanitization that many sanitizing regimes impose (by sanitizing everything that might be offensive). Once the user is satisfied that his/her logs are suitably sanitized, the logs will be provided to the research team; this is the first point at which the research team sees any user data.

An insider-detection algorithm will be run against both the unsanitized and the sanitized data (the detection code will be given to volunteers, who run it themselves on their unsanitized and sanitized logs). The results of the detection algorithm will be given to the research team, who will compare detection efficacy using sanitized vs. unsanitized data. If the detection outcomes for sanitized data are less satisfying than those for the unsanitized data, the research team will work with the volunteers to determine what aspects or artifacts of the sanitization were responsible for the detection deterioration.

The benefits of the proposed work are to: (1) make sanitization easier and more effective, while simultaneously reducing exposure of confidential or private information; (2) determine the effects of different sanitization strategies, employed by different volunteers, on detection efficacy; (3) limit the effects of sanitization artifacts on detector performance by identifying sanitization strategies that do not skew detector performance; (4) provide a way to obtain vetted, realistic benchmark data sets that can be used across the entire research community (by making data, as well as logging and sanitization software, freely available).

Plan of work:

The proposed plan of work is to collect data, and then to use that data in the studies proposed. The first stage will comprise a principled evaluation of sanitization strategies and their effects on detector performance. The sanitization strategies will be compared quantitatively by their effect on a representative insider-threat detection algorithm, and those strategies that have minimal effect on detector performance will be identified.

The second stage will create tools that automatically perform common sanitization tasks, as done by the volunteers. These tools will be evaluated empirically with the data from the user study to ensure that they reduce the burden of sanitization, and yet remain effective.

  • Primary data collection. Deploy data logger on the workstations of volunteer users; monitor the behavior of the volunteers for a two-month baseline.
  • Acquire and prepare insider scenarios. Through our contacts in the intelligence community we will ascertain types of realistic insider scenarios of concern. We will implement selected scenarios so they can be injected realistically into the operations of everyday users (with user permission and cooperation).
  • Inject insider scenarios. With the cooperation and collaboration of the volunteers, inject insider activity into the volunteer’s activities. Mark the injection for ground truth. Recover the situation to normal.
  • Sanitization. Deploy data sanitizer to each of the volunteers, and have each volunteer sanitize his/her data. Collect the data.
  • Evaluate sanitization strategies. Create insider-threat benchmark data sets from raw and sanitized data. Select representative insider-threat detection system. Evaluate different sanitization strategies with respect to the difference in detector performance on the raw and sanitized data sets. Analyze results, and identify sanitization strategies that limit the effects of sanitization artifacts on detector performance.
  • Technical report. Prepare a report documenting the work, and suggesting future directions.

Lack of sound and useful benchmark data sets is perhaps the greatest inhibitor of effective insider-threat detection. Without such commonly-held data, there is no measure of progress. The proposed research attempts to encourage the collection and sharing of such data by making it easy to collect and sanitize, and by ensuring that the data collected will be useful to the insider-threat detection community. Existing public data sets, e.g., Schonlau, Greenberg, Lane & Brodley, etc., all of which contain sanitization flaws, can be replaced in the research community by sets of high-quality, benchmarked insider data.