Deep Learning Malware Detection

IN PROGRESS

Introduction

Malware is a prevalent threat in the current state of the digital world. With adversaries and defenders in a constant arms race, interesting new methods and techniques have arose from both sides. One of the current defensive techniques pairs deep learning with malware analysis to create models capable of classifying benign and malicious binaries. Accurate and robust models are employed in anti-malware solutions and are often effective. However, deep learning has many limitations and these models will never be able of achieving perfect accuracy in the real world, especially as adversaries continue to develop new techniques.

Architecture

Creating an intelligent antivirus engine is no simple feat. A strong level of understanding is required in both machine learning / deep learning concepts aside from malware. There have been many different architectures and techniques experimented with in the past. Additionally, the preprocessing of data sets of malware binaries varies greatly. Some researchers opt for only analyzing the PE headers while others have used all the raw bytes of the binary. In this post, I will only be discussing the latter.

Passing the entire binary to the model has several advantages and disadvantages. By giving the model access to the entire malware sample's raw bytes, it is possible to extract much more valuable information about the data. However, reading in entire malware samples is also a double-edged sword. A powerful GPU or cluster of GPUs is required to sufficiently fit the data in memory. Also, composing an architecture that can adequately capture tiny differences between benign and malicious binaries is not trivial when binaries can be gigabytes in size.