Labeling the VirusShare Corpus: Lessons Learned

July 1, 2016 |by ZeroFox Team

2 minute read

When I presented the results of my thesis on using quantum computers for malware classification at DEF CON last year, I mentioned that the most challenging aspect of the project was obtaining public malware data, as it was scarce, stale, and rarely clean. Clean data is a fundamental baseline for building any sort of advanced classification models. Since then, I’ve put a lot of effort, both within ZeroFox and without, around data visibility; getting high-quality data to the people who need it. Specifically, I’ve had the honor of working with the MLSec and VirusShare projects to use Antivirus vendor labels from the VirusTotal database to label massive VirusShare corpus. For data scientists, labelled data is infinitely more useful than unlabeled data when it comes to building classifiers. We also published this labelled data so machine learning researchers have the ability to use the corpus more easily for supervised machine learning.

As we’ve mentioned before, supervised machine learning is an alternative programming paradigm where, instead of telling the machine what to do, the programmer gives the machine a huge amount of data and instructs it to find the statistically relevant features for completing a given task. However, to do so, this data needs to be tagged with labels, such as the malware family for an executable.

The VirusShare corpus is a massive, curated repository of live malware, orders of magnitude bigger than other commonly used corpora for machine learning in this domain. To put things in perspective, the standard training set used by many academics is the Vx Heaven corpus, which holds 271,092 samples and can comfortably fit on my personal laptop’s hard drive. VirusShare, as of the time of this writing, has a whopping 27,133,454 samples and the source is being constantly updated. This massive size, along with constant updates and a more permissive license than VirusTotal, makes VirusShare the perfect source for research malware data.

In addition to labelling the corpus, we also created an inverted index. This allows us to not only find the labels for malware in the corpus, but also easily find malware in the corpus with a given label. To build this index, we used PySpark to count up all the labels in the corpus, and created a table with each malware family and how many samples within that family exist in each “chunk” of the VirusShare corpus. If you’re interested in learning how we’ve done this, come on over to the BSidesLV Ground Truth track on August 3rd!