The Case for Classifiers: Why Machine Learning is Perfect for Digital Risk Monitoring

“Fire-stick hunting” is an ancient method of hunting in which humans burned entire forests to the ground. After the blaze, the hunters picked through the embers for carcasses or sought out surviving animals, now with nowhere to hide. If your security and risk teams are searching for digital risks — that is, any security or business risk in the online world outside of your network perimeter — chances are they are practicing the “fire-stick” method of identification. (If they aren’t searching for digital risks, they should be. Digital risk monitoring has become a staple for every modern organization grappling with unregulated, dynamic channels, such as social media, on which threats can be built, launched and damage your organization all without triggering a single alarm.)

Some quick definitions: when we say digital risks, we’re talking about things like phishing links, customer scams, piracy, leaked PII, attacker chatter, physical threats, and impersonations. When we say digital channels, we’re talking about social networks like LinkedIn, Facebook and Twitter as well as other channels external to your regulatory jurisdiction and line of visibility, like Pastebin and Reddit.

machine learning scale-of-social

Fig. 1: The scale of social media is immense, and it’s growing every day.

What does the fire-stick method look like in the world of digital risk monitoring? Many organizations ingest digital and social media data using keyword combinations. They will look for their brand name or an executive’s name in combination with a risky term like “click now!” (scam), “doxx” (attacker chatter, leaked PII), or “kill” (physical threat). This often results in huge volumes of data (burning the entire forest down) with little progress made towards efficiently identifying real risks. People use language in all sorts of euphemistic or colloquial ways that render keyword searching ineffective. Someone might tweet, “Went to BrandXYZ today: it was great, they killed it! #killingit” or “click here to see my recent article on one of my heroes BrandXYZ’s CEO, John Smith!”

The false positive rate in such a huge set of data is off the charts. A simple listening tool will regularly generate tens of thousands of alerts, creating an immense amount of work for security and risk professionals. If the person or brand has a commonly occurring or oft-mentioned named, like “John Smith” or “Apple,” the false positive problem is exacerbated. What’s worse, some organizations are doing this manually, without a tool to help ingest and analyze the data; the equivalent of fire-stick hunting with a single match during a hurricane.

Enter machine learning. To continue our metaphor, machine learning is like hunting with GPS, ATVs and a high powered rifle with a scope. Machines, by rapidly comparing vast volumes of data at scale, can identify subtle difference in risks vs. non-risks and prioritize them accordingly. A well trained machine learning classifier can assess hundreds of minute characteristics, prioritize which characteristics are most telling, and create a classification probability based on the aggregate characteristics and their relative weight. The classifier, at least for binary supervised algorithms, returns a yes-or-no classification; risky or not risky. ZeroFOX machine learning classifiers boast accuracy in the near 100% range. We can consistently and accurately identify malicious content like impersonating profiles, violent posts and scams without unleashing a flood of false positives.

Lots of technology companies like to use the phrase “machine learning” or “advanced algorithms” — we’re guilty as charged. Some companies truly use machine learning effectively; many do not. The trick to figuring out if the claim is hot air or not, is to assess if the problem being solved warrants a machine learning approach. For identifying risks on social media, the glove fits perfectly; it’s the right tool for the job. Huge volumes of dynamic data with extremely nuanced variables warrant a non-manual, data science approach.

Machine learning techniques are also superior to traditional ones because they can predict if something is malicious without ever having encountered it previously. AV signatures databases and whitelists & blacklists are examples of techniques that require something malicious to occur before the new data can be incorporated into the defense system. For many security and risk professionals, letting a single breach through is the difference between a promotion and unemployment. Margins are razor thin, and building proactive solutions rather than reactive ones pays serious dividends.

A machine learning classifier relies on assessing a broad set of variables to determine which category an unseen piece of data ultimately falls into. A classifier trained on known phishing links for instance can analyze an unknown phishing link, assess it based on a slew of important characteristics — like redirects, IP location, path length, URL traits and more — and make a classification without ever having processed the link before. A blacklist will miss that link every time.

This benefit of machine learning is perfect for identifying digital risks because the dataset is incredibly dynamic. New posts show up all the time and risks grow and evolve at the speed of the internet. With regular classifier re-trainings, machine learning techniques can stay remain one step ahead of the adversary.

machine learning

Fig. 2: Which comes first, the labelling or the classifier?

Machine learning is not without its drawbacks. In order to train the classifier, you need a large quantity of pre-labeled data. This is often the most difficult part of building a machine learning system. You arrive at a classic data science chicken-or-the-egg dilemma. To train the classifier, you need a pre-labeled set of data. To label data, you need a trained classifier.

To overcome this problem, data scientists either need to:

  1. Manually label a corpus of data to be used as a training set. This is time consuming and cumbersome.
  2. Leverage a high-quality, open-source corpus of pre-labelled data. For net-new problems, like digital risk monitoring, these corpora do not exist. If they do, someone has likely already attempted to create a machine learning solution.
  3. Get creative on finding a source of data that fits the goal of the classifier. For instance, one might use data from Reddit to analyze phishing links on the assumption that upvoted links are always non-malicious. This is only somewhat true and this method results in quirks and issues downstream.

At ZeroFOX, we invest an immense amount of time earning our technical chops as data scientists. Our research team boast years of experience, PhDs, and a love for data science. We use a combination of keyword filtering and machine learning. Keywords reduce the total population of data; machine learning distills the risk from the noise.

ZeroFOX hunts for risk the modern way. With machine learning, burning down forests of data is a thing of the past.