Creating an intelligent “sandbox” for coordinated malware eradication
Hello from China where I am presenting on coordinated malware eradication at the 2014 PC Security Labs Information Security Conference.
Coordinated malware eradication was also the topic of my last blog. I said the antimalware ecosystem must begin to work with new types of partners if we are going to move from the current state of uncoordinated malware disruption, to a state of coordinated malware eradication. Since then we’ve been talking about these ideas at conferences around the world, including the recent RSA Conference in San Francisco, the Digital Crimes Consortium in Singapore, and the APCERT AGM & Conference in Taipei. The level of engagement across the antimalware ecosystem has been high. Security and antivirus (AV) vendors, service providers, Computer Emergency Response Teams (CERTs), anti-fraud departments, and law enforcement have all joined the conversation, asking the essential questions about governance, communication channels, and benefits.
The overall theme of these discussions has been focused on how we can take the information we have and correlate it in new ways – a topic that lends itself to machine learning and big data analysis in the cloud. I believe this can be the most effective way to accelerate our malware eradication efforts. This proposes the next question: how do we create an intelligent “sandbox” where we can do this work?
For some time now, antimalware companies have been applying machine learning and big data analysis to generate more malware detections faster. Machine learning is all about training a machine to find patterns of signals in large streams of labeled information, then using those patterns against future data, all the while using feedback to continuously improve its accuracy. The stronger the labels, and the more diverse the information, the more effective the machine becomes.
Machine learning is similar to how I see people learn. For instance, when toddlers look at animals, at first they all appear to be the same. Then they learn to distinguish dogs from cows, for example. Pretty soon they can tell poodles from retrievers too. We correct them as necessary, and over many repetitions, they soon start to find more efficient identification patterns. In machine learning terms, we’d say the toddlers were trained with labeled information. They extracted patterns of signals from the animals, and then applied these patterns against the new animals that they saw.
Humans do this intuitively and naturally, whereas machines require complex algorithms and training against huge data sets. Currently in the antimalware business, we have three main sources of machine learning signals: voluntarily opted-in telemetry data on encountered malware threats, our analysis of the malicious files, and malware signals from our partners.
To give you a sense of the volume and scale I am talking about, each month the Microsoft Malware Protection Center’s (MMPC) machine learning systems analyze more than 30 million different file samples, and correlate this with what we know about the associated files, websites, and usage patterns. Our systems classify the file samples and then automatically create and deploy signatures for those identified as malware. The huge pipeline of signals makes it possible for us to quickly spot new malware. When we combine this with insights from our in-house AV researchers, our machines get smarter, and our customers receive greater protection.
We are using machine learning advances with the cloud too. For instance, we automatically recognize files showing tell-tale patterns of malicious intent. Cloud-based machines correlate that suspicious behavior with the reputation of the particular software being used to decide if AV software should intervene to block – faster, better, and more efficiently than a client computer could perform the check. In many cases we are able to protect clients even before detection signatures are delivered.
Although machine learning has already contributed significantly to malware protection, I believe that complete eradication of malware families will fail unless we determine how to identify specific attackers, and how to track a given malware family’s malicious activity across its entire lifecycle. The AV industry needs to understand how a malware family is developed and distributed, how it is controlled, how it responds to changes, and how it is monetized.
To answer these questions, we’ll need our machines to correlate more than telemetry, analysis, and the types of signals traditional security vendor partners provide. This is where coordinated malware eradication partnerships come into play. By working together and correlating our signals, we can see the bigger picture and identify appropriate choke points – weak spots for the malware writers.
Figure 1: The antimalware ecosystem’s coordinated malware eradication
The next question is where we will accomplish this goal. As I said above, we need a “sandbox” big enough where every industry partner can contribute with a variety of signals and deploy their machine learning and analysis tools. On top of our telemetry and analysis data, Microsoft can also contribute large amounts of cloud-based scalable storage and computing horsepower with the necessary big data analysis tools built-in. Our partners can contribute new information signals, strong labels, and their own tools to better train all of the machines.
For example, take your typical click-fraud attack. An advertising network can see the URLs being abused, the bank accounts in use, and the websites involved. A CERT or ISP can see parts of the command and control system – URLs, files being served, domain registrars, etc. AV vendors can see the client code and the URLs it is working with. Individually no one party has enough to identify the entirety of the attack. But when seen together, the correlation (in this example at least) is pretty easy to spot.
Figure 2: Putting machine learning against massively correlated signals means we can go on the offensive
Putting machine learning to use at these huge scales against massively correlated signals means we can go on the offensive. Hopefully it will leave the bad guys with nowhere to go. It will allow us, as an industry, to blunt the efforts of the malware authors and their supply chains, and to block their attempts to game and steal from our customers.
I encourage you to join the conversation. We will be holding roundtable discussions at a few more upcoming events. The latest schedule is below. If you would like to attend a discussion, email us at [email protected]
Partner PM Manager
Upcoming roundtable discussions:
- PC Security Labs Conference, 2014 – April 1, 2014 – April 2, 2014 Beijing, China
- CARO Workshop, May 15, 2014 – May 16, 2014 Melbourne, FL
- 26th Annual FIRST Conference, June 22, 2014 – June 27, 2014 Boston, MA
- Microsoft Security Research Alliance Summit
July 22, 2014 – July 24, 2014 Seattle, WA
Invite only. NDA required.