An inside look: gathering and analyzing the SIR data
At the Microsoft Malware Protection Center, threat data is a critical source of information to help protect our customers. We use it to understand what’s going on in the overall malware ecosystem, determine the best way to protect our customers, and find the most effective way to deliver that protection.
We also use the data to produce a number of reports to help our customers. This includes our bi-annual Security Intelligence Report (SIR). This blog post gives you a behind-the-scenes look at how we collect, analyze and evolve our SIR to better serve the needs of our customers.
Collecting the data
We start by pulling together all the data needed to generate the report. Through our real-time protection (RTP) products and the monthly-released Malicious Software Removal Tool (MSRT) we receive valuable data when customers opt in to help improve protections by sharing their malware encounters. In Windows Defender, for example, you can view this under Settings > MAPS:
We use this information in a number of ways. For example, reports from the MSRT help us compute infection rates (the number of machines per thousand in which we detect malware). Encounter rates (the percentage value of how often we encounter that threat) are calculated using RTP product data.
Both infection rates and encounter rates are reported in the SIR. For each reporting period we break down the data based on a number of different categories: by country, by platform, by malware, and so on. We use our big data platform (COSMOS), which is the same platform that powers Bing to do a lot of the data processing and aggregating work. COSMOS can take the terabytes of telemetry we receive and turn it into organized, structured, and consumable groupings.
Once we have our data organized, we can begin our analysis. This is where the powerful Excel PowerBI tools come in to play. We have to say, this is our favorite part of the process. Think of it like you’re trying to solve a mystery, without knowing what mystery you are trying to solve. We start by asking a series of questions about the data – What caused a dip in the trend here, or a spike there? Why were the encounter rates for this platform a particular value for this time period? Why did some infections affect a particular country, but not others?
Sometimes the questions have obvious answers, and sometimes they don’t. Despite the abundance of telemetry that we have, the types of data we collect is limited. Sometimes we come across a question that we don’t know the answer to. For example, why is Conficker still such a prevalent threat in enterprise PCs after all these years (it’s second among the top ten families encountered in domain-joined computers for the first half of 2014)? We discovered Conficker in 2009; it’s been in MSRT since the same year, and yet it’s still prevalent. Why haven’t we totally killed it yet?
We have strong antimalware detection signatures for Conficker, and the family is effectively starved (it no longer communicates back to its authors). But Conficker does use a variety of distribution methods: removable drives, an OS vulnerability, spam, and weak passwords. It was this last condition that we found to be the answer, as we blogged about when we released the SIR in 2012.
Another question might be, why is FakePAV, a Rogue family, among the top ten encountered families for Denmark and Norway? Rogues have undergone a dramatic decline in the last few years (if you’re a regular reader of the SIR, you would have noticed this trend), yet, Denmark and Norway, which have two of the lowest encounter rates among all countries, still have FakePAV in their top ten list. Our theory is that rogues exist to steal money from people. And from a bad guys’ perspective, it makes sense to try and target wealthier countries. In the past four quarters, FakePAV only appeared in 16 countries, the majority of which are in the G20.
One last thing we consider when we’re doing our data gathering and analysis for the SIR is – as with all our products – updates and improvements. These improvements help us to gather more accurate data, and more accurate data improves our ability to provide actionable guidance to help protect our customers.
For example, in October 2014, we increased our MSRT sampling rate for our entire population from 0.1% to 10%, and then to 100%. This means that all the infection reports we get from MSRT are 100% accurate representation of the ecosystem, whereas we previously had to extrapolate infection rates based on assumptions we made about the entire population of our customers. With our improved sampling, we can now say exactly what the infection rate is and exactly which types of computers are at a greater risk of infection. Not only does this help us prioritize our protection efforts, it also helps us to more accurately gauge infections rates over time – including how different malware families impact our customers differently, and the best way to stop them.
Another recent change was the tightening of our objective criteria for adware. This meant dropping the word “potentially” from the phrase “potentially unwanted software”. Software is either wanted by the user or not. Because of this change, we now track adware encounter rates in the SIR to provide a better picture of adware prevalence, as compared to other types of threats.
The SIR is just like any other product or service: we are continuously striving to improve our features, delivery, and output. There’s a wealth of data in the latest version of the SIR. It is our hope that by sharing our data and insights with our customers, we can help them create better security policies and programs that more effectively protect against threats for their clients, organization or region. If you haven’t seen the latest Security Intelligence Report, we encourage you to visit www.microsoft.com/sir and download your free copy today.
Ina Ragragio and Joe Blackbird
Microsoft Malware Protection Center
Secure Hunter Anti -Malware