Is There Too Much Data for Machine Learning?

May 21, 2018

This is part two of three on our coverage of “Artificial Intelligence and Machine Learning Applied to Cybersecurity,” a new paper written by 19 artificial intelligence, machine learning and cybersecurity experts convened by IEEE. Part one can be found [here].

In 2014, the International Data Corporation reported that the total amount of data was doubling every two years, and would reach 44 zettabytes (or 44 trillion gigabytes) by 2020.

That tremendous increase in quantity has some upsides for artificial intelligence (AI) and machine learning (ML); obviously, there’s a lot of very specific data being generated by IoT devices that gives us data sets we haven’t had access to in profusion.

When it comes to data that can train AI and ML systems, the experts we assembled said in their paper: “To be effective, security AI/ML algorithms must be trained on large, diverse training data sets.”

While simple-sounding in theory, it’s actually quite complex: “While large training data sets are often available, one challenge is the completeness of the data. Existing devices and networks were not originally designed with instrumentation and measurement as an integral feature; therefore, the data available from these devices and networks are not capturing critical conditions.”

We design devices with a specific function in mind, and that function is typically something beyond generating data. That means data generation is an afterthought, rather than something that’s carefully engineered. Not focusing on usable data creates a number of challenges, such as:

Incomplete data sets – consumer privacy concerns, government policies and regulation, and protection of proprietary information all lead to loss of usable data.
Time sensitivity – “Data sets must be continually updated so they include the most recent evolution of threat results.” Bad actors are always refining their methods, so even slightly outdated data is suboptimal.
Biased data – “Data collection techniques, by their very nature, often include unintended human and technical biases.” Keeping close tabs on and understanding those biases is important.
Lack of sharing – “No centralized, standardized, and qualified data warehouses for cybersecurity data currently exist that allow broad sharing across industry, government, and academia.” This is a complicated situation for a number of reasons, regulatory obviously being one of them.
A focus on common threats – “Rare threat events, while potentially devastating, are often underrepresented in a probabilistic model that encompasses all threats.” It’s easy to focus on common threats at the expense of ones that are unlikely but highly risky.

Keeping these challenges in mind is the first step in counteracting this phenomenon. There were additional ideas from the session on generating more standard, usable data too.

First, sponsoring data warehouses, with support from analysts, to maintain data quality and facilitate feature engineering. Organizations could then drive toward international data storage standards, making it easier to share information across organizations.

Second, establishing regulations, rules, norms and research frameworks for data sets coming from new areas like smart cities, smart cars and the IoT in order to keep that data more standardized from the start.

In case you’re wondering, collecting less data was not a proposed solution. However, having the proper sensors in place to collect useful data was. With them, a measurement system could be built to determine the relevance and accuracy of the data, which would be transformative for the way we approach machine learning.