May 21, 2018
This is part two of three on our coverage of “Artificial Intelligence and Machine Learning Applied to Cybersecurity,” a new paper written by 19 artificial intelligence, machine learning and cybersecurity experts convened by IEEE. Part one can be found [here].
In 2014, the International Data Corporation reported that the total amount of data was doubling every two years, and would reach 44 zettabytes (or 44 trillion gigabytes) by 2020.
That tremendous increase in quantity has some upsides for artificial intelligence (AI) and machine learning (ML); obviously, there’s a lot of very specific data being generated by IoT devices that gives us data sets we haven’t had access to in profusion.
When it comes to data that can train AI and ML systems, the experts we assembled said in their paper: “To be effective, security AI/ML algorithms must be trained on large, diverse training data sets.”
While simple-sounding in theory, it’s actually quite complex: “While large training data sets are often available, one challenge is the completeness of the data. Existing devices and networks were not originally designed with instrumentation and measurement as an integral feature; therefore, the data available from these devices and networks are not capturing critical conditions.”
We design devices with a specific function in mind, and that function is typically something beyond generating data. That means data generation is an afterthought, rather than something that’s carefully engineered. Not focusing on usable data creates a number of challenges, such as:
- Incomplete data sets – consumer privacy concerns, government policies and regulation, and protection of proprietary information all lead to loss of usable data.
- Time sensitivity – “Data sets must be continually updated so they include the most recent evolution of threat results.” Bad actors are always refining their methods, so even slightly outdated data is suboptimal.
- Biased data – “Data collection techniques, by their very nature, often include unintended human and technical biases.” Keeping close tabs on and understanding those biases is important.
- Lack of sharing – “No centralized, standardized, and qualified data warehouses for cybersecurity data currently exist that allow broad sharing across industry, government, and academia.” This is a complicated situation for a number of reasons, regulatory obviously being one of them.
- A focus on common threats – “Rare threat events, while potentially devastating, are often underrepresented in a probabilistic model that encompasses all threats.” It’s easy to focus on common threats at the expense of ones that are unlikely but highly risky.
Keeping these challenges in mind is the first step in counteracting this phenomenon. There were additional ideas from the session on generating more standard, usable data too.
First, sponsoring data warehouses, with support from analysts, to maintain data quality and facilitate feature engineering. Organizations could then drive toward international data storage standards, making it easier to share information across organizations.
Second, establishing regulations, rules, norms and research frameworks for data sets coming from new areas like smart cities, smart cars and the IoT in order to keep that data more standardized from the start.
In case you’re wondering, collecting less data was not a proposed solution. However, having the proper sensors in place to collect useful data was. With them, a measurement system could be built to determine the relevance and accuracy of the data, which would be transformative for the way we approach machine learning.





MEANINGFUL MOMENTUM OR RUNNING IN PLACE?
AI Through Our Ages
Liquid Infrastructure: Our Planet's Most Precious Resource
The Impact of Technology in 2025
Quantum and AI: Safeguards or Threats to Cybersecurity?
Why AI Can't Live Without Us
Bits, Bytes, Buildings and Bridges: Digital-Driven Infrastructure
Impact of Technology in 2024
Emerging AI Cybersecurity Challenges and Solutions
The Skies are Unlimited
Smart Cities 2030: How Tech is Reshaping Urbanscapes
Impact of Technology 2023
Cybersecurity for Life-Changing Innovations
Smarter Wearables Healthier Life
The Global Impact of IEEE's 802 Standards
How Millennial Parents are Embracing Health and Wellness Technologies for Their Generation Alpha Kids
Space Exploration, Technology and Our Lives
Global Innovation and the Environment
How Technology, Privacy and Security are Changing Each Other (And Us)
Find us in booth 31506, LVCC South Hall 3 and experience the Technology Moon Walk
Virtual and Mixed Reality
How Robots are Improving our Health
IEEE Experts and the Robots They are Teaching
See how millennial parents around the world see AI impacting the lives of their tech-infused offspring
Take the journey from farm to table and learn how IoT will help us reach the rising demand for food production
Watch technical experts discuss the latest cyber threats
Explore how researchers, teachers, explorers, healthcare and medical professionals use immersive technologies
Follow the timeline to see how Generation AI will be impacted by technology
Learn how your IoT data can be used by experiencing a day in a connected life
Listen to technical experts discuss the biggest security threats today
See how tech has influenced and evolved with the Games
Enter our virtual home to explore the IoT (Internet of Things) technologies
Explore an interactive map showcasing exciting innovations in robotics
Interactively explore A.I. in recent Hollywood movies
Get immersed in technologies that will improve patients' lives