Source: Eakrin Rasadonyindee via Shutterstock
Modern cybersecurity technologies produce massive quantities of data, which requires rethinking how to store and manage all the different types of information being generated. Many cybersecurity platforms are increasingly relying on one of two database technologies — graph or streaming databases — to efficiently represent and query databases of threat indicators, asset inventories, and other critical cybersecurity information.
Graph databases allow for the properties and relationships of various objects — whether threat groups, devices on the network, or software vulnerabilities — to be connected and searchable. Streaming database technology allows real-time threat data and status updates to be efficiently processed and stored. Both technologies help companies move beyond the lists used by defenders in the past to track everything and to do so in real time.
"All of us who work in this field have long lamented the difficulty of defending against cyber intruders, but there hasn't been a single moment of change, just a gradual increase in complexity over time," says Irene Michlin, staff engineer and application security lead at Neo4j, a graph-database provider. "We've reached that tipping point in difficulty, where data has become evermore interconnected with 'many-to-many' relationships."
The changing nature of data collection and use in cybersecurity has necessitated moving to other approaches to storing and processing data. Social networks of threat actors, connected assets in defenders' networks, and indicators of compromise are some types of data where the relationships among the elements of the dataset is extremely important.
Graph databases allow for the efficient representation and querying of relationships among data entities — critical in cybersecurity for detecting patterns such as fraud or network intrusions, says Weimo Liu, CEO of graph-engine maker PuppyGraph.
"Attackers exploit the network's interconnectedness, viewing it as a graph to identify and leverage vulnerabilities," he says. "By adopting a similar graph-based perspective, defenders can significantly enhance their security posture."
Cybersecurity Data Needs Better Representation
Various graph representations of data evolved along with network data models in the 1970s, object-oriented databases in the 1980s, and graph database models in the 1990s, according to one survey of graph database models. Modern graph database management systems kicked off with Neo4j in 2007, which released version 1.0 in 2010.
In the mid-2010s, cybersecurity practitioners began looking at graph databases as a natural representation of the relationships among business assets, cybersecurity properties such as vulnerabilities, and the threat landscape. John Lambert, a distinguished engineer at Microsoft's Threat Intelligence Center, noted this in 2015 when he stated that, "Defenders think in lists. Attackers think in graphs. As long as this is true, attackers win."
"Most defenders focus on protecting their assets, prioritizing them, and sorting them by workload and business function," he said. "Defenders are awash in lists of assets —i n system management services, in asset inventory databases, in BCDR (business continuity, disaster recovery) spreadsheets. There's one problem with all of this. Defenders don't have a list of assets — they have a graph."
The amount of cybersecurity-related data generated by business operations is huge. One of the primary challenges in dealing with this data and using graphs for cybersecurity is managing the complexity and volume of data, says PuppyGraph's Liu.
"Cybersecurity environments generate vast amounts of data from various sources, including network traffic, logs, and threat intelligence feeds," he says. "Modeling this data as graphs can quickly lead to large, complex structures that are difficult to analyze and interpret."
The average company tracks about half of its total log volume and hopes to track up to 80% in the next few years, according to a survey conducted by consultancy McKinsey.
Visualizing Security Threats
Graph databases naturally allow for the visualization of the data and relationships among elements of the database. For cybersecurity, this visualization allows for defenders to better identify and mitigate vulnerable nodes, Liu says.
"By providing a graphical representation of network topology, graphs reveal potential vulnerabilities and the possible spread of threats within a network, thereby offering invaluable insights into complex network structures," he says.
Streaming databases help process information in real time and make decisions based on that data, such as with anti-fraud systems used by financial institutions, says Rayees Pasha, head of product at RisingWave Labs, a streaming-database provider.
"You now have advanced techniques to catch fraud — not just whether a person is who he or she claims based on a simple password and username and the logging-on system," he says. "It's also about where they're logging on ... what location and what time of the day. There may be multiple other aspects that you're taking into consideration, and now you need a database that can connect these dots."
Concepts, Not Implementations, Are Key
While many graph and streaming database services are proprietary, such as AWS Kinesis, open source efforts have set the bar and are catching up. For streaming databases, Apache Kafka is perhaps the best-known platform, not just for storing data but for creating an entire processing pipeline.
The development of new graph database platforms has led to a variety of proprietary ways of representing graphs, but relational databases are catching up as well. The latest version of the SQL language (SQL:2023), for example, introduces an entirely new specification to represent property graphs and Property Graph Queries (PGQ), a sub-language for interacting with the new data structures.
An implementation of the new language as a plug-in for the relational database, DuckDB, showed "encouraging performance and scalability," according to a paper by a research team at the Centrum Wiskunde & Informatica institute in the Netherlands.
"It’s our view that more and more data challenges require a graph database, especially when your data is highly connected and there is a need to traverse through those multiple connections," Neo4j's Michlin says. "They find hidden relationships and patterns across billions of data connections deeply, easily, and quickly, and they are a natural fit for this kind of data."