Benefits of using a graph database to map data lineage

Graph databases are a great solution for mapping data lineage as they’re built around nodes and the relationships between nodes.

18 August 2024

One of my first contract roles was in data management for a large UK bank, where I used graph databases to keep track of data lineage.

What are graph databases?

Graph databases are a type of NoSQL or non-relational database. This means that the data isn’t stored in tables like in a conventional relational database. Instead, it’s stored as nodes, relationships and properties.

Here’s an example graph schema from Neo4j:

Nodes:

Represented by the circles, they can represent objects, things, people, products etc. They are labelled - see the grey boxes by each node.

Relationships:

These are represented by lines, which illustrate how the nodes are connected. The lines have a direction, indicated by an arrow, and a type, specified by the text under each relationship.

Properties:

These are specific attributes that apply to both nodes and relationships, they are noted using key:value pairs and can be seen in the black speech bubbles.

What is data lineage?

Data lineage is the documentation of data flow from source, all the way through the intermediate systems, processes and transformations, to the final output. This means that every single piece of data represented on a report is fully defined, and can be traced back to its source. At some large companies, this can be a long journey from mainframe systems (yes, mainframe!), to the cloud, data lakes and data warehouses.

Not only is data lineage useful for tracing upstream data issues, it’s a regulatory requirement for some financial institutions.

Benefit 1
Graph databases can be queried using Cypher Query Language

This means you don’t have to write self-joining or recursive SQL queries to get your data lineage.

Just imagine for a second that your data lineage is stored in a relational database table. It would look something like this:

data_lineage.loans

source_table	source_field	destination_table	destination_field
MF.LOANS	BAL	AWS.LOANS	balance
AWS.LOANS	balance	warehouse.reporting	loan_balance

Data moves from the source table MF.LOANS to AWS.LOANS to warehouse.reporting, while the field name also changes along the way.

This example is just two hops for a single field, with no transformations. That’s not how it works in the real world!

I’ve seen as many as 30 hops for a field, and when you add transformations that happen along the way it starts to get difficult to keep track. Just imagine a loan monthly payment field for example, you need to trace back loan balance, interest rates, currency and other fields all the way back to their source and track where they combine.

Now let’s suppose that you want to trace back the final loan_balance field from reg_reporting.loans to its source. The SQL query would look like this self-join, with another self-join needed for each hop back to the data source. As you can see this isn’t sustainable or user friendly.

SELECT b.source_table, b.source_field
FROM (
    SELECT source_table, source_field
    FROM data_lineage.loans
    WHERE destination_table = 'warehouse.loans' 
    AND destination_field = 'loan_balance'
) a
LEFT JOIN data_lineage.loans b
ON a.source_table = b.destination_table
AND a.source_field = b.destination_field;

You would have to keep self-joining like above until you get to the source, and even then, it’s hard to visualise what’s going on.

The same data lineage can be represented in a graph database like this:

To query this using Cypher Query Language , all we would need is a simple two line Cypher query like this.

Query:

MATCH path = (:field)-[:SOURCE_TO]->(:field)-[:SOURCE_TO]->(f:field {fieldName: 'loan_balance', tableName: 'reg_reporting.loans'})
RETURN nodes(path);

Output:

[Node[1] {fieldName: 'loan_balance', tableName: 'warehouse.reporting'}]
[Node[2] {fieldName: 'balance', tableName: 'AWS.LOANS'}]
[Node[3] {fieldName: 'BAL', tableName: 'MF.LOANS'}]

Benefit 2
Tools such as Neo4j’s query editor make it easy to visualise the flow of data

The output above may be useful, but it doesn’t give a visual representation of the relationships between nodes. That’s where Neo4j’s user interface can help.

Neo4j is a popular graph database management system. As well as its graph database admin capabilities, it can also produce useful visuals which represent your data lineage.

This is an example of what the nodes and relationship graph output looks like in Neo4j. You can even expand nodes without writing Cypher queries using this interface, which lets you explore your data flows directly in the output viewer.

Benefit 3
Graph databases have a dynamic schema

This means that you can add new node and relationship types without having to change the entire database schema.

For example, if you wanted to add a new node which contains details of the primary key in a table, then you could simply add it to the database, and it will automatically update the data model to include this new node.

This is very much in contrast to relational databases, where a schema has to be defined before any data is added.

Conclusion

Graph databases are a powerful and flexible solution for storing and querying lineage data. This is especially true for financial institutions which have strict regulatory requirements for data management.