Benefits of using a graph database to map data lineage
Graph databases are a great solution for mapping data lineage as they’re built around nodes and the relationships between nodes.
One of my first contract roles was in data management for a large UK bank, where I used graph databases to keep track of data lineage.
What are graph databases?
Graph databases are a type of NoSQL or non-relational database. This means that the data isn’t stored in tables like in a conventional relational database. Instead, it’s stored as nodes, relationships and properties.
Here’s an example graph schema from Neo4j:
Nodes:
Represented by the circles, they can represent objects, things, people, products etc. They are labelled - see the grey boxes by each node.
Relationships:
These are represented by lines, which illustrate how the nodes are connected. The lines have a direction, indicated by an arrow, and a type, specified by the text under each relationship.
Properties:
These are specific attributes that apply to both nodes and relationships, they are noted using key:value pairs and can be seen in the black speech bubbles.
What is data lineage?
Data lineage is the documentation of data flow from source, all the way through the intermediate systems, processes and transformations, to the final output. This means that every single piece of data represented on a report is fully defined, and can be traced back to its source. At some large companies, this can be a long journey from mainframe systems (yes, mainframe!), to the cloud, data lakes and data warehouses.
Not only is data lineage useful for tracing upstream data issues, it’s a regulatory requirement for some financial institutions.
Benefit 1Graph databases can be queried using Cypher Query Language
This means you don’t have to write self-joining or recursive SQL queries to get your data lineage.
Just imagine for a second that your data lineage is stored in a relational database table. It would look something like this:
data_lineage.loans
source_table | source_field | destination_table | destination_field |
---|---|---|---|
MF.LOANS | BAL | AWS.LOANS | balance |
AWS.LOANS | balance | warehouse.reporting | loan_balance |
Data moves from the source table MF.LOANS
to AWS.LOANS
to warehouse.reporting
, while the field name also changes along the way.
This example is just two hops for a single field, with no transformations. That’s not how it works in the real world!
I’ve seen as many as 30 hops for a field, and when you add transformations that happen along the way it starts to get difficult to keep track. Just imagine a loan monthly payment field for example, you need to trace back loan balance, interest rates, currency and other fields all the way back to their source and track where they combine.
Now let’s suppose that you want to trace back the final loan_balance
field from reg_reporting.loans
to its source. The SQL query would look like this self-join, with another self-join needed for each hop back to the data source. As you can see this isn’t sustainable or user friendly.
SELECT b.source_table, b.source_field
FROM (
SELECT source_table, source_field
FROM data_lineage.loans
WHERE destination_table = 'warehouse.loans'
AND destination_field = 'loan_balance'
) a
LEFT JOIN data_lineage.loans b
ON a.source_table = b.destination_table
AND a.source_field = b.destination_field;
You would have to keep self-joining like above until you get to the source, and even then, it’s hard to visualise what’s going on.
The same data lineage can be represented in a graph database like this:
To query this using Cypher Query Language , all we would need is a simple two line Cypher query like this.
Query:
MATCH path = (:field)-[:SOURCE_TO]->(:field)-[:SOURCE_TO]->(f:field {fieldName: 'loan_balance', tableName: 'reg_reporting.loans'})
RETURN nodes(path);
Output:
[Node[1] {fieldName: 'loan_balance', tableName: 'warehouse.reporting'}]
[Node[2] {fieldName: 'balance', tableName: 'AWS.LOANS'}]
[Node[3] {fieldName: 'BAL', tableName: 'MF.LOANS'}]
Benefit 2Tools such as Neo4j’s query editor make it easy to visualise the flow of data
The output above may be useful, but it doesn’t give a visual representation of the relationships between nodes. That’s where Neo4j’s user interface can help.
Neo4j is a popular graph database management system. As well as its graph database admin capabilities, it can also produce useful visuals which represent your data lineage.
This is an example of what the nodes and relationship graph output looks like in Neo4j. You can even expand nodes without writing Cypher queries using this interface, which lets you explore your data flows directly in the output viewer.
Benefit 3Graph databases have a dynamic schema
This means that you can add new node and relationship types without having to change the entire database schema.
For example, if you wanted to add a new node which contains details of the primary key in a table, then you could simply add it to the database, and it will automatically update the data model to include this new node.
This is very much in contrast to relational databases, where a schema has to be defined before any data is added.
Conclusion
Graph databases are a powerful and flexible solution for storing and querying lineage data. This is especially true for financial institutions which have strict regulatory requirements for data management.