Lineage for IBM products and Oracle

Data Lineage

Data lineage is the process of tracking the flow of data over time, providing a clear understanding of where the data originated, how it has changed, and its ultimate destination within the data pipeline.

During my work in ING-Diba, I have implemented a system that expanded the capabilities of IBM products.

Covered products:

IBM DataStage 11.7 + IBM InfoSphere Information Governance Catalog (IGC)
IBM Cloud Pak for Data + IBM Watson Catalogue

Involved products:

Oracle databases, Atlassian Confluence, Microsoft Azure, Amason S3, Cyberark, LDAP/Kerberos, Automic Automation (UC4)

Project target groups

architects, developers, DBAs, business analysts, data cleaning/compliance, gdpr

Lineage

I developed the lineage: complete control over all enterprise data flows. Using ANTLR, a system was written for lexical/grammatical parsing of all ETL/ELT processes, as well as all related Oracle SQL queries (of any complexity). The lineage includes a complete data flow graph and also a connection between the technical level and the business level. The technical part allows a complete detailed analysis of data flows and covers both the level of tables and ETL/ELT processes, as well as fields and variables. It is also possible to process enterprise-specific data flow connections (specific lineage cases). The system includes a service for automatically rebuilding the graph in case of any changes in ETL/ELT processes or database structures.

Lineage GUI:

interactive visualization of data flow graphs;
reporting;
statistics;
convenient navigation between graph nodes;
possibility of filtering the graph according to specified criteria.

Generating/Adjusting

Parsing algorithms serve as the basis for generating new ETL/ELT processes. It allows to generate ETL/ELT processes of almost any complexity with taking into account data historicization algorithms, such as SCD1, SCD2, Data Vault 2.0 etc, and in a free form too. The system includes mechanisms for batch configuration of multiple ETL/ELT processes, as well as preparation of ETL/ELT processes for deployment.

Data quality (DQ)

The module allows to automate DQ processes:

setting requirements for data verification;
generation of ETL/ELT processes to check DQ;
registration of verification rules;
accompanying automatically creation and updating of documentation.

Versioning

In addition, the system includes modules:

automatic versioning of ETL/ELT processes and all database objects
GUI for easy recovery of ETL/ELT processes
visualization of simultaneous comparison of revisions ETL/ELT processes in human reedable form
batch comparison of multiple ETL/ELT processes
global search

Technically, the system is a client-server solution, both with a direct http connection and using web sockets.

After developing the project, I transfer knowledge to the team developers for subsequent project support.

Lineage for IBM products and Oracle

Covered products:​

Involved products:​

Project target groups​

Lineage​

Generating/Adjusting​

Data quality (DQ)​

Versioning​