Building Data Pipelines for AI from Mainframe Data

Organisations building AI models increasingly need mainframe data. DB2 z/OS, IMS, VSAM – decades of high-quality financial data that most AI teams cannot access. Here is how to change that.

AI & Mainframe · Carmi Sternberg · May 2026 · 8 min read

Every major financial institution, insurance company, and government agency running mainframe has something that AI teams desperately want: decades of clean, structured, high-quality transactional data.

DB2 z/OS tables containing thirty years of payment transactions. IMS databases holding insurance policy histories going back to the 1980s. VSAM files with customer records that predate the internet.

This data is extraordinarily valuable for training domain-specific AI models – models that understand financial transactions, insurance claims, regulatory patterns, and risk profiles at a level that general-purpose models cannot match.

The problem is access. Getting this data out of mainframe systems, into modern AI infrastructure, reliably, in real time, without disrupting production – this is a genuinely hard problem that most AI teams do not have the skills to solve.

Why mainframe data is valuable for AI

Volume and history. A major bank's mainframe DB2 may contain transaction records going back twenty to thirty years. This depth of history is invaluable for training models on financial patterns, fraud detection, risk assessment, and regulatory compliance.

Data quality. Mainframe systems process data under extremely strict validation rules. Bad data that would pass through a modern API is rejected by COBOL edit routines that have been refined over decades. The data in production mainframe databases is generally cleaner than equivalent data in modern systems.

Domain specificity. Financial transaction data, insurance claim data, government benefit data – these are domains where general-purpose models have significant gaps. A model fine-tuned on decades of actual transaction history from a major bank will outperform a general model on financial tasks in ways that matter for production use.

Regulatory completeness. Financial mainframe data is maintained under regulatory audit requirements. It tends to be complete, timestamped, and accompanied by the audit trails that AI training requires for regulated use cases.

The access problem

Despite its value, mainframe data is largely inaccessible to AI teams, for reasons that are technical, organisational, and cultural.

EBCDIC encoding. Mainframe data is stored in EBCDIC, not ASCII or UTF-8. Every field extracted from DB2 z/OS or VSAM needs to be converted. For numeric fields, the conversion rules depend on whether the field is packed decimal, zoned decimal, binary, or floating point. Getting this wrong produces data that looks plausible but is wrong in ways that corrupt model training.

Record format complexity. VSAM and sequential datasets use fixed-length, variable-length, and undefined record formats with control bytes that do not exist in modern file formats. A developer who has not worked with mainframe datasets will produce incorrect extracts that silently lose or corrupt data.

Copybook dependency. DB2 z/OS and VSAM data structures are defined in COBOL copybooks. Without the copybooks, field boundaries, data types, and meanings are unknown. Many organisations do not have current copybooks readily available for all their datasets.

Access control. Production mainframe systems are protected by RACF (or equivalent) with strict access controls. An AI team asking for access to production DB2 tables containing customer financial data will encounter (correctly) significant resistance from security and compliance teams.

CDC – the right architectural approach

The correct architecture for feeding mainframe data to AI infrastructure is CDC – Change Data Capture.

Rather than running batch extracts that read entire tables, CDC captures changes to mainframe data as they happen and streams them to downstream systems. The result is a continuous feed of changes that downstream AI systems can consume in near real time.

Third-party CDC tooling for DB2 z/OS reads the DB2 log, captures inserts, updates, and deletes, handles EBCDIC conversion and data type mapping, and delivers the changes to a target system – Kafka, cloud data lakes, modern databases.

The advantages: no production impact (CDC reads the database log, not the live tables), real-time feed (changes delivered within seconds to minutes), and complete history (by replaying the log from a historical point, the full change history of a table can be reconstructed for training datasets).

The skills gap

Building reliable data pipelines from mainframe to AI infrastructure requires a skill combination that is genuinely rare.

Mainframe side

DB2 z/OS administration – table structures, access paths, authorisation
COBOL copybook interpretation – packed decimal, REDEFINES clauses
VSAM file organisation – KSDS, ESDS, RRDS access methods
CDC tool configuration for z/OS environments
z/OS security – RACF authorisation for data access

Modern side

Kafka or equivalent streaming infrastructure
Cloud data lake architecture (S3, Azure Data Lake, GCS)
Python data engineering for transformation pipelines
ML pipeline infrastructure – feature stores, training pipelines
Modern database platforms that serve as targets

"The data is there. The bridge builders are not."

Most data engineers have the modern side skills. Most mainframe DBAs have the mainframe side skills. People who have both are extraordinarily rare – and as the previous posts in this series have described, they are getting rarer as the mainframe cohort retires.

Practical architecture patterns

DB2 z/OS to modern data lake (batch). For historical training data, a batch approach is often sufficient. DB2 UNLOAD utility extracts table data. A conversion program handles EBCDIC translation and data type mapping. The output lands in S3 or equivalent in Parquet format for training. This approach is reliable and low-risk but produces data that is hours or days old. Suitable for periodic model retraining, not for real-time inference.

DB2 z/OS to Kafka (real-time CDC). A CDC tool captures DB2 log changes and publishes to Kafka topics. A Kafka consumer handles any remaining transformation and writes to the serving layer. Suitable for real-time model inference scenarios – fraud detection, real-time risk assessment.

VSAM to modern platform. VSAM data extraction requires reading the VSAM catalog, obtaining the copybook definitions, and running an extraction job that handles the EBCDIC conversion and record format translation. The output is typically written to a sequential dataset and then transferred to a Linux processing environment. This is lower-tech than CDC but often practical for VSAM datasets that change infrequently.

The emerging role

As organisations recognise that their most valuable AI training data is locked in mainframe systems, a new role is emerging: mainframe data engineer for AI pipelines. This person understands DB2 z/OS, IMS, and VSAM well enough to extract data correctly. They understand modern data engineering well enough to build reliable pipelines. And they understand enough about ML pipelines to know what the AI team actually needs.

If you have this combination of skills, you are in a position that very few people occupy. The demand is growing and the supply is not.

Also in this series: Is Mainframe Work at Risk of Being Outsourced? · Why Generic AI Tools Fail on Mainframe · Runtime Evidence as the Right Starting Point

Building AI pipelines from mainframe data? Infomanta has experience with DB2 z/OS, IMS, VSAM, and mainframe data integration. Let us help.

Get in touch

Working on Linux and mainframe? IM3270 is a modern 3270 terminal emulator for Linux – free 60-day trial, no credit card required.

Download Free

Carmi Sternberg

FOUNDER, INFOMANTA LTD · MAINFRAME SINCE 1990

Blog