If you’re working with Oracle Database and want to leverage the power of Apache Iceberg for your data Lakehouse. Let me walk you through the different ways you can make this connection happen from simple batch loads to real-time CDC pipelines.
Why Connect Oracle to Iceberg Anyway?
Before we dive into the “how,” let’s quickly talk about the “why.” Oracle databases are great for transactional workloads, but when you need to run large-scale analytics, machine learning models, or feed data to multiple downstream systems, Iceberg gives you that flexibility.
You get an open table format that works across different compute engines (Spark, Flink, Trino), time-travel capabilities for historical queries, and schema evolution without the headaches.
Three Main Approaches:
There are essentially three ways to connect Oracle to Iceberg, depending on your requirements:
- Batch loading using Oracle Data Transforms
- Real-time CDC (Change Data Capture) using tools like Goldengate or Estuary Flow
- Manual ETL using Spark or custom scripts
Let me break down each approach so you can pick what works best for your use case.
Approach 1: Oracle Data Transforms (The Oracle-Native Way)
If you’re already in the Oracle ecosystem and want a straightforward solution, Oracle Data Transforms supports Apache Iceberg as a target.[2]
Setting Up the Connection:
First, you’ll create an Apache Iceberg connection in Oracle Data Transforms:
- Catalog Type: Choose REST Catalog (stores Iceberg data in Oracle Object Storage)
- Catalog URI: Your REST catalog endpoint
- Warehouse Location: Where you want to store the data (e.g., in Oracle Object Storage)
- Authentication: Either OAuth2 (`client_credentials` or `authorization_code`)
Loading Data
- Create a Data Load and select your Oracle database as the source
- Choose Apache Iceberg as the target connection
- Specify or create a namespace for your Iceberg tables
- Select the tables you want to load
- Choose your load operation: Incremental Merge or Incremental Append
Approach 2: Real-Time CDC with Oracle GoldenGate
Why GoldenGate for Iceberg?
GoldenGate captures changes from your Oracle database in real-time and streams them to Iceberg tables.
- Keeping your data lake in sync with production databases
- Feeding real-time analytics dashboards
- Building event-driven architectures
- Supporting AI/ML pipelines that need fresh data
The Setup Process
Oracle has detailed documentation for replicating from Autonomous Database to Iceberg, and the process is similar for regular Oracle databases
- Configure your source Oracle database with supplemental logging enabled
- Set up GoldenGate deployment in OCI (or on-premises)
- Create an Iceberg connection in GoldenGate pointing to your catalog (AWS Glue, REST catalog, etc.)
- Define your replication by mapping source tables to target Iceberg tables
- Start the replication and monitor the data flow
Approach 3: Third-Party CDC Tools (Estuary Flow, Airbyte)
If you’re looking for alternatives to GoldenGate or want something cloud-agnostic, tools like Estuary Flow and Airbyte offer Oracle-to-Iceberg connectors.
Estuary Flow Approach
Estuary Flow is particularly interesting because it’s built for real-time streaming with a no-code setup.
Prerequisites:
- Oracle 11g or later
- Oracle user with SELECT access and LogMiner views (for CDC)
- Supplemental logging enabled
- S3 bucket for Iceberg storage
- AWS Glue or REST catalog for Iceberg metadata
- EMR Serverless (for standard updates)
Step 1: Configure Oracle as Source
Log into Estuary Flow, create a new capture, and select Oracle. Fill in your connection details hostname, port, database name, credentials and optionally configure SSH tunneling if your database is behind a firewall.
Step 2: Configure Iceberg as Destination
Set up your Iceberg materialization by providing:
- Catalog type (AWS Glue or REST)
- S3 bucket details
- AWS credentials with appropriate permissions
- EMR Serverless application ID
Step 3: Map and Sync
Define which Oracle tables should sync to which Iceberg tables, configure any transformations, and start the pipeline. Estuary handles CDC, schema enforcement, and orchestration automatically.
Airbyte Alternative
Airbyte offers a similar no-code approach:
- Set up Oracle DB as source connector
- Choose Apache Iceberg as destination
- Define sync frequency and data mapping
- Let Airbyte handle the extraction and loading.
You can self-host Airbyte Open Source or use their managed cloud service.
Approach 4: Manual Integration with Spark
For those who prefer full control or have custom requirements, you can build your own pipeline using Apache Spark.
High-Level Steps
Extract from Oracle:
- Use JDBC to connect Spark to Oracle and read data into DataFrames.
Transform (if needed):
- Apply any business logic or data cleaning using Spark’s DataFrame API.
Write to Iceberg:
- Configure Spark with Iceberg libraries and write data to Iceberg tables.
Wrapping Up
Connecting Oracle Database to Apache Iceberg opens up a world of possibilities for modern analytics, AI/ML workloads, and data sharing across teams. Whether you go with Oracle’s native tools, GoldenGate’s real-time CDC, third-party platforms like Estuary Flow, or custom Spark pipelines, the key is choosing the approach that matches your technical requirements and team capabilities.