[POC][client] Support log scanner for multiple tables by loserwang1024 · Pull Request #3140 · apache/fluss

loserwang1024 · 2026-04-20T13:04:26Z

Purpose

Linked issue: close #3139

Brief change log

Tests

API and Format

Documentation

wuchong · 2026-04-20T15:24:27Z

Thanks @loserwang1024. Supporting multi-table reads is a valuable enhancement. Currently, we can already read from multiple tables by creating separate LogScanner instances via connection.getTable(tablePath).newScan(..).createLogScanner(..).

I understand that the primary goal of this PR is to consolidate I/O within the LogFetcher, and maintaining individual LogScanners per table is not a problem. This optimization mirrors our approach on the writer side, where writes for multiple tables are merged into a unified sender instance to aggregate I/O. We can apply a similar strategy here by enabling multi-table support in the LogFetcher and sharing it at the Connection level.

This approach allows us to keep the user-facing API unchanged, leveraging the existing hierarchy of Connection -> Table -> LogScanner without introducing a new multi-table scanner abstraction. It also ensures consistency between reader and writer operations for both single-table and multi-table scenarios.

What do you think?

…eturn real schema rather than target schema

loserwang1024 · 2026-04-22T12:23:39Z

leveraging the existing hierarchy of Connection -> Table -> LogScanner without introducing a new multi-table scanner abstraction. It also ensures consistency between reader and writer operations for both single-table and multi-table scenarios.

this approach introduces two critical issues:

Blocking Latency vs. Asynchronous Writes: A single reader would need to maintain multiple LogScanner instances and poll them sequentially (e.g., calling scanner.poll(POLL_TIMEOUT) in a loop). If the tables at the beginning of the queue have low data volume, the poll operation will block for the entire timeout period, causing significant latency before subsequent tables are processed.
Note: The reason this pattern works on the writer side is that writes are asynchronous; one write operation does not block others. In contrast, the synchronous nature of reading makes sequential polling inefficient.
Memory Overhead: Requiring each reader to manage multiple LogScanner instances creates substantial resource overhead. Each LogScanner must independently maintain its own memory buffers for parsing Arrow records. Consequently, the aggregate memory footprint becomes significantly larger than necessary compared to a unified approach.

loserwang1024 added 2 commits April 20, 2026 21:03

[client] Support log scanner for multiple tables

027e51f

expose getTableId(), getSchemaId(), and getRowType() in log record.

ba45343

loserwang1024 added 2 commits April 21, 2026 11:45

fix partition

ae1968f

1. Expose schemaId of multiple log scanner. 2. multiple log scanner r…

1f53a61

…eturn real schema rather than target schema

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC][client] Support log scanner for multiple tables#3140

[POC][client] Support log scanner for multiple tables#3140
loserwang1024 wants to merge 4 commits intoapache:mainfrom
loserwang1024:multiple_log_scanner

loserwang1024 commented Apr 20, 2026

Uh oh!

wuchong commented Apr 20, 2026 •

edited

Loading

Uh oh!

loserwang1024 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loserwang1024 commented Apr 20, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

wuchong commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

loserwang1024 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wuchong commented Apr 20, 2026 •

edited

Loading