Skip to content

[POC][client] Support log scanner for multiple tables#3140

Open
loserwang1024 wants to merge 4 commits intoapache:mainfrom
loserwang1024:multiple_log_scanner
Open

[POC][client] Support log scanner for multiple tables#3140
loserwang1024 wants to merge 4 commits intoapache:mainfrom
loserwang1024:multiple_log_scanner

Conversation

@loserwang1024
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3139

Brief change log

Tests

API and Format

Documentation

@wuchong
Copy link
Copy Markdown
Member

wuchong commented Apr 20, 2026

Thanks @loserwang1024. Supporting multi-table reads is a valuable enhancement. Currently, we can already read from multiple tables by creating separate LogScanner instances via connection.getTable(tablePath).newScan(..).createLogScanner(..).

I understand that the primary goal of this PR is to consolidate I/O within the LogFetcher, and maintaining individual LogScanners per table is not a problem. This optimization mirrors our approach on the writer side, where writes for multiple tables are merged into a unified sender instance to aggregate I/O. We can apply a similar strategy here by enabling multi-table support in the LogFetcher and sharing it at the Connection level.

This approach allows us to keep the user-facing API unchanged, leveraging the existing hierarchy of Connection -> Table -> LogScanner without introducing a new multi-table scanner abstraction. It also ensures consistency between reader and writer operations for both single-table and multi-table scenarios.

What do you think?

@loserwang1024
Copy link
Copy Markdown
Contributor Author

leveraging the existing hierarchy of Connection -> Table -> LogScanner without introducing a new multi-table scanner abstraction. It also ensures consistency between reader and writer operations for both single-table and multi-table scenarios.

this approach introduces two critical issues:

  • Blocking Latency vs. Asynchronous Writes: A single reader would need to maintain multiple LogScanner instances and poll them sequentially (e.g., calling scanner.poll(POLL_TIMEOUT) in a loop). If the tables at the beginning of the queue have low data volume, the poll operation will block for the entire timeout period, causing significant latency before subsequent tables are processed.
    Note: The reason this pattern works on the writer side is that writes are asynchronous; one write operation does not block others. In contrast, the synchronous nature of reading makes sequential polling inefficient.

  • Memory Overhead: Requiring each reader to manage multiple LogScanner instances creates substantial resource overhead. Each LogScanner must independently maintain its own memory buffers for parsing Arrow records. Consequently, the aggregate memory footprint becomes significantly larger than necessary compared to a unified approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support log scanner for multiple tables.

2 participants