Skip to content

feat: add dictionary_columns to scan API for memory-efficient string reads#3234

Open
tanmayrauth wants to merge 1 commit intoapache:mainfrom
tanmayrauth:feat/dictionary-columns-scan
Open

feat: add dictionary_columns to scan API for memory-efficient string reads#3234
tanmayrauth wants to merge 1 commit intoapache:mainfrom
tanmayrauth:feat/dictionary-columns-scan

Conversation

@tanmayrauth
Copy link
Copy Markdown

Exposes dictionary_columns: tuple[str, ...] | None = None on Table.scan() and DataScan, threading it through to PyArrow's ParquetFileFormat so that named columns are read as DictionaryArray instead of plain large_utf8. This dramatically reduces memory usage for high-cardinality repeated JSON/string columns (issue #3168) and addresses the general scan parameter extensibility request (issue #3170).

Key implementation details:

  • ORC files are guarded — dictionary_columns is only passed for Parquet
  • ArrowScan.to_table() rebuilds the Arrow schema with dict types before the empty-table fast-path so schema is consistent regardless of row count
  • DataScan.to_arrow_batch_reader() rebuilds target_schema with dict types to prevent .cast() from silently decoding DictionaryArray back to plain string
  • DataScan.__init__ declares and stores the param so TableScan.update() (which uses inspect.signature) preserves it across scan copies

Fixes #3168, closes #3170

Rationale for this change

Are these changes tested? Yes

Are there any user-facing changes? No

@tanmayrauth tanmayrauth force-pushed the feat/dictionary-columns-scan branch from 52b2070 to 9fc3b0c Compare April 13, 2026 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant