Skip to content

RequiresFallback.isCompressionSatisfying is too aggressive with current default page size #3479

@etseidl

Description

@etseidl

Describe the enhancement requested

An issue recently was brought up in arrow-rs (apache/arrow-rs#9700) which brought to my attention the existence of isCompressionSatisfying in the RequiresFallback interface. In short, after accumulating a page worth of data, isCompressionSatisfying is called to see if dictionary encoding is actually compressing the data at all, and if not, then the encoder falls back immediately to the fallback encoder. As far as I could determine, this behavior was introduced very early on, before the advent of the page indexes, so IIRC the page size would have been significantly larger. With page indexes, however, this function is now called after only 20000 rows have been processed. A column with a moderate cardinality might not yet have produced enough repeating values to lead this function to conclude it's best to continue using a dictionary.

For example, a dataframe with an int64 column consisting of one million values mod'd with 32768 will end up ditching dictionary encoding completely, and produce a column chunk of 8.4MB. If the page row count is bumped up to 128k, then dictionary encoding is used throughout and the resultant column chunk is only 2.2MB.

Sadly, it does not appear that this behavior is configurable, so short of increasing the page row count, its behavior cannot be modified.

I can see the need for this type of heuristic, but I think it needs to be modified in light of the current defaults resulting in far too few samples with which to determine if dictionary encoding is beneficial or not. If collecting more samples before falling back is not practical, there should at least be a configuration setting to disable this check.

Component(s)

Core

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions