Skip to content

feat: streaming HuggingFace export, deduplication, CRC fix, CLI improvements#1

Open
lvs0 wants to merge 1 commit intomainfrom
improve/streaming-huggingface-deduplicate
Open

feat: streaming HuggingFace export, deduplication, CRC fix, CLI improvements#1
lvs0 wants to merge 1 commit intomainfrom
improve/streaming-huggingface-deduplicate

Conversation

@lvs0
Copy link
Copy Markdown
Owner

@lvs0 lvs0 commented Apr 23, 2026

Summary

Fixes

  • to_huggingface() streaming: Was loading all records into RAM before creating a generator, defeating the streaming design. Now returns a proper with a picklable generator factory that re-opens the file per worker.

  • CRC bug in _rewrite_metadata: Was computing CRC over compressed blocks; spec says it must be over decompressed blocks. Fixed.

New features

  • LoopReader.deduplicate(): Removes duplicate records by canonical SHA-256 hash of messages. Supports filtered deduplication and in-place file output.
  • loop deduplicate CLI:
  • loop stats --language: Was missing from argparse even though the function used it (would have raised AttributeError).

Also

  • as alias for (more intuitive name)
  • Test updated to reflect return type (57 tests passing)

…vements

- to_huggingface() now returns IterableDataset for true streaming
  (was loading all records into RAM before creating a generator)
- Add LoopReader.deduplicate() to remove duplicate records by message hash
- Fix _rewrite_metadata CRC computation: was using compressed blocks,
  should be decompressed per SPEC.md
- Add loop deduplicate CLI command
- Add --language/--min-quality/--split args to loop stats command
- Add to_dicts() as alias for to_list()
- Update test to reflect IterableDataset return type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants