Skip to content

Base implementation of Parquet file writing#2583

Open
VeckoTheGecko wants to merge 27 commits intoParcels-code:mainfrom
VeckoTheGecko:push-zlxyoyvlpoqm
Open

Base implementation of Parquet file writing#2583
VeckoTheGecko wants to merge 27 commits intoParcels-code:mainfrom
VeckoTheGecko:push-zlxyoyvlpoqm

Conversation

@VeckoTheGecko
Copy link
Copy Markdown
Contributor

@VeckoTheGecko VeckoTheGecko commented Apr 20, 2026

Description

This PR introduces Parquet file writing to Parcels.

I still need to work on:

  • How to work with cftime output in the Parquet (how does this work with our internal model of time in Parcels? How should it work?)
  • Reviewing the test_particlefile.py file - are there tests that are no longer needed? What would be the best testing approach here?
    • Will leave for a future PR
  • Update documentation
    • Will leave for a future PR

Posting as draft for initial feedback

Checklist

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR.
    • Describe how you used it (e.g., by pasting your prompt): Just to help with learning the PyArrow API (i.e., used it to create an example script - which I then used as an entry to exploring the docs on pyarrow)

@VeckoTheGecko VeckoTheGecko changed the title Add parquet file writing Base implementation of Parquet file writing Apr 20, 2026
@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

@erikvansebille let's table some of these questions for our meeting tomorrow (mainly around datetime serialization in the Parquet file)

@VeckoTheGecko VeckoTheGecko marked this pull request as ready for review April 23, 2026 12:48
@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

@erikvansebille I moved read_particlefile to the global scope (i.e.,parcels.read_particlefile()) as well as added a docstring.

@VeckoTheGecko
Copy link
Copy Markdown
Contributor Author

(we'll likely get doc failures on this PR - which is to be expected since they haven't been updated)

Copy link
Copy Markdown
Member

@erikvansebille erikvansebille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work. And as I'm going through updating all the documentation, it seems to work smoothly too!

A few small comments below

self._create_new_zarrfile = False
else:
Z = zarr.group(store=store, overwrite=False)
obs = particle_data["obs_written"][indices_to_write]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think obs_written is now obsolete and can be removed from the ParticleClass. Saves some memory ;-)

self._writer: pq.ParquetWriter | None = None
if path.exists():
# TODO: Add logic for recovering/appending to existing parquet file
raise ValueError(f"{path=!r} already exists. Either delete this file or use a path that doesn't exist.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change from v3 behaviour, where the default would be to overwrite an existing file. Perhaps we can do that here too, and add an option to replace or append?

Comment on lines 173 to 192
@@ -298,7 +192,7 @@
)[0]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section still necessary? If so, should we also fix the warning

[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
  np.less_equal(
[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
  & np.greater_equal(
[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
  & np.equal(time, particle_data["time"], where=np.isfinite(particle_data["time"]))

I didn't worry too much about it before because thought we would refactor this out anyways, but if we keep it we should also address the warning


Parameters
----------
name : str
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "path"?

Comment on lines 57 to 58
particleset :
ParticleSet to output
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an argument to init?

Comment on lines +158 to +159
# if len(indices_to_write) == 0: # TODO: Remove this?
# return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good idea to remove if it's commented out already?



def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]:
def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]: # TODO: Remove?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's used anywhere in the codebase, so can indeed be removed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the tests in the v3-directory? We normally don't touch these, since they are broken already (also apply to other two files)



def test_pfile_array_remove_particles(fieldset, tmp_zarrfile):
@pytest.mark.skip("Keep or remove? Introduced in 5d7dd6bba800baa0fe4bd38edfc17ca3e310062b ")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ti's an important test to keep in. Or does it fail now?

for i in range(npart):
ds_timediff[i, :] = ds.time.values[i, :] - time[i]
np.testing.assert_equal(age, ds_timediff)
# df = pd.read_parquet(tmp_parquet)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove stray debug statement?

@@ -187,7 +154,7 @@ def test_variable_written_once():
@pytest.mark.skip(reason="Pending ParticleFile refactor; see issue #2386")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test work again? If not out of the box, perhaps we should discuss what we actually expect from output of a looped pset.execute...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Consistent time handling

2 participants