Base implementation of Parquet file writing by VeckoTheGecko · Pull Request #2583 · Parcels-code/Parcels

VeckoTheGecko · 2026-04-20T09:48:33Z

Description

This PR introduces Parquet file writing to Parcels.

I still need to work on:

How to work with cftime output in the Parquet (how does this work with our internal model of time in Parcels? How should it work?)
- Scoped out in Consistent time handling #2586 , and now implemented in this PR. Closes Consistent time handling #2586. This PR also includes a function parcels.read_particlefile(...) to streamline ingestion into cftime objects.
Reviewing the test_particlefile.py file - are there tests that are no longer needed? What would be the best testing approach here?
- Will leave for a future PR
Update documentation
- Will leave for a future PR

Posting as draft for initial feedback

Checklist

Contributes to Migrate particlefile output to tabular (via Parquet) #2582
Tests added
This PR targets the correct branch (main for normal development, v3-support for v3 support)

AI Disclosure

This PR contains AI-generated content.
- I have tested any AI-generated content in my PR.
- I take responsibility for any AI-generated content in my PR.
- Describe how you used it (e.g., by pasting your prompt): Just to help with learning the PyArrow API (i.e., used it to create an example script - which I then used as an entry to exploring the docs on pyarrow)

VeckoTheGecko · 2026-04-20T12:17:39Z

@erikvansebille let's table some of these questions for our meeting tomorrow (mainly around datetime serialization in the Parquet file)

Covered by test_write_dtypes_pfile

for more information, see https://pre-commit.ci

This mark was only introduced during refactoring

Remove temporary test_cftime.py file

This function is now independent of the time_interval as time is now stored as float

Remove nested key - save on root instead

VeckoTheGecko · 2026-04-23T12:59:44Z

@erikvansebille I moved read_particlefile to the global scope (i.e.,parcels.read_particlefile()) as well as added a docstring.

VeckoTheGecko · 2026-04-23T13:02:49Z

(we'll likely get doc failures on this PR - which is to be expected since they haven't been updated)

erikvansebille

Very nice work. And as I'm going through updating all the documentation, it seems to work smoothly too!

A few small comments below

erikvansebille · 2026-04-23T13:23:29Z

-            self._create_new_zarrfile = False
-        else:
-            Z = zarr.group(store=store, overwrite=False)
-            obs = particle_data["obs_written"][indices_to_write]


I think obs_written is now obsolete and can be removed from the ParticleClass. Saves some memory ;-)

erikvansebille · 2026-04-23T13:25:23Z

+        self._writer: pq.ParquetWriter | None = None
+        if path.exists():
+            # TODO: Add logic for recovering/appending to existing parquet file
+            raise ValueError(f"{path=!r} already exists. Either delete this file or use a path that doesn't exist.")


This is a change from v3 behaviour, where the default would be to overwrite an existing file. Perhaps we can do that here too, and add an option to replace or append?

erikvansebille · 2026-04-23T13:31:16Z

@@ -298,7 +192,7 @@
    )[0]


Is this section still necessary? If so, should we also fix the warning

[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None. np.less_equal( [/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None. & np.greater_equal( [/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None. & np.equal(time, particle_data["time"], where=np.isfinite(particle_data["time"]))

I didn't worry too much about it before because thought we would refactor this out anyways, but if we keep it we should also address the warning

erikvansebille · 2026-04-23T13:55:13Z


    Parameters
    ----------
    name : str


should be "path"?

erikvansebille · 2026-04-23T15:08:27Z

    particleset :
        ParticleSet to output


This is not an argument to init?

erikvansebille · 2026-04-23T15:14:51Z

+        # if len(indices_to_write) == 0: # TODO: Remove this?
+        #     return


Probably good idea to remove if it's commented out already?

erikvansebille · 2026-04-23T15:15:27Z



-def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]:
+def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]:  # TODO: Remove?


I don't think it's used anywhere in the codebase, so can indeed be removed

erikvansebille · 2026-04-23T15:18:11Z

why change the tests in the v3-directory? We normally don't touch these, since they are broken already (also apply to other two files)

erikvansebille · 2026-04-23T15:20:31Z



-def test_pfile_array_remove_particles(fieldset, tmp_zarrfile):
+@pytest.mark.skip("Keep or remove? Introduced in 5d7dd6bba800baa0fe4bd38edfc17ca3e310062b ")


I think ti's an important test to keep in. Or does it fail now?

erikvansebille · 2026-04-23T15:21:41Z

-    for i in range(npart):
-        ds_timediff[i, :] = ds.time.values[i, :] - time[i]
-    np.testing.assert_equal(age, ds_timediff)
+    # df = pd.read_parquet(tmp_parquet)


remove stray debug statement?

erikvansebille · 2026-04-23T16:24:06Z

@@ -187,7 +154,7 @@ def test_variable_written_once():
 @pytest.mark.skip(reason="Pending ParticleFile refactor; see issue #2386")


Does this test work again? If not out of the box, perhaps we should discuss what we actually expect from output of a looped pset.execute...

github-project-automation Bot added this to Parcels development Apr 20, 2026

github-project-automation Bot moved this to Backlog in Parcels development Apr 20, 2026

VeckoTheGecko changed the title ~~Add parquet file writing~~ Base implementation of Parquet file writing Apr 20, 2026

VeckoTheGecko force-pushed the push-zlxyoyvlpoqm branch from 1d26f4a to 91ce496 Compare April 20, 2026 09:54

VeckoTheGecko and others added 12 commits April 20, 2026 14:41

Update .gitignore

ffe0ebf

Disable zarr writing

9ebb653

Fix parquet writing

0218c28

Remove test_vriable_write_double

4e7de3e

Fix all "uses_old_zarr" tests

bc653f1

Remove test_variable_write_double

5daec3b

Covered by test_write_dtypes_pfile

Fixing tests

2a07ced

More test fixing

b8c5477

Fix last tests

2d438c5

[pre-commit.ci] auto fixes from pre-commit.com hooks

32a82fa

for more information, see https://pre-commit.ci

Remove old fixtures

19fbd8d

Fix pre-commit errors

e74672f

VeckoTheGecko mentioned this pull request Apr 21, 2026

Consistent time handling #2586

Open

VeckoTheGecko force-pushed the push-zlxyoyvlpoqm branch from e06a618 to 60ceef6 Compare April 22, 2026 14:23

VeckoTheGecko added 11 commits April 23, 2026 13:11

Cleanup

db9f983

This mark was only introduced during refactoring

Add pandas and pyarrow as explicit dependencies

ac2a830

Add assert_cftime_like_particlefile

de464e5

Remove temporary test_cftime.py file

MAINT: Cleanup create_particle_data

57ccf6f

This function is now independent of the time_interval as time is now stored as float

Add cftime metadata serialization

b2bde50

Add np.timedelta64 support

55493a9

Fix assert_cftime_like_particlefile

bab4d5d

Remove nested key - save on root instead

Move imports

b28665c

Fixing tests

7184e1f

Fix test_time_is_age test

e7e37ef

Refactor assert_cftime_like_particlefile

54c829a

VeckoTheGecko force-pushed the push-zlxyoyvlpoqm branch from 60ceef6 to 54c829a Compare April 23, 2026 11:18

VeckoTheGecko added 2 commits April 23, 2026 13:51

Self-review feedback

8626d48

Fix test_particle_schema

3693329

VeckoTheGecko marked this pull request as ready for review April 23, 2026 12:48

VeckoTheGecko added 2 commits April 23, 2026 14:52

Make read_particlefile public

81f127b

Add docstring to read_particlefile

9fcb5bf

VeckoTheGecko requested a review from erikvansebille April 23, 2026 12:58

erikvansebille reviewed Apr 23, 2026

View reviewed changes

		# if len(indices_to_write) == 0: # TODO: Remove this?
		# return



		def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]:
		def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]: # TODO: Remove?



		def test_pfile_array_remove_particles(fieldset, tmp_zarrfile):
		@pytest.mark.skip("Keep or remove? Introduced in 5d7dd6bba800baa0fe4bd38edfc17ca3e310062b ")

		@@ -187,7 +154,7 @@ def test_variable_written_once():
		@pytest.mark.skip(reason="Pending ParticleFile refactor; see issue #2386")

Conversation

VeckoTheGecko commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

AI Disclosure

Uh oh!

VeckoTheGecko commented Apr 20, 2026

Uh oh!

VeckoTheGecko commented Apr 23, 2026

Uh oh!

VeckoTheGecko commented Apr 23, 2026

Uh oh!

erikvansebille left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

VeckoTheGecko commented Apr 20, 2026 •

edited

Loading