Base implementation of Parquet file writing#2583
Base implementation of Parquet file writing#2583VeckoTheGecko wants to merge 27 commits intoParcels-code:mainfrom
Conversation
1d26f4a to
91ce496
Compare
|
@erikvansebille let's table some of these questions for our meeting tomorrow (mainly around datetime serialization in the Parquet file) |
Covered by test_write_dtypes_pfile
for more information, see https://pre-commit.ci
e06a618 to
60ceef6
Compare
Remove temporary test_cftime.py file
This function is now independent of the time_interval as time is now stored as float
Remove nested key - save on root instead
60ceef6 to
54c829a
Compare
|
@erikvansebille I moved |
|
(we'll likely get doc failures on this PR - which is to be expected since they haven't been updated) |
erikvansebille
left a comment
There was a problem hiding this comment.
Very nice work. And as I'm going through updating all the documentation, it seems to work smoothly too!
A few small comments below
| self._create_new_zarrfile = False | ||
| else: | ||
| Z = zarr.group(store=store, overwrite=False) | ||
| obs = particle_data["obs_written"][indices_to_write] |
There was a problem hiding this comment.
I think obs_written is now obsolete and can be removed from the ParticleClass. Saves some memory ;-)
| self._writer: pq.ParquetWriter | None = None | ||
| if path.exists(): | ||
| # TODO: Add logic for recovering/appending to existing parquet file | ||
| raise ValueError(f"{path=!r} already exists. Either delete this file or use a path that doesn't exist.") |
There was a problem hiding this comment.
This is a change from v3 behaviour, where the default would be to overwrite an existing file. Perhaps we can do that here too, and add an option to replace or append?
| @@ -298,7 +192,7 @@ | |||
| )[0] | |||
There was a problem hiding this comment.
Is this section still necessary? If so, should we also fix the warning
[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:175): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
np.less_equal(
[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:180): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
& np.greater_equal(
[/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187](https://file+.vscode-resource.vscode-cdn.net/Users/erik/Codes/parcels/src/parcels/_core/particlefile.py:187): UserWarning: 'where' used without 'out', expect unitialized memory in output. If this is intentional, use out=None.
& np.equal(time, particle_data["time"], where=np.isfinite(particle_data["time"]))I didn't worry too much about it before because thought we would refactor this out anyways, but if we keep it we should also address the warning
|
|
||
| Parameters | ||
| ---------- | ||
| name : str |
| particleset : | ||
| ParticleSet to output |
There was a problem hiding this comment.
This is not an argument to init?
| # if len(indices_to_write) == 0: # TODO: Remove this? | ||
| # return |
There was a problem hiding this comment.
Probably good idea to remove if it's commented out already?
|
|
||
|
|
||
| def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]: | ||
| def _get_calendar_and_units(time_interval: TimeInterval) -> dict[str, str]: # TODO: Remove? |
There was a problem hiding this comment.
I don't think it's used anywhere in the codebase, so can indeed be removed
There was a problem hiding this comment.
why change the tests in the v3-directory? We normally don't touch these, since they are broken already (also apply to other two files)
|
|
||
|
|
||
| def test_pfile_array_remove_particles(fieldset, tmp_zarrfile): | ||
| @pytest.mark.skip("Keep or remove? Introduced in 5d7dd6bba800baa0fe4bd38edfc17ca3e310062b ") |
There was a problem hiding this comment.
I think ti's an important test to keep in. Or does it fail now?
| for i in range(npart): | ||
| ds_timediff[i, :] = ds.time.values[i, :] - time[i] | ||
| np.testing.assert_equal(age, ds_timediff) | ||
| # df = pd.read_parquet(tmp_parquet) |
There was a problem hiding this comment.
remove stray debug statement?
| @@ -187,7 +154,7 @@ def test_variable_written_once(): | |||
| @pytest.mark.skip(reason="Pending ParticleFile refactor; see issue #2386") | |||
There was a problem hiding this comment.
Does this test work again? If not out of the box, perhaps we should discuss what we actually expect from output of a looped pset.execute...
Description
This PR introduces Parquet file writing to Parcels.
I still need to work on:
parcels.read_particlefile(...)to streamline ingestion into cftime objects.test_particlefile.pyfile - are there tests that are no longer needed? What would be the best testing approach here?Posting as draft for initial feedback
Checklist
mainfor normal development,v3-supportfor v3 support)AI Disclosure