Spinup-Evaluation

Spinup-evaluation

Accelerating Spinup

Ocean spinup is slow. Upwards of a 1000 simulation years to achieve equilibrium state!

Many machine learning emulators exist for ocean modelling that are well validated:
- How do we know if they are correct physically?
- Errors can easily propagate during rollout.

Our focus is on the physical evaluation of ocean models:
- Including equation resolving, pure AI emulators and hybrid approaches.

Implement sustainable software that is easy to extend, robust and reusable.
- Generalised evaluation benchmark for climate.

NEMO/DINO ocean model

/vopikamm/DINO

We use the DINO configuration (Kamm, Deshayes, and Madec 2025) for development purposes.

Is RMSE enough?

Unconstrained

Constrained

Reference

	Unconstrained	Constrained	Reference
\(T\) DW	2.9	2.6	2.6
\(\rho\) error	26.8	0.4	0.4

Evaluate even when resolution changes.

Meunier et al. (2025), Gorce et al. (2025)

Is RMSE enough?

Unconstrained

Constrained

Reference

	Unconstrained	Constrained	Reference
\(T\) DW	2.9	2.6	2.6
\(\rho\) error	26.8	0.4	0.4

Evaluate even when resolution changes.

Meunier et al. (2025), Gorce et al. (2025)

Physical evaluation

Metrics

Figure from Learning to generate physical ocean states. Meunier et al. (2025)

Bring in physical knowledge of problem

i.e. metrics specific to tracking equilibrium progress.

No ground truth

Evaluation-Benchmark

/m2lines/nemo-spinup-forecast

AI emulator or traditional time integrator of physical quantities.

AI Emulator
/ Hybrid

Ocean
Model

Physical Evaluation

Physical evaluation of instantaneous and temporal fields [GridT, restart].nc

Translator
Toolbox

Create restart state from AI emulator
Simple checks
‘Glue’ module for future time-stepping

Evaluation-Benchmark

How do we evaluate hybrid ocean emulators?

PILLAR 1:
- Compute metrics at end time only.
- Compute during roll-out

PILLAR 2:
- Feed emulated state back into ocean model to check stability and
  - evaluate metrics over time history

PILLAR 3:
- Sample from a distribution of possible emulated states

Pillars 1, 2 and 3 enabled with the use of two software components:

/m2lines/spinup-evaluation

version 0.2

Developer highlights

Spinup-evaluation (1)

Computes metrics on raw emulator outputs (.npy), instantaneous (restart.nc) and time averaged (grid.nc) output files & statistical comparisons
Strict mapping of ocean variables to restart, grid (output) and mesh_data names
Provides a framework to easily add new metrics
Implements temporal downsampling – all possible metrics computed even if data output at different cadences.

reference files

↓

restart.nc

→

Spinup
Evaluation

→

metrics.csv

output.nc

→

mmask.nc

→

raw.npy

→

/m2lines/spinup-evaluation

version 0.2

Developer highlights

Spinup-restart/translator (2)

Take an emulator state and converts to NEMO compatible restart.nc file
Basic validation gate; does grid encoding and sanity checks (existence of NaNs etc)
Regrid, upscale, downscale resolution;
Encode other physical assumptions (geostrophic velocities)

reference files

↓

restart.nc

→

Spinup
Translator

→

Spinup
Evaluation

→

metrics.csv

output.nc

→

mmask.nc

→

raw.npy

→

NEMO

Oceananigans

MOM6

Translate between other ocean model restart files - potentially difficult due to grids (in progress..)

Software engineering

Modular & Generalizable

Reusable components across domains
Extensible to new models (e.g., MOM6, Oceananigans)

Quality & Standards

Robust testing (Unit, Integration, Coverage) with CI
Well documented end-to-end examples (hybrid)
Data products adhere to FAIR (see Wilkinson et al. 2016)
Collaborative Design: Bridging Science & RSE

graph TD
    A(Scientific Expertise) --- C{RSE Bridge}
    B(Software Engineering) --- C
    C --> D[Sustainable Tools]
    
    classDef default fill:#1a365d,stroke:#fff,stroke-width:2px,color:#fff,font-size:18px;
    classDef highlight fill:#84BD00,stroke:#fff,stroke-width:2px,color:#fff,font-size:18px;
    classDef bridge fill:#003E51,stroke:#fff,stroke-width:2px,color:#fff,font-size:18px;

    class A,B default;
    class C bridge;
    class D highlight;
    
    linkStyle default stroke:#fff,stroke-width:2px;

Eliminating silos through cross-disciplinary partnership.

Conclusions

Key Takeaways

Extend evaluation of ocean models:
- Evaluation of AI emulators must go beyond simple RMSE metrics.
- Enable evaluation in conjunction with equation resolving techniques.

Software enabled research:
- Research software engineering is a critical component of research bringing generalisability, extensibility and robustness to software products.

Next Steps

Extend to other ocean models and AI emulators
Translation between different ocean checkpoint files

Thanks for Listening

Get in touch:

Matt Archer

ma595.github.io/spineval-agu - Slides

The ICCS received support from

/m2lines/spinup-evaluation

References

Gorce, Blandine, Luther Ollier, David Kamm, and Etienne Meunier. 2025. “Physically Consistent Sampling for Ocean Model Initialization.” In Advances in Neural Information Processing Systems. Vol. 38. https://neurips.cc/virtual/2025/loc/san-diego/poster/126936.

Kamm, D., J. Deshayes, and G. Madec. 2025. “DINO: A Diabatic Model of Pole-to-Pole Ocean Dynamics to Assess Subgrid Parameterizations Across Horizontal Scales.” EGUsphere 2025: 1–26. https://doi.org/10.5194/egusphere-2025-1100.

Meunier, Etienne, David Kamm, Guillaume Gachon, Redouane Lguensat, and Julie Deshayes. 2025. “Learning to Generate Physical Ocean States: Towards Hybrid Climate Modeling.” arXiv. https://doi.org/10.48550/arXiv.2502.02499.

Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. “The FAIR Guiding Principles for Scientific Data Management and Stewardship.” Scientific Data 3 (1): 160018. https://doi.org/10.1038/sdata.2016.18.