NCAR Model Manager (ncarmm)

Standing up WRF / MPAS on the cloud means rebuilding a deep, pinned dependency stack — compilers, MPI, and the parallel-I/O chain — from scratch, every time.

HDF5↔ NetCDF↔ PnetCDF↔ PIO

The real killer: all four must be built against the same MPI — one mismatch and you lose days.

Each project rebuilds this from scratch → weeks lost, undocumented tribal knowledge.
Scientists want to run science, not debug build toolchains.

Ease

One command to a cloud-ready, runnable model image.

No duplicate work

Pinned, reviewed build recipes shared across all projects.

Reproducible builds

Every dependency pinned by SHA / version, so the software stack rebuilds identically.

Portable

Same recipe targets AMIs today, containers next.

ncarmm build mpas --version 8.3.1

Models

WRF 4.4.0 WRF-Chem 4.4.0 MPAS 8.3.1

Cloud

Amazon Web Services

Output

Ready-to-run machine images (AMIs) + AWS ParallelCluster reference configs.

CLI

list-models list-images build delete

Scope: builds the model and its toolchain — input-data staging & output movement stay in the user's workflow.

[Victor] All three models build end-to-end on AWS today, producing an AMI that ParallelCluster launches directly onto HPC-class instances (EFA-enabled nodes, FSx Lustre scratch). Two clarifications to have ready: (1) WRF-Chem is the chemistry-enabled BUILD of the same WRF 4.4.0 core — we deliver the correctly compiled binary and its dependency stack. Emissions preprocessing and chemical-mechanism setup remain the user's responsibility. Don't let chemists assume the emissions toolchain ships with it. (2) Scope honesty: the model binary is a third of the real problem. ICs/LBCs, static geog, and output egress are the user's workflow — typically FSx Lustre scratch + S3. CLI nits for a live demo: the chem model is typed "wrfchem" (no hyphen); "delete" is registered but currently a stub — don't demo an actual delete.

manifest.yaml Manifest Declares the model + every pinned dependency.

→

Image Builder + CloudFormation Automated build A right-sized EC2 instance compiles the parallel-I/O stack + model.

→

snapshot AMI Tagged & discoverable machine image.

→

ParallelCluster Run the model Launch the AMI on HPC-class nodes.

The build instance compiles the full parallel-I/O stack — HDF5 + PnetCDF + PIO + METIS + the model, then snapshots to the AMI. Networking self-provisions by default; locked-down accounts are accommodated with pre-created instance profiles.

Prototype · proof-of-concept · not yet released

container_spec.yaml

single spec

→Jinja2

Dockerfile

Apptainer .def · rootless --fakeroot

Both reuse the AMI build script compile_mpas.sh byte-for-byte — zero diff, CI drift-guarded.

Validated single-node (real idealized forecast → NetCDF check).
Multi-node / EFA binding is the next step.

MPAS variable-resolution hexagonal mesh over Earth

MPAS variable-resolution mesh — the model now ships as both AMI and container.

[Victor] This is the "eliminate duplicate work" thesis made concrete — we did NOT fork the build for containers. A single container_spec.yaml renders both a Dockerfile and an Apptainer definition through Jinja2, and the compile step calls the exact same compile_mpas.sh the AMI pipeline uses (verified zero diff against main). A fix benefits both delivery paths at once. • Smoke test is stronger than its name: it runs an actual idealized Jablonowski–Williamson baroclinic-wave forecast end-to-end (analytic init → MPI decomposition → forecast → valid NetCDF output). Be precise: that proves the binaries build/link correctly and the MPI workflow runs — it's build-integration, NOT scientific verification against observations. • MPI honesty: the container ships its own (Rocky stock) OpenMPI, exercised single-node via mpirun. Multi-node on EFA/InfiniBand needs host-MPI binding/ABI compatibility — known next step. The AWS AMI build used an EFA-tuned MPI; the container does not yet. • Scope: container is MPAS core (atmosphere/init_atmosphere + ungrib); post-processing (MPASSIT/UPP/ESMF) is deliberately out. Keep the hedge strong — unmerged branch, not wired into the CLI.

A new model or version is a new manifest + recipe directory; the framework does the rest.

Plugin-style layout — each model / version is self-contained and loaded dynamically.
Adding one never touches the core.

On the radar

CheMPAS WRF-Hydro FastEddy New versions of existing models

Including private models

Bring-your-own manifest + recipe Build-time auth token

Import the model registry, register a private manifest + recipe, and pass a token at build time to pull restricted source — run a proprietary model (e.g. WRF WxMod, a non-public FastEddy) on the framework without that code living here.

[David] Extensibility isn't a slogan — the repo has an actual documented checklist and loads models dynamically, so adding one doesn't touch the core. Roadmap models reflect real NCAR demand. Accuracy note: the repo README currently lists CheMPAS + FastEddy; WRF-Hydro is our stated direction (and it carries its own coupling/NUOPC complexity). If WRF-Hydro is real, let's add it to the README so the spoken roadmap matches the repo. Private-model angle: because models load as self-contained plug-ins, a downstream project can register its own manifest + recipe through the model registry and pass a build-time token to pull restricted source — so proprietary models (WRF WxMod, a non-public FastEddy) run on ncarmm without that code ever entering this repo. The registry hook is a near-term design item — present it as "the architecture already supports this," not "shipped."

CSP support is a pluggable interface — AWS is the first implementation. Adding a provider = implement two functions.

get_image_details() build_image()

Implemented

Amazon Web Services

On the radar

Google Cloud Microsoft Azure Oracle Other CSPs

01

Cluster lifecycle

Start / stop compute clusters on demand — the clearest cost-control story.

02

Containers, first-class

Container output alongside AMIs, growing out of the MPAS prototype.

03

More models + clouds

Continued breadth across the roadmap — new models, new providers.

Cost honesty: for cloud NWP the bill is dominated by EFA HPC instances + FSx Lustre + data egress — not idle time alone.

Today

GitHub repository

Direct pip install from a Git / URL ref. Not on PyPI.

Planned

NCAR-internal PyPI

A clean pip install ncarmm for NCAR staff.

Possible future

Open source

On the table — gauging the room's appetite.

Repository: github.com/NCAR/ncarmm

ncarmm: build once, run where NCAR computes — cloud today, on-prem container next.

Solves duplicated cloud-model effort across projects.
Working today: WRF, WRF-Chem, MPAS on AWS.
Extensible by design: more models, more clouds, containers.

Let's discuss

What models / clouds do you need?

Would containers help your HPC workflow?

Interest in open source?

Not presented — reference for the co-leads.

Do cloud runs reproduce our Derecho results bit-for-bit?

No, and that's expected. We reproduce the build (pinned sources/compiler/libs). Forecasts differ at round-off across hardware, AVX paths, and MPI rank counts — same as any MPI code.

Did you validate the MPAS container?

It passes an end-to-end idealized (Jablonowski–Williamson) forecast with NetCDF output checks — a build-integration test. Real-data IC/LBC validation against observations is future work.

Does WRF-Chem come with the emissions preprocessors?

No. We deliver the chem-enabled WRF binary + dependency stack; emissions / mechanism setup is application-specific and stays with the user.

Multi-node on EFA / InfiniBand?

Validated single-node today. Multi-node needs host-MPI/EFA binding (the classic hybrid-Apptainer model) — known next step.

How do I get data in/out, and what's the cost?

Input staging + output egress are the user's workflow (FSx Lustre + S3). Cost is driven by EFA HPC instances, Lustre, and egress; start/stop clusters targets the idle-compute slice.

Is it on PyPI / open source?

Not yet on PyPI (GitHub + URL install today; internal PyPI planned). Open source is possible, not committed.

Can we run a proprietary / non-public model (e.g. WRF WxMod, a private FastEddy)?

Yes, by design. A downstream project imports the model registry, registers its own manifest + Image Builder recipe, and supplies a build-time auth token (OAuth / PAT) to pull the restricted source. The proprietary code never lives in ncarmm — the framework just builds & runs it. Registry hook is a near-term design item.