
Partnership
by
Olivier Truong
Plug into the AI omnicloud with a SkyPilot YAML
It’s been a longstanding request from the community and we’re excited to finally announce: SkyPilot workloads can now run on Mithril. You can run jobs, serve models, or execute large-scale inference easily, while benefiting from the consumption flexibility offered by Mithril.
SkyPilot started at UC Berkeley's Sky Lab and is now a thriving open-source framework for running AI workloads across clouds, Kubernetes, and on-prem clusters through a single interface.
Mithril complements this on the supply side. It turns globally distributed GPU infrastructure into a fluid, market-driven resource pool. Instead of capacity being either stranded or over-subscribed, pricing and allocation on Mithril adjust in real time to match supply with demand. Mithril achieves this by aggregating capacity into a unified marketplace.
From the recently-released flexible reservations to cost-efficient spot capacity, Mithril gives teams two ways to procure compute depending on whether they prioritize certainty or flexibility.
For guaranteed access, reservations let you secure capacity ahead of time. Unlike traditional reservations that require committing to fixed usage, Mithril reservations let you give capacity back when you don’t need it, and earn usage credit as a result.
Spot capacity is allocated through a real-time auction, allowing teams to access available GPUs at market-clearing prices, allowing you to trade off cost with consistency of access. With a very low limit price, your workloads only run at times where the market price is very low, ensuring excellent value (at the cost of having to wait longer for completion). A high limit price, on the other hand, ensures you’ll get capacity (while only paying the resulting market price, not your limit price).
The result is immediate access to a broader pool of high-performance GPUs — delivered seamlessly through the familiar SkyPilot interface.
What Mithril Brings to SkyPilot
Training: secure capacity and return unused capacity for credits
For reservations, you can set up a Kubernetes cluster on top of your reserved resources and interact with them through SkyPilot, giving your whole team a single interface to share capacity across projects. The downside of a traditional reservation is that you're paying for capacity whether you use it or not. On Mithril, you earn credits when returning capacity back which can result in significantly improved economics for your training runs.
Training: achieve dramatic cost reductions for preemptible runs
At off-peak times, H100s and B200s can be had for as low as $0.01/hr. With SkyPilot and Mithril's developer tooling, you can make your training run preemptible by gracefully handling preemption. Checkpoint your training run to cloud buckets, resume automatically without manual intervention. This lets you launch a job that absorbs available and cost-effective capacity over days/weeks.
Inference: buy right-of-first refusal to capacity for your upcoming launch
Demand during product launches can be unpredictable. Traditionally, companies are forced to over-provision, wasting capital and hoarding resources, or to under-provision and risk failing to capture the moment. On Mithril, you can buy reserved capacity, scale up or down and get credits back for unused GPUs. With SkyPilot, scaling that capacity across regions or clouds requires no adjustments to your workflow.
Inference: burst on the spot market when demand spikes
A unique feature of Mithril spot is the ability to outbid the market to acquire capacity when supply is tight everywhere else. With SkyPilot's cross-cloud capabilities, you can use your reserved capacity, whether on Mithril or another cloud, and with minimal changes to your workflow, turn to Mithril at a moment's notice to grab capacity on the spot market to meet a surge in demand.
Batch inference: passively grab cheap capacity
SkyPilot makes workloads portable across providers and through Mithril's unique auction-based spot market, popular GPUs can be available for as low as $0.01 at off-peak hours. On Mithril, you can name a limit price and let capacity come to you. When prices drop, during off-peak hours or periods of excess supply, your jobs automatically start, capturing low-cost GPU capacity without manual intervention.
Run your first workload on Mithril with SkyPilot
The steps below are enough to get up and running quickly.
Install and setup the Mithril CLI
uv tool install -U --refresh mithril-client
ml setup
Install SkyPilot
uv tool install --with pip "skypilot-nightly[mithril]"
Run a test job
# task.yaml
resources:
infra: mithril
accelerators: B200:8 # An 8x B200 instance
# Maximum hourly price you're willing to pay for
# the instance.
# Due to auction-based pricing, you often pay less
# than this cap.
config:
mithril:
limit_price: 32.00 # Equivalent to $4.00/GPU-hour
# Command that executes your code — runs on the
# cluster every time you launch or exec.
run: |
nvidia-smi
Launch your job
❯ sky launch -c mithril-test task.yaml
YAML to run: task.yaml
Considered resources (1 node):
--------------------------------------------------------------------------------------------
INFRA INSTANCE vCPUs Mem(GB) GPUS COST ($) CHOSEN
--------------------------------------------------------------------------------------------
Mithril (us-central5-a) neb-b200.sxm.8x 160 1792 B200:8 0.08 ✔
--------------------------------------------------------------------------------------------
Launching a new cluster 'mithril-test'. Proceed? [Y/n]:
## Distributed training
# multi-node.yaml
# Sync this directory so your code and data are
# available on the cluster
workdir: .
resources:
infra: mithril
accelerators: B200:8
num_nodes: 2
# Maximum hourly price you're willing to pay for
# the instance.
# Due to auction-based pricing, you often pay less
# than this cap.
config:
mithril:
# Equivalent to $4.00/GPU/hour on an 8x instance.
limit_price: 32.00
# Runs once when the cluster is first created
# (install deps, download data, etc.)
setup: |
uv pip install -r requirements.txt
source .venv/bin/activate
# Command that executes your code — runs on the
# cluster every time you launch or exec.
run: |
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Starting distributed training, head node: $MASTER_ADDR"
torchrun \
--nnodes=$SKYPILOT_NUM_NODES \
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
--node_rank=${SKYPILOT_NODE_RANK} \
--master_addr=$MASTER_ADDR \
--master_port=8008 \
train.py