Bringing a Petaflop to the Clinic: How We Use NVIDIA DGX Spark for Self-Hosted Medical AI

When you work with medical data, “just call a cloud API” stops being a serious option very quickly. Privacy, regulation, cost and vendor risk all push in the same direction: more control, more locality, fewer external dependencies.

That is why we invested in the NVIDIA DGX Spark, a desktop scale Grace Blackwell system that can run and fine tune large models locally with up to a petaflop of FP4 AI compute in a 240 W box on a desk.

In this article:

About the DGX Spark
What makes it special for self hosted AI
How we use it for medical models
What we have learned so far
What comes next
Closing thoughts

About the DGX Spark

DGX Spark is NVIDIA’s attempt to condense data center style AI capability into something closer to a high end desktop system. It is built around the GB10 Grace Blackwell superchip, where an Arm CPU and a Blackwell GPU share a single coherent memory space.

Hardware overview

At a high level:

Component	Detail
CPU	20 core Arm (10 performance cores, 10 efficiency cores)
GPU	Blackwell generation, 5th gen Tensor Cores with FP4 support
Memory	128 GB LPDDR5X unified memory, shared by CPU and GPU
Storage	4 TB NVMe SSD with hardware encryption
Networking	ConnectX 7 NIC (high speed), plus 10 GbE, Wi Fi 7, Bluetooth, USB C
Power	240 W external supply, GB10 TDP around 140 W
Form factor	150 × 150 × 50.5 mm, about 1.2 kg

The key phrase here is unified memory. There is no split between system RAM and GPU VRAM. CPU and GPU see the same 128 GB address space. That matters a lot once you start working with 70B class models and long contexts.

Software stack

DGX Spark ships with:

DGX OS
A Linux distribution that is tuned for AI workloads.
NVIDIA AI software stack
- CUDA and cuDNN
- Frameworks such as PyTorch pre integrated in containers
- NVIDIA Triton and TensorRT LLM for inference
- NeMo, LLaMA Factory, SGLang, vLLM and other tooling available through NVIDIA supplied “playbooks”
Integration points
- NVIDIA AI Enterprise
- DGX Cloud and larger DGX systems for scale out

The practical consequence is that the device behaves like an AI appliance. You get to useful workloads quickly instead of spending weeks on drivers and CUDA versions.

What Makes It Special For Self Hosted AI

Pain points with cloud first AI in healthcare

If you build standard SaaS features, hosted LLM APIs are perfectly acceptable. In healthcare, the constraints are different:

Regulation and privacy

Moving protected health information into third party APIs is often a non starter, especially across jurisdictions. Even when it is allowed, the compliance burden is heavy.
Cost volatility

Usage based API pricing looks cheap in early prototypes. At real clinical volume, you can end up with a monthly bill that dwarfs your own infrastructure cost.
Vendor and data lock in

Once your prompt templates, evaluation stack and workflows are tied to a single API provider, you are effectively locked into their roadmap and pricing.
Latency and reliability

Critical workflows must not depend on a remote service having a good day. Local inference is much easier to reason about from a reliability point of view.

Where DGX Spark fits

DGX Spark fills a very specific gap:

You want serious local AI capacity without a full rack or a multi GPU cluster.
You need large memory capacity, because medical models with long contexts do not fit comfortably into typical 24 or 48 GB GPUs.
You want full control of data and models while keeping integration effort reasonable.

In practice, DGX Spark gives you:

A petaflop class AI box at FP4 with enough memory to host large open models locally.
A fully integrated NVIDIA AI stack so you can run and fine tune models without assembling your own environment from scratch.
A clear path from local prototype to scaled deployment, because the same stack is available on DGX Cloud and larger DGX systems.

If you only care about raw throughput per euro, a custom workstation can be more economical. If you care about self hosted LLMs in privacy sensitive environments, the combination of unified memory, software stack and form factor is the reason to consider Spark.

How We Use It For Medical Models

We use DGX Spark as a self hosted LLM node in our infrastructure, with a strong focus on medical and biomedical use cases.

We treat it as:

A local model server for clinicians and internal tools.
A fine tuning node for domain specific models.
A secure sandbox for experimentation that never leaves our environment.

System architecture

On the DGX Spark we typically run:

Base system
- DGX OS
- NVIDIA container runtime
Model serving
- SGLang or vLLM based model servers behind an OpenAI compatible HTTP API
- Open WebUI as an internal frontend so teams can use models through a browser
Training and fine tuning
- Containers for PyTorch, NeMo, LLaMA Factory, Unsloth and similar tools
- Parameter efficient fine tuning (PEFT) setups for LLMs

The device is deployed as a hardened node:

No direct outbound internet access.
Access controlled through SSO and roles.
Audit logging for prompts and responses.
Network isolation consistent with how we treat other systems that interact with sensitive data.

Model portfolio

We run a mix of general and medical specific language models.

General backbones

We use recent open models such as Llama 3 class models and others from major vendors as general reasoning backbones. They provide good tooling compatibility and strong generic performance.

DGX Spark has enough unified memory to run very large models for inference. That keeps our options open for future generations.

Medical specific models

For medical and biomedical work we rely on models that were trained or fine tuned on domain specific corpora, for example:

Meditron series

Models adapted from Llama style bases to medical texts such as PubMed articles, clinical guidelines and QA datasets.
Clinical conversational models

For example models similar in spirit to Clinical Camel that are fine tuned via QLoRA on medical QA and dialogue.
Biomedical literature models

Models in the BioGPT or BiomedGPT family for scientific literature and multi modal tasks.

All of these are used strictly as decision support and research tools. They do not replace clinicians and they are never treated as standalone diagnostic or treatment systems.

For 70B class models we rely on quantization and adapter based fine tuning. DGX Spark’s 128 GB of shared memory allows us to keep such a model loaded while still running long contexts and multiple concurrent sessions.

Data and fine tuning workflow

We keep the pipeline intentionally boring and conservative. The risk is in the data and in evaluation, not in clever optimizer tricks.

1. Data curation and de identification

We work with de identified clinical text such as notes and reports, combined with public medical QA and benchmark datasets.
Automated de identification is combined with manual sampling. Linkage between de identified and original data stays in separate, highly restricted systems and is never used during model training.

2. Task design

Typical tasks include:

Summarization of long clinical notes and radiology or pathology reports.
Question answering over clinical guidelines, pathways and internal SOPs.
Triage of free text into structured categories, for example routing messages or classifying cases.
Transformation tasks such as turning free text into structured templates or patient friendly explanations.

We deliberately avoid training models to output direct diagnostic or treatment recommendations.

3. Fine tuning approach

Our default strategy is:

Start from a strong, general or medical base model.
Use parameter efficient fine tuning methods such as LoRA or QLoRA.
Train only a small set of adapter parameters on DGX Spark.

For heavy experiments such as very long context fine tuning on large datasets, DGX Spark serves as the prototyping node. Once a run is promising, we move that workload to larger multi GPU systems. The code and Docker images transfer without major changes, because we stay within the NVIDIA ecosystem.

4. Serving and integration

After fine tuning:

Models are packaged as Docker images with SGLang or vLLM as the serving engine.
These services expose an HTTP API through our internal gateway.
Open WebUI and other internal tools talk to those APIs.

From the outside, applications see a stable API and do not need to care whether the underlying model is a base, a fine tuned variant or a new generation that we are experimenting with.

Example medical use cases on DGX Spark

Guideline aware research assistant

Use case: help clinicians and researchers query long and complex clinical guidelines.

Architecture:

Guidelines and SOPs are ingested into a vector store that stays behind the firewall.
A local LLM on DGX Spark is configured with retrieval augmented generation.
A user asks a question such as “What are the recommended anticoagulation options for a patient with atrial fibrillation and stage 4 CKD”.
The system retrieves the most relevant sections, then the model synthesizes an answer with explicit references.

We constrain prompts and templates so that the model stays close to the retrieved text and clearly cites sources instead of hallucinating.

Biomedical knowledge graph extraction

Use case: structured extraction of relationships from biomedical literature.

Pipeline:

Full text articles are fed into an LLM on DGX Spark.
The model is prompted to output structured triples such as drug target effect or disease gene association.
The triples are validated, cleaned and loaded into a graph database.
Research teams use that graph for hypothesis generation and analysis.

This workload benefits from the large unified memory for long documents, and from the fact that all processing stays inside our environment.

Safety, evaluation and boundaries

Because this is healthcare, we enforce explicit limits:

Models are decision support tools, not authorities. Their outputs are suggestions and drafts that must be reviewed by qualified professionals.
We benchmark on public medical NLP datasets and maintain internal evaluation sets, but we treat benchmark scores as a floor, not a green light.
We run internal red teaming to probe hallucinations, bias, outdated knowledge and unsafe suggestions.
Any feature that touches diagnosis, treatment or prognosis is clearly labelled and wrapped in user experience that keeps the human in control.

DGX Spark helps here because it keeps the entire stack under our control. We know precisely which model is running, what it was trained on and how it has changed over time.

What We Have Learned So Far

What works well

Unified memory changes the workflow

Having 128 GB of unified memory for CPU and GPU simplifies LLM work:

Large models fit without complicated sharding.
Long contexts and multiple concurrent requests are less fragile.
Data movement between CPU and GPU is less of a bottleneck.

For our workloads, that matters more than another 10 or 20 percent of peak FLOPs.

The appliance experience is real

Because NVIDIA ships DGX Spark with DGX OS, CUDA, containers and playbooks, time to first useful model is short.

We still had to harden the system, integrate it with SSO, monitoring and our CI pipeline, but we did not burn engineering weeks chasing driver bugs or compute capability mismatches.

Office friendly form factor

In clinical and research environments, noise and power draw are not minor details. A quiet 240 W box that sits next to a workstation is much easier to get approved than a loud rack server.

This matters when you want AI close to practitioners instead of hidden in a remote data center.

Trade offs and limitations

It is not a magic box and it is not the right tool for every job.

Cost and value

DGX Spark sits at a price point where you can absolutely build a custom workstation with more raw FP16 or FP32 throughput.

If your only objective is maximum tokens per second per euro, a home built system with a large gaming GPU will win. The case for Spark is about unified memory, maturity of the software stack and operational simplicity in sensitive environments.

You need to be clear about that trade off before you buy.

Single accelerator limits

DGX Spark is one GB10 superchip. It is not a multi GPU cluster.

For:

Full LLM pre training.
Very large multi modal models.
High throughput production inference at scale.

you still want larger systems. We see Spark as a lab bench and integration node, not as the only compute we will ever need.

Arm ecosystem friction

The Arm CPU is efficient, but a non trivial amount of tooling in the Python and DevOps ecosystem still assumes x86.

Most of the core AI stack is covered by NVIDIA’s containers, but as soon as you wander into niche packages, you start to see:

Missing binary wheels.
Build from source delays.
Occasional bugs that only show up on Arm.

This is manageable, but you have to factor it into your planning.

Networking reality

Spark advertises very high NIC speeds for clustering. In practice, real world throughput is lower than the headline number and is constrained by the PCIe link.

For us, with a small number of boxes and 10 GbE connectivity into the rest of the network, it is good enough. If you are imagining a 32 node Spark cluster as your main training platform, you should run the numbers and look at alternatives.

What Comes Next

We are still early in our adoption of DGX Spark, but the roadmap is already filling up.

Multi modal medical assistants

We are exploring assistants that can combine:

Free text clinical notes.
Imaging metadata.
Structured lab values and vitals.

Heavy training will continue to happen on larger systems, but inference, fine tuning on local data and workflow prototyping fit well on DGX Spark, particularly when data cannot leave the institutional environment.

Clustering and cross site deployments

Tying two Sparks together is interesting for:

Larger models that need more memory.
Experiments with distributed inference and failover.
Standardising a “small cluster” pattern that can be deployed at different sites.

The goal is not to squeeze every last token per second out of the hardware, but to nail the operational patterns that will be required in regulated deployments.

Stronger evaluation pipelines

We are investing in evaluation along three axes:

Public medical benchmarks for a first sanity check.
Internal datasets that reflect actual clinical workflows.
Human in the loop review with specialists who can push models into corner cases.

DGX Spark allows us to run these evaluations on the same hardware that will be used in pilots, which keeps surprises to a minimum.

Path to production grade systems

DGX Spark is our closest to the user AI node. For production, we expect to rely on a mix of:

Larger DGX systems on premises.
DGX Cloud for bursty or exploratory work.
Standard Kubernetes based deployments for serving models.

Because the software stack is consistent across these environments, we can treat Spark as the “inner loop” for experimentation and integration, with a clean path to scaling up.

We already have multiple projects lined up with clinical partners where DGX Spark will sit at the center of the experimentation phase. The details belong in their own case studies, but the direction is clear.

Closing Thoughts

DGX Spark does not solve the hard parts of medical AI. It does not clean your data, design your evaluation plan or write your clinical integration strategy.

What it does provide is:

A serious AI system that fits in normal workspaces.
Enough unified memory to run the models that actually matter in 2026.
An integrated software stack that cuts out weeks of undifferentiated engineering work.
A clear migration path from a single lab node to larger deployments.

For us, that is enough to make it strategically important. In healthcare, outsourcing your core AI capability to external APIs is a short term convenience and a long term liability. Owning your models and your infrastructure is how you keep control of your future stack.

DGX Spark is not the only way to do that, and it is not the right choice for every organization. For teams that need self hosted medical AI with real models, real constraints and real accountability, it is a very strong starting point.