ACCESS Newswire

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

11.3.2026 14:00:00 CET | ACCESS Newswire | Press release

Share

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics- a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ™ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

"Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant," said Suresh Vasudevan, CEO of Clockwork.io. "We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure."

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

"As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra's NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable," said Patel. "TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics."

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable - making this a major barrier to scaling AI's impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

"Managing compute output across large-scale GPU clusters is vital to ensuring we're delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations," said David Power, CTO of Nscale. "In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale."

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis' independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

"In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective," concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io's prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io's Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world's most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork



View the original press release on ACCESS Newswire

Clockwork

Subscribe to releases from ACCESS Newswire

Subscribe to all the latest releases from ACCESS Newswire by registering your e-mail address below. You can unsubscribe at any time.

Latest releases from ACCESS Newswire

LinkShadow is Positioned in the Visionaries Quadrant in the 2026 Gartner(R) Magic Quadrant(TM) for Network Detection and Response (NDR)23.5.2026 17:00:00 CEST | Press release

Recognized for Vision. Driven by Innovation. Advancing AI-powered network detection and response through connected intelligence, contextual visibility, and modern cyber defense. ATHENS, GA / ACCESS Newswire / May 23, 2026 / LinkShadow has been positioned in the Visionaries Quadrant of the 2026 Gartner® Magic Quadrant™ for Network Detection and Response. We are recognized for our completeness of vision and ability to execute. We believe this recognition highlights a differentiated approach to NDR that is redefining how organizations detect and respond to modern network threats. As cyber threats grow more sophisticated and fast moving, security teams are challenged by fragmented visibility, overwhelming alert volumes, and limited context. LinkShadow addresses these challenges through a distinct strategy that combines AI driven analytics with deep contextual awareness and real time correlation across network activity. This enables organizations to move beyond isolated alerts and toward a

LiberNovo Summer Kickoff Across Europe: A Five-Day Flash and Two Show Floors22.5.2026 11:00:00 CEST | Press release

HONG KONG, HK / ACCESS Newswire / May 22, 2026 / LiberNovo's Summer Kickoff Flash opens Friday, May 22 across the EU (9:00 CEST) and UK (8:00 BST) and runs five days. LiberNovo Omni ships in a regional bundle with €651 off in the EU and £549.50 off in the UK. Verified students and educators can stack another 5% on top. What's in the Bundle EU: LiberNovo Omni paired with the StepSync footrest and a matching StepSync Mat. €1,066 flash, regular €1,717, or 38% off. UK: LiberNovo Omni paired with the StepSync footrest and an Eye Mask. £969.50 flash, regular £1,519, or 36% off. Designed Around Motion LiberNovo Omni adapts to the body in real time. Three features handle the work: Bionic FlexFit Backrest. Eight independent panels follow the spine through every shift in posture, instead of one rigid surface pushing back. Automatic armrests. They track with the chair's recline so you don't reset them between positions, and they slide back into the base when you scoot under the desk so they don't

GA-ASI Completes First Flight of MQ-9B With AEW Pods21.5.2026 17:00:00 CEST | Press release

New Development Effort Will Enable Airborne Early Warning Capability for MQ-9B SAN DIEGO, CA / ACCESS Newswire / May 21, 2026 / General Atomics Aeronautical Systems, Inc. (GA-ASI) flew its MQ-9B Remotely Piloted Aircraft for the first time with Airborne Early Warning (AEW) pods. The much-anticipated AEW capability is being provided through a partnership with Saab. Once the AEW sensor, named LoyalEye, is made available to MQ-9B operators and new customers, it will deliver persistent and cost-effective air surveillance capabilities in regions where it is currently unavailable. GA-ASI conducted a validation flight of MQ-9B using AEW radar pods on May 19 from GA-ASI's Desert Horizon flight operations facility in Southern California using a company-owned aircraft. The flight signaled the first step in a development process that is expected to take several months and culminate with a full-capability demonstration later this year. GA-ASI and Saab announced their partnership last year with the

AI Trading Changing Stock, Gold, and Forex Trading Market: Funds Coin's Multi-Agent Trading Update Dominates20.5.2026 11:00:00 CEST | Press release

DENVER, CO / ACCESS Newswire / May 20, 2026 / Ten years ago, algorithmic trading was the exclusive territory of investment banks and hedge funds. Today, a retail trader with $100 and a smartphone can access the same class of automated execution that once required a team of quants and millions in infrastructure. That's not an exaggeration. It's the shift that's quietly reshaping stock and forex markets, and AI trading agents are at the center of it. The Old Way Is Breaking Down Manual trading made sense when markets moved slowly enough for humans to keep up. That world no longer exists. Forex markets process over $7 trillion in daily volume. Crypto trades around the clock across hundreds of exchanges. Stock prices react to news in milliseconds. The information moves faster than any individual trader can process, and emotions, such as fear, greed, and hesitation, make an already difficult job even harder. The traders who thrived in this environment were either exceptionally disciplined o

Karbon-X and Evertrak Sign Letter of Intent to Advance Infrastructure-Linked Plastic Waste Reduction Credit Initiative19.5.2026 19:45:00 CEST | Press release

Proposed initiative would evaluate the potential generation of Verra-aligned Plastic Waste Reduction Credits for approximately 200,000 railroad ties made from recycled plastic currently installed across railroad infrastructure in North America. CALGARY, AB / ACCESS Newswire / May 19, 2026 / Karbon-X Corp. (OTCQB:KARX) ("Karbon-X" or the "Company"), a vertically integrated climate solutions company operating across compliance and voluntary environmental markets, today announced the signing of a Letter of Intent with Evertrak LLC ("Evertrak"), the leading manufacturer of Glass Fiber Reinforced Composite (GFRC) railroad ties made from recycled plastic, to explore an infrastructure-linked Plastic Waste Reduction Credit ("PWRC") initiative under Verra's Plastic Waste Reduction (PWR) Standard. Across North America, 20 million railroad ties made from wood are replaced annually. Approximately 4-6 million of those ties are less than 12 years old. Safe, resilient, and efficient railroad infrastr

In our pressroom you can read all our latest releases, find our press contacts, images, documents and other relevant information about us.

Visit our pressroom
World GlobeA line styled icon from Orion Icon Library.HiddenA line styled icon from Orion Icon Library.Eye