This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content. The placement of this content is part of a paid service.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

CuriosityStream Reports 40% Revenue Growth for Full-Year 2025

CuriosityStream Reports 40% Revenue Growth for Full-Year 2025

Full year 2025 revenue of $71.7 million, up 40%Record full-year operating cash flow of $13.1 million, up 60%Board

March 11, 2026

Avel eCare and Seven Corners Healthcare Announce Strategic Partnership to Expand Access and Continuity of Care

Avel eCare and Seven Corners Healthcare Announce Strategic Partnership to Expand Access and Continuity of Care

SIOUX FALLS, SD / ACCESS Newswire / March 11, 2026 / Avel eCare, the nation's leading Virtual Health System, today

March 11, 2026

Aquatic Creations of Ohio Introduces Enhanced Financing Options And Customer Centric Approach

Aquatic Creations of Ohio Introduces Enhanced Financing Options And Customer Centric Approach

Aquatic Creations of Ohio Introduces Enhanced Financing Options And Customer Centric Approach Columbus, OH – March 11,

March 11, 2026

Business N Realty Funding Solutions Brings Personalized Funding Solutions to Entrepreneurs and Real Estate Investors

Business N Realty Funding Solutions Brings Personalized Funding Solutions to Entrepreneurs and Real Estate Investors

Entrepreneurs and real estate investors can now tap into strategic funding with Business N Realty Funding Solutions,

March 11, 2026

Third Marble Marketing Recognized as a Top Google Ads White-Label Service Provider for 2026

Third Marble Marketing Recognized as a Top Google Ads White-Label Service Provider for 2026

FOR IMMEDIATE RELEASE Third Marble Marketing Recognized as a Top Google Ads White-Label Service Provider for 2026

March 11, 2026

Packard Power Washing LLC Shares Tips to Prevent Mold and Algae on Homes

Packard Power Washing LLC Shares Tips to Prevent Mold and Algae on Homes

Packard Power Washing LLC is helping homeowners understand how mold, algae, and mildew develop on siding, roofs, and

March 11, 2026

Paywint Partners with Plaid to Strengthen Instant Payments Solutions

Paywint Partners with Plaid to Strengthen Instant Payments Solutions

New York, NY, March 11, 2026 — Paywint, an innovative fintech platform focused on digital wallet integrations and

March 11, 2026

NoteGPT Launches AI Presentation Maker and Nano Banana Pro Slides: A New AI-Powered Way to Create Professional Presentations in Minutes

NoteGPT Launches AI Presentation Maker and Nano Banana Pro Slides: A New AI-Powered Way to Create Professional Presentations in Minutes

NoteGPT has announced the launch of two new AI-powered tools designed to transform how presentations are created: AI

March 11, 2026

Non-Surgical Spinal Decompression Treatment Expands at Meridian Idaho Gonstead Spine & Wellness Chiropractic for Herniated, Slipped, and Bulging Discs

Non-Surgical Spinal Decompression Treatment Expands at Meridian Idaho Gonstead Spine & Wellness Chiropractic for Herniated, Slipped, and Bulging Discs

Non-surgical spinal decompression for herniated, slipped, and bulging discs is now available in Meridian Idaho at

March 11, 2026

eLuminous Technologies Helps Tier-1 Indonesian ISP Transform CRM Operations with ServiceNow

eLuminous Technologies Helps Tier-1 Indonesian ISP Transform CRM Operations with ServiceNow

eLuminous Technologies transformed a Tier-1 Indonesian ISP's legacy CRM into a unified ServiceNow platform, automating

March 11, 2026

ThriveCart Acquires Automated Webinar Platform, Stealth Seminar

ThriveCart Acquires Automated Webinar Platform, Stealth Seminar

NEW YORK, NY (PinionNewswire) — ThriveCart, an all-in-one creator commerce and learning platform used by more than

March 11, 2026

New Book Examines Global Transition to Post-Quantum Cryptography

New Book Examines Global Transition to Post-Quantum Cryptography

MIAMI, FL, March 11, 2026 — A newly published cybersecurity book is examining how organizations and governments may

March 11, 2026

PuroClean of Redmond/Woodinville Expands Fire Damage Restoration Services Across Eastside Communities

PuroClean of Redmond/Woodinville Expands Fire Damage Restoration Services Across Eastside Communities

March 11, 2026 – PRESSADVANTAGE – PuroClean of Redmond/Woodinville has expanded its fire, soot, and smoke damage

March 11, 2026

Go Industries Expands OEM Custom Manufacturing and Fabrication Capabilities for Industrial Sectors

Go Industries Expands OEM Custom Manufacturing and Fabrication Capabilities for Industrial Sectors

Richardson, TX – March 11, 2026 – PRESSADVANTAGE – Go Industries, a Texas-based manufacturer with over 40 years of

March 11, 2026

Big Easy Grass Cutting Adds Commercial Lawn Maintenance Service for Business Properties, Office Complexes, and Retail Facilities

Big Easy Grass Cutting Adds Commercial Lawn Maintenance Service for Business Properties, Office Complexes, and Retail Facilities

NEW ORLEANS, LA – March 11, 2026 – PRESSADVANTAGE – Big Easy Grass Cutting, a lawn care company serving residential and

March 11, 2026

FZE Manufacturing Showcases ISO-Certified Stainless Steel Passivation Services

FZE Manufacturing Showcases ISO-Certified Stainless Steel Passivation Services

NORTH FOND DU LAC, WI – March 11, 2026 – PRESSADVANTAGE – FZE Manufacturing Solutions LLC, a precision manufacturing

March 11, 2026

Smith Machine Home Gym With Cable Weights Available for Pre-Order by Strongway Gym Supplies

Smith Machine Home Gym With Cable Weights Available for Pre-Order by Strongway Gym Supplies

Coventry, UK – March 11, 2026 – PRESSADVANTAGE – Strongway Gym Supplies has opened pre-orders for Smith machine home

March 11, 2026

Beeline Holdings (NASDAQ: BLNE) Sets March 30 Call to Review Q4 Results, Including a $100M Run Rate by December 2027

Beeline Holdings (NASDAQ: BLNE) Sets March 30 Call to Review Q4 Results, Including a $100M Run Rate by December 2027

Beeline Holdings (NASDAQ: BLNE), a fast-growing digital mortgage platform redefining the path to homeownership,

March 11, 2026

Pelican Acquisition (NASDAQ: PELI) Announces Arctic Logistics Agreement Supporting Greenland Energy Drilling Campaign

Pelican Acquisition (NASDAQ: PELI) Announces Arctic Logistics Agreement Supporting Greenland Energy Drilling Campaign

Pelican Acquisition (NASDAQ: PELI) announced that the leadership team behind the formation of Greenland Energy Company

March 11, 2026

Datavault AI (NASDAQ: DVLT) CEO Nate Bradley To Present Tokenized Legacy Platform At Luminary 2026

Datavault AI (NASDAQ: DVLT) CEO Nate Bradley To Present Tokenized Legacy Platform At Luminary 2026

Datavault AI (NASDAQ: DVLT) announced that CEO Nathaniel “Nate” Bradley will deliver a featured presentation at

March 11, 2026

Versus Systems (NASDAQ: VS) Renews Texas Rangers Partnership, Introduces Upgraded Filter Fan Cam

Versus Systems (NASDAQ: VS) Renews Texas Rangers Partnership, Introduces Upgraded Filter Fan Cam

Versus Systems Inc. (NASDAQ: VS) announced the renewal of its partnership with the Texas Rangers for continued use of

March 11, 2026

Calm North Labs Expands SEOJuice Into Full Visibility Platform

Calm North Labs Expands SEOJuice Into Full Visibility Platform

Calm North Labs expands it's flagship product into a full visibility platform with brand monitoring, content decay

March 11, 2026

Worksport Announces Fourth Quarter and Full Year 2025 Earnings Date; Updated Financial Guidance and Path to Cash-Flow Positivity to Be Discussed

Worksport Announces Fourth Quarter and Full Year 2025 Earnings Date; Updated Financial Guidance and Path to Cash-Flow Positivity to Be Discussed

Conference call expected to provide additional details on the Company's path to cash-flow positivity and key

March 11, 2026

Pure Digital PR Launches as a Digital PR Agency Earning Powerful Media Coverage to Support SEO Growth

Pure Digital PR Launches as a Digital PR Agency Earning Powerful Media Coverage to Support SEO Growth

UK-based digital PR agency Pure Digital PR launches with the aim of delivering campaigns that boost visibility,

March 11, 2026

Dental Emergency Coventry General Dentist Dr Chetan Mathias Recommends Urgent Treatments at Light Lane Dental Practice

Dental Emergency Coventry General Dentist Dr Chetan Mathias Recommends Urgent Treatments at Light Lane Dental Practice

COVENTRY, UK – March 11, 2026 – PRESSADVANTAGE – People in Coventry experiencing sudden dental pain or unexpected oral

March 11, 2026

Big Easy Landscaping Publishes Guide on Sod Fertilization Timing, Soil Assessment, and Material Selection for Residential Lawn Installations

Big Easy Landscaping Publishes Guide on Sod Fertilization Timing, Soil Assessment, and Material Selection for Residential Lawn Installations

March 11, 2026 – PRESSADVANTAGE – Big Easy Landscaping, a landscaping and outdoor construction contractor serving

March 11, 2026

From Speculation To Verified Digital Assets: SMX Brings Real-World Commodities Into The Blockchain Era As Global Markets Demand Proof

From Speculation To Verified Digital Assets: SMX Brings Real-World Commodities Into The Blockchain Era As Global Markets Demand Proof

In an environment of geopolitical volatility and supply chain uncertainty, SMX's digital infrastructure is transforming

March 11, 2026

Basatne Launches ORBT to Transform the Global Digital Payouts Market

Basatne Launches ORBT to Transform the Global Digital Payouts Market

New fintech platform converts refunds, trade-ins, and incentives into instant digital value, aligning with the global

March 11, 2026

Tenstorrent Unveils TT-QuietBox(TM) 2, the First RISC-V AI Workstation With a Fully Open-Source Stack to Deliver Teraflop-Class Inference

Tenstorrent Unveils TT-QuietBox(TM) 2, the First RISC-V AI Workstation With a Fully Open-Source Stack to Deliver Teraflop-Class Inference

Liquid-Cooled Desktop System Runs Models up to 120B Parameters Locally With a Fully Open-Source Stack, Starting at

March 11, 2026

Sing Yachts Highlights the “Blue Boardroom” Trend as Global Executives Move High-Stakes Meetings to the Sea

Sing Yachts Highlights the “Blue Boardroom” Trend as Global Executives Move High-Stakes Meetings to the Sea

Latest press releases and corporate announcements from EdgeNewswire

March 11, 2026

Digital Marketing Agency Expands to Detroit, Michigan

Digital Marketing Agency Expands to Detroit, Michigan

National Performance Marketing Firm Brings Advanced Digital Growth Solutions to the Heart of the Rust Belt

March 11, 2026

Hemisphere GNSS and Calian Announce Joint Development of the A65 GNSS Antenna Featuring Calian’s Advanced XF Filtering(R) and Enhanced Multi Constellation Performance

Hemisphere GNSS and Calian Announce Joint Development of the A65 GNSS Antenna Featuring Calian’s Advanced XF Filtering(R) and Enhanced Multi Constellation Performance

TEMPE, AZ / ACCESS Newswire / March 11, 2026 / Hemisphere GNSS, a brand of CNH (NYSE:CNH), together with Calian Group

March 11, 2026

Miguel Ángel Fonseca Rodríguez and the Clinical Approach Behind Modern Plastic Surgery

Miguel Ángel Fonseca Rodríguez and the Clinical Approach Behind Modern Plastic Surgery

Plastic surgery has become part of everyday conversation. It is discussed openly, shared widely and, at times,

March 11, 2026

Hour To Midnight Wins Gold for Best Escape Room in the 2025 Best of the Rose City Awards

Hour To Midnight Wins Gold for Best Escape Room in the 2025 Best of the Rose City Awards

Portland’s premier immersive escape room facility earns top community honor across the greater Portland metro area

March 11, 2026

Caledonia Mining Corporation Plc Notice of Q4 and FY 2025 Results and Investor Presentation

Caledonia Mining Corporation Plc Notice of Q4 and FY 2025 Results and Investor Presentation

(NYSE AMERICAN, AIM and VFEX: CMCL) SAINT HELIER, JE / ACCESS Newswire / March 11, 2026 / Caledonia Mining Corporation

March 11, 2026

A Better Solution In Home Care Strengthens Expansion Efforts with New Franchise Development Director

A Better Solution In Home Care Strengthens Expansion Efforts with New Franchise Development Director

A Better Solution In Home Care Strengthens Expansion Efforts with New Franchise Development Director by continuing to

March 11, 2026

Blue Box Packaging Strengthens Its Position as a Leading Rigid Box Manufacturer in USA with Expanded Premium Services

Blue Box Packaging Strengthens Its Position as a Leading Rigid Box Manufacturer in USA with Expanded Premium Services

Blue Box Packaging has expanded its premium packaging services as a trusted rigid box manufacturer in the United

March 11, 2026

La Vida Salon and Spa Recognized with 2026 Consumer Choice Award for Day Spa in Windsor

La Vida Salon and Spa Recognized with 2026 Consumer Choice Award for Day Spa in Windsor

WINDSOR, ON / ACCESS Newswire / March 11, 2026 / La Vida Salon and Spa has been recognized with the 2026 Consumer

March 11, 2026

Northwest Career College Expands Program Offerings to Address Southern Nevada Workforce Needs

Northwest Career College Expands Program Offerings to Address Southern Nevada Workforce Needs

Northwest Career College is a family-owned institution providing career-focused education in Southern Nevada. LAS

March 11, 2026

Republican Congressional Candidate Adam Perez Arquette Reveals Past Sex Trafficking Event

Republican Congressional Candidate Adam Perez Arquette Reveals Past Sex Trafficking Event

Adam Perez Arquette, district 6, Kentucky congressional candidate is ready and willing to speak about Jeffrey Epstein.

March 11, 2026