We would like to provide an update on the Savio supercomputing cluster global scratch parallel file system service restoration and long-term stability. On Friday, January 23, 2026 the scratch file system on Savio became completely unavailable on the Savio cluster’s login, data transfer, and compute nodes due to an overutilization in terms of size of the data and large number of files (~10 millions or more) on the Scratch disk space which exhausted the metadata data and object storage targets. During the recovery period, we worked with users who requested early access to the system and also made measurable progress in reducing inode usage so the file system is no longer at risk of entering a critical state. On Tuesday, February 10, 2026 the global scratch parallel file system service was successfully restored and opened to general availability. This recovery followed a carefully managed, week-long staged approach that gradually reintroduced users with work deadlines and prioritized low I/O computing jobs. We have now shifted focus to continuous close monitoring and implementing long-term stability measures to ensure the integrity and reliability of the file system moving forward.
Current Operational Status
-
Inode Utilization (File Count): The top MDT has been reduced from 100% to 85%.
-
Object Storage Targets (OST) Occupancy: The two most utilized OSTs have been reduced from 100% to 92%.
-
Job Queue: Over 400 jobs are actively running, with approximately 3,000 jobs currently queued for computing resources.
Immediate Risk Mitigation and Interventions
To address any signs of instability, the following operational interventions are in place:
-
Proactive Usage Monitoring: Continuously identifying and engaging with top resource users to manage load and prevent bottlenecks.
-
Resource Throttling: If necessary, limiting available computing resources to immediately reduce system-wide I/O load
-
Emergency Stop Capability: Retaining the ability to stop all active jobs as a final emergency measure to protect file system integrity and prevent data loss.
User Best Practices
In addition to providing users with regular system status updates, we are consistently communicating best practices, emphasizing the critical need to:
-
Clean up intermediate files immediately upon job completion.
-
Optimize their I/O workflows to minimize system strain.
Ongoing and Long-Term Stability Strategy
A comprehensive, multi-phased strategy is being executed to ensure the file system's long-term stability, efficiency, and scalability:
-
Ongoing Cleanup: Targeted Inactive User Data Migration
-
Archive approximately 900 TB of inactive user data, migrating it from Savio scratch to temporary storage, specifically Cloudian storage and GPFS at SRDC.
-
System Upgrade: DDN Exascaler 6.3
-
Utilizing professional services to upgrade system software from version 5.2 to the latest 6.3 (which includes Prometheus integration) significantly enhances system logging and management capabilities.
-
Monitoring & Management: DDN Insight Tool
-
Acquiring and implementing the DDN monitoring tool “Insight” for more in-depth and efficient system surveillance.
-
User Compliance: Soft Quota System
-
Establishing a soft quota system to provide an automated "sanity check" and compliance reporting, while preserving "unlimited space" for active computing tasks.
-
Storage Automation: Starfish Purge Policy (In Development)
-
Utilizing the Starfish tool to scan, tag, and automate the reporting, deletion, and migration of non-compliant global scratch data. Significant expansion of the existing Starfish license will be required
-
Tier 2 Storage: Rental Planning
-
Planning for adjacent storage capacity, such as Cloudian, to serve as a designated Tier 2 storage solution as a paid option for users whose needs preclude the use of temporary scratch storage
-
Additional storage
-
Evaluating the purchase of additional hard disk storage to provide greater operational capability
Thank you for your patience during this unprecedented time. We remain committed to providing cutting-edge and stable high-performance computing systems for UC Berkeley's research and teaching needs.