Research IT

provides research data and computing technologies, consulting, and community for the UC Berkeley campus. Our goal is to advance research through IT innovation.

Status and Service Updates

Savio Cluster: Login/compute nodes and Slurm issue

Over the past few days, we’ve seen intermittent issues where SSH connections to the login nodes have hung or been unusually slow. We have been working to mitigate the issue, and we tested a newer kernel version. That change introduced MUNGE/Slurm authentication incompatibilities, leading to job submission failures and error messages when running Slurm commands. We are reverting to the previous kernel, and job submission and login node stability should be restored by this afternoon. As a longer-term improvement, we are adding an additional login node to the current login pool, with the goal of making it available by early next week. We plan to continue expanding the login pool. We also acknowledge ongoing issues with jobs hanging on compute nodes and with Open OnDemand sessions failing or hanging. Our team is working with VAST support to identify the root cause and resolve the problem as quickly as possible.

Wed, 2/18: Global Scratch Parallel File System Service Restoration

The global scratch parallel file system service has been successfully restored and opened to general availability as of Tuesday, February 10th. This recovery followed a carefully managed, week-long staged approach that gradually reintroduced users with work deadlines and prioritized low I/O computing jobs. We have now shifted focus to continuous close monitoring and implementing long-term stability measures to ensure the integrity and reliability of the file system moving forward. Please see our recent e-mail for more information about the current operational status, immediate risk mitigation and interventions, user best practices, and ongoing and long-term stability strategy.

Tues, 2/10: Savio update to general availability

We are excited to announce that the /global/scratch file system is now available. The reservation is removed (no action needed in Slurm) and the temporary Slurm reservation used during the phased onboarding has been removed. You no longer need to include a reservation in your job scripts—please now submit jobs normally. We will resume processing new user account requests, but will do so in batches, slowly and steadily, to ensure the system remains stable as usage ramps up.

Mon, 2/9/2026: Savio recovery update

We are pleased that the /global/scratch file system is no longer in a critical state. However, elements of the storage system remain fragile at >95% utilization, leaving little operational headroom. System health metrics are trending strongly in the right direction. We have successfully onboarded the majority of users who requested early access, and the system is demonstrating excellent stability. To request immediate access to the system, please see our recent emails. Please note that at this time we are unable to create new user accounts.

News Articles