Research IT

provides research data and computing technologies, consulting, and community for the UC Berkeley campus. Our goal is to advance research through IT innovation.

Status and Service Updates

Wed, 3/25 - Thurs, 3/26: No office hours during spring break

Research IT will not hold office hours during Spring Break on Wednesday, 3/25 and Thursday, 3/26. Please get in touch with us via e-mail in the meantime and we will be happy to assist you.

Savio Cluster: Login/compute nodes and Slurm issue

Over the past few days, we’ve seen intermittent issues where SSH connections to the login nodes have hung or been unusually slow. We have been working to mitigate the issue, and we tested a newer kernel version. That change introduced MUNGE/Slurm authentication incompatibilities, leading to job submission failures and error messages when running Slurm commands. We are reverting to the previous kernel, and job submission and login node stability should be restored by this afternoon. As a longer-term improvement, we are adding an additional login node to the current login pool, with the goal of making it available by early next week. We plan to continue expanding the login pool. We also acknowledge ongoing issues with jobs hanging on compute nodes and with Open OnDemand sessions failing or hanging. Our team is working with VAST support to identify the root cause and resolve the problem as quickly as possible.

Wed, 2/18: Global Scratch Parallel File System Service Restoration

The global scratch parallel file system service has been successfully restored and opened to general availability as of Tuesday, February 10th. This recovery followed a carefully managed, week-long staged approach that gradually reintroduced users with work deadlines and prioritized low I/O computing jobs. We have now shifted focus to continuous close monitoring and implementing long-term stability measures to ensure the integrity and reliability of the file system moving forward. Please see our recent e-mail for more information about the current operational status, immediate risk mitigation and interventions, user best practices, and ongoing and long-term stability strategy.

News Articles