Status and Service Updates

View all status and service updates below 

Savio Lustre File System Returned to Service: Fri, 3/1

The Savio Lustre Parallel File system has been returned to service, which means that the Savio Scratch file system is accessible once again.

Savio Lustre File System is Down: Wed, 2/28

The Lustre Parallel File system (e.g., Savio scratch file system) is down, which might be caused by problems related to cabling. The investigation is already underway. We will let you know of any progress we make. A reservation is in place to prevent any new jobs from starting. We appreciate your patience.

Savio Downtime Postponed: Fri, 2/16

We are going to postpone the Savio downtime that was scheduled for Feb 20. With the two back-to-back unplanned power outages on campus, the data center has shifted its focus to installing new UPS systems in the data center. Once the future downtime date is determined, we will send out a notification with a sufficient amount of time for you to prepare.

Savio Cluster Returned to Service: Wed, 2/14

Power and systems have been restored at the data center (Earl Warren Hall) and the Savio supercluster is back online and accessible.

Data Center Power Outage: Wed, 2/14

We have been notified that there is a power outage in the data center. Without power, the Savio HPC services are down. We will let you know when there is any new information and when to expect our services to be available.

Savio Cluster Returned to Service: Fri, 2/9

Power and systems have been restored at the data center (Earl Warren Hall) and the Savio supercluster is back online and accessible. Jobs have started running, and Open OnDemand and Globus are also back in service.

Data Center Power Status: Thursday, 2/8

The Savio cluster is offline to accommodate the scheduled power work at the data center. However, the power generator test failed, and the data center lost power. Work is underway to restore power. Without power, the return of the Savio HPC services will be delayed. We will let you know if there is any new information.

Savio Downtime: Feb 8 & Feb 20

We apologize for this short notice: There will be a power outage in Earl Warren Hall on Thursday, 2/8, which will result in Savio downtime during this period starting at 10AM. We will return the Savio cluster online on the same day after the power work is complete. A scheduler reservation is in place to ensure that no jobs run after 10AM on Thursday, 2/8. If you plan to submit jobs, please ask for proper walltime for those jobs to complete before the downtime. Otherwise, your jobs will wait in the queue until the cluster is back online. The downtime scheduled on Feb 20th is still in place.

Issue with Savio SLURM email notifications has been resolved: Wed, 1/31

The issue that was preventing Savio users from receiving SLURM email notifications when running jobs on Savio has now been resolved and Savio users will again be receiving such emails as normal.

New Savio user account creation delay issues resolved: Wed, 1/24

We now have a workaround to solve the OTP linking email related problems, and the processing of new user account requests has resumed.

Delay with new Savio user account creations: Fri, 1/19

There has been a delay with the processing of new Savio user account creations because the OTP (One Time Password) service is down. UC Berkeley has recently enforced additional configuration requirements for email servers and the OTP service requires extra work due to its unique setup. We apologize for any inconvenience this might have caused and thank you for your patience.

Savio DTN and Globus issues resolved: Wed, 1/10

The Savio DTN and Globus service are available and accessible again after a failed hardware component was replaced.

Savio DTN and Globus not accessible: Tues, 1/9

The Savio Data Transfer Node (DTN) and Globus are currently not accessible. We are investigating the root cause and will update as soon as we know more.

No office hours 12/20 - 1/5

Research IT will not be hosting office hours during the Winter break starting on Wednesday, 12/20 and will resume the second week in January on Wednesday, 1/10. In the meantime, we are available by email so please get in touch with us at research-it@berkeley.edu

Open OnDemand issues resolved: Wed, 11/22

The launching issues with remote desktop and Matlab in OOD have been resolved.

Update on Savio: Tues, 11/21

We have completed the Slurm upgrade and restored the scratch parallel file system. The scheduler reservation has been removed, and jobs have started running. Other HPC services like Open OnDemand and Globus are also back online. However, remote desktop and Matlab within Open OnDemand have launching issues. We are looking into this.

No office hours on Wed, 11/22 + Thurs, 11/23

Research IT will not be holding office hours on Wed, 11/22 + Thurs, 11/23 due to the holiday break! Please get in touch with us at research-it-consulting@lists.berkeley.edu with any questions.

Update on Savio: Mon, 11/20

We upgraded Slurm and are working on a few configuration issues for completion. On the Scratch file system, our system engineers are working on solutions with the DDN team. We estimate that we will need another day to get the scratch file system back online.

Savio Scratch File System is Still Down: Sun, 11/19

The scratch parallel file system is still down. Our system engineers have been working on solutions over the weekend. Again, we are very sorry for any inconvenience this may cause. As I mentioned yesterday, we aim to restore the file system tomorrow during the downtime for the Slurm upgrade. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.

Savio Downtime: Monday, Nov 20

We have scheduled a delayed upgrade on Slurm to address some security concerns. The one-day downtime will start at 8AM on Monday, Nov 20. We expect to return the HPC services by the end of the day.

Savio Scratch File System is Down: Fri, 11/17

As you might have known, the scratch parallel file system began having access problems this morning. We have contacted the vendor to have it fixed as soon as possible. We are very sorry for any inconvenience this may cause. Also, as a reminder, we will proceed with the planned downtime on Monday to upgrade Slurm. Once the Slurm work is complete, we will make every effort to restore the file system.

Savio is Back in Service: Tues, 10/31

The Savio supercluster is back online. We have removed the reservation that was put in place to prevent jobs from running while the file system was down. The home file system is currently operated on one controller only; we will likely schedule downtime later to facilitate the repair of the second controller. We appreciate your patience in the past week and apologize for any inconvenience that might have caused.

Update on Savio: Thurs, 10/26

The Dell Conpellant support team has scheduled for tomorrow to replace the broken parts. In the meantime, Savio remains to be inaccessible. We understand your frustration and appreciate your patience.

Update on Savio: Tues, 10/24

We have identified the problem with the file system and are waiting for the replacement parts to arrive in a day or two. We will get the file system up and running as soon as we receive the parts. Please stay tuned. Again, we appreciate your patience. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.

Savio down: Sat, 10/21

Savio is currently inaccessible. We are looking into this and will keep you posted on any progress. We appreciate your patience. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.

Savio File System Back in Service: Mon, 10/16

The parallel file system at /global/scratch is back online. We are sorry for any inconvenience that the unresponsive file system might have caused. Thank you very much for your patience and please let us know if you encounter any problems.

Savio Scratch File System is Down: Mon, 10/16

We apologize that the parallel file system at /global/scratch is not responsive. This problem impacts job submissions and OpenOnDemand if you need data access from /global/scratch. We are investigating how to fix it and will keep you posted. Please reac

Job submission issues resolved: Thurs, 09/21

Job submission problems have been resolved for both interactive and batch jobs.

Issues with interactive job submissions: Wed, 09/20/23

There are problems on compute nodes that prevent job submissions since yesterday (09/19/23). We are working to resolve it.

Issues with Scratch: Thurs, 8/31/23

We are experiencing some difficulties with the scratch file system because a few users have a very large number of files there. We're in the process of addressing this, but in the meantime, you may see "no space on the device" errors when using scratch.

Issues with Scratch: Thurs, 8/31/23

We are experiencing some difficulties with the scratch file system because a few users have a very large number of files there. We're in the process of addressing this, but in the meantime, you may see "no space on the device" errors when using scratch.

Known remote I/O issue: Mon, June 5

We are aware of the current remote I/O issue on login node 2/3 and are working to resolve it this evening by rebooting the login nodes.

Savio Downtime rescheduled: Thurs 5/18, 8am-5pm

The cluster downtime is rescheduled to 8:00 AM Thursday, May 18th. We plan to upgrade the global scratch parallel file system aimed for the preparation to implement the purge policy on /global/scratch. We estimate the downtime will only last for a few hours before the cluster returns to service by 5PM.

No office hours during finals week, 5/8 - 5/12

We are not offering office hours during finals week, 5/8 - 5/12. Please get in touch with us at research-it-consulting@berkeley.edu in the meantime. We will resume normal hours next week.

Compilation issue resolved: Tues, 4/11

The code compilation issue is resolved. Thank you for your patience.

Savio compilation issues: Mon, 4/10

Savio users may experience code compilation issues where a header file couldn't be found. We have identified the problem and will patch the system soon. Thank you for your patience.

Savio downtime rescheduled: 4/4 - 4/5

The Savio cluster downtime is rescheduled to start at 8:00 AM Tuesday, April 4th for system upgrade. The upgrade work will include Slurm, master node, MyBRC portal and the storage. The downtime is estimated to last for two days. We plan to bring the cluster back to service by 5PM Wednesday the 5th once the work is complete.

No office hours Wed, 3/29 + Thurs, 3/30

Research IT will not be holding our regular office hours during the week of Spring Break, 3/27 - 3/31. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.

Savio login issues: Mon, 3/13

Savio users may still experience login failures, which is related to the maintenance work being performed last week. The IDM team who is providing the authentication service is conducting an investigation. We will keep you posted on any progress being made. Thank you for your patience.

Savio downtime postponed: Thurs, 3/9

The scheduled downtime is temporarily postponed. We will make an announcement once a future downtime is confirmed.

Office Hours to resume on Wed, 1/18

Our weekly office hours are closed during the winter break and will resume on Wednesday, 1/18. We will keep our regular hours on Wednesday afternoons from 1:30-3pm and Thursday mornings from 9:30-11am. We are available via email to help in the meantime.

Limited holiday availability: 12/23 - 1/2

Research IT will be participating in the campus-wide curtailment program from Friday, 12/23 - Mon, 1/2. Assume only emergency support will be available for our systems and services as most of our staff are out of the office. Office hours will resume the week of 1/17.

Data transfer node unstable: Mon, 12/19

The ongoing intermittent network problems have caused DTNs (Data Transfer Nodes) unreachable at times. As a consequence, the services of Globus and secure file transfer tools, such as scp, are also affected. Our team is working on stabilizing the services and will update you with any progress we make.

Globus endpoint ucb#brc unreachable: Wed, 12/7

Globus endpoint ucb#brc has been unreachable for the past few days due to a network issue. Our team is working on bringing services back and will update you once it is back in production. As a backup option for data transfer, you could use the SCP/SFTP/rsync command line tools or FileZilla. See the email sent to Savio users for further information about these options.

No office hours on Wed, 11/23 or Thurs, 11/24

Due to the Thanksgiving holiday, we will not be holding office hours on Wed, 11/23 or Thurs, 11/24. Thursday and Friday are university holidays so we will be limited in our responses until we resume service on Monday, 11/28.

Savio Power Outage Status: Wed, 11/16

Savio partially lost power twice within the last week due to excess power consumption by some of the Savio4 nodes. Thus as a precautionary measure, we have intentionally kept some nodes offline to keep power consumption under control: savio4_htc (28 nodes), savio2_htc (9 nodes), savio2_gpu (9) and savio (36 nodes). This may affect job scheduling in these partitions. Open OnDemand and Globus are back to normal. Please refer to the “Savio Power Outage Status” email sent on 11/16 for more details.

Savio Power is restored: Fri, 11/11

The power is restored to the Savio cluster. All the Slurm partitions are ready to take jobs. Open Ondemand and Globus have been tested and function well. However, we are intentionally keeping some nodes offline for the time being and gradually bring them back online to ensure that the power consumption is under control.

Savio Partial Power Outage: Fri, 11/11

Following the unexpected power outage in the datacenter on Saturday, the power breakers were tripped again last night. As a consequence, we lost power to a few Savio HPC racks and portions of the Savio partitions are currently offline. We are working with teams on campus to restore the power and bring the HPC services back as soon as possible.

Open OnDemand issues resolved: Tues, 11/8

The interactive node was down and we are happy to report that it is now working, including Jupyter notebook, VScode and Rstudio.

Savio Outage Resolved: Mon, 11/7

On Saturday afternoon we lost power to approximately five of the Savio HPC racks due to a tripped breaker in the data center. This resulted in a large portion of the Savio 1, 2 and 4 partitions being offline. We are working on near and long term mitigation strategies. Please contact us with feedback about any adverse impact you may have experienced.

MyBRC back online: Thurs, 9/20

MyBRC is back online. We had a hardware failure which also caused slow responsiveness of SLURM. We will modify config file/code to alleviate the impact on slurm, should this happen again.

MyBRC portal offline: Tues, 9/20

We are aware that mybrc is offline as of last night and are actively working to bring it back online again. Thank you for your patience.

Intermittent login failures resolved: Tues, 8/30

One of the login nodes was blocked on 8/17 by the Berkeley Lab security team which caused user login failures. Login is now restored and we are taking necessary actions to prevent similar triggers from happening again.

Savio job emails working again: Wed, 8/10

Savio job emails should be working again. Please contact us if you are still having any issues with this. 

Issue with job emails: Thurs, 7/28

We have received a number of tickets about not receiving emails when jobs finish. We are looking into this and will provide an update when it is resolved.

/home & /clusterfs are working: Wed, 5/18

/home and /clusterfs are now back to normal. Thank you for your patience while we resolved this issue.

clusterfs degraded performance: Tues, 5/17

We are aware of the degraded performance on home directory and condo storage under /clusterfs and are working to try to fix it by the end of today.

Working on Viz Node Issues: Wed, 05/11

The Viz node has been experiencing issues since our scheduled downtime last week. This issue also affects Matlab OOD access. We are working to identify and correct this issue, and will keep you updated as this process continues.

Data Transfer Node + Globus Back Online: Tues, 4/26

The DMZ routes on campus are back. Our data transfer node is back online and the Globus service is resumed. 

Data Transfer Node Down: Mon, 4/25

The data transfer node is offline again, likely because the dmz route is down on campus. We will send an update when this is back online. 

Scheduled Savio Downtime: Mon, 5/2

We will power down Savio between 8am-5pm on Monday, 5/2 to apply vendor patches to the SLURM scheduling system. If you submit jobs before the downtime please request proper walltime for them to complete in time, otherwise they will wait in the queue until the cluster is back online.

No office hours on Wed, 3/23 - Thurs, 3/24

Research IT will not be holding our regular office hours during the week of Spring Break, 3/21 - 3/25. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.

Savio Cluster is back in service: Fri, 1/28

We have solutions in place to fix /global/scratch file system so the Savio/Vector clusters are back in service. Please submit a ticket at brc-hpc-help@berkeley.edu if you have any questions or experience continued issues. Thank you for your patience and happy computing!

Working on Scratch Instability: Thurs, 01/27

The global scratch file system on Savio has not been stable since last night. Access to certain folders/files may be sporadic or hanging, and some file operations might give I/O errors. We’re doing emergency maintenance and will keep you updated as we resolve it.

Savio Scratch File System is Back Online: Tues, 1/25

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on Scratch Instability: Tues, 01/25

The global scratch file system on Savio has not been stable since about noon today. Access to certain folders/files may be sporadic or hanging, and some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we resolve it.

No office hours on 1/5 or 1/6

Research IT is participating in a "soft" curtailment the week of Monday, Jan. 3 through Friday, Jan. 7, 2022. We will not be holding office hours during this week.

Closed for curtailment: Thurs, 12/23 - Mon, 01/03

Research IT will be participating in the campus-wide curtailment program from Thursday, Dec. 23, 2021 through Monday, Jan. 3, 2022. Many facilities and services will be closed or operating on modified schedules during this time.

Data Transfer Node is back online, Thurs, 12/2

We are glad to inform you that DTN, the designated Data Transfer Node, is back online. We apologize for any inconvenience this caused.

Data Transfer Node (DTN) is down: Thurs, 12/2

The Data Transfer Node (DTN) is currently down. We plan to go on site around noon to get it fixed and then will post an update.

No office hours on 11/23 or 11/24

Research IT is participating in the campus-wide curtailment during the week of Thanksgiving from Monday, Nov. 22 through Friday, Nov. 26, 2021. We will not be holding office hours this week.

Data Transfer Node is back in service: Tues, 10/26

We are glad to inform you that DTN, the designated Data Transfer Node dtn00.brc.berkeley.edu, is back online. We apologize for any inconvenience for the past a few days.

Data Transfer Node is down: Mon, 10/25

As we are working to restore the service, you can login dtn01.brc.berkeley.edu to transfer data for now. Please contact us with any questions.

Open OnDemand access via eduroam is working: Thurs, 10/21

Users should no longer experience the timeout issues that had been occurring in previous weeks

New condo storage pricing announced

Condo storage purchase is now available in chunks of 112TB at the estimated cost of $5750. Please contact us if you would like to purchase storage for your research data.

Savio3_gpu partition open to FCA users

The savio3_gpu partition is open to Faculty Computing Allowance users. You can submit jobs to a subset of GPU nodes on Savio. Read further for details on how to use GPU partitions.

Open OnDemand downtime: Fri, 10/15 at 10am

Campus's EduRoam currently cannot route to OOD. Please use AirBears2, full tunnel vpn, or wired Ethernet connections. In order to focus our support on Open OnDemand, the JupyterHub server is officially taken offline. We plan to take a short downtime to complete the transition starting 10:00 AM Friday, October 15 for approximately 30 minutes.

Savio Scratch File System is Back Online: Thurs, 9/30

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on scratch instability: Tues, 9/28

The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.

Savio back online: Mon, 9/20, 10:15am

We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.

Savio Scratch File System is Back Online: Thurs, 9/30

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on scratch instability: Tues, 9/28

The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.

Savio back online: Mon, 9/20, 10:15am

We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.

Savio scheduled downtime: Mon, 9/20, 8am

A small number of users on the new scratch file system have been impacted by a file system bug that prevents the creation of new files. We plan to have a four hour downtime to resolve this issue.

Savio scheduled downtime: Fri, 8/27, 9-11am

In order to make a minor change to the current structure of the /global/scratch file system, we are scheduling a brief downtime to relocate all user directories from /global/scratch/[username] to /global/scratch/users/[username].

Savio back online: Thurs, 8/12, 5pm

Savio is now back online, and we’re pleased to announce the availability of the new /global/scratch file system, which should alleviate the space shortage issues. Please migrate any critical data to the new system.

Scheduled Savio downtime: Thurs, 8/12, 9am

We are excited to announce an upcoming downtime on Thursday, August 12 at 9am to complete the roll-out of the new /global/scratch file system which offers a significant upgrade in capability. We ask that you begin migrating any critical data to the new file system and leave any unneeded data behind.

Savio scheduled downtime: Tues, 7/20, 9am

We need to stop scheduling new SLURM jobs for a short period of time to migrate backend database and implement support for allocation management inside the new MyBRC user portal. The jobs will resume running when we complete the scheduled work.

Savio back online: Tues, 4/20, 3:30pm

The Savio cluster is back online as planned, and jobs have started running. Please contact us via email or at our drop in office hours if you have any questions.

Savio scheduled downtime: Tues, 4/20, 9am

There is a planned downtime starting at 9am until the end of the day on Tuesday, April 20, 2021 to prepare space for the new global scratch parallel filesystem.