Savio Downtime rescheduled: Thurs 5/18, 8am-5pm
The cluster downtime is rescheduled to 8:00 AM Thursday, May 18th. We plan to upgrade the global scratch parallel file system aimed for the preparation to implement the purge policy on /global/scratch. We estimate the downtime will only last for a few hours before the cluster returns to service by 5PM.
No office hours during finals week, 5/8 - 5/12
We are not offering office hours during finals week, 5/8 - 5/12. Please get in touch with us at research-it-consulting@berkeley.edu in the meantime. We will resume normal hours next week.
Compilation issue resolved: Tues, 4/11
The code compilation issue is resolved. Thank you for your patience.
Savio compilation issues: Mon, 4/10
Savio users may experience code compilation issues where a header file couldn't be found. We have identified the problem and will patch the system soon. Thank you for your patience.
Savio downtime rescheduled: 4/4 - 4/5
The Savio cluster downtime is rescheduled to start at 8:00 AM Tuesday, April 4th for system upgrade. The upgrade work will include Slurm, master node, MyBRC portal and the storage. The downtime is estimated to last for two days. We plan to bring the cluster back to service by 5PM Wednesday the 5th once the work is complete.
No office hours Wed, 3/29 + Thurs, 3/30
Research IT will not be holding our regular office hours during the week of Spring Break, 3/27 - 3/31. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.
Savio login issues: Mon, 3/13
Savio users may still experience login failures, which is related to the maintenance work being performed last week. The IDM team who is providing the authentication service is conducting an investigation. We will keep you posted on any progress being made. Thank you for your patience.
Savio downtime postponed: Thurs, 3/9
The scheduled downtime is temporarily postponed. We will make an announcement once a future downtime is confirmed.
Office Hours to resume on Wed, 1/18
Our weekly office hours are closed during the winter break and will resume on Wednesday, 1/18. We will keep our regular hours on Wednesday afternoons from 1:30-3pm and Thursday mornings from 9:30-11am. We are available via email to help in the meantime.
Limited holiday availability: 12/23 - 1/2
Research IT will be participating in the campus-wide curtailment program from Friday, 12/23 - Mon, 1/2. Assume only emergency support will be available for our systems and services as most of our staff are out of the office. Office hours will resume the week of 1/17.
Data transfer node unstable: Mon, 12/19
The ongoing intermittent network problems have caused DTNs (Data Transfer Nodes) unreachable at times. As a consequence, the services of Globus and secure file transfer tools, such as scp, are also affected. Our team is working on stabilizing the services and will update you with any progress we make.
Globus endpoint ucb#brc unreachable: Wed, 12/7
Globus endpoint ucb#brc has been unreachable for the past few days due to a network issue. Our team is working on bringing services back and will update you once it is back in production. As a backup option for data transfer, you could use the SCP/SFTP/rsync command line tools or FileZilla. See the email sent to Savio users for further information about these options.
No office hours on Wed, 11/23 or Thurs, 11/24
Due to the Thanksgiving holiday, we will not be holding office hours on Wed, 11/23 or Thurs, 11/24. Thursday and Friday are university holidays so we will be limited in our responses until we resume service on Monday, 11/28.
Savio Power Outage Status: Wed, 11/16
Savio partially lost power twice within the last week due to excess power consumption by some of the Savio4 nodes. Thus as a precautionary measure, we have intentionally kept some nodes offline to keep power consumption under control: savio4_htc (28 nodes), savio2_htc (9 nodes), savio2_gpu (9) and savio (36 nodes). This may affect job scheduling in these partitions. Open OnDemand and Globus are back to normal. Please refer to the “Savio Power Outage Status” email sent on 11/16 for more details.
Savio Power is restored: Fri, 11/11
The power is restored to the Savio cluster. All the Slurm partitions are ready to take jobs. Open Ondemand and Globus have been tested and function well. However, we are intentionally keeping some nodes offline for the time being and gradually bring them back online to ensure that the power consumption is under control.
Savio Partial Power Outage: Fri, 11/11
Following the unexpected power outage in the datacenter on Saturday, the power breakers were tripped again last night. As a consequence, we lost power to a few Savio HPC racks and portions of the Savio partitions are currently offline. We are working with teams on campus to restore the power and bring the HPC services back as soon as possible.
Open OnDemand issues resolved: Tues, 11/8
The interactive node was down and we are happy to report that it is now working, including Jupyter notebook, VScode and Rstudio.
Savio Outage Resolved: Mon, 11/7
On Saturday afternoon we lost power to approximately five of the Savio HPC racks due to a tripped breaker in the data center. This resulted in a large portion of the Savio 1, 2 and 4 partitions being offline. We are working on near and long term mitigation strategies. Please contact us with feedback about any adverse impact you may have experienced.
MyBRC back online: Thurs, 9/20
MyBRC is back online. We had a hardware failure which also caused slow responsiveness of SLURM. We will modify config file/code to alleviate the impact on slurm, should this happen again.
MyBRC portal offline: Tues, 9/20
We are aware that mybrc is offline as of last night and are actively working to bring it back online again. Thank you for your patience.
Intermittent login failures resolved: Tues, 8/30
One of the login nodes was blocked on 8/17 by the Berkeley Lab security team which caused user login failures. Login is now restored and we are taking necessary actions to prevent similar triggers from happening again.
Savio job emails working again: Wed, 8/10
Savio job emails should be working again. Please contact us if you are still having any issues with this.
Issue with job emails: Thurs, 7/28
/home & /clusterfs are working: Wed, 5/18
/home and /clusterfs are now back to normal. Thank you for your patience while we resolved this issue.
clusterfs degraded performance: Tues, 5/17
We are aware of the degraded performance on home directory and condo storage under /clusterfs and are working to try to fix it by the end of today.
Working on Viz Node Issues: Wed, 05/11
The Viz node has been experiencing issues since our scheduled downtime last week. This issue also affects Matlab OOD access. We are working to identify and correct this issue, and will keep you updated as this process continues.
Data Transfer Node + Globus Back Online: Tues, 4/26
The DMZ routes on campus are back. Our data transfer node is back online and the Globus service is resumed.
Data Transfer Node Down: Mon, 4/25
The data transfer node is offline again, likely because the dmz route is down on campus. We will send an update when this is back online.
Scheduled Savio Downtime: Mon, 5/2
We will power down Savio between 8am-5pm on Monday, 5/2 to apply vendor patches to the SLURM scheduling system. If you submit jobs before the downtime please request proper walltime for them to complete in time, otherwise they will wait in the queue until the cluster is back online.
No office hours on Wed, 3/23 - Thurs, 3/24
Research IT will not be holding our regular office hours during the week of Spring Break, 3/21 - 3/25. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.
Savio Cluster is back in service: Fri, 1/28
We have solutions in place to fix /global/scratch file system so the Savio/Vector clusters are back in service. Please submit a ticket at brc-hpc-help@berkeley.edu if you have any questions or experience continued issues. Thank you for your patience and happy computing!
Working on Scratch Instability: Thurs, 01/27
The global scratch file system on Savio has not been stable since last night. Access to certain folders/files may be sporadic or hanging, and some file operations might give I/O errors. We’re doing emergency maintenance and will keep you updated as we resolve it.
Savio Scratch File System is Back Online: Tues, 1/25
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on Scratch Instability: Tues, 01/25
The global scratch file system on Savio has not been stable since about noon today. Access to certain folders/files may be sporadic or hanging, and some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we resolve it.
No office hours on 1/5 or 1/6
Research IT is participating in a "soft" curtailment the week of Monday, Jan. 3 through Friday, Jan. 7, 2022. We will not be holding office hours during this week.
Closed for curtailment: Thurs, 12/23 - Mon, 01/03
Research IT will be participating in the campus-wide curtailment program from Thursday, Dec. 23, 2021 through Monday, Jan. 3, 2022. Many facilities and services will be closed or operating on modified schedules during this time.
Data Transfer Node is back online, Thurs, 12/2
We are glad to inform you that DTN, the designated Data Transfer Node, is back online. We apologize for any inconvenience this caused.
Data Transfer Node (DTN) is down: Thurs, 12/2
The Data Transfer Node (DTN) is currently down. We plan to go on site around noon to get it fixed and then will post an update.
No office hours on 11/23 or 11/24
Research IT is participating in the campus-wide curtailment during the week of Thanksgiving from Monday, Nov. 22 through Friday, Nov. 26, 2021. We will not be holding office hours this week.
Data Transfer Node is back in service: Tues, 10/26
We are glad to inform you that DTN, the designated Data Transfer Node dtn00.brc.berkeley.edu, is back online. We apologize for any inconvenience for the past a few days.
Data Transfer Node is down: Mon, 10/25
As we are working to restore the service, you can login dtn01.brc.berkeley.edu to transfer data for now. Please contact us with any questions.
Open OnDemand access via eduroam is working: Thurs, 10/21
Users should no longer experience the timeout issues that had been occurring in previous weeks
New condo storage pricing announced
Savio3_gpu partition open to FCA users
Open OnDemand downtime: Fri, 10/15 at 10am
Campus's EduRoam currently cannot route to OOD. Please use AirBears2, full tunnel vpn, or wired Ethernet connections. In order to focus our support on Open OnDemand, the JupyterHub server is officially taken offline. We plan to take a short downtime to complete the transition starting 10:00 AM Friday, October 15 for approximately 30 minutes.
Savio Scratch File System is Back Online: Thurs, 9/30
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on scratch instability: Tues, 9/28
The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.
Savio back online: Mon, 9/20, 10:15am
We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.
Savio Scratch File System is Back Online: Thurs, 9/30
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on scratch instability: Tues, 9/28
The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.
Savio back online: Mon, 9/20, 10:15am
We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.
Savio scheduled downtime: Mon, 9/20, 8am
A small number of users on the new scratch file system have been impacted by a file system bug that prevents the creation of new files. We plan to have a four hour downtime to resolve this issue.
Savio scheduled downtime: Fri, 8/27, 9-11am
In order to make a minor change to the current structure of the /global/scratch file system, we are scheduling a brief downtime to relocate all user directories from /global/scratch/[username] to /global/scratch/users/[username].
Savio back online: Thurs, 8/12, 5pm
Savio is now back online, and we’re pleased to announce the availability of the new /global/scratch file system, which should alleviate the space shortage issues. Please migrate any critical data to the new system.
Scheduled Savio downtime: Thurs, 8/12, 9am
We are excited to announce an upcoming downtime on Thursday, August 12 at 9am to complete the roll-out of the new /global/scratch file system which offers a significant upgrade in capability. We ask that you begin migrating any critical data to the new file system and leave any unneeded data behind.
Savio scheduled downtime: Tues, 7/20, 9am
We need to stop scheduling new SLURM jobs for a short period of time to migrate backend database and implement support for allocation management inside the new MyBRC user portal. The jobs will resume running when we complete the scheduled work.
Savio back online: Tues, 4/20, 3:30pm
The Savio cluster is back online as planned, and jobs have started running. Please contact us via email or at our drop in office hours if you have any questions.
Savio scheduled downtime: Tues, 4/20, 9am
There is a planned downtime starting at 9am until the end of the day on Tuesday, April 20, 2021 to prepare space for the new global scratch parallel filesystem.