Status and Service Updates

View all status and service updates below 

Data transfer node unstable: Mon, 12/19

The ongoing intermittent network problems have caused DTNs (Data Transfer Nodes) unreachable at times. As a consequence, the services of Globus and secure file transfer tools, such as scp, are also affected. Our team is working on stabilizing the services and will update you with any progress we make.

Globus endpoint ucb#brc unreachable: Wed, 12/7

Globus endpoint ucb#brc has been unreachable for the past few days due to a network issue. Our team is working on bringing services back and will update you once it is back in production. As a backup option for data transfer, you could use the SCP/SFTP/rsync command line tools or FileZilla. See the email sent to Savio users for further information about these options.

No office hours on Wed, 11/23 or Thurs, 11/24

Due to the Thanksgiving holiday, we will not be holding office hours on Wed, 11/23 or Thurs, 11/24. Thursday and Friday are university holidays so we will be limited in our responses until we resume service on Monday, 11/28.

Savio Power Outage Status: Wed, 11/16

Savio partially lost power twice within the last week due to excess power consumption by some of the Savio4 nodes. Thus as a precautionary measure, we have intentionally kept some nodes offline to keep power consumption under control: savio4_htc (28 nodes), savio2_htc (9 nodes), savio2_gpu (9) and savio (36 nodes). This may affect job scheduling in these partitions. Open OnDemand and Globus are back to normal. Please refer to the “Savio Power Outage Status” email sent on 11/16 for more details.

Savio Power is restored: Fri, 11/11

The power is restored to the Savio cluster. All the Slurm partitions are ready to take jobs. Open Ondemand and Globus have been tested and function well. However, we are intentionally keeping some nodes offline for the time being and gradually bring them back online to ensure that the power consumption is under control.

Savio Partial Power Outage: Fri, 11/11

Following the unexpected power outage in the datacenter on Saturday, the power breakers were tripped again last night. As a consequence, we lost power to a few Savio HPC racks and portions of the Savio partitions are currently offline. We are working with teams on campus to restore the power and bring the HPC services back as soon as possible.

Open OnDemand issues resolved: Tues, 11/8

The interactive node was down and we are happy to report that it is now working, including Jupyter notebook, VScode and Rstudio.

Savio Outage Resolved: Mon, 11/7

On Saturday afternoon we lost power to approximately five of the Savio HPC racks due to a tripped breaker in the data center. This resulted in a large portion of the Savio 1, 2 and 4 partitions being offline. We are working on near and long term mitigation strategies. Please contact us with feedback about any adverse impact you may have experienced.

MyBRC back online: Thurs, 9/20

MyBRC is back online. We had a hardware failure which also caused slow responsiveness of SLURM. We will modify config file/code to alleviate the impact on slurm, should this happen again.

MyBRC portal offline: Tues, 9/20

We are aware that mybrc is offline as of last night and are actively working to bring it back online again. Thank you for your patience.

Intermittent login failures resolved: Tues, 8/30

One of the login nodes was blocked on 8/17 by the Berkeley Lab security team which caused user login failures. Login is now restored and we are taking necessary actions to prevent similar triggers from happening again.

Savio job emails working again: Wed, 8/10

Savio job emails should be working again. Please contact us if you are still having any issues with this. 

Issue with job emails: Thurs, 7/28

We have received a number of tickets about not receiving emails when jobs finish. We are looking into this and will provide an update when it is resolved.

/home & /clusterfs are working: Wed, 5/18

/home and /clusterfs are now back to normal. Thank you for your patience while we resolved this issue.

clusterfs degraded performance: Tues, 5/17

We are aware of the degraded performance on home directory and condo storage under /clusterfs and are working to try to fix it by the end of today.

Working on Viz Node Issues: Wed, 05/11

The Viz node has been experiencing issues since our scheduled downtime last week. This issue also affects Matlab OOD access. We are working to identify and correct this issue, and will keep you updated as this process continues.

Data Transfer Node + Globus Back Online: Tues, 4/26

The DMZ routes on campus are back. Our data transfer node is back online and the Globus service is resumed. 

Data Transfer Node Down: Mon, 4/25

The data transfer node is offline again, likely because the dmz route is down on campus. We will send an update when this is back online. 

Scheduled Savio Downtime: Mon, 5/2

We will power down Savio between 8am-5pm on Monday, 5/2 to apply vendor patches to the SLURM scheduling system. If you submit jobs before the downtime please request proper walltime for them to complete in time, otherwise they will wait in the queue until the cluster is back online.

No office hours on Wed, 3/23 - Thurs, 3/24

Research IT will not be holding our regular office hours during the week of Spring Break, 3/21 - 3/25. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.

Savio Cluster is back in service: Fri, 1/28

We have solutions in place to fix /global/scratch file system so the Savio/Vector clusters are back in service. Please submit a ticket at brc-hpc-help@berkeley.edu if you have any questions or experience continued issues. Thank you for your patience and happy computing!

Working on Scratch Instability: Thurs, 01/27

The global scratch file system on Savio has not been stable since last night. Access to certain folders/files may be sporadic or hanging, and some file operations might give I/O errors. We’re doing emergency maintenance and will keep you updated as we resolve it.

Savio Scratch File System is Back Online: Tues, 1/25

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on Scratch Instability: Tues, 01/25

The global scratch file system on Savio has not been stable since about noon today. Access to certain folders/files may be sporadic or hanging, and some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we resolve it.

No office hours on 1/5 or 1/6

Research IT is participating in a "soft" curtailment the week of Monday, Jan. 3 through Friday, Jan. 7, 2022. We will not be holding office hours during this week.

Closed for curtailment: Thurs, 12/23 - Mon, 01/03

Research IT will be participating in the campus-wide curtailment program from Thursday, Dec. 23, 2021 through Monday, Jan. 3, 2022. Many facilities and services will be closed or operating on modified schedules during this time.

Data Transfer Node is back online, Thurs, 12/2

We are glad to inform you that DTN, the designated Data Transfer Node, is back online. We apologize for any inconvenience this caused.

Data Transfer Node (DTN) is down: Thurs, 12/2

The Data Transfer Node (DTN) is currently down. We plan to go on site around noon to get it fixed and then will post an update.

No office hours on 11/23 or 11/24

Research IT is participating in the campus-wide curtailment during the week of Thanksgiving from Monday, Nov. 22 through Friday, Nov. 26, 2021. We will not be holding office hours this week.

Data Transfer Node is back in service: Tues, 10/26

We are glad to inform you that DTN, the designated Data Transfer Node dtn00.brc.berkeley.edu, is back online. We apologize for any inconvenience for the past a few days.

Data Transfer Node is down: Mon, 10/25

As we are working to restore the service, you can login dtn01.brc.berkeley.edu to transfer data for now. Please contact us with any questions.

Open OnDemand access via eduroam is working: Thurs, 10/21

Users should no longer experience the timeout issues that had been occurring in previous weeks

New condo storage pricing announced

Condo storage purchase is now available in chunks of 112TB at the estimated cost of $5750. Please contact us if you would like to purchase storage for your research data.

Savio3_gpu partition open to FCA users

The savio3_gpu partition is open to Faculty Computing Allowance users. You can submit jobs to a subset of GPU nodes on Savio. Read further for details on how to use GPU partitions.

Open OnDemand downtime: Fri, 10/15 at 10am

Campus's EduRoam currently cannot route to OOD. Please use AirBears2, full tunnel vpn, or wired Ethernet connections. In order to focus our support on Open OnDemand, the JupyterHub server is officially taken offline. We plan to take a short downtime to complete the transition starting 10:00 AM Friday, October 15 for approximately 30 minutes.

Savio Scratch File System is Back Online: Thurs, 9/30

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on scratch instability: Tues, 9/28

The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.

Savio back online: Mon, 9/20, 10:15am

We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.

Savio Scratch File System is Back Online: Thurs, 9/30

The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.

Working on scratch instability: Tues, 9/28

The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.

Savio back online: Mon, 9/20, 10:15am

We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.

Savio scheduled downtime: Mon, 9/20, 8am

A small number of users on the new scratch file system have been impacted by a file system bug that prevents the creation of new files. We plan to have a four hour downtime to resolve this issue.

Savio scheduled downtime: Fri, 8/27, 9-11am

In order to make a minor change to the current structure of the /global/scratch file system, we are scheduling a brief downtime to relocate all user directories from /global/scratch/[username] to /global/scratch/users/[username].

Savio back online: Thurs, 8/12, 5pm

Savio is now back online, and we’re pleased to announce the availability of the new /global/scratch file system, which should alleviate the space shortage issues. Please migrate any critical data to the new system.

Scheduled Savio downtime: Thurs, 8/12, 9am

We are excited to announce an upcoming downtime on Thursday, August 12 at 9am to complete the roll-out of the new /global/scratch file system which offers a significant upgrade in capability. We ask that you begin migrating any critical data to the new file system and leave any unneeded data behind.

Savio scheduled downtime: Tues, 7/20, 9am

We need to stop scheduling new SLURM jobs for a short period of time to migrate backend database and implement support for allocation management inside the new MyBRC user portal. The jobs will resume running when we complete the scheduled work.

Savio back online: Tues, 4/20, 3:30pm

The Savio cluster is back online as planned, and jobs have started running. Please contact us via email or at our drop in office hours if you have any questions.

Savio scheduled downtime: Tues, 4/20, 9am

There is a planned downtime starting at 9am until the end of the day on Tuesday, April 20, 2021 to prepare space for the new global scratch parallel filesystem.