BRC status

Savio back online

Friday, October 11, 2019 - 15:09

Earlier this afternoon, we received word that the non-UPS power had been restored to the Warren Hall datacenter so we immediately started work to bring Savio back online. At this point, Savio is back in full production so users can return to doing their work. We did notice a few jobs were still running when we brought the system down before the power outage. Users should check their last running jobs and restart if necessary.

Savio still down: update

Thursday, October 10, 2019 - 13:39

While parts of the campus are running on power from the Cogen plant we have been requested to hold off bringing the Savio cluster back up due to concerns around overloading the generation facility. To this end we are waiting for clearance from campus facilities before beginning to restore service.

As soon as we have a green light to bring systems back up we will send out another note with an estimated time for service availability followed by a confirmation of service availability. Thank you for your patience as we navigate this outage. All hands are on standby to restore service as soon as possible.

Savio down on 10/9 due to power shutoff

Tuesday, October 8, 2019 - 16:37

Based on information from campus leadership, we are expecting that PG&E power to campus will be shut off on Wednesday, October 9 at 8am due to the Fire Weather Watch and likelihood of high winds. We will need to shutdown Savio starting at 6am. 

Users with access to other computing resources may want to copy their data over there as a precaution. Note that a PG&E outage at UC Berkeley will also affect LBNL computing resources too.

 

 

Update on Savio Status: Back Online

Monday, September 30, 2019 - 10:24

Datacenter staff finished repairs to the transformer this morning (Monday, 9/30) and were able to switch the power source from the generator to the house power. We paused the SLURM scheduler queues around 7:00 am today to shutdown all compute resources and allow the power switch from the generator to the transformer power. After that we were able to power back all compute resources and release the job queues at around 12:30 pm. We would like to thank all of our users for their patience and cooperation during this unexpected outage. 


 

Update on Savio availability from 9/24-9/30

Tuesday, September 24, 2019 - 16:37
As some of you might have noticed we were able to gradually bring BRC cluster resources back online on 9/23. Campus data center staff acted quickly and were able to provide us with alternative generator power as they continued their repairs on the broken transformer. Using this generator power we were able to bring Savio & Vector resources back to the level we were at the beginning of this week, i.e, Monday morning 9/23.
 
Data center staff estimate that the transformer will be repaired and ready by Monday morning 9/30 when we will have to switch the power source from generator to transformer back to house power. This will result in another half-day of downtime for BRC cluster resources sometime on or after Monday 9/30.
 
For now job queues have been released and many user jobs have been running in all the three generations of Savio and Vector cluster nodes (savio1, savio2, savio3). If the generator power and cooling remains stable we will be able to run at the current level until 6AM on Monday 9/30. 
 
We will take another short half day of downtime to cluster resources either on Monday 9/30 or after. We apologize for all the inconveniences caused by these outages to our users but we are glad that we did not have to take a long 7 day outage between 9/23 and 9/30.
 
Please reach us at brc-hpc-help@berkeley.edu if you have any further questions or concerns.

Savio unexpectedly unavailable from 9/23-9/30

Monday, September 23, 2019 - 15:34

Due to an unexpected power system emergency in the Warren Hall data center, Savio will be shut down from the evening of Mon, 9/23 until the repairs are complete on the morning of Mon, 9/30.

 

Savio back online

Friday, August 16, 2019 - 13:33

Savio is all back online as of 11:15 AM today. All services have been restored as before.

Update : Unscheduled downtime for BRC/Savio due to a power event in the datacenter

Thursday, August 15, 2019 - 15:23
Datacenter staff and electricians worked all day today to replace the burnt power circuit breakers with new parts, but they ran into some more issues. Their work has been extended into tomorrow as they need to wait for the arrival of some more parts. Unfortunately, this means that none of the Savio compute nodes are yet accessible via the SLURM partitions, but we hope to make them accessible by the end of tomorrow, assuming the electricians can finish their repairs early tomorrow morning, 8/16.
 
We just removed the login blocks so users can access login/front-end nodes, the DTN,  and data on the filesystems, and we expect to keep this access open through tomorrow. 
 
We apologize for this disruption in service but we are doing our best to keep as many services as possible online. Email us at brc-hpc-help@berkeley.edu if you need any further help.
 

Savio partially online

Tuesday, August 13, 2019 - 11:05
BRC Savio cluster is partially online as of today 8/13 morning. Over the weekend multiple power circuit breakers got tripped in the datacenter and all Savio compute nodes lost power. As we tried to reset the breakers on Monday we realized some of the breakers are damaged and need replacement. But we managed to reset some other breakers and able to restore power in some of the circuits in the datacenter as of Monday evening. Our engineers managed to rerun the power cables from the compute nodes and network switches to the working power circuits and bring partial Savio cluster online. 
 
Datacenter staff are scheduled to take all the cluster offline again on Thursday 8/15 10:00 am to replace the broken circuit breakers. 
 
Savio cluster queues are now open for users to run jobs but only a small number of nodes are available in each of the Savio partitions. So we request only the users who have immediate deadlines to try to make use of the Savio cluster today and tomorrow and others please stay away from keeping the queues busy. Also note that due to next downtime starting at 10AM on Thursday request proper wallclock times (< 48 hours) to get your jobs running in the queues."
 
Email us brc-hpc-help@berkeley.edu if you have any questions or need additional assistance

Savio outage 8/12/2019

Monday, August 12, 2019 - 09:53

We experienced an unexpected power event disrupting power supply to all the compute nodes of the Savio cluster sometime yesterday, 8/11. Our engineers are in the datacenter early in the morning today making fixes and changes to the power layout and to restore services back online.

Right now users can login to the cluster front end nodes and access their data in the cluster filesystems but no jobs are running in the Savio cluster queues.

Once we finish power rebalance and bring nodes online jobs will resume running as scheduled before. CGRL's Vector cluster nodes are not impacted by this power event.

Reach us at brc-hpc-help@berkeley.edu if you have any questions or concerns.

BRC Jupyterhub service experiencing problems

Thursday, August 8, 2019 - 13:46

Users are having trouble accessing the BRC Jupyterhub service. BRC staff are looking into the problem. As an alternative workaround for the time being, you can currently get access to Jupyter notebooks on the Savio visualization node following the instructions in the RIT documentation here

 

Update: BRC cluster login returning to normal

Tuesday, August 6, 2019 - 16:49

We believe we've resolved the login issues. Please let us know if you experience problems.

Ongoing Savio login issues

Tuesday, August 6, 2019 - 14:41

Users have been reporting problems logging in, with their password not being accepted. BRC staff are looking into this.

In the meantime, simply waiting for a minute and trying again may allow you to get access.

BRC Savio Cluster expected to be online by 5 pm August 5

Monday, August 5, 2019 - 14:22

Our original post about Savio being online first thing on the morning of August 5 was incorrect (and contrary to the email message that was sent out).

 

BRC Savio Cluster shutdown planned for the weekend of Aug 3

Sunday, June 9, 2019 - 21:46

BRC Savio will be shutdown on Friday Aug 2 after 5pm to accommodate electrical work in the data center. Savio will be brought back online first thing on Monday morning Aug 5.

BRC cluster downtime planned for 8/6-8/7

Tuesday, July 31, 2018 - 09:19

BRC staff have made arrangements with the vendor to perform an upgrade of our Lustre file storage system on August 6th - 7th, which was unable to take place during our most recent scheduled downtime.

If you have questions or concerns, please contact us at brc-hpc-help@berkeley.edu.

Scheduled downtime 7/24-7/25

Monday, July 23, 2018 - 09:59

Our next maintenance downtime for the BRC HPC Supercluster is scheduled for July 24th and 25th. It will be a two day downtime starting from 8:00 am on Tuesday till 5:00 pm on Wednesday.

We need to do some long pending maintenance tasks and improvements to the scratch filesystem which will help us manage it better. 

All access to cluster login nodes, data transfer node, scheduler queues and data on all the cluster filesystems will be blocked. This downtime impacts all the three clusters, Savio, Cortex & Vector in the supercluster infrastructure. After the downtime, access will be restored as before. 

We have scheduler reservations put in place such that there will not be any jobs running after 8:00 am on July 24th. So if you are submitting jobs to any cluster queues before the downtime please make sure you request proper wallclock time such that they finish running before 8:00 am on 24th or else your jobs will wait in the queue until after the downtime. 

(Resolved) Job submission errors on Savio

Tuesday, July 17, 2018 - 08:24

[Update 9:30 AM: This issue should be resolved. Please contact us at brc-hpc-help@berkeley.edu if you continue experiencing problems.]

Since 1:30 AM on 7/17/18, users have been reporting issues with job submission on Savio. Staff are investigating the problem and hope to restore service soon.

[Resolved] Ongoing scratch and DTN issues

Monday, June 4, 2018 - 17:01

As of 11:30 PM on 6/4/18, the scratch and DTN issues should be resolved. Please contact brc-hpc-help@berkeley.edu if you encounter further issues.

BRC cluster users are continuing to report issues with scratch storage and DTN access. Support staff are currently working on the issue, and will post an update when a fix is in place, or we have an ETA for a fix. 

Scratch storage issue on BRC clusters

Sunday, June 3, 2018 - 18:03

Starting Sunday afternoon (6/3/18), users have been reporting issues with scratch storage on BRC clusters. Cluster sysadmins will look into it as soon as possible.

Pages