Update: Emergency cluster maintenance has now been rescheduled for Friday 11/3.
BRC HPC cluster infrastructure experienced two unexpected outages this week once on Monday, Oct 30th and again on Wednesday, Nov 1st. Both outages were triggered by unplanned power outages in the UCB datacenter impacting all the non UPS powered resources. In both events users have lost their active login sessions to the cluster resources and lost their running jobs on the compute nodes. If your have failed or incomplete jobs from either of these events please review them and resubmit back to the queues.
We are taking these outages very seriously and working with datacenter operations to improve the situation as soon as possible. This issue has been escalated with the manufacturer of the power infrastructure in the datacenter and they are planning to replace some equipment next Friday, Nov 10th. To accomodate this we have scheduled an emergency maintenance downtime to the BRC HPC cluster infrastructure starting from 7:00 am till 3:00 pm on Nov 10th.
All cluster resources will be unavailable to the users for the duration of this downtime. Job queues will be blocked so if you are submitting any jobs to the queues before Nov 10th make sure you request proper wallclock time such that jobs finish running before 7:00 am on the 10th or else your jobs will wait in the queue until after the downtime. Please pay attention to your walllclock requests and avoid getting confused on why your jobs are not running and creating support tickets with BRC help.
We apologize for these continued unplanned outages but we are doing everything we can to avoid these from happening in future.
Email us at firstname.lastname@example.org if you have any concerns with this schedule.