BRC status

(Resolved) Jupyterhub on Savio currently unavailable

Tuesday, December 12, 2017 - 09:25

Update: Jupyterhub access has now been restored, as of 10:15 AM on 12/12/17.

At 7:30 PM on 12/11/17, Jupyterhub went down, and hasn't come back up after a restart. Systems staff are currently working on resolving the issue.

Scheduled downtime for BRC clusters: 12/19 9-12

Friday, December 8, 2017 - 15:25

The BRC clusters (Savio, Vector, Cortex) have planned downtime scheduled 12/19, 9 AM - 12 PM, to accommodate some changes to the storage system.

If you have questions or concerns, please contact brc-hpc-help@berkeley.edu

BRC cluster emergency maintenance TOMORROW, 11/3

Thursday, November 2, 2017 - 12:21

We just learned that the power equipment vendor is available as soon as tomorrow, Nov 3rd to perform the replacements. To reduce the possibility of more unplanned outages we are preponing the emergency maintenance scheduled for next week to tomorrow, Friday, Nov 3rd.

All BRC HPC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 am and brought back online by 3:00 pm. 

We hope this sudden schedule change does not cause major disruptions to your plans. If so please do write back to us immediately at brc-hpc-help@berkeley.edu.

Emergency BRC cluster maintenance

Thursday, November 2, 2017 - 11:04

Update: Emergency cluster maintenance has now been rescheduled for Friday 11/3.

BRC HPC cluster infrastructure experienced two unexpected outages this week once on Monday, Oct 30th and again on Wednesday, Nov 1st. Both outages were triggered by unplanned power outages in the UCB datacenter impacting all the non UPS powered resources. In both events users have lost their active login sessions to the cluster resources and lost their running jobs on the compute nodes. If your have failed or incomplete jobs from either of these events please review them and resubmit back to the queues.

We are taking these outages very seriously and working with datacenter operations to improve the situation as soon as possible. This issue has been escalated with the manufacturer of the power infrastructure in the datacenter and they are planning to replace some equipment next Friday, Nov 10th. To accomodate this we have scheduled an emergency maintenance downtime to the BRC HPC cluster infrastructure starting from 7:00 am till 3:00 pm on Nov 10th.

All cluster resources will be unavailable to the users for the duration of this downtime. Job queues will be blocked so if you are submitting any jobs to the queues before Nov 10th make sure you request proper wallclock time such that jobs finish running before 7:00 am on the 10th or else your jobs will wait in the queue until after the downtime. Please pay attention to your walllclock requests and avoid getting confused on why your jobs are not running and creating support tickets with BRC help.

We apologize for these continued unplanned outages but we are doing everything we can to avoid these from happening in future.

Email us at brc-hpc-help@berkeley.edu if you have any concerns with this schedule.

(Resolved) BRC clusters down due to power outage

Wednesday, November 1, 2017 - 17:30

Due to a power outage in the data center, BRC clusters became unavailable around 5 PM on November 1st. The systems team restored access by 7:15 PM.

(Resolved) Savio login nodes down

Monday, October 30, 2017 - 13:52

Around 1:40 PM on Monday, October 30th, all Savio login nodes went down due to a power issue at Earl Warren Hall. The systems team restored service by 2:45 PM.

(Resolved) DTN unavailable

Thursday, October 26, 2017 - 21:00

The data transfer node (DTN) for BRC clusters was unavailable as of 9:00 PM on October 26th. Service was restored by 1:30 AM.

(Resolved) Scratch unavailable on BRC clusters

Saturday, October 21, 2017 - 07:30

Scratch storage on the BRC clusters was unavailable between 7:30 AM and 12:30 PM, but is now operating normally.

Planned cluster downtime: 9/21-9/22

Thursday, September 21, 2017 - 11:05

All BRC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 AM on Sep 21st, Thursday and will stay offline until 5:00 PM on Sep 22nd, Friday, in order to conduct essential electrical and storage maintenance. Please contact us at brc-hpc-help@berkeley.edu if you have any concerns.

(Resolved) Metadata server crashed

Saturday, August 19, 2017 - 10:46

As of 10:30 AM on Saturday, August 19th, scratch was unresponsive due to a metadata server crash. The systems team brought the metadata server back online and full access was restored by 12:50 PM.

(Resolved) Savio scratch unresponsive

Wednesday, August 16, 2017 - 21:13

As of 9:00 PM on August 16th, 2017, scratch on the BRC clusters was unresponsive. Functionality was restored by 10:20 PM.

(Resolved) Scratch issues recurring on Savio

Monday, August 14, 2017 - 10:06

As of 10 AM on Monday, August 14th, Savio users began reporting slow response times on Savio when acessing scratch. The BRC systems team resolved the issue within a few minutes. Please email brc-hpc-help@berkeley.edu if you encounter this problem on your account.

(Resolved) possible scratch issues

Saturday, August 12, 2017 - 18:10

On August 12th, a number of users began to report issues with ls in their scratch directories. We believe the issue has been resolved as of 8:20 PM. Please email brc-hpc-help@berkeley.edu if you are still encountering these issues.

Savio scratch briefly unresponsive, now fixed

Thursday, August 3, 2017 - 23:04

Beginning around 7:30 on August 3rd, BRC staff began receiving reports of scratch being slow or unresponsive on Savio. We believe the problem was fixed as of 11:15 PM. If you are still experiencing issues with scratch, please email brc-hpc-help@berkeley.edu.

BRC clusters coming back online

Monday, July 31, 2017 - 09:14

BRC clusters (Savio, Vector, Cortex) are now coming back online after a power incident in the data center that affected all data center customers not connected to the UPS. The data center will be addressing these issues with the vendor.

BRC clusters experiencing downtime

Monday, July 31, 2017 - 08:15

BRC clusters (Savio, Vector, Cortex) are currently experiencing unexpected downtime. The systems team is looking into it and we will post updates as they are available.

BRC clusters back online

Thursday, July 20, 2017 - 17:07

BRC Cluster resources are now back online. Users should be able to access the resources as before. 

Jobs running in the queues on the compute nodes at the time of power incident might have got flushed from the system and lost by the scheduler. Please look for your failed jobs, cleanup your files and resubmit those jobs back to the queue. 

Savio experiencing power outage

Thursday, July 20, 2017 - 15:45

At approximately 3:45 PM on Thursday, July 20th, Savio experienced a power outage. The sysadmin team is currently working on getting the system back online.

Savio back online after power outage

Tuesday, June 13, 2017 - 16:12

Savio is now back online after the power outage. The SLURM job queue might have got flushed during the power outage, all running jobs might have failed and jobs waiting in the queue might be lost by the scheduler. Please look for your jobs and resubmit them back to the queue.

Brief power outage impacting Savio

Tuesday, June 13, 2017 - 14:39

There was a brief power outage in the data center around 2:25 PM on Tuesday, June 13th. Savio nodes are currently coming back online, and users will not be able to log into Savio until the login nodes are restored.

Pages