BRC status

Planned cluster downtime: 9/21-9/22

Thursday, September 21, 2017 - 11:05

All BRC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 AM on Sep 21st, Thursday and will stay offline until 5:00 PM on Sep 22nd, Friday, in order to conduct essential electrical and storage maintenance. Please contact us at brc-hpc-help@berkeley.edu if you have any concerns.

(Resolved) Metadata server crashed

Saturday, August 19, 2017 - 10:46

As of 10:30 AM on Saturday, August 19th, scratch was unresponsive due to a metadata server crash. The systems team brought the metadata server back online and full access was restored by 12:50 PM.

(Resolved) Savio scratch unresponsive

Wednesday, August 16, 2017 - 21:13

As of 9:00 PM on August 16th, 2017, scratch on the BRC clusters was unresponsive. Functionality was restored by 10:20 PM.

(Resolved) Scratch issues recurring on Savio

Monday, August 14, 2017 - 10:06

As of 10 AM on Monday, August 14th, Savio users began reporting slow response times on Savio when acessing scratch. The BRC systems team resolved the issue within a few minutes. Please email brc-hpc-help@berkeley.edu if you encounter this problem on your account.

(Resolved) possible scratch issues

Saturday, August 12, 2017 - 18:10

On August 12th, a number of users began to report issues with ls in their scratch directories. We believe the issue has been resolved as of 8:20 PM. Please email brc-hpc-help@berkeley.edu if you are still encountering these issues.

Savio scratch briefly unresponsive, now fixed

Thursday, August 3, 2017 - 23:04

Beginning around 7:30 on August 3rd, BRC staff began receiving reports of scratch being slow or unresponsive on Savio. We believe the problem was fixed as of 11:15 PM. If you are still experiencing issues with scratch, please email brc-hpc-help@berkeley.edu.

BRC clusters coming back online

Monday, July 31, 2017 - 09:14

BRC clusters (Savio, Vector, Cortex) are now coming back online after a power incident in the data center that affected all data center customers not connected to the UPS. The data center will be addressing these issues with the vendor.

BRC clusters experiencing downtime

Monday, July 31, 2017 - 08:15

BRC clusters (Savio, Vector, Cortex) are currently experiencing unexpected downtime. The systems team is looking into it and we will post updates as they are available.

BRC clusters back online

Thursday, July 20, 2017 - 17:07

BRC Cluster resources are now back online. Users should be able to access the resources as before. 

Jobs running in the queues on the compute nodes at the time of power incident might have got flushed from the system and lost by the scheduler. Please look for your failed jobs, cleanup your files and resubmit those jobs back to the queue. 

Savio experiencing power outage

Thursday, July 20, 2017 - 15:45

At approximately 3:45 PM on Thursday, July 20th, Savio experienced a power outage. The sysadmin team is currently working on getting the system back online.

Savio back online after power outage

Tuesday, June 13, 2017 - 16:12

Savio is now back online after the power outage. The SLURM job queue might have got flushed during the power outage, all running jobs might have failed and jobs waiting in the queue might be lost by the scheduler. Please look for your jobs and resubmit them back to the queue.

Brief power outage impacting Savio

Tuesday, June 13, 2017 - 14:39

There was a brief power outage in the data center around 2:25 PM on Tuesday, June 13th. Savio nodes are currently coming back online, and users will not be able to log into Savio until the login nodes are restored.

DNS issue resolved

Thursday, May 25, 2017 - 06:33

As of early this morning, the DNS issues with hpc.brc.berkeley.edu have been fixed. You should be able to connect to it normally now. If you experience any issues, please email brc-hpc-help@berkeley.edu.

DNS issue - hpc.brc.berkeley.edu not available

Wednesday, May 24, 2017 - 20:37

We are currently experiencing a DNS issue with hpc.brc.berkeley.edu. We hope to restore access soon. Scheduled jobs will continue to run in the meantime, and dtn.brc.berkeley.edu is still available.

Update 10:40 PM: We are in contact with the network team and anticipate that access should be restored by tomorrow (Thursday) morning.

Savio, Cortex, and Vector back online after scheduled maintenance

Wednesday, May 17, 2017 - 15:36

Scheduled maintenance is now complete on the Savio, Cortex, and Vector clusters. If you encounter any issues, please contact us at brc-hpc-help@berkeley.edu.

Savio, Vector, Cortex currently undergoing planned maintenance

Tuesday, May 16, 2017 - 09:00

The Savio, Vector, and Cortex clusters are currently down for planned maintenance from 9 AM on Tuesday, May 16th until 5 PM on Wednesday, May 17th. During the maintenance period, we are upgrading and expanding storage, performing an OS/VNFS update, and doing a Slurm update. If you have any questions or concerns, please email brc-hpc-help@berkeley.edu.

Savio, Vector, Cortex planned downtime May 16-17

Tuesday, May 9, 2017 - 09:16

BRC Supercluster infrastructure with all its clusters - Cortex, Vector, Savio and all associated condos will be unavailable for a scheduled maintenance for two days on Tuesday, May 16th and Wednesday May 17th. We are planning to do some storage upgrades during this downtime. Access to the login/front-end nodes, compute nodes of all the three clusters, scheduler queues and data on the filesystems will be blocked starting from 9:00 am on Tuesday, May 16th until 5:00 pm on Wednesday, May 17th. 

If you are submitting jobs to any of the cluster partitions please choose proper wallclock times such that jobs finish running before 9:00 am on Tuesday, May 16th or else your jobs will stay in the queue waiting for the downtime to finish.

Email us at brc-hpc-help@berkeley.edu if you have any questions or concerns with this schedule.