BRC status

(Resolved) Job submission errors on Savio

Tuesday, July 17, 2018 - 08:24

[Update 9:30 AM: This issue should be resolved. Please contact us at brc-hpc-help@berkeley.edu if you continue experiencing problems.]

Since 1:30 AM on 7/17/18, users have been reporting issues with job submission on Savio. Staff are investigating the problem and hope to restore service soon.

[Resolved] Ongoing scratch and DTN issues

Monday, June 4, 2018 - 17:01

As of 11:30 PM on 6/4/18, the scratch and DTN issues should be resolved. Please contact brc-hpc-help@berkeley.edu if you encounter further issues.

BRC cluster users are continuing to report issues with scratch storage and DTN access. Support staff are currently working on the issue, and will post an update when a fix is in place, or we have an ETA for a fix. 

Scratch storage issue on BRC clusters

Sunday, June 3, 2018 - 18:03

Starting Sunday afternoon (6/3/18), users have been reporting issues with scratch storage on BRC clusters. Cluster sysadmins will look into it as soon as possible.

Scratch storage issue on BRC clusters

Wednesday, May 23, 2018 - 13:30

Beginning around 1 PM today, BRC clusters began experiencing issues with scratch storage, where any attempt to access the filesystem might cause it to freeze. BRC staff are currently working to restore service.

(Resolved) Login problems on Savio

Thursday, May 10, 2018 - 07:20

Update: The login problems, which were caused by a storage issue, have now been resolved. Please email brc-hpc-help@berkeley.edu if the issue reoccurs for you.

Since around midnight on 5/10/18, users have been reporting problems with logging into Savio, including the DTN. The systems team is currently looking into the issue.

Jupyterhub on Savio currently unavailable

Sunday, April 29, 2018 - 19:51

Since 4/27/18, Jupyterhub on Savio has experienced a number of outages. The systems team is investigating and will restore service as soon as possible.

Emergency downtime for BRC clusters

Tuesday, April 17, 2018 - 09:02

We are currently undergoing an emergency downtime from 9-12 on 4/17 to address recent scratch storage issues.

Users should receive a notification when the system is back online. If you have any concerns in the meantime, please email brc-hpc-help@berkeley.edu.

Savio scratch file creation issues

Thursday, March 15, 2018 - 10:00

Update: (3/15/18, 4:30 PM) With help from users with high file counts, we are continuing to work towards stabilizing scratch, but users may continue to experience sporadic issues through tomorrow. Deleting unused files is still helpful, if possible. 

Since 10 AM on 3/15/18, we have been experiencing some issues with Savio scratch, where users may be unable to create new files. BRC staff are working on resolving the problem, but deleting unused files will help us restore access more quickly. We will continue to update users, but if you have specific concerns, please email brc-hpc-help@berkeley.edu.

Scratch filesystem returning to normal

Wednesday, February 7, 2018 - 08:53

Thanks to the quick assistance of a number of top scratch storage users, scratch should be available for use again. If you continue to experience errors, please contact us at brc-hpc-help@berkeley.edu.

Read-only scratch filesystem

Tuesday, February 6, 2018 - 19:47

Scratch storage on the BRC clusters is currently read-only due to a space issue as of 7:50 PM on 1/6/18. The systems team is actively working on a resolution, but currently no new files can be created on scratch storage.

Scratch filesystem instability

Tuesday, February 6, 2018 - 19:27

We are currently experiencing some ongoing instability with the BRC scratch filesystem and are working with the vendor to resolve it. 

If you experience errors when writing to scratch, please wait a few minutes and try again and/or restart your job. You can also contact us at brc-hpc-help@berkeley.edu with any issues or concerns.

(Resolved) Scratch storage issues for some users

Saturday, February 3, 2018 - 19:26

Update: As of 10 PM the scratch issues appear to be resolved. Please email brc-hpc-help@berkeley.edu if you experience further issues.

As of around 5:45 PM on 2/3/18, some users began experiencing issues with scratch storage on the BRC clusters, with errors like "no space left on device" or "Bad address". BRC staff are currently looking into the issue and will post an update when it's resolved.

[Resolved] Globus unavailable after upgrade

Friday, January 26, 2018 - 11:15

Update: Globus should now be available. Please deactivate any credentials for your savio brc endpoint and please try again. 

BRC staff are working with Globus engineers to address issues with Globus following the SL7 update.  We'll post an announcement once it's available again. Thank you for your patience; if you encounter other issues with the system, please let us know at brc-hpc-help@berkeley.edu.

Scheduled cluster downtime 1/23

Monday, January 22, 2018 - 10:53

The BRC clusters (Savio, Vector, and Cortex) will be down on Tuesday, January 23rd for the Scientific Linux 7 OS upgrade. Please email brc-hpc-help@berkeley.edu if you have questions or concerns.

(Resolved) Jupyterhub currently down

Thursday, December 28, 2017 - 13:30

Update: the Jupyterhub node is back online as of 2 PM on 12/30/17.

As of 1:20 PM on 12/28/17, the Jupyterhub node is down. There may be a delay in getting it back online due to winter curtailment, but updates will be posted as they become available.

(Resolved) BRC clusters down for scheduled maintenance

Tuesday, December 19, 2017 - 08:03

As of 12:00 PM, all BRC clusters should be back online. Please email us at brc-hpc-help@berkeley.edu if you encounter any issues.

The BRC clusters are undergoing a brief scheduled downtime to allow us to make some network configuration changes. We expect everything to be back online by noon today.

(Resolved) Jupyterhub on Savio currently unavailable

Tuesday, December 12, 2017 - 09:25

Update: Jupyterhub access has now been restored, as of 10:15 AM on 12/12/17.

At 7:30 PM on 12/11/17, Jupyterhub went down, and hasn't come back up after a restart. Systems staff are currently working on resolving the issue.

Scheduled downtime for BRC clusters: 12/19 9-12

Friday, December 8, 2017 - 15:25

The BRC clusters (Savio, Vector, Cortex) have planned downtime scheduled 12/19, 9 AM - 12 PM, to accommodate some changes to the storage system.

If you have questions or concerns, please contact brc-hpc-help@berkeley.edu

BRC cluster emergency maintenance TOMORROW, 11/3

Thursday, November 2, 2017 - 12:21

We just learned that the power equipment vendor is available as soon as tomorrow, Nov 3rd to perform the replacements. To reduce the possibility of more unplanned outages we are preponing the emergency maintenance scheduled for next week to tomorrow, Friday, Nov 3rd.

All BRC HPC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 am and brought back online by 3:00 pm. 

We hope this sudden schedule change does not cause major disruptions to your plans. If so please do write back to us immediately at brc-hpc-help@berkeley.edu.

Emergency BRC cluster maintenance

Thursday, November 2, 2017 - 11:04

Update: Emergency cluster maintenance has now been rescheduled for Friday 11/3.

BRC HPC cluster infrastructure experienced two unexpected outages this week once on Monday, Oct 30th and again on Wednesday, Nov 1st. Both outages were triggered by unplanned power outages in the UCB datacenter impacting all the non UPS powered resources. In both events users have lost their active login sessions to the cluster resources and lost their running jobs on the compute nodes. If your have failed or incomplete jobs from either of these events please review them and resubmit back to the queues.

We are taking these outages very seriously and working with datacenter operations to improve the situation as soon as possible. This issue has been escalated with the manufacturer of the power infrastructure in the datacenter and they are planning to replace some equipment next Friday, Nov 10th. To accomodate this we have scheduled an emergency maintenance downtime to the BRC HPC cluster infrastructure starting from 7:00 am till 3:00 pm on Nov 10th.

All cluster resources will be unavailable to the users for the duration of this downtime. Job queues will be blocked so if you are submitting any jobs to the queues before Nov 10th make sure you request proper wallclock time such that jobs finish running before 7:00 am on the 10th or else your jobs will wait in the queue until after the downtime. Please pay attention to your walllclock requests and avoid getting confused on why your jobs are not running and creating support tickets with BRC help.

We apologize for these continued unplanned outages but we are doing everything we can to avoid these from happening in future.

Email us at brc-hpc-help@berkeley.edu if you have any concerns with this schedule.

Pages