As of 10:30 AM on Saturday, August 19th, scratch was unresponsive due to a metadata server crash. The systems team brought the metadata server back online and full access was restored by 12:50 PM.
(Resolved) Metadata server crashed
(Resolved) Savio scratch unresponsive
As of 9:00 PM on August 16th, 2017, scratch on the BRC clusters was unresponsive. Functionality was restored by 10:20 PM.
(Resolved) Scratch issues recurring on Savio
As of 10 AM on Monday, August 14th, Savio users began reporting slow response times on Savio when acessing scratch. The BRC systems team resolved the issue within a few minutes. Please email firstname.lastname@example.org if you encounter this problem on your account.
(Resolved) possible scratch issues
On August 12th, a number of users began to report issues with ls in their scratch directories. We believe the issue has been resolved as of 8:20 PM. Please email email@example.com if you are still encountering these issues.
Savio scratch briefly unresponsive, now fixed
Beginning around 7:30 on August 3rd, BRC staff began receiving reports of scratch being slow or unresponsive on Savio. We believe the problem was fixed as of 11:15 PM. If you are still experiencing issues with scratch, please email firstname.lastname@example.org.
BRC clusters coming back online
BRC clusters (Savio, Vector, Cortex) are now coming back online after a power incident in the data center that affected all data center customers not connected to the UPS. The data center will be addressing these issues with the vendor.
BRC clusters experiencing downtime
BRC clusters (Savio, Vector, Cortex) are currently experiencing unexpected downtime. The systems team is looking into it and we will post updates as they are available.
BRC clusters back online
BRC Cluster resources are now back online. Users should be able to access the resources as before.
Jobs running in the queues on the compute nodes at the time of power incident might have got flushed from the system and lost by the scheduler. Please look for your failed jobs, cleanup your files and resubmit those jobs back to the queue.
Savio experiencing power outage
At approximately 3:45 PM on Thursday, July 20th, Savio experienced a power outage. The sysadmin team is currently working on getting the system back online.
Savio back online after power outage
Savio is now back online after the power outage. The SLURM job queue might have got flushed during the power outage, all running jobs might have failed and jobs waiting in the queue might be lost by the scheduler. Please look for your jobs and resubmit them back to the queue.
Brief power outage impacting Savio
There was a brief power outage in the data center around 2:25 PM on Tuesday, June 13th. Savio nodes are currently coming back online, and users will not be able to log into Savio until the login nodes are restored.
DNS issue resolved
As of early this morning, the DNS issues with hpc.brc.berkeley.edu have been fixed. You should be able to connect to it normally now. If you experience any issues, please email email@example.com.
DNS issue - hpc.brc.berkeley.edu not available
We are currently experiencing a DNS issue with hpc.brc.berkeley.edu. We hope to restore access soon. Scheduled jobs will continue to run in the meantime, and dtn.brc.berkeley.edu is still available.
Update 10:40 PM: We are in contact with the network team and anticipate that access should be restored by tomorrow (Thursday) morning.
Savio, Cortex, and Vector back online after scheduled maintenance
Scheduled maintenance is now complete on the Savio, Cortex, and Vector clusters. If you encounter any issues, please contact us at firstname.lastname@example.org.
Savio, Vector, Cortex currently undergoing planned maintenance
The Savio, Vector, and Cortex clusters are currently down for planned maintenance from 9 AM on Tuesday, May 16th until 5 PM on Wednesday, May 17th. During the maintenance period, we are upgrading and expanding storage, performing an OS/VNFS update, and doing a Slurm update. If you have any questions or concerns, please email email@example.com.
Savio, Vector, Cortex planned downtime May 16-17
BRC Supercluster infrastructure with all its clusters - Cortex, Vector, Savio and all associated condos will be unavailable for a scheduled maintenance for two days on Tuesday, May 16th and Wednesday May 17th. We are planning to do some storage upgrades during this downtime. Access to the login/front-end nodes, compute nodes of all the three clusters, scheduler queues and data on the filesystems will be blocked starting from 9:00 am on Tuesday, May 16th until 5:00 pm on Wednesday, May 17th.
If you are submitting jobs to any of the cluster partitions please choose proper wallclock times such that jobs finish running before 9:00 am on Tuesday, May 16th or else your jobs will stay in the queue waiting for the downtime to finish.
Email us at firstname.lastname@example.org if you have any questions or concerns with this schedule.