Savio back online
Monday, October 28, 2019 - 05:41
The emergency datacenter work scheduled for this AM is complete. Savio is back in production in a limited capacity as we requested and received an allocation of campus co-gen power to meet critical research needs. Users that had jobs still running as of this morning at 6am will want to check and resubmit their jobs.
Savio down on Monday, 10/28 due to power shutdown
Saturday, October 26, 2019 - 09:22
Based on information from campus leadership, we are expecting that PG&E power to campus will be shut off Saturday, October 26 due to the Fire Weather Watch and likelihood of high winds. For this outage we have been fortunate to receive an allocation of co-gen power from campus facilities such that Savio’s compute nodes will be able to stay operational and on line, available for service until Monday at 6am.
Monday at 6am we will bring down all Savio compute nodes with an expectation of service restoration by Monday 6pm. Job reservations are put in place such that there will be no jobs running in the queue after 6am on Monday. For all job submissions please choose appropriate wall clock times such that jobs finish before 6am on Monday or else they will be waiting in the queue until after the outage ends at 6pm on Monday. As before Savio’s login nodes and its data storage are on UPS and will remain up during the entire power outage.
Savio back online
Friday, October 11, 2019 - 15:09
Earlier this afternoon, we received word that the non-UPS power had been restored to the Warren Hall datacenter so we immediately started work to bring Savio back online. At this point, Savio is back in full production so users can return to doing their work. We did notice a few jobs were still running when we brought the system down before the power outage. Users should check their last running jobs and restart if necessary.
Savio still down: update
Thursday, October 10, 2019 - 13:39
While parts of the campus are running on power from the Cogen plant we have been requested to hold off bringing the Savio cluster back up due to concerns around overloading the generation facility. To this end we are waiting for clearance from campus facilities before beginning to restore service.
As soon as we have a green light to bring systems back up we will send out another note with an estimated time for service availability followed by a confirmation of service availability. Thank you for your patience as we navigate this outage. All hands are on standby to restore service as soon as possible.
Savio down on 10/9 due to power shutoff
Tuesday, October 8, 2019 - 16:37
Based on information from campus leadership, we are expecting that PG&E power to campus will be shut off on Wednesday, October 9 at 8am due to the Fire Weather Watch and likelihood of high winds. We will need to shutdown Savio starting at 6am.
Users with access to other computing resources may want to copy their data over there as a precaution. Note that a PG&E outage at UC Berkeley will also affect LBNL computing resources too.
Update on Savio Status: Back Online
Monday, September 30, 2019 - 10:24
Datacenter staff finished repairs to the transformer this morning (Monday, 9/30) and were able to switch the power source from the generator to the house power. We paused the SLURM scheduler queues around 7:00 am today to shutdown all compute resources and allow the power switch from the generator to the transformer power. After that we were able to power back all compute resources and release the job queues at around 12:30 pm. We would like to thank all of our users for their patience and cooperation during this unexpected outage.
Update on Savio availability from 9/24-9/30
Tuesday, September 24, 2019 - 16:37
As some of you might have noticed we were able to gradually bring BRC cluster resources back online on 9/23. Campus data center staff acted quickly and were able to provide us with alternative generator power as they continued their repairs on the broken transformer. Using this generator power we were able to bring Savio & Vector resources back to the level we were at the beginning of this week, i.e, Monday morning 9/23.
Data center staff estimate that the transformer will be repaired and ready by Monday morning 9/30 when we will have to switch the power source from generator to transformer back to house power. This will result in another half-day of downtime for BRC cluster resources sometime on or after Monday 9/30.
For now job queues have been released and many user jobs have been running in all the three generations of Savio and Vector cluster nodes (savio1, savio2, savio3). If the generator power and cooling remains stable we will be able to run at the current level until 6AM on Monday 9/30.
Savio unexpectedly unavailable from 9/23-9/30
Monday, September 23, 2019 - 15:34
Due to an unexpected power system emergency in the Warren Hall data center, Savio will be shut down from the evening of Mon, 9/23 until the repairs are complete on the morning of Mon, 9/30.
Savio back online
Friday, August 16, 2019 - 13:33
Savio is all back online as of 11:15 AM today. All services have been restored as before.
Update : Unscheduled downtime for BRC/Savio due to a power event in the datacenter
Thursday, August 15, 2019 - 15:23
Datacenter staff and electricians worked all day today to replace the burnt power circuit breakers with new parts, but they ran into some more issues. Their work has been extended into tomorrow as they need to wait for the arrival of some more parts. Unfortunately, this means that none of the Savio compute nodes are yet accessible via the SLURM partitions, but we hope to make them accessible by the end of tomorrow, assuming the electricians can finish their repairs early tomorrow morning, 8/16.
We just removed the login blocks so users can access login/front-end nodes, the DTN, and data on the filesystems, and we expect to keep this access open through tomorrow.
We apologize for this disruption in service but we are doing our best to keep as many services as possible online. Email us at email@example.com
if you need any further help.
Savio partially online
Tuesday, August 13, 2019 - 11:05
BRC Savio cluster is partially online as of today 8/13 morning. Over the weekend multiple power circuit breakers got tripped in the datacenter and all Savio compute nodes lost power. As we tried to reset the breakers on Monday we realized some of the breakers are damaged and need replacement. But we managed to reset some other breakers and able to restore power in some of the circuits in the datacenter as of Monday evening. Our engineers managed to rerun the power cables from the compute nodes and network switches to the working power circuits and bring partial Savio cluster online.
Datacenter staff are scheduled to take all the cluster offline again on Thursday 8/15 10:00 am to replace the broken circuit breakers.
Savio cluster queues are now open for users to run jobs but only a small number of nodes are available in each of the Savio partitions. So we request only the users who have immediate deadlines to try to make use of the Savio cluster today and tomorrow and others please stay away from keeping the queues busy. Also note that due to next downtime starting at 10AM on Thursday request proper wallclock times (< 48 hours) to get your jobs running in the queues."
Savio outage 8/12/2019
Monday, August 12, 2019 - 09:53
We experienced an unexpected power event disrupting power supply to all the compute nodes of the Savio cluster sometime yesterday, 8/11. Our engineers are in the datacenter early in the morning today making fixes and changes to the power layout and to restore services back online.
Right now users can login to the cluster front end nodes and access their data in the cluster filesystems but no jobs are running in the Savio cluster queues.
Once we finish power rebalance and bring nodes online jobs will resume running as scheduled before. CGRL's Vector cluster nodes are not impacted by this power event.
Filesystem slowdowns on BRC systems affecting various cluster activities
Thursday, August 8, 2019 - 16:52
We are currently seeing slow filesystem responsiveness, which is affecting many users carrying out operations such as running shell commands, starting applications, data transfers, and the recent problems launching Jupyter notebooks in the BRC Jupyterhub. BRC staff are looking into the issue.
BRC Jupyterhub service experiencing problems
Thursday, August 8, 2019 - 13:46
Users are having trouble accessing the BRC Jupyterhub service. BRC staff are looking into the problem. As an alternative workaround for the time being, you can currently get access to Jupyter notebooks on the Savio visualization node following the instructions in the RIT documentation here
Issues with adding or managing OTP tokens due website being down
Tuesday, August 6, 2019 - 16:13
Multiple ussers have reported that they are unable to access the non-LBL Token Management page (https://identity.lbl.gov/otptokens/hpccluster
) to generate a one-time password token. The HPCS team is working with the identity management (IDM) staff to track down the authentication problems. The latest update is that the IDM staff has implemented a workaround that seems to reliably work for the time being, but we see that some users continue to be unable to access the non-LBL Token Management page. BRC staff is continuing to work on this problem and we will provide another update shortly.
Update: BRC cluster login returning to normal
Tuesday, August 6, 2019 - 16:49
We believe we've resolved the login issues. Please let us know if you experience problems.
Ongoing Savio login issues
Tuesday, August 6, 2019 - 14:41
Users have been reporting problems logging in, with their password not being accepted. BRC staff are looking into this.
In the meantime, simply waiting for a minute and trying again may allow you to get access.
BRC Savio Cluster expected to be online by 5 pm August 5
Monday, August 5, 2019 - 14:22
Our original post about Savio being online first thing on the morning of August 5 was incorrect (and contrary to the email message that was sent out).
Savio downtime from August 2 - August 5
Monday, August 5, 2019 - 13:49
Savio will be shutdown on Friday Aug 2 after 5pm to accommodate electrical work in the data center. Savio will be brought back online on Monday, Aug 5 at 5pm.
BRC Savio Cluster shutdown planned for the weekend of Aug 3
Sunday, June 9, 2019 - 21:46
BRC Savio will be shutdown on Friday Aug 2 after 5pm to accommodate electrical work in the data center. Savio will be brought back online first thing on Monday morning Aug 5.
BRC Savio Cluster shutdown planned for the weekend of Aug 3
Sunday, June 9, 2019 - 21:41
Campus Facilities Services will be performing maintenance on the main power switchboard of Earl Warren Hall on Saturday, August 3rd, 2019, from 8:00 AM to 3:00 PM. These actions will cut off all electricity to all non-UPS backed equipment in the Campus Data Center including the BRC Savio Cluster. To accommodate the electrical work, Savio will be scheduled to be shutdown on Friday Aug 2 after 5pm and will be brought back online first thing on Monday morning Aug 5.
BRC cluster downtime planned for 8/6-8/7
Tuesday, July 31, 2018 - 09:19
BRC staff have made arrangements with the vendor to perform an upgrade of our Lustre file storage system on August 6th - 7th, which was unable to take place during our most recent scheduled downtime.
Scheduled downtime 7/24-7/25
Monday, July 23, 2018 - 09:59
Our next maintenance downtime for the BRC HPC Supercluster is scheduled for July 24th and 25th. It will be a two day downtime starting from 8:00 am on Tuesday till 5:00 pm on Wednesday.
We need to do some long pending maintenance tasks and improvements to the scratch filesystem which will help us manage it better.
All access to cluster login nodes, data transfer node, scheduler queues and data on all the cluster filesystems will be blocked. This downtime impacts all the three clusters, Savio, Cortex & Vector in the supercluster infrastructure. After the downtime, access will be restored as before.
We have scheduler reservations put in place such that there will not be any jobs running after 8:00 am on July 24th. So if you are submitting jobs to any cluster queues before the downtime please make sure you request proper wallclock time such that they finish running before 8:00 am on 24th or else your jobs will wait in the queue until after the downtime.
(Resolved) Job submission errors on Savio
Tuesday, July 17, 2018 - 08:24
[Update 9:30 AM: This issue should be resolved. Please contact us at firstname.lastname@example.org
if you continue experiencing problems.]
Since 1:30 AM on 7/17/18, users have been reporting issues with job submission on Savio. Staff are investigating the problem and hope to restore service soon.
[Resolved] Ongoing scratch and DTN issues
Monday, June 4, 2018 - 17:01
As of 11:30 PM on 6/4/18, the scratch and DTN issues should be resolved. Please contact email@example.com
if you encounter further issues.
BRC cluster users are continuing to report issues with scratch storage and DTN access. Support staff are currently working on the issue, and will post an update when a fix is in place, or we have an ETA for a fix.
Scratch storage issue on BRC clusters
Sunday, June 3, 2018 - 18:03
Starting Sunday afternoon (6/3/18), users have been reporting issues with scratch storage on BRC clusters. Cluster sysadmins will look into it as soon as possible.
Scratch storage issue on BRC clusters
Wednesday, May 23, 2018 - 13:30
Beginning around 1 PM today, BRC clusters began experiencing issues with scratch storage, where any attempt to access the filesystem might cause it to freeze. BRC staff are currently working to restore service.
(Resolved) Login problems on Savio
Thursday, May 10, 2018 - 07:20
Update: The login problems, which were caused by a storage issue, have now been resolved. Please email firstname.lastname@example.org
if the issue reoccurs for you.
Since around midnight on 5/10/18, users have been reporting problems with logging into Savio, including the DTN. The systems team is currently looking into the issue.
Jupyterhub on Savio currently unavailable
Sunday, April 29, 2018 - 19:51
Since 4/27/18, Jupyterhub on Savio has experienced a number of outages. The systems team is investigating and will restore service as soon as possible.
Emergency downtime for BRC clusters
Tuesday, April 17, 2018 - 09:02
We are currently undergoing an emergency downtime from 9-12 on 4/17 to address recent scratch storage issues.
Users should receive a notification when the system is back online. If you have any concerns in the meantime, please email email@example.com
Savio scratch file creation issues
Thursday, March 15, 2018 - 10:00
Update: (3/15/18, 4:30 PM) With help from users with high file counts, we are continuing to work towards stabilizing scratch, but users may continue to experience sporadic issues through tomorrow. Deleting unused files is still helpful, if possible.
Since 10 AM on 3/15/18, we have been experiencing some issues with Savio scratch, where users may be unable to create new files. BRC staff are working on resolving the problem, but deleting unused files will help us restore access more quickly. We will continue to update users, but if you have specific concerns, please email firstname.lastname@example.org
Scratch filesystem returning to normal
Wednesday, February 7, 2018 - 08:53
Thanks to the quick assistance of a number of top scratch storage users, scratch should be available for use again. If you continue to experience errors, please contact us at email@example.com
Read-only scratch filesystem
Tuesday, February 6, 2018 - 19:47
Scratch storage on the BRC clusters is currently read-only due to a space issue as of 7:50 PM on 1/6/18. The systems team is actively working on a resolution, but currently no new files can be created on scratch storage.
Scratch filesystem instability
Tuesday, February 6, 2018 - 19:27
We are currently experiencing some ongoing instability with the BRC scratch filesystem and are working with the vendor to resolve it.
If you experience errors when writing to scratch, please wait a few minutes and try again and/or restart your job. You can also contact us at firstname.lastname@example.org
with any issues or concerns.
(Resolved) Scratch storage issues for some users
Saturday, February 3, 2018 - 19:26
Update: As of 10 PM the scratch issues appear to be resolved. Please email email@example.com
if you experience further issues.
As of around 5:45 PM on 2/3/18, some users began experiencing issues with scratch storage on the BRC clusters, with errors like "no space left on device" or "Bad address". BRC staff are currently looking into the issue and will post an update when it's resolved.
[Resolved] Globus unavailable after upgrade
Friday, January 26, 2018 - 11:15
Update: Globus should now be available. Please deactivate any credentials for your savio brc endpoint and please try again.
BRC staff are working with Globus engineers to address issues with Globus following the SL7 update. We'll post an announcement once it's available again. Thank you for your patience; if you encounter other issues with the system, please let us know at firstname.lastname@example.org
Host key / host ID error messages
Thursday, January 25, 2018 - 10:08
Following Tuesday's SL7 upgrade, many users are seeing messages like "Host key verification failed" or "Remote host identification has changed". This is expected behavior. Please edit your $HOME/.ssh/known_hosts file on your computer and remove all the BRC entries (eg : entries that start with hpc.brc, dtn.brc, etc.)
Scheduled cluster downtime 1/23
Monday, January 22, 2018 - 10:53
The BRC clusters (Savio, Vector, and Cortex) will be down on Tuesday, January 23rd for the Scientific Linux 7 OS upgrade. Please email email@example.com
if you have questions or concerns.
(Resolved) Jupyterhub currently down
Thursday, December 28, 2017 - 13:30
Update: the Jupyterhub node is back online as of 2 PM on 12/30/17. As of 1:20 PM on 12/28/17, the Jupyterhub node is down. There may be a delay in getting it back online due to winter curtailment, but updates will be posted as they become available.
(Resolved) BRC clusters down for scheduled maintenance
Tuesday, December 19, 2017 - 08:03
As of 12:00 PM, all BRC clusters should be back online. Please email us at firstname.lastname@example.org
if you encounter any issues.
The BRC clusters are undergoing a brief scheduled downtime to allow us to make some network configuration changes. We expect everything to be back online by noon today.
(Resolved) Jupyterhub on Savio currently unavailable
Tuesday, December 12, 2017 - 09:25
Update: Jupyterhub access has now been restored, as of 10:15 AM on 12/12/17.
At 7:30 PM on 12/11/17, Jupyterhub went down, and hasn't come back up after a restart. Systems staff are currently working on resolving the issue.
Scheduled downtime for BRC clusters: 12/19 9-12
Friday, December 8, 2017 - 15:25
The BRC clusters (Savio, Vector, Cortex) have planned downtime scheduled 12/19, 9 AM - 12 PM, to accommodate some changes to the storage system.
BRC cluster emergency maintenance TOMORROW, 11/3
Thursday, November 2, 2017 - 12:21
We just learned that the power equipment vendor is available as soon as tomorrow, Nov 3rd to perform the replacements. To reduce the possibility of more unplanned outages we are preponing the emergency maintenance scheduled for next week to tomorrow, Friday, Nov 3rd.
All BRC HPC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 am and brought back online by 3:00 pm.
We hope this sudden schedule change does not cause major disruptions to your plans. If so please do write back to us immediately at email@example.com
Emergency BRC cluster maintenance
Thursday, November 2, 2017 - 11:04
Update: Emergency cluster maintenance has now been rescheduled for Friday 11/3.
BRC HPC cluster infrastructure experienced two unexpected outages this week once on Monday, Oct 30th and again on Wednesday, Nov 1st. Both outages were triggered by unplanned power outages in the UCB datacenter impacting all the non UPS powered resources. In both events users have lost their active login sessions to the cluster resources and lost their running jobs on the compute nodes. If your have failed or incomplete jobs from either of these events please review them and resubmit back to the queues.
We are taking these outages very seriously and working with datacenter operations to improve the situation as soon as possible. This issue has been escalated with the manufacturer of the power infrastructure in the datacenter and they are planning to replace some equipment next Friday, Nov 10th. To accomodate this we have scheduled an emergency maintenance downtime to the BRC HPC cluster infrastructure starting from 7:00 am till 3:00 pm on Nov 10th.
All cluster resources will be unavailable to the users for the duration of this downtime. Job queues will be blocked so if you are submitting any jobs to the queues before Nov 10th make sure you request proper wallclock time such that jobs finish running before 7:00 am on the 10th or else your jobs will wait in the queue until after the downtime. Please pay attention to your walllclock requests and avoid getting confused on why your jobs are not running and creating support tickets with BRC help.
We apologize for these continued unplanned outages but we are doing everything we can to avoid these from happening in future.
(Resolved) BRC clusters down due to power outage
Wednesday, November 1, 2017 - 17:30
Due to a power outage in the data center, BRC clusters became unavailable around 5 PM on November 1st. The systems team restored access by 7:15 PM.
(Resolved) Savio login nodes down
Monday, October 30, 2017 - 13:52
Around 1:40 PM on Monday, October 30th, all Savio login nodes went down due to a power issue at Earl Warren Hall. The systems team restored service by 2:45 PM.
(Resolved) DTN unavailable
Thursday, October 26, 2017 - 21:00
The data transfer node (DTN) for BRC clusters was unavailable as of 9:00 PM on October 26th. Service was restored by 1:30 AM.
(Resolved) Scratch unavailable on BRC clusters
Saturday, October 21, 2017 - 07:30
Scratch storage on the BRC clusters was unavailable between 7:30 AM and 12:30 PM, but is now operating normally.
Planned cluster downtime: 9/21-9/22
Thursday, September 21, 2017 - 11:05
All BRC cluster resources including Cortex, Savio and Vector will be taken offline starting at 7:00 AM on Sep 21st, Thursday and will stay offline until 5:00 PM on Sep 22nd, Friday, in order to conduct essential electrical and storage maintenance. Please contact us at firstname.lastname@example.org
if you have any concerns.
(Resolved) Metadata server crashed
Saturday, August 19, 2017 - 10:46
As of 10:30 AM on Saturday, August 19th, scratch was unresponsive due to a metadata server crash. The systems team brought the metadata server back online and full access was restored by 12:50 PM.
(Resolved) Savio scratch unresponsive
Wednesday, August 16, 2017 - 21:13
As of 9:00 PM on August 16th, 2017, scratch on the BRC clusters was unresponsive. Functionality was restored by 10:20 PM.
(Resolved) Scratch issues recurring on Savio
Monday, August 14, 2017 - 10:06
As of 10 AM on Monday, August 14th, Savio users began reporting slow response times on Savio when acessing scratch. The BRC systems team resolved the issue within a few minutes. Please email email@example.com
if you encounter this problem on your account.
(Resolved) possible scratch issues
Saturday, August 12, 2017 - 18:10
On August 12th, a number of users began to report issues with ls
in their scratch directories. We believe the issue has been resolved as of 8:20 PM. Please email firstname.lastname@example.org
if you are still encountering these issues.
Savio scratch briefly unresponsive, now fixed
Thursday, August 3, 2017 - 23:04
Beginning around 7:30 on August 3rd, BRC staff began receiving reports of scratch being slow or unresponsive on Savio. We believe the problem was fixed as of 11:15 PM. If you are still experiencing issues with scratch, please email email@example.com
BRC clusters coming back online
Monday, July 31, 2017 - 09:14
BRC clusters (Savio, Vector, Cortex) are now coming back online after a power incident in the data center that affected all data center customers not connected to the UPS. The data center will be addressing these issues with the vendor.
BRC clusters experiencing downtime
Monday, July 31, 2017 - 08:15
BRC clusters (Savio, Vector, Cortex) are currently experiencing unexpected downtime. The systems team is looking into it and we will post updates as they are available.
BRC clusters back online
Thursday, July 20, 2017 - 17:07
BRC Cluster resources are now back online. Users should be able to access the resources as before. Jobs running in the queues on the compute nodes at the time of power incident might have got flushed from the system and lost by the scheduler. Please look for your failed jobs, cleanup your files and resubmit those jobs back to the queue.
Savio experiencing power outage
Thursday, July 20, 2017 - 15:45
At approximately 3:45 PM on Thursday, July 20th, Savio experienced a power outage. The sysadmin team is currently working on getting the system back online.
Savio back online after power outage
Tuesday, June 13, 2017 - 16:12
Savio is now back online after the power outage. The SLURM job queue might have got flushed during the power outage, all running jobs might have failed and jobs waiting in the queue might be lost by the scheduler. Please look for your jobs and resubmit them back to the queue.
Brief power outage impacting Savio
Tuesday, June 13, 2017 - 14:39
There was a brief power outage in the data center around 2:25 PM on Tuesday, June 13th. Savio nodes are currently coming back online, and users will not be able to log into Savio until the login nodes are restored.
DNS issue resolved
Thursday, May 25, 2017 - 06:33
As of early this morning, the DNS issues with hpc.brc.berkeley.edu have been fixed. You should be able to connect to it normally now. If you experience any issues, please email firstname.lastname@example.org
DNS issue - hpc.brc.berkeley.edu not available
Wednesday, May 24, 2017 - 20:37
We are currently experiencing a DNS issue with hpc.brc.berkeley.edu. We hope to restore access soon. Scheduled jobs will continue to run in the meantime, and dtn.brc.berkeley.edu is still available.
Update 10:40 PM: We are in contact with the network team and anticipate that access should be restored by tomorrow (Thursday) morning.
Savio, Cortex, and Vector back online after scheduled maintenance
Wednesday, May 17, 2017 - 15:36
Scheduled maintenance is now complete on the Savio, Cortex, and Vector clusters. If you encounter any issues, please contact us at email@example.com
Savio, Vector, Cortex currently undergoing planned maintenance
Tuesday, May 16, 2017 - 09:00
The Savio, Vector, and Cortex clusters are currently down for planned maintenance from 9 AM on Tuesday, May 16th until 5 PM on Wednesday, May 17th. During the maintenance period, we are upgrading and expanding storage, performing an OS/VNFS update, and doing a Slurm update. If you have any questions or concerns, please email firstname.lastname@example.org
Savio, Vector, Cortex planned downtime May 16-17
Tuesday, May 9, 2017 - 09:16
BRC Supercluster infrastructure with all its clusters - Cortex, Vector, Savio and all associated condos will be unavailable for a scheduled maintenance for two days on Tuesday, May 16th and Wednesday May 17th. We are planning to do some storage upgrades during this downtime. Access to the login/front-end nodes, compute nodes of all the three clusters, scheduler queues and data on the filesystems will be blocked starting from 9:00 am on Tuesday, May 16th until 5:00 pm on Wednesday, May 17th.
If you are submitting jobs to any of the cluster partitions please choose proper wallclock times such that jobs finish running before 9:00 am on Tuesday, May 16th or else your jobs will stay in the queue waiting for the downtime to finish.