BRC/Savio Supercluster DTN not accessible
July 21, 2020
We are experiencing some network issue with the BRC/Savio Supercluster DTN as of today, 7/21 morning. User attempts to ssh to dtn.brc.berkeley.edu are timing out. We are aware of this issue and working on restoring the service back to normal.
Lawrence Berkeley National Lab Scheduled Power Outage
July 17, 2020
The OAuth/MyProxy server that allows users to authenticate to globus on the BRC DTNs will be going out of service tomorrow at 5pm due to the power work in the building in which it is housed. The DTN itself will not be affected. Users will be able to do file transger with scp, rsync, sftp, and other clients, only the globus service will be affected.
Savio Scheduled Downtime on June 30
June 28, 2020
Dear Savio Users,
There will be a scheduled downtime on Tuesday, June 30, starting at 8:00am. We will perform maintenance work on the cluster file systems.
We have set up a scheduler reservation to drain the queues by 7:00 AM on Tuesday June 30. So if you plan to submit jobs before the downtime please make sure you request proper wall clock time to ensure jobs can finish before the downtime, otherwise your jobs will wait in the queue until the work is complete. We will return the cluster back to production once the work is completed.
For any concerns please contact us at firstname.lastname@example.org
Thanks for your attention,
Research IT participates in #ShutDownAcadamia
June 10, 2020
On Wednesday, June 10, Research IT is recognizing #ShutDownAcadamia #ShutDownSTEM #Strike4BlackLives. We will not be holding our regular office hours and staff are out of the office to commit the day to education, action, and healing around Black Lives Matter efforts.
Please see this list of resources if you would also like to educate yourself on the injustice and take action against anti-Black racism and police brutality.
Scratch back online
April 17, 2020
This message is to inform you that /scratch file system is back online. Your access to /global/scratch is now restored.
We apologize for the multiple interruptions lately. Moving forward, please use /global/scratch wisely by keeping data size under control. Reducing constant read/write operations over the network is another good practice to alleviate the burden to the already overloaded file system. This can be done by moving data in chunks to the compute nodes and conducting file operations locally.
Scratch filesystem issues
April 16, 2020
We regret to say that we are having issues with the scratch (/global/scratch) filesystem of the BRC supercluster again today. Any attempts to access data/files on the filesystem might just hang. All running jobs in the queues dependent on /global/scratch might also be hung waiting for a response from the file system. We are working on this issue and at this point we do not have an ETA so if your jobs are hung you might want to stop/cancel them or else they will just sit on the nodes and might eventually run out of time.
We are not going to setup a scheduler reservation and block all jobs. So if you can get your work done without using /global/scratch please feel free to update your job scripts and submit them to the queues.
We apologize for these multiple interruptions and are working on restoring the service as soon as possible. Please write to us at email@example.com if you need additional assistance.
Scratch back online
April 11, 2020
We want to let you know that /global/scratch file system is back online. We have released Slurm reservations, and the scheduler is now ready to take jobs. In the meantime, we still plan to perform some power rebalance work in the datacenter. Once it is completed, there will be more compute nodes available to you.
Thank you very much for your patience during this unexpected downtime. Please contact us at firstname.lastname@example.org if you have any questions.
April 10, 2020
The /global/scratch file system started having problems yesterday morning. The file system metadata targets ran out of space, and the users' ability to access existing files or create new files failed. As this scratch is the primary file system from which users run the majority of their jobs, we have blocked all the cluster queues to prevent further strains on the file system with new jobs.
Our storage engineer has been working on restoring the file system. The file system has started responding, but the work is taking longer than anticipated. In order to speed up the process, we have made decisions to take /global/scratch offline. /global/scratch is unavailable on compute nodes. Please don’t perform any writes to /global/scratch, such operations stand a good chance of being administratively killed. You should be able to login and access home folders and any other file systems except /global/scratch.
We expect to bring the file system back online and release the queues by Tuesday 4/14 morning. We apologize for this unexpected outage. Separately, we will use this opportunity to perform some pending power rebalance work in the datacenter to avoid a future downtime. Again, we are sorry for this unexpected downtime. For any concerns please contact us at email@example.com
Scratch file system issues
April 9, 2020
We are having issues with our /global/scratch filesystem starting this morning. File system metadata targets have run out of space and user's ability to access existing files or create new files has been failing. Our storage engineer is working on restoring and running an integrity check on the file system. This process is moving slowly and taking longer than anticipated.
As this scratch is the primary file system from which users run the majority of their jobs we have blocked all the cluster queues at this time such that no new job will start running. Right now we expect to bring the file system back online and release the queues by tomorrow 4/10 morning. We apologize for this unexpected outage.
We will use this opportunity to perform some pending power rebalance work in the datacenter and avoid future downtime. Again we are sorry for this unexpected downtime. Users should be able to login and access their home folders and any other file systems except /global/scratch.
For any other concerns please email us at firstname.lastname@example.org
Research IT Consultations Are Moving Online During Campus Response to Coronavirus
March 20, 2020
Research IT will continue to hold virtual office hours:
Wednesdays from 1:30-3:00 PM
Thursdays from 9:30-11:00 AM.
Moving forward, we will hold consultations via Zoom:
Meeting ID: 504 713 509
Dial: +1 669 900 6833
We are also available via e-mail at email@example.com or firstname.lastname@example.org.
We will keep you updated as the COVID-19 situation evolves and will do all we can to ensure you receive high quality and helpful consultations during this challenging time.
Short power loss
March 18, 2020
At 8:49 AM this morning, the Earl Warren Hall Data Center experienced a short power loss of maybe one second. You may want to check on your jobs or restart them if necessary.
Running on Limited Power
February 10, 2020
Savio continues to run on limited power. The data center team is working with the data center facility to add more power, but this may take many weeks to complete. Many partitions have reduced nodes available and wait time maybe longer than previously. Thank you for your patience and understanding as we get service restored and expanded.
Savio back online
January 15, 2020
The scheduled datacenter electric work and storage system upgrade are complete. Savio is back online. Happy computing!
January 13 - January 16, 2020
The Savio system will be unavailable starting in the evening on Monday, January 13, 2020 at 5 pm through Thursday, January 16 at 5pm to facilitate scheduled datacenter electrical work and a storage system upgrade. The electrical work is a necessary step to prepare for the installation of the new SRDC system
and to support provisioning of additional power for the Savio cluster in the coming year. We are taking advantage of the downtime to also schedule a significant upgrade of the Savio Scratch filesystem. This will provide us with tools to better manage scratch storage usage.
As with any storage work, there is always a very small risk of software failure so we are also reminding users to backup any critical data on the Savio Scratch filesystem as it is intended for non-persistent data.
We apologize for this long downtime and the inconvenience that it will cause but the work is critically necessary. If the timing of this outage is anticipated to cause major consequences to your research please write to us
, we will do everything possible to find solutions that meet your needs.
Limited availability over curtailment
December 21, 2019 - January 1, 2020
During the UC Berkeley curtailment period, Research IT will offer only limited support for Savio and the other services. Please expect delays in responses to inquiries. In person consulting at weekly office hours will resume the week of January 6, 2020.
Savio back online
Monday, October 28, 2019 - 05:41
The emergency datacenter work scheduled for this AM is complete. Savio is back in production in a limited capacity as we requested and received an allocation of campus co-gen power to meet critical research needs. Users that had jobs still running as of this morning at 6am will want to check and resubmit their jobs.
Savio down on Monday, 10/28 due to power shutdown
Saturday, October 26, 2019 - 09:22
Based on information from campus leadership, we are expecting that PG&E power to campus will be shut off Saturday, October 26 due to the Fire Weather Watch and likelihood of high winds. For this outage we have been fortunate to receive an allocation of co-gen power from campus facilities such that Savio’s compute nodes will be able to stay operational and on line, available for service until Monday at 6am.
Monday at 6am we will bring down all Savio compute nodes with an expectation of service restoration by Monday 6pm. Job reservations are put in place such that there will be no jobs running in the queue after 6am on Monday. For all job submissions please choose appropriate wall clock times such that jobs finish before 6am on Monday or else they will be waiting in the queue until after the outage ends at 6pm on Monday. As before Savio’s login nodes and its data storage are on UPS and will remain up during the entire power outage.
Savio back online
Friday, October 11, 2019 - 15:09
Earlier this afternoon, we received word that the non-UPS power had been restored to the Warren Hall datacenter so we immediately started work to bring Savio back online. At this point, Savio is back in full production so users can return to doing their work. We did notice a few jobs were still running when we brought the system down before the power outage. Users should check their last running jobs and restart if necessary.
Savio still down: update
Thursday, October 10, 2019 - 13:39
While parts of the campus are running on power from the Cogen plant we have been requested to hold off bringing the Savio cluster back up due to concerns around overloading the generation facility. To this end we are waiting for clearance from campus facilities before beginning to restore service.
As soon as we have a green light to bring systems back up we will send out another note with an estimated time for service availability followed by a confirmation of service availability. Thank you for your patience as we navigate this outage. All hands are on standby to restore service as soon as possible.
Savio down on 10/9 due to power shutoff
Tuesday, October 8, 2019 - 16:37
Based on information from campus leadership, we are expecting that PG&E power to campus will be shut off on Wednesday, October 9 at 8am due to the Fire Weather Watch and likelihood of high winds. We will need to shutdown Savio starting at 6am.
Users with access to other computing resources may want to copy their data over there as a precaution. Note that a PG&E outage at UC Berkeley will also affect LBNL computing resources too.
Update on Savio Status: Back Online
Monday, September 30, 2019 - 10:24
Datacenter staff finished repairs to the transformer this morning (Monday, 9/30) and were able to switch the power source from the generator to the house power. We paused the SLURM scheduler queues around 7:00 am today to shutdown all compute resources and allow the power switch from the generator to the transformer power. After that we were able to power back all compute resources and release the job queues at around 12:30 pm. We would like to thank all of our users for their patience and cooperation during this unexpected outage.
Update on Savio availability from 9/24-9/30
Tuesday, September 24, 2019 - 16:37
As some of you might have noticed we were able to gradually bring BRC cluster resources back online on 9/23. Campus data center staff acted quickly and were able to provide us with alternative generator power as they continued their repairs on the broken transformer. Using this generator power we were able to bring Savio & Vector resources back to the level we were at the beginning of this week, i.e, Monday morning 9/23.
Data center staff estimate that the transformer will be repaired and ready by Monday morning 9/30 when we will have to switch the power source from generator to transformer back to house power. This will result in another half-day of downtime for BRC cluster resources sometime on or after Monday 9/30.
For now job queues have been released and many user jobs have been running in all the three generations of Savio and Vector cluster nodes (savio1, savio2, savio3). If the generator power and cooling remains stable we will be able to run at the current level until 6AM on Monday 9/30.
Savio unexpectedly unavailable from 9/23-9/30
Monday, September 23, 2019 - 15:34
Due to an unexpected power system emergency in the Warren Hall data center, Savio will be shut down from the evening of Mon, 9/23 until the repairs are complete on the morning of Mon, 9/30.
Savio back online
Friday, August 16, 2019 - 13:33
Savio is all back online as of 11:15 AM today. All services have been restored as before.
Update : Unscheduled downtime for BRC/Savio due to a power event in the datacenter
Thursday, August 15, 2019 - 15:23
Datacenter staff and electricians worked all day today to replace the burnt power circuit breakers with new parts, but they ran into some more issues. Their work has been extended into tomorrow as they need to wait for the arrival of some more parts. Unfortunately, this means that none of the Savio compute nodes are yet accessible via the SLURM partitions, but we hope to make them accessible by the end of tomorrow, assuming the electricians can finish their repairs early tomorrow morning, 8/16.
We just removed the login blocks so users can access login/front-end nodes, the DTN, and data on the filesystems, and we expect to keep this access open through tomorrow.
We apologize for this disruption in service but we are doing our best to keep as many services as possible online. Email us at email@example.com
if you need any further help.
Savio partially online
Tuesday, August 13, 2019 - 11:05
BRC Savio cluster is partially online as of today 8/13 morning. Over the weekend multiple power circuit breakers got tripped in the datacenter and all Savio compute nodes lost power. As we tried to reset the breakers on Monday we realized some of the breakers are damaged and need replacement. But we managed to reset some other breakers and able to restore power in some of the circuits in the datacenter as of Monday evening. Our engineers managed to rerun the power cables from the compute nodes and network switches to the working power circuits and bring partial Savio cluster online.
Datacenter staff are scheduled to take all the cluster offline again on Thursday 8/15 10:00 am to replace the broken circuit breakers.
Savio cluster queues are now open for users to run jobs but only a small number of nodes are available in each of the Savio partitions. So we request only the users who have immediate deadlines to try to make use of the Savio cluster today and tomorrow and others please stay away from keeping the queues busy. Also note that due to next downtime starting at 10AM on Thursday request proper wallclock times (< 48 hours) to get your jobs running in the queues."
Savio outage 8/12/2019
Monday, August 12, 2019 - 09:53
We experienced an unexpected power event disrupting power supply to all the compute nodes of the Savio cluster sometime yesterday, 8/11. Our engineers are in the datacenter early in the morning today making fixes and changes to the power layout and to restore services back online.
Right now users can login to the cluster front end nodes and access their data in the cluster filesystems but no jobs are running in the Savio cluster queues.
Once we finish power rebalance and bring nodes online jobs will resume running as scheduled before. CGRL's Vector cluster nodes are not impacted by this power event.
Filesystem slowdowns on BRC systems affecting various cluster activities
Thursday, August 8, 2019 - 16:52
We are currently seeing slow filesystem responsiveness, which is affecting many users carrying out operations such as running shell commands, starting applications, data transfers, and the recent problems launching Jupyter notebooks in the BRC Jupyterhub. BRC staff are looking into the issue.
BRC Jupyterhub service experiencing problems
Thursday, August 8, 2019 - 13:46
Users are having trouble accessing the BRC Jupyterhub service. BRC staff are looking into the problem. As an alternative workaround for the time being, you can currently get access to Jupyter notebooks on the Savio visualization node following the instructions in the RIT documentation here
Issues with adding or managing OTP tokens due website being down
Tuesday, August 6, 2019 - 16:13
Multiple ussers have reported that they are unable to access the non-LBL Token Management page (https://identity.lbl.gov/otptokens/hpccluster
) to generate a one-time password token. The HPCS team is working with the identity management (IDM) staff to track down the authentication problems. The latest update is that the IDM staff has implemented a workaround that seems to reliably work for the time being, but we see that some users continue to be unable to access the non-LBL Token Management page. BRC staff is continuing to work on this problem and we will provide another update shortly.
Update: BRC cluster login returning to normal
Tuesday, August 6, 2019 - 16:49
We believe we've resolved the login issues. Please let us know if you experience problems.
Ongoing Savio login issues
Tuesday, August 6, 2019 - 14:41
Users have been reporting problems logging in, with their password not being accepted. BRC staff are looking into this.
In the meantime, simply waiting for a minute and trying again may allow you to get access.
BRC Savio Cluster expected to be online by 5 pm August 5
Monday, August 5, 2019 - 14:22
Our original post about Savio being online first thing on the morning of August 5 was incorrect (and contrary to the email message that was sent out).
Savio downtime from August 2 - August 5
Monday, August 5, 2019 - 13:49
Savio will be shutdown on Friday Aug 2 after 5pm to accommodate electrical work in the data center. Savio will be brought back online on Monday, Aug 5 at 5pm.
BRC Savio Cluster shutdown planned for the weekend of Aug 3
Sunday, June 9, 2019 - 21:46
BRC Savio will be shutdown on Friday Aug 2 after 5pm to accommodate electrical work in the data center. Savio will be brought back online first thing on Monday morning Aug 5.
BRC Savio Cluster shutdown planned for the weekend of Aug 3
Sunday, June 9, 2019 - 21:41
Campus Facilities Services will be performing maintenance on the main power switchboard of Earl Warren Hall on Saturday, August 3rd, 2019, from 8:00 AM to 3:00 PM. These actions will cut off all electricity to all non-UPS backed equipment in the Campus Data Center including the BRC Savio Cluster. To accommodate the electrical work, Savio will be scheduled to be shutdown on Friday Aug 2 after 5pm and will be brought back online first thing on Monday morning Aug 5.