Savio Cluster Scheduled Downtime
April 20, 2021
There is a planned downtime starting at 9am on Tuesday, April 20, 2021. As you may already know, we are in the process of installing a new global scratch parallel filesystem. During this scheduled downtime, we will be preparing physical space for the newly procured /global/scratch replacement.
A scheduler reservation has been put in place to ensure no jobs will be running after 9am on April 20. If you plan to submit jobs to the Savio super cluster around this time, please make sure you request proper wall time for those jobs to finish before the downtime, otherwise your jobs will wait in the queue. We plan to bring the cluster back online before the end of the day on April 20.
Please contact us if you have any questions.
No Office Hours during Spring Break
March 24-25, 2021
Research IT will not be holding its regularily scheduled office hours during spring break on Wednesday, March 24 or Thursday, March 25. During that week please get in touch with us by e-mail: firstname.lastname@example.org
Savio back online
January 5, 2021
The Savio Cluster is back online and jobs have started running as the scheduled work at the data center is now complete. We also took this downtime opportunity to upgrade the SLURM scheduler and plugins for job management.
Research IT closed for winter curtailment
December 23, 2020 - January 4, 2021
Research IT is participating in the campus annual winter curtailment program from Wednesday, December 23 - Monday, January 4. During these dates, assume only emergency support will be available for most campus systems and IT services.
Due to this, there will be a delay in our response time until Tuesday, January 5. Thank you for your patience. As a reminder, the University's Data Center will be shut down from Jan 2-5, 2021 which will also affect our website(s). For more information on this shutdown, please visit the Status Dashboard.
Advance notice of data center stutdown
January 2-5, 2021
There will be a necessary shutdown of the data center scheduled to start Saturday, January 2 at 3pm with completion planned by Tuesday, January 5 at 5pm. This work is required to support the campus research mission by enabling new research systems and to provide improved resilience for these systems when running on generator power during campus power outages. We understand this may disrupt your ability to work but have purposely scheduled the outage during a time period when the use of systems is reduced in an attempt to minimize the overall impact to campus.
Limited staffing week of Thanksgiving holiday
November 23-27, 2020
Research IT will be participating in the optional campus-wide curtailment program to extend Thanksgiving break from Monday, Nov. 23 through Friday, Nov. 27, 2020. Many facilities and services will be closed or operating on modified schedules during the week of Thanksgiving. During curtailment dates, assume only emergency support will be available until the following Monday, November 30.
BRC/Savio Supercluster DTN not accessible
July 21, 2020
We are experiencing some network issue with the BRC/Savio Supercluster DTN as of today, 7/21 morning. User attempts to ssh to dtn.brc.berkeley.edu are timing out. We are aware of this issue and working on restoring the service back to normal.
Lawrence Berkeley National Lab Scheduled Power Outage
July 17, 2020
The OAuth/MyProxy server that allows users to authenticate to globus on the BRC DTNs will be going out of service tomorrow at 5pm due to the power work in the building in which it is housed. The DTN itself will not be affected. Users will be able to do file transger with scp, rsync, sftp, and other clients, only the globus service will be affected.
Savio Scheduled Downtime on June 30
June 28, 2020
Dear Savio Users,
There will be a scheduled downtime on Tuesday, June 30, starting at 8:00am. We will perform maintenance work on the cluster file systems.
We have set up a scheduler reservation to drain the queues by 7:00 AM on Tuesday June 30. So if you plan to submit jobs before the downtime please make sure you request proper wall clock time to ensure jobs can finish before the downtime, otherwise your jobs will wait in the queue until the work is complete. We will return the cluster back to production once the work is completed.
For any concerns please contact us at email@example.com
Thanks for your attention,
Research IT participates in #ShutDownAcadamia
June 10, 2020
On Wednesday, June 10, Research IT is recognizing #ShutDownAcadamia #ShutDownSTEM #Strike4BlackLives. We will not be holding our regular office hours and staff are out of the office to commit the day to education, action, and healing around Black Lives Matter efforts.
Please see this list of resources if you would also like to educate yourself on the injustice and take action against anti-Black racism and police brutality.
Scratch back online
April 17, 2020
This message is to inform you that /scratch file system is back online. Your access to /global/scratch is now restored.
We apologize for the multiple interruptions lately. Moving forward, please use /global/scratch wisely by keeping data size under control. Reducing constant read/write operations over the network is another good practice to alleviate the burden to the already overloaded file system. This can be done by moving data in chunks to the compute nodes and conducting file operations locally.
Scratch filesystem issues
April 16, 2020
We regret to say that we are having issues with the scratch (/global/scratch) filesystem of the BRC supercluster again today. Any attempts to access data/files on the filesystem might just hang. All running jobs in the queues dependent on /global/scratch might also be hung waiting for a response from the file system. We are working on this issue and at this point we do not have an ETA so if your jobs are hung you might want to stop/cancel them or else they will just sit on the nodes and might eventually run out of time.
We are not going to setup a scheduler reservation and block all jobs. So if you can get your work done without using /global/scratch please feel free to update your job scripts and submit them to the queues.
We apologize for these multiple interruptions and are working on restoring the service as soon as possible. Please write to us at firstname.lastname@example.org if you need additional assistance.
Scratch back online
April 11, 2020
We want to let you know that /global/scratch file system is back online. We have released Slurm reservations, and the scheduler is now ready to take jobs. In the meantime, we still plan to perform some power rebalance work in the datacenter. Once it is completed, there will be more compute nodes available to you.
Thank you very much for your patience during this unexpected downtime. Please contact us at email@example.com if you have any questions.
April 10, 2020
The /global/scratch file system started having problems yesterday morning. The file system metadata targets ran out of space, and the users' ability to access existing files or create new files failed. As this scratch is the primary file system from which users run the majority of their jobs, we have blocked all the cluster queues to prevent further strains on the file system with new jobs.
Our storage engineer has been working on restoring the file system. The file system has started responding, but the work is taking longer than anticipated. In order to speed up the process, we have made decisions to take /global/scratch offline. /global/scratch is unavailable on compute nodes. Please don’t perform any writes to /global/scratch, such operations stand a good chance of being administratively killed. You should be able to login and access home folders and any other file systems except /global/scratch.
We expect to bring the file system back online and release the queues by Tuesday 4/14 morning. We apologize for this unexpected outage. Separately, we will use this opportunity to perform some pending power rebalance work in the datacenter to avoid a future downtime. Again, we are sorry for this unexpected downtime. For any concerns please contact us at firstname.lastname@example.org
Scratch file system issues
April 9, 2020
We are having issues with our /global/scratch filesystem starting this morning. File system metadata targets have run out of space and user's ability to access existing files or create new files has been failing. Our storage engineer is working on restoring and running an integrity check on the file system. This process is moving slowly and taking longer than anticipated.
As this scratch is the primary file system from which users run the majority of their jobs we have blocked all the cluster queues at this time such that no new job will start running. Right now we expect to bring the file system back online and release the queues by tomorrow 4/10 morning. We apologize for this unexpected outage.
We will use this opportunity to perform some pending power rebalance work in the datacenter and avoid future downtime. Again we are sorry for this unexpected downtime. Users should be able to login and access their home folders and any other file systems except /global/scratch.
For any other concerns please email us at email@example.com
Research IT Consultations Are Moving Online During Campus Response to Coronavirus
March 20, 2020
Research IT will continue to hold virtual office hours:
Wednesdays from 1:30-3:00 PM
Thursdays from 9:30-11:00 AM.
Moving forward, we will hold consultations via Zoom:
Meeting ID: 504 713 509
Dial: +1 669 900 6833
We will keep you updated as the COVID-19 situation evolves and will do all we can to ensure you receive high quality and helpful consultations during this challenging time.
Short power loss
March 18, 2020
At 8:49 AM this morning, the Earl Warren Hall Data Center experienced a short power loss of maybe one second. You may want to check on your jobs or restart them if necessary.
Running on Limited Power
February 10, 2020
Savio continues to run on limited power. The data center team is working with the data center facility to add more power, but this may take many weeks to complete. Many partitions have reduced nodes available and wait time maybe longer than previously. Thank you for your patience and understanding as we get service restored and expanded.
Savio back online
January 15, 2020
The scheduled datacenter electric work and storage system upgrade are complete. Savio is back online. Happy computing!
The Savio system will be unavailable starting in the evening on Monday, January 13, 2020 at 5 pm through Thursday, January 16 at 5pm to facilitate scheduled datacenter electrical work and a storage system upgrade. The electrical work is a necessary step to prepare for the installation of the new SRDC system and to support provisioning of additional power for the Savio cluster in the coming year. We are taking advantage of the downtime to also schedule a significant upgrade of the Savio Scratch filesystem. This will provide us with tools to better manage scratch storage usage.
As with any storage work, there is always a very small risk of software failure so we are also reminding users to backup any critical data on the Savio Scratch filesystem as it is intended for non-persistent data.
We apologize for this long downtime and the inconvenience that it will cause but the work is critically necessary. If the timing of this outage is anticipated to cause major consequences to your research please write to us, we will do everything possible to find solutions that meet your needs.
Limited availability over curtailment
Savio back online
Savio down on Monday, 10/28 due to power shutdown
Savio back online
Savio still down: update
Savio down on 10/9 due to power shutoff
Update on Savio Status: Back Online
Update on Savio availability from 9/24-9/30
Savio unexpectedly unavailable from 9/23-9/30
Due to an unexpected power system emergency in the Warren Hall data center, Savio will be shut down from the evening of Mon, 9/23 until the repairs are complete on the morning of Mon, 9/30.