Savio HPC Open OnDemand Service Fixed: Tues, 10/29
We replaced some faulty hardware in the data center that was causing problems using Open OnDemand so it should now be functioning normally. Please let us know if you see further problems.
Savio HPC Open OnDemand Service Issues: Tues, 10/22
Savio's Open OnDemand service is experiencing errors and slow responsiveness because of network issues in the campus data center. We are working on the issue.
Scratch file system is back! Tues, 9/17
Our admin team has worked tirelessly to restore the Savio cluster's scratch system. The cluster functionality is now back to normal, and we are thankful for your patience and support throughout the process. We also appreciate users' efforts to clean up and move their files out of scratch. As a part of best practices, we urge users to keep backing up their files from scratch to permanent storage discs and avoid storing massive numbers of tiny files in scratch for long periods.
Global scratch issues: Mon, 9/16
Due to extensive usage, the Savio storage system /global/scratch has run out of capacity. As a result, you might have seen the "out-of-space" errors and experienced other instabilities. We are aware of these issues, and our team is working on resolving them; however, during this phase, you might see that the terminal command/s(for example, ls) executions are slow and sluggish. We appreciate your patience and cooperation during this time. Please reach out to us at brc-hpc-help@berkeley.edu if you have any further questions.
Savio HPC Services Resumed: Mon, 8/12
We are excited to report that Savio HPC services have resumed. As planned, the Berkeley IT team repaired the automated transfer switch in the data center over the weekend. The data center's power is restored, and the Savio supercluster is back online. Jobs have started running, and HPC services, including Open OnDemand and Globus, are also back in service.
Savio outage (data center repair): Fri, 8/9-Mon, 8/12
The Berkeley IT team is planning to repair the automated transfer switch in the Earl Warren Hall Data Center. The work is needed to automate the power failover to generators during future power outages. We have scheduled a Savio downtime to accommodate the repair work, which calls for a full power shutdown of the data center. The downtime will start at 5:00 PM on Friday, Aug 9th and we anticipate to return the HPC services on Monday, Aug 12th by 5:00 PM. A scheduler reservation is already in place to ensure that no jobs run after 5:00 PM on Friday, August 9th. If you plan to submit jobs, please request proper walltime to ensure that jobs complete before the downtime. Otherwise, your jobs will wait in the queue until the cluster is back online.
Savio HPC Open OnDemand Service back online: Mon, 7/15
The Open OnDemand HPC service at https://ood.brc.berkeley.edu/ is back online. We appreciate your patience while we were working through some issues. The service has some changes. We have upgraded Open OnDemand to the latest version, 3.1.7. We have also adopted CILogon for user authentication to eliminate the repetitive login problems you might have experienced. Please select the appropriate institute, primarily the University of California, Berkeley, at the login page. The command-line tool email_lookup.sh can help clarify at which institute you should log in.
Savio HPC Open OnDemand Service back online: Mon, 7/15
The Open OnDemand HPC service at https://ood.brc.berkeley.edu/ is back online. We appreciate your patience while we were working through some issues. The service has some changes. We have upgraded Open OnDemand to the latest version, 3.1.7. We have also adopted CILogon for user authentication to eliminate the repetitive login problems you might have experienced. Please select the appropriate institute, primarily the University of California, Berkeley, at the login page. The command-line tool email_lookup.sh can help clarify at which institute you should log in.
Savio HPC Services are back online except for OOD: Wed, 7/10
The Savio HPC system (with the new Rocky Linux 8 OS installed and implemented), with the exception of the Open OnDemand (OOD) service, has been returned to service and is available to users. We will need more time to configure OOD, so please do not attempt to use OOD at this time. Please note, however, that the Savio documentation has not yet been fully updated to reflect changes due to the new Rocky Linux 8 OS (e.g., changes in the software stack and software module farms, changes in how to compile user code, etc.). Therefore, until the updates in the Savio documentation have been completed, we suggest that Savio users refer to the LBNL Science IT documentation for the Lawrencium HPC system at https://scienceit-docs.lbl.gov/hpc/rocky8-migration/ , https://scienceit-docs.lbl.gov/hpc/software/software-module-farm/ , and https://scienceit-docs.lbl.gov/hpc/software/module-management/ (which is similar to though not exactly the same as Savio) as a temporary guide to some of the changes that have taken place on the Savio system due to the Savio Rocky Linux 8 OS upgrade.
Savio Downtime: Fri, 7/5 - Wed, 7/10
As you know, we have been working on upgrading the Savio operation system to Rocky 8. To complete the OS upgrade, we coordinated with the data center group on campus to combine our work with the long-awaited power work needed at the data center. The joint downtime will start at 5PM on Friday, 7/5. We anticipate to return the services by the end of Wednesday, 7/10. A scheduler reservation is in place to ensure no jobs run after 5PM on Friday, 7/5. If you plan to submit jobs, please request the appropriate wall time to ensure job completion before the downtime. Otherwise, your jobs will wait in the queue until Savio is back online.
Savio Scratch File System is Back up and Running: Mon., 7/1
The Savio /global/scratch parallel file system is back up and running and usable again.
Savio Scratch File System is Down: Mon, 6/24
The /global/scratch parallel file system started having access problems on Friday, 6/21. The investigation is underway. We apologize for any inconvenience this may cause and will keep you posted about the investigation's status.
No office hours on Thursday, 7/4
Research IT will not be holding office hours on Thursday, 7/4 as it is a University holiday. We will resume the following week per our regular schedule.
No office hours on Wednesday, 6/19
Research IT will not be holding office hours on Wednesday, 6/19 as Juneteenth is a University holiday. We will resume the following day on Thursday, 6/20.
Savio Downtime: Friday, May 31, 2024
We have scheduled a Savio downtime starting at 8 AM on Friday, May 31, 2024 to install the long-awaited uninterruptible power supply (UPS) and protocol data units (PDU) in the data center at Earl Warren Hall. HPC services will be unavailable during the downtime. We plan to bring the Savio cluster back online before the end of the day on May 31, 2024. A scheduler reservation is already in place to ensure no jobs run after 8:00 AM on Friday, May 31st. If you plan to submit jobs, please ask for appropriate wall time to complete them before the downtime. Otherwise, your jobs will wait in the queue until the cluster is back online.
No office hours on Wed, 5/8 - Thurs, 5/9
Research IT will not be holding our regular office hours during the week of Spring Break, 5/6 - 5/10. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.
Savio HPC services are back online: Tues, 3/26
The Savio HPC services are back online. Taking this opportunity, we have another exciting news to share. The Savio Supercluster currently uses a CentOS 7 Linux operating system. CentOS 7 will reach the end of life on June 30, 2024. Thus, we have planned to migrate to a new operating system, Rocky 8, to keep the system functional and secure. More information on this forthcoming.
Scheduled Savio downtime: Mon, March 25
We have scheduled a Savio downtime to accommodate the power maintenance work. The downtime will start at 5:00 PM on Monday, March 25th. All the HPC hardware will be powered down; thus, HPC services will be unavailable during the downtime. We expect to bring back the cluster online before 5 PM on Wednesday, March 27th. A scheduler reservation will be in place to ensure no jobs run after 5:00 PM on Monday, March 25th. If you plan to submit jobs, please ask for proper wall time so they can be completed before the downtime. Otherwise, your jobs will wait in the queue until the cluster is back online.
Savio Lustre File System Returned to Service: Fri, 3/1
The Savio Lustre Parallel File system has been returned to service, which means that the Savio Scratch file system is accessible once again.
Savio Lustre File System is Down: Wed, 2/28
The Lustre Parallel File system (e.g., Savio scratch file system) is down, which might be caused by problems related to cabling. The investigation is already underway. We will let you know of any progress we make. A reservation is in place to prevent any new jobs from starting. We appreciate your patience.
Savio Downtime Postponed: Fri, 2/16
We are going to postpone the Savio downtime that was scheduled for Feb 20. With the two back-to-back unplanned power outages on campus, the data center has shifted its focus to installing new UPS systems in the data center. Once the future downtime date is determined, we will send out a notification with a sufficient amount of time for you to prepare.
Savio Cluster Returned to Service: Wed, 2/14
Power and systems have been restored at the data center (Earl Warren Hall) and the Savio supercluster is back online and accessible.
Data Center Power Outage: Wed, 2/14
We have been notified that there is a power outage in the data center. Without power, the Savio HPC services are down. We will let you know when there is any new information and when to expect our services to be available.
Savio Cluster Returned to Service: Fri, 2/9
Power and systems have been restored at the data center (Earl Warren Hall) and the Savio supercluster is back online and accessible. Jobs have started running, and Open OnDemand and Globus are also back in service.
Data Center Power Status: Thursday, 2/8
The Savio cluster is offline to accommodate the scheduled power work at the data center. However, the power generator test failed, and the data center lost power. Work is underway to restore power. Without power, the return of the Savio HPC services will be delayed. We will let you know if there is any new information.
Savio Downtime: Feb 8 & Feb 20
We apologize for this short notice: There will be a power outage in Earl Warren Hall on Thursday, 2/8, which will result in Savio downtime during this period starting at 10AM. We will return the Savio cluster online on the same day after the power work is complete. A scheduler reservation is in place to ensure that no jobs run after 10AM on Thursday, 2/8. If you plan to submit jobs, please ask for proper walltime for those jobs to complete before the downtime. Otherwise, your jobs will wait in the queue until the cluster is back online. The downtime scheduled on Feb 20th is still in place.
Issue with Savio SLURM email notifications has been resolved: Wed, 1/31
The issue that was preventing Savio users from receiving SLURM email notifications when running jobs on Savio has now been resolved and Savio users will again be receiving such emails as normal.
New Savio user account creation delay issues resolved: Wed, 1/24
We now have a workaround to solve the OTP linking email related problems, and the processing of new user account requests has resumed.
Delay with new Savio user account creations: Fri, 1/19
There has been a delay with the processing of new Savio user account creations because the OTP (One Time Password) service is down. UC Berkeley has recently enforced additional configuration requirements for email servers and the OTP service requires extra work due to its unique setup. We apologize for any inconvenience this might have caused and thank you for your patience.
Savio DTN and Globus issues resolved: Wed, 1/10
The Savio DTN and Globus service are available and accessible again after a failed hardware component was replaced.
Savio DTN and Globus not accessible: Tues, 1/9
The Savio Data Transfer Node (DTN) and Globus are currently not accessible. We are investigating the root cause and will update as soon as we know more.
No office hours 12/20 - 1/5
Research IT will not be hosting office hours during the Winter break starting on Wednesday, 12/20 and will resume the second week in January on Wednesday, 1/10. In the meantime, we are available by email so please get in touch with us at research-it@berkeley.edu
Open OnDemand issues resolved: Wed, 11/22
The launching issues with remote desktop and Matlab in OOD have been resolved.
Update on Savio: Tues, 11/21
We have completed the Slurm upgrade and restored the scratch parallel file system. The scheduler reservation has been removed, and jobs have started running. Other HPC services like Open OnDemand and Globus are also back online. However, remote desktop and Matlab within Open OnDemand have launching issues. We are looking into this.
No office hours on Wed, 11/22 + Thurs, 11/23
Research IT will not be holding office hours on Wed, 11/22 + Thurs, 11/23 due to the holiday break! Please get in touch with us at research-it-consulting@lists.berkeley.edu with any questions.
Update on Savio: Mon, 11/20
We upgraded Slurm and are working on a few configuration issues for completion. On the Scratch file system, our system engineers are working on solutions with the DDN team. We estimate that we will need another day to get the scratch file system back online.
Savio Scratch File System is Still Down: Sun, 11/19
The scratch parallel file system is still down. Our system engineers have been working on solutions over the weekend. Again, we are very sorry for any inconvenience this may cause. As I mentioned yesterday, we aim to restore the file system tomorrow during the downtime for the Slurm upgrade. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.
Savio Downtime: Monday, Nov 20
We have scheduled a delayed upgrade on Slurm to address some security concerns. The one-day downtime will start at 8AM on Monday, Nov 20. We expect to return the HPC services by the end of the day.
Savio Scratch File System is Down: Fri, 11/17
As you might have known, the scratch parallel file system began having access problems this morning. We have contacted the vendor to have it fixed as soon as possible. We are very sorry for any inconvenience this may cause. Also, as a reminder, we will proceed with the planned downtime on Monday to upgrade Slurm. Once the Slurm work is complete, we will make every effort to restore the file system.
Savio is Back in Service: Tues, 10/31
The Savio supercluster is back online. We have removed the reservation that was put in place to prevent jobs from running while the file system was down. The home file system is currently operated on one controller only; we will likely schedule downtime later to facilitate the repair of the second controller. We appreciate your patience in the past week and apologize for any inconvenience that might have caused.
Update on Savio: Thurs, 10/26
The Dell Conpellant support team has scheduled for tomorrow to replace the broken parts. In the meantime, Savio remains to be inaccessible. We understand your frustration and appreciate your patience.
Update on Savio: Tues, 10/24
We have identified the problem with the file system and are waiting for the replacement parts to arrive in a day or two. We will get the file system up and running as soon as we receive the parts. Please stay tuned. Again, we appreciate your patience. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.
Savio down: Sat, 10/21
Savio is currently inaccessible. We are looking into this and will keep you posted on any progress. We appreciate your patience. Please reach us at brc-hpc-help@berkeley.edu if you have any questions.
Savio File System Back in Service: Mon, 10/16
The parallel file system at /global/scratch is back online. We are sorry for any inconvenience that the unresponsive file system might have caused. Thank you very much for your patience and please let us know if you encounter any problems.
Savio Scratch File System is Down: Mon, 10/16
We apologize that the parallel file system at /global/scratch is not responsive. This problem impacts job submissions and OpenOnDemand if you need data access from /global/scratch. We are investigating how to fix it and will keep you posted. Please reac
Job submission issues resolved: Thurs, 09/21
Job submission problems have been resolved for both interactive and batch jobs.
Issues with interactive job submissions: Wed, 09/20/23
There are problems on compute nodes that prevent job submissions since yesterday (09/19/23). We are working to resolve it.
Issues with Scratch: Thurs, 8/31/23
We are experiencing some difficulties with the scratch file system because a few users have a very large number of files there. We're in the process of addressing this, but in the meantime, you may see "no space on the device" errors when using scratch.
Issues with Scratch: Thurs, 8/31/23
We are experiencing some difficulties with the scratch file system because a few users have a very large number of files there. We're in the process of addressing this, but in the meantime, you may see "no space on the device" errors when using scratch.
Known remote I/O issue: Mon, June 5
We are aware of the current remote I/O issue on login node 2/3 and are working to resolve it this evening by rebooting the login nodes.
Savio Downtime rescheduled: Thurs 5/18, 8am-5pm
The cluster downtime is rescheduled to 8:00 AM Thursday, May 18th. We plan to upgrade the global scratch parallel file system aimed for the preparation to implement the purge policy on /global/scratch. We estimate the downtime will only last for a few hours before the cluster returns to service by 5PM.
No office hours during finals week, 5/8 - 5/12
We are not offering office hours during finals week, 5/8 - 5/12. Please get in touch with us at research-it-consulting@berkeley.edu in the meantime. We will resume normal hours next week.
Compilation issue resolved: Tues, 4/11
The code compilation issue is resolved. Thank you for your patience.
Savio compilation issues: Mon, 4/10
Savio users may experience code compilation issues where a header file couldn't be found. We have identified the problem and will patch the system soon. Thank you for your patience.
Savio downtime rescheduled: 4/4 - 4/5
The Savio cluster downtime is rescheduled to start at 8:00 AM Tuesday, April 4th for system upgrade. The upgrade work will include Slurm, master node, MyBRC portal and the storage. The downtime is estimated to last for two days. We plan to bring the cluster back to service by 5PM Wednesday the 5th once the work is complete.
No office hours Wed, 3/29 + Thurs, 3/30
Research IT will not be holding our regular office hours during the week of Spring Break, 3/27 - 3/31. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.
Savio login issues: Mon, 3/13
Savio users may still experience login failures, which is related to the maintenance work being performed last week. The IDM team who is providing the authentication service is conducting an investigation. We will keep you posted on any progress being made. Thank you for your patience.
Savio downtime postponed: Thurs, 3/9
The scheduled downtime is temporarily postponed. We will make an announcement once a future downtime is confirmed.
Office Hours to resume on Wed, 1/18
Our weekly office hours are closed during the winter break and will resume on Wednesday, 1/18. We will keep our regular hours on Wednesday afternoons from 1:30-3pm and Thursday mornings from 9:30-11am. We are available via email to help in the meantime.
Limited holiday availability: 12/23 - 1/2
Research IT will be participating in the campus-wide curtailment program from Friday, 12/23 - Mon, 1/2. Assume only emergency support will be available for our systems and services as most of our staff are out of the office. Office hours will resume the week of 1/17.
Data transfer node unstable: Mon, 12/19
The ongoing intermittent network problems have caused DTNs (Data Transfer Nodes) unreachable at times. As a consequence, the services of Globus and secure file transfer tools, such as scp, are also affected. Our team is working on stabilizing the services and will update you with any progress we make.
Globus endpoint ucb#brc unreachable: Wed, 12/7
Globus endpoint ucb#brc has been unreachable for the past few days due to a network issue. Our team is working on bringing services back and will update you once it is back in production. As a backup option for data transfer, you could use the SCP/SFTP/rsync command line tools or FileZilla. See the email sent to Savio users for further information about these options.
No office hours on Wed, 11/23 or Thurs, 11/24
Due to the Thanksgiving holiday, we will not be holding office hours on Wed, 11/23 or Thurs, 11/24. Thursday and Friday are university holidays so we will be limited in our responses until we resume service on Monday, 11/28.
Savio Power Outage Status: Wed, 11/16
Savio partially lost power twice within the last week due to excess power consumption by some of the Savio4 nodes. Thus as a precautionary measure, we have intentionally kept some nodes offline to keep power consumption under control: savio4_htc (28 nodes), savio2_htc (9 nodes), savio2_gpu (9) and savio (36 nodes). This may affect job scheduling in these partitions. Open OnDemand and Globus are back to normal. Please refer to the “Savio Power Outage Status” email sent on 11/16 for more details.
Savio Power is restored: Fri, 11/11
The power is restored to the Savio cluster. All the Slurm partitions are ready to take jobs. Open Ondemand and Globus have been tested and function well. However, we are intentionally keeping some nodes offline for the time being and gradually bring them back online to ensure that the power consumption is under control.
Savio Partial Power Outage: Fri, 11/11
Following the unexpected power outage in the datacenter on Saturday, the power breakers were tripped again last night. As a consequence, we lost power to a few Savio HPC racks and portions of the Savio partitions are currently offline. We are working with teams on campus to restore the power and bring the HPC services back as soon as possible.
Open OnDemand issues resolved: Tues, 11/8
The interactive node was down and we are happy to report that it is now working, including Jupyter notebook, VScode and Rstudio.
Savio Outage Resolved: Mon, 11/7
On Saturday afternoon we lost power to approximately five of the Savio HPC racks due to a tripped breaker in the data center. This resulted in a large portion of the Savio 1, 2 and 4 partitions being offline. We are working on near and long term mitigation strategies. Please contact us with feedback about any adverse impact you may have experienced.
MyBRC back online: Thurs, 9/20
MyBRC is back online. We had a hardware failure which also caused slow responsiveness of SLURM. We will modify config file/code to alleviate the impact on slurm, should this happen again.
MyBRC portal offline: Tues, 9/20
We are aware that mybrc is offline as of last night and are actively working to bring it back online again. Thank you for your patience.
Intermittent login failures resolved: Tues, 8/30
One of the login nodes was blocked on 8/17 by the Berkeley Lab security team which caused user login failures. Login is now restored and we are taking necessary actions to prevent similar triggers from happening again.
Savio job emails working again: Wed, 8/10
Savio job emails should be working again. Please contact us if you are still having any issues with this.
Issue with job emails: Thurs, 7/28
/home & /clusterfs are working: Wed, 5/18
/home and /clusterfs are now back to normal. Thank you for your patience while we resolved this issue.
clusterfs degraded performance: Tues, 5/17
We are aware of the degraded performance on home directory and condo storage under /clusterfs and are working to try to fix it by the end of today.
Working on Viz Node Issues: Wed, 05/11
The Viz node has been experiencing issues since our scheduled downtime last week. This issue also affects Matlab OOD access. We are working to identify and correct this issue, and will keep you updated as this process continues.
Data Transfer Node + Globus Back Online: Tues, 4/26
The DMZ routes on campus are back. Our data transfer node is back online and the Globus service is resumed.
Data Transfer Node Down: Mon, 4/25
The data transfer node is offline again, likely because the dmz route is down on campus. We will send an update when this is back online.
Scheduled Savio Downtime: Mon, 5/2
We will power down Savio between 8am-5pm on Monday, 5/2 to apply vendor patches to the SLURM scheduling system. If you submit jobs before the downtime please request proper walltime for them to complete in time, otherwise they will wait in the queue until the cluster is back online.
No office hours on Wed, 3/23 - Thurs, 3/24
Research IT will not be holding our regular office hours during the week of Spring Break, 3/21 - 3/25. Please get in touch with us at research-it@berkeley.edu for any help in the meantime.
Savio Cluster is back in service: Fri, 1/28
We have solutions in place to fix /global/scratch file system so the Savio/Vector clusters are back in service. Please submit a ticket at brc-hpc-help@berkeley.edu if you have any questions or experience continued issues. Thank you for your patience and happy computing!
Working on Scratch Instability: Thurs, 01/27
The global scratch file system on Savio has not been stable since last night. Access to certain folders/files may be sporadic or hanging, and some file operations might give I/O errors. We’re doing emergency maintenance and will keep you updated as we resolve it.
Savio Scratch File System is Back Online: Tues, 1/25
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on Scratch Instability: Tues, 01/25
The global scratch file system on Savio has not been stable since about noon today. Access to certain folders/files may be sporadic or hanging, and some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we resolve it.
No office hours on 1/5 or 1/6
Research IT is participating in a "soft" curtailment the week of Monday, Jan. 3 through Friday, Jan. 7, 2022. We will not be holding office hours during this week.
Closed for curtailment: Thurs, 12/23 - Mon, 01/03
Research IT will be participating in the campus-wide curtailment program from Thursday, Dec. 23, 2021 through Monday, Jan. 3, 2022. Many facilities and services will be closed or operating on modified schedules during this time.
Data Transfer Node is back online, Thurs, 12/2
We are glad to inform you that DTN, the designated Data Transfer Node, is back online. We apologize for any inconvenience this caused.
Data Transfer Node (DTN) is down: Thurs, 12/2
The Data Transfer Node (DTN) is currently down. We plan to go on site around noon to get it fixed and then will post an update.
No office hours on 11/23 or 11/24
Research IT is participating in the campus-wide curtailment during the week of Thanksgiving from Monday, Nov. 22 through Friday, Nov. 26, 2021. We will not be holding office hours this week.
Data Transfer Node is back in service: Tues, 10/26
We are glad to inform you that DTN, the designated Data Transfer Node dtn00.brc.berkeley.edu, is back online. We apologize for any inconvenience for the past a few days.
Data Transfer Node is down: Mon, 10/25
As we are working to restore the service, you can login dtn01.brc.berkeley.edu to transfer data for now. Please contact us with any questions.
Open OnDemand access via eduroam is working: Thurs, 10/21
Users should no longer experience the timeout issues that had been occurring in previous weeks
New condo storage pricing announced
Savio3_gpu partition open to FCA users
Open OnDemand downtime: Fri, 10/15 at 10am
Campus's EduRoam currently cannot route to OOD. Please use AirBears2, full tunnel vpn, or wired Ethernet connections. In order to focus our support on Open OnDemand, the JupyterHub server is officially taken offline. We plan to take a short downtime to complete the transition starting 10:00 AM Friday, October 15 for approximately 30 minutes.
Savio Scratch File System is Back Online: Thurs, 9/30
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on scratch instability: Tues, 9/28
The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.
Savio back online: Mon, 9/20, 10:15am
We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.
Savio Scratch File System is Back Online: Thurs, 9/30
The global scratch file system is back to service. Jobs have started running on Savio. Thank you very much for your understanding and patience while we were restoring the service.
Working on scratch instability: Tues, 9/28
The global scratch file system on Savio has not been stable since this morning. The access to certain folders/files may be sporadic or hanging, some file operations might throw out I/O errors. We are working on this issue and will keep you updated as we work out solutions.
Savio back online: Mon, 9/20, 10:15am
We have resolved the issue on the scratch parallel file system. The work is complete and jobs have started running on Savio. Thank you for your patience.
Savio scheduled downtime: Mon, 9/20, 8am
A small number of users on the new scratch file system have been impacted by a file system bug that prevents the creation of new files. We plan to have a four hour downtime to resolve this issue.
Savio scheduled downtime: Fri, 8/27, 9-11am
In order to make a minor change to the current structure of the /global/scratch file system, we are scheduling a brief downtime to relocate all user directories from /global/scratch/[username] to /global/scratch/users/[username].
Savio back online: Thurs, 8/12, 5pm
Savio is now back online, and we’re pleased to announce the availability of the new /global/scratch file system, which should alleviate the space shortage issues. Please migrate any critical data to the new system.
Scheduled Savio downtime: Thurs, 8/12, 9am
We are excited to announce an upcoming downtime on Thursday, August 12 at 9am to complete the roll-out of the new /global/scratch file system which offers a significant upgrade in capability. We ask that you begin migrating any critical data to the new file system and leave any unneeded data behind.
Savio scheduled downtime: Tues, 7/20, 9am
We need to stop scheduling new SLURM jobs for a short period of time to migrate backend database and implement support for allocation management inside the new MyBRC user portal. The jobs will resume running when we complete the scheduled work.
Savio back online: Tues, 4/20, 3:30pm
The Savio cluster is back online as planned, and jobs have started running. Please contact us via email or at our drop in office hours if you have any questions.
Savio scheduled downtime: Tues, 4/20, 9am
There is a planned downtime starting at 9am until the end of the day on Tuesday, April 20, 2021 to prepare space for the new global scratch parallel filesystem.