Condo Cluster Service¶

Summary

BRC manages Savio, the high-performance computational cluster for research computing. Designed as a turnkey computing resource, it features flexible usage and business models, and professional system administration. Unlike traditional clusters, Savio is a collaborative system wherein the majority of nodes are purchased and shared by the cluster users, known as condo owners.

The model for sustaining computing resources is premised on faculty and principal investigators purchasing compute nodes (individual servers) from their grants or other available funds which are then added to the cluster. This allows PI-owned nodes to take advantage of the high speed Infiniband interconnect and high performance Lustre parallel filesystem storage associated with Savio. Operating costs for managing and housing PI-owned compute nodes are waived in exchange for letting other users make use of any idle compute cycles on the PI-owned nodes. PI owners have priority access to computing resources equivalent to those purchased with their funds, but can access more nodes for their research if needed. This provides the PI with much greater flexibility than owning a standalone cluster.

Program Details¶

Compute node equipment is purchased and maintained based on a 5-year lifecycle. PIs owning the nodes will be notified during year 4 that the nodes will have to be upgraded before the end of year 5. If the hardware is not upgraded by the end of 5 years, the PI may donate the equipment to Savio or take possession of the equipment (removal of the equipment from Savio and transfer to another location is at the PI's expense); nodes left in the cluster after five years may be removed and disposed of at the discretion of the BRC program manager

Once a PI has decided to participate, the PI or their designate works with the HPC Services manager and IST teams to procure the desired number of compute nodes and allocate the needed storage. There is a 4-node minimum buy-in for any given compute pool (and all 4 nodes must be the same whether it be the Standard, HTC, Bigmem, or GPU nodes. GPU nodes are the most expensive; therefore, if a group has already purchased the 4-node minimum of any other type of node, they can purchase and add single GPU nodes to their Condo). Generally, procurement takes about three months from start to finish. In the interim, a test condo queue with a small allocation will be set up for the PI's users in anticipation of acquiring the new equipment. Users may submit jobs to the general queues on the cluster using their Faculty Computing Allowance. Jobs are subject to general queue limitations and guaranteed access to contributed cores is not provided until purchased nodes are provisioned.

All group members have equal access to the condo resources, via a condo-specific Slurm QoS (the 'floating reservation' described below). The expectation is that the research group will collectively manage use of the resources by individual members.

Hardware Information¶

Warranty

Each system has a warranty of 5 years

Basic specifications for the systems listed below:

General Computing Node (256 GB RAM)

	General Computing Node (256 GB RAM)
Processors	Dual-socket, 28-core, 2.1 GHz Intel Xeon Gold 6330 processors (56 cores/node)
Memory	256 GB (16 x 16GB) 2666 Mhz DDR4 RDIMMs
Interconnect	100 Gb/s Mellanox ConnectX6 HDR-100 Infiniband interconnect
Hard Drive	1.92 TB NVMe SSD drive (Local swap and log files)
Notes	These come in sets of 4, and the minimum buy-in is 4 nodes
Current Approximate Price (with tax)	~$42,500 for a Dell C6400 chassis with 4 nodes + 4 EDR 2M cables

Big Memory or HTC Computing Node (512 GB RAM)

	Big Memory or HTC Computing Node (512 GB RAM)
Processors	Dual-socket, 28-core, 2.1 GHz Intel Xeon Gold 6330 processors (56 cores/node)
Memory	512 GB (16 x 32 GB) 2666 Mhz DDR4 RDIMMs
Interconnect	100 Gb/s Mellanox ConnectX5 EDR Infiniband interconnect
Hard Drive	1.92 TB NVMe SSD drive (Local swap and log files)
Notes	These come in sets of 4, and the minimum buy-in is 4 nodes
Current Approximate Price (with tax)	~$50,000 for a Dell C6400 chassis with 4 nodes + 4 2M cables

Very Large Memory Computing Node (1.5 TB RAM)

	Very Large Memory Computing Node (1.5 TB RAM)
Processors	Dual-socket, 26-core, 2.1 GHz Intel Xeon Gold 6230 processors (52 cores/node)
Memory	1.5 TB (24 x 64GB) DDR4
Interconnect	100 Gb/s Mellanox ConnectX5 EDR Infiniband interconnect
Hard Drive	1 TB HDD 7.2K RPM (Local swap and log files)
Notes	These can be purchased one by one, but the minimum buy-in is 2 nodes
Current Approximate Price (with tax)	$16,500 per node + $100 for 1 ea. EDR 2M cable

GPU Computing Node

	GPU Computing Node (A40)
Processors	1 AMD EPYC 7302P processor, 3 Ghz (16 cores/node)
Memory (CPU)	256 GB (8 X 32 GB) 3200 Mhz DDR4 ECC RDIMMs
Interconnect	100 Gb/s Mellanox ConnectX5 EDR Infiniband interconnect
GPU	2 Nvidia A40 accelerator boards with 48 GB GPU memory each
Hard Drive	960 GB SSD drive (Local swap and log files)
Notes	These can be purchased one by one, and the minimum buy-in is one node
Current Approximate Price (with tax)	$19,100 for a single node

Hardware Purchasing¶

Prospective condo owners should contact us for current pricing and prior to purchasing any equipment to insure compatibility. If you are interested in other hardware configurations (e.g., HTC/Serial nodes), please contact us. BRC will assist with entering a compute node purchase requisition on behalf of UC Berkeley faculty.

Software¶

Prospective Condo owners should review the System Software section of the System Overview page to confirm that their applications are compatible with Savio's operating system, job scheduler and operating environment.

Storage¶

All institutional and condo users have a 10 GB home directory with backups; in addition, each research group is eligible to receive up to 200 GB of shared project space (30 GB for Faculty Computing Allowance accounts and 200 GB for Condo accounts) to hold research specific application software that is shared among the users of a research group. All users have access to the Savio high performance scratch filesystem for non-persistent data. Users or projects needing more space for persistent data can also purchase additional performance tier storage from IST at the current rate. For even larger storage needs, Condo partners may also take advantage of the Condo Storage service, which provides low-cost storage for very large data needs (minimum 25 TB).

Network¶

A Mellanox infiniband 36-port unmanaged leaf switch is used for every 24 ea. compute nodes.

Job scheduling¶

We will set up a floating reservation equivalent to the number of nodes that you contribute to the Condo to provide priority access to you and your users. You can determine the run time limits for your reservation. If you are not using your reservation, then other users will be allowed to run jobs on unused nodes. If you submit a job to run when all nodes are busy, your job will be given priority over all other waiting jobs to run, but your job will have to wait until nodes become free in order to run. We do not do pre-emptive scheduling where running jobs are killed in order to give immediate access to priority jobs.

Note that the configuration above means that Condos do not have dedicated/reserved nodes. The basic premise of Condo participation is to facilitate the sharing of unused resources. Dedicating or reserving compute resources works counter to sharing, so this is not possible in the Condo model. As an alternative, PIs can purchase nodes and set them up as a Private Pool in the Condo environment, which will allow a researcher to tailor the access and job queues to meet their specific needs. Private Pool compute nodes will share the HPC infrastructure along with the Condo cluster; however, researchers will have to cover the support costs for BRC staff to manage their compute nodes. Please contact us for rates for Private Pool compute nodes.

Charter Condo Contributors¶

The following is a list of those who initially contributed Charter nodes to the Savio Condo, thus helping launch the Savio cluster:

Contributor	Affiliation
Eliot Quataert	Theoretical Astrophysics Center, Astronomy Department
Eugene Chiang	Astronomy Department
Chris McKee	Astronomy Department
Richard Klein	Astronomy Department
Uros Seljak	Physics Department
Jon Arons	Astronomy Department
Ron Cohen	Department of Chemistry, Department of Earth and Planetary Science
John Chiang	Department of Geography and Berkeley Atmospheric Sciences Center
Fotini Katopodes Chow	Department of Civil and Environmental Engineering
Jasmina Vujic	Department of Nuclear Engineering
Jasjeet Sekhon	Department of Political Science and Statistics
Rachel Slaybaugh	Nuclear Engineering
Massimiliano Fratoni	Nuclear Engineering
Hiroshi Nikaido,	Molecular and Cell Biology
Donna Hendrix,	Computation Genomics Research Lab
Justin McCrary,	Director D-Lab
Alan Hubbard,	Biostatistics, School of Public Health
Mark van der Laan,	Biostatistics and Statistics, School of Public Health
Michael Manga,	Department of Earth and Planetary Sciences
Sol Shiang	Goldman School of Public Policy
Jeff Neaton	Physics
Eric Neuscamman	College of Chemistry
M. Alam Reza	Mechanical Engineering
Elaine Tseng	UCSF School of Medicine
Julius Guccione	UCSF Department of Surgery
Ryan Lovett	Statistical Computing Facility
David Limmer	College of Chemistry
Doris Bachtrog	Integrative Biology
Kranthi Mandadapu	College of Chemistry
Kristin Persson	Department of Materials Science and Engineering
Daryl Chrzan	Department of Materials Science and Engineering
William Boos	Earth and Planetary Science
Daniel Weisz	Department of Astronomy
Peter Sudmant	Integrative Biology
Priya Moorjani	Molecular and Cell Biology

Faculty Perspectives¶

UC Berkeley Professor of Astrophysics Eliot Quataert speaks at the BRC Program Launch (22 May 2014) on the need for local high performance computing (HPC) clusters, distinct from national resources such as NSF, DOE (NERSC), and NASA.

UC Berkeley Professor of Integrated Biology Rasmus Nielsen speaks at the BRC Program Launch (22 May 2014) about the transformative effect of using HPC in genomics research.