BRC: Supporting Data Intensive Computing

May 18, 2015

Like The Blob in a 1950s B-movie, data grows ever larger with time. That’s what UC Berkeley’s researchers are finding as they work with data generated by new generations of instruments and high-resolution sensors, as well as increasingly monstrous datasets. Brain scans from high resolution CCD cameras, images from remote telescopes, and high frequency stock trading and quotation datasets are just a few examples of this data’s diversity.

As campus researchers adopt high performance computing resources to help analyze this data, moving these masses of data to the systems on which computational analysis is performed is a challenge. In significant part, that’s because it can be slow and frustrating to schlep vast quantities of data from the many places it can be stored or generated over general-purpose computer networks.

To help address this pain point, the Berkeley Research Computing (BRC) program in Research IT has deployed a new Data Transfer Node (DTN). This DTN directly connects Savio, the campus’ one-year-old High Performance Computing (HPC) Linux cluster, to the campus’ new Science DMZ network, which facilitates fast transfer of large datasets typical of research workflows requiring computation.

The campus’ Science DMZ network, in turn, was recently connected at a blazing 100Gb/s to the California Research & Education Network (CalREN). This high-speed, fibre-based network reaches over 10,000 research, educational, and public service institutions throughout the state, ranging from colleges and universities to K-12 schools; peers with with other educational high-speed networks; and connects to the commodity Internet, as well. Public libraries will soon be connected to CalREN under the California Library Broadband Initiative.

BRC has also installed Globus Online software on the DTN connecting the Savio HPC cluster to the campus Science DMZ. Globus Online allows researchers to perform unattended transfers of large datasets to and from Savio, utilizing parallel streams. This software improves data transfer speed, convenience, and reliability; additional features will soon be enabled to offer the ability to selectively share files on researchers’ laptops or workstations with other Globus users.

The Savio cluster has the distinction of being the first production system connected to the campus’ Science DMZ network. The Science DMZ is being built out by Isaac Orr, Erik McCroskey and others in IST’s Network Operations and Services group, under the aegis of an NSF grant received by the campus to assist researchers needing to move large amounts of data and to support network research.

Other institutions are also deploying Science DMZ networks. And in some cases, these individual high speed science networks are starting to be linked together, crossing institutional boundaries. For instance, UC Berkeley, along with many other UC campuses and a set of peer institutions including Stanford, Caltech, and the University of Washington, are working towards developing the Pacific Research Platform. This consists of a multi-campus mesh of DTNs, effectively creating links between the Science DMZs of each participating institution. This network will be especially welcome to campus researchers who must regularly transfer data to and from their collaborators at these institutions.

About Science DMZs

The Science DMZ is a specific network architecture designed to handle the high-volume data transfers typical of research and high-performance computing, by creating a special internal network (“DMZ”) to accommodate those transfers. It is typically deployed at or near the local network perimeter, and is optimized for fast and relatively bottleneck-free transfer of a moderate number of high throughput research data flows, as contrasted with the many diverse data flows typical of general-purpose network traffic.

A simple Science DMZ has several essential components.  These include dedicated access to high-performance wide area networks and advanced services infrastructures, high-performance network equipment, and dedicated science resources such as Data Transfer Nodes.  A notional diagram of a simple Science DMZ showing these components, along with data paths, is shown above. (Diagram courtesy of ESNET)

The Science DMZ architecture was initially developed by US Department of Energy’s ESNET located at Lawrence Berkeley National Laboratory. The ultimate vision for this network architecture is to eliminate constraints on scientific progress caused by the physical location of instruments, data, computational resources, or people.

About DTNs

Data Transfer Nodes (DTNs) are dedicated systems, purpose-built and tuned to facilitate high speed file transfers over Science DMZ networks. DTNs are set up as endpoints on those networks to initiate or receive data transfers between large data generators, such as genome sequencers, high resolution cameras, and other instruments or sensors; and data receivers, including large storage resources; computational resources, such as the campus’ Savio HPC cluster; and even other collaborator sites elsewhere on the Internet.