Cyberinfrastructure Engineer

Research IT’s cyberinfrastructure (CI) engineer works with individual researchers and labs to develop repeatable, sharable workflows, data pipelines, and “plumbing” components that can bridge the gap between specific research tools and methodologies and the computational infrastructure offered by UC Berkeley.

If you think your project might benefit from working with a CI engineer, please email us at brc@berkeley.edu. When selecting projects, the CI engineer considers a number of factors, including reusability (will this workflow or component be useful beyond an individual research group?) and impact.

Three example cases illustrate the kind of work the CI engineer can assist researchers with:

Working with Active Research Data - Phoebe A. Hearst Museum of Anthropology (PAHMA)

One of the common challenges that the Research Data Management (RDM) team in Research IT encounters is for effective management of active research data. Often active data maintained on aging file servers needs to be to migrated to a stable but easily accessed environment. The CIE can work in collaboration with RDM consultants to educate the research team about options for both active data storage and archives. Considering the strong historic relationship between PAHMA directors and Research IT and the instability of the NAS server used as their image repository, the CIE assisted with the data migration as an opportunity to evaluate tools available to research staff for migrating large data collections. The CIE collaborated with RDM consultants to understand current capabilities, active data management solutions, and established best practices. During the process certain gaps and bottlenecks were identified that informed the Research IT roadmap in this area. While a robust data transfer tool, Globus, can be used by researchers to move data to Savio for processing, it does not currently have connectors for the two data repositories most used by Berkeley research team, Box and Google Drive. Other GridFTP tools do not include robust verification of transfer so the CIE developed a utility program that performs data transfer to Box as well as basic validation checks.  PAHMA achieved a reliable and more accessible repository as well as options for additional data management needs.

Supporting Digital Humanities Scholarship - Mellon Postdoctoral Fellow Adam Anderson

The engagement with Adam Anderson, whose interdisciplinary work spans Near Eastern Studies, archaeology, and computational linguistics, was also oriented toward scaling analysis and integrating active data with compute resources. Adam has collected tens of thousands of PDF files in multiple languages from scanning books relevant to his research, many of them rare. He would like to be able to quickly analyze information in these images to find critical passages. Berkeley has licenses for desktop OCR software but the licensed software is not suitable for large scale processing either from a cost or a usage perspective. Anderson, Digital Humanities Consultant Quinn Dombrowski and the CIE investigated whether publicly available (open-source) OCR software such as Tesseract might be able to provide the fidelity required to locate  passages of interest among the hundreds of thousands of pages in Anderson’s collection. Anderson maintains his PDFs on Google Drive, and would need to migrate the data to a compute platform for processing.

Installing and training Tesseract is tedious and time consuming, so by using a Singularity container the CIE was able to install the software once and run on a laptop, on Savio, and on the XSEDE Comet cluster. This reduced overhead for development and testing of the workflow, and also increased confidence in the results because the same executable was being using in all environments.

The group collaborated on logic that would locate and score relevant passages, then collect a 'hit list' for review by the researcher. Several iterations were necessary to refine the methods used, and metrics were collected to estimate the compute time for the full dataset. Anderson is now reviewing the collected information and we expect to move forward with large scale processing runs in the near future.

Adesnik Lab - Partnering with Labs for Sustained Success

Researchers in the Adesnik Lab are working to identify the fundamental mechanisms by which cortical circuits generate perceptions.Their strategy is to leverage calcium imaging, optogenetic tools, two-photon structured light microscopy, and high-performance distributed computing to design an approach that will enable experimenters to read-out and control the spatiotemporal activity of cortical neurons in the intact brain at cellular resolution and with millisecond precision.

Challenges that the Adnesick Lab were facing matched well with a number of the focus areas for the CIE role. Team members were investigating multiple workflow tools, including Spark and Jupyter notebooks, which might improve the efficiency and throughput of their data analysis. Research IT domain experts had provided initial examples appropriate to the lab’s goals, but the scientists struggled when moving past introductory examples and attempting to employ the tools using experimental data. Given the tight time constraints of the project, the lab’s researchers had abandoned the effort to employ these new tools, reverting to the process already in place -- a less efficient batch-oriented approach.  The research team had also discussed with Research IT the difficulties faced with moving data from the collection device (a computer running an older version of Windows, connected to a microscope taking high resolution images) to Savio (the compute resource). Two data transfer steps of large data files was required. Finally, the team was interested in the possibility of experimenting with a closed-loop pattern that would transfer data to the compute resource as it was produced by the device, perform the analysis steps as the data arrived on the cluster, then transfer results back to the lab in near-real-time. This would allow the researchers to use the results to inform additional steps in the ongoing experiment.

The first challenge was to clarify and resolve the issues encountered as the team attempted to employ workflow tools available on the Savio cluster. The CIE worked with the Research IT domain experts to try various approaches to creating custom kernels for Jupyter that included the modules required by the team, as well as to understand how to configure the notebooks to run in parallel on multiple Savio nodes. Addressing the lab’s obstacles with Spark required a few misconceptions to be clarified documentation was then updated to provide the clarifying information. Following this initial stage of engagement, the research team was able to successfully employ the technology in a context that was meaningful to them. Several subsequent contacts were made to verify that no new issues were encountered, and they had achieved a level of comfort with using the technology. This pattern of "assisted success" is something the CIE expects to follow in subsequent  engagements, to verify that the user is able to independently interact with the technology successfully, and has a working example to reference at a later time. These interactions and resources will likely increase the probability of adoption. The CIE, Research IT consultants and domain experts then reviewed where the rough spots existed for a user attempting to employ the technology, and discussed how to reduce the learning curve for other research teams. The team reviewed possibilities for improvements that can be made to the service itself, including default configuration and components that abstract some of the complexity without reducing capability.  Additional follow-ups are planned with the Adesnik Lab to see whether the team has been able to effectively employ these tools to improve methods and practices, address any additional sticking points, and provide guidance on best practices.

Finding solutions which make active data more accessible to both users and compute resources is one of the goals for the CIE role. Research team staff often employ workarounds to transfer challenges that include thumb drives, multi-hop data transfers, and external hard drives. The Adnesick team moves a significant amount of data by first transferring from the imaging device to their laptops, then from their laptop to Savio for processing. While Research IT provides the Globus tool for moving data to the Savio cluster via a DTN, the team did not have access to the account type that would make the device endpoint accessible to other Globus users if it was installed using a type of shared identity used on the Berkeley campus for authorization. Research IT addressed this issue by working with Globus to provide the required client account type, enabling the team to use Globus for internal transfers. Having addressed the immediate issue, the CIE worked with the research team to understand how a more direct data transfer path could be created. The CIE then created a component which leveraged the Globus SDK to monitor a set of output folders and transfer new or modified data files to Savio automatically, reducing the need for the research team to initiate transfers manually. Initial testing was performed in the Research IT environment, then the CIE worked with the Adnesik team to test the component in their lab. A few iterations were performed to refine the component, which further enhanced the CIE’s working relationship with the research team.

The CIE and the Research IT team is currently discussing patterns and technologies that might provide a solution to the 'closed loop' experiment workflow in the coming months. The transfer component created above may be employed to automate aspects of data transfer. This work will provide another opportunity to actively engage with the research teams, strengthening the relationship and understanding of their domain and challenges at a deeper level. The patterns and components that result may provide the capability for multiple teams to scale and transform their research and evolve the CI capabilities of Research IT.