Milton Halem, Department of Computer Science and Electrical Engineering
As machine learning (ML) training datasets continue to grow rapidly in resolution as well as in volume, an HPC-ML challenge emerges requiring systems with high-performance distributed parallel capabilities. To satisfy this demand for training large Hi-Resolution images with deeper neural networks, we require access to a large hybrid cluster of GPU based architecture. We propose to implement an accelerated scalable training framework for deep convolutional neural networks applied to the analysis of a sequence of very large high-resolution stem cell microscopic images (20,000×20,000 pixels). The aim of proposed training framework is to distribute the processes of Convolutional Neural Net (CNN) models across multiple nodes of hybrid CPUs and GPUs. Utilizing the Tensorflow distributed framework, we will implement CNN architectures in synchronous and asynchronous modes. We will investigate the run time performance and accuracy tradeoffs for implementing distribu ted CNN models on a large cluster of GPU nodes.