Storage System Performance Update
To all chip users,
The DoIT Research Computing & Data (RCD) Team is continuing to work on the storage system performance issues that have been impacting the chip cluster since early February. For some users this is intermittent or not noticeable, for others the issue is persistent. When present, these issues are manifesting as unusually long file read/write/load times for operations as simple as "ls" or "cd" when navigating the filesystem or as complicated as long slurm job runtimes.
While the RCD Team continues to optimize the RRStor Ceph File Storage System to meet the needs of the diverse user base, certain research workflows have been identified as specifically vulnerable to low performance. These workflows are those which feature the reading and/or writing of many (more than one hundred thousand) small (10s of KB or less) files or their enclosing directory from the storage system. The RCD Team has begun working with a number of researchers whose slurm job runtimes have been adversely affected due to these workflow features.
The RCD Team is continuing to work with our support vendor with multiple touch-points a week to test new ideas and potential solutions while minimizing the impact to users. In the past week, the team has worked to upgrade our storage cluster manager to enable faster communication between distinct storage cluster nodes. Further, the team is beginning to explore the effect that "pinning" research group storage cluster connections to distinct cluster nodes might have on the balanced communication between the compute cluster (chip) and the storage cluster.
We will continue to give weekly updates and give substantive updates as they develop.
Roy Prouty
Assistant Director for Research Computing
UMBC DoIT
Posted: April 10, 2026, 6:35 PM