Video recording and post production done by OpenStack Foundation.
The Worldwide LHC Computing Grid (WLCG) is a global collaboration that is analyzing CERN's LHC data, and it consists of more than 170 computing centers in 40 countries.
The CMS experiment is maintaining a large infrastructure to readout and filter the data from the detector. The High Level Trigger (HLT) cluster is composed of 15k cores in 1500 compute nodes dedicated to online data and event filtering. However, this resource is used only for about 30% of the time, due to the accelerator duty cycle and the various maintenance periods and the rest of time it is free. Only during these unused times an OpenStack cloud is started on top of the cluster allowing it to join the WLCG for offline data analysis.
For running the required software for the offline data analysis, specific system images are generated and distributed through the cloud. The OpenStack s image service (Glance) has to distribute a 1,7 GB image to almost 1500 servers, so the VMs can boot. Every time images are added or changed, Glance has to redistribute the new images to all nova compute nodes which poses challenges. Due to the usage pattern, a rapid startup is needed to maximize the available cluster time of sometimes only a few hours.
In this presentation we will discuss how we managed to boot all the available nova compute nodes, in under 10 minutes. Also the specificities of this standard cluster usage and opportunistic cloud usage will be explained, as will the details of the deployment of the infrastructure, the issues encountered and their solutions.