No matter how much on-premises compute capacity you have, one day, you will need more. OCF works closely with a number of HPC schedulers and public cloud providers to enable this on-demand 'burst' capacity as and when you need it
Do you have occasional peaks in your workload that you struggle to meet? By ‘bursting’ to a public cloud provider you can scale up your on-premise HPC cluster practically infinitely just when you need it. With OCF’s expertise and partnerships with public cloud providers we can enable cloud burst on your cluster today.
The OCF Slurm cloud burst implementation uses repurposed core Slurm functionality together with custom OCF scripts to enable intelligent cloud burst functionality on new or existing clusters using the Slurm Workload Manager.
Slurm cloud bursting is an extension of its power saving functionality. When a node is not in use it is ‘suspended’ and when it is required again it is ‘resumed’. This is achieved by triggering customisable scripts in either condition. For cloud bursting the script is replaced with one that spins up or down cloud instances.
Within Slurm different types of nodes are separated into “partitions”. Cloud nodes would be in a separate partition to physical nodes. Jobs can then be submitted to the cloud partition so that they always go to cloud, the physical partition so they never go to cloud, or they can be submitted to both partitions and the scheduler will decide the best place to run the job.
If a job is submitted to multiple partitions the schedulers default behaviour is to allocate it to the partition which will result in the earliest completion of the job. For more specific bursting conditions (such as 90% cluster utilisation and job pending periods) OCF develops custom scripts for our customers.
Data pre-staging is achieved using the SLURM Burst Buffer plugin. This allows jobs to be submitted with the --bb option to stage in data to a location of their choosing (in our case it would be the cloud resource) and then stage out result data (to cluster scratch disk). To ensure no sensitive data is left on the cloud resource a #BB destroy option can be invoked once the job is complete. The burst buffer plugin is managed by triggering scripts which OCF optimise for each individual cluster.
This customisable approach allows for a bespoke cloud bursting solution tailored to our customer requirements.
Slurm can be configured to collect accounting information for every job and job step executed. Accounting records can be written to a simple text file or a database. Information is available about both currently executing jobs and jobs which have already terminated. The sacct command can report resource usage for running or terminated jobs including individual tasks, which can be useful to detect load imbalance between the tasks. The sstat command can be used to status only currently running jobs. It also can give you valuable information about imbalance between tasks. The sreport can be used to generate reports based upon all jobs executed in a time interval. One thing to note is that if you want to give users a set allocation of ‘credits’ and prevent them from using more than that amount, rather than giving them uncapped access and billing them for it at the end of a period, this requires the use of a different plugin, Slurm Bank (sbank). sbank can also be used by administrators to generate reports displaying credits used and left available.
Adaptive Computing’s Moab HPC Workload and Resource Orchestration Platform, has been enhanced to automatically extend on-premise HPC workloads into public clouds. Moab is fully integrated with NODUS Cloud OS 4.0, which works with Moab to deliver cloud services, enabling true scalability and elasticity.
Moab NODUS Cloud OS can provision a full stack across multiple cloud environments, and can be customized to satisfy multiple use cases and scenarios. Access to unlimited HPC compute resources is available in the cloud from multiple cloud providers.