Matching User Requirements with Predicted Resource - Queue Time and Energy Prediction
As part of the ComPat project we are seeking to address the issue of energy efficiency associated with running large multiscale applications on geographically distributed supercomputers. Two key areas in this challenge are:
1. optimising the application code for both runtime and energy 2. Optimising the application scheduling process to ensure the most appropriate runtime conditions based upon user requirements.
The ComPat project has already started to develop QCG broker through the inclusion of new features and parameters using XML allowing system users to specify runtime and energy requirements for a given job. The integration of energy aware scheduling and queue time prediction will enable this by allowing efficient scheduling to meet user requirements, based upon system metrics.
HPC systems use batch scheduling software (Platform LSF, SLURM, Load Leveller, etc) which increasingly present unique algorithms to predict both energy efficiency and runtime based around DVFS. In order to choose the right HPC resource to match a user’s requirements the metrics from these tools can be used to appropriately schedule an application on the most appropriate system.
However, in order to query the resources a sample run of the application has to be run for a sufficient time to create an energy tag indicating the predicted energy and runtime at varying CPU frequencies. This information can then be used to understand the energy and runtime performance of the application and predict future performance.
Breaking down the application requirements (expressed in XMML) it is possible to classify each application as a series of common types. This enables associated dummy jobs to be created for each type. These can be run periodically to generate energy tags that reflect the current state of HPC resource.
This periodic running of the test jobs can save energy and time by reducing the amount of jobs to be run to get energy tags from per application to a sample size for example of per 100 applications. Actual run time and energy data from completed jobs will be used to weight the predicted performance by the test jobs enabling a view on scheduler accuracy.
In order to generate predicted performance per application the development of deep learning algorithms will be needed to analyse the test run data and actual data generated from executed applications. These algorithms will influence how the XMML is analysed to fit a job into a category and how the data from the schedulers are managed in terms of accuracy.
End of August: Initial implementation and test on one node (1 conference publication) End of October: Implementation and test across nodes (1 conference publication) End of March: Application of different deep learning algorithms and results (journal publication
Additional Notes - Call 28th July 2017
Vytautas, Olly, Neil.
Queue time and energy prediction / scheduling are overlapping elements of work. Opportunites for more efficient scheduling / integration.
Need to refine/develop the model for capturing runtime energy and DVFS predictions to support appropriate job scheduling through QCG broker. This could include both the EAS features within the schedulers or the library developed by Poznan.
1. Upload the outline document to the wiki (NM).
2. Discuss Energy Task Force on the call next Friday -4th August 2017 (All).
3. Set up face to face meeting for the Energy Task Force (OP?).
4. Discuss energy library with Tomek and colleauges from Poznan (NM).
5. Explore inviting rep from ‘EsiWace’ centre of excellence to the next all hands meeting to discuss energy efficiency and prediction (OP).