Fast Approximate Evaluation of Parallel Overhead from a Minimal Set of Measured Execution Times

Porting scientiﬁc key algorithms to HPC architectures requires a thorough understanding of the subtle balance between gain in performance and introduced overhead. Here we continue the development of our recently proposed technique that uses plain execution times to predict the extent of parallel overhead. The focus here is on an analytic solution that takes into account as many data points as there are unknowns, i


Introduction
High Performance Computing (HPC) is increasingly seen as an integral part of modern scientific research and many examples have been given to date [1][2][3][4][5] reaching from theoretical predictions of exotic states of matter [1,5] over analyzing CERN data in the discovery of the Higgs boson [2] or genetic data in deciphering the human genome [4] to worldwide collaborations on the detection of gravitational waves [3].In many instances a single key algorithm forms the basis of all in-depth scientific discovery and, consequently, needs to be ported to HPC architectures for efficient computation of large scale problems and detailed analysis of complex data sets.However, such a transformation of scientific code to an efficient HPC implementation is all but a trivial task.This is due in part to the complex nature of current state-of-the-art HPC systems, e.g. the dynamic behaviour of core components [6], the increasing complexity of CPU architectures [7] and the need for scalable performance using standard methods of parallel programming [8].Moreover, significant limitations arise from the required communication between individual parallel tasks, where the introduced overhead needs to be kept at an absolute minimum in order to achieve sustained scalability to large numbers of parallel processing cores [9,10].Especially with respect to this latter constraint, several quality tools have been developed in the past that aid in quantifying parallelization overhead and thus ease the process of porting scientific programs [11][12][13].
Studying parallel overhead has been a research focus for many years [14][15][16][17][18][19][20][21][22].One early proposal was given with the logP model [14,15] where four parameters, latency, overhead, gap (reciprocal communication bandwidth) and number of processors were introduced to analyze a given algorithm.A core design goal was to find a balance between overly simplistic and overly specific models.Application to MPI [16] and several extensions respecting large messages [17,16] and contention effects [18] were described.A more abstract framework with tunable complexity but still practical timing requirements has been provided with PERC [19].More recent trends in hybrid MPI/OpenMP programming were taken care of by a combination of application signature with system profiles [20].Along similar lines applicationcentric performance modeling [21] was described based on characteristics of the application and the target computing platform with the objective of successful large-scale extrapolation.Similar predictions could be made with the help of runtime functions within the SUIF infrastructure [22].
For many practical applications, prior to any more sophisticated analysis, it is already considered helpful preliminary information if one could get a quick overview on the subject with simplest methods, i.e. without having to switch on profiling flags, linking in additional libraries, embedding in analysis tools etc.Such a method need not be perfect [7], it merely should serve for a qualitative assessment to monitor critical situations where the consultation of more advanced tools [11][12][13] is becoming a strongly advisable recommendation.We have previously presented such a method [23] where a simple record of execution times for varying numbers of parallel processing cores could be used to estimate the parallel overhead incurred.This approach was specifically applicable to a particular class of scientific applications frequently used on current HPC platforms.Here we want to briefly summarize the basics of this approach, augment it with two more examples from the scientific community and describe a systematic way to determine model-critical parameters.

Basic model
The time to solution, t n , for a particular application executed on some HPC platform using n cores was proposed as [23], where b and c are application-as well as problem-size-specific parameters and the initial term is the classic relation of Amdahl [24], with f s the sequential fraction of t 1 that cannot be parallelized.Parameters b and c could be determined from a fit to measured execution times with f s obtained as a by-product.Once the fit was showing a good match to the data set, f s , b and c could be plugged into a second relation accounting for τ n , the parallel overhead affecting the application, Owing to its simplicity, Eq. (1) can also be used in reverse to determine model parameters, c, f s and b from a triple of known explicit data points, {(n x , t x ), (n y , t y ), (n z , t z )}.Since the number of possible triples is growing rapidly with the number of available data points, it is interesting to examine which of the available combinations would lead to an optimal representation.The hope is to replace the fitting procedure used in [23] with an analytic approach based on a suitable choice of three explicit execution times.Given the already established data sets considered in [23], these data can now be re-evaluated and general trends unveiled as well as potential numerical issues identified.

Determination of model parameters c, f s and b
Respecting the non-linear character of Eq. ( 1) after a series of trivial algebraic manipulations we obtain the following 4th-order polynomial in c, 1850003-3 Parallel Process.Lett.2018.28.Downloaded from www.worldscientific.comby UNIVERSITY OF VIENNA on 04/04/18.For personal use only.4) where all coefficients are scaled by (γ Different choices of explicit triplets of execution times give rise to p 4 (c) of either strongly increasing (blue) or decreasing (red) trend with at least one particular zero always detectable in the interval [10,1000].
with individual coefficients defined as follows, and characteristic shapes of the polynomial shown in Fig. 1.Because individual coefficients in the polynomial of Eq. ( 4) are rather large, we usually divide the entire equation by the coefficient corresponding to the 1storder term, i.e. the prefactor of c.In so doing we expect to identify at least one particular zero in the interval [10,1000] (see Fig. 1) with c ∈ R + and straightforward detection by means of bisection [25].Of particular note is the bivalent character of the curvature of either rapidly increasing, or decreasing trend (see blue and red graphs in Fig. 1).Having obtained a solution for c, the next model parameter can be determined using where it should be noted that equivalent expressions in terms of pairs xz or yz should lead to identical solutions.Finally, the third parameter can be derived from the following relation, where, again, similar expressions hold for n y , t y and n z , t z .

Results
Extending our previous study [23] we first present data for two more applications frequently used in science and research.These are WIEN2k [28] and NWCHEM [29].Times to solution, t n , as a function of numbers of processing cores, n, are shown in Fig. 2 (red squares and green discs, also see Table 1).As can be seen, WIEN2k belongs exactly to the same class of applications that was subject to our previous study where the focus was on compute-bound applications of significant arithmetic intensity.In contrast, NWCHEM exhibits a remarkably different profile and shows a strong rise in t n even for small numbers of processing cores, where t n starts to exceed the single-core execution time, t  (color online) Recorded times to solution, tn, (red squares and green discs, also see Table 1, columns 1, 2 and 5) as a function of numbers of cores, n, operating in parallel for applications WIEN2k [28] and NWCHEM [29].Very large initial times corresponding to very small core counts have been truncated for better visual clarity.Best fitting the data with GNUPLOT [35] yields parameters b and c (implicitly also fs) via Eq.( 1) where the original data is reasonably well approximated (solid line in cyan).In addition, an estimate can be provided for the parallel overhead using Eq.(3) (orange line).The estimate matches the allinea/MAP-derived [13] mean parallel overhead rather well (compare orange line to triangles in blue, respectively Table 1, columns 3 and 6).Significant deviation from Amdahl's Law [24] is seen already for small core counts (compare cyan to grey line).Optional analytic determination of c, fs and b (as discussed here) from a triple of known data points are shown as dashed lines.
a PGAS [30][31][32][33] model, in particular the Global Arrays Toolkit [34].It is surprising that the suggested relation of Eq. (1) appears to be working even for such applications that never had been considered target of the original development.

Overall assessment of the explicit computation of model parameters c, f s and b
Previous [23] and current measurements of t n as a function of n are considered for analytical computations of c, f s and b following Eqs.( 4), ( 6) and (7).From a set of k data points all possible triple combinations {(n x , t x ), (n y , t y ), (n z , t z )} are examined.There are a number of k 3 such triples where it should be noted that the single core execution time, t 1 , is not a member of the k-group but needs to be provided in addition to it.Initially, the root-finding problem described in Sec.2.1 should deliver a single solution for c.This is then used in Eq. ( 6) to determine f s .Two pairs of data points with largest differences in core counts are taken into account for this second task and the smaller f s value is carried on if either solution is perceived meaningful, i.e. f s > 0. The final task then is to use these two parameters, c and f s , together with Eq. ( 7) to determine b.All three possibilities are considered and a mean value of b is calculated only in case all individual solutions turn out to be reasonable, i.e. b > 0. The general numerical behaviour of  4), ( 6) and ( 7) for applications WIEN2k [28], NWCHEM [29], HPL [36], GROMACS [37], AM-BER [38],GREMLIN [39,40], VASP [41], QUANTUM ESPRESSO [42] and LAMMPS [43].this approach is summarized in Table 2.As becomes clear immediately, the major problem is finding a suitable solution for f s .This is partially due to the anticipated value of f s close to zero for strong-scaling applications which for many triples will result in a pair of f s -solutions slightly larger and smaller than zero, hence will be 1850003-7 Parallel Process.Lett.2018.28.Downloaded from www.worldscientific.comby UNIVERSITY OF VIENNA on 04/04/18.For personal use only.

Application
discarded following the aforementioned procedure.The interesting outlier in this respect is NWCHEM which underlines the observed characteristics of non-ideality in the strong-scaling regime with an anticipated sizeable fraction of sequential code.Given the autonomous nature and efficiency of the present approach, an analytical determination of such critical parameters as f s seems to be of considerable advantage.

Optimal choice of three data points
For each of the successful computations of parameters c, f s and b (see previous Sec.4.1) we can express the quality of the approximation in terms of the root mean square deviation (RMSD) from the measured execution times, where t n (exp) are the k experimentally observed data points and t n (app) their corresponding approximations resulting from Eq. (1).By ranking all the solutions according to RMSD we can screen for the most optimal triplet of explicit data points leading to the best match to the observed data.Moreover, different applications can be compared to each other and their respective optimal combinations of explicit data points can be analyzed for an emerging pattern perhaps common to all applications.Top-2-ranked solutions together with their underlying triplet of numbers of cores are summarized in Table 3. Top-results corresponding to WIEN2k and NWCHEM are also indicated as dashed lines in Fig. 2 and can be compared Table 3. Top ranked solutions for c, fs and b using Eqs.( 4), ( 6) and ( 7) and corresponding selection of three explicit data points resulting in minimal RMSD from experimentally observed times to solutions, tn, for a range of applications (see  to the plain GNUPLOT [35] based fitting.As has already been outlined in [23] because of lim n→∞ τn tn < 1 we infer that b < c + 1, hence a condition that almost all of the top-2-ranked results do indeed reflect (see final two columns in Table 3).It is interesting to observe that applications similar to NWCHEM with increasing trend in t n deliver solutions very close to the limit (see for example b, c values in Table 3 for applications QUANTUM ESPRESSO and LAMMPS).Thus the rather strong deviation of τ n (see dashed orange line in the right panel of Fig. 2) may indicate a certain degree of systematic bias for this type of applications.No solution could be determined for application GREMLIN [39,40] because of the limited number of available data points (see Table 2).
In general, it is difficult to advise on an optimal choice of three explicit data points.Results of Table 3 are also graphically illustrated in Fig. 3 where individual ranges of considered numbers of cores have been normalized into the interval [0, 1] and the largest value of n (corresponding to normalized 1) is given as subscript to the name of the application.Not only do optimal core counts scatter over the entire range, but also are second and top ranked solutions pretty different to each other for individual applications.Thus it appears that compilation of a small data set similar to the example given in Table 1 with subsequent combinatorial examination of all possible triplets is probably the most general approach to determine optimal solutions for c, f s and b.

Conclusion
Parallel execution times, t n , can be well described by Eq. ( 1) for a great variety of different applications.This even includes atypical cases as shown here on the 1850003-9 Parallel Process.Lett.2018.28.Downloaded from www.worldscientific.comby UNIVERSITY OF VIENNA on 04/04/18.For personal use only.
example of NWCHEM [29] that exhibit deterioration of execution times with increasing numbers of parallel processing cores.Model parameters c, f s and b can be efficiently computed from a set of 3 explicit data points with f s , the fraction of sequential code, being the most difficult to determine.Knowledge of these parameters facilitates an approximate estimation of τ n , the parallel overhead affecting the application (see Eq. ( 3)).No general recommendation can be made for an a-priori choice of explicit data points most appropriate for an optimal determination of these parameters.

Fig. 2 .
Fig.2.(color online) Recorded times to solution, tn, (red squares and green discs, also see Table1, columns 1, 2 and 5) as a function of numbers of cores, n, operating in parallel for applications WIEN2k[28] and NWCHEM[29].Very large initial times corresponding to very small core counts have been truncated for better visual clarity.Best fitting the data with GNUPLOT[35] yields parameters b and c (implicitly also fs) via Eq.(1) where the original data is reasonably well approximated (solid line in cyan).In addition, an estimate can be provided for the parallel overhead using Eq.(3) (orange line).The estimate matches the allinea/MAP-derived[13] mean parallel overhead rather well (compare orange line to triangles in blue, respectively Table1, columns 3 and 6).Significant deviation from Amdahl's Law[24] is seen already for small core counts (compare cyan to grey line).Optional analytic determination of c, fs and b (as discussed here) from a triple of known data points are shown as dashed lines.

Fig. 3 .
Fig.3.Triplets of explicit data points leading to optimal solutions for c, fs and b.Maximum numbers of cores considered are indicated as subscripts to the name of the application on the right axis.Various ranges have been normalized into the interval [0, 1] for better direct comparison with 1 representing the maximum core count.
1 , at core counts of n > 250.The reason for such a drastic qualitative change may lie in the fact that NWCHEM implements 1850003-5

Table 2 .
Evaluation of explicit calculations of parameters c, fs and b via Eqs.(