Skip to main content

Filtering

Congratulations, we have made our first measurement with Score-P. But how good was the measurement? The measured execution gave the desired valid result, but the execution took a bit longer than expected! The instrumented run has a large increase in runtime compared to a baseline (around 47s versus 23s). Your runtime may vary slightly from our measurements. Even if we ignore the start and end of the measurement, it was probably prolonged by the instrumentation/measurement overhead.

To make sure you don't draw the wrong conclusions based on data that has been disturbed by significant overhead, it's often a good idea to optimise the measurement configuration before you do any more experiments. There are lots of ways you can do this, for example, by using runtime filtering, selective recording, or manual instrumentation to control the measurement.

However, in many cases, it's enough to filter a few frequently executed but otherwise unimportant user functions to reduce the measurement overhead to an acceptable level (based on experience, we consider 0-20% of runtime dilation as acceptable). The selection of those routines has to be done with care, though, as it affects the granularity of the measurement and too aggressive filtering might "blur" the location of important hotspots.

To understand where the overhead is coming from it is necessary to make scoring of the measurement. It can be done via the following command:

$ scorep-score scorep_bt-mz_sum/profile.cubex 

As an output you will see the following:

Estimated aggregate size of event trace:                   40GB
Estimated requirements for largest trace buffer (max_buf): 6GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 6GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid
intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
maximum supported memory or reduce requirements using USR regions filters.)

flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 5,398,866,521 1,638,135,715 2570.45 100.0 1.57 ALL
USR 5,358,738,138 1,631,138,913 161.17 6.3 0.10 USR
OMP 39,174,822 6,781,952 2093.27 81.4 308.65 OMP
COM 665,210 182,120 302.72 11.8 1662.21 COM
MPI 288,310 32,722 13.28 0.5 405.86 MPI
SCOREP 41 8 0.00 0.0 43.30 SCOREP

As can be seen from the top of the score output, the estimated size for an event trace measurement without filtering applied is approximately 40GB, with the process-local maximum across all ranks being roughly 6GB.

The next section of the score output provides a table which shows how the trace memory requirements of a single process (column max_buf) as well as the overall number of visits and CPU allocation time are distributed among certain function groups. In current execution, the following groups are distinguished:

  • ALL: All functions of the application.
  • MPI: MPI API functions.
  • OMP: OpenMP constructs and API functions.
  • COM: User functions/regions that appear on a call path to an OpenMP construct, or an OpenMP or MPI API function. Useful to provide the context of MPI/OpenMP usage.
  • USR: User functions/regions that do not appear on a call path to an OpenMP construct, or an OpenMP or MPI API function.
  • SCOREP: This group aggregates activities within the measurement system.
info

There are more function groups available, e.g. CUDA,OPENACC,MEMORY,IO,LIB, etc. For more details consult with the documentation here.

As we can see from the scoring output, the USR group is making the biggest contribution to the trace memory requirements. To figure out which routines are causing the problem, we need to see a breakdown by function. To do this, we just need to run the following command:

$ scorep-score -r scorep_bt-mz_sum/profile.cubex

As an output you will see the following

Estimated aggregate size of event trace:                   40GB
Estimated requirements for largest trace buffer (max_buf): 6GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 6GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid
intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the
maximum supported memory or reduce requirements using USR regions filters.)

flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 5,398,866,521 1,638,135,715 2570.45 100.0 1.57 ALL
USR 5,358,738,138 1,631,138,913 161.17 6.3 0.10 USR
OMP 39,174,822 6,781,952 2093.27 81.4 308.65 OMP
COM 665,210 182,120 302.72 11.8 1662.21 COM
MPI 288,310 32,722 13.28 0.5 405.86 MPI
SCOREP 41 8 0.00 0.0 43.30 SCOREP
USR 1,716,505,830 522,844,416 70.49 2.7 0.13 binvcrhs
USR 1,716,505,830 522,844,416 54.37 2.1 0.10 matmul_sub
USR 1,716,505,830 522,844,416 30.05 1.2 0.06 matvec_sub
USR 76,195,080 22,692,096 2.97 0.1 0.13 lhsinit
USR 76,195,080 22,692,096 2.00 0.1 0.09 binvrhs
USR 56,825,184 17,219,840 1.28 0.0 0.07 exact_solution
OMP 3,147,660 257,280 0.03 0.0 0.13 !$omp parallel @exch_qbc.f:204
OMP 3,147,660 257,280 0.04 0.0 0.14 !$omp parallel @exch_qbc.f:215
OMP 3,147,660 257,280 0.04 0.0 0.14 !$omp parallel @exch_qbc.f:244
OMP 3,147,660 257,280 0.04 0.0 0.15 !$omp parallel @exch_qbc.f:255
OMP 1,581,660 129,280 0.14 0.0 1.08 !$omp parallel @rhs.f:28
OMP 1,573,830 128,640 0.02 0.0 0.17 !$omp parallel @add.f:22
OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @z_solve.f:43
OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @y_solve.f:43
OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @x_solve.f:46
OMP 940,680 257,280 0.13 0.0 0.49 !$omp do @exch_qbc.f:204
OMP 940,680 257,280 17.03 0.7 66.20 !$omp implicit barrier @exch_qbc.f:213
OMP 940,680 257,280 0.14 0.0 0.56 !$omp do @exch_qbc.f:215
OMP 940,680 257,280 17.09 0.7 66.41 !$omp implicit barrier @exch_qbc.f:224
OMP 940,680 257,280 0.18 0.0 0.70 !$omp do @exch_qbc.f:244
OMP 940,680 257,280 17.42 0.7 67.73 !$omp implicit barrier @exch_qbc.f:253
OMP 940,680 257,280 0.18 0.0 0.70 !$omp do @exch_qbc.f:255
OMP 940,680 257,280 17.33 0.7 67.34 !$omp implicit barrier @exch_qbc.f:264
OMP 472,680 129,280 10.97 0.4 84.86 !$omp implicit barrier @rhs.f:439
OMP 472,680 129,280 1.92 0.1 14.86 !$omp do @rhs.f:37
OMP 472,680 129,280 1.61 0.1 12.46 !$omp do @rhs.f:62
OMP 472,680 129,280 71.47 2.8 552.80 !$omp implicit barrier @rhs.f:72
OMP 472,680 129,280 4.55 0.2 35.20 !$omp do @rhs.f:80
OMP 472,680 129,280 4.31 0.2 33.37 !$omp do @rhs.f:191
OMP 472,680 129,280 2.90 0.1 22.43 !$omp do @rhs.f:301
OMP 472,680 129,280 77.15 3.0 596.76 !$omp implicit barrier @rhs.f:353
OMP 472,680 129,280 0.11 0.0 0.85 !$omp do @rhs.f:359
OMP 472,680 129,280 0.11 0.0 0.82 !$omp do @rhs.f:372
OMP 472,680 129,280 1.25 0.0 9.66 !$omp do @rhs.f:384
OMP 472,680 129,280 0.13 0.0 0.97 !$omp do @rhs.f:400
OMP 472,680 129,280 0.11 0.0 0.87 !$omp do @rhs.f:413
OMP 472,680 129,280 18.53 0.7 143.36 !$omp implicit barrier @rhs.f:423
OMP 472,680 129,280 0.49 0.0 3.81 !$omp do @rhs.f:428
OMP 470,340 128,640 0.52 0.0 4.06 !$omp do @add.f:22
OMP 470,340 128,640 11.15 0.4 86.69 !$omp implicit barrier @add.f:33
OMP 470,340 128,640 520.36 20.2 4045.09 !$omp implicit barrier @z_solve.f:428
OMP 470,340 128,640 54.58 2.1 424.27 !$omp do @z_solve.f:52
OMP 470,340 128,640 570.18 22.2 4432.34 !$omp implicit barrier @y_solve.f:406
OMP 470,340 128,640 54.71 2.1 425.33 !$omp do @y_solve.f:52
OMP 470,340 128,640 549.55 21.4 4272.04 !$omp implicit barrier @x_solve.f:407
OMP 470,340 128,640 52.68 2.0 409.55 !$omp do @x_solve.f:54
COM 188,136 51,456 6.79 0.3 132.04 copy_x_face
COM 188,136 51,456 6.67 0.3 129.69 copy_y_face
MPI 125,223 10,854 0.03 0.0 3.06 MPI_Irecv
MPI 125,223 10,854 0.19 0.0 17.89 MPI_Isend
COM 47,268 12,928 4.80 0.2 371.56 compute_rhs
OMP 47,268 12,928 0.00 0.0 0.29 !$omp master @rhs.f:74
OMP 47,268 12,928 0.00 0.0 0.18 !$omp master @rhs.f:183
OMP 47,268 12,928 0.00 0.0 0.15 !$omp master @rhs.f:293
OMP 47,268 12,928 0.00 0.0 0.17 !$omp master @rhs.f:424
COM 47,034 12,864 0.01 0.0 0.85 adi
COM 47,034 12,864 90.92 3.5 7067.81 x_solve
COM 47,034 12,864 93.53 3.6 7270.87 y_solve
COM 47,034 12,864 97.01 3.8 7541.13 z_solve
COM 47,034 12,864 2.08 0.1 161.46 add
MPI 36,582 10,854 10.30 0.4 948.63 MPI_Waitall
OMP 15,660 1,280 0.00 0.0 0.79 !$omp parallel @initialize.f:28
OMP 11,700 3,200 0.00 0.0 0.07 !$omp atomic @error.f:51
OMP 11,700 3,200 0.00 0.0 0.07 !$omp atomic @error.f:104
OMP 7,830 640 0.00 0.0 0.51 !$omp parallel @exact_rhs.f:21
OMP 7,830 640 0.00 0.0 0.74 !$omp parallel @error.f:27
OMP 7,830 640 0.00 0.0 0.57 !$omp parallel @error.f:86
COM 5,226 1,608 0.02 0.0 15.22 exch_qbc
OMP 4,680 1,280 0.22 0.0 175.51 !$omp implicit barrier @initialize.f:204
OMP 4,680 1,280 0.02 0.0 17.03 !$omp do @initialize.f:31
OMP 4,680 1,280 0.92 0.0 719.67 !$omp do @initialize.f:50
OMP 4,680 1,280 0.00 0.0 2.76 !$omp do @initialize.f:100
OMP 4,680 1,280 0.00 0.0 2.79 !$omp do @initialize.f:119
OMP 4,680 1,280 0.01 0.0 3.99 !$omp do @initialize.f:137
OMP 4,680 1,280 0.01 0.0 4.08 !$omp do @initialize.f:156
OMP 4,680 1,280 7.97 0.3 6227.90 !$omp implicit barrier @initialize.f:167
OMP 4,680 1,280 0.01 0.0 6.11 !$omp do @initialize.f:174
OMP 4,680 1,280 0.01 0.0 6.02 !$omp do @initialize.f:192
USR 4,550 1,400 0.00 0.0 0.06 get_comm_index
OMP 2,340 640 0.05 0.0 73.19 !$omp implicit barrier @exact_rhs.f:357
OMP 2,340 640 0.01 0.0 22.36 !$omp do @exact_rhs.f:31
OMP 2,340 640 1.10 0.0 1711.99 !$omp implicit barrier @exact_rhs.f:41
OMP 2,340 640 0.08 0.0 119.53 !$omp do @exact_rhs.f:46
OMP 2,340 640 0.07 0.0 111.99 !$omp do @exact_rhs.f:147
OMP 2,340 640 1.92 0.1 2998.85 !$omp implicit barrier @exact_rhs.f:242
OMP 2,340 640 0.08 0.0 118.56 !$omp do @exact_rhs.f:247
OMP 2,340 640 0.67 0.0 1049.28 !$omp implicit barrier @exact_rhs.f:341
OMP 2,340 640 0.00 0.0 1.85 !$omp do @exact_rhs.f:346
OMP 2,340 640 0.72 0.0 1123.30 !$omp implicit barrier @error.f:54
OMP 2,340 640 0.06 0.0 101.38 !$omp do @error.f:33
OMP 2,340 640 0.05 0.0 78.05 !$omp implicit barrier @error.f:107
OMP 2,340 640 0.00 0.0 2.06 !$omp do @error.f:91
MPI 612 72 0.00 0.0 32.94 MPI_Bcast
USR 572 176 0.00 0.0 0.10 timer_clear
COM 468 128 0.73 0.0 5685.34 initialize
COM 234 64 0.01 0.0 137.02 exact_rhs
COM 234 64 0.01 0.0 151.69 rhs_norm
COM 234 64 0.13 0.0 1977.62 error_norm
MPI 204 24 0.01 0.0 386.08 MPI_Reduce
MPI 136 16 0.10 0.0 6541.62 MPI_Barrier
MPI 84 8 0.00 0.0 214.85 MPI_Comm_split
MPI 84 8 0.00 0.0 36.30 MPI_Finalize
MPI 84 8 2.64 0.1 329806.37 MPI_Init_thread
MPI 52 16 0.00 0.0 0.54 MPI_Comm_rank
SCOREP 41 8 0.00 0.0 43.30 bt-mz_B.8
COM 26 8 0.00 0.0 417.57 bt
USR 26 8 0.00 0.0 0.34 set_constants
USR 26 8 0.00 0.0 3.65 zone_starts
USR 26 8 0.00 0.0 1.13 zone_setup
COM 26 8 0.00 0.0 44.15 verify
USR 26 8 0.00 0.0 27.10 map_zones
COM 26 8 0.00 0.0 17.88 env_setup
COM 26 8 0.00 0.0 28.33 mpi_setup
USR 26 1 0.00 0.0 63.97 print_results
USR 26 8 0.00 0.0 0.28 timer_read
USR 26 8 0.00 0.0 13.27 timer_stop
USR 26 8 0.00 0.0 12.07 timer_start
MPI 26 8 0.00 0.0 2.83 MPI_Comm_size

The detailed breakdown by region below the summary provides a classification according to these function groups (column type) for each region found in the summary report. Investigation of this part of the score report reveals that most of the trace data would be generated by about 1.7 billion calls to each of the three routines binvcrhs, matmul_sub and matvec_sub (these routines are highlighted), which are classified as USR. And although the percentage of time spent in these routines at first glance suggest that they are important, the average time per visit is below 130 nanoseconds (column time/visit). That is, the relative measurement overhead for these functions is substantial, and thus a significant amount of the reported time is very likely spent in the Score-P measurement system rather than in the application itself. Therefore, these routines constitute good candidates for being filtered (like they are good candidates for being inlined by the compiler). Additionally selecting the lhsinit, binvrhs, and exact_solution routines, which generates about 208MB of event data on a single rank with very little runtime impact.

Score-P allows users to exclude specific routines or files from being measured using a filter file. This file, written in a specific format, specifies what should be included or excluded. In our case, we define rules for certain functions between the keywords SCOREP_REGION_NAMES_BEGIN and SCOREP_REGION_NAMES_END, the keyword EXCLUDE indicating that functions must be excluded from the measurements. A typical Score-P filter file looks like this:

SCOREP_REGION_NAMES_BEGIN
EXCLUDE
binvcrhs
matmul_sub
matvec_sub
lhsinit
binvrhs
exact_solution
SCOREP_REGION_NAMES_END

We have prepared a filter file scorep.filter, which you can find here NPB3.3-MZ-MPI/config/scorep.filt. You may notice some differences from the example above, such as the use of asterisks (*) as bash wildcards, because some Fortran compilers handle _ symbols in function names differently. We have also excluded timer functions from the measurement.

info

Just to let you know that the filter is safe to use. It doesn't prevent any of the listed routines from being executed. They are simply not recorded in the measurement, so they won't appear in the profile/trace explorer.

info

Please refer to the Score-P manual here for a detailed description of the filter file format, how to filter based on file names, define (and combine) blacklists and whitelists, and how to use wildcards for convenience.

The effectiveness of this filter can be examined by scoring the initial summary report again, this time specifying the filter file using the -f option of the command:

$ scorep-score -r -f ../config/scorep.filt scorep_bt-mz_sum/profile.cubex

This way a filter file can be incrementally developed, avoiding the need to conduct many measurements to step-by-step investigate the effect of filtering individual functions.

The output of the aforementioned command will look like this:

Estimated aggregate size of event trace:                   273MB
Estimated requirements for largest trace buffer (max_buf): 39MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 59MB
(hint: When tracing set SCOREP_TOTAL_MEMORY=59MB to avoid intermediate flushes
or reduce requirements using USR regions filters.)

flt type max_buf[B] visits time[s] time[%] time/visit[us] region
- ALL 5,398,866,521 1,638,135,715 2570.45 100.0 1.57 ALL
- USR 5,358,738,138 1,631,138,913 161.17 6.3 0.10 USR
- OMP 39,174,822 6,781,952 2093.27 81.4 308.65 OMP
- COM 665,210 182,120 302.72 11.8 1662.21 COM
- MPI 288,310 32,722 13.28 0.5 405.86 MPI
- SCOREP 41 8 0.00 0.0 43.30 SCOREP

* ALL 40,133,037 6,998,235 2409.28 93.7 344.27 ALL-FLT
+ FLT 5,358,733,484 1,631,137,480 161.17 6.3 0.10 FLT
- OMP 39,174,822 6,781,952 2093.27 81.4 308.65 OMP-FLT
* COM 665,210 182,120 302.72 11.8 1662.21 COM-FLT
- MPI 288,310 32,722 13.28 0.5 405.86 MPI-FLT
* USR 4,680 1,433 0.00 0.0 0.29 USR-FLT
- SCOREP 41 8 0.00 0.0 43.30 SCOREP-FLT

+ USR 1,716,505,830 522,844,416 70.49 2.7 0.13 binvcrhs
+ USR 1,716,505,830 522,844,416 54.37 2.1 0.10 matmul_sub
+ USR 1,716,505,830 522,844,416 30.05 1.2 0.06 matvec_sub
+ USR 76,195,080 22,692,096 2.97 0.1 0.13 lhsinit
+ USR 76,195,080 22,692,096 2.00 0.1 0.09 binvrhs
+ USR 56,825,184 17,219,840 1.28 0.0 0.07 exact_solution
- OMP 3,147,660 257,280 0.03 0.0 0.13 !$omp parallel @exch_qbc.f:204
- OMP 3,147,660 257,280 0.04 0.0 0.14 !$omp parallel @exch_qbc.f:215
- OMP 3,147,660 257,280 0.04 0.0 0.14 !$omp parallel @exch_qbc.f:244
- OMP 3,147,660 257,280 0.04 0.0 0.15 !$omp parallel @exch_qbc.f:255
- OMP 1,581,660 129,280 0.14 0.0 1.08 !$omp parallel @rhs.f:28
- OMP 1,573,830 128,640 0.02 0.0 0.17 !$omp parallel @add.f:22
- OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @z_solve.f:43
- OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @y_solve.f:43
- OMP 1,573,830 128,640 0.04 0.0 0.32 !$omp parallel @x_solve.f:46
- OMP 940,680 257,280 0.13 0.0 0.49 !$omp do @exch_qbc.f:204
- OMP 940,680 257,280 17.03 0.7 66.20 !$omp implicit barrier @exch_qbc.f:213
- OMP 940,680 257,280 0.14 0.0 0.56 !$omp do @exch_qbc.f:215
- OMP 940,680 257,280 17.09 0.7 66.41 !$omp implicit barrier @exch_qbc.f:224
- OMP 940,680 257,280 0.18 0.0 0.70 !$omp do @exch_qbc.f:244
- OMP 940,680 257,280 17.42 0.7 67.73 !$omp implicit barrier @exch_qbc.f:253
- OMP 940,680 257,280 0.18 0.0 0.70 !$omp do @exch_qbc.f:255
- OMP 940,680 257,280 17.33 0.7 67.34 !$omp implicit barrier @exch_qbc.f:264
- OMP 472,680 129,280 10.97 0.4 84.86 !$omp implicit barrier @rhs.f:439
- OMP 472,680 129,280 1.92 0.1 14.86 !$omp do @rhs.f:37
- OMP 472,680 129,280 1.61 0.1 12.46 !$omp do @rhs.f:62
- OMP 472,680 129,280 71.47 2.8 552.80 !$omp implicit barrier @rhs.f:72
- OMP 472,680 129,280 4.55 0.2 35.20 !$omp do @rhs.f:80
- OMP 472,680 129,280 4.31 0.2 33.37 !$omp do @rhs.f:191
- OMP 472,680 129,280 2.90 0.1 22.43 !$omp do @rhs.f:301
- OMP 472,680 129,280 77.15 3.0 596.76 !$omp implicit barrier @rhs.f:353
- OMP 472,680 129,280 0.11 0.0 0.85 !$omp do @rhs.f:359
- OMP 472,680 129,280 0.11 0.0 0.82 !$omp do @rhs.f:372
- OMP 472,680 129,280 1.25 0.0 9.66 !$omp do @rhs.f:384
- OMP 472,680 129,280 0.13 0.0 0.97 !$omp do @rhs.f:400
- OMP 472,680 129,280 0.11 0.0 0.87 !$omp do @rhs.f:413
- OMP 472,680 129,280 18.53 0.7 143.36 !$omp implicit barrier @rhs.f:423
- OMP 472,680 129,280 0.49 0.0 3.81 !$omp do @rhs.f:428
- OMP 470,340 128,640 0.52 0.0 4.06 !$omp do @add.f:22
- OMP 470,340 128,640 11.15 0.4 86.69 !$omp implicit barrier @add.f:33
- OMP 470,340 128,640 520.36 20.2 4045.09 !$omp implicit barrier @z_solve.f:428
- OMP 470,340 128,640 54.58 2.1 424.27 !$omp do @z_solve.f:52
- OMP 470,340 128,640 570.18 22.2 4432.34 !$omp implicit barrier @y_solve.f:406
- OMP 470,340 128,640 54.71 2.1 425.33 !$omp do @y_solve.f:52
- OMP 470,340 128,640 549.55 21.4 4272.04 !$omp implicit barrier @x_solve.f:407
- OMP 470,340 128,640 52.68 2.0 409.55 !$omp do @x_solve.f:54
- COM 188,136 51,456 6.79 0.3 132.04 copy_x_face
- COM 188,136 51,456 6.67 0.3 129.69 copy_y_face
- MPI 125,223 10,854 0.03 0.0 3.06 MPI_Irecv
- MPI 125,223 10,854 0.19 0.0 17.89 MPI_Isend
- COM 47,268 12,928 4.80 0.2 371.56 compute_rhs
- OMP 47,268 12,928 0.00 0.0 0.29 !$omp master @rhs.f:74
- OMP 47,268 12,928 0.00 0.0 0.18 !$omp master @rhs.f:183
- OMP 47,268 12,928 0.00 0.0 0.15 !$omp master @rhs.f:293
- OMP 47,268 12,928 0.00 0.0 0.17 !$omp master @rhs.f:424
- COM 47,034 12,864 0.01 0.0 0.85 adi
- COM 47,034 12,864 90.92 3.5 7067.81 x_solve
- COM 47,034 12,864 93.53 3.6 7270.87 y_solve
- COM 47,034 12,864 97.01 3.8 7541.13 z_solve
- COM 47,034 12,864 2.08 0.1 161.46 add
- MPI 36,582 10,854 10.30 0.4 948.63 MPI_Waitall
- OMP 15,660 1,280 0.00 0.0 0.79 !$omp parallel @initialize.f:28
- OMP 11,700 3,200 0.00 0.0 0.07 !$omp atomic @error.f:51
- OMP 11,700 3,200 0.00 0.0 0.07 !$omp atomic @error.f:104
- OMP 7,830 640 0.00 0.0 0.51 !$omp parallel @exact_rhs.f:21
- OMP 7,830 640 0.00 0.0 0.74 !$omp parallel @error.f:27
- OMP 7,830 640 0.00 0.0 0.57 !$omp parallel @error.f:86
- COM 5,226 1,608 0.02 0.0 15.22 exch_qbc
- OMP 4,680 1,280 0.22 0.0 175.51 !$omp implicit barrier @initialize.f:204
- OMP 4,680 1,280 0.02 0.0 17.03 !$omp do @initialize.f:31
- OMP 4,680 1,280 0.92 0.0 719.67 !$omp do @initialize.f:50
- OMP 4,680 1,280 0.00 0.0 2.76 !$omp do @initialize.f:100
- OMP 4,680 1,280 0.00 0.0 2.79 !$omp do @initialize.f:119
- OMP 4,680 1,280 0.01 0.0 3.99 !$omp do @initialize.f:137
- OMP 4,680 1,280 0.01 0.0 4.08 !$omp do @initialize.f:156
- OMP 4,680 1,280 7.97 0.3 6227.90 !$omp implicit barrier @initialize.f:167
- OMP 4,680 1,280 0.01 0.0 6.11 !$omp do @initialize.f:174
- OMP 4,680 1,280 0.01 0.0 6.02 !$omp do @initialize.f:192
- USR 4,550 1,400 0.00 0.0 0.06 get_comm_index
- OMP 2,340 640 0.05 0.0 73.19 !$omp implicit barrier @exact_rhs.f:357
- OMP 2,340 640 0.01 0.0 22.36 !$omp do @exact_rhs.f:31
- OMP 2,340 640 1.10 0.0 1711.99 !$omp implicit barrier @exact_rhs.f:41
- OMP 2,340 640 0.08 0.0 119.53 !$omp do @exact_rhs.f:46
- OMP 2,340 640 0.07 0.0 111.99 !$omp do @exact_rhs.f:147
- OMP 2,340 640 1.92 0.1 2998.85 !$omp implicit barrier @exact_rhs.f:242
- OMP 2,340 640 0.08 0.0 118.56 !$omp do @exact_rhs.f:247
- OMP 2,340 640 0.67 0.0 1049.28 !$omp implicit barrier @exact_rhs.f:341
- OMP 2,340 640 0.00 0.0 1.85 !$omp do @exact_rhs.f:346
- OMP 2,340 640 0.72 0.0 1123.30 !$omp implicit barrier @error.f:54
- OMP 2,340 640 0.06 0.0 101.38 !$omp do @error.f:33
- OMP 2,340 640 0.05 0.0 78.05 !$omp implicit barrier @error.f:107
- OMP 2,340 640 0.00 0.0 2.06 !$omp do @error.f:91
- MPI 612 72 0.00 0.0 32.94 MPI_Bcast
+ USR 572 176 0.00 0.0 0.10 timer_clear
- COM 468 128 0.73 0.0 5685.34 initialize
- COM 234 64 0.01 0.0 137.02 exact_rhs
- COM 234 64 0.01 0.0 151.69 rhs_norm
- COM 234 64 0.13 0.0 1977.62 error_norm
- MPI 204 24 0.01 0.0 386.08 MPI_Reduce
- MPI 136 16 0.10 0.0 6541.62 MPI_Barrier
- MPI 84 8 0.00 0.0 214.85 MPI_Comm_split
- MPI 84 8 0.00 0.0 36.30 MPI_Finalize
- MPI 84 8 2.64 0.1 329806.37 MPI_Init_thread
- MPI 52 16 0.00 0.0 0.54 MPI_Comm_rank
- SCOREP 41 8 0.00 0.0 43.30 bt-mz_B.8
- COM 26 8 0.00 0.0 417.57 bt
- USR 26 8 0.00 0.0 0.34 set_constants
- USR 26 8 0.00 0.0 3.65 zone_starts
- USR 26 8 0.00 0.0 1.13 zone_setup
- COM 26 8 0.00 0.0 44.15 verify
- USR 26 8 0.00 0.0 27.10 map_zones
- COM 26 8 0.00 0.0 17.88 env_setup
- COM 26 8 0.00 0.0 28.33 mpi_setup
- USR 26 1 0.00 0.0 63.97 print_results
+ USR 26 8 0.00 0.0 0.28 timer_read
+ USR 26 8 0.00 0.0 13.27 timer_stop
+ USR 26 8 0.00 0.0 12.07 timer_start
- MPI 26 8 0.00 0.0 2.83 MPI_Comm_size

Below the (original) function group summary, the score report now also includes a second summary with the filter applied. Here, an additional group FLT is added, which subsumes all filtered regions. Moreover, the column flt indicates whether a region/function group is filtered (+), not filtered (-), or possibly partially filtered (, only used for function groups).

As expected, the estimate for the aggregate event trace size drops down to 273MB, and the process-local maximum across all ranks is reduced to 39MB. Since the Score-P measurement system also creates a number of internal data structures (e.g., to track MPI requests and communicators), the suggested setting for the SCOREP_TOTAL_MEMORY environment variable to adjust the maximum amount of memory used by the Score-P memory management is 59MB when tracing is configured.

info

With the -g option, scorep-score can create an initial filter file in Score-P format. See more details here.

Let's modify our batch script scorep.pbs to enable filtering (see highlighted lines):

#!/bin/bash
# submit from ./bin subdirectory with "qsub reference.pbs"
#
#PBS -N mzmpibt
#PBS -l select=2:node_type=skl:mem=10gb:mpiprocs=4:ncpus=20
#PBS -l place=scatter
#PBS -q smp
#PBS -l walltime=00:10:00

cd $PBS_O_WORKDIR

# Benchmark configuration
export NPB_MZ_BLOAD=0
export OMP_NUM_THREADS=10
CLASS=B
NPROCS=8
EXE=./bt-mz_$CLASS.$NPROCS

module load score-p cube scalasca

# Score-P measurement configuration
export SCOREP_EXPERIMENT_DIRECTORY=scorep_bt-mz_sum_sum_filt
export SCOREP_FILTERING_FILE=../config/scorep.filt
#export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC
#export SCOREP_TOTAL_MEMORY=90M
#export SCOREP_ENABLE_TRACING=true

# Run the application
mpirun --report-bindings $EXE

In first highlighted line we added suffix _filt to create measurement directory with a different name. In the second one we provided name of the filter file which will be used during the measurement.

info

If you do not specify SCOREP_EXPERIMENT_DIRECTORY variable, the experiment directory is named in the format scorep-YYYYMMDD_HHMM_XXXXXXXX, where YYYYMMDD and HHMM represent the date and time, followed by random numbers.

If a directory with the specified name already exists, it will be renamed with a date suffix by default. To prevent this and abort the measurement if the directory exists, set SCOREP_OVERWRITE_EXPERIMENT_DIRECTORY to false. This setting is effective only if SCOREP_EXPERIMENT_DIRECTORY is set.

Now we are ready to submit our batch script with enabled filtering

$ qsub scorep.pbs
Question

Open the freshly generated stdout file and find the metric "Time in seconds". Compare it to our baseline measurement here and our original instrumented run here. Has it increased or decreased? If so, by how much? Which routines in your opinion are safe to filter?