Submission 2025-05-01¶

Execution Throughput and Latency¶

This section microbenchmarks the execution throughput and latency of FP32 Neon instructions.

1. Microbenchmark the execution throughput of the following instructions:¶

FMLA (vector) with arrangement specifier 4S

File: submissions/submission_25_05_01/neon_1_1.s
Driver: submissions/submission_25_05_01/neon_1_1_driver.cpp
Compilation: g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s
We have \(13.2304 \cdot 10^{10}\) instructions per second. That are \(13.2304 \cdot 10^{10} / 8 = 16.538 \cdot 10^9\) instructions per ALU per second. This aligns with a throughput of \(\approx 4\) instruction per cycle, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.

FMLA (vector) with arrangement specifier 2S

File: submissions/submission_25_05_01/neon_1_1.s
Driver: submissions/submission_25_05_01/neon_1_1_driver.cpp
Compilation: g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s
We have \(6.65221 \cdot 10^{10}\) instructions per second. That are \(6.65221 \cdot 10^{10} / 8 = 8.31526 \cdot 10^9\) instructions per ALU per second. This aligns with a throughput of \(\approx 2\) instruction per cycle, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.

FMADD (scalar), single-precision variant

File: submissions/submission_25_05_01/neon_1_1.s
Driver: submissions/submission_25_05_01/neon_1_1_driver.cpp
Compilation: g++ -o neon_1_1.exe neon_1_1_driver.cpp neon_1_1.s
We have \(1.12728 \cdot 10^{10}\) instructions per second. That are \(1.12728 \cdot 10^{10} / 8 = 1.4091 \cdot 10^9\) instructions per ALU per second. This aligns with a throughput of \(\approx 1/3\) instruction per cycle, as it is known from benchmarks that the performance cores of the M4 chip have a clock speed of 4.4 GHz.

2. Microbenchmark the execution latency of FMLA (vector) with arrangement specifier 4S. Consider the following two cases:¶

Dependency on one of the source registers

File: submissions/submission_25_05_01/neon_1_2.s
Driver: submissions/submission_25_05_01/neon_1_2_driver.cpp
Compilation: g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s
We have \(11.4961 \cdot 10^9\) instruction per seconds in a single ALU. Resulting in a latency of \(\approx 1/3\) cycle for the known clock speed of 4.4 GHz.

Dependency on the destination register only

File: submissions/submission_25_05_01/neon_1_2.s
Driver: submissions/submission_25_05_01/neon_1_2_driver.cpp
Compilation: g++ -o neon_1_2.exe neon_1_2_driver.cpp neon_1_2.s
We have \(11.7019 \cdot 10^9\) instruction per seconds in a single ALU. Resulting in a latency of \(\approx 1/3\) cycle for the known clock speed of 4.4 GHz.

Microkernel¶

1. Implement a Neon microkernel that computes C+=AB for M=16, N=6, and K=1.¶

Files submissions/submission_25_05_01/neon_2_simple.s
Driver submissions/submission_25_05_01/neon_2_driver.cpp

Implementation loops over each column over the matrix c to be calculated.

    ...
    // Offset the used leading dimension by the size of floats (4byte == 2 lshifts)
    lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
    lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
    lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)

    // Load all data from the 16x1 matrix a
    ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0]

    // Init the loop counter
    mov x6, #6
process_next_column:
    // Iteration -= 1
    subs x6, x6, #1

    // Load next element from the 1x6 matrix
    // ldr s4, [x1], #4 // one-liner but not using the argument offset
    ldr s4, [x1]
    add x1, x1, x4

    // Load next column from the 16x6 matrix c
    ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]

    // Calculate the next row of c
    fmla v17.4s, v0.4s, v4.s[0]
    fmla v18.4s, v1.4s, v4.s[0]
    fmla v19.4s, v2.4s, v4.s[0]
    fmla v20.4s, v3.4s, v4.s[0]

    // Store the result back to memory
    st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5

    // Compare and branch on not-zero
    cbnz x6, process_next_column
    ...

2. Test and optimize your microkernel. Report its performance in GFLOPS.¶

Files
- submissions/submission_25_05_01/neon_2.h
- submissions/submission_25_05_01/neon_2_unrolled.s
Tests submissions/submission_25_05_01/neon_2.test.cpp
Benchmarks submissions/submission_25_05_01/neon_2.bench.cpp

Optimization

To optimize the kernel we unrolled the loop into 3 different register ranges (v15-v28, v17-v20, v21-v24), to allow for less dependency between the calculation of columns. These 3 different fmla blocks gets repeated with .rept 2 to achieve the total of 6 column of calculation.

...
.rept 2
// Load first element from the 1x6 matrix b
ldr s4, [x1]
add x1, x1, x4
// Load first column from the 16x6 matrix c
ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2]

// Calculate first column of c
fmla v25.4s, v0.4s, v4.s[0]
fmla v26.4s, v1.4s, v4.s[0]
fmla v27.4s, v2.4s, v4.s[0]
fmla v28.4s, v3.4s, v4.s[0]

// Store first column back to memory
st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5

// Load second element from the 1x6 matrix b
ldr s4, [x1]
add x1, x1, x4
// Load second column from the 16x6 matrix c
ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2]

// Calculate second column of c
fmla v17.4s, v0.4s, v4.s[0]
fmla v18.4s, v1.4s, v4.s[0]
fmla v19.4s, v2.4s, v4.s[0]
fmla v20.4s, v3.4s, v4.s[0]

// Store second column back to memory
st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5

// Load third element from the 1x6 matrix b
ldr s4, [x1]
add x1, x1, x4
// Load third column from the 16x6 matrix c
ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2]

// Calculated third column of c
fmla v21.4s, v0.4s, v4.s[0]
fmla v22.4s, v1.4s, v4.s[0]
fmla v23.4s, v2.4s, v4.s[0]
fmla v24.4s, v3.4s, v4.s[0]

// Store third column back to memory
st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
.endr
...

Benchmarks

We run the benchmark with the following command:

./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

Therefore we do 10 repetitions of the benchmark which do about 120 000 000 iterations each on our matmul kernels.

----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations      FLOPS
----------------------------------------------------------------------------------------------------------------------------------
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_mean               5.84 ns         5.82 ns           10 33.0036G/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_median             5.83 ns         5.81 ns           10 33.0317G/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_stddev            0.025 ns        0.025 ns           10 143.339M/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_simple/min_warmup_time:1.000_cv                 0.43 %          0.44 %            10      0.43%
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_mean             5.71 ns         5.69 ns           10 33.7234G/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_median           5.70 ns         5.68 ns           10 33.7732G/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_stddev          0.038 ns        0.038 ns           10 224.892M/s
Gemm16x6x1Fixture/BM_matmul_16_6_1_unrolled/min_warmup_time:1.000_cv               0.67 %          0.67 %            10      0.67

We see that the simple first implementation of our matmul kernel gets about 33.0 GFLOPS. The optimized unrolled version gets about 0.7 GFLOPS more resulting in 33.7 GFLOPS.

Loops¶

1. Loop over K: Implement a kernel that computes C+=AB for M=16, N=6 and K=64.¶

File submissions/submission_25_05_01/neon_3_1.s

  ...
  // Offset the used leading dimension by the size of floats
  lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
  lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
  lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)

  mov x6, x1 // Store the initial value of x1, to be restored in the next loop iteration
  mov x7, x2 // Store the initial value of x2, to be restored after the loop

  // Load first column from the 16x6 matrix c
  ld1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
  // Load second column from the 16x6 matrix c
  ld1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
  // Load third column from the 16x6 matrix c
  ld1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
  // Load fourth column from the 16x6 matrix c
  ld1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5
  // Load fifth column from the 16x6 matrix c
  ld1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5
  // Load sixth column from the 16x6 matrix c
  ld1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5

  mov x9, #64 // x9 iterator for K loop
matmul_loop_over_K:
  sub x9, x9, #1

  // Load first column data from the 16x1 matrix a
  ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x0], x3

  // run the known matmul_16_6_1_unrolled kernel
  // Load first element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculate first column of c
  fmla v25.4s, v0.4s, v4.s[0]
  fmla v26.4s, v1.4s, v4.s[0]
  fmla v27.4s, v2.4s, v4.s[0]
  fmla v28.4s, v3.4s, v4.s[0]


  // Load second element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculate second column of c
  fmla v17.4s, v0.4s, v4.s[0]
  fmla v18.4s, v1.4s, v4.s[0]
  fmla v19.4s, v2.4s, v4.s[0]
  fmla v20.4s, v3.4s, v4.s[0]


  // Load third element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculated third column of c
  fmla v21.4s, v0.4s, v4.s[0]
  fmla v22.4s, v1.4s, v4.s[0]
  fmla v23.4s, v2.4s, v4.s[0]
  fmla v24.4s, v3.4s, v4.s[0]


  // Load fourth element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculate fourth column of c
  fmla v5.4s, v0.4s, v4.s[0]
  fmla v6.4s, v1.4s, v4.s[0]
  fmla v7.4s, v2.4s, v4.s[0]
  fmla v8.4s, v3.4s, v4.s[0]


  // Load fifth element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculate fifth column of c
  fmla v9.4s, v0.4s, v4.s[0]
  fmla v10.4s, v1.4s, v4.s[0]
  fmla v11.4s, v2.4s, v4.s[0]
  fmla v12.4s, v3.4s, v4.s[0]


  // Load sixth element from the 1x6 matrix b
  ldr s4, [x1]
  add x1, x1, x4

  // Calculated sixth column of c
  fmla v13.4s, v0.4s, v4.s[0]
  fmla v14.4s, v1.4s, v4.s[0]
  fmla v15.4s, v2.4s, v4.s[0]
  fmla v16.4s, v3.4s, v4.s[0]


  // offset x6 to the next element in the column
  add x6, x6, #4 // #4 = sizeof(float)

  // Restore x1 to be incremented again
  mov x1, x6

  // Loop back
  cbnz x9, matmul_loop_over_K

  // Restore initial value of x2 that was changed by the loads
  mov x2, x7

  // Store first column back to memory
  st1 {v25.4s, v26.4s, v27.4s, v28.4s}, [x2], x5
  // Store second column back to memory
  st1 {v17.4s, v18.4s, v19.4s, v20.4s}, [x2], x5
  // Store third column back to memory
  st1 {v21.4s, v22.4s, v23.4s, v24.4s}, [x2], x5
  // Store fourth column back to memory
  st1 {v5.4s, v6.4s, v7.4s, v8.4s}, [x2], x5
  // Store fifth column back to memory
  st1 {v9.4s, v10.4s, v11.4s, v12.4s}, [x2], x5
  // Store sixth column back to memory
  st1 {v13.4s, v14.4s, v15.4s, v16.4s}, [x2], x5

2. Loop over M: Implement a kernel that computes C+=AB for M=64, N=6 and K=64.¶

File submissions/submission_25_05_01/neon_3_2.s

    // Offset the used leading dimension by the size of floats
    lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
    lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
    lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)

    mov x6, x1 // Store the initial value of x1, to be restored in the K loop iteration
    mov x7, x2 // Store the initial value of x2, to be restored in the K loop iteration

    mov x8, x0 // Store the initial value of x0, to be restored in the M loop iteration
    mov x9, x1 // Store the initial value of x1, to be restored in the M loop iteration

    mov x16, #4 // x16 iterator for M loop
matmul_loop_over_M:
    sub x16, x16, #1

    // ... <logic of loop over K - neon_3_1>

    // next M iteration on the matrix c and matrix a, both need offset about 16 values
    // also matrix b needs to start at the initial location again
    // Updates for the matrix c
    add x7, x7, #16*4 // column height * sizeof(float)
    mov x2, x7 // also apply offset to x2

    // Updates for the matrix a
    add x8, x8, #16*4 // column height * sizeof(float)
    mov x0, x8 // also apply offset to x0

    // Updates for the matrix b
    mov x6, x9 // Update the restore register for x1 for the K loop
    mov x1, x9 // Update the x1 register itself

    // Loop back to M
    cbnz x16, matmul_loop_over_M

3. Loop over N: Implement a kernel that computes C+=AB for M=64, N=48 and K=64.¶

File submissions/submission_25_05_01/neon_3_3.s

    // Offset the used leading dimension by the size of floats
    lsl x3, x3, #2 // x3 * 4 = x3 * sizeof(float)
    lsl x4, x4, #2 // x4 * 4 = x4 * sizeof(float)
    lsl x5, x5, #2 // x5 * 4 = x5 * sizeof(float)

    mov x6, x1 // Store the initial value of x1, to be restored in the K loop iteration
    mov x7, x2 // Store the initial value of x2, to be restored in the K loop iteration

    mov x8, x0 // Store the initial value of x0, to be restored in the M loop iteration
    mov x9, x1 // Store the initial value of x1, to be restored in the M loop iteration

    mov x10, x0 // Store the initial value of x0, to be restored in the N loop iteration
    mov x11, x2 // Store the initial value of x2, to bes restored in the N loop iteration
    mov x12, #6 // hold the size of N that are processed in one loop, needed for offset calculation

    mov x17, #8 // x17 iterator for N loop
matmul_loop_over_N:
    sub x17, x17, #1

  // ... <logic of loop over M - neon_3_2>

    // next M iteration on the matrix b and matrix c, both need offset about 6*ldb/ldc values
    // also matrix a needs to start at the initial location again
    // Update for the matrix a
    mov x8, x10 // Update the restore register for x0 for the M loop
    mov x0, x10 // Update the x0 register itself

    // Updates for the matrix b
    madd x9, x4, x12, x9 // ldb * 6 + initial position
    mov x6, x9 // Update the restore register of x1 for the K loop
    mov x1, x9 // Update the x1 register itself

    // Updates for the matrix c
    madd x11, x5, x12, x11 // ldc * 6 + initial position
    mov x7, x11 // Update the restore register of x2 for the K loop
    mov x2, x11 // Update the x2 register itself

    // Loop back to N
    cbnz x17, matmul_loop_over_N

1. Test and optimize the kernels. Report your performance in GFLOPS.¶

File submissions/submission_25_05_01/neon_3.h
Tests submissions/submission_25_05_01/neon_3.test.cpp
Benchmarks submissions/submission_25_05_01/neon_3.bench.cpp

Optimization

Usage of already optimized matmul_16_6_1 from task 2.

Benchmarks

We run the benchmark with the following command:

./benchmarks --benchmark_counters_tabular=true --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

----------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                             Time             CPU   Iterations      FLOPS
----------------------------------------------------------------------------------------------------------------------------------
GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_mean           97.8 ns         97.4 ns           10  126.12G/s
GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_median         97.7 ns         97.3 ns           10 126.245G/s
GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_stddev        0.581 ns        0.563 ns           10 720.109M/s
GemmMxNxKFixture<16, 6, 64>/BM_matmul_16_6_64/min_warmup_time:1.000_cv             0.59 %          0.58 %            10      0.57%
GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_mean            386 ns          385 ns           10 127.812G/s
GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_median          385 ns          384 ns           10  127.95G/s
GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_stddev         2.16 ns         2.11 ns           10 693.069M/s
GemmMxNxKFixture<64, 6, 64>/BM_matmul_64_6_64/min_warmup_time:1.000_cv             0.56 %          0.55 %            10      0.54%
GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_mean         3103 ns         3092 ns           10 95.3736G/s
GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_median       3097 ns         3087 ns           10 95.5363G/s
GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_stddev       16.0 ns         15.6 ns           10 475.851M/s
GemmMxNxKFixture<64, 48, 64>/BM_matmul_64_48_64/min_warmup_time:1.000_cv           0.52 %          0.50 %            10      0.50%

Mean FLOPS for loop over K: 126.1 GFLOPS.
Mean FLOPS for loop over M: 127.8 GFLOPS.
Mean FLOPS for loop over N: 95.4 GFLOPS.