Submission 2025-05-22¶
Transposition¶
This section develops a kernel that performs the identity operation on the elements of an 8x8 column-major matrix A and stores the result in row-major format in matrix B.
1. Implement a Neon kernel that transposes an 8x8 matrix: B:=Aᵀ¶
File: neon_7_1.s
From the lecture, we already know the 4x4 transpose kernel. Therefore, we have the following idea:
Divide the 8x8 matrix A into four 4x4 sub-matrices
Transpose each 4x4 sub-matrix
Save T(A) and T(D) sub-matrix to matrix B
Swap B and C: Save T(B) to bottom-left sub-matrix of B and T(C) to top-right sub-matrix of B

Code:
1...
2/*
3* Part 1:
4* Load 4x4 sub-matrix A.
5* Transpose 4x4 block.
6* Store 4x4 block of A into B.
7*/
8// Load
9ldr q0, [x4]
10add x4, x4, x2
11ldr q1, [x4]
12add x4, x4, x2
13ldr q2, [x4]
14add x4, x4, x2
15ldr q3, [x4]
16
17// Transpose
18trn1 v4.4s, v0.4s, v1.4s
19trn2 v5.4s, v0.4s, v1.4s
20trn1 v6.4s, v2.4s, v3.4s
21trn2 v7.4s, v2.4s, v3.4s
22
23zip1 v8.2d, v4.2d, v6.2d
24zip1 v9.2d, v5.2d, v7.2d
25zip2 v10.2d, v4.2d, v6.2d
26zip2 v11.2d, v5.2d, v7.2d
27
28// Store
29str q8, [x5]
30add x5, x5, x3
31str q9, [x5]
32add x5, x5, x3
33str q10, [x5]
34add x5, x5, x3
35str q11, [x5]
36
37/*
38* Part 2:
39* Load 4x4 sub-matrix B and C.
40* Transpose both 4x4 blocks.
41* Store both 4x4 blocks of C and B into B.
42*/
43// Load right-top
44mov x4, x0 // A
45add x4, x4, #128 // Offset to top-left corner of right half of A (32th element)
46...
47
48// Transpose right-top
49...
50
51// Load left-bottom
52mov x4, x0 // A
53add x4, x4, #16 // Offset to next 4 elements of column in A (4th element)
54...
55
56// Transpose left-bottom
57...
58
59// Store after transpose to avoid conflicts when input matrix A = B
60// Store B to C (right-top of A to left-bottom of B)
61mov x5, x1
62add x5, x5, #16
63...
64
65// Store C to B (left-bottom of A to right-top of B)
66mov x5, x1
67add x5, x5, #128
68...
69
70/*
71* Part 3:
72* Load 4x4 sub-matrix D.
73* Transpose 4x4 block.
74* Store 4x4 block of A into B.
75*/
76// Load
77mov x4, x0 // A
78add x4, x4, #144 // 128 + 16 -> left-top corner of right-bottom 4x4 sub-matrix of A
79...
80
81// Transpose
82...
83
84// Store
85mov x5, x1 // A
86add x5, x5, #144 // 128 + 16 -> left-top corner of right-bottom 4x4 sub-matrix of B
87...
2. Test and optimize¶
--------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Byte
--------------------------------------------------------------------------------------------------------------
Trans8x8Fixture/BT_tran_8_8/min_warmup_time:1.000_mean 5.08 ns 5.06 ns 10 101.188G/s
Trans8x8Fixture/BT_tran_8_8/min_warmup_time:1.000_median 5.07 ns 5.06 ns 10 101.277G/s
Trans8x8Fixture/BT_tran_8_8/min_warmup_time:1.000_stddev 0.030 ns 0.030 ns 10 590.962M/s
Trans8x8Fixture/BT_tran_8_8/min_warmup_time:1.000_cv 0.59 % 0.59 % 10 0.58%
tran_8_8 kernel: \(50.5\) GiB/s
Unary Primitives¶
Zero Primitive¶
1. mini_jit::Unary::generate function to support the zero primitive¶
File: unary_zero.cpp
2. Test and optimize¶
Without Transposition
---------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Bytes
---------------------------------------------------------------------------------------------------------------------------
UnaryFixture/BM_unary_zero/M:50/N:50/min_warmup_time:1.000_mean 97.5 ns 97.2 ns 10 205.828G/s
UnaryFixture/BM_unary_zero/M:50/N:50/min_warmup_time:1.000_median 97.3 ns 97.0 ns 10 206.267G/s
UnaryFixture/BM_unary_zero/M:50/N:50/min_warmup_time:1.000_stddev 0.983 ns 0.980 ns 10 2.05822G/s
UnaryFixture/BM_unary_zero/M:50/N:50/min_warmup_time:1.000_cv 1.01 % 1.01 % 10 1.00%
UnaryFixture/BM_unary_zero/M:64/N:64/min_warmup_time:1.000_mean 186 ns 185 ns 10 176.895G/s
UnaryFixture/BM_unary_zero/M:64/N:64/min_warmup_time:1.000_median 184 ns 183 ns 10 179.139G/s
UnaryFixture/BM_unary_zero/M:64/N:64/min_warmup_time:1.000_stddev 7.19 ns 7.16 ns 10 6.55984G/s
UnaryFixture/BM_unary_zero/M:64/N:64/min_warmup_time:1.000_cv 3.86 % 3.86 % 10 3.71%
UnaryFixture/BM_unary_zero/M:512/N:512/min_warmup_time:1.000_mean 9205 ns 9174 ns 10 228.836G/s
UnaryFixture/BM_unary_zero/M:512/N:512/min_warmup_time:1.000_median 9345 ns 9314 ns 10 225.169G/s
UnaryFixture/BM_unary_zero/M:512/N:512/min_warmup_time:1.000_stddev 316 ns 314 ns 10 7.95304G/s
UnaryFixture/BM_unary_zero/M:512/N:512/min_warmup_time:1.000_cv 3.43 % 3.43 % 10 3.48%
UnaryFixture/BM_unary_zero/M:2048/N:2048/min_warmup_time:1.000_mean 265455 ns 264356 ns 10 127.786G/s
UnaryFixture/BM_unary_zero/M:2048/N:2048/min_warmup_time:1.000_median 256886 ns 255806 ns 10 131.171G/s
UnaryFixture/BM_unary_zero/M:2048/N:2048/min_warmup_time:1.000_stddev 24366 ns 24271 ns 10 10.4142G/s
UnaryFixture/BM_unary_zero/M:2048/N:2048/min_warmup_time:1.000_cv 9.18 % 9.18 % 10 8.15%
BM_unary_zero/M:50/N:50 kernel: \(206.3\) GiB/s
BM_unary_zero/M:64/N:64 kernel: \(176.9\) GiB/s
BM_unary_zero/M:512/N:512 kernel: \(228.8\) GiB/s
BM_unary_zero/M:2048/N:2048 kernel: \(127.8\) GiB/s
With Transposition
Transposition is the equivalent operation with swapped M und N dimension.
Identity Primitive¶
1. mini_jit::Unary::generate function to support the identity primitive¶
File: unary_identity.cpp
& File: unary_identity_transpose.cpp
2. Test and optimize¶
Without Transposition
---------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Bytes
---------------------------------------------------------------------------------------------------------------------------
UnaryFixture/BM_unary_identity/M:50/N:50/min_warmup_time:1.000_mean 129 ns 129 ns 10 155.397G/s
UnaryFixture/BM_unary_identity/M:50/N:50/min_warmup_time:1.000_median 129 ns 128 ns 10 155.951G/s
UnaryFixture/BM_unary_identity/M:50/N:50/min_warmup_time:1.000_stddev 1.53 ns 1.49 ns 10 1.7808G/s
UnaryFixture/BM_unary_identity/M:50/N:50/min_warmup_time:1.000_cv 1.18 % 1.16 % 10 1.15%
UnaryFixture/BM_unary_identity/M:64/N:64/min_warmup_time:1.000_mean 202 ns 202 ns 10 163.002G/s
UnaryFixture/BM_unary_identity/M:64/N:64/min_warmup_time:1.000_median 200 ns 200 ns 10 164.143G/s
UnaryFixture/BM_unary_identity/M:64/N:64/min_warmup_time:1.000_stddev 11.4 ns 11.3 ns 10 8.26683G/s
UnaryFixture/BM_unary_identity/M:64/N:64/min_warmup_time:1.000_cv 5.65 % 5.62 % 10 5.07%
UnaryFixture/BM_unary_identity/M:512/N:512/min_warmup_time:1.000_mean 16864 ns 16789 ns 10 125.139G/s
UnaryFixture/BM_unary_identity/M:512/N:512/min_warmup_time:1.000_median 16468 ns 16388 ns 10 127.967G/s
UnaryFixture/BM_unary_identity/M:512/N:512/min_warmup_time:1.000_stddev 776 ns 768 ns 10 5.48377G/s
UnaryFixture/BM_unary_identity/M:512/N:512/min_warmup_time:1.000_cv 4.60 % 4.57 % 10 4.38%
UnaryFixture/BM_unary_identity/M:2048/N:2048/min_warmup_time:1.000_mean 317943 ns 315684 ns 10 106.416G/s
UnaryFixture/BM_unary_identity/M:2048/N:2048/min_warmup_time:1.000_median 316519 ns 314379 ns 10 106.733G/s
UnaryFixture/BM_unary_identity/M:2048/N:2048/min_warmup_time:1.000_stddev 11615 ns 11358 ns 10 3.8664G/s
UnaryFixture/BM_unary_identity/M:2048/N:2048/min_warmup_time:1.000_cv 3.65 % 3.60 % 10 3.63%
BM_unary_identity/M:50/N:50 kernel: \(155.4\) GiB/s
BM_unary_identity/M:64/N:64 kernel: \(163.0\) GiB/s
BM_unary_identity/M:512/N:512 kernel: \(125.1\) GiB/s
BM_unary_identity/M:2048/N:2048 kernel: \(106.4\) GiB/s
ReLu Primitive¶
Without Transposition
-----------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Bytes
-----------------------------------------------------------------------------------------------------------------------------------------
UnaryFixture/BM_unary_identity_transpose/M:50/N:50/min_warmup_time:1.000_mean 157 ns 156 ns 10 128.118G/s
UnaryFixture/BM_unary_identity_transpose/M:50/N:50/min_warmup_time:1.000_median 157 ns 156 ns 10 128.093G/s
UnaryFixture/BM_unary_identity_transpose/M:50/N:50/min_warmup_time:1.000_stddev 0.734 ns 0.732 ns 10 600.086M/s
UnaryFixture/BM_unary_identity_transpose/M:50/N:50/min_warmup_time:1.000_cv 0.47 % 0.47 % 10 0.47%
UnaryFixture/BM_unary_identity_transpose/M:64/N:64/min_warmup_time:1.000_mean 252 ns 251 ns 10 130.502G/s
UnaryFixture/BM_unary_identity_transpose/M:64/N:64/min_warmup_time:1.000_median 252 ns 251 ns 10 130.41G/s
UnaryFixture/BM_unary_identity_transpose/M:64/N:64/min_warmup_time:1.000_stddev 1.12 ns 1.10 ns 10 573.58M/s
UnaryFixture/BM_unary_identity_transpose/M:64/N:64/min_warmup_time:1.000_cv 0.44 % 0.44 % 10 0.44%
UnaryFixture/BM_unary_identity_transpose/M:512/N:512/min_warmup_time:1.000_mean 477698 ns 476113 ns 10 4.40967G/s
UnaryFixture/BM_unary_identity_transpose/M:512/N:512/min_warmup_time:1.000_median 478232 ns 476633 ns 10 4.39994G/s
UnaryFixture/BM_unary_identity_transpose/M:512/N:512/min_warmup_time:1.000_stddev 16842 ns 16774 ns 10 155.727M/s
UnaryFixture/BM_unary_identity_transpose/M:512/N:512/min_warmup_time:1.000_cv 3.53 % 3.52 % 10 3.53%
UnaryFixture/BM_unary_identity_transpose/M:2048/N:2048/min_warmup_time:1.000_mean 8826949 ns 8790588 ns 10 3.81791G/s
UnaryFixture/BM_unary_identity_transpose/M:2048/N:2048/min_warmup_time:1.000_median 8813603 ns 8778155 ns 10 3.8225G/s
UnaryFixture/BM_unary_identity_transpose/M:2048/N:2048/min_warmup_time:1.000_stddev 137200 ns 136627 ns 10 59.0859M/s
UnaryFixture/BM_unary_identity_transpose/M:2048/N:2048/min_warmup_time:1.000_cv 1.55 % 1.55 % 10 1.55%
BM_unary_identity_transpose/M:50/N:50 kernel: \(128.1\) GiB/s
BM_unary_identity_transpose/M:64/N:64 kernel: \(130.5\) GiB/s
BM_unary_identity_transpose/M:512/N:512 kernel: \(4.409\) GiB/s
BM_unary_identity_transpose/M:2048/N:2048 kernel: \(3.817\) GiB/s
1. mini_jit::Unary::generate function to support the ReLu primitive¶
File: unary_relu.cpp
& File: unary_relu_transpose.cpp
2. Test and optimize¶
---------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Bytes
---------------------------------------------------------------------------------------------------------------------------
UnaryFixture/BM_unary_relu/M:50/N:50/min_warmup_time:1.000_mean 141 ns 140 ns 10 143.13G/s
UnaryFixture/BM_unary_relu/M:50/N:50/min_warmup_time:1.000_median 137 ns 137 ns 10 146.168G/s
UnaryFixture/BM_unary_relu/M:50/N:50/min_warmup_time:1.000_stddev 8.16 ns 8.03 ns 10 7.717G/s
UnaryFixture/BM_unary_relu/M:50/N:50/min_warmup_time:1.000_cv 5.80 % 5.73 % 10 5.39%
UnaryFixture/BM_unary_relu/M:64/N:64/min_warmup_time:1.000_mean 223 ns 222 ns 10 147.933G/s
UnaryFixture/BM_unary_relu/M:64/N:64/min_warmup_time:1.000_median 221 ns 220 ns 10 148.866G/s
UnaryFixture/BM_unary_relu/M:64/N:64/min_warmup_time:1.000_stddev 14.6 ns 14.4 ns 10 9.37165G/s
UnaryFixture/BM_unary_relu/M:64/N:64/min_warmup_time:1.000_cv 6.56 % 6.49 % 10 6.34%
UnaryFixture/BM_unary_relu/M:512/N:512/min_warmup_time:1.000_mean 16615 ns 16550 ns 10 126.752G/s
UnaryFixture/BM_unary_relu/M:512/N:512/min_warmup_time:1.000_median 16695 ns 16624 ns 10 126.155G/s
UnaryFixture/BM_unary_relu/M:512/N:512/min_warmup_time:1.000_stddev 281 ns 281 ns 10 2.24229G/s
UnaryFixture/BM_unary_relu/M:512/N:512/min_warmup_time:1.000_cv 1.69 % 1.70 % 10 1.77%
UnaryFixture/BM_unary_relu/M:2048/N:2048/min_warmup_time:1.000_mean 314145 ns 312183 ns 10 107.534G/s
UnaryFixture/BM_unary_relu/M:2048/N:2048/min_warmup_time:1.000_median 316017 ns 313907 ns 10 106.899G/s
UnaryFixture/BM_unary_relu/M:2048/N:2048/min_warmup_time:1.000_stddev 7378 ns 7145 ns 10 2.48532G/s
UnaryFixture/BM_unary_relu/M:2048/N:2048/min_warmup_time:1.000_cv 2.35 % 2.29 % 10 2.31%
BM_unary_relu/M:50/N:50 kernel: \(143.1\) GiB/s
BM_unary_relu/M:64/N:64 kernel: \(148.0\) GiB/s
BM_unary_relu/M:512/N:512 kernel: \(126.8\) GiB/s
BM_unary_relu/M:2048/N:2048 kernel: \(107.5\) GiB/s
Without Transposition
-------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations Bytes
-------------------------------------------------------------------------------------------------------------------------------------
UnaryFixture/BM_unary_relu_transpose/M:50/N:50/min_warmup_time:1.000_mean 159 ns 158 ns 10 126.468G/s
UnaryFixture/BM_unary_relu_transpose/M:50/N:50/min_warmup_time:1.000_median 159 ns 158 ns 10 126.509G/s
UnaryFixture/BM_unary_relu_transpose/M:50/N:50/min_warmup_time:1.000_stddev 0.839 ns 0.826 ns 10 660.394M/s
UnaryFixture/BM_unary_relu_transpose/M:50/N:50/min_warmup_time:1.000_cv 0.53 % 0.52 % 10 0.52%
UnaryFixture/BM_unary_relu_transpose/M:64/N:64/min_warmup_time:1.000_mean 244 ns 243 ns 10 134.75G/s
UnaryFixture/BM_unary_relu_transpose/M:64/N:64/min_warmup_time:1.000_median 244 ns 243 ns 10 134.706G/s
UnaryFixture/BM_unary_relu_transpose/M:64/N:64/min_warmup_time:1.000_stddev 0.405 ns 0.417 ns 10 231.018M/s
UnaryFixture/BM_unary_relu_transpose/M:64/N:64/min_warmup_time:1.000_cv 0.17 % 0.17 % 10 0.17%
UnaryFixture/BM_unary_relu_transpose/M:512/N:512/min_warmup_time:1.000_mean 481150 ns 479519 ns 10 4.37974G/s
UnaryFixture/BM_unary_relu_transpose/M:512/N:512/min_warmup_time:1.000_median 483163 ns 481481 ns 10 4.35591G/s
UnaryFixture/BM_unary_relu_transpose/M:512/N:512/min_warmup_time:1.000_stddev 19118 ns 19053 ns 10 176.288M/s
UnaryFixture/BM_unary_relu_transpose/M:512/N:512/min_warmup_time:1.000_cv 3.97 % 3.97 % 10 4.03%
UnaryFixture/BM_unary_relu_transpose/M:2048/N:2048/min_warmup_time:1.000_mean 8770935 ns 8729697 ns 10 3.84785G/s
UnaryFixture/BM_unary_relu_transpose/M:2048/N:2048/min_warmup_time:1.000_median 8782429 ns 8746690 ns 10 3.83637G/s
UnaryFixture/BM_unary_relu_transpose/M:2048/N:2048/min_warmup_time:1.000_stddev 305062 ns 299606 ns 10 133.989M/s
UnaryFixture/BM_unary_relu_transpose/M:2048/N:2048/min_warmup_time:1.000_cv 3.48 % 3.43 % 10 3.48%
BM_unary_relu_transpose/M:50/N:50 kernel: \(126.4\) GiB/s
BM_unary_relu_transpose/M:64/N:64 kernel: \(134.7\) GiB/s
BM_unary_relu_transpose/M:512/N:512 kernel: \(4.379\) GiB/s
BM_unary_relu_transpose/M:2048/N:2048 kernel: \(3.847\) GiB/s