BitNet/codegen.md at main

mirror of https://github.com/microsoft/BitNet synced 2025-04-29 20:45:26 +08:00

Jason Davies b384804b95 Fix typos.

2024-10-17 21:00:15 +01:00

1.5 KiB

Raw Permalink Blame History

Codegen for TL1 and TL2

codegen_tl1.py and codegen_tl2.py are using params to generate kernel codes in different devices to achieve fastest performance for TL1 and TL2.

We cutting weight into multiple compute blocks to best utilize hardware capabilities.

Example

bitnet_b1_58-large:

Make sure Matmul kernels shapes
For example, bitnet_b1_58-large Matmul kernel shapes are:
[1536, 4096]
[1536, 1536]
[4096, 1536]
Make sure each BM, BK, bm for each kernel to meet the requirements below
Generate codes
For example, for bitnet_b1_58-large, we can gencode like:

# For TL1
python utils/codegen_tl1.py --model bitnet_b1_58-large --BM 256,128,256 --BK 128,64,128 --bm 32,64,32

# For TL2
python utils/codegen_tl2.py --model bitnet_b1_58-large --BM 256,128,256 --BK 96,192,96 --bm 32,32,32

TL1:

For TL1, we cut weight into M / BM weights, each weight shape is (BM, K). Then we cut weight into K / BK weights, each weight shape is (BM, BK). As for (BM, BK) weight, we cut it the same way into (bm, compute_num / bm) compute blocks, and finish computing in it.

Thus, we need to make sure

M % BM == 0
K % BK == 0
BM % bm == 0
bm choose in [32, 64]

TL2:

For TL2, things got a little more complicated. Due to TL2 needs BK % 6 == 0, we need to split K into threeK and twoK, in which compute in TL2 for (M, threeK), compute in TL1 for (M, two_K).

Thus, we needs to make sure

M % BM == 0
K % BK % 32 == 0
BM % bm == 0
bm choose in [32]

1.5 KiB Raw Permalink Blame History

Codegen for TL1 and TL2

Example

TL1:

TL2:

1.5 KiB

Raw Permalink Blame History