[experimental]backend: add new oneDNN backend #855

rfsaliev · 2024-06-12T15:44:10Z

This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.

I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.

Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.

Known issues and TODOs:

Functionality:
- Limited set of operations are implemented - just to support GPT2 sample model.
It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..
- This PoC supports CPU engine only.
- By default, the backend utilize CPU buffer type - own buffer type is under development.
Performance:
- Operations fusing is not implemented - oneDNN allows to fuse several operations to 1 call which significantly improve performance due to reduced R/W memory access.
- oneDNN MatMul and InnerProduct (Linear) primitives executed in non-optimal mode because weights provided in a plain memory layout. To gain maximum performance , it is recommended to 'reorder' at least weights to a blocked layout which is effective for memory access and AI acceleration instructions. See (oneDNN Memory Format Propagation)[https://oneapi-src.github.io/oneDNN/page_memory_format_propagation_cpp.html].

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Some technical details:

Added files: ggml-dnnl.h, ggml-dnnl.cpp. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped by USE_DNNL_BACKEND macros.
CMake files modified to support GGML_DNNL configuration option.
gpt2-sched is modified to convert model weights from FP16 to FP32 if DNNL backend enabled - current oneDNN release version does not support MatMul cases with src_type=dst_type=f32 and weights_type=fp16

slaren · 2024-06-12T16:24:10Z

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

This will be supported through ggml_backend_sched after ggerganov/llama.cpp#6210 is merged.

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type. For example, the CUDA backend has a split buffer type that splits the tensors between multiple GPUs. Since this buffer type would only be used to store weights, in most cases it would be ok to leave some functionality unimplemented, such as support for creating views, or reading data back through get_tensor.

WilliamTambellini · 2024-06-12T16:45:16Z

Very good idea @rfsaliev .
Most recent intel CPUs support bf16.
int8b should be easy to add support to (vnni).

rfsaliev · 2024-06-14T17:14:45Z

Thank you @slaren for your response.

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in graph_compute. IMHO full support of graph_plan_create + graph_plan_compute would give best opportunities for backend-side optimizations.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type.

In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in set_tensor call?
I found that buffer.init_tensor is called on every operation during model execution - should I rely on such behavior or it will be changed in future? I mean, should I expect that init_tensor will be called for e.g. all MatMul operations assigned to backend? If yes, is there any design rules which prevent me from changing/replacing op->src with own buffer?

ggerganov · 2024-06-14T17:59:08Z

src/CMakeLists.txt

+        set(GGML_HEADERS_DNNL ggml-dnnl.h)
+        set(GGML_SOURCES_DNNL ggml-dnnl.cpp)
+
+        set(GGML_EXTRA_INCS  ${GGML_EXTRA_INCS}  ${CLBLAST_INC} ${OPENCL_INC})


CLBLAST vars look out of place here

Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.

rfsaliev · 2024-06-20T16:23:00Z

@slaren, can you please help me understand how a backend should work in gpt-2-sched sample?
I tried to enable BLAS backend in the sample, but do not see any call to ggml_backend_blas_mul_mat.
I did following steps to run the sample:

clone ggml

(tf_env) rfsaliev:~$ git clone https://github.com/ggerganov/ggml.git

Apply patch with debug fprintf calls (see the patch below):

(tf_env) rfsaliev:~$ cd ggml
(tf_env) rfsaliev:~/ggml$ git apply < ~/ggml-blas-debug.patch

Download and convert a model keeping FP32 weights:

(tf_env) rfsaliev:~/ggml$ mkdir build && cd build
(tf_env) rfsaliev:~/ggml/build$ ../examples/gpt-2/download-model.sh 117M
(tf_env) rfsaliev:~/ggml/build$ python ../examples/gpt-2/convert-ckpt-to-ggml.py models/gpt-2-117M 0

Build and run the sample:

(tf_env) rfsaliev:~/ggml/build$ cmake .. -DGGML_BLAS=ON && cmake --build . --target gpt-2-sched
(tf_env) rfsaliev:~/ggml/build$ ./bin/gpt-2-sched -m models/gpt-2-117M/ggml-model-f32.bin -p "This is an example of" -n 1 -ngl 32 -s 1

And got the number of BLAS op supported in sample's output but no one BLAS: MUL_MAT or BLAS: OUT_PROD:

main: seed = 1
gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model-f32.bin'
gpt2_model_load: n_vocab = 50257
gpt2_model_load: n_ctx   = 1024
gpt2_model_load: n_embd  = 768
gpt2_model_load: n_head  = 12
gpt2_model_load: n_layer = 12
gpt2_model_load: ftype   = 0
gpt2_model_load: qntvr   = 0
gpt2_model_load:     BLAS buffer size =   622.01 MB
gpt2_model_load: memory size =    72.00 MB, n_mem = 12288
gpt2_model_load: backend_kv = BLAS
gpt2_model_load: model size  =   621.94 MB
gpt2_model_load: backend_in = BLAS (8192 bytes)
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
BLAS: op supported
BLAS: op supported
...
BLAS: op supported
BLAS: op supported
main:     BLAS compute buffer size =     6.32 MB
main: total compute buffer size: 6.32 MB
main: prompt: 'This is an example of'
main: number of tokens in prompt = 5, first 8 tokens: 1212 318 281 1672 286

This is an example of something

main:     load time =   743.46 ms
main:   sample time =     0.79 ms
main:  predict time =    33.71 ms / 6.74 ms per token
main:    total time =   781.95 ms

BLAS backend debug print patch `ggml-blas-debug.patch`

diff --git a/src/ggml-blas.cpp b/src/ggml-blas.cpp
index d709a35..7fff962 100644
--- a/src/ggml-blas.cpp
+++ b/src/ggml-blas.cpp
@@ -52,6 +52,8 @@ static void ggml_backend_blas_mul_mat(ggml_backend_blas_context * ctx, struct gg
     const struct ggml_tensor * src0 = dst->src[0];
     const struct ggml_tensor * src1 = dst->src[1];
 
+    fprintf(stderr, "BLAS: MUL_MAT");
+
     GGML_TENSOR_BINARY_OP_LOCALS
 
     const enum ggml_type type = src0->type;
@@ -170,6 +172,8 @@ static void ggml_backend_blas_out_prod(ggml_backend_blas_context * ctx, struct g
     const struct ggml_tensor * src0 = dst->src[0];
     const struct ggml_tensor * src1 = dst->src[1];
 
+    fprintf(stderr, "BLAS: OUT_PROD");
+
     GGML_TENSOR_BINARY_OP_LOCALS
 
     GGML_ASSERT(ne0  == ne00);
@@ -284,14 +288,15 @@ GGML_CALL static bool ggml_backend_blas_supports_op(ggml_backend_t backend, cons
     const struct ggml_tensor * src0 = op->src[0];
     const struct ggml_tensor * src1 = op->src[1];
 
-    return (op->op == GGML_OP_MUL_MAT  && ggml_backend_blas_use_blas(op)) ||
+    bool ok = (op->op == GGML_OP_MUL_MAT  && ggml_backend_blas_use_blas(op)) ||
            (op->op == GGML_OP_OUT_PROD && op->src[0]->type == GGML_TYPE_F32 &&
                                           op->src[1]->type == GGML_TYPE_F32 &&
                                           ggml_is_matrix(src0) &&
                                           ggml_is_matrix(src1) &&
                                           ggml_is_contiguous(src0) &&
                                           (ggml_is_contiguous(src1) || ggml_is_transposed(src1)));
-
+    if (ok) fprintf(stderr, "BLAS: op supported\n");
+    return ok;
     GGML_UNUSED(backend);
 }

slaren · 2024-06-20T16:41:46Z

BLAS is only used with batches of at least 32 tokens. The "OP supported" you are seeing are probably from the reserve run, which is never executed. Try a larger prompt, or always return true from ggml_backend_blas_use_blas. You can also view the backend assigned to each node in the graph by setting the environment variable GGML_SCHED_DEBUG.

rfsaliev · 2024-06-20T17:04:32Z

Thank you, main: number of tokens in prompt = 40, - solved the issue.

* Backend logic is based on BLAS backend * Implemented support for MUL_MAT operation * Implemented MUL_MAT fusing with subsequential ADD as bias-add * Implemented weights 'pre-packing'(reordering) for MUL_MAT operation Notes: * This it is the second version of the DNNL-backend based on refactored ggml backend support implemented together with BLAS-backend * It is recommended to enable GGML_OPENMP when oneDNN compiled with DNNL_CPU_RUNTIME=OMP(default)

rfsaliev · 2024-06-28T15:07:10Z

Hello,
I"ve published the new simplified backend version based on the logic of BLAS backend.

Added also simple MUL_MAT+ADD fusing and weights 'pre-packing' (reordering) features.
The 'pre-packing' executed at schedule stage based on OneDNN 'primitive descriptor' which is constructed in supports_op() interface.

qnixsynapse · 2024-08-30T11:51:02Z

Is this PR dead?

WilliamTambellini · 2024-08-30T17:25:16Z

RFC please

ggerganov reviewed Jun 14, 2024

View reviewed changes

rfsaliev force-pushed the onednn-backend branch from 919d80a to 06dabec Compare June 18, 2024 09:14

rfsaliev force-pushed the onednn-backend branch from 9040e9f to a5d602d Compare June 28, 2024 14:55

WilliamTambellini mentioned this pull request Jul 31, 2024

Add OneDNN or DirectML support ggerganov/whisper.cpp#2303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental]backend: add new oneDNN backend #855

[experimental]backend: add new oneDNN backend #855

rfsaliev commented Jun 12, 2024

slaren commented Jun 12, 2024 •

edited

Loading

WilliamTambellini commented Jun 12, 2024

rfsaliev commented Jun 14, 2024

ggerganov Jun 14, 2024

rfsaliev Jun 18, 2024

rfsaliev commented Jun 20, 2024

slaren commented Jun 20, 2024 •

edited

Loading

rfsaliev commented Jun 20, 2024

rfsaliev commented Jun 28, 2024 •

edited

Loading

qnixsynapse commented Aug 30, 2024

WilliamTambellini commented Aug 30, 2024

[experimental]backend: add new oneDNN backend #855

Are you sure you want to change the base?

[experimental]backend: add new oneDNN backend #855

Conversation

rfsaliev commented Jun 12, 2024

slaren commented Jun 12, 2024 • edited Loading

WilliamTambellini commented Jun 12, 2024

rfsaliev commented Jun 14, 2024

ggerganov Jun 14, 2024

Choose a reason for hiding this comment

rfsaliev Jun 18, 2024

Choose a reason for hiding this comment

rfsaliev commented Jun 20, 2024

BLAS backend debug print patch ggml-blas-debug.patch

slaren commented Jun 20, 2024 • edited Loading

rfsaliev commented Jun 20, 2024

rfsaliev commented Jun 28, 2024 • edited Loading

qnixsynapse commented Aug 30, 2024

WilliamTambellini commented Aug 30, 2024

slaren commented Jun 12, 2024 •

edited

Loading

BLAS backend debug print patch `ggml-blas-debug.patch`

slaren commented Jun 20, 2024 •

edited

Loading

rfsaliev commented Jun 28, 2024 •

edited

Loading