GGUF 만들기

May 4, 2024 78 분 소요

GGUF 생성 가이드 (Llama 3 모델)

llama.cpp

C++로 작성된 오픈 소스 소프트웨어 라이브러리로, Llama와 같은 다양한 대규모 언어 모델(LLM)의 인퍼런스를 수행한다. 이 라이브러리는 범용 텐서 라이브러리인 ggml과 함께 공동 개발되었습니다.

Georgi Gerganov에 의해 순수 C++로 개발되었으며, 의존성이 없다. 이러한 접근 방식의 장점은 CUDA와 같은 하드웨어 의존적인 닫힌 소스 라이브러리에 의존하는 다른 인퍼런스 라이브러리와 달리 더 다양한 하드웨어에서 실행될 수 있다는 점이다. Georgi Gerganov는 이전에 OpenAI의 음성 인식 모델인 Whisper를 구현한 whisper.cpp라는 유사한 라이브러리를 개발한 바 있다.

초기에는 CPU에서만 실행할 수 있었지만, 이제는 Vulkan과 SYCL을 포함한 다양한 백엔드를 통해 GPU에서도 실행할 수 있다. 이러한 백엔드는 GGML 텐서 라이브러리를 구성하며, 이는 모델별 프론트엔드 코드인 llama.cpp에서 사용된다. llama.cpp는 GGUF라는 자체 모델 포맷을 지원하며, 이전에는 GGML 포맷으로 불렸다.

실시간 양자화 대신 사전 양자화를 지원한다. 이로 인해 모델을 보다 효율적으로 사용할 수 있으며, 특히 전문 하드웨어가 없는 사용자들에게 인기를 끌고 있다. 또한, Android 기기에서도 CPU만으로 실행할 수 있어 사용자층이 넓다.

GGUF 파일 포맷

텐서와 메타데이터를 저장하는 바이너리 포맷이다. GGUF 파일은 주로 PyTorch와 같은 다른 머신 러닝 라이브러리에서 개발된 모델을 변환하여 생성된다. GGUF의 주요 목적은 llama.cpp와 다른 ggml 프로젝트에서 모델 파일을 쉽게 그리고 빠르게 로드할 수 있도록 하는 것이다.

이전에 사용되던 파일 포맷들을 대체하기 위해 만들어졌다. 이전 포맷들은 아키텍처 메타데이터를 포함하지 않아 소프트웨어 확장이 어려웠고, 역호환성 문제가 있었다. GGUF는 이러한 문제를 해결하여 모델 파일의 확장성과 호환성을 높였다.

safetensors와 같은 텐서 전용 파일 포맷과 달리 텐서와 표준화된 메타데이터 세트를 모두 인코딩한다. 이로 인해 모델의 로딩 및 저장 과정이 더욱 효율적이고 체계적으로 이루어진다.

텐서와 메타데이터를 저장하는 바이너리 형식으로, 모델을 빠르게 로드하고 저장할 수 있도록 최적화되었다. 이 포맷은 GGML 및 다른 실행기와 함께 사용하도록 설계되었다.

[GGUF 파일 구조]

GGUF 매직 넘버 (4 bytes): “0x47 0x47 0x55 0x46”로, “GGUF”를 나타낸다.
GGUF 버전 (4 bytes): 현재 버전은 3이다.
텐서 수 (8 bytes): 파일에 포함된 텐서의 수를 나타내는 UINT64 값.
메타데이터 키-값 쌍의 수 (8 bytes): 메타데이터 키-값 쌍의 수를 나타내는 UINT64 값.
메타데이터: 모델의 아키텍처, 이름, 컨텍스트 길이, 파일 타입, 토크나이저 정보 등을 포함한다.
텐서 정보: 텐서의 이름, 차원 수, 차원, 타입, 오프셋 등을 포함한다.

이 외의 파일의 나머지 부분은 텐서 데이터로 구성된다

다양한 양자화 타입을 지원하여 메모리 사용량을 줄이고 속도를 향상시킬 수 있다. 또한, float32, float16, bfloat16과 같은 일반적인 부동소수점 데이터 타입과 여러 양자화된 정수 타입(1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, 8-bit)을 지원한다.

llama.cpp 준비 (make)

Output Folder : llama.cpp/

snoopy_kr@MacBookPro AI % git clone https://github.com/ggerganov/llama.cpp
Cloning into 'llama.cpp'...
remote: Enumerating objects: 23768, done.
remote: Counting objects: 100% (127/127), done.
remote: Compressing objects: 100% (87/87), done.
remote: Total 23768 (delta 59), reused 90 (delta 40), pack-reused 23641
Receiving objects: 100% (23768/23768), 38.14 MiB | 21.42 MiB/s, done.
Resolving deltas: 100% (16793/16793), done.

snoopy_kr@MacBookPro AI % cd llama.cpp

snoopy_kr@MacBookPro llama.cpp % make                 
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   i386
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
I CC:        Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX:       Apple clang version 15.0.0 (clang-1500.1.0.2.5)

c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/grammar-parser.cpp -o grammar-parser.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/build-info.cpp -o build-info.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/json-schema-to-grammar.cpp -o json-schema-to-grammar.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/console.cpp -o console.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c sgemm.cpp -o sgemm.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c ggml-metal.m -o ggml-metal.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c unicode.cpp -o unicode.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c unicode-data.cpp -o unicode-data.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/main/main.cpp -o examples/main/main.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/main/main.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 

====  Run ./main -h for help.  ====

c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/quantize-stats/quantize-stats.cpp -o examples/quantize-stats/quantize-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  build-info.o ggml.o llama.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize-stats/quantize-stats.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/perplexity/perplexity.cpp -o examples/perplexity/perplexity.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/perplexity/perplexity.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/imatrix/imatrix.cpp -o examples/imatrix/imatrix.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/imatrix/imatrix.o -o imatrix -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/embedding/embedding.cpp -o examples/embedding/embedding.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/embedding/embedding.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c pocs/vdot/vdot.cpp -o pocs/vdot/vdot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/vdot.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c pocs/vdot/q8dot.cpp -o pocs/vdot/q8dot.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o pocs/vdot/q8dot.o -o q8dot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/train.cpp -o train.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/train-text-from-scratch/train-text-from-scratch.cpp -o examples/train-text-from-scratch/train-text-from-scratch.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/train-text-from-scratch/train-text-from-scratch.o -o train-text-from-scratch -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.cpp -o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/convert-llama2c-to-ggml/convert-llama2c-to-ggml.o -o convert-llama2c-to-ggml -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/simple/simple.cpp -o examples/simple/simple.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/simple/simple.o -o simple -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/batched/batched.cpp -o examples/batched/batched.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched/batched.o -o batched -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/batched-bench/batched-bench.cpp -o examples/batched-bench/batched-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  build-info.o ggml.o llama.o common.o sampling.o grammar-parser.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/batched-bench/batched-bench.o -o batched-bench -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/save-load-state/save-load-state.cpp -o examples/save-load-state/save-load-state.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/save-load-state/save-load-state.o -o save-load-state -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/server/server.cpp -o examples/server/server.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o -Iexamples/server examples/server/server.o -o server -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/gguf/gguf.cpp -o examples/gguf/gguf.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf/gguf.o -o gguf -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/gguf-split/gguf-split.cpp -o examples/gguf-split/gguf-split.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gguf-split/gguf-split.o -o gguf-split -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/eval-callback/eval-callback.cpp -o examples/eval-callback/eval-callback.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/eval-callback/eval-callback.o -o eval-callback -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/llama-bench/llama-bench.cpp -o examples/llama-bench/llama-bench.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llama-bench/llama-bench.o -o llama-bench -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -static -fPIC -c examples/llava/llava.cpp -o libllava.a -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/llava/llava-cli.cpp -o examples/llava/llava-cli.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/llava/clip.cpp  -o examples/llava/clip.o -Wno-cast-qual
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/llava/llava.cpp -o examples/llava/llava.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/llava/llava-cli.o examples/llava/clip.o examples/llava/llava.o -o llava-cli -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/baby-llama/baby-llama.cpp -o examples/baby-llama/baby-llama.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/baby-llama/baby-llama.o -o baby-llama -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/beam-search/beam-search.cpp -o examples/beam-search/beam-search.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/beam-search/beam-search.o -o beam-search -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/retrieval/retrieval.cpp -o examples/retrieval/retrieval.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/retrieval/retrieval.o -o retrieval -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/speculative/speculative.cpp -o examples/speculative/speculative.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/speculative/speculative.o -o speculative -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/infill/infill.cpp -o examples/infill/infill.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o console.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/infill/infill.o -o infill -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/tokenize/tokenize.cpp -o examples/tokenize/tokenize.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/tokenize/tokenize.o -o tokenize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/benchmark/benchmark-matmult.cpp -o examples/benchmark/benchmark-matmult.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  build-info.o ggml.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/benchmark/benchmark-matmult.o -o benchmark-matmult -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/parallel/parallel.cpp -o examples/parallel/parallel.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/parallel/parallel.o -o parallel -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/finetune/finetune.cpp -o examples/finetune/finetune.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o train.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/finetune/finetune.o -o finetune -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/export-lora/export-lora.cpp -o examples/export-lora/export-lora.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/export-lora/export-lora.o -o export-lora -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/lookahead/lookahead.cpp -o examples/lookahead/lookahead.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookahead/lookahead.o -o lookahead -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c common/ngram-cache.cpp -o ngram-cache.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/lookup/lookup.cpp -o examples/lookup/lookup.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup.o -o lookup -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/lookup/lookup-create.cpp -o examples/lookup/lookup-create.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-create.o -o lookup-create -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/lookup/lookup-merge.cpp -o examples/lookup/lookup-merge.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-merge.o -o lookup-merge -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/lookup/lookup-stats.cpp -o examples/lookup/lookup-stats.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o ngram-cache.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/lookup/lookup-stats.o -o lookup-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/passkey/passkey.cpp -o examples/passkey/passkey.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/passkey/passkey.o -o passkey -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -c examples/gritlm/gritlm.cpp -o examples/gritlm/gritlm.o
c++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o json-schema-to-grammar.o sgemm.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/gritlm/gritlm.o -o gritlm -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c tests/test-c.c -o tests/test-c.o

[ext] llama.cpp requirements 설치

[ext] download.py 작업

[ext] download model

Model을 gguf로 변경

snoopy_kr@MacBookPro llama.cpp % python convert-hf-to-gguf.py ../models/Llama-3-Open-Ko-8B --outfile ../models/Llama-3-Open-Ko-8B.gguf
INFO:hf-to-gguf:Loading model: Llama-3-Open-Ko-8B
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 8192
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: rms norm epsilon = 1e-05
INFO:hf-to-gguf:gguf: file type = 1
INFO:hf-to-gguf:Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:gguf.vocab:Adding 280147 merge(s).
INFO:gguf.vocab:Setting special token type bos to 128000
INFO:gguf.vocab:Setting special token type eos to 128001
INFO:hf-to-gguf:Exporting model to '../models/Llama-3-Open-Ko-8B.gguf'
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00006.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> float16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.1.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.1.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.1.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.1.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.1.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.1.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.2.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.2.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.2.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.2.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.2.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.2.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.3.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.3.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.3.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.3.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.3.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.3.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.4.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00002-of-00006.safetensors'
INFO:hf-to-gguf:blk.10.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.10.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.10.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.10.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.10.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.10.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.10.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.11.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.11.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.4.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.4.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.4.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.4.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.5.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.5.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.5.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.5.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.5.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.5.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.5.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.6.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.6.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.6.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.6.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.6.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.6.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.7.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.7.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.7.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.7.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.7.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.7.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.8.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.8.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.8.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.8.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.8.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.8.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.9.ffn_down.weight,       torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.9.ffn_gate.weight,       torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_up.weight,         torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.9.ffn_norm.weight,       torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.9.attn_k.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.9.attn_output.weight,    torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_q.weight,         torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.9.attn_v.weight,         torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00003-of-00006.safetensors'
INFO:hf-to-gguf:blk.11.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.11.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.11.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.11.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.12.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.12.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.12.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.12.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.12.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.12.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.12.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.13.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.13.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.13.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.13.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.13.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.13.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.14.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.14.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.14.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.14.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.14.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.14.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.15.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.15.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.15.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.15.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.15.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.15.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.16.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.16.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.16.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.16.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.16.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.16.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.17.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.17.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.17.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.17.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.17.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.17.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00004-of-00006.safetensors'
INFO:hf-to-gguf:blk.18.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.18.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.18.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.18.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.18.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.18.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.18.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.19.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.19.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.19.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.19.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.19.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.19.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.20.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.20.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.20.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.20.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.20.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.20.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.21.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.21.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.21.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.21.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.21.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.21.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.22.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.22.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.22.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.22.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.22.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.22.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.23.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.23.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.23.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.23.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.23.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.23.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.24.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.24.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.24.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00005-of-00006.safetensors'
INFO:hf-to-gguf:blk.24.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.24.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.24.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.25.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.25.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.25.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.25.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.25.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.25.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.25.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.26.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.26.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.26.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.26.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.26.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.26.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.27.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.27.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.27.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.27.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.27.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.27.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.28.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.28.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.28.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.28.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.28.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.28.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.29.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.29.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.29.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.29.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.29.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.29.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.30.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.30.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.30.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.30.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.30.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.30.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.ffn_gate.weight,      torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.attn_k.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:blk.31.attn_output.weight,   torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_q.weight,        torch.bfloat16 --> float16, shape = {4096, 4096}
INFO:hf-to-gguf:blk.31.attn_v.weight,        torch.bfloat16 --> float16, shape = {4096, 1024}
INFO:hf-to-gguf:gguf: loading model part 'model-00006-of-00006.safetensors'
INFO:hf-to-gguf:output.weight,               torch.bfloat16 --> float16, shape = {4096, 128256}
INFO:hf-to-gguf:blk.31.attn_norm.weight,     torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:blk.31.ffn_down.weight,      torch.bfloat16 --> float16, shape = {14336, 4096}
INFO:hf-to-gguf:blk.31.ffn_up.weight,        torch.bfloat16 --> float16, shape = {4096, 14336}
INFO:hf-to-gguf:blk.31.ffn_norm.weight,      torch.bfloat16 --> float32, shape = {4096}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> float32, shape = {4096}
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Writing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.1G/16.1G [01:13<00:00, 219Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to '../models/Llama-3-Open-Ko-8B.gguf'

gguf 양자화

snoopy_kr@MacBookPro llama.cpp % ./quantize ../models/Llama-3-Open-Ko-8B.gguf ../models/Llama-3-Open-Ko-8B-Q4_K_M.gguf Q4_K_M                          
main: build = 2824 (4426e298)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for x86_64-apple-darwin22.6.0
main: quantizing '../models/Llama-3-Open-Ko-8B.gguf' to '../models/Llama-3-Open-Ko-8B-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../models/Llama-3-Open-Ko-8B.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama-3-Open-Ko-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
[   1/ 291]                    token_embd.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q4_K .. size =  1002.00 MiB ->   281.81 MiB
[   2/ 291]               blk.0.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   3/ 291]                blk.0.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[   4/ 291]                blk.0.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[   5/ 291]                  blk.0.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[   6/ 291]                blk.0.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[   7/ 291]                  blk.0.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[   8/ 291]             blk.0.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[   9/ 291]                  blk.0.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  10/ 291]                  blk.0.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  11/ 291]               blk.1.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  12/ 291]                blk.1.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  13/ 291]                blk.1.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  14/ 291]                  blk.1.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  15/ 291]                blk.1.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  16/ 291]                  blk.1.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  17/ 291]             blk.1.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  18/ 291]                  blk.1.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  19/ 291]                  blk.1.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  20/ 291]               blk.2.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  21/ 291]                blk.2.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  22/ 291]                blk.2.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  23/ 291]                  blk.2.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  24/ 291]                blk.2.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  25/ 291]                  blk.2.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  26/ 291]             blk.2.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  27/ 291]                  blk.2.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  28/ 291]                  blk.2.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  29/ 291]               blk.3.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  30/ 291]                blk.3.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  31/ 291]                blk.3.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  32/ 291]                  blk.3.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  33/ 291]                blk.3.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  34/ 291]                  blk.3.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  35/ 291]             blk.3.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  36/ 291]                  blk.3.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  37/ 291]                  blk.3.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  38/ 291]                blk.4.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  39/ 291]                  blk.4.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  40/ 291]             blk.4.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  41/ 291]                  blk.4.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  42/ 291]                  blk.4.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  43/ 291]              blk.10.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  44/ 291]               blk.10.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  45/ 291]               blk.10.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  46/ 291]                 blk.10.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  47/ 291]               blk.10.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  48/ 291]                 blk.10.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  49/ 291]            blk.10.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  50/ 291]                 blk.10.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  51/ 291]                 blk.10.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  52/ 291]                 blk.11.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  53/ 291]            blk.11.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  54/ 291]                 blk.11.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  55/ 291]                 blk.11.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  56/ 291]               blk.4.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  57/ 291]                blk.4.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  58/ 291]                  blk.4.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  59/ 291]                blk.4.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  60/ 291]               blk.5.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  61/ 291]                blk.5.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  62/ 291]                blk.5.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  63/ 291]                  blk.5.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  64/ 291]                blk.5.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  65/ 291]                  blk.5.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  66/ 291]             blk.5.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  67/ 291]                  blk.5.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  68/ 291]                  blk.5.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  69/ 291]               blk.6.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  70/ 291]                blk.6.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  71/ 291]                blk.6.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  72/ 291]                  blk.6.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  73/ 291]                blk.6.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  74/ 291]                  blk.6.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  75/ 291]             blk.6.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  76/ 291]                  blk.6.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  77/ 291]                  blk.6.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  78/ 291]               blk.7.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  79/ 291]                blk.7.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  80/ 291]                blk.7.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  81/ 291]                  blk.7.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  82/ 291]                blk.7.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  83/ 291]                  blk.7.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  84/ 291]             blk.7.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  85/ 291]                  blk.7.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  86/ 291]                  blk.7.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[  87/ 291]               blk.8.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  88/ 291]                blk.8.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[  89/ 291]                blk.8.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  90/ 291]                  blk.8.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  91/ 291]                blk.8.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  92/ 291]                  blk.8.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  93/ 291]             blk.8.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  94/ 291]                  blk.8.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[  95/ 291]                  blk.8.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[  96/ 291]               blk.9.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[  97/ 291]                blk.9.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  98/ 291]                blk.9.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[  99/ 291]                  blk.9.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 100/ 291]                blk.9.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 101/ 291]                  blk.9.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 102/ 291]             blk.9.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 103/ 291]                  blk.9.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 104/ 291]                  blk.9.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 105/ 291]              blk.11.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 106/ 291]               blk.11.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 107/ 291]               blk.11.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 108/ 291]                 blk.11.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 109/ 291]               blk.11.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 110/ 291]              blk.12.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 111/ 291]               blk.12.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 112/ 291]               blk.12.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 113/ 291]                 blk.12.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 114/ 291]               blk.12.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 115/ 291]                 blk.12.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 116/ 291]            blk.12.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 117/ 291]                 blk.12.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 118/ 291]                 blk.12.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 119/ 291]              blk.13.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 120/ 291]               blk.13.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 121/ 291]               blk.13.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 122/ 291]                 blk.13.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 123/ 291]               blk.13.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 124/ 291]                 blk.13.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 125/ 291]            blk.13.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 126/ 291]                 blk.13.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 127/ 291]                 blk.13.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 128/ 291]              blk.14.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 129/ 291]               blk.14.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 130/ 291]               blk.14.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 131/ 291]                 blk.14.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 132/ 291]               blk.14.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 133/ 291]                 blk.14.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 134/ 291]            blk.14.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 135/ 291]                 blk.14.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 136/ 291]                 blk.14.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 137/ 291]              blk.15.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 138/ 291]               blk.15.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 139/ 291]               blk.15.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 140/ 291]                 blk.15.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 141/ 291]               blk.15.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 142/ 291]                 blk.15.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 143/ 291]            blk.15.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 144/ 291]                 blk.15.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 145/ 291]                 blk.15.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 146/ 291]              blk.16.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 147/ 291]               blk.16.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 148/ 291]               blk.16.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 149/ 291]                 blk.16.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 150/ 291]               blk.16.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 151/ 291]                 blk.16.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 152/ 291]            blk.16.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 153/ 291]                 blk.16.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 154/ 291]                 blk.16.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 155/ 291]              blk.17.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 156/ 291]               blk.17.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 157/ 291]               blk.17.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 158/ 291]                 blk.17.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 159/ 291]               blk.17.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 160/ 291]                 blk.17.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 161/ 291]            blk.17.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 162/ 291]                 blk.17.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 163/ 291]                 blk.17.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 164/ 291]              blk.18.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 165/ 291]               blk.18.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 166/ 291]               blk.18.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 167/ 291]                 blk.18.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 168/ 291]               blk.18.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 169/ 291]                 blk.18.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 170/ 291]            blk.18.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 171/ 291]                 blk.18.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 172/ 291]                 blk.18.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 173/ 291]              blk.19.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 174/ 291]               blk.19.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 175/ 291]               blk.19.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 176/ 291]                 blk.19.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 177/ 291]               blk.19.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 178/ 291]                 blk.19.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 179/ 291]            blk.19.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 180/ 291]                 blk.19.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 181/ 291]                 blk.19.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 182/ 291]              blk.20.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 183/ 291]               blk.20.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 184/ 291]               blk.20.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 185/ 291]                 blk.20.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 186/ 291]               blk.20.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 187/ 291]                 blk.20.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 188/ 291]            blk.20.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 189/ 291]                 blk.20.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 190/ 291]                 blk.20.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 191/ 291]              blk.21.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 192/ 291]               blk.21.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 193/ 291]               blk.21.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 194/ 291]                 blk.21.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 195/ 291]               blk.21.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 196/ 291]                 blk.21.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 197/ 291]            blk.21.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 198/ 291]                 blk.21.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 199/ 291]                 blk.21.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 200/ 291]              blk.22.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 201/ 291]               blk.22.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 202/ 291]               blk.22.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 203/ 291]                 blk.22.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 204/ 291]               blk.22.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 205/ 291]                 blk.22.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 206/ 291]            blk.22.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 207/ 291]                 blk.22.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 208/ 291]                 blk.22.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 209/ 291]              blk.23.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 210/ 291]               blk.23.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 211/ 291]               blk.23.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 212/ 291]                 blk.23.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 213/ 291]               blk.23.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 214/ 291]                 blk.23.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 215/ 291]            blk.23.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 216/ 291]                 blk.23.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 217/ 291]                 blk.23.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 218/ 291]               blk.24.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 219/ 291]                 blk.24.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 220/ 291]                 blk.24.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 221/ 291]            blk.24.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 222/ 291]                 blk.24.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 223/ 291]                 blk.24.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 224/ 291]              blk.24.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 225/ 291]               blk.24.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 226/ 291]               blk.24.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 227/ 291]              blk.25.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 228/ 291]               blk.25.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 229/ 291]               blk.25.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 230/ 291]                 blk.25.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 231/ 291]               blk.25.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 232/ 291]                 blk.25.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 233/ 291]            blk.25.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 234/ 291]                 blk.25.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 235/ 291]                 blk.25.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 236/ 291]              blk.26.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 237/ 291]               blk.26.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 238/ 291]               blk.26.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 239/ 291]                 blk.26.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 240/ 291]               blk.26.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 241/ 291]                 blk.26.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 242/ 291]            blk.26.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 243/ 291]                 blk.26.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 244/ 291]                 blk.26.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 245/ 291]              blk.27.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 246/ 291]               blk.27.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 247/ 291]               blk.27.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 248/ 291]                 blk.27.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 249/ 291]               blk.27.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 250/ 291]                 blk.27.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 251/ 291]            blk.27.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 252/ 291]                 blk.27.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 253/ 291]                 blk.27.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 254/ 291]              blk.28.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 255/ 291]               blk.28.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 256/ 291]               blk.28.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 257/ 291]                 blk.28.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 258/ 291]               blk.28.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 259/ 291]                 blk.28.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 260/ 291]            blk.28.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 261/ 291]                 blk.28.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 262/ 291]                 blk.28.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 263/ 291]              blk.29.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 264/ 291]               blk.29.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 265/ 291]               blk.29.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 266/ 291]                 blk.29.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 267/ 291]               blk.29.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 268/ 291]                 blk.29.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 269/ 291]            blk.29.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 270/ 291]                 blk.29.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 271/ 291]                 blk.29.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 272/ 291]              blk.30.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 273/ 291]               blk.30.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 274/ 291]               blk.30.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 275/ 291]                 blk.30.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 276/ 291]               blk.30.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 277/ 291]                 blk.30.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 278/ 291]            blk.30.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 279/ 291]                 blk.30.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 280/ 291]                 blk.30.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 281/ 291]               blk.31.ffn_gate.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 282/ 291]                 blk.31.attn_k.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q4_K .. size =     8.00 MiB ->     2.25 MiB
[ 283/ 291]            blk.31.attn_output.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 284/ 291]                 blk.31.attn_q.weight - [ 4096,  4096,     1,     1], type =    f16, converting to q4_K .. size =    32.00 MiB ->     9.00 MiB
[ 285/ 291]                 blk.31.attn_v.weight - [ 4096,  1024,     1,     1], type =    f16, converting to q6_K .. size =     8.00 MiB ->     3.28 MiB
[ 286/ 291]                        output.weight - [ 4096, 128256,     1,     1], type =    f16, converting to q6_K .. size =  1002.00 MiB ->   410.98 MiB
[ 287/ 291]              blk.31.attn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 288/ 291]               blk.31.ffn_down.weight - [14336,  4096,     1,     1], type =    f16, converting to q6_K .. size =   112.00 MiB ->    45.94 MiB
[ 289/ 291]                 blk.31.ffn_up.weight - [ 4096, 14336,     1,     1], type =    f16, converting to q4_K .. size =   112.00 MiB ->    31.50 MiB
[ 290/ 291]               blk.31.ffn_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
[ 291/ 291]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 15317.02 MB
llama_model_quantize_internal: quant size  =  4685.30 MB

main: quantize time = 182635.01 ms
main:    total time = 182635.01 ms

[ext] download.py 작업

from huggingface_hub import snapshot_download
model_id="beomi/Llama-3-Open-Ko-8B"
snapshot_download(repo_id=model_id, local_dir="Llama-3-Open-Ko-8B",
                  local_dir_use_symlinks=False, revision="main")

[ext] download model

snoopy_kr@MacBookPro models % python download.py 
/Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages/huggingface_hub/file_download.py:1194: UserWarning: `local_dir_use_symlinks` parameter is deprecated and will be ignored. The process to download files to a local folder has been updated and do not rely on symlinks anymore. You only need to pass a destination folder as`local_dir`.
For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.
  warnings.warn(
generation_config.json: 100%|███████████████████| 132/132 [00:00<00:00, 751kB/s]
.gitattributes: 100%|██████████████████████| 1.57k/1.57k [00:00<00:00, 11.1MB/s]
config.json: 100%|█████████████████████████████| 698/698 [00:00<00:00, 4.48MB/s]
model.safetensors.index.json: 100%|████████| 23.9k/23.9k [00:00<00:00, 2.72MB/s]
README.md: 100%|███████████████████████████| 8.21k/8.21k [00:00<00:00, 46.9MB/s]
model-00004-of-00006.safetensors: 100%|████| 2.94G/2.94G [03:44<00:00, 13.1MB/s]
model-00002-of-00006.safetensors: 100%|████| 2.94G/2.94G [04:14<00:00, 11.5MB/s]
model-00006-of-00006.safetensors: 100%|████| 1.29G/1.29G [04:54<00:00, 4.36MB/s]
model-00003-of-00006.safetensors: 100%|████| 2.97G/2.97G [05:06<00:00, 9.67MB/s]
pytorch_model-00002-of-00006.bin: 100%|████| 2.94G/2.94G [06:01<00:00, 8.11MB/s]
pytorch_model.bin.index.json: 100%|████████| 23.9k/23.9k [00:00<00:00, 90.4MB/s]
special_tokens_map.json: 100%|█████████████████| 301/301 [00:00<00:00, 3.85MB/s]
tokenizer.json: 100%|██████████████████████| 9.09M/9.09M [00:03<00:00, 2.45MB/s]
tokenizer_config.json: 100%|███████████████| 51.0k/51.0k [00:00<00:00, 12.9MB/s]
pytorch_model-00006-of-00006.bin: 100%|████| 1.29G/1.29G [01:46<00:00, 12.0MB/s]
pytorch_model-00004-of-00006.bin: 100%|████| 2.94G/2.94G [03:11<00:00, 15.4MB/s]
pytorch_model-00003-of-00006.bin: 100%|████| 2.97G/2.97G [03:45<00:00, 13.2MB/s]
model-00005-of-00006.safetensors: 100%|████| 2.94G/2.94G [08:04<00:00, 6.06MB/s]
pytorch_model-00001-of-00006.bin: 100%|████| 3.00G/3.00G [08:37<00:00, 5.79MB/s]
model-00001-of-00006.safetensors: 100%|████| 3.00G/3.00G [08:46<00:00, 5.69MB/s]
pytorch_model-00005-of-00006.bin: 100%|████| 2.94G/2.94G [03:53<00:00, 12.6MB/s]
Fetching 21 files: 100%|████████████████████████| 21/21 [08:52<00:00, 25.35s/it]

[ext] llama.cpp requirements 설치

snoopy_kr@MacBookPro llama.cpp % pip install -r requirements.txt
Requirement already satisfied: numpy~=1.24.4 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert.txt (line 1)) (1.24.4)
Requirement already satisfied: sentencepiece~=0.1.98 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert.txt (line 2)) (0.1.99)
Requirement already satisfied: transformers<5.0.0,>=4.35.2 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert.txt (line 3)) (4.40.1)
Requirement already satisfied: gguf>=0.1.0 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert.txt (line 4)) (0.6.0)
Requirement already satisfied: protobuf<5.0.0,>=4.21.0 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert.txt (line 5)) (4.25.3)
Requirement already satisfied: torch~=2.1.1 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (2.1.2)
Requirement already satisfied: einops~=0.7.0 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from -r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 3)) (0.7.0)
Requirement already satisfied: filelock in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (0.23.0)
Requirement already satisfied: packaging>=20.0 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (23.1)
Requirement already satisfied: pyyaml>=5.1 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (2023.10.3)
Requirement already satisfied: requests in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (2.31.0)
Requirement already satisfied: tokenizers<0.20,>=0.19 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (0.19.1)
Requirement already satisfied: safetensors>=0.4.1 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (0.4.3)
Requirement already satisfied: tqdm>=4.27 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (4.65.0)
Requirement already satisfied: typing-extensions in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (4.9.0)
Requirement already satisfied: sympy in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (1.12)
Requirement already satisfied: networkx in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (3.1)
Requirement already satisfied: jinja2 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (3.1.3)
Requirement already satisfied: fsspec in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (2023.10.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from jinja2->torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from requests->transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from requests->transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from requests->transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (2.0.7)
Requirement already satisfied: certifi>=2017.4.17 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from requests->transformers<5.0.0,>=4.35.2->-r llama.cpp/./requirements/requirements-convert.txt (line 3)) (2024.2.2)
Requirement already satisfied: mpmath>=0.19 in /Users/snoopy_kr/Apps/anaconda3/lib/python3.11/site-packages (from sympy->torch~=2.1.1->-r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2)) (1.3.0)

[ref] convert-hf-to-gguf.py help

snoopy_kr@MacBookPro llama.cpp % python convert-hf-to-gguf.py -h
usage: convert-hf-to-gguf.py [-h] [--vocab-only] [--awq-path AWQ_PATH] [--outfile OUTFILE] [--outtype {f32,f16}] [--bigendian] [--use-temp-file] [--no-lazy]
                             [--model-name MODEL_NAME] [--verbose]
                             model

Convert a huggingface model to a GGML compatible file

positional arguments:
  model                 directory containing model file

options:
  -h, --help            show this help message and exit
  --vocab-only          extract only the vocab
  --awq-path AWQ_PATH   Path to scale awq cache file
  --outfile OUTFILE     path to write to; default: based on input
  --outtype {f32,f16}   output format - use f32 for float32, f16 for float16
  --bigendian           model is executed on big endian machine
  --use-temp-file       use the tempfile library while processing (helpful when running out of memory, process killed)
  --no-lazy             use more RAM by computing all outputs before writing (use in case lazy evaluation is broken)
  --model-name MODEL_NAME
                        name of the model
  --verbose             increase output verbosity

[ref] convert.py help

snoopy_kr@MacBookPro llama.cpp % python convert.py -h         
usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab]
                  [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR]
                  [--vocab-type VOCAB_TYPE] [--outfile OUTFILE] [--ctx CTX]
                  [--concurrency CONCURRENCY] [--big-endian] [--pad-vocab]
                  [--skip-unknown] [--verbose]
                  model

Convert a LLaMA model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself
                        (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --no-vocab            store model without the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default:
                        f16 or f32 based on input)
  --vocab-dir VOCAB_DIR
                        directory containing tokenizer.model, if separate from
                        model file
  --vocab-type VOCAB_TYPE
                        vocab types to try in order, choose from 'spm', 'bpe',
                        'hfft' (default: spm,hfft)
  --outfile OUTFILE     path to write to; default: based on input
  --ctx CTX             model training context (default: based on input)
  --concurrency CONCURRENCY
                        concurrency used for conversion (default: 8)
  --big-endian          model is executed on big endian machine
  --pad-vocab           add pad tokens when model vocab expects more than
                        tokenizer metadata provides
  --skip-unknown        skip unknown tensor names instead of failing
  --verbose             increase output verbosity

[ref] quantize help

snoopy_kr@MacBookPro llama.cpp % ./quantize
usage: ./quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights] [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--override-kv] model-f32.gguf [model-quant.gguf] type [nthreads]

  --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit
  --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing
  --pure: Disable k-quant mixtures and quantize all tensors to the same type
  --imatrix file_name: use data in file_name as importance matrix for quant optimizations
  --include-weights tensor_name: use importance matrix for this/these tensor(s)
  --exclude-weights tensor_name: use importance matrix for this/these tensor(s)
  --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor
  --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor
  --keep-split: will generate quatized model in the same shards as input  --override-kv KEY=TYPE:VALUE
      Advanced option to override model metadata by key in the quantized model. May be specified multiple times.
Note: --include-weights and --exclude-weights cannot be used together

Allowed quantization types:
   2  or  Q4_0    :  3.56G, +0.2166 ppl @ LLaMA-v1-7B
   3  or  Q4_1    :  3.90G, +0.1585 ppl @ LLaMA-v1-7B
   8  or  Q5_0    :  4.33G, +0.0683 ppl @ LLaMA-v1-7B
   9  or  Q5_1    :  4.70G, +0.0349 ppl @ LLaMA-v1-7B
  19  or  IQ2_XXS :  2.06 bpw quantization
  20  or  IQ2_XS  :  2.31 bpw quantization
  28  or  IQ2_S   :  2.5  bpw quantization
  29  or  IQ2_M   :  2.7  bpw quantization
  24  or  IQ1_S   :  1.56 bpw quantization
  31  or  IQ1_M   :  1.75 bpw quantization
  10  or  Q2_K    :  2.63G, +0.6717 ppl @ LLaMA-v1-7B
  21  or  Q2_K_S  :  2.16G, +9.0634 ppl @ LLaMA-v1-7B
  23  or  IQ3_XXS :  3.06 bpw quantization
  26  or  IQ3_S   :  3.44 bpw quantization
  27  or  IQ3_M   :  3.66 bpw quantization mix
  12  or  Q3_K    : alias for Q3_K_M
  22  or  IQ3_XS  :  3.3 bpw quantization
  11  or  Q3_K_S  :  2.75G, +0.5551 ppl @ LLaMA-v1-7B
  12  or  Q3_K_M  :  3.07G, +0.2496 ppl @ LLaMA-v1-7B
  13  or  Q3_K_L  :  3.35G, +0.1764 ppl @ LLaMA-v1-7B
  25  or  IQ4_NL  :  4.50 bpw non-linear quantization
  30  or  IQ4_XS  :  4.25 bpw non-linear quantization
  15  or  Q4_K    : alias for Q4_K_M
  14  or  Q4_K_S  :  3.59G, +0.0992 ppl @ LLaMA-v1-7B
  15  or  Q4_K_M  :  3.80G, +0.0532 ppl @ LLaMA-v1-7B
  17  or  Q5_K    : alias for Q5_K_M
  16  or  Q5_K_S  :  4.33G, +0.0400 ppl @ LLaMA-v1-7B
  17  or  Q5_K_M  :  4.45G, +0.0122 ppl @ LLaMA-v1-7B
  18  or  Q6_K    :  5.15G, +0.0008 ppl @ LLaMA-v1-7B
   7  or  Q8_0    :  6.70G, +0.0004 ppl @ LLaMA-v1-7B
   1  or  F16     : 14.00G, -0.0020 ppl @ Mistral-7B
  32  or  BF16    : 14.00G, -0.0050 ppl @ Mistral-7B
   0  or  F32     : 26.00G              @ 7B
          COPY    : only copy tensors, no quantizing

[ref] Converting 확인

snoopy_kr@iMac llama.cpp % ./main -m ../models/Llama-3-Open-Ko-8B.gguf \
  --color \
  --ctx_size 2048 \
  -n -1 \
  -ins -b 256 \
  --top_k 10000 \
  --temp 0.2 \
  --repeat_penalty 1.1 \
  -t 8 \
  -ngl 0

Log start
main: build = 2784 (a2ac89d6)
main: built with Apple clang version 15.0.0 (clang-1500.1.0.2.5) for x86_64-apple-darwin22.6.0
main: seed  = 1721177599
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from ../models/Llama-3-Open-Ko-8B.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Llama-3-Open-Ko-8B
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = { set loop_messages = messages }{ ...
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 14.96 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = Llama-3-Open-Ko-8B
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size = 15317.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =   129.25 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 8 / 6 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 10000, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.200
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 1


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

<|begin_of_text|>
> 
이번에 새로 나온 제품은 어떤가요?

llama_print_timings:        load time =  196953.77 ms
llama_print_timings:      sample time =      29.38 ms /    20 runs   (    1.47 ms per token,   680.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  223184.69 ms /    20 runs   (11159.23 ms per token,     0.09 tokens per second)
llama_print_timings:       total time =  320525.77 ms /    21 tokens

참조

huggingface

https://huggingface.co

llama.cpp

https://github.com/ggerganov/llama.cpp

k-quants

https://github.com/ggerganov/llama.cpp/pull/1684

Tutorial: How to convert HuggingFace model to GGUF format #2948

https://github.com/ggerganov/llama.cpp/discussions/2948

Refactor convert.py and add support for Metas official Llama 3 model #6819

https://github.com/ggerganov/llama.cpp/issues/6819

Intel CPU and Graphics card Macbook pro: failed to create context with model ‘./models/model.q4_k_s.gguf’ #3129

https://github.com/ggerganov/llama.cpp/issues/3129

Twitter Facebook LinkedIn

GGUF 만들기

llama.cpp

GGUF 파일 포맷

llama.cpp 준비 (make)

Model을 gguf로 변경

gguf 양자화

[ext] download.py 작업

[ext] download model

[ext] llama.cpp requirements 설치

[ref] convert-hf-to-gguf.py help

[ref] convert.py help

[ref] quantize help

[ref] Converting 확인

참조

공유하기

댓글남기기

참고

Golang Timeout Context with Interupt

Golang Cancel Context with Interupt

Golang Binary Search

Golang Closure