11# Llama3.java
22
3- Practical [ Llama 3] ( https://github.com/meta-llama/llama3 ) inference implemented in a single Java file.
3+ Practical [ Llama 3] ( https://github.com/meta-llama/llama3 ) and [ 3.1 ] ( https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1 ) inference implemented in a single Java file.
44
55<p align =" center " >
66 <img width =" 700 " src =" https://github.com/mukel/llama3.java/assets/1896283/7939588c-c0ff-4261-b67f-8a54bad59ab5 " >
@@ -17,6 +17,7 @@ Besides the educational value, this project will be used to test and tune compil
1717 - [ GGUF format] ( https://github.com/ggerganov/ggml/blob/master/docs/gguf.md ) parser
1818 - Llama 3 tokenizer based on [ minbpe] ( https://github.com/karpathy/minbpe )
1919 - Llama 3 inference with Grouped-Query Attention
20+ - Support Llama 3.1 (ad-hoc RoPE scaling)
2021 - Support for Q8_0 and Q4_0 quantizations
2122 - Fast matrix-vector multiplication routines for quantized tensors using Java's [ Vector API] ( https://openjdk.org/jeps/469 )
2223 - Simple CLI with ` --chat ` and ` --instruct ` modes.
@@ -30,14 +31,20 @@ Here's the interactive `--chat` mode in action:
3031## Setup
3132
3233Download pure ` Q4_0 ` and (optionally) ` Q8_0 ` quantized .gguf files from:
34+ https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
3335https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
3436
3537The ` ~4.3GB ` pure ` Q4_0 ` quantized model is recommended, please be gentle with [ huggingface.co] ( https://huggingface.co ) servers:
3638```
39+ # Llama 3.1
40+ curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
41+
42+ # Llama 3
3743curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
3844
3945# Optionally download the Q8_0 quantized model ~8GB
40- # curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
46+ # curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gg
47+ # curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
4148```
4249
4350#### Optional: quantize to pure ` Q4_0 ` manually
@@ -47,7 +54,7 @@ A **pure** `Q4_0` quantization can be generated from a high precision (F32, F16,
4754with the ` quantize ` utility from [ llama.cpp] ( https://github.com/ggerganov/llama.cpp ) as follows:
4855
4956``` bash
50- ./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
57+ ./llama- quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
5158```
5259
5360## Build and run
@@ -75,7 +82,7 @@ java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java
7582A simple [ Makefile] ( ./Makefile ) is provided, run ` make ` to produce ` llama3.jar ` or manually:
7683``` bash
7784javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
78- jar -cvfe llama3.jar Llama3 LICENSE -C target/classes .
85+ jar -cvfe llama3.jar com.llama4j. Llama3 LICENSE -C target/classes .
7986```
8087
8188Run the resulting ` llama3.jar ` as follows:
0 commit comments