Update README: add Llama 3.1 support and GGUF files.

mukel · mukel · commit 76b19af2ac15 · 2024-07-26T23:49:10.000+02:00
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Llama3.java
 
-Practical [Llama 3](https://github.com/meta-llama/llama3) inference implemented in a single Java file.
+Practical [Llama 3](https://github.com/meta-llama/llama3) and [3.1](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1) inference implemented in a single Java file.
 
 <p align="center">
   <img width="700" src="https://github.com/mukel/llama3.java/assets/1896283/7939588c-c0ff-4261-b67f-8a54bad59ab5">
@@ -17,6 +17,7 @@ Besides the educational value, this project will be used to test and tune compil
  - [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) parser
  - Llama 3 tokenizer based on [minbpe](https://github.com/karpathy/minbpe)
  - Llama 3 inference with Grouped-Query Attention
+ - Support Llama 3.1 (ad-hoc RoPE scaling)
  - Support for Q8_0 and Q4_0 quantizations
  - Fast matrix-vector multiplication routines for quantized tensors using Java's [Vector API](https://openjdk.org/jeps/469)
  - Simple CLI with `--chat` and `--instruct` modes.
@@ -30,14 +31,20 @@ Here's the interactive `--chat` mode in action:
 ## Setup
 
 Download pure `Q4_0` and (optionally) `Q8_0` quantized .gguf files from:  
+https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF
 https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF
 
 The `~4.3GB` pure `Q4_0` quantized model is recommended, please be gentle with [huggingface.co](https://huggingface.co) servers: 
 ```
+# Llama 3.1
+curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf
+
+# Llama 3
 curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf
 
 # Optionally download the Q8_0 quantized model ~8GB
-# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf
+# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gg
+# curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf
 ```
 
 #### Optional: quantize to pure `Q4_0` manually
@@ -47,7 +54,7 @@ A **pure** `Q4_0` quantization can be generated from a high precision (F32, F16,
 with the `quantize` utility from [llama.cpp](https://github.com/ggerganov/llama.cpp) as follows:
 
 ```bash
-./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
+./llama-quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
 ```
 
 ## Build and run
@@ -75,7 +82,7 @@ java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java
 A simple [Makefile](./Makefile) is provided, run `make` to produce `llama3.jar` or manually:
 ```bash
 javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
-jar -cvfe llama3.jar Llama3 LICENSE -C target/classes .
+jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .
 ```
 
 Run the resulting `llama3.jar` as follows: