diff --git a/.gitignore b/.gitignore
index 9dd056f..31ca5dd 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,3 +1,4 @@
**/src-gen/
+**/fed-gen/
**/build/
-**/bin/
\ No newline at end of file
+**/bin/
diff --git a/llm/README.md b/llm/README.md
new file mode 100644
index 0000000..ceef144
--- /dev/null
+++ b/llm/README.md
@@ -0,0 +1,142 @@
+
+# LLM Demo Overview
+This is a quiz-style game between two LLM agents. For each user question typed at the keyboard for the judge, both agents answer in parallel. The Judge announces whichever answer arrives first (or a timeout if neither responds within 60 sec), and prints per-question elapsed logical and physical times.
+
+# Directory Structure
+- [federated](src/federated/) - Directory for federated versions of LLM demos.
+- [agents](src/agents/) - Directory for Python files for various LLM agents.
+
+# Pre-requisites
+
+You need Python >= 3.10 installed.
+
+## Library Dependencies
+To run this project, there are dependencies required which are in [requirements.txt](requirements.txt) file. The model used in this repository has been quantized using 4-bit precision (bnb_4bit) and relies on bitsandbytes for efficient matrix operations and memory optimization. So specific versions of bitsandbytes, torch, and torchvision are mandatory for compatibility.
+While newer versions of other dependencies may work, the specific versions listed below have been tested and are recommended for optimal performance.
+It is highly recommended to create a Python virtual environment or a Conda environment to manage dependencies. \
+To create the a virtual environment follow the steps below.
+
+### Step 1: Creating environment
+```
+python3 -m venv llm
+source llm/bin/activate
+```
+For activating the environment everytime use "source llm/bin/activate".
+or
+```
+conda create -n llm
+conda activate llm
+```
+### Step 2: Installing the required packages
+Check if pip is installed:
+```
+pip --version
+```
+If it is not installed:
+```
+python -m pip install --upgrade pip
+```
+Run this command to install the packages from the [requirements.txt](requirements.txt) file:\
+**Note**: Since we are using LLMs with 7B and 70B parameters it is recommended to have a device with GPU support.
+```
+pip install -r requirements.txt
+```
+To check if all the requirements are installed, run:
+```
+pip list | grep -E "transformers|accelerate|tokenizers|bitsandbytes"
+```
+For installing torch:
+
+1. For devices without GPU
+```
+pip install torch torchvision
+```
+2. For devices with GPU
+ Checking the CUDA version run this command:
+ ```
+ nvidia-smi
+ ```
+ Look for the line "CUDA Version" as shown in the image: \
+
+
+ With the correct version install PyTorch from [PyTorch](https://pytorch.org/get-started/locally/) by selecting the right correct OS and compute platform as shown in the image below for Linux system with CUDA version 12.8: \
+
+### Step 3: Model Dependencies
+- **Pre-trained Models used in the agents/llm.py**: [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) , [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) \
+**Note:** Follow the steps below to obtain the access and authentication key for the hugging face models.
+1. Create the user access token and follow the steps shown on the official documentation: [User access tokens](https://huggingface.co/docs/hub/en/security-tokens)
+2. Log in using the Hugging Face CLI by running huggingface-cli login. Please refer to the official documentation for step-by-step instructions - [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+3. For the Llama Models you will require access to use the models if you are using it for the first time. Open these links and apply for accessing the models ([meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf))
+
+## System Requirements
+
+To ensure optimal performance, the following hardware and software requirements are utilized. \
+**Note:** To replicate this model, you can use any equivalent hardware that meets the computational requirements.
+
+### Hardware Requirements
+The demo was tested with the following hardware setup.
+- **GPU**: NVIDIA RTX A6000, NVIDIA RTX PRO 6000 Blackwell
+
+### Software Requirements
+- **OS**: Linux
+- **Python**: 3.12.3+
+- **CUDA Version**: 12.8+
+- **Lingua Franca**: 0.10.1
+
+Make sure the environment is properly configured to use CUDA for optimal GPU acceleration.
+
+# Files and directories in this repository
+ - **`llm_base_class.lf`** - Contains the base reactors LlmA, LlmB, and Judge.
+ - **`llm_quiz_game.lf`** - Lingua Franca program that defines the quiz game reactors (LLM agent A, LLM agent B and Judge).
+
+# Execution Workflow
+
+### Step 1:
+Run the **`llm_quiz_game.lf`**.
+
+**Note:**
+- Ensure that you specify the correct file paths
+
+Run the following commands:
+
+```
+lfc src/llm_quiz_game.lf
+```
+
+### Step 2: Run the binary file and input the quiz question
+Run the following commands:
+
+```
+./bin/llm_quiz_game
+```
+
+The system will ask for entering the quiz question which is to be obtained from the keyboard input.
+
+Example output printed on the terminal:
+
+
+ +-------------------------------------------------- +******* Using Python version: 3.12.3 +[LlmA] Loading Llama-2-7B chat model +Loading checkpoint shards: 100%|| 2/2 [00:09<00:00, 4.61s/it] +[LlmA] 7B model ready. +[LlmB] Loading Llama-2-70B chat model +Loading checkpoint shards: 100%|| 15/15 [01:36<00:00, 6.40s/it] +[LlmB] 70B model ready. +---- System clock resolution: 1 nsec +---- Start execution on Tue Dec 02 13:57:35 2025 ---- plus 38464851 nanoseconds +Enter the quiz question +What is the capital of South Korea? +Enter the quiz question +Query: What is the capital of South Korea? +waiting... +Winner: LLM-B | logical 0 ms | physical 2521 ms +Answer: Seoul. +-------------------------------------------------- + ++ +# Contributors +- Deeksha Prahlad (dprahlad@asu.edu), Ph.D. student at Arizona State University +- Hokeun Kim (hokeun@asu.edu, https://hokeun.github.io/), Assistant professor at Arizona State University diff --git a/llm/img/cudaversion.png b/llm/img/cudaversion.png new file mode 100644 index 0000000..2b7e874 Binary files /dev/null and b/llm/img/cudaversion.png differ diff --git a/llm/img/pytorch.png b/llm/img/pytorch.png new file mode 100644 index 0000000..3ecd8af Binary files /dev/null and b/llm/img/pytorch.png differ diff --git a/llm/requirements.txt b/llm/requirements.txt new file mode 100644 index 0000000..c8a18f7 --- /dev/null +++ b/llm/requirements.txt @@ -0,0 +1,5 @@ +accelerate +transformers +tokenizers +bitsandbytes>=0.43.0 + diff --git a/llm/src/agents/llm.py b/llm/src/agents/llm.py new file mode 100644 index 0000000..4ef69f2 --- /dev/null +++ b/llm/src/agents/llm.py @@ -0,0 +1,89 @@ +### Import Libraries +import transformers +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +from torch import cuda, bfloat16 + + +### Model to be chosen to act as an agent +model_id = "meta-llama/Llama-2-7b-chat-hf" +model_id_2 = "meta-llama/Llama-2-70b-chat-hf" + +### To check if there is GPU and convert it into float 16 +has_cuda = torch.cuda.is_available() +dtype = torch.bfloat16 if has_cuda else torch.float32 + +### To convert the model into 4bit quantization +bnb_config = None +### if there is cuda then the model is converted to 4bit quantization +if has_cuda: + try: + import bitsandbytes as bnb + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + except Exception: + bnb_config = None + +### calling pre-trained tokenizer +tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) +tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, use_fast=True) +for tok in (tokenizer, tokenizer_2): + if tok.pad_token_id is None: + tok.pad_token = tok.eos_token + +### since both the models have same device map and using 4bit quantization for both +common = dict( + device_map="auto" if has_cuda else None, + torch_dtype=dtype, # Changed from dtype=dtype (correct arg name) + low_cpu_mem_usage=True, +) +if bnb_config is not None: + common["quantization_config"] = bnb_config + +### calling pre-trained model +model = AutoModelForCausalLM.from_pretrained(model_id, **common) +model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, **common) +model.eval(); model_2.eval() + + +### arguments for both the models +GEN_A = dict(max_new_tokens=24, do_sample=False, temperature=0.1, + eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id) +GEN_B = dict(max_new_tokens=24, do_sample=False, temperature=0.1, + eos_token_id=tokenizer_2.eos_token_id, pad_token_id=tokenizer_2.pad_token_id) + +###to resturn only one line answers +def postprocess(text: str) -> str: + t = text.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + +###Calling agent1 from .lf code +def agent1(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer(prompt, return_tensors="pt") + if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = model.generate(**inputs, **GEN_A) + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True) + return postprocess(result) + +###Calling agent2 from .lf code +def agent2(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer_2(prompt, return_tensors="pt") + if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = model_2.generate(**inputs, **GEN_B) + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) + return postprocess(result) diff --git a/llm/src/agents/llm_a.py b/llm/src/agents/llm_a.py new file mode 100644 index 0000000..0126e48 --- /dev/null +++ b/llm/src/agents/llm_a.py @@ -0,0 +1,78 @@ +# llm_a.py + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +#Model +model_id = "meta-llama/Llama-2-7b-chat-hf" + + +has_cuda = torch.cuda.is_available() +if not has_cuda: + raise RuntimeError("CUDA GPU required for this configuration.") +dtype = torch.bfloat16 if has_cuda else torch.float32 + +#4-bit quantization +bnb_config = None +if has_cuda: + try: + import bitsandbytes as bnb + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + except Exception: + bnb_config = None + +#Tokenizer and the token is automatically used if logged in via CLI +tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) +if tokenizer.pad_token_id is None: + tokenizer.pad_token = tokenizer.eos_token + + +common = dict( + device_map="auto" if has_cuda else None, + torch_dtype=dtype, + low_cpu_mem_usage=True, +) + +if bnb_config is not None: + common["quantization_config"] = bnb_config + +#model +model = AutoModelForCausalLM.from_pretrained(model_id, **common) +model.eval() + +#Generation +GEN_A = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id +) + +#post-processing +def postprocess(text: str) -> str: + t = text.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + +#Agent 1 +def agent1(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer(prompt, return_tensors="pt") + if has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = model.generate(**inputs, **GEN_A) + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True) + print(result) + return postprocess(result) diff --git a/llm/src/agents/llm_b.py b/llm/src/agents/llm_b.py new file mode 100644 index 0000000..9ae257f --- /dev/null +++ b/llm/src/agents/llm_b.py @@ -0,0 +1,81 @@ +# llm_b.py + +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +#Model +model_id_2 = "meta-llama/Llama-2-70b-chat-hf" + +#Requires the GPU for this model +has_cuda = torch.cuda.is_available() +if not has_cuda: + raise RuntimeError("CUDA GPU required for this configuration.") +dtype = torch.bfloat16 if has_cuda else torch.float32 + +#4-bit quantization +bnb_config = None +if has_cuda: + try: + import bitsandbytes as bnb + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + except Exception: + bnb_config = None + +#Tokenizer and the token automatically used if logged in via CLI +tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, use_fast=True) +if tokenizer_2.pad_token_id is None: + tokenizer_2.pad_token = tokenizer_2.eos_token + + +common = dict( + device_map="auto" if has_cuda else None, + torch_dtype=dtype, + low_cpu_mem_usage=True, +) + +if bnb_config is not None: + common["quantization_config"] = bnb_config + +#Model +model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, **common) +model_2.eval() + +#Generation +GEN_B = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=tokenizer_2.eos_token_id, + pad_token_id=tokenizer_2.pad_token_id, +) + +#Post-processing +def postprocess(text: str) -> str: + t = text.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + +#Agent 2 +def agent2(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer_2(prompt, return_tensors="pt") + + if has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + + with torch.no_grad(): + out = model_2.generate(**inputs, **GEN_B) + + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) + print(result) + return postprocess(result) diff --git a/llm/src/agents/llm_b_jetson.py b/llm/src/agents/llm_b_jetson.py new file mode 100644 index 0000000..8ac042f --- /dev/null +++ b/llm/src/agents/llm_b_jetson.py @@ -0,0 +1,48 @@ +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + + +# Model ID +model_id = "meta-llama/Llama-3.2-1B" + +# Check GPU availability +has_cuda = torch.cuda.is_available() +device = torch.device("cuda" if has_cuda else "cpu") +compute_dtype = torch.float16 if has_cuda else torch.float32 + + +common = dict( + low_cpu_mem_usage=True, + attn_implementation="eager", +) + +#Load tokenizer and the token automatically used from CLI login +tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) +if tokenizer.pad_token_id is None: + tokenizer.pad_token = tokenizer.eos_token + +#Load model +mp_kwargs = dict(torch_dtype=compute_dtype, **common) +model = AutoModelForCausalLM.from_pretrained(model_id, **mp_kwargs) +model.to(device) +model.eval() + +#Generation +GEN = dict( + max_new_tokens=64, + do_sample=True, + temperature=0.7, + top_p=0.95, + eos_token_id=tokenizer.eos_token_id, + pad_token_id=tokenizer.pad_token_id, +) + +#Agent 2 +def agent2(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer(prompt, return_tensors="pt").to(device) + with torch.inference_mode(): + out = model.generate(**inputs, **GEN) + gen = out[0, inputs["input_ids"].shape[1]:] + return tokenizer.decode(gen, skip_special_tokens=True).strip() + diff --git a/llm/src/agents/llm_b_m2.py b/llm/src/agents/llm_b_m2.py new file mode 100644 index 0000000..aa699ec --- /dev/null +++ b/llm/src/agents/llm_b_m2.py @@ -0,0 +1,92 @@ +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +#Model +model_id_2 = "google/gemma-3-270m" + +#Device setup +has_cuda = torch.cuda.is_available() +has_mps = torch.backends.mps.is_available() + +if has_cuda: + device = torch.device("cuda") + compute_dtype = torch.float16 +elif has_mps: + device = torch.device("mps") + compute_dtype = torch.float32 +else: + device = torch.device("cpu") + compute_dtype = torch.float32 + +#Common model kwargs +common = dict( + low_cpu_mem_usage=True, + attn_implementation="eager" +) + +#4-bit quantization on CUDA if available +if has_cuda: + try: + import bitsandbytes as bnb + common["quantization_config"] = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=compute_dtype, + ) + common["device_map"] = "auto" + except Exception: + print("[WARN] bitsandbytes not available; using full-precision fp16 on CUDA.", flush=True) + common["device_map"] = "auto" +else: + common["device_map"] = None + +#Tokenizer and the token automatically used if logged in via CLI +tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, use_fast=True) +if tokenizer_2.pad_token_id is None: + tokenizer_2.pad_token = tokenizer_2.eos_token + +# Model +mp_kwargs = dict(dtype=compute_dtype, **common) +model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, **mp_kwargs) + + +if not has_cuda: + model_2.to(device) +model_2.eval() + +# Generation +GEN_B = dict( + max_new_tokens=32, + do_sample=True, + eos_token_id=tokenizer_2.eos_token_id, + pad_token_id=tokenizer_2.pad_token_id, +) + +def postprocess(text: str) -> str: + t = text.strip() + for sep in ["\n", ". ", " "]: + i = t.find(sep) + if i > 0: + t = t[:i] + break + return t.strip().strip(":").strip() + +def agent2(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer_2(prompt, return_tensors="pt") + + if has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + elif has_mps: + inputs = {k: v.to("mps") for k, v in inputs.items()} + else: + inputs = {k: v.to("cpu") for k, v in inputs.items()} + + with torch.inference_mode(): + out = model_2.generate(**inputs, **GEN_B) + + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) + print(result) + return postprocess(result) diff --git a/llm/src/agents/llm_small.py b/llm/src/agents/llm_small.py new file mode 100644 index 0000000..0a1c985 --- /dev/null +++ b/llm/src/agents/llm_small.py @@ -0,0 +1,89 @@ +### Import Libraries +import transformers +import torch +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig +from torch import cuda, bfloat16 + + +### Model to be chosen to act as an agent +model_id = "microsoft/Phi-3.5-mini-instruct" +model_id_2 = "EleutherAI/pythia-70m" + +### To check if there is GPU and convert it into float 16 +has_cuda = torch.cuda.is_available() +dtype = torch.bfloat16 if has_cuda else torch.float32 + +### To convert the model into 4bit quantization +bnb_config = None +### if there is cuda then the model is converted to 4bit quantization +if has_cuda: + try: + import bitsandbytes as bnb + bnb_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + except Exception: + bnb_config = None + +### calling pre-trained tokenizer +tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True) +tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, use_fast=True) +for tok in (tokenizer, tokenizer_2): + if tok.pad_token_id is None: + tok.pad_token = tok.eos_token + +### since both the models have same device map and using 4bit quantization for both +common = dict( + device_map="auto" if has_cuda else None, + dtype=dtype, + low_cpu_mem_usage=True, +) +if bnb_config is not None: + common["quantization_config"] = bnb_config + +### calling pre-trained model +model = AutoModelForCausalLM.from_pretrained(model_id, **common) +model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, **common) +model.eval(); model_2.eval() + + +### arguments for both the models +GEN_A = dict(max_new_tokens=24, do_sample=False, temperature=0.1, + eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id) +GEN_B = dict(max_new_tokens=24, do_sample=False, temperature=0.1, + eos_token_id=tokenizer_2.eos_token_id, pad_token_id=tokenizer_2.pad_token_id) + +###to resturn only one line answers +def postprocess(text: str) -> str: + t = text.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + +###Calling agent1 from .lf code +def agent1(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer(prompt, return_tensors="pt") + if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = model.generate(**inputs, **GEN_A) + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True) + return postprocess(result) + +###Calling agent2 from .lf code +def agent2(q: str) -> str: + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + inputs = tokenizer_2(prompt, return_tensors="pt") + if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = model_2.generate(**inputs, **GEN_B) + prompt_len = inputs["input_ids"].shape[1] + result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) + return postprocess(result) diff --git a/llm/src/federated/README.md b/llm/src/federated/README.md new file mode 100644 index 0000000..23cc689 --- /dev/null +++ b/llm/src/federated/README.md @@ -0,0 +1,158 @@ +# LLM Demo (Federated Execution) Overview + +This is a quiz-style game between two LLM agents using federated execution. For each user question asked to the Judge, both agents answer in parallel. The Judge announces whichever answer arrives first (or a timeout if neither responds within 60 sec), and prints per-question elapsed logical and physical times. There are three federates (federate__llma, federate__llmb, federate__j) and an RTI. + +# Pre-requisites + +You need Python >= 3.10 installed. + +## Library Dependencies +To run this project, there are dependencies required which are in [requirements.txt](requirements.txt) file. The model used in this repository has been quantized using 4-bit precision (bnb_4bit) and relies on bitsandbytes for efficient matrix operations and memory optimization. So specific versions of bitsandbytes, torch, and torchvision are mandatory for compatibility. +While newer versions of other dependencies may work, the specific versions listed below have been tested and are recommended for optimal performance. +It is highly recommended to create a Python virtual environment or a Conda environment to manage dependencies. \ +To create the a virtual environment follow the steps below. + +### Step 1: Creating environment +Replace this <> with the environment name +``` +python3 -m venv
+
+ With the correct version install PyTorch from [PyTorch](https://pytorch.org/get-started/locally/) by selecting the right correct OS and compute platform as shown in the image below for Linux system with CUDA version 12.8: \
+
+### Step 3: Model Dependencies
+- **Pre-trained Models used in the agents/llm_a.py and agents/llm_b.py**: [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) , [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) \
+**Note:** Follow the steps below to obtain the access and authentication key for the hugging face models.
+1. Create the user access token and follow the steps shown on the official documentation: [User access tokens](https://huggingface.co/docs/hub/en/security-tokens)
+2. Log in using the Hugging Face CLI by running huggingface-cli login. Please refer to the official documentation for step-by-step instructions - [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+3. For the Llama Models you will require access to use the models if you are using it for the first time. Open these links and apply for accessing the models ([meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf))
+
+## System Requirements
+
+To ensure optimal performance, the following hardware and software requirements are utilized. \
+**Note:** To replicate this model, you can use any equivalent hardware that meets the computational requirements.
+
+### Hardware Requirements
+The demo was tested with the following hardware setup.
+- **GPU**: NVIDIA RTX A6000, NVIDIA RTX PRO 6000 Blackwell
+
+### Software Requirements
+- **OS**: Linux
+- **Python**: 3.12.3+
+- **CUDA Version**: 12.8+
+- **Lingua Franca**: 0.10.1
+
+Make sure the environment is properly configured to use CUDA for optimal GPU acceleration.
+
+# Files and directories in this repository
+ - **`llm_base_class_federate.lf`** - Contains the base reactors LlmA, LlmB and Judge.
+ - **`llm_game_federated.lf`** - Lingua Franca program that defines the quiz game as federated execution.
+
+# Execution Workflow
+
+### Step 1:
+To compile this specify the RTI host by specifying an IP address here:
+```
+federated reactor llm_game_federated at 10.xxx.xxx.xx {
+}
+```
+
+Run the **`llm_game_federated.lf`**.
+
+**Note:**
+- Ensure that you specify the correct file paths
+
+Run the following commands:
+
+```
+lfc src/federated_execution/llm_game_federated.lf
+```
+
+### Step 2: Run the binary file and input the quiz question
+Run the following command:
+
+```
+cd fed-gen/llm_game_federated/
+```
+
+In the first terminal run:
+```
+./bin/RTI -n 3
+```
+In the second terminal run:
+```
+./bin/federate__j
+```
+In the third terminal run:
+```
+./bin/federate__llma
+```
+In the fourth terminal run:
+```
+./bin/federate__llmb
+```
+
+The system will ask for entering the quiz question which is to be obtained from the keyboard input.
+
+Example output printed on the terminal where federate__j is running:
+
++ +-------------------------------------------------- +******* Using Python version: 3.12.3 +---- System clock resolution: 1 nsec +---- Start execution on Tue Dec 02 14:31:36 2025 ---- plus 537640559 nanoseconds +Fed 0 (j): Connected to 10.218.100.78:15045. +Fed 0 (j): Starting timestamp is: 1764711104560384525. +[Judge] Waiting for models +[Judge] Ready +Enter the quiz question (or 'quit') +What is the opposite of tall? +Enter the quiz question (or 'quit') + +Query: What is the opposite of tall? + +waiting... + +Winner: LLM-A | logical 0 ms | physical 378 ms +A: The opposite of tall is short. +-------------------------------------------------- + ++ +# Contributors +- Deeksha Prahlad (dprahlad@asu.edu), Ph.D. student at Arizona State University +- Hokeun Kim (hokeun@asu.edu, https://hokeun.github.io/), Assistant professor at Arizona State University + diff --git a/llm/src/federated/llm_base_class_federate.lf b/llm/src/federated/llm_base_class_federate.lf new file mode 100644 index 0000000..a7ae791 --- /dev/null +++ b/llm/src/federated/llm_base_class_federate.lf @@ -0,0 +1,344 @@ +/** + * This program implements a simple LF-based quiz between two LLM agents + * and a Judge reactor that measures latency. + * + * LlmA loads a Llama-2-7B chat model. + * LlmB loads a Llama-2-70B chat model. + * Both use optional 4-bit quantization (bitsandbytes) and, run on a CUDA GPU. + * + * Each Llm reactor: + * Initializes its tokenizer and model once in the preamble. + * Spawns a background thread per query to call model.generate(). + * Cleans the decoded text into a short, one-line answer. + * Uses a logical action (done) to notify that the answer is ready and sets its output port. + * + * The Judge reactor: + * Reads user queries. + * Broadcasts each query on ask to both LlmA and LlmB. + * Reads logical and physical timestamps when the question is issued. + * Declares the winner as the LLM that responds first and prints its latency and answer. + * Triggers a 60 s timeout if neither LLM responds, and terminates the program when the user types "quit". + * + * @author Deeksha Prahlad + */ +target Python { keepalive: true } + +preamble {= + import threading + import torch + from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig +=} + +reactor LlmA { + state th + state running = False + state out_buffer = "" + state ready = False + + input user_in + output answer + output ready_out + logical action done + + reaction(startup) -> ready_out {= + print("[LlmA] Loading 7B", flush=True) + + self.has_cuda = torch.cuda.is_available() + self.dtype = torch.bfloat16 if self.has_cuda else torch.float32 + + model_id = "meta-llama/Llama-2-7b-chat-hf" + + self.tokenizer_a = AutoTokenizer.from_pretrained(model_id, use_fast=True) + if self.tokenizer_a.pad_token_id is None: + self.tokenizer_a.pad_token = self.tokenizer_a.eos_token + + common_a = dict( + device_map="auto" if self.has_cuda else None, + torch_dtype=self.dtype, + low_cpu_mem_usage=True, + ) + + try: + import bitsandbytes as bnb + common_a["quantization_config"] = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=self.dtype, + ) + except Exception: + pass + + self.model_a = AutoModelForCausalLM.from_pretrained(model_id, **common_a) + self.model_a.eval() + + self.GEN_A = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=self.tokenizer_a.eos_token_id, + pad_token_id=self.tokenizer_a.pad_token_id, + ) + + print("[LlmA] Ready.", flush=True) + self.ready = True + ready_out.set(True) + =} + + reaction(user_in) -> done {= + if not self.ready: + return + if self.running: + return + self.running = True + query = user_in.value + + def worker(): + try: + prompt = f"You are a concise Q&A assistant.\n\n{query}\n" + inputs = self.tokenizer_a(prompt, return_tensors="pt") + if self.has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = self.model_a.generate(**inputs, **self.GEN_A) + plen = inputs["input_ids"].shape[1] + txt = self.tokenizer_a.decode(out[0][plen:], skip_special_tokens=True) + t = txt.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + self.out_buffer = t.strip().strip(":").strip() + finally: + done.schedule(0) + + self.th = threading.Thread(target=worker, daemon=True) + self.th.start() + =} + + reaction(done) -> answer {= + self.running = False + answer.set(self.out_buffer) + =} +} + +reactor LlmB { + state th + state running = False + state out_buffer = "" + state ready = False + + input user_in + output answer + output ready_out + logical action done + + reaction(startup) -> ready_out {= + print("[LlmB] Loading 70B", flush=True) + + self.has_cuda = torch.cuda.is_available() + self.dtype = torch.bfloat16 if self.has_cuda else torch.float32 + + model_id = "meta-llama/Llama-2-70b-chat-hf" + + self.tokenizer_b = AutoTokenizer.from_pretrained(model_id, use_fast=True) + if self.tokenizer_b.pad_token_id is None: + self.tokenizer_b.pad_token = self.tokenizer_b.eos_token + + common_b = dict( + device_map="auto" if self.has_cuda else None, + torch_dtype=self.dtype, + low_cpu_mem_usage=True, + ) + + try: + import bitsandbytes as bnb + common_b["quantization_config"] = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=self.dtype, + ) + except Exception: + pass + + self.model_b = AutoModelForCausalLM.from_pretrained(model_id, **common_b) + self.model_b.eval() + + self.GEN_B = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=self.tokenizer_b.eos_token_id, + pad_token_id=self.tokenizer_b.pad_token_id, + ) + + print("[LlmB] Ready.", flush=True) + self.ready = True + ready_out.set(True) + =} + + reaction(user_in) -> done {= + if not self.ready: + return + if self.running: + return + self.running = True + query = user_in.value + + def worker(): + try: + prompt = f"You are a concise Q&A assistant.\n\n{query}\n" + inputs = self.tokenizer_b(prompt, return_tensors="pt") + if self.has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + with torch.no_grad(): + out = self.model_b.generate(**inputs, **self.GEN_B) + plen = inputs["input_ids"].shape[1] + txt = self.tokenizer_b.decode(out[0][plen:], skip_special_tokens=True) + t = txt.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + self.out_buffer = t.strip().strip(":").strip() + finally: + done.schedule(0) + + self.th = threading.Thread(target=worker, daemon=True) + self.th.start() + =} + + reaction(done) -> answer {= + self.running = False + answer.set(self.out_buffer) + =} +} + +reactor Judge { + state th + state reader_started = False + state terminate = False + state eof = False + state buffer = "" + state waiting = False + state logical_base_time = 0 + state physical_base_time = 0 + state a_ready = False + state b_ready = False + + input ready_a + input ready_b + input llma + input llmb + + output ask + output quit + + logical action line + logical action tick + physical action timeout(60 sec) + + reaction(startup) {= + print("[Judge] Waiting for models", flush=True) + =} + + reaction(ready_a) -> line {= + self.a_ready = True + if self.a_ready and self.b_ready and not self.reader_started: + import threading + def reader(): + while not self.terminate: + s = input("Enter the quiz question (or 'quit')\n") + if s == "" or s.lower().strip() == "quit": + self.eof = True + line.schedule(0) + break + self.buffer = s + line.schedule(1) + self.reader_started = True + print("[Judge] Ready", flush=True) + self.th = threading.Thread(target=reader, daemon=True) + self.th.start() + =} + + reaction(ready_b) -> line {= + self.b_ready = True + if self.a_ready and self.b_ready and not self.reader_started: + import threading + def reader(): + while not self.terminate: + s = input("Enter the quiz question (or 'quit')\n") + if s == "" or s.lower().strip() == "quit": + self.eof = True + line.schedule(0) + break + self.buffer = s + line.schedule(1) + self.reader_started = True + print("[Judge] Ready", flush=True) + self.th = threading.Thread(target=reader, daemon=True) + self.th.start() + =} + + reaction(line) -> tick, ask, timeout, quit {= + if self.eof: + quit.set() + environment().sync_shutdown() + else: + self.waiting = True + self.logical_base_time = lf.time.logical_elapsed() + self.physical_base_time = lf.time.physical_elapsed() + timeout.schedule(0) + print(f"\n\n\nQuery: {self.buffer}\n", flush=True) + print("waiting...\n", flush=True) + tick.schedule(5) + =} + + reaction(tick) -> ask {= + ask.set(self.buffer) + =} + + reaction(llma) {= + if not self.waiting: + return + self.waiting = False + ln = lf.time.logical_elapsed() + pn = lf.time.physical_elapsed() + lm = int((ln - self.logical_base_time)/1000000) + pm = int((pn - self.physical_base_time)/1000000) + print(f"Winner: LLM-A | logical {lm} ms | physical {pm} ms", flush=True) + print(f"{llma.value}", flush=True) + =} + + reaction(llmb) {= + if not self.waiting: + return + self.waiting = False + ln = lf.time.logical_elapsed() + pn = lf.time.physical_elapsed() + lm = int((ln - self.logical_base_time)/1000000) + pm = int((pn - self.physical_base_time)/1000000) + print(f"Winner: LLM-B | logical {lm} ms | physical {pm} ms", flush=True) + print(f"{llmb.value}", flush=True) + =} + + reaction(timeout) {= + if not self.waiting: + return + self.waiting = False + ln = lf.time.logical_elapsed() + pn = lf.time.physical_elapsed() + lm = int((ln - self.logical_base_time)/1000000) + pm = int((pn - self.physical_base_time)/1000000) + print(f"TIMEOUT (60 s) | logical {lm} ms | physical {pm} ms", flush=True) + =} + + reaction(shutdown) {= + self.terminate = True + if self.th and self.th.is_alive(): + self.th.join() + =} +} diff --git a/llm/src/federated/llm_game_federated.lf b/llm/src/federated/llm_game_federated.lf new file mode 100644 index 0000000..fe4d7da --- /dev/null +++ b/llm/src/federated/llm_game_federated.lf @@ -0,0 +1,19 @@ +target Python { keepalive: true } + +import LlmA, LlmB, Judge from "llm_base_class_federate.lf" + +federated reactor llm_game_federated at 10.218.100.78 { + j = new Judge() + llma = new LlmA() + llmb = new LlmB() + + j.ask -> llma.user_in + j.ask -> llmb.user_in + + llma.answer -> j.llma + llmb.answer -> j.llmb + + llma.ready_out -> j.ready_a + llmb.ready_out -> j.ready_b +} + diff --git a/llm/src/llm_base_class.lf b/llm/src/llm_base_class.lf new file mode 100644 index 0000000..000479d --- /dev/null +++ b/llm/src/llm_base_class.lf @@ -0,0 +1,334 @@ +/** + * This program implements a simple LF-based quiz between two LLM agents + * and a Judge reactor that measures latency. + * + * LlmA loads a Llama-2-7B chat model. + * LlmB loads a Llama-2-70B chat model. + * Both use optional 4-bit quantization (bitsandbytes) and, run on a CUDA GPU. + * + * Each Llm reactor: + * Initializes its tokenizer and model once in the preamble. + * Spawns a background thread per query to call model.generate(). + * Cleans the decoded text into a short, one-line answer. + * Uses a logical action (done) to notify that the answer is ready and sets its output port. + * + * The Judge reactor: + * Reads user queries. + * Broadcasts each query on ask to both LlmA and LlmB. + * Reads logical and physical timestamps when the question is issued. + * Declares the winner as the LLM that responds first and prints its latency and answer. + * Triggers a 60 s timeout if neither LLM responds, and terminates the program when the user types "quit". + * + * @author Deeksha Prahlad + */ +target Python { keepalive: true } + +preamble {= + import threading + import torch + from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig +=} + +reactor LlmA { + state th + state running = False + state out_buffer = "" + + input user_in + output answer + logical action done + + preamble {= + print("[LlmA] Loading Llama-2-7B chat model", flush=True) + has_cuda = torch.cuda.is_available() + dtype = torch.bfloat16 if has_cuda else torch.float32 + + model_id_a = "meta-llama/Llama-2-7b-chat-hf" + + tokenizer_a = AutoTokenizer.from_pretrained(model_id_a, use_fast=True) + if tokenizer_a.pad_token_id is None: + tokenizer_a.pad_token = tokenizer_a.eos_token + + common_a = dict( + device_map="auto" if has_cuda else None, + torch_dtype=dtype, + low_cpu_mem_usage=True, + ) + + try: + import bitsandbytes as bnb # noqa: F401 + quant_a = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + common_a["quantization_config"] = quant_a + except Exception: + quant_a = None + + + model_a = AutoModelForCausalLM.from_pretrained(model_id_a, **common_a) + model_a.eval() + + generation_a = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=tokenizer_a.eos_token_id, + pad_token_id=tokenizer_a.pad_token_id, + ) + + def run_llm_a(self, q: str) -> str: + """Bound method: uses class-level model/tokenizer.""" + cls = type(self) + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + + tok = cls.tokenizer_a + model = cls.model_a + generation_a = cls.generation_a + has_cuda = cls.has_cuda + + inputs = tok(prompt, return_tensors="pt") + if has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + + with torch.no_grad(): + out = model.generate(**inputs, **generation_a) + + prompt_len = inputs["input_ids"].shape[1] + result = tok.decode(out[0][prompt_len:], skip_special_tokens=True) + + + t = result.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + + print("[LlmA] 7B model ready.", flush=True) + =} + + reaction(user_in) -> done {= + if self.running: + return + self.running = True + query = user_in.value + + def worker(): + try: + self.out_buffer = self.run_llm_a(query) + finally: + done.schedule(0) + + self.th = threading.Thread(target=worker, daemon=True) + self.th.start() + =} + + reaction(done) -> answer {= + self.running = False + answer.set(self.out_buffer) + =} +} + + +reactor LlmB { + state th + state running = False + state out_buffer = "" + + input user_in + output answer + logical action done + + preamble {= + print("[LlmB] Loading Llama-2-70B chat model", flush=True) + + has_cuda = torch.cuda.is_available() + dtype = torch.bfloat16 if has_cuda else torch.float32 + + model_id_b = "meta-llama/Llama-2-70b-chat-hf" + + tokenizer_b = AutoTokenizer.from_pretrained(model_id_b, use_fast=True) + if tokenizer_b.pad_token_id is None: + tokenizer_b.pad_token = tokenizer_b.eos_token + + common_b = dict( + device_map="auto" if has_cuda else None, + torch_dtype=dtype, + low_cpu_mem_usage=True, + ) + + try: + import bitsandbytes as bnb # noqa: F401 + quant_b = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", + bnb_4bit_use_double_quant=True, + bnb_4bit_compute_dtype=dtype, + ) + common_b["quantization_config"] = quant_b + except Exception: + quant_b = None + + model_b = AutoModelForCausalLM.from_pretrained(model_id_b, **common_b) + model_b.eval() + + generation_b = dict( + max_new_tokens=24, + do_sample=False, + temperature=0.1, + eos_token_id=tokenizer_b.eos_token_id, + pad_token_id=tokenizer_b.pad_token_id, + ) + + def run_llm_b(self, q: str) -> str: + cls = type(self) + prompt = f"You are a concise Q&A assistant.\n\n{q}\n" + + tok = cls.tokenizer_b + model = cls.model_b + generation_b = cls.generation_b + has_cuda = cls.has_cuda + + inputs = tok(prompt, return_tensors="pt") + if has_cuda: + inputs = {k: v.to("cuda") for k, v in inputs.items()} + + with torch.no_grad(): + out = model.generate(**inputs, **generation_b) + + prompt_len = inputs["input_ids"].shape[1] + result = tok.decode(out[0][prompt_len:], skip_special_tokens=True) + + t = result.strip() + for sep in ["\n", ". ", " "]: + idx = t.find(sep) + if idx > 0: + t = t[:idx] + break + return t.strip().strip(":").strip() + + print("[LlmB] 70B model ready.", flush=True) + =} + + reaction(user_in) -> done {= + if self.running: + return + self.running = True + query = user_in.value + + def worker(): + try: + self.out_buffer = self.run_llm_b(query) + finally: + done.schedule(0) + + self.th = threading.Thread(target=worker, daemon=True) + self.th.start() + =} + + reaction(done) -> answer {= + self.running = False + answer.set(self.out_buffer) + =} +} + + +reactor Judge { + state th + state terminate = False + state eof = False + state buffer = "" + + output ask + output quit + input llma + input llmb + + state waiting = False + state logical_base_time = 0 + state physical_base_time = 0 + state winner = "" + + logical action line + logical action timeout(60 sec) + + reaction(startup) -> line {= + def reader(): + while not self.terminate: + s = input("Enter the quiz question\n") + if s == "": + self.eof = True + line.schedule(0) + break + elif s.lower().strip() == "quit": + self.eof = True + line.schedule(0) + break + else: + self.buffer = s + line.schedule(1) + + self.th = threading.Thread(target=reader, daemon=True) + self.th.start() + =} + + reaction(line) -> ask, quit, timeout {= + if self.eof: + quit.set() + environment().sync_shutdown() + else: + self.waiting = True + self.winner = "" + self.logical_base_time = lf.time.logical_elapsed() + self.physical_base_time = lf.time.physical_elapsed() + timeout.schedule(0) + print(f"\n\n\nQuery: {self.buffer}\n", flush=True) + print("waiting...\n", flush=True) + ask.set(self.buffer) + =} + + reaction(llma) {= + if not self.waiting: + return + self.waiting = False + logical_now = lf.time.logical_elapsed() + physical_now = lf.time.physical_elapsed() + logical_ms = int((logical_now - self.logical_base_time) / 1000000) + physical_ms = int((physical_now - self.physical_base_time) / 1000000) + print(f" Winner: LLM-A | logical {logical_ms} ms | physical {physical_ms} ms", flush=True) + print(f"{llma.value}", flush=True) + =} + + reaction(llmb) {= + if not self.waiting: + return + self.waiting = False + logical_now = lf.time.logical_elapsed() + physical_now = lf.time.physical_elapsed() + logical_ms = int((logical_now - self.logical_base_time) / 1000000) + physical_ms = int((physical_now - self.physical_base_time) / 1000000) + print(f"Winner: LLM-B | logical {logical_ms} ms | physical {physical_ms} ms", flush=True) + print(f"{llmb.value}", flush=True) + =} + + reaction(timeout) {= + if not self.waiting: + return + self.waiting = False + logical_now = lf.time.logical_elapsed() + physical_now = lf.time.physical_elapsed() + logical_ms = int((logical_now - self.logical_base_time) / 1000000) + physical_ms = int((physical_now - self.physical_base_time) / 1000000) + print(f"TIMEOUT (60 s) | logical {logical_ms} ms | physical {physical_ms} ms", flush=True) + =} + + reaction(shutdown) {= + self.terminate = True + if self.th and self.th.is_alive(): + self.th.join() + =} +} diff --git a/llm/src/llm_quiz_game.lf b/llm/src/llm_quiz_game.lf new file mode 100644 index 0000000..105aa8f --- /dev/null +++ b/llm/src/llm_quiz_game.lf @@ -0,0 +1,26 @@ +/** + * Main LF reactor that connects three independently defined reactors: + * + * LlmA : inference for the Llama-2-7B model. + * LlmB : inference for the Llama-2-70B model. + * Judge: Reads user queries, measures timing, and decides the winner. + * This main reactor composes the LLM reactors and Judge into one coordinated system. + * All timing, concurrency, and winner-determination logic reside in the sub-reactors in "llm_base_class.lf" + */ +target Python { keepalive: true } + +import LlmA from "llm_base_class.lf" +import LlmB from "llm_base_class.lf" +import Judge from "llm_base_class.lf" + +main reactor { + llma_response = new LlmA() + llmb_response = new LlmB() + j = new Judge() + + j.ask -> llma_response.user_in + j.ask -> llmb_response.user_in + llma_response.answer -> j.llma + llmb_response.answer -> j.llmb +} +