Skip to content

Conversation

@JayDi11a
Copy link

@JayDi11a JayDi11a commented Jan 5, 2026

What does this PR do?

This commit builds on top of the work done in PR #4191 instantiating APIs to register FastAPI router factories and migrating away from the legacy @webmethod decorator system. The implementation primarily focuses on the migration of the Inference API which updates the server and OpenAPI generation while maintaining the existing routing and inspection.

The Inference API has been migrated to adopt the same new API Package Structure as the migrated Batches AI migration, i.e., protocol definitions and models live in llama_stack_api/inference. The FastAPI router implementation lives in llama_stack/core/server/routers/inference maintaining the established pattern of API contracts and server routing logic.

The nuances of migrating the Inference API include fixing model chunking where chunk_id aren't uniform across 100+ models and adding a sync for chunk id and meta data and an overall effort for backwards compatibility including content types. Last but not least, the Stainless config needed to be updated for the /v1/inference/rerank path.

This implementation represents an incremental migrating of the Inference API to the router system while maintaining full backward compatibility with existing webmethod-based APIs.

Test Plan

Run this from the command line and the same routes should be upheld:

curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("inference") or contains("chat") or contains("completion") or contains("embedding"))'

Since the inference unit tests only import types and not routing logic and the types are reimported, unit testing didn't need modification. Therefore:

uv run pytest tests/integration/inference/ -vv --stack-config=http://localhost:8321
      Built llama-stack @ file:///Users/geraldtrotman/Virtualenvs/llama-stack
      Built llama-stack-api @ file:///Users/geraldtrotman/Virtualenvs/llama-stack/src/llama_stack_api
Uninstalled 2 packages in 2ms
Installed 2 packages in 2ms
================================================================================================ test session starts ================================================================================================
platform darwin -- Python 3.12.12, pytest-8.4.2, pluggy-1.6.0 -- /Users/geraldtrotman/Virtualenvs/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.12', 'Platform': 'macOS-26.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/geraldtrotman/Virtualenvs/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 86 items                                                                                                                                                                                                  

tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[None-None-None-None-None-inference:completion:sanity] SKIPPED (text_model_id empty - skipping test)               [  1%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming_suffix[None-None-None-None-None-inference:completion:suffix] SKIPPED (text_model_id empty - skipping test)        [  2%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[None-None-None-None-None-inference:completion:sanity] SKIPPED (text_model_id empty - skipping test)                   [  3%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[None-None-None-None-None] SKIPPED (text_model_id empty - skipping test)                                           [  4%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-None-None-None-None-None-inference:chat_completion:non_streaming_01] SKIPPED (text_model_id
empty - skipping test)                                                                                                                                                                                        [  5%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-None-None-None-None-None-inference:chat_completion:streaming_01] SKIPPED (text_model_id empty -
skipping test)                                                                                                                                                                                                [  6%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-None-None-None-None-None-inference:chat_completion:streaming_01] SKIPPED (text_model_id
empty - skipping test)                                                                                                                                                                                        [  8%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-None-None-None-None-None-True] SKIPPED (text_model_id empty - skipping test)                                        [  9%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-None-None-None-None-None-True] SKIPPED (text_model_id empty - skipping test)                             [ 10%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[None-None-None-None-None] SKIPPED (text_model_id empty - skipping test)                            [ 11%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_stop_sequence[None-None-None-None-None-inference:completion:stop_sequence] SKIPPED (text_model_id empty - skipping test)        [ 12%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs[None-None-None-None-None-inference:completion:log_probs] SKIPPED (text_model_id empty - skipping test)                 [ 13%]
tests/integration/inference/test_openai_completion.py::test_openai_completion_logprobs_streaming[None-None-None-None-None-inference:completion:log_probs] SKIPPED (text_model_id empty - skipping test)       [ 15%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                        [ 16%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                     [ 17%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)           [ 18%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                      [ 19%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                  [ 20%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                     [ 22%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                  [ 23%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)   [ 24%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)          [ 25%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[openai_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)              [ 26%]
tests/integration/inference/test_provider_data_routing.py::test_unregistered_model_routing_with_provider_data[None-None-None-None-None] SKIPPED (Test requires library client for provider-level patching)    [ 27%]
tests/integration/inference/test_rerank.py::test_rerank_text[None-None-None-None-None-string-query-string-items] SKIPPED (rerank_model_id empty - skipping test)                                              [ 29%]
tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-image-query-url] SKIPPED (rerank_model_id empty - skipping test)                                                       [ 30%]
tests/integration/inference/test_rerank.py::test_rerank_max_results[None-None-None-None-None] SKIPPED (rerank_model_id empty - skipping test)                                                                 [ 31%]
tests/integration/inference/test_rerank.py::test_rerank_max_results_larger_than_items[None-None-None-None-None] SKIPPED (rerank_model_id empty - skipping test)                                               [ 32%]
tests/integration/inference/test_rerank.py::test_rerank_semantic_correctness[None-None-None-None-None-What is a reranking model? -items0-A reranking model reranks a list of items based on the query. ] SKIPPED [ 33%]
tests/integration/inference/test_tools_with_schemas.py::TestOpenAICompatibility::test_openai_chat_completion_with_tools[openai_client-None-None-None-None-None] SKIPPED (text_model_id empty - skipping test) [ 34%]
tests/integration/inference/test_tools_with_schemas.py::TestOpenAICompatibility::test_openai_format_preserves_complex_schemas[openai_client-None-None-None-None-None] SKIPPED (text_model_id empty - skipping
test)                                                                                                                                                                                                         [ 36%]
tests/integration/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[None-None-None-None-None] SKIPPED (vision_model_id empty - skipping test)                                      [ 37%]
tests/integration/inference/test_vision_inference.py::test_image_chat_completion_multiple_images[None-None-None-None-None-True] SKIPPED (vision_model_id empty - skipping test)                               [ 38%]
tests/integration/inference/test_vision_inference.py::test_image_chat_completion_streaming[None-None-None-None-None] SKIPPED (vision_model_id empty - skipping test)                                          [ 39%]
tests/integration/inference/test_vision_inference.py::test_image_chat_completion_base64[None-None-None-None-None] SKIPPED (vision_model_id empty - skipping test)                                             [ 40%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools[None-inference:chat_completion:tool_calling] SKIPPED (text_model_id empty - skipping test)                      [ 41%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tools_and_streaming[None-inference:chat_completion:tool_calling] SKIPPED (text_model_id empty - skipping test)        [ 43%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_with_tool_choice_none[None-inference:chat_completion:tool_calling] SKIPPED (text_model_id empty - skipping test)           [ 44%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_structured_output[None-inference:chat_completion:structured_output] SKIPPED (text_model_id empty - skipping test)          [ 45%]
tests/integration/inference/test_tools_with_schemas.py::TestChatCompletionWithTools::test_simple_tool_call[None] SKIPPED (text_model_id empty - skipping test)                                                [ 46%]
tests/integration/inference/test_tools_with_schemas.py::TestChatCompletionWithTools::test_tool_with_complex_schema[None] SKIPPED (text_model_id empty - skipping test)                                        [ 47%]
tests/integration/inference/test_tools_with_schemas.py::TestMCPToolsInChatCompletion::test_mcp_tools_in_inference[None] SKIPPED (text_model_id empty - skipping test)                                         [ 48%]
tests/integration/inference/test_tools_with_schemas.py::TestProviderSpecificBehavior::test_openai_provider_drops_output_schema[None] SKIPPED (text_model_id empty - skipping test)                            [ 50%]
tests/integration/inference/test_tools_with_schemas.py::TestStreamingWithTools::test_streaming_tool_calls[None] SKIPPED (text_model_id empty - skipping test)                                                 [ 51%]
tests/integration/inference/test_tools_with_schemas.py::TestEdgeCases::test_tool_without_schema[None] SKIPPED (text_model_id empty - skipping test)                                                           [ 52%]
tests/integration/inference/test_tools_with_schemas.py::TestEdgeCases::test_multiple_tools_with_different_schemas[None] SKIPPED (text_model_id empty - skipping test)                                         [ 53%]
tests/integration/inference/test_openai_vision_inference.py::test_openai_chat_completion_image_url[None] SKIPPED (vision_model_id empty - skipping test)                                                      [ 54%]
tests/integration/inference/test_openai_vision_inference.py::test_openai_chat_completion_image_data[None] SKIPPED (vision_model_id empty - skipping test)                                                     [ 55%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-None-None-None-None-None-inference:chat_completion:non_streaming_02] SKIPPED (text_model_id
empty - skipping test)                                                                                                                                                                                        [ 56%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-None-None-None-None-None-inference:chat_completion:streaming_02] SKIPPED (text_model_id empty -
skipping test)                                                                                                                                                                                                [ 58%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-None-None-None-None-None-inference:chat_completion:streaming_02] SKIPPED (text_model_id
empty - skipping test)                                                                                                                                                                                        [ 59%]
tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-None-None-None-None-None-False] SKIPPED (text_model_id empty - skipping test)                                       [ 60%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-None-None-None-None-None-False] SKIPPED (text_model_id empty - skipping test)                            [ 61%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                   [ 62%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                [ 63%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)      [ 65%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                 [ 66%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)             [ 67%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)                [ 68%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)             [ 69%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping
test)                                                                                                                                                                                                         [ 70%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)     [ 72%]
tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[llama_stack_client-None-None-None-None-None] SKIPPED (embedding_model_id empty - skipping test)         [ 73%]
tests/integration/inference/test_rerank.py::test_rerank_text[None-None-None-None-None-text-query-text-items] SKIPPED (rerank_model_id empty - skipping test)                                                  [ 74%]
tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-image-query-base64] SKIPPED (rerank_model_id empty - skipping test)                                                    [ 75%]
tests/integration/inference/test_rerank.py::test_rerank_semantic_correctness[None-None-None-None-None-What is C++?-items1-C++ is a programming language. ] SKIPPED (rerank_model_id empty - skipping test)    [ 76%]
tests/integration/inference/test_tools_with_schemas.py::TestOpenAICompatibility::test_openai_chat_completion_with_tools[client_with_models-None-None-None-None-None] SKIPPED (text_model_id empty - skipping
test)                                                                                                                                                                                                         [ 77%]
tests/integration/inference/test_tools_with_schemas.py::TestOpenAICompatibility::test_openai_format_preserves_complex_schemas[client_with_models-None-None-None-None-None] SKIPPED (text_model_id empty -
skipping test)                                                                                                                                                                                                [ 79%]
tests/integration/inference/test_vision_inference.py::test_image_chat_completion_multiple_images[None-None-None-None-None-False] SKIPPED (vision_model_id empty - skipping test)                              [ 80%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-None-None-None-None-None-inference:chat_completion:non_streaming_01] SKIPPED              [ 81%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-None-None-None-None-None-inference:chat_completion:streaming_01] SKIPPED (text_model_id empty
- skipping test)                                                                                                                                                                                              [ 82%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-None-None-None-None-None-inference:chat_completion:streaming_01] SKIPPED               [ 83%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-None-None-None-None-None-True] SKIPPED (text_model_id empty - skipping test)                                   [ 84%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-None-None-None-None-None-True] SKIPPED (text_model_id empty - skipping test)                        [ 86%]
tests/integration/inference/test_rerank.py::test_rerank_text[None-None-None-None-None-mixed-content-1] SKIPPED (rerank_model_id empty - skipping test)                                                        [ 87%]
tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-text-query-image-item] SKIPPED (rerank_model_id empty - skipping test)                                                 [ 88%]
tests/integration/inference/test_rerank.py::test_rerank_semantic_correctness[None-None-None-None-None-What are good learning habits? -items2-Good learning habits include reading daily and taking notes. ] SKIPPED [ 89%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-None-None-None-None-None-inference:chat_completion:non_streaming_02] SKIPPED              [ 90%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-None-None-None-None-None-inference:chat_completion:streaming_02] SKIPPED (text_model_id empty
- skipping test)                                                                                                                                                                                              [ 91%]
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-None-None-None-None-None-inference:chat_completion:streaming_02] SKIPPED               [ 93%]
tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-None-None-None-None-None-False] SKIPPED (text_model_id empty - skipping test)                                  [ 94%]
tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-None-None-None-None-None-False] SKIPPED (text_model_id empty - skipping test)                       [ 95%]
tests/integration/inference/test_rerank.py::test_rerank_text[None-None-None-None-None-mixed-content-2] SKIPPED (rerank_model_id empty - skipping test)                                                        [ 96%]
tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-mixed-content-1] SKIPPED (rerank_model_id empty - skipping test)                                                       [ 97%]
tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-mixed-content-2] SKIPPED (rerank_model_id empty - skipping test)                                                       [ 98%]
tests/integration/inference/test_tools_with_schemas.py::TestProviderSpecificBehavior::test_gemini_array_support PASSED                                                                                        [100%]

=============================================================================================== slowest 10 durations ================================================================================================
0.11s setup    tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[None-None-None-None-None-inference:completion:sanity]
0.00s setup    tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[None-None-None-None-None]
0.00s setup    tests/integration/inference/test_rerank.py::test_rerank_text[None-None-None-None-None-string-query-string-items]
0.00s teardown tests/integration/inference/test_tools_with_schemas.py::TestProviderSpecificBehavior::test_gemini_array_support
0.00s setup    tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[openai_client-None-None-None-None-None]
0.00s setup    tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-None-None-None-None-None-inference:chat_completion:non_streaming_01]
0.00s setup    tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[llama_stack_client-None-None-None-None-None]
0.00s teardown tests/integration/inference/test_rerank.py::test_rerank_max_results[None-None-None-None-None]
0.00s setup    tests/integration/inference/test_rerank.py::test_rerank_image[None-None-None-None-None-image-query-base64]
0.00s setup    tests/integration/inference/test_provider_data_routing.py::test_unregistered_model_routing_with_provider_data[None-None-None-None-None]
===================================================================================== 1 passed, 85 skipped, 1 warning in 0.19s ======================================================================================

Lastly, after the completion of each migration, the server gets tested against the demo_script.py in the getting started documentation as well as the inference.py, agent.py, and rag_agent.py examples created from the detailed_tutorial.mdx documentation.

@meta-cla
Copy link

meta-cla bot commented Jan 5, 2026

Hi @JayDi11a!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@mergify
Copy link

mergify bot commented Jan 5, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @JayDi11a please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@meta-cla
Copy link

meta-cla bot commented Jan 5, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 5, 2026
@mergify
Copy link

mergify bot commented Jan 6, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @JayDi11a please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link

mergify bot commented Jan 6, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @JayDi11a please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 6, 2026
JayDi11a and others added 7 commits January 6, 2026 09:50
Following PR llamastack#4191 pattern, migrate Inference API from @webmethod decorators
to FastAPI router-based architecture.

- Created inference package (api.py, models.py, fastapi_routes.py, __init__.py)
- Registered inference router in fastapi_router_registry
- Updated server.py and routes.py for inference route processing
- Regenerated OpenAPI specs (49 paths, 68 operations)
- Updated stainless config for /v1/inference/rerank endpoint
- Fixed Chunk model: optional chunk_id for backward compatibility
- Re-exported InterleavedContent in inference/__init__.py

Inference endpoints (6 routes):
- POST /v1/chat/completions (streaming support)
- GET /v1/chat/completions (list)
- GET /v1/chat/completions/{completion_id} (retrieve)
- POST /v1/completions (streaming support)
- POST /v1/embeddings
- POST /v1/inference/rerank
Update demo script to use the newer LlamaStackClient and Agent API instead
of the manual OpenAI client approach.

Changes:
- Switch from OpenAI client to LlamaStackClient
- Use Agent API for simplified RAG implementation
- Auto-select models with preference for Ollama (no API key needed)
- Reduce code complexity from ~136 to ~102 lines
- Remove manual RAG implementation in favor of agentic approach

This provides a cleaner, more modern example for users getting started
with Llama Stack.
Simplify the Ollama model selection logic in the detailed tutorial.

Changes:
- Replace complex custom_metadata filtering with simple ID check
- Use direct 'ollama' in model ID check instead of metadata lookup
- Makes code more concise and easier to understand

This aligns with the simplified approach used in the updated demo_script.py.
Update the agent examples to use the latest API methods.

Changes:
- Simplify model selection (already applied in previous commit)
- Use response.output_text instead of response.output_message.content
- Use direct print(event) instead of event.print() for streaming

This aligns the tutorial with the current Agent API implementation.
Modernize the RAG agent example to use the latest Vector Stores API.

Changes:
- Replace deprecated VectorDB API with Vector Stores API
- Use file upload and vector_stores.create() instead of rag_tool.insert()
- Download files via requests and upload to Llama Stack
- Update to use file_search tool type with vector_store_ids
- Simplify model selection with Ollama preference
- Improve logging and user feedback
- Update event logging to handle both old and new API
- Add note about known server routing issues

This provides a more accurate example using current Llama Stack APIs.
Fix conformance test failures by explicitly defining both application/json
and text/event-stream media types in the 200 responses for streaming
endpoints (/chat/completions and /completions).

Changes:
- Updated fastapi_routes.py to include explicit response schemas for both media types
- Regenerated OpenAPI specs with proper 200 responses
- Regenerated Stainless config

This fixes the "response-success-status-removed" conformance errors while
maintaining the dynamic streaming/non-streaming behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
BREAKING CHANGE: Chunk.chunk_id field was made optional in commit 53af013 to support backward compatibility with legacy data. This changes the OpenAPI schema but maintains API compatibility.
@mergify mergify bot removed the needs-rebase label Jan 6, 2026
@mergify
Copy link

mergify bot commented Jan 7, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @JayDi11a please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 7, 2026
@mergify mergify bot removed the needs-rebase label Jan 7, 2026
@leseb
Copy link
Collaborator

leseb commented Jan 7, 2026

thanks a lot for your contribution, you didn't pick the easiest one, please see the tests failures.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we introducing so many changes to the demo_script?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it up in the forum-llama-stack-core channel. But here is the list of what needed to be done to fix the demo script:

Vector DB API Change ([demo_script.py:21-23](vscode-webview://01gn3kfohjdkuq986mnfpbqrs203d8vlfvp0a49o8qe60jf1pkr3/demo_script.py#L21-L23))
1. Removed client.vector_dbs.register() - this API no longer exists
2. The vector store doesn't need to be explicitly registered; it's handled automatically

RAG Tool Insert API Mismatch ([demo_script.py:35-42](vscode-webview://01gn3kfohjdkuq986mnfpbqrs203d8vlfvp0a49o8qe60jf1pkr3/demo_script.py#L35-L42))
Changed from client.tool_runtime.rag_tool.insert() to using client.post() directly
1. The client library expects vector_db_id but the server requires vector_store_id in the request body
2. Used direct POST to work around this client/server API mismatch (exposes a deeper issue with client and server maintenance)

Tool Configuration for Agent ([demo_script.py:48-53](vscode-webview://01gn3kfohjdkuq986mnfpbqrs203d8vlfvp0a49o8qe60jf1pkr3/demo_script.py#L48-L53))
Changed from using builtin::rag/knowledge_search to the correct format
1. Used type: "file_search" with vector_store_ids parameter
2. This matches the OpenAI-compatible API format

Event Logger Output ([demo_script.py:69-73](vscode-webview://01gn3kfohjdkuq986mnfpbqrs203d8vlfvp0a49o8qe60jf1pkr3/demo_script.py#L69-L73))
1. Added check for log objects that might not have a .print() method
Falls back to regular print() for string outputs

These fixes still seem to hold for client v.040 as well.

"""Request model for listing chat completions."""

after: str | None = Field(default=None, description="The ID of the last chat completion to return.")
limit: int | None = Field(default=20, description="The maximum number of chat completions to return.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ge=1 constraint

class GetChatCompletionRequest(BaseModel):
"""Request model for getting a chat completion."""

completion_id: str = Field(..., description="ID of the chat completion.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing min_length=1

...,
description="The search query to rank items against. Can be a string, text content part, or image content part. The input must not exceed the model's max input token length.",
)
items: list[str | OpenAIChatCompletionContentPartTextParam | OpenAIChatCompletionContentPartImageParam] = Field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing min_length=1 for non-empty list

description="List of items to rerank. Each item can be a string, text content part, or image content part. Each input must not exceed the model's max input token length.",
)
max_num_results: int | None = Field(
default=None, description="Maximum number of results to return. Default: returns all."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing ge=1 when provided

:returns: RerankResponse with indices sorted by relevance score (descending).
"""
raise NotImplementedError("Reranking is not implemented")
return # this is so mypy's safe-super rule will consider the method concrete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we chose to appease mypy, might as well do it for the methods at lines 100-119 as well (that is to add the return at the end of the method)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new inference/models.py is 15x larger than any other API's models file. This suggests:

  • It may be duplicating types that already exist elsewhere
  • It should potentially be split into multiple files
  • Some types might belong in llama_stack_api/common

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing api.py files:

API Lines Classes Pattern
batches 54 1 Protocol All ... (pure protocol)
models 39 1 Protocol All ... (pure protocol)
inference 120 2 classes Mixed: ... + NotImplementedError

The inference/api.py:

  • Has two classes (InferenceProvider + Inference) instead of one Protocol
  • Uses mixed patterns - some methods use ..., others raise NotImplementedError
  • The Inference class extends a Protocol (unusual pattern)

Not saying this is necessarily bad, but worth looking into again

JayDi11a and others added 4 commits January 7, 2026 11:00
…dpoints

Updated the InferenceRouter.list_chat_completions and
InferenceRouter.get_chat_completion methods to accept the new
request object parameters (ListChatCompletionsRequest and
GetChatCompletionRequest) instead of individual parameters.

This fixes a SQLite parameter binding error where Pydantic request
objects were being passed directly to SQL queries. The router now
unpacks request object fields when calling the underlying store.

Fixes the following error:
sqlite3.ProgrammingError: Error binding parameter 1: type
'ListChatCompletionsRequest' is not supported

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Added missing exports for GetChatCompletionRequest and
ListChatCompletionsRequest to llama_stack_api/__init__.py.

These request types are needed by InferenceRouter and must be
importable from the top-level llama_stack_api package.

Fixes import error:
ImportError: cannot import name 'GetChatCompletionRequest' from
'llama_stack_api'

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
FastAPI routers need to explicitly return StreamingResponse for
streaming endpoints. The old webmethod system had middleware that
automatically wrapped AsyncIterator results in SSE format, but the
new router system requires explicit handling.

Changes:
- Added _create_sse_event() helper to format data as SSE events
- Added _sse_generator() to convert async generators to SSE format
- Updated openai_chat_completion() to return StreamingResponse when stream=True
- Updated openai_completion() to return StreamingResponse when stream=True

Fixes streaming errors:
TypeError("'async_generator' object is not iterable")

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@JayDi11a
Copy link
Author

JayDi11a commented Jan 7, 2026

thanks a lot for your contribution, you didn't pick the easiest one, please see the tests failures.

Thank you @leseb. I am honored to help. Then it sounds like I picked the right one. I had some silly TypeErrors handling streaming but I think those are fixed and I am hopefully getting closer.

JayDi11a and others added 2 commits January 7, 2026 21:41
Upgrade llama-stack-client from >=0.3.0 to >=0.4.0 to ensure compatibility
with recently added APIs (admin, vector_io parameter changes, beta namespace,
and tool_runtime authorization parameter).

This resolves test failures caused by client SDK being out of sync with
server API changes from router migrations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mergify
Copy link

mergify bot commented Jan 8, 2026

This pull request has merge conflicts that must be resolved before it can be merged. @JayDi11a please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. needs-rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants