server: improve correctness of request parsing and responses #2929

cebtenzzre · 2024-08-30T23:59:16Z

This PR significantly improves the accuracy of the local API server's request parsing and responses.

Functional changes:

Split chat and completions endpoints, since they have significant differences - this is a functional change since what was just isChat is now a finer distinction, and some parameters are no longer supported with one or the other since they aren't in the spec
Rewrite request parsing using dedicated classes with more strict type checking and more meaningful client error responses
Fix /v1/completions to stop applying a prompt template (it should be already applied by the client, if any) and give token-accurate responses with echo=true (whitespace was being trimmed to eagerly)
Fix /v1/chat/completions to insert actual user and assistant messages into history using fakeReply instead of just joining all of the user messages into a string to prefix the prompt

Other changes:

Change llmodel to use std::optional for fakeReply, since constructing a std::string pointer requires too much boilerplate
Apply a few cleanups to existing code as recommended by iwyu/clazy/clang-tidy

Outstanding issues (all existing before this PR):

Reproducibility is inhibited by changing top_k, n_batch, repeat_penalty, and repeat_last_n in the UI; these should really be configured by the client
Reproducibility may be inhibited by changing promptTemplate in the UI, perhaps the client would like to configure this
Reproducibility is inhibited by not actually knowing which model is used if the expected model was potentially not found
Reproducibility is inhibited by enabling LocalDocs in the UI (the client probably wants control of this), which affects the model's context in a way that cannot be represented by user/assistant queries in follow-up requests

cebtenzzre · 2024-09-03T17:57:44Z

Test 1: /v1/chat/completions

Request

{
    "model": "Llama 3 8B Instruct",
    "temperature": 0,
    "messages": [
        {
            "role": "user",
            "content": "this is a conversation between john and another person.\njohn: what is your name?"
        },
        {
            "role": "assistant",
            "content": "bob: my name is bob"
        },
        {
            "role": "user",
            "content": "john: could you please repeat that?"
        }
    ]
}

Response on main

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "Other Person: My name is Emily.\n\n(Note: I'll respond as the other",
        "role": "assistant"
      },
      "references": []
    }
  ],
  "created": 1725387192,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 40,
    "total_tokens": 56
  }
}

Why the above is unexpected: The model forgot its previous response, despite it being explicitly provided in the request.

Response with this PR

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "bob: sure thing, john! my name is bob.",
        "role": "assistant"
      },
      "references": null
    }
  ],
  "created": 1725385942,
  "id": "placeholder",
  "model": "Llama 3 8B Instruct",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 53,
    "total_tokens": 65
  }
}

Test 2: /v1/completions

Request

{
    "model": "Llama 3 8B Instruct",
    "temperature": 0,
    "echo": true,
    "prompt": "France is a"
}

Response on main

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "references": [],
      "text": "France is a\nI think I know where this is going!\n\n...country!"
    }
  ],
  "created": 1725387245,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 12,
    "prompt_tokens": 14,
    "total_tokens": 26
  }
}

Why the above is unexpected: The model's response is joined with a newline to the query, with a chat template applied behind-the-scenes, even though this is not a chat.

Response with this PR

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "references": null,
      "text": "France is a country with a rich history and culture, known for its iconic landmarks like the E"
    }
  ],
  "created": 1725386185,
  "id": "placeholder",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 5,
    "total_tokens": 21
  }
}

Test 3: /v1/chat/completions

Request

{
    "model": "Llama 3 8B Instruct",
    "prompt": "what is 2+2?",
    "messages": [ { "role": "user", "content": "?" } ]
}

Response on main

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "The answer to 2 + 2 is... (drumroll please)...",
        "role": "assistant"
      },
      "references": []
    }
  ],
  "created": 1725390175,
  "id": "foobarbaz",
  "model": "Llama 3 8B Instruct",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 16,
    "prompt_tokens": 19,
    "total_tokens": 35
  }
}

Why the above is unexpected: The server not only didn't complain about the invalid argument, it actually generated a response to its content.

Response with this PR

{
  "error": {
    "code": null,
    "message": "Unrecognized request argument supplied: prompt",
    "param": null,
    "type": "invalid_request_error"
  }
}

Signed-off-by: Jared Van Bortel <[email protected]>

VS 2019 does not support C++23 std::optional::transform. Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre · 2024-09-04T15:27:46Z

Marking as draft until I fully sort out the CI issues. We need to use VS 2022 but CMake can't find the C++ compiler if I just switch the machine image - probably because the hardcoded Enter-VsDevShell equivalent needs to be updated.

cebtenzzre · 2024-09-04T21:42:56Z

Windows is building fine now, but the gcc we are using on Linux is too old for std::optional::transform (need gcc 12 from May 2022 but have 11), and on macOS there is an incompatibility with std::format and OS targets older than Ventura 13.3 (which can't be used on pre-2017 Intel Macs).

cebtenzzre · 2024-09-05T19:16:26Z

The final nail in the coffin for std::format is that on Linux it requires libstdc++13, which released in April of last year, and was not made available until Ubuntu Lunar 23.04. Somewhere around 50% of our Ubuntu users are still on something older than 23.04, despite Ubuntu Noble 24.04LTS releasing earlier this year.

manyoso · 2024-09-05T19:19:21Z

The changes that you're attempting to make with regard to string formatting are tangentially related to the thrust of the change though, right? Perhaps it would be best to separate the concerns here as some of these fixes should go in sooner?

Signed-off-by: Jared Van Bortel <[email protected]>

manyoso · 2024-09-07T13:29:30Z

Seems to be failing on CI still...

cebtenzzre · 2024-09-09T14:46:35Z

Seems to be failing on CI still...

CI passed, actually. I canceled the Windows and macOS builds because they succeeded with the previous commit.

cebtenzzre force-pushed the improve-local-server branch from ea253e4 to c3c3c51 Compare September 3, 2024 17:17

cebtenzzre requested a review from manyoso September 3, 2024 17:29

cebtenzzre marked this pull request as ready for review September 3, 2024 17:29

cebtenzzre changed the title ~~WIP: local API server improvements~~ Improve correctness of local API server parsing and responses Sep 3, 2024

cebtenzzre added 5 commits September 3, 2024 14:21

server: separate chat and completions, rewrite request parsing

3f2e1f2

Signed-off-by: Jared Van Bortel <[email protected]>

server: fix implementation and output of completion endpoints

e55564d

Signed-off-by: Jared Van Bortel <[email protected]>

llmodel: use std::optional for fakeReply for better ergonomics

f7f36e6

Signed-off-by: Jared Van Bortel <[email protected]>

chat: misc cleanup as recommended by iwyu/clazy/clang-tidy

2087999

Signed-off-by: Jared Van Bortel <[email protected]>

changelog: add this PR

84ae96a

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre force-pushed the improve-local-server branch from 74c42d9 to 84ae96a Compare September 3, 2024 18:21

cebtenzzre added 3 commits September 3, 2024 14:58

server: string argument fixups

413ed43

Signed-off-by: Jared Van Bortel <[email protected]>

server: prohibit unrecognized arguments

5680175

Signed-off-by: Jared Van Bortel <[email protected]>

server: revert stale change that shouldn't be here

5456de3

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre changed the title ~~Improve correctness of local API server parsing and responses~~ server: improve correctness of request parsing and responses Sep 3, 2024

manyoso approved these changes Sep 4, 2024

View reviewed changes

ci: use more current Windows images to fix build failure

d1fa05f

VS 2019 does not support C++23 std::optional::transform. Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre marked this pull request as draft September 4, 2024 15:26

cebtenzzre marked this pull request as ready for review September 4, 2024 19:57

cebtenzzre marked this pull request as draft September 5, 2024 14:55

This comment was marked as resolved.

Sign in to view

cebtenzzre added 2 commits September 6, 2024 12:07

Merge branch 'main' into improve-local-server

db80f3a

ci: use GCC 12 on Linux

9a0a372

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre force-pushed the improve-local-server branch from 231c566 to 9a0a372 Compare September 6, 2024 16:08

cebtenzzre added 2 commits September 6, 2024 16:20

Merge branch 'main' into improve-local-server

ee7f55e

ci: remove trailing whitespace

16f0129

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre added 2 commits September 6, 2024 18:43

server: use std::string formatter as parent

65bbea9

Signed-off-by: Jared Van Bortel <[email protected]>

chat: replace std::format with fmtlib for now

dccf99b

Signed-off-by: Jared Van Bortel <[email protected]>

cebtenzzre marked this pull request as ready for review September 6, 2024 23:24

cebtenzzre requested a review from manyoso September 6, 2024 23:24

server: remove no longer needed #include

21bbe27

Signed-off-by: Jared Van Bortel <[email protected]>

manyoso approved these changes Sep 9, 2024

View reviewed changes

cebtenzzre merged commit 3900528 into main Sep 9, 2024
6 of 21 checks passed

cebtenzzre deleted the improve-local-server branch September 9, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: improve correctness of request parsing and responses #2929

server: improve correctness of request parsing and responses #2929

cebtenzzre commented Aug 30, 2024 •

edited

Loading

cebtenzzre commented Sep 3, 2024 •

edited

Loading

cebtenzzre commented Sep 4, 2024

cebtenzzre commented Sep 4, 2024 •

edited

Loading

cebtenzzre commented Sep 5, 2024

manyoso commented Sep 5, 2024

This comment was marked as resolved.

manyoso commented Sep 7, 2024

cebtenzzre commented Sep 9, 2024

server: improve correctness of request parsing and responses #2929

server: improve correctness of request parsing and responses #2929

Conversation

cebtenzzre commented Aug 30, 2024 • edited Loading

cebtenzzre commented Sep 3, 2024 • edited Loading

Test 1: /v1/chat/completions

Test 2: /v1/completions

Test 3: /v1/chat/completions

cebtenzzre commented Sep 4, 2024

cebtenzzre commented Sep 4, 2024 • edited Loading

cebtenzzre commented Sep 5, 2024

manyoso commented Sep 5, 2024

This comment was marked as resolved.

manyoso commented Sep 7, 2024

cebtenzzre commented Sep 9, 2024

cebtenzzre commented Aug 30, 2024 •

edited

Loading

cebtenzzre commented Sep 3, 2024 •

edited

Loading

cebtenzzre commented Sep 4, 2024 •

edited

Loading