Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming responses #3

Closed
mihau12 opened this issue Nov 26, 2023 · 1 comment
Closed

Streaming responses #3

mihau12 opened this issue Nov 26, 2023 · 1 comment

Comments

@mihau12
Copy link

mihau12 commented Nov 26, 2023

Hello! Thanks for your work! Is it possible to add stream=true parameter to have streamed responses?

@Zaki-1052
Copy link
Owner

Zaki-1052 commented Dec 3, 2023

Hi, apologies for the wait, I was super swamped with work! TL;DR is that it would be difficult to handle past sending to the server due to openai-node not supporting event streams; it's only Python afaik, and this repo is written in Node.js. There's a great thread at openai/openai-node#18 that explains it.

Basically, parsing the return data is extremely complicated, especially for a POST request, and there doesn't yet seem to be a good way not to have it all chunked when it's actually sent to the frontend, which kinda defeats the purpose.

The consensus seems to be:

streaming experience is currently not good and only seems to return all chunks in bulk instead of as they come in.

So far, no one in this community thread has a great solution to the problem: https://community.openai.com/t/how-to-stream-response-in-javascript/7310/20

While you can always add it to the parameters next to temperature and everything in server.js per their documentation with:
stream: true,
One of my main goals here was not to make this repo bloated with other dependencies, and unfortunately, as far as I can tell, at least in a JavaScript web app like this one that uses POST requests, it's required to use the SSE library to stream them to the frontend. Specifically, that would be HTTP Streaming. An explanation is here.

According to the Docs:

If you’d like to stream results from the POST variant in your browser, consider using the SSE library.

From the thread:

If you look into the code, it seems like it’s listening to the progress event when opening up an XHR instance and then parsing that data into tokens, so I’m guessing setting stream: true in the POST request case is just enabling chunked transfer encoding on the response.

Also, I've already been handling everything in a messageContent field, so some pretty severe modifications would be needed to change variable names, parsing and the like, as according to the OpenAI CookBook, Streaming Completions uses a delta instead of a message field. So basically, the following code implements stream=true, but it doesn't actually display like that, only sends it from the server with some extra work.

This is the snippet that would be used on this repo that currently works when I tested it, but doesn't actually stream to the frontend, and that I'm wary of implementing due to the image endpoint being a bit scuffed:

try {
    const response = await axios.post('https://api.openai.com/v1/chat/completions', data, { headers, responseType: 'stream' });
    let buffer = '';
  
    response.data.on('data', (chunk) => {
      buffer += chunk.toString(); // Accumulate the chunks in a buffer
    });
  
    response.data.on('end', () => {
      try {
        const lines = buffer.split('\n');
        let messageContent = '';
  
        for (const line of lines) {
          if (line.startsWith('data: ')) {
            const jsonString = line.substring(6).trim();
            if (jsonString !== '[DONE]') {
              const parsedChunk = JSON.parse(jsonString);
              messageContent += parsedChunk.choices.map(choice => choice.delta?.content).join('');
            }
          }
        }

        const lastMessageContent = messageContent;

  
        if (lastMessageContent) {
          // Add assistant's message to the conversation history
          conversationHistory.push({ role: "assistant", content: lastMessageContent.trim() });
    
          // Send this back to the client
          res.json({ text: lastMessageContent.trim() });
        } else {
          // Handle no content scenario
          res.status(500).json({ error: "No text was returned from the API" });
        }
      } catch (parseError) {
        console.error('Error parsing complete response:', parseError.message);
        res.status(500).json({ error: "Error parsing the response from OpenAI API" });
      }
    });
  
  } catch (error) {
    console.error('Error calling OpenAI API:', error.message);
    if (error.response) {
      console.error(error.response.data);
    }
    res.status(500).json({ error: "An error occurred when communicating with the OpenAI API.", details: error.message });
  }
  
});

Am I correct in thinking that your motivation to have streamed responses would be so that the completions appear in chunks like on the ChatGPT interface but are sent to the conversation history in a single response, and are also sent server-side in this manner?
Because if so, I could make it so that it parses the input in this way and add some handling on the front-end; streamed responses also decrease the speed that the requests are sent to the server, but I'd probably create a new branch for this with the required dependencies and everything because it's otherwise useless, especially in a locally hosted project like this that will be slow regardless.

Let me know if you'd like to fork it yourself, or if I should otherwise close the issue, since I believe their documentation doesn't address anything like replicating what they have on their Chat interface exactly, only this example:

import OpenAI from "openai";

const openai = new OpenAI();

async function main() {
  const completion = await openai.chat.completions.create({
    model: "gpt-3.5-turbo",
    messages: [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ],
    stream: true,
  });

  for await (const chunk of completion) {
    console.log(chunk.choices[0].delta.content);
  }
}

main();

Which even when parsed and sent to the frontend individually, would not be fast enough to make much of a meaningful difference; the chat completions chunk is described here and would mean quite the workaround.
Also, be aware that I'll be fixing the issue mentioned about image inputs, so when the data is streamed and parsed to the server, there will likely be an issue with that URLs if too much of the JSON payload is changed; just FYI if you do decide to tackle it.
From the docs:

If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.

Edit: You may want to look into .toReadableStream() and .fromReadableStream(), – see examples/stream-to-client-next.ts for one example.

The end of the thread has some suggestions for implementation in V4, so maybe something like this could work, but I'm not too sure about the feasibility for something so small, sorry.

Apologies if I've missed anything obvious; I'm new to this, but feel free to submit a PR if you figure it out!

@mihau12
Zaki-1052

@Zaki-1052 Zaki-1052 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants