[Feature-Request] Support for GPT Vision #624

antoan · 2024-03-10T19:11:02Z

I have already tried a solution suggested by @truebit, for this, earlier, without any luck - I have documented my attempt a little here:

Originally posted by @antoan in #459 (comment)

arnavsinghvi11 · 2024-03-14T18:30:25Z

Hi @antoan , I believe you would have to pass the path through an image_url argument. Feel free to follow the openAI docs on vision for reference in passing the proper configuration.

jamesschinnerplxs · 2024-03-18T01:06:45Z

I have the same desire to try this out with images, looking at the GPT class, one current restriction I see is that the __call__ method only takes a single string parameter of 'prompt', which gets passes as a single 'message' (in OpenAI terms) to the LM: [{"role": "user", "content": prompt}]

A way forward would be to allow GPT to accept a list of 'prompts' which are type annotated somehow, and dynamically construct the messages based on those types, ie 'inline base64 image' or a image_url.

Not sure if these is existing machinery with DSPy which can help with this (i've really only looked at this for 1hr or so... including reading the docs!)

jmanhype · 2024-03-18T14:20:21Z

I have the same desire to try this out with images, looking at the GPT class, one current restriction I see is that the __call__ method only takes a single string parameter of 'prompt', which gets passes as a single 'message' (in OpenAI terms) to the LM: [{"role": "user", "content": prompt}]

A way forward would be to allow GPT to accept a list of 'prompts' which are type annotated somehow, and dynamically construct the messages based on those types, ie 'inline base64 image' or a image_url.

Not sure if these is existing machinery with DSPy which can help with this (i've really only looked at this for 1hr or so... including reading the docs!)

Maybe start here https://github.com/stanfordnlp/dspy/blob/main/dsp/modules/lm.py

jmanhype · 2024-03-19T03:48:31Z

#675

I have made a PR

rawwerks · 2024-05-18T14:36:56Z

here is a potential implementation: https://github.com/stanfordnlp/dspy/blob/56a0949ad285e0a3dd5649de58a6f5fb6f734a60/dsp/modules/gpt4vision.py#L106C1-L147C20

dat-boris · 2024-06-03T00:23:51Z

I am also in need of this feature, started with this at
#1099

Tested this with both Gemini and GPT-4o. If anybody is interested, welcome to try it out!

ZhijieXiong · 2024-06-21T00:28:52Z

I have also been looking for a solution to this problem, and saw that someone had written a GPT4Vision class (https://github.com/stanfordnlp/dspy/blob/56a0949ad285e0a3dd5649de58a6f5fb6f734a60/dsp/modules/gpt4vision.py#L106C1-L147C20). But that one is too complicated. This is a simple solution I wrote based on the documentation, and it has actually been tested and is feasible:

import dspy
import requests
import base64
import re
from dsp import LM


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


class GPTVision(LM):
    def __init__(self, model, api_key):
        super().__init__("gpt-4o")
        self.model = model
        self.api_key = api_key
        self.provider = "openai"

        self.history = []
        self.base_url = "https://api.openai.com/v1/chat/completions"

    def basic_request(self, prompt, **kwargs):
        # 目前是从prompt中把图片文件的位置取出来，然后再从prompt把这一部分删除
        pattern = r'^Image Path: .*'
        matches = re.findall(pattern, prompt, re.MULTILINE)

        image_path = matches[1].replace("Image Path: ", "")
        for match in matches:
            prompt = prompt.replace(f"\n{match}\n", "")

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}"
        }
        base64_image = encode_image(image_path)

        data = {
            **kwargs,
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 300
        }
        response = requests.post(self.base_url, headers=headers, json=data)
        response = response.json()

        self.history.append({
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs
        })

        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        responses = self.request(prompt, **kwargs)
        completions = [choice["message"]["content"] for choice in responses["choices"]]

        return completions


class VqaCoT(dspy.Signature):
    """Answer the questions based on the pictures."""

    image_path = dspy.InputField(desc="Base64 format of the image")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Answer based on image and question")


if __name__ == "__main__":
    gpt4o = GPTVision(model='gpt-4o', api_key="your api key of openai")
    qa = dspy.ChainOfThought(VqaCoT)
    with dspy.context(lm=gpt4o):
        print(qa(question="What is the occupation of the man in the picture?",
                 image_path="/Users/dream/myProjects/ITS-llm/demo/curry.jpeg"))
        # Prediction(
        #     rationale='Question: What is the occupation of the man in the picture?\n\nReasoning: Let\'s think step by step in order to determine his occupation. We observe that he is wearing a basketball jersey with the team name "Golden State Warriors" and holding a basketball. These indicators suggest that he is likely employed in a profession related to basketball.',
        #     answer='The man is a basketball player.'
        # )

aliirz · 2024-07-20T22:33:49Z

I have also been looking for a solution to this problem, and saw that someone had written a GPT4Vision class (https://github.com/stanfordnlp/dspy/blob/56a0949ad285e0a3dd5649de58a6f5fb6f734a60/dsp/modules/gpt4vision.py#L106C1-L147C20). But that one is too complicated. This is a simple solution I wrote based on the documentation, and it has actually been tested and is feasible:

import dspy
import requests
import base64
import re
from dsp import LM


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')


class GPTVision(LM):
    def __init__(self, model, api_key):
        super().__init__("gpt-4o")
        self.model = model
        self.api_key = api_key
        self.provider = "openai"

        self.history = []
        self.base_url = "https://api.openai.com/v1/chat/completions"

    def basic_request(self, prompt, **kwargs):
        # 目前是从prompt中把图片文件的位置取出来，然后再从prompt把这一部分删除
        pattern = r'^Image Path: .*'
        matches = re.findall(pattern, prompt, re.MULTILINE)

        image_path = matches[1].replace("Image Path: ", "")
        for match in matches:
            prompt = prompt.replace(f"\n{match}\n", "")

        headers = {
            "Content-Type": "application/json",
            "Authorization": f"Bearer {self.api_key}"
        }
        base64_image = encode_image(image_path)

        data = {
            **kwargs,
            "model": "gpt-4o",
            "messages": [
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/jpeg;base64,{base64_image}"
                            }
                        }
                    ]
                }
            ],
            "max_tokens": 300
        }
        response = requests.post(self.base_url, headers=headers, json=data)
        response = response.json()

        self.history.append({
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs
        })

        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        responses = self.request(prompt, **kwargs)
        completions = [choice["message"]["content"] for choice in responses["choices"]]

        return completions


class VqaCoT(dspy.Signature):
    """Answer the questions based on the pictures."""

    image_path = dspy.InputField(desc="Base64 format of the image")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="Answer based on image and question")


if __name__ == "__main__":
    gpt4o = GPTVision(model='gpt-4o', api_key="your api key of openai")
    qa = dspy.ChainOfThought(VqaCoT)
    with dspy.context(lm=gpt4o):
        print(qa(question="What is the occupation of the man in the picture?",
                 image_path="/Users/dream/myProjects/ITS-llm/demo/curry.jpeg"))
        # Prediction(
        #     rationale='Question: What is the occupation of the man in the picture?\n\nReasoning: Let\'s think step by step in order to determine his occupation. We observe that he is wearing a basketball jersey with the team name "Golden State Warriors" and holding a basketball. These indicators suggest that he is likely employed in a profession related to basketball.',
        #     answer='The man is a basketball player.'
        # )

Thank you

arthurgreef · 2024-08-23T22:40:58Z

Here is another simple GptVision class using the Azure OpenAI client in case it helps.

import base64
import json
import re

from dsp import LM
from dsp.modules.azure_openai import AzureOpenAI, chat_request

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')
    
class GptVision(LM):
    def __init__(self, model, api_version, api_base, azure_ad_token_provider, max_tokens):
        super().__init__(model)

        client = AzureOpenAI(
                    model=model,
                    api_version=api_version,
                    api_base=api_base,
                    azure_ad_token_provider=azure_ad_token_provider,
                    max_tokens=max_tokens,
                )

        self.client = client

        self.history = []

    def basic_request(self, prompt, **kwargs):
        raw_kwargs = kwargs

        pattern = r'^Image Path: .*'
        matches = re.findall(pattern, prompt, re.MULTILINE)

        image_path = matches[1].replace("Image Path: ", "")
        for match in matches:
            prompt = prompt.replace(f"\n{match}\n", "")

        base64_image = encode_image(image_path)

        kwargs = {**self.kwargs, **kwargs}
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": prompt
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}"
                        }
                    }
                ]
            }
        ]

        kwargs["messages"] = messages
        kwargs = {"stringify_request": json.dumps(kwargs)}
        response = chat_request(self.client.client, **kwargs)

        history = {
            "prompt": prompt,
            "response": response,
            "kwargs": kwargs,
            "raw_kwargs": raw_kwargs,
        }
        self.history.append(history)

        return response

    def __call__(self, prompt, only_completed=True, return_sorted=False, **kwargs):
        responses = self.request(prompt, **kwargs)
        completions = [choice["message"]["content"] for choice in responses["choices"]]

        return completions

antoan changed the title ~~Support for GPT Vision~~ [Feature-Request] Support for GPT Vision Mar 10, 2024

thomasnormal mentioned this issue Mar 15, 2024

Mulitmodel LLMs (esp. images): how to use them? #648

Closed

thomasahle added the enhancement New feature or request label Mar 18, 2024

dat-boris mentioned this issue Jun 3, 2024

feat(dspy): Experiment with adding image data with GPT-4o and Gemini #1099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature-Request] Support for GPT Vision #624

[Feature-Request] Support for GPT Vision #624

antoan commented Mar 10, 2024 •

edited

Loading

arnavsinghvi11 commented Mar 14, 2024

jamesschinnerplxs commented Mar 18, 2024

jmanhype commented Mar 18, 2024

jmanhype commented Mar 19, 2024

rawwerks commented May 18, 2024

dat-boris commented Jun 3, 2024

ZhijieXiong commented Jun 21, 2024

aliirz commented Jul 20, 2024

arthurgreef commented Aug 23, 2024

[Feature-Request] Support for GPT Vision #624

[Feature-Request] Support for GPT Vision #624

Comments

antoan commented Mar 10, 2024 • edited Loading

arnavsinghvi11 commented Mar 14, 2024

jamesschinnerplxs commented Mar 18, 2024

jmanhype commented Mar 18, 2024

jmanhype commented Mar 19, 2024

rawwerks commented May 18, 2024

dat-boris commented Jun 3, 2024

ZhijieXiong commented Jun 21, 2024

aliirz commented Jul 20, 2024

arthurgreef commented Aug 23, 2024

antoan commented Mar 10, 2024 •

edited

Loading