From 8db48f9a19d81a5470f61505f80bd59a0a842052 Mon Sep 17 00:00:00 2001 From: capjamesg Date: Sat, 14 Sep 2024 01:29:02 +0000 Subject: [PATCH] Update results --- index.html | 294 ++++++++++++++++++++-------------------- results/2024-09-14.json | 106 +++++++++++++++ 2 files changed, 255 insertions(+), 145 deletions(-) create mode 100644 results/2024-09-14.json diff --git a/index.html b/index.html index e284778..bbd1a46 100644 --- a/index.html +++ b/index.html @@ -40,7 +40,7 @@

How's GPT-4o Doing?

You can contribute your own tests, too! See the GitHub README for contributing instructions.

-

Tests are run every day at 1am PT. Last updated September 12, 2024.

+

Tests are run every day at 1am PT. Last updated September 14, 2024.

Made with ❤️ by the team at Roboflow.

@@ -58,12 +58,12 @@

How's GPT-4o Doing?

Response Time

-

Today, the average response time to receive results from our tests was 4.05 seconds per request.

+

Today, the average response time to receive results from our tests was 4.07 seconds per request.

This number only accounts for requests made by this application.

-

4.05 s

+

4.07 s

@@ -122,66 +122,12 @@

Prompt

Image

Image of the input into GPT-4

Result

-
There are eight fruits in the image.
+
9

Test submitted by Roboflow

- -
-
-
-

Handwriting OCR

-

Can GPT-4V read handwriting?

-
-
-
-

Fail

-
-
-
-
-
- Last 7-Day Performance -
- -
- -
- -
- -
- -
- -
- -
- -
-
-

Of the last 7 tests, conducted daily, this test has passed 71.0% of the time.

-

Today's request cost $0.009

-
-
- -
-

Method

-
We send a image of a handwritten note to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.
-

Prompt

-
-                                            Read the text in the image. Return only the text, with punctuation.
-                                        
-

Image

- Image of the input into GPT-4 -

Result

-
The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day oh tea."
-

Test submitted by Roboflow

-
-
-
- +
@@ -216,7 +162,7 @@

Object Detection

Of the last 7 tests, conducted daily, this test has passed 0% of the time.

-

Today's request cost $0.01

+

Today's request cost $0.009

@@ -230,7 +176,7 @@

Prompt

Image

Image of the input into GPT-4

Result

-
{'x': 0.373, 'y': 0.3645, 'width': 0.254, 'height': 0.4958}
+
{'x': 0.38, 'y': 0.4, 'width': 0.2, 'height': 0.4}

Test submitted by Roboflow

@@ -286,22 +232,22 @@

Image

Result

```json
 {
-  "A": {
-    "quantity": 20,
-    "price": 15
-  },
-  "B": {
-    "quantity": 25,
-    "price": 25
-  },
-  "C": {
-    "quantity": 30,
-    "price": 30
-  },
-  "D": {
-    "quantity": 35,
-    "price": 40
-  }
+    "A": {
+        "quantity": 20,
+        "price": 10
+    },
+    "B": {
+        "quantity": 25,
+        "price": 25
+    },
+    "C": {
+        "quantity": 30,
+        "price": 30
+    },
+    "D": {
+        "quantity": 35,
+        "price": 40
+    }
 }
 ```

Test submitted by Roboflow

@@ -403,7 +349,7 @@

Annotation Quality Assurance

Of the last 7 tests, conducted daily, this test has passed 0% of the time.

-

Today's request cost $0.02

+

Today's request cost $0.021

@@ -417,43 +363,56 @@

Prompt

Image

Image of the input into GPT-4

Result

-
To determine if there are any missing annotations, I'll count the visible cars in the image and compare that with the number of red bounding boxes.
+                                        
The image provided contains visible cars annotated with red bounding boxes. I'll manually count the cars and the annotations to find any missing annotations.
+
+1. Cars in the image:
+   - 1 car (left oncoming lane, closest)
+   - 2 cars (next in line after the front car in the oncoming lane)
+   - 3 cars (close proximity in the far center lane)
+   - 1 car on the rightmost lane (closest to the camera)
 
-In the provided image:
+Total cars: 7
 
-1. Visible cars (including partial and fully visible):
-   - Front white car (on the right).
-   - Black SUV to the left.
-   - Several distant cars in the middle lane.
-   - Additional cars that may be hard to distinguish clearly.
+2. Annotations:
+   - 1 car (left oncoming lane, closest)
+   - 1 car (next in line after the front car, on the left)
+   - 3 cars (in the far center lane)
+   - 1 car on the rightmost lane (closest to the camera)
+   
+Total annotations: 6
 
-2. Red bounding boxes present:
-   - There are 8 red bounding boxes.
+The car closest to the camera on the right lane isn't annotated.
 
-Here, the number of cars in the image appears to be accurately annotated with red boxes. Therefore, there are no missing annotations. 
+Therefore, there is 1 missing annotation.
 
-JSON output:
 ```json
 {
-  "missing": 0
+  "missing": 1
 }
-```
-
-This indicates that there are no missing annotations in the provided image.
+```

Test submitted by Roboflow

- + + + +
+
+

Today's Passing Tests

+ +
+
+
-

Measurement Test

-

Can GPT-4 Measure Items Using a Reference?

+

Zero Shot Classification

+

Can GPT-4V classify an image without being trained on that particular use case?

-
-

Fail

+
+

Pass

@@ -462,62 +421,48 @@

Measurement Test

Last 7-Day Performance
-
+
-
+
-
+
-
+
-
+
-
+
-

Of the last 7 tests, conducted daily, this test has passed 14.0% of the time.

-

Today's request cost $0.009

+

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Today's request cost $0.005

Method

-
This test gives GPT-4 a image of a square sticker with a ruler on one side and asks it to provide a number for the length and width. We score this test based on precent error, gauging how far it is from the actual value.
+
We check to see if the model can correctly identify the vehicle. If it can, it recieves a 100%, if it is incorrect, it recieves a 0%.

Prompt

-                                            What is the length and width of this square sticker, based on the ruler? Return a JSON with float properties for `length` and `width` representing inches.
+                                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
```json
-{
-  "length": 3.0,
-  "width": 3.0
-}
-```
+
Toyota Camry

Test submitted by Roboflow

- -
-
-
-
-

Today's Passing Tests

- -
-
- +
-

Zero Shot Classification

-

Can GPT-4V classify an image without being trained on that particular use case?

+

Document OCR

+

Can GPT-4V read a document and return the exact characters in the text?

@@ -547,31 +492,31 @@

Zero Shot Classification

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

-

Today's request cost $0.005

+

Today's request cost $0.009

Method

-
We check to see if the model can correctly identify the vehicle. If it can, it recieves a 100%, if it is incorrect, it recieves a 0%.
+
We send a screenshot of a typed document to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.

Prompt

-                                            What is in the image? Return the class of the object in the image. Here are the classes: Toyota Camry, Tesla Model 3. You can only return one class from that list.
+                                            Read the text in the image. Return only the text, with punctuation.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
Toyota Camry
+
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.

Test submitted by Roboflow

- +
-

Document OCR

-

Can GPT-4V read a document and return the exact characters in the text?

+

Handwriting OCR

+

Can GPT-4V read handwriting?

@@ -586,7 +531,7 @@

Document OCR

-
+
@@ -596,31 +541,31 @@

Document OCR

-
+
-

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

+

Of the last 7 tests, conducted daily, this test has passed 71.0% of the time.

Today's request cost $0.009

Method

-
We send a screenshot of a typed document to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.
+
We send a image of a handwritten note to determine if it can correctly read the text. If it correctly gets the text, it gets a 100%. Otherwise, it gets a 0%.

Prompt

                                             Read the text in the image. Return only the text, with punctuation.
                                         

Image

- Image of the input into GPT-4 + Image of the input into GPT-4

Result

-
I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times.
+
The words of songs on the album have been echoing in my head all week. "Fades into the grey of my day old tea."

Test submitted by Roboflow

- +
@@ -669,7 +614,7 @@

Prompt

Image

Image of the input into GPT-4

Result

-
[{'name': 'MARY THOMAS', 'time_per_day': 1, 'medication': 'ATENOLOL', 'dosage': 100, 'rx_number': '1234567-12345'}]
+
[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]

Test submitted by Roboflow

@@ -709,7 +654,7 @@

Math OCR

Of the last 7 tests, conducted daily, this test has passed 100% of the time.

-

Today's request cost $0.016

+

Today's request cost $0.015

@@ -728,7 +673,66 @@

Result

- + +
+
+
+

Measurement Test

+

Can GPT-4 Measure Items Using a Reference?

+
+
+
+

Pass

+
+
+
+
+
+ Last 7-Day Performance +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+
+

Of the last 7 tests, conducted daily, this test has passed 14.0% of the time.

+

Today's request cost $0.009

+
+
+ +
+

Method

+
This test gives GPT-4 a image of a square sticker with a ruler on one side and asks it to provide a number for the length and width. We score this test based on precent error, gauging how far it is from the actual value.
+

Prompt

+
+                                            What is the length and width of this square sticker, based on the ruler? Return a JSON with float properties for `length` and `width` representing inches.
+                                        
+

Image

+ Image of the input into GPT-4 +

Result

+
```json
+{
+  "length": 3.5,
+  "width": 3.5
+}
+```
+

Test submitted by Roboflow

+
+
+
+
diff --git a/results/2024-09-14.json b/results/2024-09-14.json new file mode 100644 index 0000000..009fd16 --- /dev/null +++ b/results/2024-09-14.json @@ -0,0 +1,106 @@ +{ + "zero_shot_classification": { + "score": 1, + "success": true, + "price": 0.00481, + "pass_fail": "Pass", + "response_time": 2.1339406967163086, + "result": "Toyota Camry" + }, + "count_fruit": { + "score": 0, + "success": false, + "price": 0.007870000000000002, + "pass_fail": "Fail", + "response_time": 4.496371269226074, + "result": "9" + }, + "document_ocr": { + "score": 1, + "success": true, + "price": 0.008539999999999999, + "pass_fail": "Pass", + "response_time": 2.9639036655426025, + "result": "I was thinking earlier today that I have gone through, to use the lingo, eras of listening to each of Swift's Eras. Meta indeed. I started listening to Ms. Swift's music after hearing the Midnights album. A few weeks after hearing the album for the first time, I found myself playing various songs on repeat. I listened to the album in order multiple times." + }, + "handwriting_ocr": { + "score": 1, + "success": true, + "price": 0.00876, + "pass_fail": "Pass", + "response_time": 20.625366926193237, + "result": "The words of songs on the album have been echoing in my head all week. \"Fades into the grey of my day old tea.\"" + }, + "extraction_ocr": { + "score": 1.0, + "success": true, + "price": 0.00719, + "pass_fail": "Pass", + "response_time": 3.4025280475616455, + "result": "[{'name': 'Mary Thomas', 'time_per_day': 1, 'medication': 'Atenolol', 'dosage': 100, 'rx_number': '1234567-12345'}]" + }, + "math_ocr": { + "score": 1.0, + "success": true, + "price": 0.015290000000000002, + "pass_fail": "Pass", + "response_time": 4.145634174346924, + "result": "3x^2-6x+2" + }, + "object_detection": { + "score": 0.504774897680764, + "success": false, + "price": 0.009490000000000002, + "pass_fail": "Fail", + "response_time": 4.783623695373535, + "result": "{'x': 0.38, 'y': 0.4, 'width': 0.2, 'height': 0.4}" + }, + "graph_understanding": { + "score": 0.965, + "success": false, + "price": 0.01079, + "pass_fail": "Fail", + "response_time": 3.5321664810180664, + "result": "```json\n{\n \"A\": {\n \"quantity\": 20,\n \"price\": 10\n },\n \"B\": {\n \"quantity\": 25,\n \"price\": 25\n },\n \"C\": {\n \"quantity\": 30,\n \"price\": 30\n },\n \"D\": {\n \"quantity\": 35,\n \"price\": 40\n }\n}\n```" + }, + "color_recognition": { + "score": 0.9607843137254902, + "success": false, + "price": 0.008870000000000001, + "pass_fail": "Fail", + "response_time": 2.9251863956451416, + "result": "```json\n{\n \"R\": 79,\n \"G\": 0,\n \"B\": 130\n}\n```" + }, + "annotation_qa": { + "score": 0.33333333333333337, + "success": false, + "price": 0.02112, + "pass_fail": "Fail", + "response_time": 19.58968210220337, + "result": "The image provided contains visible cars annotated with red bounding boxes. I'll manually count the cars and the annotations to find any missing annotations.\n\n1. Cars in the image:\n - 1 car (left oncoming lane, closest)\n - 2 cars (next in line after the front car in the oncoming lane)\n - 3 cars (close proximity in the far center lane)\n - 1 car on the rightmost lane (closest to the camera)\n\nTotal cars: 7\n\n2. Annotations:\n - 1 car (left oncoming lane, closest)\n - 1 car (next in line after the front car, on the left)\n - 3 cars (in the far center lane)\n - 1 car on the rightmost lane (closest to the camera)\n \nTotal annotations: 6\n\nThe car closest to the camera on the right lane isn't annotated.\n\nTherefore, there is 1 missing annotation.\n\n```json\n{\n \"missing\": 1\n}\n```" + }, + "measurement": { + "score": 1.0, + "success": true, + "price": 0.00877, + "pass_fail": "Pass", + "response_time": 5.455448865890503, + "result": "```json\n{\n \"length\": 3.5,\n \"width\": 3.5\n}\n```" + }, + "easy_captcha": { + "score": 1, + "success": true, + "price": 0.004790000000000001, + "pass_fail": "Pass", + "response_time": 3.490755081176758, + "result": "charybdis indubitable" + }, + "easy_captcha_persuade": { + "score": 1, + "success": true, + "price": 0.00529, + "pass_fail": "Pass", + "response_time": 17.351974725723267, + "result": "charybdis indubitable" + } +} \ No newline at end of file