Turn pictures into structured data with ChatGPT

I’ve been spending more time on the treadmill lately and realized that I wasn’t really keeping track of my progress. While I could create a spreadsheet and type in all the important numbers at the end of my workout, it’s much easier to just snap a picture. And once I get home, I can easily upload this picture and get ChatGPT to turn it into structured data, which I can put into a spreadsheet or whatever tool I want to use to track this data over time.

This is definitely the sort of workflow that would make more sense if you were building a workout app and wanted to let users log their workout progress by taking a picture of a treadmill or a weight machine after they were done. However, it does work as a standalone process as well!

In this article, we’ll take a look at a basic prompt to process these images, what modifications needs to be made to support multiple kinds of treadmill images and how to parse a batch of images and have ChatGPT generate a CSV file for you to download.

Parsing an image with ChatGPT

To get started, I grabbed the latest image I took of a completed workout, which looked like this:

Luckily for us, this image is pretty straightforward, with the labels right underneath the data points. There’s no indication whether the distance is miles or kilometers, so we’ll need to be specific about that. But, with our image ready, we can head over to ChatGPT, upload our image by clicking on the paperclip and providing the following prompt.

Attached is an image showing a treadmill at the completion of a workout. What are the calories burned, distance (in miles) and time (in minutes and seconds) of the workout?

Please provide just the data points with no further explanation or wording.

When we run this, ChatGPT looks at the image and uses that, along with the prompt, to give us our response. After letting it run for a few seconds, we get the following output:

ChatGPT
512 Calories, 3.89 Miles, 49:41 (49 minutes and 41 seconds)

This is exactly what we asked for! ChatGPT was able to parse the image and extract the information we were looking for. We were able to use a relatively simple prompt because we had a clear image, with a straightforward layout of labels and data. But what if we had a slightly more complicated image?

Parsing more complicated or varied images

Our simple prompt works great for simple images, but what about treadmills where more than just a couple statistics are visible or where certain numbers need a button press to access. Take a look at this image for example.

With this image, each numerical display shows one of two possible statistics, based on which indicator light above is illuminated. For example, the middle number is either total calories burned or calories per hour, but we know that total calories are being displayed because of the illuminated light next to CALS . To help ChatGPT parse this image correctly, we need to tell it about this added complexity, with something like the following prompt.

Attached is an image showing a treadmill at the completion of a workout. Please output the important statistics shown in the image. If a particular number has two potential labels, the label with a light illuminated to the left of it is the correct label. Assume all distances are in miles and speeds in miles per hour.

Please provide just the data points with no further explanation or wording.

Including the image above with that prompt, we get the following output from ChatGPT:

ChatGPT
• Incline: 0.0
• Time: 24:26
• Calories: 307
• Calories/Hour: 757
• Heart Rate: —
• Distance: 4.54
• Vertical Feet: —
• Pace: —
• Speed: 0.0

What’s interesting about this is that, in addition to parsing the actual output of the image, it calculated the Calories/Hour based on the calories and time it found in the image. It misinterpreted the distance number (which was actually vertical feet), because there was no light next to either label.

This highlights an important feature of all of these AI tools. You can do a lot with these tools, but having good input data is one of the best ways to make more effective tools.

When I tried this with blurry images (or image without fully lit label indicators like this one) many prompt adjustments and different approaches simply had no impact on getting more correct output. However, with clear imagery, it’s amazing the sort of data we can get with relatively straightforward prompting.

Parsing a batch of images

What if instead of a single image, we had a bunch of images to parse? Luckily, ChatGPT can handle this too. We just need to modify our prompt slightly to ask it to loop through each image and tell it what sort of file format we want our data to be presented in.

Attached are multiple images showing a treadmill at the completion of a workout. Please output the important statistics shown in each image. If a particular statistic has two potential labels, the label with a light illuminated to the left of it is the correct label. Assume all distances are in miles and speeds in miles per hour.

Please provide just the data points with no further explanation or wording.

This takes a bit longer because ChatGPT has to parse each of the images, analyze them and return your data. A couple of times when I ran this prompt and asked for a CSV download, I actually ran into issues where ChatGPT’s execution would time out, it would try an alternate solution and still fail, which shows just how important the specific wording in your prompt really is.

Once we get to the end of this process, we see that ChatGPT has once again correctly identified the relevant data in each of the images.

ChatGPT
Image 1:

• Calories: 511
• Distance: 2.87 miles
• Time: 52:00 minutes
• Incline: 0.0

Image 2:

Calories: 517
Distance: 2.27 miles
Time: 45:55 minutes
Incline: 0.0

Image 3:

• Calories: 501
• Distance: 2.13 miles
• Time: 42:58 minutes
• Incline: 0.0

Wrapping Up

What I love about these sorts of workflow examples is that they used to be a relatively difficult challenge of writing a computer vision script, figuring out where to deploy it and actually managing the compute needed to run it. Now, you can drag and drop a few images, iterate on a couple lines of text to define your prompt and you have structured data generated from images.

If you were using the OpenAI API to power a feature for a particular application, you might want to get a bit more specific with your prompts and request a certain format of output, but this provides a general model for how to use AI tooling to parse data trapped in images.

How are you using workflows like this in your own work? I would love to hear about it. Send me an email at keanan@floorboardai.com and let me know!