IBM’s Granite 3.2 Vision Model

4 min read6 days ago

On NVIDIA with open-source UI for AI (Tech to Art coverage).

Background:

As someone who is always on the lookout for the best advancements in generative AI, I recently decided to experience the latest IBM’s Granite 3.2 Vision model, locally on my NVIDIA Jetson Orin Super GPU Kit at my home GenAI lab. In the past few years, I have experienced 100+ language models (large and small) and wrote a specific chapter on model selection with 29 criteria in my published book on Amazon — Generative AI for Enterprises, Essential Insights for Decision Makers.

The results here were nothing short of astonishing, reinforcing the immense potential of open-source models — here a powerful (but small) vision model.

Granite 3.2 Vision is IBM’s latest iteration in its AI model series, designed to deliver high accuracy in visual understanding tasks. In other words, it is a compact and efficient vision-language model, specifically designed for visual document understanding, enabling automated content extraction from tables, charts, infographics, plots, diagrams, and more. The model was trained on a meticulously curated instruction-following dataset, comprising diverse public datasets and synthetic datasets tailored to support a wide range of document understanding and general image tasks. It was trained by fine-tuning a Granite large language model with both image and text modalities.

Given my focus on exploring cutting-edge generative AI solutions, deploying this model on an edge GPU computing device was an amazing experience. The NVIDIA Jetson Orin Super GPU Kit provided the perfect hardware platform to test the model’s capabilities. With its robust computational power, the device ensured smooth inferencing with minimal latency, making it an ideal choice for real-time AI applications.

The combination of IBM’s model and NVIDIA’s hardware created a synergy that exemplified the future of AI at the edge.

The results demonstrated the model’s remarkable precision in processing visual data. Image detection, scene segmentation, and contextual understanding were executed with impressive accuracy. The model’s ability to distinguish fine details and make intelligent sense out of it, showcased IBM’s commitment to advancing generative AI in open-source terms. As GenAI continues to evolve (and get commoditized), the ability to run sophisticated models on compact yet powerful devices will be a game-changer for many industries.

My experience of serving of IBM’s Granite 3.2 Vision model with NVIDIA’s Jetson Orin platform is a significant step forward in achieving scalable, high-performance vision solutions at the edge.

Practical:

I am using Ollama with Open WebUI these days as a GenAI platform at home for models’ testing, proxy testing , agentic AI testing etc.

In order to install the Granite 3.2 Vision model, I had to upgrade Ollama to 0.5.13 as a prerequisite as shown below. It also identified NVIDIA platform by design.

With 0.5.12 version, it would fail. I successfully installed the model and I could access my Granite 3.2 Vision model through the UI. I decided to experience the model’s output in two ways as following:

Step 1: Technology related image

I uploaded the image of Microsoft 365 Copilot Architecture and asked to explain it from that.

Below is the output.

Step 2: Art related image

I uploaded the image of the Mona Lisa and asked on purpose, “Whose picture is this ? Provide all the details”.

And, below is the output. Quite impressive, that it identified the individuality of the person in image, in stead of explaining it generically as any other person.

As you see, both outputs were pretty accurate and high quality.

Throughout this experience, I also wanted to understand GPU’s behaviour during and after inference with this vision model. Hence I ran the NVIDIA performance stats to capture the same as shown below:

NVIDIA / GPU utilization during inference: