Exploring New Horizons: Gemini Pro Vision's Test Integration with Raster Promises a Smarter Way to Handle Visual Content.

  • Leonardo
    Leonardo
    Senior Engineer
  • Claudio
    Claudio
    Director of Engineering

What is Raster?

Raster is a digital asset manager for modern teams, saving time organizing, editing, and hosting photography developed in-house by the Monogram team. It focuses on saving time in organizing, editing, and hosting photography. Raster offers features like AI-driven organization, nondestructive collaborative editing, and efficient photo management. It utilizes AI for organizing images with smart tags and streamlines workflows for developers, designers, and marketing teams.

What is Gemini Pro Vision?

Gemini Pro Vision is a model from Google Gemini AI, which is Google's most advanced and versatile AI to date. Developed collaboratively by teams across Google, including Google Research, Gemini is designed to be multimodal, capable of understanding and operating across different types of information such as text, code, audio, images, and video. Gemini Pro is optimized for scaling across a wide range of tasks, boasting state-of-the-art capabilities that enhance how developers and enterprise customers build and scale with AI. For more details, you can visit the Google blog post.

The need for AI

We've been eagerly anticipating the opportunity to test out the multimodal prompts capability of the Gemini API. And what better platform to test it with than Raster!

This would allow Raster to:

  • automatically label each image with highly relevant tags, revolutionizing the way we search and organize our visual assets
  • generate alt tags that are both user-friendly and SEO-optimized, giving our images an extra boost in terms of search engine visibility and accessibility

How we did it?

Let’s focus on the alt text generation functionality, here’s how we did it:

Image Acquisition: Access the uploaded image via a direct API call.

Image Processing: Preprocess the image by resizing and converting it to a format compatible with Gemini Pro Vision's API image analysis engine.

Content Analysis via Clear Prompt Instructions: Utilize Gemini Vision's advanced content analysis capabilities to extract meaningful insights from the image. This involves identifying prominent objects, scenes, and other visual elements — this is where Gemini’s true multimodal capability shines.

Alt Text Generation: Based on the extracted insights, Gemini Vision generates a concise and descriptive alt text that accurately conveys the image's content.

Integration with Raster: Store the generated alt text into the appropriate fields in Raster's database and display it’s results to the user interface.

Google AI Studio

Initially, we started using Google AI Studio at https://makersuite.google.com to test and improve prompts and obtain the desired responses. We used it to test multiple alternatives until we achieved the optimal input for our use case.

We experimented with various inputs until we considered asking Gemini for an optimal prompt for an AI. This helped us refine our input and achieve the desired outcome.

In addition to testing different prompts, the generation config is also important for fine-tuning the responses, specifically the temperature, topK, and topP parameters. You can learn more about their meaning and values in the Gemini API documentation.

After iterating using the Google AI Studio, a highly useful feature is the "Get code" option. The AI Studio web app provides all the necessary code to run what you have in your prompts, simplifying the process of transferring ideas from the studio to the application.

As JavaScript developers, the quickest way for us to start using Gemini is by directly utilizing the API for our web app using the Google AI JavaScript SDK. This SDK is suitable for anyone who prefers not to work with REST APIs or server-side code (such as Node.js) to access Gemini models in their web app.

Side note: If you are running the code on the client where the API key is exposed, make sure to set restrictions in the Google Cloud Console in GCP based on your specific use case.

Google AI SDK

The Google AI JavaScript SDK, available as an open source repo, allows developers to utilize Google's Generative AI models. This SDK supports various use cases including:

  • Generate text from text-only input
  • Generate text from multimodal prompts (text and images)
  • Build multi-turn conversations (chat)

After getting the code from Google AI Studio, we refactored it to align with our style guide and enhance reusability for future Gemini API features within Raster.

The Code

First, install the JS SDK as a dependency in your project:

pnpm add @google/generative-ai

This is the definition of the reusable function we created:

async function getGeminiProVisionResponse(
	text: string,
	imageUrl: string,
	options?: { generationConfig?: GenerationConfig; safetySettings?: SafetySetting[] }
): Promise<string>

We will discuss each parameter in more detail below.

Import the library into the project, set up the API key as environment variable and instantiate the AI model. In our case, we used gemini-pro-vision as we're dealing with images.

import { GoogleGenerativeAI } from '@google/generative-ai'

async function getGeminiProVisionResponse(
	text: string,
	imageUrl: string,
	options?: { generationConfig?: GenerationConfig; safetySettings?: SafetySetting[] }
): Promise<string> { 
	const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY)

	const model = genAI.getGenerativeModel({
		model: 'gemini-pro-vision'
	})
	
	// ...

We fetch the images that are already uploaded to Raster:

const imageFile = await fetch(imageUrl)
if (!imageFile.ok) throw new Error(`Failed to find image at ${imageUrl}`)

Define the Gemini API parameters: the prompt string and the image as base64.

const prompts: Part[] = [
  { text },
  {
    inlineData: {
      mimeType: 'image/jpeg',
      data: Buffer.from(await imageFile.arrayBuffer()).toString('base64')
    }
  }
]

The generation config is a parameter that can be adjusted on a case-by-case basis to optimize your responses. We made the options parameter optional when creating a reusable function, but retained the default values from the AI Studio as a fallback.

// see: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini#gemini-pro-vision
const DEFAULT_GENERATION_CONFIG: GenerationConfig = {
  temperature: 0.4,
  topK: 32,
  topP: 1,
  maxOutputTokens: 4096
}

// see: https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini#gemini-pro-vision
const DEFAULT_SAFETY_SETTINGS: SafetySetting[] = [
  {
    category: HarmCategory.HARM_CATEGORY_HARASSMENT,
    threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
  },
  {
    category: HarmCategory.HARM_CATEGORY_HATE_SPEECH,
    threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
  },
  {
    category: HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT,
    threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
  },
  {
    category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
    threshold: HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
  }
]

Finally, we can call the Gemini API to obtain the response and return it:

const result = await model.generateContent({
    contents: [{ role: 'user', parts }],
    generationConfig: options?.generationConfig || DEFAULT_GENERATION_CONFIG,
    safetySettings: options?.safetySettings || DEFAULT_SAFETY_SETTINGS
  })

return result.response.text()

Now, we can implement multiple features in the app simply by calling this function.

Here's a demonstration of the image alt text:

The recent test integration of Gemini Pro Vision into Raster has demonstrated promising potential, including the ability to automatically label images with relevant tags. This innovation could revolutionize how we search and categorize visual assets. Additionally, the generation of user-friendly and SEO-optimized alt tags in this test phase suggests a future where image visibility and accessibility are greatly enhanced, improving the overall user experience.

This collaboration between Raster and Gemini Pro Vision, though still in its testing phase, highlights the incredible possibilities that AI can bring to digital asset management, accessibility, and user experience. As we watch technology evolve, it's thrilling to consider how these emerging innovations might shape our interaction with visual content. We're on the cusp of a new era in content management, driven by the burgeoning partnership of Raster and Gemini Pro Vision.

Useful Links