July 14, 2024 How to use Yi-Vision with TextGrad

#AI #textgrad

I've already penned two blog posts, "July 9: How to use DeepSeek with TextGrad" and "June 25, 2024: TextGrad," where I explore TextGrad—an innovative autograd engine that enhances language models through iterative feedback. Recently, TextGrad has expanded to support multimodal optimization, enabling it to refine responses to images as well. Currently, it's compatible only with GPT-4 series from OpenAI and Claude series from Anthropic. However, I've devised a simple method to adapt TextGrad for use with other models. For those who prefer a quick fix over writing new code, a minor tweak to the existing script will suffice. I'll walk you through this process using Yi-vision from 01-ai.

Step 1

Let's make a few tweaks to the openai.py script, specifically found at textgrad.engine.openai.py. We'll adjust the ChatOpenAI class by changing the model_string to yi-vision and setting is-multimodal to True. These modifications are minor but essential for integrating new functionalities.

Step 2

Now, we need to add the model name to the __MULTIMODAL_ENGINES list located in the __init__.py file. This step is crucial to ensure the model is recognized as capable of handling multimodal tasks.

Step 3

Now, you can utilize ChatExternalClient with your API key from 01-ai to handle multimodal tasks. Here's how you can integrate the code:

from openai import OpenAI
from text.grad.engine.local_model_openai_api import ChatExternalClient
import textgrad as tg
from PIL import Image
from textgrad.autograd import MultimodalLLMCall
from textgrad.loss import ImageQALoss
import io


client = OpenAI(base_url="https://api.lingyiwanwu.com/v1", api_key="<your api_key>")
engine = ChatExternalClient(client=client, model_string='yi-vision')
tg.set_backward_engine(engine, override=True)

For setting up complete, you're now ready to import your image. The example picture is generated using Midjourney,

An image to describe post

image_path = 'dylan.png'
image = Image.open(image_path)
byte_io = io.BytesIO()
image.save(byte_io, "PNG")
image_bytes = byte_io.getvalue()
question = tg.Variable(image_bytes,role_description="image to answer a question about",requires_grad=False)
question_variable = tg.Variable('What do you see in this image?', role_description = "question", requires_grad = False)

answer = MultimodalLLMCall(engine)([question,question_variable])
print("first answer:",answer)
print('-'*50)

loss_fn = ImageQALoss(evaluation_instruction="Does this seem like a complete and good answer for the image? Criticize. Do not provide a new answer.",engine=engine)
loss = loss_fn(question=question_variable, image=question, response=answer)
print("Some points to improve the answer:\n")
print(loss)
print('-'*50)

optimizer = tg.TGD(parameters=[answer])
loss.backward()
optimizer.step()
print('final answwer', answer.value)

After running the code, you should now review the output.

first answer:  The image captures the majestic beauty of a Blue-throated Macaw, a bird species known for its striking appearance and vibrant personality. The bird is the central focus of the image, its body filling most of the frame. Its feathers are a soft, light blue, providing a serene contrast to the fiery orange and red crest that adorns its head. The crest, composed of long, slender feathers tipped with white, fans out like a burst of flames, adding a dynamic element to the bird's overall appearance.

The bird's face is a study in contrasts, with a black mask encircling its eye and a blue throat that seamlessly blends into the rest of its body. Its eye, a deep, piercing red, stands out against the black and blue of its face, drawing the viewer's attention. The beak, a sharp, curved black, adds a touch of sharpness to the bird's otherwise soft appearance.

The background of the image is a blurred, dark blue, providing a neutral backdrop that allows the bird to stand out. The bokeh effect in the background adds a touch of depth to the image, making the bird appear as if it's part of a larger, unseen world. The overall composition of the image, with its focus on the bird and the blurred background, creates a sense of intimacy, as if the viewer is getting a close-up look at this beautiful creature in its natural habitat.

--------------------------------------------------
Some points to improve the answer:

The image is a beautiful representation of a Blue-throated Macaw, capturing its unique features and vibrant personality. The bird's light blue feathers and fiery orange crest create a striking contrast, making it the focal point of the image. The blurred, dark blue background enhances the bird's appearance, creating a sense of intimacy and depth. The image effectively highlights the bird's distinctive features, such as its black mask, blue throat, and red eye, making it a captivating portrait of this majestic creature.

--------------------------------------------------
final answer: The image is a stunning portrayal of a Blue-throated Macaw, showcasing its unique and captivating features. The bird's light blue plumage is soft and delicate, providing a serene contrast to the vibrant orange and red of its crest. This crest, composed of long, slender feathers tipped with white, fans out like a burst of flames, adding a dynamic and dramatic element to the bird's appearance.

The bird's face is a striking study in contrasts, with a bold black mask encircling its piercing red eye and a blue throat that seamlessly blends into the rest of its body. The texture of the feathers is intricately detailed, with each feather showcasing a unique pattern that adds depth and complexity to the bird's overall appearance.

The blurred, dark blue background of the image serves to highlight the bird, creating a sense of intimacy and depth. The bokeh effect in the background adds an additional layer of visual interest, making the bird appear as if it's part of a larger, unseen world.

In addition to its striking appearance, the Blue-throated Macaw is known for its vibrant personality and complex social behaviors. These birds are native to the rainforests of Central and South America, where they can be found living in large, noisy flocks. They are known for their intelligence and ability to mimic a wide range of sounds, including human speech.

By providing a detailed and vivid description of the Blue-throated Macaw's unique features, as well as information about its natural habitat and behavior, the image offers a comprehensive and engaging portrait of this majestic creature.

It's quite common for the exact species to be unclear when using AI tools like Midjourney for image generation, as they may not always align perfectly with real-world accuracy. Nevertheless, it's encouraging to see noticeable improvements with just a single iteration. Indeed, executing additional epochs could further enhance the accuracy. This method of iterative refinement is particularly effective when testing the capabilities of smaller models.

Conclusion

In this blog post, I introduce a swift and lazy method for integrating a third-party provider with TextGrad for multimodal tasks. This approach is notably efficient as it doesn't require creating a new class for the third-party client. I've discovered that employing TextGrad significantly enhances model responses.

Step 1

Step 2

Step 3

Conclusion

July 9, How to use DeepSeek with TextGrad

July, 2024 LLMs Evaluation Benchmarks