Microsoft’s new vision-language model outranks humans at image captioning
The company will integrate the new model into Azure Cognitive Services to support vision-language tasks

Microsoft researchers have developed a new object-attribute detection model for image encoding: VinVL (visual features in vision-language)
Vision-language (VL) systems make it possible to search relevant images for a text query (or vice versa). They also help describe an image’s content.
In most cases, the systems use two modules to achieve the VL understanding: an image encoding module to generate feature maps of an input image and a vision-language fusion module to map the encoded image and text into vectors in the same semantic space.
Microsoft’s new research focuses on improving the image-encoding module. When combined with VL fusion modules such as OSCAR and VIVO, Microsoft’s newest VL system scored big on the most competitive artificial intelligence (AI) benchmarks, including visual question answering (VQA), Microsoft COCO Image Captioning, and novel object captioning (nocaps).
The tech giant also highlighted that VinVL significantly surpasses human performance on the nocaps leaderboard for consensus-based image description evaluation (CIDEr).
Microsoft trained its VinVL object-attribute detection model using a large object detection dataset containing 2.49 million images ascribed to 1,848 object classes and 524 attribute classes to achieve the results mentioned above. Microsoft formed the dataset by merging four public object detection datasets (COCO, Open Images, Objects365, and VG).
RELATED RESOURCE
Getting started with Azure Red Hat OpenShift
A developer’s guide to improving application building and deployment capabilities
“We first pretrained an object detection model on the merged dataset, and then fine-tuned the model with an additional attribute branch on VG, making it capable of detecting both objects and attributes,” said Microsoft.
Get the ITPro daily newsletter
Sign up today and you will receive a free copy of our Future Focus 2025 report - the leading guidance on AI, cybersecurity and other IT challenges as per 700+ senior executives
“Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes. As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments.”
Despite the promising results, Microsoft said its model is by no means close to the human-level VL understanding.
Microsoft also announced VinVL would be available to the public for general use. Additionally, it will integrate VinVL into Azure Cognitive Services to power a wide range of Microsoft services, including Image Captioning in Office and LinkedIn, and Seeing AI.
-
Bigger salaries, more burnout: Is the CISO role in crisis?
In-depth CISOs are more stressed than ever before – but why is this and what can be done?
By Kate O'Flaherty Published
-
Cheap cyber crime kits can be bought on the dark web for less than $25
News Research from NordVPN shows phishing kits are now widely available on the dark web and via messaging apps like Telegram, and are often selling for less than $25.
By Emma Woollacott Published
-
Microsoft is ending support for the Remote Desktop app – here are three alternatives you can try instead
News Microsoft has announced plans to end support for its Remote Desktop application in just over two months.
By George Fitzmaurice Published
-
Microsoft's huge AI spending has investors worried – now the company is changing its financial reporting to highlight successes
News The move comes as investors want more evidence that Microsoft’s AI investment will pay off
By Nicole Kobie Published
-
Could Python in Excel be a boon for cryptocurrency miners?
Opinion Free Python compute resource on offer via Microsoft 365 beta preview – what could possibly go wrong?
By Richard Speed Published
-
Microsoft defends “negligent” security approach that prolonged vulnerability fix for five months
News The tech giant has refuted claims that its practices have left customers “in the dark”
By Ross Kelly Published
-
Microsoft Build 2023: Microsoft Fabric and oodles of Azure AI integrations announced
News Microsoft Fabric aims to greatly improve developer productivity and simplify real-time analytics
By Ross Kelly Published
-
Five ways to reduce Kubernetes costs
Tutorials With cutting expenditure a business imperative, there are several ways enterprises can reduce Kubernetes costs
By Ross Kelly Last updated
-
Azure spending notifications for customers unavailable until March, Microsoft warns
News Customers have been advised to manually monitor Azure usage and costs until a fix is implemented
By Ross Kelly Published
-
Microsoft cloud revenue still sky-high as device sales continue to slide
News The company's latest earnings call revealed that Azure cloud products delivered better-than-expected results while numbers in other divisions fell substantially
By Ross Kelly Published