How to Burn Images and Text into MP4 Frames with Media Foundation

Mar 16, 2026 01:00 · Go Komura · Media Foundation, C++, Windows Development, GDI+, Direct2D, DirectWrite, H.264

Logo watermarks, inspection results, equipment IDs, operator names, timestamps. The requirement to burn this kind of information into every frame of an MP4 video and produce a new MP4 is quite common in surveillance, inspection, audit trails, and analysis UIs.

But once you start working with Media Foundation, you are confronted with IMFSourceReader, IMFSample, IMFMediaBuffer, IMFTransform, and IMFSinkWriter, and it suddenly becomes hard to see where exactly you are supposed to overlay text or a PNG.

In this article, we first lay out the big picture — Source Reader -> drawing -> color conversion -> Sink Writer — and then provide a single-file sample you can paste straight into a Visual Studio C++ console application. The sample reads a given MP4, draws a given image plus the text HelloWorld onto every frame, and produces an output MP4.

Note that this sample prioritizes being something you can paste in and run immediately, so it uses a configuration that re-encodes the video only. You could cram audio remuxing into the same program, but since the topic of this article is “burning an image and text into every frame,” we focus on that first.

The code in this article is published on GitHub as a complete sample set (a single-file .cpp plus a CMake build configuration).

media-foundation-overlay-image-text-on-mp4-frames - komurasoft-blog-samples (GitHub)

1. The Short Answer First

The basic pattern for putting an image or text into every frame of an MP4 is decode with the Source Reader -> composite onto uncompressed frames -> convert colors if needed -> re-encode with the Sink Writer.
Placing the image or text itself is not Media Foundation’s job. It is more natural to think about this part with drawing APIs such as GDI+, Direct2D, DirectWrite, and WIC.
If you are writing back to MP4(H.264), you will often need a conversion stage that bridges RGB32 / ARGB32, which is easy to draw on, and NV12 / I420 / YUY2, which encoders accept readily.
If you want to get your first version working, the configuration Source Reader -> RGB32 -> draw with GDI+ -> NV12 -> Sink Writer is easy to follow.
If you want to prioritize speed and extensibility, moving toward D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writer gives you more headroom.

2. Why This Problem Is a Bit Tricky

“Putting text into a video” is actually four different topics mixed together.

Containers vs. codecs An mp4 is a container, not the frames themselves. The contents are usually compressed data such as H.264 or H.265.
Decoding / encoding While the data is still compressed, you cannot simply overlay text or a PNG with an ordinary 2D drawing API. You first need to get back to uncompressed frames.
Drawing Text, logos, alpha-blended PNGs, and anti-aliased text rendering are not the responsibility of Media Foundation itself. This is the job of GDI+ or Direct2D / DirectWrite / WIC.
Color spaces and pixel formats The format that is easy to draw on and the format the encoder prefers are not the same. This is where people quietly get stuck.

Putting it bluntly in one line: rather than “putting text in with Media Foundation,” the clearest mental model is “use Media Foundation to move frames around, use a drawing API to overlay things, then apply any needed color conversion before encoding.”

3. The Overview Table to Look at First

Approach	Configuration	Best suited for	Watch out for
Get it working correctly first	`Source Reader -> RGB32 -> composite -> NV12 -> Sink Writer`	Batch processing, internal tools, initial implementations	CPU-side copies and conversions tend to add up
Increase speed	`D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writer`	Long videos, high resolutions, bulk processing	More D3D11 and DXGI management
Make it a reusable component	Implement as a custom `MFT` and insert it into a topology	Effects shared across multiple apps, integration into an MF pipeline	Implementation, registration, and debugging get harder

The sample in this article is limited to the top row, the “get it working correctly first” configuration.

3.1 Processing Overview

flowchart LR
    A[input.mp4] --> B[IMFSourceReader]
    B --> C[Uncompressed frame<br/>RGB32]
    C --> D[Draw image + HelloWorld with GDI+]
    D --> E[BGRA -> NV12 conversion]
    E --> F[IMFSinkWriter]
    F --> G[output.mp4]

    B --> H[Audio samples]
    H --> I[Copy as is<br/>or re-encode]
    I --> F

The important point here is that the drawing itself is not Media Foundation’s job. Media Foundation is responsible for moving frames in and out; placing the image and text is delegated to a drawing API.

4. How to Split Up the Pipeline

4.1 Receive the Input with `IMFSourceReader`

If the input is a file path, use MFCreateSourceReaderFromURL; if it is video data in memory, a clear approach is to create an IMFByteStream and use MFCreateSourceReaderFromByteStream.

The first thing to decide here is whether to receive frames in a format that is easy to draw on, or in a format aimed at the encoder.

If you want a simple implementation, use RGB32 or ARGB32
If you want encoding efficiency, use a YUV format such as NV12

That said, compositing text and PNGs is overwhelmingly easier to reason about in RGB formats, so receiving frames as RGB32 / ARGB32 is the easy first move.

If you enable MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, the Source Reader performs YUV -> RGB32 conversion and deinterlacing for you. This is convenient at the “I just want to pull out frames and work with them” stage, but it tends to get heavy with long or high-resolution videos, so if you need speed in production it is worth revisiting the configuration later.

4.2 Think About Image and Text Compositing in Terms of `GDI+` or `Direct2D / DirectWrite`

You take the buffer out of the IMFSample received from Media Foundation and place a logo image and text on top of it.

This sample prioritizes being easy to paste as a single file, so it uses GDI+ for drawing.

It can load images
It can draw text
It requires relatively little extra setup
It fits easily into a single console-app .cpp

On the other hand, for workloads that process long videos or lots of 4K content, D3D11 + Direct2D + DirectWrite has more headroom. A natural progression is to use GDI+ for the first implementation and move to Direct2D / DirectWrite when you need to optimize for speed.

4.3 You Cannot Necessarily Write `RGB32` Straight to `H.264`

This is where people get stuck most often.

When writing back to MP4(H.264), Microsoft’s H.264 encoder usually expects YUV-family input such as I420 / IYUV / NV12 / YUY2 / YV12. In other words, compositing in the easy-to-draw RGB32 / ARGB32 and then handing the result straight to IMFSinkWriter is not guaranteed to just work.

So in practice you need one of two conversions.

Insert a Video Processor MFT to do RGB32 / ARGB32 -> NV12
Implement your own RGB -> NV12 conversion

This sample prioritizes being a single self-contained file, so it takes the latter route with a hand-rolled conversion. In production, inserting a Video Processor MFT, which can handle color-space conversion, resizing, and deinterlacing all together, is also a strong option.

4.4 Write the Output with `IMFSinkWriter`

For video output, IMFSinkWriter is the easiest to work with.

The idea is simple: you configure two things separately,

The output stream type … the format you want written to the file Example: MFVideoFormat_H264
The input stream type … the format the app hands to the Sink Writer Example: MFVideoFormat_NV12

So from the Sink Writer’s perspective,

the app side hands over uncompressed NV12 frames
the Sink Writer encodes them to H.264 and writes them into the MP4

is the relationship.

4.5 Treating Audio Separately at First Keeps Things Tidy

Very often you only want to put a logo or text into the video and do not want to touch the audio at all.

In practice, the configuration

video stream only: Source Reader -> composite -> Sink Writer
audio stream: remux it while still compressed

is easy to work with.

However, since this sample focuses on burning an image and text into the frames, the output is a video-only MP4. A version that preserves audio is easier to follow if you add it later as an extension.

5. Assumptions and Usage for This Sample

The assumptions for this code are as follows.

Windows 10 / 11
A Visual Studio 2022 C++ console application
x64 build
This .cpp file does not use precompiled headers
The input video’s width and height are even
The input is an ordinary MP4 video file
The output is a video-only MP4
The image is in a format GDI+ can read, such as PNG / JPEG / BMP / GIF

NV12 is 4:2:0, so the width and height must be even. For that reason, this sample explicitly raises an error when those conditions are not met.

5.1 Usage

Create a Console App in Visual Studio
Paste this .cpp in wholesale
Set that .cpp file’s precompiled header option to “Not Using”
Build for x64
Run it as follows

OverlayMp4.exe input.mp4 overlay.png output.mp4

input.mp4 The source video
overlay.png The image to overlay
output.mp4 The output destination

The text string is hard-coded to HelloWorld in kOverlayText at the top of the code. Position and size can also be changed by adjusting the constants in the code.

6. Single-File Code You Can Paste Straight into a `.cpp`

#define NOMINMAX
#include <windows.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <mferror.h>
#include <gdiplus.h>
#include <wrl/client.h>

#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cwchar>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>

#pragma comment(lib, "mfplat.lib")
#pragma comment(lib, "mfreadwrite.lib")
#pragma comment(lib, "mfuuid.lib")
#pragma comment(lib, "mf.lib")
#pragma comment(lib, "gdiplus.lib")

using Microsoft::WRL::ComPtr;

namespace
{
    const wchar_t* kOverlayText = L"HelloWorld";
    const float kMarginRatio = 0.03f;
    const float kImageMaxWidthRatio = 0.20f;
    const float kImageMaxHeightRatio = 0.20f;
    const float kMinFontPx = 24.0f;

    std::string HrToHex(HRESULT hr)
    {
        char buf[32]{};
        std::snprintf(buf, sizeof(buf), "0x%08X", static_cast<unsigned int>(hr));
        return std::string(buf);
    }

    void ThrowIfFailed(HRESULT hr, const char* message)
    {
        if (FAILED(hr))
        {
            throw std::runtime_error(std::string(message) + " failed. HRESULT=" + HrToHex(hr));
        }
    }

    void ThrowIfGdiplusError(Gdiplus::Status status, const char* message)
    {
        if (status != Gdiplus::Ok)
        {
            char buf[128]{};
            std::snprintf(buf, sizeof(buf), "%s failed. GDI+ status=%d", message, static_cast<int>(status));
            throw std::runtime_error(buf);
        }
    }

    BYTE ClampToByte(int value)
    {
        if (value < 0) return 0;
        if (value > 255) return 255;
        return static_cast<BYTE>(value);
    }

    class ScopedGdiplus
    {
    public:
        ScopedGdiplus()
        {
            Gdiplus::GdiplusStartupInput input;
            ThrowIfGdiplusError(Gdiplus::GdiplusStartup(&token_, &input, nullptr), "GdiplusStartup");
        }

        ~ScopedGdiplus()
        {
            if (token_ != 0)
            {
                Gdiplus::GdiplusShutdown(token_);
            }
        }

    private:
        ULONG_PTR token_ = 0;
    };

    class ScopedMf
    {
    public:
        ScopedMf()
        {
            ThrowIfFailed(CoInitializeEx(nullptr, COINIT_MULTITHREADED), "CoInitializeEx");
            comInitialized_ = true;

            ThrowIfFailed(MFStartup(MF_VERSION), "MFStartup");
            mfStarted_ = true;
        }

        ~ScopedMf()
        {
            if (mfStarted_)
            {
                MFShutdown();
            }

            if (comInitialized_)
            {
                CoUninitialize();
            }
        }

    private:
        bool comInitialized_ = false;
        bool mfStarted_ = false;
    };

    class BufferLock
    {
    public:
        explicit BufferLock(IMFMediaBuffer* buffer)
            : buffer_(buffer)
        {
            if (!buffer_)
            {
                throw std::runtime_error("BufferLock received a null buffer.");
            }

            buffer_.As(&buffer2D_);
        }

        HRESULT LockBuffer(LONG defaultStride, DWORD heightInPixels, BYTE** scanline0, LONG* actualStride)
        {
            if (scanline0 == nullptr || actualStride == nullptr)
            {
                return E_POINTER;
            }

            HRESULT hr = S_OK;

            if (buffer2D_)
            {
                hr = buffer2D_->Lock2D(scanline0, actualStride);
            }
            else
            {
                BYTE* data = nullptr;
                hr = buffer_->Lock(&data, nullptr, nullptr);
                if (SUCCEEDED(hr))
                {
                    *actualStride = defaultStride;
                    if (defaultStride < 0)
                    {
                        *scanline0 = data + (static_cast<LONG>(heightInPixels) - 1) * std::abs(defaultStride);
                    }
                    else
                    {
                        *scanline0 = data;
                    }
                }
            }

            locked_ = SUCCEEDED(hr);
            return hr;
        }

        ~BufferLock()
        {
            if (!locked_)
            {
                return;
            }

            if (buffer2D_)
            {
                buffer2D_->Unlock2D();
            }
            else
            {
                buffer_->Unlock();
            }
        }

    private:
        ComPtr<IMFMediaBuffer> buffer_;
        ComPtr<IMF2DBuffer> buffer2D_;
        bool locked_ = false;
    };

    struct VideoFormatInfo
    {
        UINT32 width = 0;
        UINT32 height = 0;
        UINT32 fpsNum = 0;
        UINT32 fpsDen = 0;
        UINT32 parNum = 1;
        UINT32 parDen = 1;
        LONG sourceStride = 0;
        LONGLONG defaultFrameDuration = 0;
        UINT32 bitrate = 0;
    };

    LONG GetDefaultStride(IMFMediaType* type)
    {
        LONG stride = 0;

        HRESULT hr = type->GetUINT32(MF_MT_DEFAULT_STRIDE, reinterpret_cast<UINT32*>(&stride));
        if (SUCCEEDED(hr))
        {
            return stride;
        }

        GUID subtype = GUID_NULL;
        UINT32 width = 0;
        UINT32 height = 0;

        ThrowIfFailed(type->GetGUID(MF_MT_SUBTYPE, &subtype), "GetGUID(MF_MT_SUBTYPE)");
        ThrowIfFailed(MFGetAttributeSize(type, MF_MT_FRAME_SIZE, &width, &height), "MFGetAttributeSize(MF_MT_FRAME_SIZE)");
        ThrowIfFailed(MFGetStrideForBitmapInfoHeader(subtype.Data1, width, &stride), "MFGetStrideForBitmapInfoHeader");
        ThrowIfFailed(type->SetUINT32(MF_MT_DEFAULT_STRIDE, static_cast<UINT32>(stride)), "SetUINT32(MF_MT_DEFAULT_STRIDE)");

        return stride;
    }

    UINT32 ChooseBitrate(IMFMediaType* nativeType, UINT32 width, UINT32 height, UINT32 fpsNum, UINT32 fpsDen)
    {
        UINT32 srcBitrate = 0;
        if (SUCCEEDED(nativeType->GetUINT32(MF_MT_AVG_BITRATE, &srcBitrate)) && srcBitrate > 0)
        {
            return srcBitrate;
        }

        const double fps = static_cast<double>(fpsNum) / static_cast<double>(fpsDen);
        double estimated = static_cast<double>(width) * static_cast<double>(height) * fps * 0.07;

        if (estimated < 1500000.0)
        {
            estimated = 1500000.0;
        }

        if (estimated > 25000000.0)
        {
            estimated = 25000000.0;
        }

        return static_cast<UINT32>(estimated);
    }

    VideoFormatInfo ConfigureSourceReader(IMFSourceReader* reader)
    {
        ThrowIfFailed(reader->SetStreamSelection(MF_SOURCE_READER_ALL_STREAMS, FALSE), "SetStreamSelection(all,false)");
        ThrowIfFailed(reader->SetStreamSelection(MF_SOURCE_READER_FIRST_VIDEO_STREAM, TRUE), "SetStreamSelection(video,true)");

        ComPtr<IMFMediaType> nativeType;
        ThrowIfFailed(reader->GetNativeMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, 0, &nativeType), "GetNativeMediaType(video)");

        ComPtr<IMFMediaType> requestedType;
        ThrowIfFailed(MFCreateMediaType(&requestedType), "MFCreateMediaType(video requested)");
        ThrowIfFailed(requestedType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(video requested major)");
        ThrowIfFailed(requestedType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32), "SetGUID(video requested subtype RGB32)");
        ThrowIfFailed(reader->SetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, nullptr, requestedType.Get()), "SetCurrentMediaType(video RGB32)");

        ComPtr<IMFMediaType> currentType;
        ThrowIfFailed(reader->GetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, &currentType), "GetCurrentMediaType(video)");

        VideoFormatInfo info;
        ThrowIfFailed(MFGetAttributeSize(currentType.Get(), MF_MT_FRAME_SIZE, &info.width, &info.height), "Get video frame size");

        HRESULT hr = MFGetAttributeRatio(currentType.Get(), MF_MT_FRAME_RATE, &info.fpsNum, &info.fpsDen);
        if (FAILED(hr))
        {
            ThrowIfFailed(MFGetAttributeRatio(nativeType.Get(), MF_MT_FRAME_RATE, &info.fpsNum, &info.fpsDen), "Get video frame rate");
        }

        if (info.fpsNum == 0 || info.fpsDen == 0)
        {
            throw std::runtime_error("Video frame rate is zero.");
        }

        hr = MFGetAttributeRatio(currentType.Get(), MF_MT_PIXEL_ASPECT_RATIO, &info.parNum, &info.parDen);
        if (FAILED(hr) || info.parNum == 0 || info.parDen == 0)
        {
            info.parNum = 1;
            info.parDen = 1;
        }

        info.sourceStride = GetDefaultStride(currentType.Get());
        info.defaultFrameDuration = (10000000LL * info.fpsDen) / info.fpsNum;
        if (info.defaultFrameDuration <= 0)
        {
            throw std::runtime_error("Calculated frame duration is invalid.");
        }

        info.bitrate = ChooseBitrate(nativeType.Get(), info.width, info.height, info.fpsNum, info.fpsDen);
        return info;
    }

    ComPtr<IMFSinkWriter> CreateSinkWriter(const std::wstring& outputPath, const VideoFormatInfo& videoInfo, DWORD* streamIndex)
    {
        if (streamIndex == nullptr)
        {
            throw std::runtime_error("streamIndex is null.");
        }

        ComPtr<IMFAttributes> attributes;
        ThrowIfFailed(MFCreateAttributes(&attributes, 1), "MFCreateAttributes(sink)");
        ThrowIfFailed(attributes->SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS, TRUE), "SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS)");

        ComPtr<IMFSinkWriter> writer;
        ThrowIfFailed(MFCreateSinkWriterFromURL(outputPath.c_str(), nullptr, attributes.Get(), &writer), "MFCreateSinkWriterFromURL");

        ComPtr<IMFMediaType> outputType;
        ThrowIfFailed(MFCreateMediaType(&outputType), "MFCreateMediaType(video output)");
        ThrowIfFailed(outputType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(output major)");
        ThrowIfFailed(outputType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_H264), "SetGUID(output subtype H264)");
        ThrowIfFailed(outputType->SetUINT32(MF_MT_AVG_BITRATE, videoInfo.bitrate), "SetUINT32(output bitrate)");
        ThrowIfFailed(outputType->SetUINT32(MF_MT_INTERLACE_MODE, MFVideoInterlace_Progressive), "SetUINT32(output interlace)");
        ThrowIfFailed(MFSetAttributeSize(outputType.Get(), MF_MT_FRAME_SIZE, videoInfo.width, videoInfo.height), "MFSetAttributeSize(output frame size)");
        ThrowIfFailed(MFSetAttributeRatio(outputType.Get(), MF_MT_FRAME_RATE, videoInfo.fpsNum, videoInfo.fpsDen), "MFSetAttributeRatio(output fps)");
        ThrowIfFailed(MFSetAttributeRatio(outputType.Get(), MF_MT_PIXEL_ASPECT_RATIO, videoInfo.parNum, videoInfo.parDen), "MFSetAttributeRatio(output PAR)");
        ThrowIfFailed(writer->AddStream(outputType.Get(), streamIndex), "AddStream(video)");

        ComPtr<IMFMediaType> inputType;
        ThrowIfFailed(MFCreateMediaType(&inputType), "MFCreateMediaType(video input)");
        ThrowIfFailed(inputType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(input major)");
        ThrowIfFailed(inputType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_NV12), "SetGUID(input subtype NV12)");
        ThrowIfFailed(inputType->SetUINT32(MF_MT_INTERLACE_MODE, MFVideoInterlace_Progressive), "SetUINT32(input interlace)");
        ThrowIfFailed(MFSetAttributeSize(inputType.Get(), MF_MT_FRAME_SIZE, videoInfo.width, videoInfo.height), "MFSetAttributeSize(input frame size)");
        ThrowIfFailed(MFSetAttributeRatio(inputType.Get(), MF_MT_FRAME_RATE, videoInfo.fpsNum, videoInfo.fpsDen), "MFSetAttributeRatio(input fps)");
        ThrowIfFailed(MFSetAttributeRatio(inputType.Get(), MF_MT_PIXEL_ASPECT_RATIO, videoInfo.parNum, videoInfo.parDen), "MFSetAttributeRatio(input PAR)");
        ThrowIfFailed(writer->SetInputMediaType(*streamIndex, inputType.Get(), nullptr), "SetInputMediaType(video)");

        ThrowIfFailed(writer->BeginWriting(), "BeginWriting");
        return writer;
    }

    void CopySampleToTopDownBgra(IMFSample* sample, const VideoFormatInfo& videoInfo, std::vector<BYTE>& bgra)
    {
        ComPtr<IMFMediaBuffer> buffer;
        ThrowIfFailed(sample->ConvertToContiguousBuffer(&buffer), "ConvertToContiguousBuffer");

        BufferLock lock(buffer.Get());

        BYTE* scanline0 = nullptr;
        LONG actualStride = 0;
        ThrowIfFailed(lock.LockBuffer(videoInfo.sourceStride, videoInfo.height, &scanline0, &actualStride), "LockBuffer");

        const size_t dstStride = static_cast<size_t>(videoInfo.width) * 4;
        bgra.resize(dstStride * videoInfo.height);

        for (UINT32 y = 0; y < videoInfo.height; ++y)
        {
            const BYTE* srcRow = scanline0 + static_cast<LONG>(y) * actualStride;
            BYTE* dstRow = bgra.data() + static_cast<size_t>(y) * dstStride;
            std::memcpy(dstRow, srcRow, dstStride);

            for (UINT32 x = 0; x < videoInfo.width; ++x)
            {
                dstRow[static_cast<size_t>(x) * 4 + 3] = 0xFF;
            }
        }
    }

    void DrawOverlay(std::vector<BYTE>& bgra, UINT32 width, UINT32 height, Gdiplus::Image& overlayImage)
    {
        const INT stride = static_cast<INT>(width * 4);

        Gdiplus::Bitmap frameBitmap(
            static_cast<INT>(width),
            static_cast<INT>(height),
            stride,
            PixelFormat32bppPARGB,
            bgra.data());
        ThrowIfGdiplusError(frameBitmap.GetLastStatus(), "Create frame bitmap");

        Gdiplus::Graphics graphics(&frameBitmap);
        ThrowIfGdiplusError(graphics.GetLastStatus(), "Create graphics");

        graphics.SetCompositingMode(Gdiplus::CompositingModeSourceOver);
        graphics.SetCompositingQuality(Gdiplus::CompositingQualityHighQuality);
        graphics.SetInterpolationMode(Gdiplus::InterpolationModeHighQualityBicubic);
        graphics.SetSmoothingMode(Gdiplus::SmoothingModeAntiAlias);
        graphics.SetTextRenderingHint(Gdiplus::TextRenderingHintAntiAliasGridFit);

        const Gdiplus::REAL margin = std::max<Gdiplus::REAL>(16.0f, static_cast<Gdiplus::REAL>(height) * kMarginRatio);
        const Gdiplus::REAL maxImageW = static_cast<Gdiplus::REAL>(width) * kImageMaxWidthRatio;
        const Gdiplus::REAL maxImageH = static_cast<Gdiplus::REAL>(height) * kImageMaxHeightRatio;

        const Gdiplus::REAL srcW = static_cast<Gdiplus::REAL>(overlayImage.GetWidth());
        const Gdiplus::REAL srcH = static_cast<Gdiplus::REAL>(overlayImage.GetHeight());
        if (srcW <= 0.0f || srcH <= 0.0f)
        {
            throw std::runtime_error("Overlay image has invalid size.");
        }

        const Gdiplus::REAL imageScale =
            std::min<Gdiplus::REAL>(1.0f, std::min(maxImageW / srcW, maxImageH / srcH));

        const Gdiplus::REAL drawW = srcW * imageScale;
        const Gdiplus::REAL drawH = srcH * imageScale;

        Gdiplus::RectF imageRect(margin, margin, drawW, drawH);
        Gdiplus::SolidBrush imagePlate(Gdiplus::Color(96, 0, 0, 0));
        graphics.FillRectangle(
            &imagePlate,
            imageRect.X - 8.0f,
            imageRect.Y - 8.0f,
            imageRect.Width + 16.0f,
            imageRect.Height + 16.0f);

        graphics.DrawImage(&overlayImage, imageRect);

        const Gdiplus::REAL fontPx =
            std::max<Gdiplus::REAL>(kMinFontPx, static_cast<Gdiplus::REAL>(height) * 0.06f);

        Gdiplus::Font font(L"Segoe UI", fontPx, Gdiplus::FontStyleBold, Gdiplus::UnitPixel);
        ThrowIfGdiplusError(font.GetLastStatus(), "Create font");

        Gdiplus::StringFormat stringFormat;
        stringFormat.SetAlignment(Gdiplus::StringAlignmentNear);
        stringFormat.SetLineAlignment(Gdiplus::StringAlignmentNear);

        Gdiplus::RectF measureLayout(
            margin,
            static_cast<Gdiplus::REAL>(height) - margin - fontPx * 2.0f,
            static_cast<Gdiplus::REAL>(width) - margin * 2.0f,
            fontPx * 2.0f);

        Gdiplus::RectF measured;
        graphics.MeasureString(kOverlayText, -1, &font, measureLayout, &stringFormat, &measured);

        Gdiplus::RectF textBg(
            measured.X - 12.0f,
            measured.Y - 8.0f,
            measured.Width + 24.0f,
            measured.Height + 16.0f);

        Gdiplus::SolidBrush textPlate(Gdiplus::Color(128, 0, 0, 0));
        graphics.FillRectangle(&textPlate, textBg);

        Gdiplus::SolidBrush shadowBrush(Gdiplus::Color(220, 0, 0, 0));
        Gdiplus::RectF shadowLayout = measureLayout;
        shadowLayout.X += 2.0f;
        shadowLayout.Y += 2.0f;
        graphics.DrawString(kOverlayText, -1, &font, shadowLayout, &stringFormat, &shadowBrush);

        Gdiplus::SolidBrush textBrush(Gdiplus::Color(235, 255, 255, 255));
        graphics.DrawString(kOverlayText, -1, &font, measureLayout, &stringFormat, &textBrush);
    }

    void BgraToNv12(const BYTE* bgra, UINT32 width, UINT32 height, BYTE* nv12)
    {
        const bool useBt709 = (width > 1024 || height > 576);

        const int yR = useBt709 ? 47 : 66;
        const int yG = useBt709 ? 157 : 129;
        const int yB = useBt709 ? 16 : 25;

        const int uR = useBt709 ? -26 : -38;
        const int uG = useBt709 ? -87 : -74;
        const int uB = 112;

        const int vR = 112;
        const int vG = useBt709 ? -102 : -94;
        const int vB = useBt709 ? -10 : -18;

        BYTE* yPlane = nv12;
        BYTE* uvPlane = nv12 + static_cast<size_t>(width) * height;

        const size_t srcStride = static_cast<size_t>(width) * 4;

        for (UINT32 y = 0; y < height; ++y)
        {
            const BYTE* srcRow = bgra + static_cast<size_t>(y) * srcStride;
            BYTE* dstY = yPlane + static_cast<size_t>(y) * width;

            for (UINT32 x = 0; x < width; ++x)
            {
                const BYTE b = srcRow[x * 4 + 0];
                const BYTE g = srcRow[x * 4 + 1];
                const BYTE r = srcRow[x * 4 + 2];

                const int Y = ((yR * r + yG * g + yB * b + 128) >> 8) + 16;
                dstY[x] = ClampToByte(Y);
            }
        }

        for (UINT32 y = 0; y < height; y += 2)
        {
            const BYTE* row0 = bgra + static_cast<size_t>(y) * srcStride;
            const BYTE* row1 = bgra + static_cast<size_t>(y + 1) * srcStride;
            BYTE* dstUV = uvPlane + static_cast<size_t>(y / 2) * width;

            for (UINT32 x = 0; x < width; x += 2)
            {
                int b = 0;
                int g = 0;
                int r = 0;

                for (UINT32 dy = 0; dy < 2; ++dy)
                {
                    const BYTE* row = (dy == 0) ? row0 : row1;
                    for (UINT32 dx = 0; dx < 2; ++dx)
                    {
                        const UINT32 ix = x + dx;
                        b += row[ix * 4 + 0];
                        g += row[ix * 4 + 1];
                        r += row[ix * 4 + 2];
                    }
                }

                b = (b + 2) / 4;
                g = (g + 2) / 4;
                r = (r + 2) / 4;

                const int U = ((uR * r + uG * g + uB * b + 128) >> 8) + 128;
                const int V = ((vR * r + vG * g + vB * b + 128) >> 8) + 128;

                dstUV[x + 0] = ClampToByte(U);
                dstUV[x + 1] = ClampToByte(V);
            }
        }
    }

    ComPtr<IMFSample> CreateNv12Sample(
        const std::vector<BYTE>& bgra,
        const VideoFormatInfo& videoInfo,
        LONGLONG sampleTime,
        LONGLONG sampleDuration)
    {
        const DWORD bufferSize =
            static_cast<DWORD>(videoInfo.width * videoInfo.height * 3 / 2);

        ComPtr<IMFMediaBuffer> buffer;
        ThrowIfFailed(MFCreateMemoryBuffer(bufferSize, &buffer), "MFCreateMemoryBuffer");

        BYTE* dst = nullptr;
        DWORD maxLength = 0;
        DWORD currentLength = 0;
        ThrowIfFailed(buffer->Lock(&dst, &maxLength, &currentLength), "Lock(NV12 buffer)");

        try
        {
            BgraToNv12(bgra.data(), videoInfo.width, videoInfo.height, dst);
        }
        catch (...)
        {
            buffer->Unlock();
            throw;
        }

        ThrowIfFailed(buffer->Unlock(), "Unlock(NV12 buffer)");
        ThrowIfFailed(buffer->SetCurrentLength(bufferSize), "SetCurrentLength(NV12 buffer)");

        ComPtr<IMFSample> sample;
        ThrowIfFailed(MFCreateSample(&sample), "MFCreateSample");
        ThrowIfFailed(sample->AddBuffer(buffer.Get()), "AddBuffer(output sample)");
        ThrowIfFailed(sample->SetSampleTime(sampleTime), "SetSampleTime");
        ThrowIfFailed(sample->SetSampleDuration(sampleDuration), "SetSampleDuration");

        return sample;
    }
}

int wmain(int argc, wchar_t* argv[])
{
    if (argc != 4)
    {
        std::wcerr << L"Usage: OverlayMp4.exe <input.mp4> <overlayImage.png> <output.mp4>" << std::endl;
        return 1;
    }

    const std::wstring inputPath = argv[1];
    const std::wstring imagePath = argv[2];
    const std::wstring outputPath = argv[3];

    try
    {
        if (_wcsicmp(inputPath.c_str(), outputPath.c_str()) == 0)
        {
            throw std::runtime_error("Input and output paths must be different.");
        }

        ScopedMf mf;
        ScopedGdiplus gdiplus;

        ComPtr<IMFAttributes> readerAttributes;
        ThrowIfFailed(MFCreateAttributes(&readerAttributes, 1), "MFCreateAttributes(reader)");
        ThrowIfFailed(
            readerAttributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE),
            "SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING)");

        ComPtr<IMFSourceReader> reader;
        ThrowIfFailed(
            MFCreateSourceReaderFromURL(inputPath.c_str(), readerAttributes.Get(), &reader),
            "MFCreateSourceReaderFromURL");

        VideoFormatInfo videoInfo = ConfigureSourceReader(reader.Get());

        if ((videoInfo.width % 2) != 0 || (videoInfo.height % 2) != 0)
        {
            throw std::runtime_error(
                "This sample requires even video width and height because NV12 is 4:2:0.");
        }

        Gdiplus::Image overlayImage(imagePath.c_str());
        ThrowIfGdiplusError(overlayImage.GetLastStatus(), "Load overlay image");

        DWORD videoStreamIndex = 0;
        ComPtr<IMFSinkWriter> writer =
            CreateSinkWriter(outputPath, videoInfo, &videoStreamIndex);

        std::vector<BYTE> bgra;
        LONGLONG firstTimestamp = -1;
        unsigned long long frameCount = 0;

        while (true)
        {
            DWORD flags = 0;
            LONGLONG timestamp = 0;
            ComPtr<IMFSample> inputSample;

            ThrowIfFailed(
                reader->ReadSample(
                    MF_SOURCE_READER_FIRST_VIDEO_STREAM,
                    0,
                    nullptr,
                    &flags,
                    &timestamp,
                    &inputSample),
                "ReadSample(video)");

            if ((flags & MF_SOURCE_READERF_CURRENTMEDIATYPECHANGED) != 0)
            {
                throw std::runtime_error("Dynamic video format change is not supported in this sample.");
            }

            if ((flags & MF_SOURCE_READERF_NATIVEMEDIATYPECHANGED) != 0)
            {
                throw std::runtime_error("Native video format change is not supported in this sample.");
            }

            if ((flags & MF_SOURCE_READERF_STREAMTICK) != 0)
            {
                if (firstTimestamp < 0)
                {
                    firstTimestamp = timestamp;
                }

                ThrowIfFailed(
                    writer->SendStreamTick(videoStreamIndex, timestamp - firstTimestamp),
                    "SendStreamTick");
            }

            if (inputSample)
            {
                if (firstTimestamp < 0)
                {
                    firstTimestamp = timestamp;
                }

                LONGLONG duration = 0;
                if (FAILED(inputSample->GetSampleDuration(&duration)) || duration <= 0)
                {
                    duration = videoInfo.defaultFrameDuration;
                }

                CopySampleToTopDownBgra(inputSample.Get(), videoInfo, bgra);
                DrawOverlay(bgra, videoInfo.width, videoInfo.height, overlayImage);

                ComPtr<IMFSample> outputSample =
                    CreateNv12Sample(bgra, videoInfo, timestamp - firstTimestamp, duration);

                ThrowIfFailed(
                    writer->WriteSample(videoStreamIndex, outputSample.Get()),
                    "WriteSample(video)");

                ++frameCount;
            }

            if ((flags & MF_SOURCE_READERF_ENDOFSTREAM) != 0)
            {
                break;
            }
        }

        ThrowIfFailed(writer->Finalize(), "Finalize");

        std::wcout
            << L"Done. frames=" << frameCount
            << L", output=" << outputPath
            << std::endl;

        return 0;
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
        return 1;
    }
}

7. Points to Keep in Mind When Reading This Implementation

7.1 The Format That Is Easy to Draw on and the Format the Encoder Accepts Are Different

This sample uses the following flow.

Source Reader output: RGB32
Drawing: GDI+
Sink Writer input: NV12

The reason is simple: RGB formats are easy to work with when overlaying text and PNGs, and NV12 is easy to hand off to H.264 encoding.

When reading the implementation, it becomes easier to follow if you split it into a “drawing stage” and a “prepare-for-encoding stage.”

7.2 Stride and Vertical Orientation Are Normalized Before Drawing

Video frames are not necessarily laid out in memory the way they appear on screen.

The stride may not match width * 4
The image may be stored upside down
IMF2DBuffer and IMFMediaBuffer are handled slightly differently

For that reason, this code first normalizes into a top-down BGRA buffer before drawing. Getting this sorted out up front lets the drawing code stay quite straightforward.

7.3 With `ReadSample`, Check the Flags and the `sample`, Not Just the `HRESULT`

ReadSample can return S_OK with sample == nullptr. Typical cases are

MF_SOURCE_READERF_STREAMTICK
MF_SOURCE_READERF_ENDOFSTREAM
other stream events

So the loop needs to look at all three together: the HRESULT, the flags, and the inputSample. In particular, if you miss STREAMTICK or ENDOFSTREAM, downstream timeline handling tends to break.

7.4 It Is Safer to Carry Over Timestamps and Durations from the Input

Timestamps are in 100-ns units. Also, the duration has to be retrieved separately from the IMFSample.

Rather than assuming a fixed fps and adding a hard-coded increment each time, it is more robust to carry over the input sample’s timestamp / duration as much as possible. This sample does exactly that, falling back to a default value computed from the fps only when the duration cannot be obtained.

7.5 `GDI+` Is Lightweight to Adopt, but There Is a Next Step for Long or High-Resolution Content

GDI+ is very well suited to a single-file sample, but for workloads processing long videos or lots of 4K content, D3D11 + Direct2D + DirectWrite can be the better choice.

First get the whole pipeline working with GDI+
Then, if needed, replace it with Direct2D / DirectWrite
Move color conversion to a Video Processor MFT or the GPU side

A staged progression like this lets you extend the system without breaking the design.

7.6 This Sample Is Limited to Video Only

If you also pile audio into the same article, the focus gets diluted. For that reason, this sample concentrates on burning an image and text into the video frames, and the output is a video-only MP4.

In practice, the next step is to grow it into

video only: Source Reader -> composite -> Sink Writer
audio: remux it while still compressed

which is an easy configuration to manage.

8. If the “Given Video Data” Is an In-Memory MP4 Byte Sequence Rather Than a File

The code in this article uses MFCreateSourceReaderFromURL, so the input is a file path.

But if the requirement is “do the same thing to mp4 bytes received from an API,” the thinking does not change. Only the entry point changes.

Prepare an IStream or a custom stream
Hand it to the Source Reader as an IMFByteStream
From there on it is the same: RGB32 -> draw -> NV12 -> Sink Writer

In other words, the essence is not how the video data is held, but how you draw onto each decoded frame.

9. Growing It for Production

9.1 Add Audio Remuxing

The most practical first extension is to preserve the audio as is. Re-encode only the video and write the audio back in the same format while still compressed; this meets the requirement without adding much implementation.

9.2 Insert a `Video Processor MFT`

This sample converts BGRA -> NV12 by hand to stay self-contained in a single file, but in production, inserting a Video Processor MFT is also a very strong option.

With the Video Processor MFT, it becomes easier to handle

color-space conversion
resizing
deinterlacing
frame-rate conversion

all in one place.

9.3 Replace `GDI+` with `Direct2D / DirectWrite`

For overlays such as logo images, subtitles, and timestamps, GDI+ is often sufficient, but if you need to squeeze out performance, Direct2D / DirectWrite has the edge.

In particular, if you have conditions such as

high resolution
long durations
large numbers of videos
a future move toward a GPU path

then a configuration based on D3D11 / DXGI surface comes into view.

9.4 Consider a Custom `MFT` Once It Becomes a “Video Effect You Want to Reuse”

In Media Foundation, effects can be implemented as an IMFTransform. So if you want to reuse the same overlay processing across multiple apps or pipelines, a custom MFT is a clean choice.

However, as a first implementation,

you must satisfy the IMFTransform contract
input/output media-type management increases
registration and debugging get harder

so in practice it is usually easier to first get things working correctly with Source Reader + compositing + Sink Writer, and extract an MFT when you actually need one.

10. Summary

When burning images or text into every frame of an MP4 with Media Foundation, breaking the problem into these four parts gives you a clear view.

Extract: IMFSourceReader
Draw: GDI+ or Direct2D / DirectWrite
Convert into a format the encoder accepts: NV12, etc.
Write back: IMFSinkWriter

And if what you want is “a sample you can paste entirely into one .cpp and run as is,” then a configuration like the one in this article,

Source Reader -> RGB32 -> image + HelloWorld with GDI+ -> BGRA to NV12 -> Sink Writer

is quite natural.

If you grow it for production next, thinking in this order keeps things from falling apart.

Add audio remuxing
Replace GDI+ with Direct2D / DirectWrite
Move the NV12 conversion to a Video Processor MFT or the GPU side
Move to a D3D11 surface-based design for long, high-resolution content
Extract a custom MFT if you need reusability

If you try to do everything at once, COM, strides, color spaces, and surface management all hit you at the same time. Getting it working stage by stage first, then strengthening only the parts you need later, makes both the design and the debugging considerably easier.

12. References

The complete sample code for this article (a single-file .cpp plus a CMake build configuration) https://github.com/gomurin0428/komurasoft-blog-samples/tree/main/media-foundation-overlay-image-text-on-mp4-frames
Microsoft Learn: Using the Source Reader to Process Media Data
Microsoft Learn: MFCreateSourceReaderFromByteStream
Microsoft Learn: MFCreateMFByteStreamOnStream
Microsoft Learn: IMFSourceReader::SetCurrentMediaType
Microsoft Learn: MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING
Microsoft Learn: MF_SOURCE_READER_ENABLE_ADVANCED_VIDEO_PROCESSING
Microsoft Learn: IMFSourceReader::ReadSample
Microsoft Learn: Working with Media Samples
Microsoft Learn: IMF2DBuffer::Lock2D
Microsoft Learn: Video Subtype GUIDs
Microsoft Learn: H.264 Video Encoder
Microsoft Learn: Video Processor MFT
Microsoft Learn: Using the Sink Writer
Microsoft Learn: Tutorial: Using the Sink Writer to Encode Video
Microsoft Learn: Interoperability Overview (Direct2D)
Microsoft Learn: Text Rendering with Direct2D and DirectWrite
Microsoft Learn: Writing a Custom MFT

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

How to Convert YUV to RGB with Media Foundation

How to convert YUV frames to RGB with Media Foundation, covering the Source Reader's automatic conversion, manual NV12/YUY2 conversion, s...

Read Article

Extracting a Still Image from an MP4 at a Specific Time with Media Foundation

How to grab the frame closest to a given time in an MP4 with the Source Reader, fix up stride and the RGB32 alpha byte, and save it as a ...

Read Article

An Introduction to Media Foundation - Understanding the API Through a COM Lens

We explain what Media Foundation is, together with the basic vocabulary of Windows media APIs - COM, HRESULT, IMFSourceReader, MFTs - in ...

Read Article

Shared Memory Pitfalls and Practical Best Practices

The pitfalls of using shared memory in production, and a design approach that lowers the accident rate by covering synchronization, visib...

Read Article

Calling a C# Native AOT DLL from C/C++

How to publish a C# class library as a native DLL with Native AOT and call UnmanagedCallersOnly entry points from C/C++ — when this setup...

Read Article

Where This Topic Connects

This article connects naturally to the following service pages.

Windows App Development

This topic maps directly to Windows application work that spans Media Foundation, GDI+, Direct2D / DirectWrite, color conversion, and video output.

View Service Contact

Technical Consulting & Design Review

It also suits design discussions on how to grow a single-file implementation into a production architecture, and where to draw the line for audio remuxing or moving work to the GPU.

View Service Contact

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

Back to the Blog

1. The Short Answer First

2. Why This Problem Is a Bit Tricky

3. The Overview Table to Look at First

3.1 Processing Overview

4. How to Split Up the Pipeline

4.1 Receive the Input with IMFSourceReader

4.2 Think About Image and Text Compositing in Terms of GDI+ or Direct2D / DirectWrite

4.3 You Cannot Necessarily Write RGB32 Straight to H.264

4.4 Write the Output with IMFSinkWriter

4.5 Treating Audio Separately at First Keeps Things Tidy

5. Assumptions and Usage for This Sample

5.1 Usage

6. Single-File Code You Can Paste Straight into a .cpp

7. Points to Keep in Mind When Reading This Implementation

7.1 The Format That Is Easy to Draw on and the Format the Encoder Accepts Are Different

7.2 Stride and Vertical Orientation Are Normalized Before Drawing

7.3 With ReadSample, Check the Flags and the sample, Not Just the HRESULT

7.4 It Is Safer to Carry Over Timestamps and Durations from the Input

7.5 GDI+ Is Lightweight to Adopt, but There Is a Next Step for Long or High-Resolution Content

7.6 This Sample Is Limited to Video Only

8. If the “Given Video Data” Is an In-Memory MP4 Byte Sequence Rather Than a File

9. Growing It for Production

9.1 Add Audio Remuxing

9.2 Insert a Video Processor MFT

9.3 Replace GDI+ with Direct2D / DirectWrite

9.4 Consider a Custom MFT Once It Becomes a “Video Effect You Want to Reuse”

10. Summary

11. Related Articles

12. References

Related Articles

Related Topics

Where This Topic Connects

Author Profile

Go Komura

4.1 Receive the Input with `IMFSourceReader`

4.2 Think About Image and Text Compositing in Terms of `GDI+` or `Direct2D / DirectWrite`

4.3 You Cannot Necessarily Write `RGB32` Straight to `H.264`

4.4 Write the Output with `IMFSinkWriter`

6. Single-File Code You Can Paste Straight into a `.cpp`

7.3 With `ReadSample`, Check the Flags and the `sample`, Not Just the `HRESULT`

7.5 `GDI+` Is Lightweight to Adopt, but There Is a Next Step for Long or High-Resolution Content

9.2 Insert a `Video Processor MFT`

9.3 Replace `GDI+` with `Direct2D / DirectWrite`

9.4 Consider a Custom `MFT` Once It Becomes a “Video Effect You Want to Reuse”