How to Burn Images and Text into MP4 Frames with Media Foundation

· · Media Foundation, C++, Windows Development, GDI+, Direct2D, DirectWrite, H.264

Logo watermarks, inspection results, equipment IDs, operator names, timestamps. The requirement to burn this kind of information into every frame of an MP4 video and produce a new MP4 is quite common in surveillance, inspection, audit trails, and analysis UIs.

But once you start working with Media Foundation, you are confronted with IMFSourceReader, IMFSample, IMFMediaBuffer, IMFTransform, and IMFSinkWriter, and it suddenly becomes hard to see where exactly you are supposed to overlay text or a PNG.

In this article, we first lay out the big picture — Source Reader -> drawing -> color conversion -> Sink Writer — and then provide a single-file sample you can paste straight into a Visual Studio C++ console application. The sample reads a given MP4, draws a given image plus the text HelloWorld onto every frame, and produces an output MP4.

Note that this sample prioritizes being something you can paste in and run immediately, so it uses a configuration that re-encodes the video only. You could cram audio remuxing into the same program, but since the topic of this article is “burning an image and text into every frame,” we focus on that first.

The code in this article is published on GitHub as a complete sample set (a single-file .cpp plus a CMake build configuration).

media-foundation-overlay-image-text-on-mp4-frames - komurasoft-blog-samples (GitHub)

1. The Short Answer First

  • The basic pattern for putting an image or text into every frame of an MP4 is decode with the Source Reader -> composite onto uncompressed frames -> convert colors if needed -> re-encode with the Sink Writer.
  • Placing the image or text itself is not Media Foundation’s job. It is more natural to think about this part with drawing APIs such as GDI+, Direct2D, DirectWrite, and WIC.
  • If you are writing back to MP4(H.264), you will often need a conversion stage that bridges RGB32 / ARGB32, which is easy to draw on, and NV12 / I420 / YUY2, which encoders accept readily.
  • If you want to get your first version working, the configuration Source Reader -> RGB32 -> draw with GDI+ -> NV12 -> Sink Writer is easy to follow.
  • If you want to prioritize speed and extensibility, moving toward D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writer gives you more headroom.

2. Why This Problem Is a Bit Tricky

“Putting text into a video” is actually four different topics mixed together.

  1. Containers vs. codecs An mp4 is a container, not the frames themselves. The contents are usually compressed data such as H.264 or H.265.

  2. Decoding / encoding While the data is still compressed, you cannot simply overlay text or a PNG with an ordinary 2D drawing API. You first need to get back to uncompressed frames.

  3. Drawing Text, logos, alpha-blended PNGs, and anti-aliased text rendering are not the responsibility of Media Foundation itself. This is the job of GDI+ or Direct2D / DirectWrite / WIC.

  4. Color spaces and pixel formats The format that is easy to draw on and the format the encoder prefers are not the same. This is where people quietly get stuck.

Putting it bluntly in one line: rather than “putting text in with Media Foundation,” the clearest mental model is “use Media Foundation to move frames around, use a drawing API to overlay things, then apply any needed color conversion before encoding.”

3. The Overview Table to Look at First

Approach Configuration Best suited for Watch out for
Get it working correctly first Source Reader -> RGB32 -> composite -> NV12 -> Sink Writer Batch processing, internal tools, initial implementations CPU-side copies and conversions tend to add up
Increase speed D3D11 / DXGI surface -> Direct2D / DirectWrite -> Video Processor MFT -> Sink Writer Long videos, high resolutions, bulk processing More D3D11 and DXGI management
Make it a reusable component Implement as a custom MFT and insert it into a topology Effects shared across multiple apps, integration into an MF pipeline Implementation, registration, and debugging get harder

The sample in this article is limited to the top row, the “get it working correctly first” configuration.

3.1 Processing Overview

input.mp4IMFSourceReaderUncompressed frameRGB32Draw image + HelloWorld with GDI+BGRA -> NV12 conversionIMFSinkWriteroutput.mp4Audio samplesCopy as isor re-encode

The important point here is that the drawing itself is not Media Foundation’s job. Media Foundation is responsible for moving frames in and out; placing the image and text is delegated to a drawing API.

4. How to Split Up the Pipeline

4.1 Receive the Input with IMFSourceReader

If the input is a file path, use MFCreateSourceReaderFromURL; if it is video data in memory, a clear approach is to create an IMFByteStream and use MFCreateSourceReaderFromByteStream.

The first thing to decide here is whether to receive frames in a format that is easy to draw on, or in a format aimed at the encoder.

  • If you want a simple implementation, use RGB32 or ARGB32
  • If you want encoding efficiency, use a YUV format such as NV12

That said, compositing text and PNGs is overwhelmingly easier to reason about in RGB formats, so receiving frames as RGB32 / ARGB32 is the easy first move.

If you enable MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, the Source Reader performs YUV -> RGB32 conversion and deinterlacing for you. This is convenient at the “I just want to pull out frames and work with them” stage, but it tends to get heavy with long or high-resolution videos, so if you need speed in production it is worth revisiting the configuration later.

4.2 Think About Image and Text Compositing in Terms of GDI+ or Direct2D / DirectWrite

You take the buffer out of the IMFSample received from Media Foundation and place a logo image and text on top of it.

This sample prioritizes being easy to paste as a single file, so it uses GDI+ for drawing.

  • It can load images
  • It can draw text
  • It requires relatively little extra setup
  • It fits easily into a single console-app .cpp

On the other hand, for workloads that process long videos or lots of 4K content, D3D11 + Direct2D + DirectWrite has more headroom. A natural progression is to use GDI+ for the first implementation and move to Direct2D / DirectWrite when you need to optimize for speed.

4.3 You Cannot Necessarily Write RGB32 Straight to H.264

This is where people get stuck most often.

When writing back to MP4(H.264), Microsoft’s H.264 encoder usually expects YUV-family input such as I420 / IYUV / NV12 / YUY2 / YV12. In other words, compositing in the easy-to-draw RGB32 / ARGB32 and then handing the result straight to IMFSinkWriter is not guaranteed to just work.

So in practice you need one of two conversions.

  • Insert a Video Processor MFT to do RGB32 / ARGB32 -> NV12
  • Implement your own RGB -> NV12 conversion

This sample prioritizes being a single self-contained file, so it takes the latter route with a hand-rolled conversion. In production, inserting a Video Processor MFT, which can handle color-space conversion, resizing, and deinterlacing all together, is also a strong option.

4.4 Write the Output with IMFSinkWriter

For video output, IMFSinkWriter is the easiest to work with.

The idea is simple: you configure two things separately,

  • The output stream type … the format you want written to the file Example: MFVideoFormat_H264
  • The input stream type … the format the app hands to the Sink Writer Example: MFVideoFormat_NV12

So from the Sink Writer’s perspective,

  • the app side hands over uncompressed NV12 frames
  • the Sink Writer encodes them to H.264 and writes them into the MP4

is the relationship.

4.5 Treating Audio Separately at First Keeps Things Tidy

Very often you only want to put a logo or text into the video and do not want to touch the audio at all.

In practice, the configuration

  • video stream only: Source Reader -> composite -> Sink Writer
  • audio stream: remux it while still compressed

is easy to work with.

However, since this sample focuses on burning an image and text into the frames, the output is a video-only MP4. A version that preserves audio is easier to follow if you add it later as an extension.

5. Assumptions and Usage for This Sample

The assumptions for this code are as follows.

  • Windows 10 / 11
  • A Visual Studio 2022 C++ console application
  • x64 build
  • This .cpp file does not use precompiled headers
  • The input video’s width and height are even
  • The input is an ordinary MP4 video file
  • The output is a video-only MP4
  • The image is in a format GDI+ can read, such as PNG / JPEG / BMP / GIF

NV12 is 4:2:0, so the width and height must be even. For that reason, this sample explicitly raises an error when those conditions are not met.

5.1 Usage

  1. Create a Console App in Visual Studio
  2. Paste this .cpp in wholesale
  3. Set that .cpp file’s precompiled header option to “Not Using”
  4. Build for x64
  5. Run it as follows
OverlayMp4.exe input.mp4 overlay.png output.mp4
  • input.mp4 The source video
  • overlay.png The image to overlay
  • output.mp4 The output destination

The text string is hard-coded to HelloWorld in kOverlayText at the top of the code. Position and size can also be changed by adjusting the constants in the code.

6. Single-File Code You Can Paste Straight into a .cpp

#define NOMINMAX
#include <windows.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <mferror.h>
#include <gdiplus.h>
#include <wrl/client.h>

#include <algorithm>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <cwchar>
#include <iostream>
#include <stdexcept>
#include <string>
#include <vector>

#pragma comment(lib, "mfplat.lib")
#pragma comment(lib, "mfreadwrite.lib")
#pragma comment(lib, "mfuuid.lib")
#pragma comment(lib, "mf.lib")
#pragma comment(lib, "gdiplus.lib")

using Microsoft::WRL::ComPtr;

namespace
{
    const wchar_t* kOverlayText = L"HelloWorld";
    const float kMarginRatio = 0.03f;
    const float kImageMaxWidthRatio = 0.20f;
    const float kImageMaxHeightRatio = 0.20f;
    const float kMinFontPx = 24.0f;

    std::string HrToHex(HRESULT hr)
    {
        char buf[32]{};
        std::snprintf(buf, sizeof(buf), "0x%08X", static_cast<unsigned int>(hr));
        return std::string(buf);
    }

    void ThrowIfFailed(HRESULT hr, const char* message)
    {
        if (FAILED(hr))
        {
            throw std::runtime_error(std::string(message) + " failed. HRESULT=" + HrToHex(hr));
        }
    }

    void ThrowIfGdiplusError(Gdiplus::Status status, const char* message)
    {
        if (status != Gdiplus::Ok)
        {
            char buf[128]{};
            std::snprintf(buf, sizeof(buf), "%s failed. GDI+ status=%d", message, static_cast<int>(status));
            throw std::runtime_error(buf);
        }
    }

    BYTE ClampToByte(int value)
    {
        if (value < 0) return 0;
        if (value > 255) return 255;
        return static_cast<BYTE>(value);
    }

    class ScopedGdiplus
    {
    public:
        ScopedGdiplus()
        {
            Gdiplus::GdiplusStartupInput input;
            ThrowIfGdiplusError(Gdiplus::GdiplusStartup(&token_, &input, nullptr), "GdiplusStartup");
        }

        ~ScopedGdiplus()
        {
            if (token_ != 0)
            {
                Gdiplus::GdiplusShutdown(token_);
            }
        }

    private:
        ULONG_PTR token_ = 0;
    };

    class ScopedMf
    {
    public:
        ScopedMf()
        {
            ThrowIfFailed(CoInitializeEx(nullptr, COINIT_MULTITHREADED), "CoInitializeEx");
            comInitialized_ = true;

            ThrowIfFailed(MFStartup(MF_VERSION), "MFStartup");
            mfStarted_ = true;
        }

        ~ScopedMf()
        {
            if (mfStarted_)
            {
                MFShutdown();
            }

            if (comInitialized_)
            {
                CoUninitialize();
            }
        }

    private:
        bool comInitialized_ = false;
        bool mfStarted_ = false;
    };

    class BufferLock
    {
    public:
        explicit BufferLock(IMFMediaBuffer* buffer)
            : buffer_(buffer)
        {
            if (!buffer_)
            {
                throw std::runtime_error("BufferLock received a null buffer.");
            }

            buffer_.As(&buffer2D_);
        }

        HRESULT LockBuffer(LONG defaultStride, DWORD heightInPixels, BYTE** scanline0, LONG* actualStride)
        {
            if (scanline0 == nullptr || actualStride == nullptr)
            {
                return E_POINTER;
            }

            HRESULT hr = S_OK;

            if (buffer2D_)
            {
                hr = buffer2D_->Lock2D(scanline0, actualStride);
            }
            else
            {
                BYTE* data = nullptr;
                hr = buffer_->Lock(&data, nullptr, nullptr);
                if (SUCCEEDED(hr))
                {
                    *actualStride = defaultStride;
                    if (defaultStride < 0)
                    {
                        *scanline0 = data + (static_cast<LONG>(heightInPixels) - 1) * std::abs(defaultStride);
                    }
                    else
                    {
                        *scanline0 = data;
                    }
                }
            }

            locked_ = SUCCEEDED(hr);
            return hr;
        }

        ~BufferLock()
        {
            if (!locked_)
            {
                return;
            }

            if (buffer2D_)
            {
                buffer2D_->Unlock2D();
            }
            else
            {
                buffer_->Unlock();
            }
        }

    private:
        ComPtr<IMFMediaBuffer> buffer_;
        ComPtr<IMF2DBuffer> buffer2D_;
        bool locked_ = false;
    };

    struct VideoFormatInfo
    {
        UINT32 width = 0;
        UINT32 height = 0;
        UINT32 fpsNum = 0;
        UINT32 fpsDen = 0;
        UINT32 parNum = 1;
        UINT32 parDen = 1;
        LONG sourceStride = 0;
        LONGLONG defaultFrameDuration = 0;
        UINT32 bitrate = 0;
    };

    LONG GetDefaultStride(IMFMediaType* type)
    {
        LONG stride = 0;

        HRESULT hr = type->GetUINT32(MF_MT_DEFAULT_STRIDE, reinterpret_cast<UINT32*>(&stride));
        if (SUCCEEDED(hr))
        {
            return stride;
        }

        GUID subtype = GUID_NULL;
        UINT32 width = 0;
        UINT32 height = 0;

        ThrowIfFailed(type->GetGUID(MF_MT_SUBTYPE, &subtype), "GetGUID(MF_MT_SUBTYPE)");
        ThrowIfFailed(MFGetAttributeSize(type, MF_MT_FRAME_SIZE, &width, &height), "MFGetAttributeSize(MF_MT_FRAME_SIZE)");
        ThrowIfFailed(MFGetStrideForBitmapInfoHeader(subtype.Data1, width, &stride), "MFGetStrideForBitmapInfoHeader");
        ThrowIfFailed(type->SetUINT32(MF_MT_DEFAULT_STRIDE, static_cast<UINT32>(stride)), "SetUINT32(MF_MT_DEFAULT_STRIDE)");

        return stride;
    }

    UINT32 ChooseBitrate(IMFMediaType* nativeType, UINT32 width, UINT32 height, UINT32 fpsNum, UINT32 fpsDen)
    {
        UINT32 srcBitrate = 0;
        if (SUCCEEDED(nativeType->GetUINT32(MF_MT_AVG_BITRATE, &srcBitrate)) && srcBitrate > 0)
        {
            return srcBitrate;
        }

        const double fps = static_cast<double>(fpsNum) / static_cast<double>(fpsDen);
        double estimated = static_cast<double>(width) * static_cast<double>(height) * fps * 0.07;

        if (estimated < 1500000.0)
        {
            estimated = 1500000.0;
        }

        if (estimated > 25000000.0)
        {
            estimated = 25000000.0;
        }

        return static_cast<UINT32>(estimated);
    }

    VideoFormatInfo ConfigureSourceReader(IMFSourceReader* reader)
    {
        ThrowIfFailed(reader->SetStreamSelection(MF_SOURCE_READER_ALL_STREAMS, FALSE), "SetStreamSelection(all,false)");
        ThrowIfFailed(reader->SetStreamSelection(MF_SOURCE_READER_FIRST_VIDEO_STREAM, TRUE), "SetStreamSelection(video,true)");

        ComPtr<IMFMediaType> nativeType;
        ThrowIfFailed(reader->GetNativeMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, 0, &nativeType), "GetNativeMediaType(video)");

        ComPtr<IMFMediaType> requestedType;
        ThrowIfFailed(MFCreateMediaType(&requestedType), "MFCreateMediaType(video requested)");
        ThrowIfFailed(requestedType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(video requested major)");
        ThrowIfFailed(requestedType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32), "SetGUID(video requested subtype RGB32)");
        ThrowIfFailed(reader->SetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, nullptr, requestedType.Get()), "SetCurrentMediaType(video RGB32)");

        ComPtr<IMFMediaType> currentType;
        ThrowIfFailed(reader->GetCurrentMediaType(MF_SOURCE_READER_FIRST_VIDEO_STREAM, &currentType), "GetCurrentMediaType(video)");

        VideoFormatInfo info;
        ThrowIfFailed(MFGetAttributeSize(currentType.Get(), MF_MT_FRAME_SIZE, &info.width, &info.height), "Get video frame size");

        HRESULT hr = MFGetAttributeRatio(currentType.Get(), MF_MT_FRAME_RATE, &info.fpsNum, &info.fpsDen);
        if (FAILED(hr))
        {
            ThrowIfFailed(MFGetAttributeRatio(nativeType.Get(), MF_MT_FRAME_RATE, &info.fpsNum, &info.fpsDen), "Get video frame rate");
        }

        if (info.fpsNum == 0 || info.fpsDen == 0)
        {
            throw std::runtime_error("Video frame rate is zero.");
        }

        hr = MFGetAttributeRatio(currentType.Get(), MF_MT_PIXEL_ASPECT_RATIO, &info.parNum, &info.parDen);
        if (FAILED(hr) || info.parNum == 0 || info.parDen == 0)
        {
            info.parNum = 1;
            info.parDen = 1;
        }

        info.sourceStride = GetDefaultStride(currentType.Get());
        info.defaultFrameDuration = (10000000LL * info.fpsDen) / info.fpsNum;
        if (info.defaultFrameDuration <= 0)
        {
            throw std::runtime_error("Calculated frame duration is invalid.");
        }

        info.bitrate = ChooseBitrate(nativeType.Get(), info.width, info.height, info.fpsNum, info.fpsDen);
        return info;
    }

    ComPtr<IMFSinkWriter> CreateSinkWriter(const std::wstring& outputPath, const VideoFormatInfo& videoInfo, DWORD* streamIndex)
    {
        if (streamIndex == nullptr)
        {
            throw std::runtime_error("streamIndex is null.");
        }

        ComPtr<IMFAttributes> attributes;
        ThrowIfFailed(MFCreateAttributes(&attributes, 1), "MFCreateAttributes(sink)");
        ThrowIfFailed(attributes->SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS, TRUE), "SetUINT32(MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS)");

        ComPtr<IMFSinkWriter> writer;
        ThrowIfFailed(MFCreateSinkWriterFromURL(outputPath.c_str(), nullptr, attributes.Get(), &writer), "MFCreateSinkWriterFromURL");

        ComPtr<IMFMediaType> outputType;
        ThrowIfFailed(MFCreateMediaType(&outputType), "MFCreateMediaType(video output)");
        ThrowIfFailed(outputType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(output major)");
        ThrowIfFailed(outputType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_H264), "SetGUID(output subtype H264)");
        ThrowIfFailed(outputType->SetUINT32(MF_MT_AVG_BITRATE, videoInfo.bitrate), "SetUINT32(output bitrate)");
        ThrowIfFailed(outputType->SetUINT32(MF_MT_INTERLACE_MODE, MFVideoInterlace_Progressive), "SetUINT32(output interlace)");
        ThrowIfFailed(MFSetAttributeSize(outputType.Get(), MF_MT_FRAME_SIZE, videoInfo.width, videoInfo.height), "MFSetAttributeSize(output frame size)");
        ThrowIfFailed(MFSetAttributeRatio(outputType.Get(), MF_MT_FRAME_RATE, videoInfo.fpsNum, videoInfo.fpsDen), "MFSetAttributeRatio(output fps)");
        ThrowIfFailed(MFSetAttributeRatio(outputType.Get(), MF_MT_PIXEL_ASPECT_RATIO, videoInfo.parNum, videoInfo.parDen), "MFSetAttributeRatio(output PAR)");
        ThrowIfFailed(writer->AddStream(outputType.Get(), streamIndex), "AddStream(video)");

        ComPtr<IMFMediaType> inputType;
        ThrowIfFailed(MFCreateMediaType(&inputType), "MFCreateMediaType(video input)");
        ThrowIfFailed(inputType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video), "SetGUID(input major)");
        ThrowIfFailed(inputType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_NV12), "SetGUID(input subtype NV12)");
        ThrowIfFailed(inputType->SetUINT32(MF_MT_INTERLACE_MODE, MFVideoInterlace_Progressive), "SetUINT32(input interlace)");
        ThrowIfFailed(MFSetAttributeSize(inputType.Get(), MF_MT_FRAME_SIZE, videoInfo.width, videoInfo.height), "MFSetAttributeSize(input frame size)");
        ThrowIfFailed(MFSetAttributeRatio(inputType.Get(), MF_MT_FRAME_RATE, videoInfo.fpsNum, videoInfo.fpsDen), "MFSetAttributeRatio(input fps)");
        ThrowIfFailed(MFSetAttributeRatio(inputType.Get(), MF_MT_PIXEL_ASPECT_RATIO, videoInfo.parNum, videoInfo.parDen), "MFSetAttributeRatio(input PAR)");
        ThrowIfFailed(writer->SetInputMediaType(*streamIndex, inputType.Get(), nullptr), "SetInputMediaType(video)");

        ThrowIfFailed(writer->BeginWriting(), "BeginWriting");
        return writer;
    }

    void CopySampleToTopDownBgra(IMFSample* sample, const VideoFormatInfo& videoInfo, std::vector<BYTE>& bgra)
    {
        ComPtr<IMFMediaBuffer> buffer;
        ThrowIfFailed(sample->ConvertToContiguousBuffer(&buffer), "ConvertToContiguousBuffer");

        BufferLock lock(buffer.Get());

        BYTE* scanline0 = nullptr;
        LONG actualStride = 0;
        ThrowIfFailed(lock.LockBuffer(videoInfo.sourceStride, videoInfo.height, &scanline0, &actualStride), "LockBuffer");

        const size_t dstStride = static_cast<size_t>(videoInfo.width) * 4;
        bgra.resize(dstStride * videoInfo.height);

        for (UINT32 y = 0; y < videoInfo.height; ++y)
        {
            const BYTE* srcRow = scanline0 + static_cast<LONG>(y) * actualStride;
            BYTE* dstRow = bgra.data() + static_cast<size_t>(y) * dstStride;
            std::memcpy(dstRow, srcRow, dstStride);

            for (UINT32 x = 0; x < videoInfo.width; ++x)
            {
                dstRow[static_cast<size_t>(x) * 4 + 3] = 0xFF;
            }
        }
    }

    void DrawOverlay(std::vector<BYTE>& bgra, UINT32 width, UINT32 height, Gdiplus::Image& overlayImage)
    {
        const INT stride = static_cast<INT>(width * 4);

        Gdiplus::Bitmap frameBitmap(
            static_cast<INT>(width),
            static_cast<INT>(height),
            stride,
            PixelFormat32bppPARGB,
            bgra.data());
        ThrowIfGdiplusError(frameBitmap.GetLastStatus(), "Create frame bitmap");

        Gdiplus::Graphics graphics(&frameBitmap);
        ThrowIfGdiplusError(graphics.GetLastStatus(), "Create graphics");

        graphics.SetCompositingMode(Gdiplus::CompositingModeSourceOver);
        graphics.SetCompositingQuality(Gdiplus::CompositingQualityHighQuality);
        graphics.SetInterpolationMode(Gdiplus::InterpolationModeHighQualityBicubic);
        graphics.SetSmoothingMode(Gdiplus::SmoothingModeAntiAlias);
        graphics.SetTextRenderingHint(Gdiplus::TextRenderingHintAntiAliasGridFit);

        const Gdiplus::REAL margin = std::max<Gdiplus::REAL>(16.0f, static_cast<Gdiplus::REAL>(height) * kMarginRatio);
        const Gdiplus::REAL maxImageW = static_cast<Gdiplus::REAL>(width) * kImageMaxWidthRatio;
        const Gdiplus::REAL maxImageH = static_cast<Gdiplus::REAL>(height) * kImageMaxHeightRatio;

        const Gdiplus::REAL srcW = static_cast<Gdiplus::REAL>(overlayImage.GetWidth());
        const Gdiplus::REAL srcH = static_cast<Gdiplus::REAL>(overlayImage.GetHeight());
        if (srcW <= 0.0f || srcH <= 0.0f)
        {
            throw std::runtime_error("Overlay image has invalid size.");
        }

        const Gdiplus::REAL imageScale =
            std::min<Gdiplus::REAL>(1.0f, std::min(maxImageW / srcW, maxImageH / srcH));

        const Gdiplus::REAL drawW = srcW * imageScale;
        const Gdiplus::REAL drawH = srcH * imageScale;

        Gdiplus::RectF imageRect(margin, margin, drawW, drawH);
        Gdiplus::SolidBrush imagePlate(Gdiplus::Color(96, 0, 0, 0));
        graphics.FillRectangle(
            &imagePlate,
            imageRect.X - 8.0f,
            imageRect.Y - 8.0f,
            imageRect.Width + 16.0f,
            imageRect.Height + 16.0f);

        graphics.DrawImage(&overlayImage, imageRect);

        const Gdiplus::REAL fontPx =
            std::max<Gdiplus::REAL>(kMinFontPx, static_cast<Gdiplus::REAL>(height) * 0.06f);

        Gdiplus::Font font(L"Segoe UI", fontPx, Gdiplus::FontStyleBold, Gdiplus::UnitPixel);
        ThrowIfGdiplusError(font.GetLastStatus(), "Create font");

        Gdiplus::StringFormat stringFormat;
        stringFormat.SetAlignment(Gdiplus::StringAlignmentNear);
        stringFormat.SetLineAlignment(Gdiplus::StringAlignmentNear);

        Gdiplus::RectF measureLayout(
            margin,
            static_cast<Gdiplus::REAL>(height) - margin - fontPx * 2.0f,
            static_cast<Gdiplus::REAL>(width) - margin * 2.0f,
            fontPx * 2.0f);

        Gdiplus::RectF measured;
        graphics.MeasureString(kOverlayText, -1, &font, measureLayout, &stringFormat, &measured);

        Gdiplus::RectF textBg(
            measured.X - 12.0f,
            measured.Y - 8.0f,
            measured.Width + 24.0f,
            measured.Height + 16.0f);

        Gdiplus::SolidBrush textPlate(Gdiplus::Color(128, 0, 0, 0));
        graphics.FillRectangle(&textPlate, textBg);

        Gdiplus::SolidBrush shadowBrush(Gdiplus::Color(220, 0, 0, 0));
        Gdiplus::RectF shadowLayout = measureLayout;
        shadowLayout.X += 2.0f;
        shadowLayout.Y += 2.0f;
        graphics.DrawString(kOverlayText, -1, &font, shadowLayout, &stringFormat, &shadowBrush);

        Gdiplus::SolidBrush textBrush(Gdiplus::Color(235, 255, 255, 255));
        graphics.DrawString(kOverlayText, -1, &font, measureLayout, &stringFormat, &textBrush);
    }

    void BgraToNv12(const BYTE* bgra, UINT32 width, UINT32 height, BYTE* nv12)
    {
        const bool useBt709 = (width > 1024 || height > 576);

        const int yR = useBt709 ? 47 : 66;
        const int yG = useBt709 ? 157 : 129;
        const int yB = useBt709 ? 16 : 25;

        const int uR = useBt709 ? -26 : -38;
        const int uG = useBt709 ? -87 : -74;
        const int uB = 112;

        const int vR = 112;
        const int vG = useBt709 ? -102 : -94;
        const int vB = useBt709 ? -10 : -18;

        BYTE* yPlane = nv12;
        BYTE* uvPlane = nv12 + static_cast<size_t>(width) * height;

        const size_t srcStride = static_cast<size_t>(width) * 4;

        for (UINT32 y = 0; y < height; ++y)
        {
            const BYTE* srcRow = bgra + static_cast<size_t>(y) * srcStride;
            BYTE* dstY = yPlane + static_cast<size_t>(y) * width;

            for (UINT32 x = 0; x < width; ++x)
            {
                const BYTE b = srcRow[x * 4 + 0];
                const BYTE g = srcRow[x * 4 + 1];
                const BYTE r = srcRow[x * 4 + 2];

                const int Y = ((yR * r + yG * g + yB * b + 128) >> 8) + 16;
                dstY[x] = ClampToByte(Y);
            }
        }

        for (UINT32 y = 0; y < height; y += 2)
        {
            const BYTE* row0 = bgra + static_cast<size_t>(y) * srcStride;
            const BYTE* row1 = bgra + static_cast<size_t>(y + 1) * srcStride;
            BYTE* dstUV = uvPlane + static_cast<size_t>(y / 2) * width;

            for (UINT32 x = 0; x < width; x += 2)
            {
                int b = 0;
                int g = 0;
                int r = 0;

                for (UINT32 dy = 0; dy < 2; ++dy)
                {
                    const BYTE* row = (dy == 0) ? row0 : row1;
                    for (UINT32 dx = 0; dx < 2; ++dx)
                    {
                        const UINT32 ix = x + dx;
                        b += row[ix * 4 + 0];
                        g += row[ix * 4 + 1];
                        r += row[ix * 4 + 2];
                    }
                }

                b = (b + 2) / 4;
                g = (g + 2) / 4;
                r = (r + 2) / 4;

                const int U = ((uR * r + uG * g + uB * b + 128) >> 8) + 128;
                const int V = ((vR * r + vG * g + vB * b + 128) >> 8) + 128;

                dstUV[x + 0] = ClampToByte(U);
                dstUV[x + 1] = ClampToByte(V);
            }
        }
    }

    ComPtr<IMFSample> CreateNv12Sample(
        const std::vector<BYTE>& bgra,
        const VideoFormatInfo& videoInfo,
        LONGLONG sampleTime,
        LONGLONG sampleDuration)
    {
        const DWORD bufferSize =
            static_cast<DWORD>(videoInfo.width * videoInfo.height * 3 / 2);

        ComPtr<IMFMediaBuffer> buffer;
        ThrowIfFailed(MFCreateMemoryBuffer(bufferSize, &buffer), "MFCreateMemoryBuffer");

        BYTE* dst = nullptr;
        DWORD maxLength = 0;
        DWORD currentLength = 0;
        ThrowIfFailed(buffer->Lock(&dst, &maxLength, &currentLength), "Lock(NV12 buffer)");

        try
        {
            BgraToNv12(bgra.data(), videoInfo.width, videoInfo.height, dst);
        }
        catch (...)
        {
            buffer->Unlock();
            throw;
        }

        ThrowIfFailed(buffer->Unlock(), "Unlock(NV12 buffer)");
        ThrowIfFailed(buffer->SetCurrentLength(bufferSize), "SetCurrentLength(NV12 buffer)");

        ComPtr<IMFSample> sample;
        ThrowIfFailed(MFCreateSample(&sample), "MFCreateSample");
        ThrowIfFailed(sample->AddBuffer(buffer.Get()), "AddBuffer(output sample)");
        ThrowIfFailed(sample->SetSampleTime(sampleTime), "SetSampleTime");
        ThrowIfFailed(sample->SetSampleDuration(sampleDuration), "SetSampleDuration");

        return sample;
    }
}

int wmain(int argc, wchar_t* argv[])
{
    if (argc != 4)
    {
        std::wcerr << L"Usage: OverlayMp4.exe <input.mp4> <overlayImage.png> <output.mp4>" << std::endl;
        return 1;
    }

    const std::wstring inputPath = argv[1];
    const std::wstring imagePath = argv[2];
    const std::wstring outputPath = argv[3];

    try
    {
        if (_wcsicmp(inputPath.c_str(), outputPath.c_str()) == 0)
        {
            throw std::runtime_error("Input and output paths must be different.");
        }

        ScopedMf mf;
        ScopedGdiplus gdiplus;

        ComPtr<IMFAttributes> readerAttributes;
        ThrowIfFailed(MFCreateAttributes(&readerAttributes, 1), "MFCreateAttributes(reader)");
        ThrowIfFailed(
            readerAttributes->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE),
            "SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING)");

        ComPtr<IMFSourceReader> reader;
        ThrowIfFailed(
            MFCreateSourceReaderFromURL(inputPath.c_str(), readerAttributes.Get(), &reader),
            "MFCreateSourceReaderFromURL");

        VideoFormatInfo videoInfo = ConfigureSourceReader(reader.Get());

        if ((videoInfo.width % 2) != 0 || (videoInfo.height % 2) != 0)
        {
            throw std::runtime_error(
                "This sample requires even video width and height because NV12 is 4:2:0.");
        }

        Gdiplus::Image overlayImage(imagePath.c_str());
        ThrowIfGdiplusError(overlayImage.GetLastStatus(), "Load overlay image");

        DWORD videoStreamIndex = 0;
        ComPtr<IMFSinkWriter> writer =
            CreateSinkWriter(outputPath, videoInfo, &videoStreamIndex);

        std::vector<BYTE> bgra;
        LONGLONG firstTimestamp = -1;
        unsigned long long frameCount = 0;

        while (true)
        {
            DWORD flags = 0;
            LONGLONG timestamp = 0;
            ComPtr<IMFSample> inputSample;

            ThrowIfFailed(
                reader->ReadSample(
                    MF_SOURCE_READER_FIRST_VIDEO_STREAM,
                    0,
                    nullptr,
                    &flags,
                    &timestamp,
                    &inputSample),
                "ReadSample(video)");

            if ((flags & MF_SOURCE_READERF_CURRENTMEDIATYPECHANGED) != 0)
            {
                throw std::runtime_error("Dynamic video format change is not supported in this sample.");
            }

            if ((flags & MF_SOURCE_READERF_NATIVEMEDIATYPECHANGED) != 0)
            {
                throw std::runtime_error("Native video format change is not supported in this sample.");
            }

            if ((flags & MF_SOURCE_READERF_STREAMTICK) != 0)
            {
                if (firstTimestamp < 0)
                {
                    firstTimestamp = timestamp;
                }

                ThrowIfFailed(
                    writer->SendStreamTick(videoStreamIndex, timestamp - firstTimestamp),
                    "SendStreamTick");
            }

            if (inputSample)
            {
                if (firstTimestamp < 0)
                {
                    firstTimestamp = timestamp;
                }

                LONGLONG duration = 0;
                if (FAILED(inputSample->GetSampleDuration(&duration)) || duration <= 0)
                {
                    duration = videoInfo.defaultFrameDuration;
                }

                CopySampleToTopDownBgra(inputSample.Get(), videoInfo, bgra);
                DrawOverlay(bgra, videoInfo.width, videoInfo.height, overlayImage);

                ComPtr<IMFSample> outputSample =
                    CreateNv12Sample(bgra, videoInfo, timestamp - firstTimestamp, duration);

                ThrowIfFailed(
                    writer->WriteSample(videoStreamIndex, outputSample.Get()),
                    "WriteSample(video)");

                ++frameCount;
            }

            if ((flags & MF_SOURCE_READERF_ENDOFSTREAM) != 0)
            {
                break;
            }
        }

        ThrowIfFailed(writer->Finalize(), "Finalize");

        std::wcout
            << L"Done. frames=" << frameCount
            << L", output=" << outputPath
            << std::endl;

        return 0;
    }
    catch (const std::exception& ex)
    {
        std::cerr << ex.what() << std::endl;
        return 1;
    }
}

7. Points to Keep in Mind When Reading This Implementation

7.1 The Format That Is Easy to Draw on and the Format the Encoder Accepts Are Different

This sample uses the following flow.

  • Source Reader output: RGB32
  • Drawing: GDI+
  • Sink Writer input: NV12

The reason is simple: RGB formats are easy to work with when overlaying text and PNGs, and NV12 is easy to hand off to H.264 encoding.

When reading the implementation, it becomes easier to follow if you split it into a “drawing stage” and a “prepare-for-encoding stage.”

7.2 Stride and Vertical Orientation Are Normalized Before Drawing

Video frames are not necessarily laid out in memory the way they appear on screen.

  • The stride may not match width * 4
  • The image may be stored upside down
  • IMF2DBuffer and IMFMediaBuffer are handled slightly differently

For that reason, this code first normalizes into a top-down BGRA buffer before drawing. Getting this sorted out up front lets the drawing code stay quite straightforward.

7.3 With ReadSample, Check the Flags and the sample, Not Just the HRESULT

ReadSample can return S_OK with sample == nullptr. Typical cases are

  • MF_SOURCE_READERF_STREAMTICK
  • MF_SOURCE_READERF_ENDOFSTREAM
  • other stream events

So the loop needs to look at all three together: the HRESULT, the flags, and the inputSample. In particular, if you miss STREAMTICK or ENDOFSTREAM, downstream timeline handling tends to break.

7.4 It Is Safer to Carry Over Timestamps and Durations from the Input

Timestamps are in 100-ns units. Also, the duration has to be retrieved separately from the IMFSample.

Rather than assuming a fixed fps and adding a hard-coded increment each time, it is more robust to carry over the input sample’s timestamp / duration as much as possible. This sample does exactly that, falling back to a default value computed from the fps only when the duration cannot be obtained.

7.5 GDI+ Is Lightweight to Adopt, but There Is a Next Step for Long or High-Resolution Content

GDI+ is very well suited to a single-file sample, but for workloads processing long videos or lots of 4K content, D3D11 + Direct2D + DirectWrite can be the better choice.

  • First get the whole pipeline working with GDI+
  • Then, if needed, replace it with Direct2D / DirectWrite
  • Move color conversion to a Video Processor MFT or the GPU side

A staged progression like this lets you extend the system without breaking the design.

7.6 This Sample Is Limited to Video Only

If you also pile audio into the same article, the focus gets diluted. For that reason, this sample concentrates on burning an image and text into the video frames, and the output is a video-only MP4.

In practice, the next step is to grow it into

  • video only: Source Reader -> composite -> Sink Writer
  • audio: remux it while still compressed

which is an easy configuration to manage.

8. If the “Given Video Data” Is an In-Memory MP4 Byte Sequence Rather Than a File

The code in this article uses MFCreateSourceReaderFromURL, so the input is a file path.

But if the requirement is “do the same thing to mp4 bytes received from an API,” the thinking does not change. Only the entry point changes.

  • Prepare an IStream or a custom stream
  • Hand it to the Source Reader as an IMFByteStream
  • From there on it is the same: RGB32 -> draw -> NV12 -> Sink Writer

In other words, the essence is not how the video data is held, but how you draw onto each decoded frame.

9. Growing It for Production

9.1 Add Audio Remuxing

The most practical first extension is to preserve the audio as is. Re-encode only the video and write the audio back in the same format while still compressed; this meets the requirement without adding much implementation.

9.2 Insert a Video Processor MFT

This sample converts BGRA -> NV12 by hand to stay self-contained in a single file, but in production, inserting a Video Processor MFT is also a very strong option.

With the Video Processor MFT, it becomes easier to handle

  • color-space conversion
  • resizing
  • deinterlacing
  • frame-rate conversion

all in one place.

9.3 Replace GDI+ with Direct2D / DirectWrite

For overlays such as logo images, subtitles, and timestamps, GDI+ is often sufficient, but if you need to squeeze out performance, Direct2D / DirectWrite has the edge.

In particular, if you have conditions such as

  • high resolution
  • long durations
  • large numbers of videos
  • a future move toward a GPU path

then a configuration based on D3D11 / DXGI surface comes into view.

9.4 Consider a Custom MFT Once It Becomes a “Video Effect You Want to Reuse”

In Media Foundation, effects can be implemented as an IMFTransform. So if you want to reuse the same overlay processing across multiple apps or pipelines, a custom MFT is a clean choice.

However, as a first implementation,

  • you must satisfy the IMFTransform contract
  • input/output media-type management increases
  • registration and debugging get harder

so in practice it is usually easier to first get things working correctly with Source Reader + compositing + Sink Writer, and extract an MFT when you actually need one.

10. Summary

When burning images or text into every frame of an MP4 with Media Foundation, breaking the problem into these four parts gives you a clear view.

  • Extract: IMFSourceReader
  • Draw: GDI+ or Direct2D / DirectWrite
  • Convert into a format the encoder accepts: NV12, etc.
  • Write back: IMFSinkWriter

And if what you want is “a sample you can paste entirely into one .cpp and run as is,” then a configuration like the one in this article,

Source Reader -> RGB32 -> image + HelloWorld with GDI+ -> BGRA to NV12 -> Sink Writer

is quite natural.

If you grow it for production next, thinking in this order keeps things from falling apart.

  1. Add audio remuxing
  2. Replace GDI+ with Direct2D / DirectWrite
  3. Move the NV12 conversion to a Video Processor MFT or the GPU side
  4. Move to a D3D11 surface-based design for long, high-resolution content
  5. Extract a custom MFT if you need reusability

If you try to do everything at once, COM, strides, color spaces, and surface management all hit you at the same time. Getting it working stage by stage first, then strengthening only the parts you need later, makes both the design and the debugging considerably easier.

12. References

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This article connects naturally to the following service pages.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog