How to Convert YUV to RGB with Media Foundation

· · Media Foundation, C++, Windows Development, Video Processing, YUV

You want to pull a frame out of a video and save it as a PNG, pass it to WIC or GDI, or display it in your UI. In those situations, the application side wants RGB pixel data.

However, the frames that come out of a Media Foundation decoder are, quite commonly, YUV-family formats such as NV12 or YUY2. If you treat the raw byte stream as an image as-is, you get a rather sad picture: broken colors, banding, or a strangely greenish tint.

In an earlier post, What Is Media Foundation - Why You Start to See the Face of COM and the Windows Media APIs, we covered the big picture, and in How to Extract a Still Image from an MP4 at a Specific Time with Media Foundation - A Single-File Version You Can Paste Straight into a .cpp, we covered still-image extraction. This time we tackle the step that sits in between: the YUV -> RGB conversion itself.

In this article, we separate and organize the following two patterns.

  • Pattern A: let IMFSourceReader automatically take the frames all the way to RGB32
  • Pattern B: receive NV12 / YUY2 and convert to RGB yourself

The goal is not to memorize API names. It is to be able to picture, in your head, where in Media Foundation the YUV appears and where it turns into RGB.

The code that appears in this article is published on GitHub as a complete sample set (C++ code for Pattern A / Pattern B, CMake configuration, and tests for the pixel conversion).

media-foundation-yuv-to-rgb-conversion-patterns - komurasoft-blog-samples (GitHub)

1. The Conclusion First

Summarizing the conclusions up front:

  • For extracting a few still images or generating thumbnails, the easiest route is to enable MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING and request MFVideoFormat_RGB32
  • However, this automatic conversion is software processing and is not optimized for real-time playback
  • If you are going to write your own conversion, the shortest path is to properly understand NV12 and YUY2 first
  • YUV -> RGB is not just “multiply by three coefficients and you are done” — in practice, subsampling, range, matrix, and stride all come into play
  • The Media Foundation documentation broadly uses the term YUV, but for digital video it is easier to read if you assume it effectively means Y’CbCr
  • The things that most often break colors in practice are not looking at MF_MT_YUV_MATRIX and MF_MT_VIDEO_NOMINAL_RANGE, and assuming the stride is width * bytesPerPixel

In short: if you want the easy path, have the Source Reader output RGB32. If you need high-volume processing or control over color, receive the frames as YUV and convert them yourself. Those are the two choices.

2. Start with a Picture

It is faster to first look at a diagram of what happens inside Media Foundation.

Pattern APattern BMP4 / H.264 / HEVCdecoderYUV frames such as NV12 / YUY2 / YV12Source Reader video processingRGB32Your own conversion codeBGRA / RGB

If the contents of the video file are a compressed format such as H.264 or HEVC, the decoder first turns them back into uncompressed frames. These uncompressed frames are not necessarily RGB. In fact, in the Windows video world, YUV-family formats are the norm.

So when the application wants RGB, you choose one of the following.

  1. Have Media Foundation take the frames all the way to RGB32
  2. Receive YUV and turn it into RGB with your own code

This article is exactly about that fork in the road.

3. Sorting Out the Relationship Between YUV and RGB First

3.1. It Says YUV, but It Is Really About Y’CbCr

Windows API names and documentation broadly use the term YUV. In the context of digital video, however, you can read U as Cb and V as Cr with essentially no problem.

Roughly speaking:

  • Y is the brightness-oriented component
  • U / V are the color-difference components
  • RGB is each pixel directly carrying Red / Green / Blue

That is the relationship.

The human eye is more sensitive to fine detail in brightness than in color. So for video, a design that keeps Y at full detail and U/V somewhat coarser pays off. This is why YUV-family formats are so widely used.

3.2. 4:4:4 / 4:2:2 / 4:2:0 Is “How Much the Color Is Thinned Out”

This is the key to reading YUV.

Notation Meaning Typical examples
4:4:4 Each pixel has its own Y/U/V AYUV, I444
4:2:2 2 pixels horizontally share one U/V pair YUY2, UYVY, I422
4:2:0 A 2x2 pixel block shares one U/V pair NV12, YV12, I420

It helps a lot to first look at the shape of the two formats you encounter most in practice.

NV12 (4:2:0, planar)

Y plane
Y Y Y Y
Y Y Y Y
Y Y Y Y
Y Y Y Y

UV plane
U V U V
U V U V

In NV12, the 4 pixels of a 2x2 block share a single U/V pair. Y exists for each individual pixel.

YUY2 (4:2:2, packed)

bytes:
Y0 U0 Y1 V0   Y2 U2 Y3 V2   ...

In YUY2, 2 horizontal pixels share one U/V pair. Y0 and Y1 are separate, but U0 and V0 are shared.

At this point you can already see that YUV -> RGB is not a simple one-pixel-to-one-pixel substitution. First you have to think about how to assign the shared U/V to which pixels.

3.3. YUV -> RGB Is “Color Space Conversion + Sampling Conversion”

If you look at Media Foundation’s Extended Color Information, strictly correct color conversion has quite a few stages: inverse quantization, chroma upsampling, YUV -> RGB, the transfer function, primaries conversion, and finally quantization.

That said, as practical code for 8-bit SDR, it is easiest to understand if you split it into the following three layers.

  1. Undo the subsampling Expand the 4:2:0 or 4:2:2 U/V so that every pixel can reference a value
  2. Undo the range Video Y normally uses 16..235 and U/V use 16..240, so undo that scaling
  3. Apply the matrix Convert to RGB using coefficients such as BT.601 or BT.709

In other words, in practical terms, YUV -> RGB conversion is the process of deciding:

  • which U/V is the color for that pixel
  • which coefficients to use to turn that Y/U/V back into RGB

3.4. Treat BT.601 and BT.709 Carelessly and Colors Drift Subtly

The Media Foundation documentation describes the relationship as BT.601 being preferred for SDTV and below, and BT.709 for video beyond SD.

However, silently guessing “the resolution is large, so it must be 709” is not a great idea. Color drift does not crash, so it easily slips into production unnoticed.

Media Foundation can carry color space information as media type attributes. At minimum, look at these two:

  • MF_MT_YUV_MATRIX
  • MF_MT_VIDEO_NOMINAL_RANGE

Looking at these two and explicitly accepting only the combinations your code supports makes silent accidents much less likely later.

3.5. The First Formula to Learn Is the BT.601 Limited-Range Version

The canonical 8-bit BT.601 formula looks like this.

C = Y - 16
D = U - 128
E = V - 128

R = clip(1.164383 * C + 1.596027 * E)
G = clip(1.164383 * C - 0.391762 * D - 0.812968 * E)
B = clip(1.164383 * C + 2.017232 * D)

For BT.709 the coefficients change. We will show that in code later.

What matters here is not memorizing the coefficients but the structure: subtract the black level 16 from Y, and view U/V as centered on 128.

4. Pattern A: Let Media Foundation Convert Automatically

4.1. When This Is a Good Fit

This approach is well suited to situations like the following.

  • You want to extract a single still image from an MP4
  • You want to create a few thumbnails
  • You want an RGB image to hand to WIC
  • Batch or tooling use is fine; this is not real-time playback

The Source Reader has a feature that performs limited video processing of YUV -> RGB32 when you use MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING.

However, as Microsoft Learn also notes, this is software processing and not optimized for playback. If you want to process hundreds of frames per second, leaning on this is not quite the right tool.

4.2. What You Set to Get RGB32 Out

The flow is quite straightforward.

  1. In the attributes passed to MFCreateSourceReaderFromURL, set MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING = TRUE
  2. Select the video stream
  3. Request MFMediaType_Video / MFVideoFormat_RGB32 via SetCurrentMediaType
  4. Read samples with ReadSample

That alone makes the limited video processing inserted behind the decoder do the YUV -> RGB32 for you.

4.3. Code

The following code assumes CoInitializeEx and MFStartup have already been done. A minimal version looks roughly like this.

#include <windows.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <mferror.h>
#include <wrl/client.h>

#pragma comment(lib, "mfplat.lib")
#pragma comment(lib, "mfreadwrite.lib")
#pragma comment(lib, "mfuuid.lib")
#pragma comment(lib, "ole32.lib")

using Microsoft::WRL::ComPtr;

HRESULT CreateSourceReaderWithAutoRgb(
    const wchar_t* path,
    IMFSourceReader** ppReader)
{
    if (!path || !ppReader) return E_POINTER;
    *ppReader = nullptr;

    ComPtr<IMFAttributes> attrs;
    HRESULT hr = MFCreateAttributes(&attrs, 2);
    if (FAILED(hr)) return hr;

    hr = attrs->SetUINT32(MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING, TRUE);
    if (FAILED(hr)) return hr;

    hr = MFCreateSourceReaderFromURL(path, attrs.Get(), ppReader);
    if (FAILED(hr)) return hr;

    hr = (*ppReader)->SetStreamSelection(MF_SOURCE_READER_ALL_STREAMS, FALSE);
    if (FAILED(hr)) return hr;

    hr = (*ppReader)->SetStreamSelection(MF_SOURCE_READER_FIRST_VIDEO_STREAM, TRUE);
    if (FAILED(hr)) return hr;

    ComPtr<IMFMediaType> outType;
    hr = MFCreateMediaType(&outType);
    if (FAILED(hr)) return hr;

    hr = outType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
    if (FAILED(hr)) return hr;

    hr = outType->SetGUID(MF_MT_SUBTYPE, MFVideoFormat_RGB32);
    if (FAILED(hr)) return hr;

    hr = (*ppReader)->SetCurrentMediaType(
        MF_SOURCE_READER_FIRST_VIDEO_STREAM,
        nullptr,
        outType.Get());
    if (FAILED(hr)) return hr;

    return S_OK;
}

HRESULT ReadOneRgb32Sample(
    IMFSourceReader* reader,
    IMFSample** ppSample,
    LONGLONG* pTimestamp100ns)
{
    if (!reader || !ppSample) return E_POINTER;
    *ppSample = nullptr;
    if (pTimestamp100ns) *pTimestamp100ns = 0;

    DWORD streamIndex = 0;
    DWORD flags = 0;
    LONGLONG timestamp = 0;

    HRESULT hr = reader->ReadSample(
        MF_SOURCE_READER_FIRST_VIDEO_STREAM,
        0,
        &streamIndex,
        &flags,
        &timestamp,
        ppSample);

    if (FAILED(hr)) return hr;
    if (flags & MF_SOURCE_READERF_ENDOFSTREAM) return MF_E_END_OF_STREAM;
    if (*ppSample == nullptr) return MF_E_INVALID_STREAM_DATA;

    if (pTimestamp100ns) *pTimestamp100ns = timestamp;
    return S_OK;
}

After this, calling GetCurrentMediaType lets you check the actual output size and stride.

4.4. Strengths of This Approach

The good thing about this approach is that it gets you to a correct picture quickly.

  • You do not have to write the 4:2:0 / 4:2:2 expansion yourself
  • It hides much of the hassle of matrix handling / deinterlacing
  • The output is easy to hand to WIC or GDI
  • For processing a handful of frames, it is perfectly practical

For still-image extraction tools, starting here is quite natural.

4.5. But There Are Pitfalls Too

This automatic conversion has the following characteristics.

Item Details
Conversion target Basically RGB32
Implementation Software processing
Suited for Small numbers of frames, thumbnails, offline processing
Not suited for D3D-based real-time rendering, high-volume frame processing
Incompatible attributes MF_SOURCE_READER_D3D_MANAGER, MF_READWRITE_DISABLE_CONVERTERS

And one more important thing: the handling of the 4th byte in RGB32. In memory, Windows RGB32 is laid out as Blue / Green / Red / Alpha or Don’t Care. It is not ARGB32. If you pass it to WIC as 32bppBGRA, it is safer to fill the 4th byte with 0xFF to make it opaque.

We touched on this as an easy thing to trip over in the previous still-image extraction article as well.

5. Pattern B: Write the Conversion Yourself

5.1. When This Is a Good Fit

Doing the conversion yourself is a good fit in cases like these.

  • You process a large number of frames and want to optimize the conversion yourself
  • You want to feed NV12 straight to the GPU or SIMD code
  • You want to handle BT.601 / BT.709 / range explicitly
  • You want to produce output formats other than RGB32
  • The Source Reader’s limited automatic conversion is not enough

You could call it the pattern where you take on responsibility for throughput and color in exchange for freedom.

5.2. Overall Flow of Manual Conversion

The steps are as follows.

  1. Set the Source Reader output to NV12 or YUY2
  2. Get the actual subtype and attributes via GetCurrentMediaType
  3. Check MF_MT_FRAME_SIZE, MF_MT_DEFAULT_STRIDE, MF_MT_YUV_MATRIX, and MF_MT_VIDEO_NOMINAL_RANGE
  4. Extract the buffer from the sample and lock it
  5. Determine which Y/U/V each pixel references
  6. Apply the matrix and write BGRA

The code in this article is limited to 8-bit SDR / progressive / NV12 or YUY2 / limited range. Narrowing the assumptions here is not laziness — it actually matters. If you write a YUV conversion that “accepts everything for now,” it tends to break colors silently.

5.3. First, Specify the Output Media Type Explicitly

First, tell the Source Reader “please output the YUV as is.” Again, this assumes CoInitializeEx / MFStartup have been done.

#include <windows.h>
#include <mfapi.h>
#include <mfidl.h>
#include <mfreadwrite.h>
#include <mferror.h>
#include <wrl/client.h>

using Microsoft::WRL::ComPtr;

HRESULT ConfigureSourceReaderForSubtype(
    IMFSourceReader* reader,
    REFGUID subtype)
{
    if (!reader) return E_POINTER;

    HRESULT hr = reader->SetStreamSelection(MF_SOURCE_READER_ALL_STREAMS, FALSE);
    if (FAILED(hr)) return hr;

    hr = reader->SetStreamSelection(MF_SOURCE_READER_FIRST_VIDEO_STREAM, TRUE);
    if (FAILED(hr)) return hr;

    ComPtr<IMFMediaType> outType;
    hr = MFCreateMediaType(&outType);
    if (FAILED(hr)) return hr;

    hr = outType->SetGUID(MF_MT_MAJOR_TYPE, MFMediaType_Video);
    if (FAILED(hr)) return hr;

    hr = outType->SetGUID(MF_MT_SUBTYPE, subtype);
    if (FAILED(hr)) return hr;

    hr = reader->SetCurrentMediaType(
        MF_SOURCE_READER_FIRST_VIDEO_STREAM,
        nullptr,
        outType.Get());
    if (FAILED(hr)) return hr;

    return S_OK;
}

Here you pass MFVideoFormat_NV12 or MFVideoFormat_YUY2 as subtype.

Note that the requested subtype is not guaranteed to be accepted as is. Check what actually comes out with GetCurrentMediaType.

5.4. Before Converting, Accept Only the Color Information You Support

For a manual conversion, first pull the minimum information from the media type. The sample in this article accepts only NV12 / YUY2, and lets through only BT.601 or BT.709 for the matrix and only MFNominalRange_16_235 for the range.

#include <vector>

struct DecodedFrameInfo
{
    GUID subtype = GUID_NULL;
    UINT32 width = 0;
    UINT32 height = 0;
    LONG defaultStride = 0;
    MFVideoTransferMatrix matrix = MFVideoTransferMatrix_Unknown;
    MFNominalRange nominalRange = MFNominalRange_Unknown;
};

HRESULT GetDefaultStride(
    IMFMediaType* pType,
    LONG* plStride)
{
    if (!pType || !plStride) return E_POINTER;

    LONG stride = 0;
    HRESULT hr = pType->GetUINT32(
        MF_MT_DEFAULT_STRIDE,
        reinterpret_cast<UINT32*>(&stride));

    if (FAILED(hr))
    {
        GUID subtype = GUID_NULL;
        UINT32 width = 0;
        UINT32 height = 0;

        hr = pType->GetGUID(MF_MT_SUBTYPE, &subtype);
        if (FAILED(hr)) return hr;

        hr = MFGetAttributeSize(pType, MF_MT_FRAME_SIZE, &width, &height);
        if (FAILED(hr)) return hr;

        hr = MFGetStrideForBitmapInfoHeader(subtype.Data1, width, &stride);
        if (FAILED(hr)) return hr;

        hr = pType->SetUINT32(MF_MT_DEFAULT_STRIDE, static_cast<UINT32>(stride));
        if (FAILED(hr)) return hr;
    }

    *plStride = stride;
    return S_OK;
}

HRESULT GetStrictDecodedFrameInfo(
    IMFMediaType* pType,
    DecodedFrameInfo* pInfo)
{
    if (!pType || !pInfo) return E_POINTER;

    HRESULT hr = pType->GetGUID(MF_MT_SUBTYPE, &pInfo->subtype);
    if (FAILED(hr)) return hr;

    if (pInfo->subtype != MFVideoFormat_NV12 &&
        pInfo->subtype != MFVideoFormat_YUY2)
    {
        return MF_E_INVALIDMEDIATYPE;
    }

    hr = MFGetAttributeSize(pType, MF_MT_FRAME_SIZE, &pInfo->width, &pInfo->height);
    if (FAILED(hr)) return hr;

    hr = GetDefaultStride(pType, &pInfo->defaultStride);
    if (FAILED(hr)) return hr;

    UINT32 value = 0;

    hr = pType->GetUINT32(MF_MT_YUV_MATRIX, &value);
    if (FAILED(hr)) return hr;

    pInfo->matrix = static_cast<MFVideoTransferMatrix>(value);
    if (pInfo->matrix != MFVideoTransferMatrix_BT601 &&
        pInfo->matrix != MFVideoTransferMatrix_BT709)
    {
        return MF_E_INVALIDMEDIATYPE;
    }

    hr = pType->GetUINT32(MF_MT_VIDEO_NOMINAL_RANGE, &value);
    if (FAILED(hr)) return hr;

    pInfo->nominalRange = static_cast<MFNominalRange>(value);
    if (pInfo->nominalRange != MFNominalRange_16_235)
    {
        return MF_E_INVALIDMEDIATYPE;
    }

    return S_OK;
}

We are deliberately being strict here. The Media Foundation enum documentation does say things like “treat Unknown as BT.709,” but in practice, silently rounding that off makes color drift harder to notice. At least in a first implementation, it is safer to return an error for unsupported combinations.

With cameras and JPEG-derived sources, you may want to handle full-range paths separately. Here we deliberately do not silently lump them together — the policy is to explicitly narrow the assumptions this code accepts.

5.5. Read the Buffer Trusting the Stride

This part is quite important too.

  • MF_MT_DEFAULT_STRIDE is the minimum stride
  • The actual sample buffer may have an actual stride that includes padding
  • If IMF2DBuffer::Lock2D is available, prefer it

Taking the helper pattern from Microsoft Learn’s Uncompressed Video Buffers and making it directly usable gives us this.

class BufferLock
{
public:
    explicit BufferLock(IMFMediaBuffer* buffer)
        : m_buffer(buffer),
          m_2dBuffer(nullptr),
          m_locked(false)
    {
        if (m_buffer)
        {
            m_buffer->AddRef();
            m_buffer->QueryInterface(IID_PPV_ARGS(&m_2dBuffer));
        }
    }

    ~BufferLock()
    {
        Unlock();

        if (m_2dBuffer)
        {
            m_2dBuffer->Release();
            m_2dBuffer = nullptr;
        }

        if (m_buffer)
        {
            m_buffer->Release();
            m_buffer = nullptr;
        }
    }

    HRESULT Lock(
        LONG defaultStride,
        DWORD heightInPixels,
        BYTE** ppScanline0,
        LONG* pActualStride)
    {
        if (!m_buffer || !ppScanline0 || !pActualStride) return E_POINTER;
        if (m_locked) return MF_E_INVALIDREQUEST;

        if (m_2dBuffer)
        {
            HRESULT hr = m_2dBuffer->Lock2D(ppScanline0, pActualStride);
            if (FAILED(hr)) return hr;

            m_locked = true;
            return S_OK;
        }

        BYTE* pData = nullptr;
        HRESULT hr = m_buffer->Lock(&pData, nullptr, nullptr);
        if (FAILED(hr)) return hr;

        *pActualStride = defaultStride;
        if (defaultStride < 0)
        {
            *ppScanline0 =
                pData + static_cast<size_t>(-defaultStride) * (heightInPixels - 1);
        }
        else
        {
            *ppScanline0 = pData;
        }

        m_locked = true;
        return S_OK;
    }

    void Unlock()
    {
        if (!m_locked) return;

        if (m_2dBuffer)
        {
            m_2dBuffer->Unlock2D();
        }
        else
        {
            m_buffer->Unlock();
        }

        m_locked = false;
    }

private:
    IMFMediaBuffer* m_buffer;
    IMF2DBuffer* m_2dBuffer;
    bool m_locked;
};

The recommended YUV surface definitions are top-left / positive stride, but for actual buffer access it is safer to use the pitch the API returned, as is. If you hard-code something based on width here, things break silently later.

5.6. Turning the Per-Pixel Conversion Formula into Code

Here we handle only the limited-range versions of BT.601 and BT.709. The output is BGRA32, which is easy to hand to WIC or GDI.

inline BYTE ClampToByte(double value)
{
    if (value <= 0.0) return 0;
    if (value >= 255.0) return 255;
    return static_cast<BYTE>(value + 0.5);
}

HRESULT ConvertLimitedYuvPixelToBgra(
    BYTE y,
    BYTE u,
    BYTE v,
    MFVideoTransferMatrix matrix,
    BYTE* dstPixel)
{
    if (!dstPixel) return E_POINTER;

    const double c = static_cast<double>(y) - 16.0;
    const double d = static_cast<double>(u) - 128.0;
    const double e = static_cast<double>(v) - 128.0;

    double r = 0.0;
    double g = 0.0;
    double b = 0.0;

    switch (matrix)
    {
    case MFVideoTransferMatrix_BT601:
        r = 1.164383 * c + 1.596027 * e;
        g = 1.164383 * c - 0.391762 * d - 0.812968 * e;
        b = 1.164383 * c + 2.017232 * d;
        break;

    case MFVideoTransferMatrix_BT709:
        r = 1.164383 * c + 1.792741 * e;
        g = 1.164383 * c - 0.213249 * d - 0.532909 * e;
        b = 1.164383 * c + 2.112402 * d;
        break;

    default:
        return MF_E_INVALIDMEDIATYPE;
    }

    dstPixel[0] = ClampToByte(b);
    dstPixel[1] = ClampToByte(g);
    dstPixel[2] = ClampToByte(r);
    dstPixel[3] = 255;

    return S_OK;
}

What is happening here is simple.

  • Subtract 16 from Y
  • Subtract 128 from U / V
  • Multiply by the coefficients for the given matrix
  • Clip the result to 0..255
  • Set the 4th BGRA byte to 255

5.7. Converting NV12 to BGRA32

NV12 is 4:2:0, so the 4 pixels of a 2x2 block share the same U/V. As a minimal implementation, the most understandable approach is to use that shared chroma directly for all 4 pixels.

HRESULT ConvertNv12ToBgra32(
    IMFMediaBuffer* buffer,
    const DecodedFrameInfo& info,
    std::vector<BYTE>& dstBgra)
{
    if (!buffer) return E_POINTER;
    if (info.subtype != MFVideoFormat_NV12) return MF_E_INVALIDMEDIATYPE;
    if ((info.width & 1u) != 0 || (info.height & 1u) != 0)
    {
        return MF_E_INVALIDMEDIATYPE;
    }

    dstBgra.resize(static_cast<size_t>(info.width) * info.height * 4);

    BufferLock lock(buffer);

    BYTE* scanline0 = nullptr;
    LONG actualStride = 0;
    HRESULT hr = lock.Lock(
        info.defaultStride,
        info.height,
        &scanline0,
        &actualStride);
    if (FAILED(hr)) return hr;

    if (actualStride <= 0)
    {
        lock.Unlock();
        return MF_E_INVALIDMEDIATYPE;
    }

    const BYTE* yPlane = scanline0;
    const BYTE* uvPlane =
        scanline0 + static_cast<size_t>(actualStride) * info.height;

    for (UINT32 y = 0; y < info.height; ++y)
    {
        const BYTE* yRow = yPlane + static_cast<size_t>(actualStride) * y;
        const BYTE* uvRow = uvPlane + static_cast<size_t>(actualStride) * (y / 2);
        BYTE* dstRow =
            dstBgra.data() + static_cast<size_t>(info.width) * 4 * y;

        for (UINT32 x = 0; x < info.width; ++x)
        {
            const BYTE Y = yRow[x];
            const BYTE U = uvRow[(x / 2) * 2 + 0];
            const BYTE V = uvRow[(x / 2) * 2 + 1];

            hr = ConvertLimitedYuvPixelToBgra(
                Y,
                U,
                V,
                info.matrix,
                dstRow + static_cast<size_t>(x) * 4);
            if (FAILED(hr))
            {
                lock.Unlock();
                return hr;
            }
        }
    }

    lock.Unlock();
    return S_OK;
}

This code interprets the chroma upsampling in a nearest-neighbor fashion. Visually that is often perfectly serviceable, but if you are aiming for maximum quality, a design that first performs the 4:2:0 -> 4:2:2 -> 4:4:4 upconversion, as described in the Microsoft Learn YUV article, is theoretically cleaner.

5.8. Converting YUY2 to BGRA32

YUY2 is packed 4:2:2. Two pixels simply share one U/V pair, so it is a bit easier to read than NV12.

#include <cstddef>

HRESULT ConvertYuy2ToBgra32(
    IMFMediaBuffer* buffer,
    const DecodedFrameInfo& info,
    std::vector<BYTE>& dstBgra)
{
    if (!buffer) return E_POINTER;
    if (info.subtype != MFVideoFormat_YUY2) return MF_E_INVALIDMEDIATYPE;
    if ((info.width & 1u) != 0) return MF_E_INVALIDMEDIATYPE;

    dstBgra.resize(static_cast<size_t>(info.width) * info.height * 4);

    BufferLock lock(buffer);

    BYTE* scanline0 = nullptr;
    LONG actualStride = 0;
    HRESULT hr = lock.Lock(
        info.defaultStride,
        info.height,
        &scanline0,
        &actualStride);
    if (FAILED(hr)) return hr;

    for (UINT32 y = 0; y < info.height; ++y)
    {
        const BYTE* src =
            scanline0 +
            static_cast<ptrdiff_t>(actualStride) * static_cast<ptrdiff_t>(y);

        BYTE* dstRow =
            dstBgra.data() + static_cast<size_t>(info.width) * 4 * y;

        for (UINT32 x = 0; x < info.width; x += 2)
        {
            const BYTE Y0 = src[0];
            const BYTE U  = src[1];
            const BYTE Y1 = src[2];
            const BYTE V  = src[3];

            hr = ConvertLimitedYuvPixelToBgra(
                Y0,
                U,
                V,
                info.matrix,
                dstRow + static_cast<size_t>(x) * 4);
            if (FAILED(hr))
            {
                lock.Unlock();
                return hr;
            }

            hr = ConvertLimitedYuvPixelToBgra(
                Y1,
                U,
                V,
                info.matrix,
                dstRow + static_cast<size_t>(x + 1) * 4);
            if (FAILED(hr))
            {
                lock.Unlock();
                return hr;
            }

            src += 4;
        }
    }

    lock.Unlock();
    return S_OK;
}

In YUY2, the bytes are laid out as Y0 U Y1 V, so the structure of “reuse the U/V for every 2 pixels” is directly visible. This makes the mental model easier to build than for NV12.

5.9. The Entry Point When Calling from a Sample

Finally, if you extract a contiguous buffer from the IMFSample and branch by subtype, it becomes easy to use.

HRESULT ConvertSampleToBgra32(
    IMFSample* sample,
    const DecodedFrameInfo& info,
    std::vector<BYTE>& dstBgra)
{
    if (!sample) return E_POINTER;

    ComPtr<IMFMediaBuffer> buffer;
    HRESULT hr = sample->ConvertToContiguousBuffer(&buffer);
    if (FAILED(hr)) return hr;

    if (info.subtype == MFVideoFormat_NV12)
    {
        return ConvertNv12ToBgra32(buffer.Get(), info, dstBgra);
    }

    if (info.subtype == MFVideoFormat_YUY2)
    {
        return ConvertYuy2ToBgra32(buffer.Get(), info, dstBgra);
    }

    return MF_E_INVALIDMEDIATYPE;
}

With this, the preceding steps become:

  • create the reader
  • request NV12 or YUY2
  • build a DecodedFrameInfo from GetCurrentMediaType
  • ReadSample
  • ConvertSampleToBgra32

The actual calling code looks something like this.

ComPtr<IMFMediaType> currentType;
HRESULT hr = reader->GetCurrentMediaType(
    MF_SOURCE_READER_FIRST_VIDEO_STREAM,
    &currentType);
if (FAILED(hr)) return hr;

DecodedFrameInfo info;
hr = GetStrictDecodedFrameInfo(currentType.Get(), &info);
if (FAILED(hr)) return hr;

DWORD flags = 0;
LONGLONG timestamp = 0;
ComPtr<IMFSample> sample;

hr = reader->ReadSample(
    MF_SOURCE_READER_FIRST_VIDEO_STREAM,
    0,
    nullptr,
    &flags,
    &timestamp,
    &sample);
if (FAILED(hr)) return hr;
if (flags & MF_SOURCE_READERF_ENDOFSTREAM) return MF_E_END_OF_STREAM;
if (!sample) return MF_E_INVALID_STREAM_DATA;

std::vector<BYTE> bgra;
hr = ConvertSampleToBgra32(sample.Get(), info, bgra);
if (FAILED(hr)) return hr;

// bgra can be treated as top-down / 32bpp BGRA

5.10. Where to Put the “Manual Conversion”

The code so far takes the form of the application converting after the Source Reader. That is the easiest to understand.

However, if you want to insert the conversion inside the Media Foundation pipeline, there are other designs.

  • Write your own MFT
  • Use the Video Processor MFT / XVP
  • Write an NV12 -> RGB shader on the GPU side

Going that far changes the topic somewhat, so this article focused on application-side code. Still, it is useful to know that between “let Media Foundation handle it” and “do everything in the app,” there is a middle ground: the Video Processor MFT.

6. Which One Should You Choose?

When in doubt, the following table sorts things out quite well.

Aspect Automatic conversion (MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING) Manual conversion
Implementation speed Excellent Fair
Extracting a few stills Excellent Good
High frame volume / real-time Fair Excellent
Explicit control of matrix / range Fair Excellent
Combining with GPU / D3D Fair Good to excellent
Output formats other than RGB32 Fair Excellent
Understanding the fundamentals Good Excellent

For your first implementation, this framing makes it easy.

  • Want it working first -> automatic conversion
  • Want to own color and performance -> manual conversion

In practice, the sequence “first confirm a correct picture with automatic conversion, then replace it with the manual path” is also quite effective. If you take on everything from the start, it becomes hard to tell where the picture broke.

7. Pitfalls That Are Easy to Hit in Practice

7.1. Assuming RGB32 Is RGBA with Alpha

In memory, RGB32 is B, G, R, Alpha or Don't Care. If you write it out to a PNG as BGRA as is, the 4th byte may be 0, making the image transparent. It is safer to set it to 0xFF before saving.

7.2. Hard-Coding the Stride as width * bytesPerPixel

A very common accident. The actual sample buffer can contain padding, so the rule is to use the actual stride to move between rows.

7.3. Confusing MF_MT_DEFAULT_STRIDE with the Actual Pitch

MF_MT_DEFAULT_STRIDE is “the minimum stride when that format is represented in contiguous memory.” For the actual pitch of the sample buffer, prefer the value returned by IMF2DBuffer::Lock2D.

7.4. Silently Guessing 601 / 709 Without Looking at the Color Metadata

Color accidents are hard to see. They do not crash either. That is what makes them troublesome.

  • MF_MT_YUV_MATRIX
  • MF_MT_VIDEO_NOMINAL_RANGE

At the very least, look at these. And the right attitude is roughly: values your code does not support should be errors.

7.5. Locating the NV12 UV Plane with width * height

The plane offset is determined by the actual stride and height. Not by width * height. Do this sloppily and you get shifted colors or corrupted images.

7.6. Processing Interlaced Video Assuming Progressive

The manual samples in this article assume progressive video. Reading interlaced content as if each frame were a single field can produce comb-like artifacts. If you need deinterlacing, it is more natural to consider the Source Reader’s automatic video processing or the Video Processor MFT.

7.7. Ignoring the Quality of 4:2:0 Chroma Upsampling

For clarity, the NV12 conversion in this article uses the shared chroma directly for each pixel. That is sufficient for many uses, but if image quality is the priority, it is worth studying the upconversion approach described in the recommended YUV formats documentation.

8. Summary

When converting YUV to RGB with Media Foundation, keeping the following framework in mind makes it much harder to get lost.

  • Behind the decoder, NV12 or YUY2 — not RGB — is what normally comes out
  • If you want the easy path, request RGB32 via MF_SOURCE_READER_ENABLE_VIDEO_PROCESSING
  • If you want control, receive NV12 / YUY2 and convert to BGRA yourself
  • On the manual path, get sampling / range / matrix / stride right before worrying about the formula
  • Being vague about BT.601 / BT.709, 16..235, and 4:2:0 / 4:2:2 leads to color drift or broken pictures

YUV -> RGB is a bit unapproachable at first. But once the picture of

  • NV12 shares U/V across 2x2 blocks
  • YUY2 shares U/V across 2 horizontal pixels
  • apply the matrix to that U/V together with Y

settles into your head, it becomes quite tame. Those mysterious cosmic-colored byte sequences start to look like properly meaningful pixels.

9. References

Sample Code for This Article

Microsoft Learn

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This article connects naturally to the following service pages.

Windows App Development

This topic covers Media Foundation, the Source Reader, image saving, and video frame conversion — a Windows media-processing implementation theme that fits well with our Windows application development service.

Technical Consulting & Design Review

If you want to sort out the division of responsibility for YUV / RGB conversion, color spaces, stride, and conversion-path design up front, this topic works well as a technical consulting / design review engagement.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog