An Introduction to Windows Text Encodings - The Mojibake That Happens When Integrating with Linux

· · Windows, Mojibake, UTF-8, CP932, Linux, PowerShell, Unicode

Mojibake on Windows does not happen because Japanese is difficult. Almost all of it is caused by reading the same byte sequence as a different encoding, or by saving the result of a misread in yet another encoding.

Especially when you cross between Windows and Linux, the Windows side carries multiple contexts — CP932, UTF-8, UTF-16, the console’s code page, version differences in PowerShell — while the Linux side mostly flows on a UTF-8 assumption. Mismatched assumptions that were previously invisible surface all at once.

This is less about the difficulty of Japanese text processing and more about whether you have aligned the assumptions under which bytes are handled. In this article, we organize Windows text encodings from the angle of “why does mojibake happen,” with a practical focus on the points where accidents multiply when Linux enters the picture.

1. What to Grasp First

Stating the essentials up front, these six points matter most.

  • Mojibake is not a problem of “characters” — it is a problem of “how a byte sequence was interpreted.”
  • On Windows, the Unicode world and the legacy code page world coexist, and even within a single machine the assumptions differ by context.
  • The Linux side has a strong UTF-8 assumption, so when Windows-side CP932 or UTF-16 gets mixed in, accidents follow easily.
  • The stage where only the display is garbled and the stage where corrupted content has been saved should be considered separately.
  • It is safest to make UTF-8 the first choice for new text and to leave existing legacy files as they are until an explicit migration task.
  • A file’s encoding, the editor’s encoding, the console’s code page, and the app’s internal string format are different things. Confuse these and your investigation gets lost.

The phrase “it got garbled on Windows” alone cannot pinpoint a cause. At minimum, you need to separate which of these is misaligned.

  • The encoding of the file itself
  • The encoding used at save time
  • The editor’s interpretation
  • The console’s input/output code page
  • The app’s internal string format
  • The locale and assumed encoding on the Linux side

2. What Mojibake Actually Is

What mojibake actually is turns out to be quite simple.

  1. A string is encoded with some encoding into a byte sequence
  2. That byte sequence is decoded with some encoding back into a string
  3. If the encode and decode assumptions do not match, it is read as a different string

For example, saving in UTF-8 yields these bytes.

E3 81 82

Read this byte sequence as UTF-8 and you get ; read it in a CP932 context and it looks like some other string, such as 縺�. That is mojibake.

What matters is that what happened here is not “the Japanese broke” — it is merely that the interpretation of the same bytes diverged.

2.1 If only the display is garbled, recovery may still be possible

Mojibake has a stage where things can still be recovered. For instance, if the original bytes are unchanged, reopening the file with the correct encoding can restore it.

What is dangerous is a flow like this.

  1. A UTF-8 file is misread as CP932
  2. On screen it looks like 縺�
  3. The “string as displayed” is saved as-is
  4. The original UTF-8 bytes are lost

Once you enter this stage, it is no longer a mere display issue — it is data corruption.

2.2 Even more dangerous: dropping “unrepresentable characters” into a narrow code page

The other classic accident is converting a Unicode string down into a legacy code page like CP932.

For example, if the string contains characters that do not exist in the destination code page:

  • they get replaced with ?
  • the replacement character appears
  • they get converted to a similar but different character
  • or the conversion fails

This accident should be judged not only by readable vs. unreadable but by whether a round-trip conversion returns the original. Once a character is lost, knowing the correct encoding later cannot restore it.

3. Why Things Get So Tangled on Windows

Windows is not tangled simply because it is old. It is because the Unicode world and the legacy code page world still live side by side.

3.1 The Windows API has both Unicode and code page lineages

The Windows API has two major lineages.

  • The W family: wide character. Handles Unicode as UTF-16
  • The A family: the code page lineage, called ANSI

In other words, Windows has had both a “handle it as Unicode” path and a “handle it via the currently active code page” path from the start. So even on the same Windows machine, the assumptions change depending on which API or which tool the text passed through.

3.2 “Japanese on Windows” is not one thing

In practice, the four that most often get mixed up in Windows Japanese-text work are:

  • CP932: common in legacy Japanese Windows text
  • UTF-8: increasingly common in newer text assets, the web, and cross-platform contexts
  • UTF-16LE: still routinely appears in the context of Windows tools and APIs
  • The console’s code page: a separate layer that affects the input/output of cmd.exe and some console tools

The important point here: running chcp 65001 does not make your files UTF-8. Changing the console’s code page and what bytes an existing file contains are separate questions.

Incidentally, legacy Japanese Windows text is often loosely called “Shift_JIS,” but in practice keeping the name CP932 in mind keeps conversations from drifting. At minimum, it makes explicit that “we are talking about the Windows-derived Japanese legacy encoding.”

3.3 File names and file contents are separate problems

When Japanese file names display fine on Windows, it is tempting to assume “then the contents must be fine too.” That is where the danger lies.

  • The layer that handles paths / file names
  • The layer that reads file contents
  • The layer that displays to the console

These three are distinct.

For example, Japanese paths may work flawlessly while the file contents, saved in CP932, break when read as UTF-8 on the Linux side. Conversely, even if the contents are UTF-8, the display alone breaks when the console’s code page does not match.

3.4 The defaults of PowerShell and surrounding tools are not aligned either

A quiet multiplier of accidents on Windows is that the same “I wrote some text” produces different output bytes depending on the path it takes.

The points to watch in particular:

  • Windows PowerShell 5.1 does not have consistent default encodings
  • Some cmdlets and redirection produce UTF-16LE
  • Other paths use the active ANSI code page
  • PowerShell 7 and later defaults to UTF-8 without BOM

So “text produced by PowerShell” alone does not determine the encoding. You need to know which version, which cmdlet, and which write path were used.

4. Classic Accidents When Linux Enters the Mix

It is not unusual for something that more or less worked on Windows alone to break the moment Linux is involved. The reason is simple: the Linux side carries a strong UTF-8 assumption.

4.1 Text saved as CP932 on Windows, read as UTF-8 on Linux

The most common accident.

  • A legacy Windows app or an old operational process writes CSVs / TXT / logs in CP932
  • Linux-side scripts and tools read with a UTF-8 assumption per the locale
  • The result: decode errors, , or meaningless strings

The Linux tool is not at fault here. The root cause is that the received bytes carried no agreement about their encoding.

4.2 UTF-8 without BOM, created on Linux / VS Code, treated as ANSI on Windows

Accidents happen in the opposite direction too.

  • A script / config / text file is created on Linux or in VS Code as UTF-8 without BOM
  • Windows PowerShell 5.1 or a legacy tool treats the BOM-less file as the ANSI-side code page
  • Only the lines containing Japanese or other non-ASCII break

UTF-8 tends to get blamed here, but the actual cause is that a reader that fails to correctly infer BOM-less UTF-8 is in the mix.

4.3 Windows writes UTF-16LE, and on Linux it “doesn’t look like text”

This one is also quite common.

  • Some Windows PowerShell 5.1 output or a legacy tool writes UTF-16LE
  • Linux-side text tools expect a UTF-8 single-byte stream
  • The result is “binary-looking text” riddled with NUL bytes

UTF-16LE itself is not bad. But it often does not mesh with the assumption of piping it straight into Linux text processing tools.

4.4 BOM presence causes friction too

A BOM is not the encoding itself, but in practice it matters a lot.

  • Some Windows-side tools are helped by a BOM
  • Some Linux-side tools treat the BOM as extraneous leading bytes
  • The result: only the first column or the start of the first line breaks, invisible junk appears, comparisons mismatch

In UTF-8 especially, UTF-8 with BOM and UTF-8 without BOM are different bytes. “We switched to UTF-8” alone is only half an operational rule.

4.5 Trusting what the console shows leads you astray

When crossing between Windows and Linux, the other danger is the console.

  • The Windows console has input / output code pages
  • Linux terminals mostly run on a UTF-8 locale assumption
  • Going through WSL, SSH, containers, or CI multiplies the display paths

In this state, judging “it was readable in the console, so the file is fine” or “it was garbled in the console, so the file is corrupted” misfires easily. Whether what you see is broken and whether the saved bytes are broken should be verified separately.

4.6 The classic accidents in a table

Situation Actual bytes Reader’s assumption Typical symptom
CSV saved by a legacy Windows app CP932 Linux side assumes UTF-8 , decode errors, meaningless Japanese
File created on Linux / VS Code UTF-8 no BOM Windows PowerShell 5.1 treats it as ANSI Only Japanese lines break
Some Windows PowerShell 5.1 output UTF-16LE or ANSI Linux side expects UTF-8 text NUL bytes mixed in, binary-like behavior
UTF-8 file with BOM UTF-8 + BOM Unix tools assume plain UTF-8 Only the first column breaks, stray characters appear
Trusting console display alone Different assumptions for file and console Investigator judges by display only Misses the root-cause split

5. Drive Mojibake Investigations with These Four Questions

When a mojibake investigation gets stuck, returning to these four questions is the fastest path.

5.1 What are the original bytes?

The first thing to look at is “what bytes is this file right now.” You need the mindset of looking at bytes, not appearance.

  • Is it UTF-8?
  • UTF-8 with BOM?
  • CP932?
  • UTF-16LE?
  • Has it been re-saved somewhere along the way and become something else?

5.2 Who wrote it first, under what assumption?

Next, identify “the original writer.”

  • A legacy Windows app?
  • PowerShell 5.1 or 7?
  • A Linux script?
  • VS Code?
  • An Excel-derived export?
  • Some middleware / batch / CI?

Leave this vague and encoding inference becomes a matter of luck.

5.3 Who is reading it now, under what assumption?

You need not just the writer but the reader’s assumption as well.

  • Is the editor auto-detecting?
  • Is PowerShell looking at the BOM?
  • Is the Linux side treating it as UTF-8 per the locale?
  • Is the library using its default encoding?
  • Is Encoding.UTF8 or cp932 specified explicitly?

Mojibake almost always originates here.

5.4 Has the misread content already been saved?

Finally, confirm whether the damage has stopped at the display stage.

  • Are the bytes still the original ones?
  • Has someone saved the broken-looking content?
  • Have ? or entered the diff?
  • Has the entire file been rewritten in a different encoding?

Fill in these four questions and the cause is usually visible.

6. Operational Rules That Reduce Accidents

From here, the practical part. In projects spanning Windows and Linux, deciding the following rules up front cuts accidents substantially.

6.1 Make UTF-8 the first choice for new files

For new text files, making UTF-8 the first candidate is the safe default. But do not stop there. You also need to decide what to do about the BOM.

Our recommended way of deciding:

  • Text mostly read on the Linux side: default to UTF-8 without BOM
  • Scripts read by legacy Windows tools or Windows PowerShell 5.1: state BOM presence explicitly based on the consumer’s needs
  • If there is a clear counterpart that requires UTF-16LE, write that requirement into the spec

Write only “standardize on UTF-8” and you will fight about the BOM later.

6.2 Keep existing legacy files as-is until an explicit migration task

If existing files are CP932, it is safer not to convert them to UTF-8 on the side, as part of everyday functional fixes.

The safe operational shape:

  • Existing files keep their original encoding / BOM / newlines
  • Encoding changes are separated out as a migration task
  • Convert in bulk only after confirming the conversion targets, the blast radius, and downstream consumers

Many mojibake accidents begin with a well-intentioned “converting to UTF-8 while I’m at it.”

6.3 Treat the encoding as part of the interface

For CSVs, TXT, logs, configuration files, and simple protocols, the encoding itself is part of the interface, not just the content.

At minimum, a spec should state:

  • Is this file UTF-8 / CP932 / UTF-16LE?
  • If UTF-8, does it carry a BOM?
  • Are the newlines LF or CRLF?
  • Which side, Linux or Windows, is the producer / consumer?
  • Does any intermediate batch or ETL re-save it?

“We pass it as text” is not a specification.

6.4 Do not trust defaults; be explicit when writing

In code and in scripts alike, it is safer to specify the encoding explicitly.

The dangerous lines of thinking:

  • Save with the defaults
  • It will probably come out fine matching the OS
  • It was readable in the console, so the file is probably fine
  • Auto-detect exists, so it will be fine

Defaults vary routinely between Windows / Linux, PowerShell 5.1 / 7, editors, and runtimes. Unless you are explicit, it tends to be “working by coincidence.”

6.5 Verify the console and the file separately

A quietly effective rule:

  • Verify the display in the console
  • Verify by reopening the file

Keep these two separate.

Even if chcp or the terminal display is aligned, it means nothing if the saved file is in a different encoding. Conversely, the file can be perfectly fine while only the appearance breaks because the console’s display code page does not match.

6.6 Git will not fix your encodings

Unglamorous but important.

Git fundamentally tracks bytes. Which means it dutifully commits broken bytes into history, as-is.

So when:

  • a huge diff appears even though you changed nothing
  • only the Japanese lines show mysterious diffs
  • only the first line changed
  • newlines and encoding changed together

suspect a re-encoding accident before suspecting a content change.

7. The Minimum Checklist

Here is the checklist worth pinning down first in projects where Windows and Linux mix.

7.1 Before editing

  • What is this file’s current encoding?
  • Is there a BOM?
  • Are the newlines LF or CRLF?
  • Have you noted 2-3 representative Japanese lines?
  • Do you know which side, Linux or Windows, is the final consumer?

7.2 While editing

  • Are you writing in a way that depends on the default encoding?
  • Are you saving with auto-detect left in charge?
  • Are you using PowerShell or shell redirection paths carelessly?
  • Are you taking comfort merely in “the display is readable”?

7.3 After editing

  • Did you reopen and verify after saving?
  • Are the representative lines intact on both the Linux and Windows sides?
  • Have ? or increased in the diff?
  • Is only the first line or first column broken?
  • Is the diff a huge BOM-only / newline-only change?

7.4 Things that belong in a migration task

  • Bulk conversion from CP932 to UTF-8
  • Unifying the UTF-8 BOM policy
  • Taking inventory of scripts that assume PowerShell 5.1
  • Documenting text paths that go through CI / containers / WSL / SSH
  • Unifying the save settings of editors / formatters / batch jobs

8. Summary

If the Windows encoding problem had to be put in one sentence, its essence is that the Unicode world and the legacy code page world still live side by side.

And the reason accidents multiply when Linux is added is that the Linux side mostly flows on a UTF-8 assumption, so Windows-side CP932 and UTF-16, the console code page, and PowerShell version differences all surface at once.

The five points worth remembering:

  • Mojibake is a misalignment in byte interpretation
  • Garbled display and data corruption are different things
  • On Windows, think in separate layers: file / editor / console / API
  • For text exchanged with Linux, make UTF-8 the first choice
  • Keep the conversion of existing legacy files separate from ordinary maintenance

Taken at face value, “it got garbled on Windows” is far too broad. But cut it along these four questions —

  • What are the original bytes?
  • Who wrote it, and how?
  • Who read it, and how?
  • Has it already been saved?

— and it becomes quite tractable.

Text encodings are unglamorous, but between Windows and Linux they are the I/O contract itself. Refusing to leave them ambiguous is the single most effective countermeasure.

9. References

Windows / Microsoft

  • [Code Pages - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/code-pages)
  • [Code Page Identifiers - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers)
  • [Unicode in the Windows API - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/unicode-in-the-windows-api)
  • [Console Code Pages - Windows Console Microsoft Learn](https://learn.microsoft.com/en-us/windows/console/console-code-pages)
  • [chcp Microsoft Learn](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/chcp)
  • [Use UTF-8 code pages in Windows apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page)

PowerShell / VS Code

  • [about_Character_Encoding Microsoft Learn](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding)
  • [Understanding file encoding in VS Code and PowerShell Microsoft Learn](https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding)

GNU / Linux locale

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This article connects naturally to the following service pages.

Technical Consulting & Design Review

In projects where the encoding assumptions for CSVs, logs, and configuration files diverge between Windows and Linux, sorting out the I/O contract and operational rules first is an effective way to reduce accidents.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog