An Introduction to Windows Text Encodings - The Mojibake That Happens When Integrating with Linux

Mojibake on Windows does not happen because Japanese is difficult. Almost all of it is caused by reading the same byte sequence as a different encoding, or by saving the result of a misread in yet another encoding.

Especially when you cross between Windows and Linux, the Windows side carries multiple contexts — CP932, UTF-8, UTF-16, the console’s code page, version differences in PowerShell — while the Linux side mostly flows on a UTF-8 assumption. Mismatched assumptions that were previously invisible surface all at once.

This is less about the difficulty of Japanese text processing and more about whether you have aligned the assumptions under which bytes are handled. In this article, we organize Windows text encodings from the angle of “why does mojibake happen,” with a practical focus on the points where accidents multiply when Linux enters the picture.

1. What to Grasp First

Stating the essentials up front, these six points matter most.

Mojibake is not a problem of “characters” — it is a problem of “how a byte sequence was interpreted.”
On Windows, the Unicode world and the legacy code page world coexist, and even within a single machine the assumptions differ by context.
The Linux side has a strong UTF-8 assumption, so when Windows-side CP932 or UTF-16 gets mixed in, accidents follow easily.
The stage where only the display is garbled and the stage where corrupted content has been saved should be considered separately.
It is safest to make UTF-8 the first choice for new text and to leave existing legacy files as they are until an explicit migration task.
A file’s encoding, the editor’s encoding, the console’s code page, and the app’s internal string format are different things. Confuse these and your investigation gets lost.

The phrase “it got garbled on Windows” alone cannot pinpoint a cause. At minimum, you need to separate which of these is misaligned.

The encoding of the file itself
The encoding used at save time
The editor’s interpretation
The console’s input/output code page
The app’s internal string format
The locale and assumed encoding on the Linux side

2. What Mojibake Actually Is

What mojibake actually is turns out to be quite simple.

A string is encoded with some encoding into a byte sequence
That byte sequence is decoded with some encoding back into a string
If the encode and decode assumptions do not match, it is read as a different string

For example, saving あ in UTF-8 yields these bytes.

E3 81 82

Read this byte sequence as UTF-8 and you get あ; read it in a CP932 context and it looks like some other string, such as 縺�. That is mojibake.

What matters is that what happened here is not “the Japanese broke” — it is merely that the interpretation of the same bytes diverged.

2.1 If only the display is garbled, recovery may still be possible

Mojibake has a stage where things can still be recovered. For instance, if the original bytes are unchanged, reopening the file with the correct encoding can restore it.

What is dangerous is a flow like this.

A UTF-8 file is misread as CP932
On screen it looks like 縺�
The “string as displayed” is saved as-is
The original UTF-8 bytes are lost

Once you enter this stage, it is no longer a mere display issue — it is data corruption.

2.2 Even more dangerous: dropping “unrepresentable characters” into a narrow code page

The other classic accident is converting a Unicode string down into a legacy code page like CP932.

For example, if the string contains characters that do not exist in the destination code page:

they get replaced with ?
the replacement character � appears
they get converted to a similar but different character
or the conversion fails

This accident should be judged not only by readable vs. unreadable but by whether a round-trip conversion returns the original. Once a character is lost, knowing the correct encoding later cannot restore it.

3. Why Things Get So Tangled on Windows

Windows is not tangled simply because it is old. It is because the Unicode world and the legacy code page world still live side by side.

3.1 The Windows API has both Unicode and code page lineages

The Windows API has two major lineages.

The W family: wide character. Handles Unicode as UTF-16
The A family: the code page lineage, called ANSI

In other words, Windows has had both a “handle it as Unicode” path and a “handle it via the currently active code page” path from the start. So even on the same Windows machine, the assumptions change depending on which API or which tool the text passed through.

3.2 “Japanese on Windows” is not one thing

In practice, the four that most often get mixed up in Windows Japanese-text work are:

CP932: common in legacy Japanese Windows text
UTF-8: increasingly common in newer text assets, the web, and cross-platform contexts
UTF-16LE: still routinely appears in the context of Windows tools and APIs
The console’s code page: a separate layer that affects the input/output of cmd.exe and some console tools

The important point here: running chcp 65001 does not make your files UTF-8. Changing the console’s code page and what bytes an existing file contains are separate questions.

Incidentally, legacy Japanese Windows text is often loosely called “Shift_JIS,” but in practice keeping the name CP932 in mind keeps conversations from drifting. At minimum, it makes explicit that “we are talking about the Windows-derived Japanese legacy encoding.”

3.3 File names and file contents are separate problems

When Japanese file names display fine on Windows, it is tempting to assume “then the contents must be fine too.” That is where the danger lies.

The layer that handles paths / file names
The layer that reads file contents
The layer that displays to the console

These three are distinct.

For example, Japanese paths may work flawlessly while the file contents, saved in CP932, break when read as UTF-8 on the Linux side. Conversely, even if the contents are UTF-8, the display alone breaks when the console’s code page does not match.

3.4 The defaults of PowerShell and surrounding tools are not aligned either

A quiet multiplier of accidents on Windows is that the same “I wrote some text” produces different output bytes depending on the path it takes.

The points to watch in particular:

Windows PowerShell 5.1 does not have consistent default encodings
Some cmdlets and redirection produce UTF-16LE
Other paths use the active ANSI code page
PowerShell 7 and later defaults to UTF-8 without BOM

So “text produced by PowerShell” alone does not determine the encoding. You need to know which version, which cmdlet, and which write path were used.

4. Classic Accidents When Linux Enters the Mix

It is not unusual for something that more or less worked on Windows alone to break the moment Linux is involved. The reason is simple: the Linux side carries a strong UTF-8 assumption.

4.1 Text saved as CP932 on Windows, read as UTF-8 on Linux

The most common accident.

A legacy Windows app or an old operational process writes CSVs / TXT / logs in CP932
Linux-side scripts and tools read with a UTF-8 assumption per the locale
The result: decode errors, �, or meaningless strings

The Linux tool is not at fault here. The root cause is that the received bytes carried no agreement about their encoding.

4.2 UTF-8 without BOM, created on Linux / VS Code, treated as ANSI on Windows

Accidents happen in the opposite direction too.

A script / config / text file is created on Linux or in VS Code as UTF-8 without BOM
Windows PowerShell 5.1 or a legacy tool treats the BOM-less file as the ANSI-side code page
Only the lines containing Japanese or other non-ASCII break

UTF-8 tends to get blamed here, but the actual cause is that a reader that fails to correctly infer BOM-less UTF-8 is in the mix.

4.3 Windows writes UTF-16LE, and on Linux it “doesn’t look like text”

This one is also quite common.

Some Windows PowerShell 5.1 output or a legacy tool writes UTF-16LE
Linux-side text tools expect a UTF-8 single-byte stream
The result is “binary-looking text” riddled with NUL bytes

UTF-16LE itself is not bad. But it often does not mesh with the assumption of piping it straight into Linux text processing tools.

4.4 BOM presence causes friction too

A BOM is not the encoding itself, but in practice it matters a lot.

Some Windows-side tools are helped by a BOM
Some Linux-side tools treat the BOM as extraneous leading bytes
The result: only the first column or the start of the first line breaks, invisible junk appears, comparisons mismatch

In UTF-8 especially, UTF-8 with BOM and UTF-8 without BOM are different bytes. “We switched to UTF-8” alone is only half an operational rule.

4.5 Trusting what the console shows leads you astray

When crossing between Windows and Linux, the other danger is the console.

The Windows console has input / output code pages
Linux terminals mostly run on a UTF-8 locale assumption
Going through WSL, SSH, containers, or CI multiplies the display paths

In this state, judging “it was readable in the console, so the file is fine” or “it was garbled in the console, so the file is corrupted” misfires easily. Whether what you see is broken and whether the saved bytes are broken should be verified separately.

4.6 The classic accidents in a table

Situation	Actual bytes	Reader’s assumption	Typical symptom
CSV saved by a legacy Windows app	CP932	Linux side assumes UTF-8	`�`, decode errors, meaningless Japanese
File created on Linux / VS Code	UTF-8 no BOM	Windows PowerShell 5.1 treats it as ANSI	Only Japanese lines break
Some Windows PowerShell 5.1 output	UTF-16LE or ANSI	Linux side expects UTF-8 text	NUL bytes mixed in, binary-like behavior
UTF-8 file with BOM	UTF-8 + BOM	Unix tools assume plain UTF-8	Only the first column breaks, stray characters appear
Trusting console display alone	Different assumptions for file and console	Investigator judges by display only	Misses the root-cause split

5. Drive Mojibake Investigations with These Four Questions

When a mojibake investigation gets stuck, returning to these four questions is the fastest path.

5.1 What are the original bytes?

The first thing to look at is “what bytes is this file right now.” You need the mindset of looking at bytes, not appearance.

Is it UTF-8?
UTF-8 with BOM?
CP932?
UTF-16LE?
Has it been re-saved somewhere along the way and become something else?

5.2 Who wrote it first, under what assumption?

Next, identify “the original writer.”

A legacy Windows app?
PowerShell 5.1 or 7?
A Linux script?
VS Code?
An Excel-derived export?
Some middleware / batch / CI?

Leave this vague and encoding inference becomes a matter of luck.

5.3 Who is reading it now, under what assumption?

You need not just the writer but the reader’s assumption as well.

Is the editor auto-detecting?
Is PowerShell looking at the BOM?
Is the Linux side treating it as UTF-8 per the locale?
Is the library using its default encoding?
Is Encoding.UTF8 or cp932 specified explicitly?

Mojibake almost always originates here.

5.4 Has the misread content already been saved?

Finally, confirm whether the damage has stopped at the display stage.

Are the bytes still the original ones?
Has someone saved the broken-looking content?
Have ? or � entered the diff?
Has the entire file been rewritten in a different encoding?

Fill in these four questions and the cause is usually visible.

6. Operational Rules That Reduce Accidents

From here, the practical part. In projects spanning Windows and Linux, deciding the following rules up front cuts accidents substantially.

6.1 Make UTF-8 the first choice for new files

For new text files, making UTF-8 the first candidate is the safe default. But do not stop there. You also need to decide what to do about the BOM.

Our recommended way of deciding:

Text mostly read on the Linux side: default to UTF-8 without BOM
Scripts read by legacy Windows tools or Windows PowerShell 5.1: state BOM presence explicitly based on the consumer’s needs
If there is a clear counterpart that requires UTF-16LE, write that requirement into the spec

Write only “standardize on UTF-8” and you will fight about the BOM later.

6.2 Keep existing legacy files as-is until an explicit migration task

If existing files are CP932, it is safer not to convert them to UTF-8 on the side, as part of everyday functional fixes.

The safe operational shape:

Existing files keep their original encoding / BOM / newlines
Encoding changes are separated out as a migration task
Convert in bulk only after confirming the conversion targets, the blast radius, and downstream consumers

Many mojibake accidents begin with a well-intentioned “converting to UTF-8 while I’m at it.”

6.3 Treat the encoding as part of the interface

For CSVs, TXT, logs, configuration files, and simple protocols, the encoding itself is part of the interface, not just the content.

At minimum, a spec should state:

Is this file UTF-8 / CP932 / UTF-16LE?
If UTF-8, does it carry a BOM?
Are the newlines LF or CRLF?
Which side, Linux or Windows, is the producer / consumer?
Does any intermediate batch or ETL re-save it?

“We pass it as text” is not a specification.

6.4 Do not trust defaults; be explicit when writing

In code and in scripts alike, it is safer to specify the encoding explicitly.

The dangerous lines of thinking:

Save with the defaults
It will probably come out fine matching the OS
It was readable in the console, so the file is probably fine
Auto-detect exists, so it will be fine

Defaults vary routinely between Windows / Linux, PowerShell 5.1 / 7, editors, and runtimes. Unless you are explicit, it tends to be “working by coincidence.”

6.5 Verify the console and the file separately

A quietly effective rule:

Verify the display in the console
Verify by reopening the file

Keep these two separate.

Even if chcp or the terminal display is aligned, it means nothing if the saved file is in a different encoding. Conversely, the file can be perfectly fine while only the appearance breaks because the console’s display code page does not match.

6.6 Git will not fix your encodings

Unglamorous but important.

Git fundamentally tracks bytes. Which means it dutifully commits broken bytes into history, as-is.

So when:

a huge diff appears even though you changed nothing
only the Japanese lines show mysterious diffs
only the first line changed
newlines and encoding changed together

suspect a re-encoding accident before suspecting a content change.

7. The Minimum Checklist

Here is the checklist worth pinning down first in projects where Windows and Linux mix.

7.1 Before editing

What is this file’s current encoding?
Is there a BOM?
Are the newlines LF or CRLF?
Have you noted 2-3 representative Japanese lines?
Do you know which side, Linux or Windows, is the final consumer?

7.2 While editing

Are you writing in a way that depends on the default encoding?
Are you saving with auto-detect left in charge?
Are you using PowerShell or shell redirection paths carelessly?
Are you taking comfort merely in “the display is readable”?

7.3 After editing

Did you reopen and verify after saving?
Are the representative lines intact on both the Linux and Windows sides?
Have ? or � increased in the diff?
Is only the first line or first column broken?
Is the diff a huge BOM-only / newline-only change?

7.4 Things that belong in a migration task

Bulk conversion from CP932 to UTF-8
Unifying the UTF-8 BOM policy
Taking inventory of scripts that assume PowerShell 5.1
Documenting text paths that go through CI / containers / WSL / SSH
Unifying the save settings of editors / formatters / batch jobs

8. Summary

If the Windows encoding problem had to be put in one sentence, its essence is that the Unicode world and the legacy code page world still live side by side.

And the reason accidents multiply when Linux is added is that the Linux side mostly flows on a UTF-8 assumption, so Windows-side CP932 and UTF-16, the console code page, and PowerShell version differences all surface at once.

The five points worth remembering:

Mojibake is a misalignment in byte interpretation
Garbled display and data corruption are different things
On Windows, think in separate layers: file / editor / console / API
For text exchanged with Linux, make UTF-8 the first choice
Keep the conversion of existing legacy files separate from ordinary maintenance

Taken at face value, “it got garbled on Windows” is far too broad. But cut it along these four questions —

What are the original bytes?
Who wrote it, and how?
Who read it, and how?
Has it already been saved?

— and it becomes quite tractable.

Text encodings are unglamorous, but between Windows and Linux they are the I/O contract itself. Refusing to leave them ambiguous is the single most effective countermeasure.

9. References

Windows / Microsoft

[Code Pages - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/code-pages)
[Code Page Identifiers - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers)
[Unicode in the Windows API - Win32 apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/win32/intl/unicode-in-the-windows-api)
[Console Code Pages - Windows Console Microsoft Learn](https://learn.microsoft.com/en-us/windows/console/console-code-pages)
[chcp Microsoft Learn](https://learn.microsoft.com/en-us/windows-server/administration/windows-commands/chcp)
[Use UTF-8 code pages in Windows apps Microsoft Learn](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page)

PowerShell / VS Code

[about_Character_Encoding

Microsoft Learn](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding)

[Understanding file encoding in VS Code and PowerShell

Microsoft Learn](https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding)

GNU / Linux locale

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

Windows Text Encodings and Line Endings - The Basics of Mojibake and CRLF/LF

The Shift_JIS / UTF-8 / UTF-16 confusion on Windows, mojibake, and the difference between CRLF and LF, organized into a form that is easy...

Read Article

Prompting Rules That Reduce Codex Mojibake Accidents on Windows

Practical prompting rules for letting Codex handle Japanese files on Windows: avoid saving on guesswork, preserve existing encodings, and...

Read Article

How to Run PowerShell from C# (CSharp) and Receive the Results as Objects

How to launch PowerShell from C# and receive results as PSObject rather than strings — a practical walkthrough of the PowerShell SDK, Add...

Read Article

Testing PowerShell with Pester — A Practical Approach to Making Operations Scripts Harder to Break

A practical walkthrough of testing PowerShell scripts with Pester v5 — safely covering date handling, file operations, deletion logic, mo...

Read Article

Practical PowerShell Command Recipes — Growing the Small Tools You Use Every Day

A practical roundup of PowerShell commands for everyday work, covering where to use Measure-Object, Group-Object, Select-String, Compare-...

Read Article

Where This Topic Connects

This article connects naturally to the following service pages.

Technical Consulting & Design Review

In projects where the encoding assumptions for CSVs, logs, and configuration files diverge between Windows and Linux, sorting out the I/O contract and operational rules first is an effective way to reduce accidents.

View Service Contact

Windows App Development

Windows business tools often live in environments where CP932 and UTF-8 are mixed, so building encoding handling into the design directly affects maintainability.

View Service Contact

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

1. What to Grasp First

2. What Mojibake Actually Is

2.1 If only the display is garbled, recovery may still be possible

2.2 Even more dangerous: dropping “unrepresentable characters” into a narrow code page

3. Why Things Get So Tangled on Windows

3.1 The Windows API has both Unicode and code page lineages

3.2 “Japanese on Windows” is not one thing

3.3 File names and file contents are separate problems

3.4 The defaults of PowerShell and surrounding tools are not aligned either

4. Classic Accidents When Linux Enters the Mix

4.1 Text saved as CP932 on Windows, read as UTF-8 on Linux

4.2 UTF-8 without BOM, created on Linux / VS Code, treated as ANSI on Windows

4.3 Windows writes UTF-16LE, and on Linux it “doesn’t look like text”

4.4 BOM presence causes friction too

4.5 Trusting what the console shows leads you astray

4.6 The classic accidents in a table

5. Drive Mojibake Investigations with These Four Questions

5.1 What are the original bytes?

5.2 Who wrote it first, under what assumption?

5.3 Who is reading it now, under what assumption?

5.4 Has the misread content already been saved?

6. Operational Rules That Reduce Accidents

6.1 Make UTF-8 the first choice for new files

6.2 Keep existing legacy files as-is until an explicit migration task

6.3 Treat the encoding as part of the interface

6.4 Do not trust defaults; be explicit when writing

6.5 Verify the console and the file separately

6.6 Git will not fix your encodings

7. The Minimum Checklist

7.1 Before editing

7.2 While editing

7.3 After editing

7.4 Things that belong in a migration task

8. Summary

9. References

Windows / Microsoft

PowerShell / VS Code

GNU / Linux locale

Related Articles

Related Topics

Where This Topic Connects

Author Profile

Go Komura