Windows Text Encodings and Line Endings - The Basics of Mojibake and CRLF/LF

In consultations about text handling on Windows, the following questions get tangled together with remarkable frequency.

What is the difference between Shift_JIS and UTF-8
Why does mojibake (garbled text) happen
What is the difference between CRLF and LF
We switched to UTF-8, so why can it still be unreadable
Why does the same file look different in the editor, the console, Excel, and Git

This does not happen because Japanese is hard. Almost all of it is caused by reading the same bytes under a different assumption, or saving the misread content as is.

On top of that, Windows still hosts both the Unicode world and the code-page world side by side. Add BOMs, line endings, editor auto-detection, the console code page, and Git’s newline conversion, and the whole topic looks complicated.

This article organizes, for practical use, the Shift_JIS / UTF-8 / UTF-16 mix-ups common on Windows, why mojibake happens, the difference between CRLF and LF, and why this topic is so confusing in the first place.

The content is based on Microsoft Learn, PowerShell, Git, and W3C / Unicode public documentation as of April 2026. See the references at the end for details.

What to Grasp First
Breaking Down the Terminology
- 2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ
- 2.2 How to Think About Shift_JIS vs. CP932
- 2.3 The Traps in the Words ANSI, Unicode, and UTF-8N
Why Mojibake Happens
- 3.1 What Mojibake Really Is
- 3.2 Garbled Display and Data Corruption Are Different
- 3.3 Dropping to Unrepresentable Characters Is Irreversible
What the Line-Ending Differences Are
- 4.1 CRLF / LF / CR
- 4.2 Line Endings Are a Separate Problem from Encodings
- 4.3 \n and the Newline Bytes in the File Are Not Necessarily the Same
Why Windows in Particular Is So Confusing
- 5.1 Unicode and Legacy Code Pages Coexist
- 5.2 The Labels Don’t Line Up
- 5.3 ASCII-Only Content Hides the Problem
- 5.4 File Contents, File Names, Console, and Source Files Are Separate Layers
- 5.5 BOM and Line Endings Act on Yet Other Axes
- 5.6 Tools Change Things on Their Own
Common Accident Patterns
Rules That Reduce Accidents in Practice
Drive Mojibake / Line-Ending Investigations with These 5 Questions
Summary
Related Articles
References

1. What to Grasp First

Putting the conclusions first, these seven points matter most.

A text file is not the string itself but bytes + an encoding + a line-ending convention. Sometimes a BOM is attached too.
Mojibake happens when the same bytes are decoded as a different encoding.
Line-ending trouble happens when the encoding is correct but the assumption about line separators is misaligned.
Unicode and UTF-8 do not mean the same thing. Unicode is about the character set; UTF-8 and UTF-16 are about encodings.
What Windows people call “Shift_JIS” is, in practice, better thought of as CP932 / the Windows family of Japanese code pages — conversations drift less that way.
“We switched to UTF-8” is still an incomplete specification. It only becomes an operating rule once you also decide whether there is a BOM and the line-ending style.
The source of the confusion is not Japanese itself, but multiple assumptions with different histories surviving on the same Windows machine.

When handling text on Windows, the starting point is to separate these four:

What are the bytes of this file
With which encoding was it written
With which encoding was it read
Are the line endings CRLF or LF

Just separating these makes you far less likely to get lost.

2. Breaking Down the Terminology

2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ

First, it pays to take the words apart once.

Term	What it refers to	Example	Common confusion
Unicode	A framework for representing characters as numbers	`U+3042` (`あ`)	Assumed to be the same as `UTF-8`
UTF-8	An encoding that turns Unicode into bytes	`E3 81 82`	Assumed to be `Unicode` itself
UTF-16LE	An encoding that turns Unicode into bytes	`42 30`	Conflated with “Unicode” in menus
CP932	The Windows Japanese legacy code page	`82 A0`	Assumed to be exactly the same as `Shift_JIS`
CRLF / LF	The bytes that separate lines	`0D 0A` / `0A`	Mistaken for a kind of encoding
BOM	Identification bytes at the start of a file	`EF BB BF` etc.	Mistaken for an encoding name

Even the single character あ has different bytes per encoding.

Character: あ

UTF-8    : E3 81 82
CP932    : 82 A0
UTF-16LE : 42 30

What matters is that characters and bytes are different things. An application appears to handle “characters” on screen, but for saving and transmission it ultimately exchanges bytes. The accidents almost always happen at the boundary of that conversion.

2.2 How to Think About `Shift_JIS` vs. `CP932`

In the field, Japanese Windows text files often get lumped together as “Shift_JIS”. That works for conversation, but for practical work it is a bit sloppy.

If you want to be more precise about Japanese legacy text on Windows, it is safer to think CP932, or the Windows family of Japanese code pages.

Be sloppy here, and conversations like these drift:

You said “save it as Shift_JIS”, but the other party assumed Windows-side CP932
The Linux / macOS side treated it as shift_jis, but some Windows-origin files do not round-trip
You were told to save as ANSI, but which code page that means is environment-dependent

So in specifications and investigation notes, it is safer to write it as concretely as possible:

Not Shift_JIS but CP932
Not ANSI but ACP (active code page) / normally CP932 in Japanese environments
Not text but something concrete like UTF-8 no BOM, LF

2.3 The Traps in the Words `ANSI`, `Unicode`, and `UTF-8N`

Around Windows, the word labels themselves are also a source of confusion.

These three are especially misleading.

ANSI Appears in the Windows UI and older documentation, but it is not ASCII. In most cases it refers to the machine’s active code page.
Unicode In some editors and tools, the Unicode entry in a menu means UTF-16LE. “I saved it as Unicode” does not necessarily mean UTF-8.
UTF-8N Seen in some Japanese-market editors; it is normally a UI label to distinguish UTF-8 without a BOM. It is not an official encoding name.

In short, on Windows, the same word means different things in different tools. This is the first big point of confusion.

3. Why Mojibake Happens

3.1 What Mojibake Really Is

What mojibake actually is, is quite simple.

A string is turned into bytes with some encoding
Those bytes are turned back into a string with a different encoding
If the assumptions do not match, a different string comes out

For example, saving あ as UTF-8 produces these bytes:

E3 81 82

Read as UTF-8 this is あ, but read under CP932 assumptions it appears as a different string like 縺�. What is broken at that point is not “the Japanese” but the decoding assumption.

Mojibake in one line:

The same bytes were read as a different encoding.

3.2 Garbled Display and Data Corruption Are Different

What matters here is separating the stage where things can still be recovered from the stage where they barely can.

For example, this flow can still be recovered from:

A UTF-8 file is opened as CP932
On screen it looks like 縺�
Nothing has been saved yet

At this stage the original bytes are still UTF-8. Reopen with the correct encoding and it may come back intact.

The dangerous flow is this one:

A UTF-8 file is misread as CP932
The visibly broken content is saved as is
The original UTF-8 bytes are lost

Once you get this far, it is no longer garbled display but data corruption.

In practice, instead of summarizing everything as “it got garbled”, it is important to at least separate these two questions:

Are the bytes themselves still correct
Has the misread content already been re-saved

3.3 Dropping to Unrepresentable Characters Is Irreversible

The other danger is converting a Unicode string down to a narrow code page like CP932.

If the string contains characters that do not exist on the target side, one of the following happens:

They become ?
A replacement character is inserted
The conversion errors out
They are mapped to some similar nearby character

For example, some emoji and extended kanji cannot be dropped into CP932 as is. This kind of accident should be judged not by “is it readable” but by does a round-trip conversion return the original.

Information once lost does not come back, even if you learn the correct encoding afterwards.

4. What the Line-Ending Differences Are

4.1 `CRLF` / `LF` / `CR`

Line endings are bytes too.

CR = carriage return = 0D
LF = line feed = 0A
Windows text files traditionally use CRLF (0D 0A)
Linux / Unix systems generally use LF (0A)
Bare CR can appear in older contexts such as classic Mac

In table form:

Line ending	Bytes	Main context
`CRLF`	`0D 0A`	Traditional Windows text files, legacy tools
`LF`	`0A`	Linux / macOS / most development tools
`CR`	`0D`	Rather old legacy data

4.2 Line Endings Are a Separate Problem from Encodings

This is quite important.

The line-ending convention is a separate problem from the encoding.

The same UTF-8 file can have either CRLF or LF line endings. For content of A, a newline, then B, the bytes change like this:

UTF-8 + LF   : 41 0A 42
UTF-8 + CRLF : 41 0D 0A 42

Which means all of these are perfectly possible:

UTF-8, but only the line endings differ
CP932, but the line endings are LF
UTF-16LE, but the line endings are CRLF

So when “we switched to UTF-8 and it’s still wrong”, the reality is sometimes that only the line endings are misaligned, not the encoding.

4.3 `\n` and the Newline Bytes in the File Are Not Necessarily the Same

From a programmer’s perspective, this is the quietly confusing part.

Writing \n in source code does not guarantee that only 0A lands in the file. In the text mode of a language, runtime, or I/O API, \n may be converted to CRLF on Windows.

Which means the following can all be out of sync:

the newline notation in the source code
the string at runtime
the bytes saved in the file
the line endings visible in the editor

This is how the “I’m sure I wrote LF, but the file is CRLF” accident happens.

Modern editors handle bare LF just fine, but surrounding tools, legacy applications, and business operations still carry CRLF assumptions. So line-ending problems are not “ancient history” — they routinely come up in practice today.

5. Why Windows in Particular Is So Confusing

5.1 Unicode and Legacy Code Pages Coexist

This is the biggest reason Windows is complicated.

Windows retains both:

the Unicode path
the code-page-based path

Newer applications, the web, and cross-platform assets lean toward UTF-8, while older CSVs, TXTs, logs, the Excel ecosystem, and business-system integrations still carry CP932. On top of that, UTF-16LE routinely appears in some outputs and around certain APIs.

In other words, multiple text cultures cohabit inside a single Windows machine.

5.2 The Labels Don’t Line Up

What multiplies the confusion is less the technology than the misaligned labels.

It says Shift_JIS, but the reality is CP932
It says ANSI, but the reality is the active code page
It says Unicode, but the reality is UTF-16LE
It says UTF-8, but BOM-or-not has never been pinned down
Editor-specific labels like UTF-8N show up

Leave this vague, and the conversation appears to work while the underlying realities never match.

5.3 ASCII-Only Content Hides the Problem

This one is big as well.

Because UTF-8 is compatible with the ASCII range, a file containing only alphanumerics and symbols can “sort of read fine” even under the wrong assumption. On the CP932 side too, the ASCII-equivalent range rarely looks broken, so the problem never surfaces.

The result is a state like this:

The English-only config looks fine
The moment one line of Japanese goes in, it breaks
A problem that lurked all along ignites for the first time in production

This is why encoding accidents tend to look like “it worked until yesterday and suddenly broke today”. In reality, it is usually that the landmine was there all along, and it only became visible the moment non-ASCII characters arrived.

5.4 File Contents, File Names, Console, and Source Files Are Separate Layers

On Windows, lumping all of the following together as “the encoding” gets you lost.

file names / paths
file contents
display in the console
the encoding of the source code file itself
the string representation at runtime
display on the clipboard and in GUI widgets

For example, even if a Japanese file name displays fine, the file’s contents may be saved in CP932. Conversely, even if the file itself is UTF-8, the display alone breaks if the console code page does not match.

Operations like chcp 65001 fundamentally act on the console-side assumption — they do not change the bytes of existing files.

Furthermore, even if the source code file is UTF-8, the log file it writes at runtime is not necessarily UTF-8. You have to separate, every single time, which layer’s encoding you are talking about.

Incidentally, on Japanese Windows the \ character can display as a yen sign, which also tends to get mixed into encoding discussions. In most cases, however, this is a display font / glyph issue — the meaning as a path separator or escape character has not changed.

5.5 BOM and Line Endings Act on Yet Other Axes

Saying UTF-8 only settles half the question.

In practice, these matter too:

Is there a BOM or not
Are the line endings CRLF or LF

For example, with the very same UTF-8:

some Windows tools read it only if the BOM is present
some Unix-side processing ends up with junk characters in the first column when a BOM is present
some legacy-side tools struggle with bare LF
CRLF can make shell scripts and diffs messy

So even with the encoding matched, you can still have an accident.

5.6 Tools Change Things on Their Own

What makes it even trickier is that your local tools change things implicitly.

The editor auto-detects
A BOM is added / removed at save time
Git converts between CRLF and LF
Shells and commands save with their default encoding
A CSV export uses an unexpected code page
Defaults differ between versions of PowerShell or other tools

In other words, even when the operator specified nothing, some layer adds its own assumption — that is everyday Windows practice.

This is the true identity of “it broke even though I changed nothing”. More often than not, it was not a person but a tool’s default that made the change.

6. Common Accident Patterns

The typical accidents, in table form:

Situation	What is actually misaligned	Typical symptom
A legacy Windows tool treats a UTF-8 no BOM config file as `ANSI` / CP932	Decoding assumption	Only the Japanese gets garbled
A CP932 CSV is fed into a UTF-8-assuming pipeline	Decoding assumption	`�`, decode errors, nonsensical Japanese
A UTF-16LE log is fed into Unix-side text tools	Encoding assumption	NUL bytes mixed in; looks binary-ish
An `LF` source file is converted to `CRLF` in another environment	Line-ending assumption	Massive line-ending diffs, script malfunctions
Misread content is saved as is	The bytes themselves become something else	Unrecoverable data corruption
The spec just says “output a CSV”	The interface is undefined	Excel reads it, another tool breaks
The only decision is “standardize on UTF-8”	BOM / line endings undefined	Only certain tools fail

The most dangerous pattern of all is seeing the garbled display and then saving, locking the accident in.

7. Rules That Reduce Accidents in Practice

From here: what to decide as operating rules to reduce accidents.

7.1 Decide the Baseline for New Text

For new files, making UTF-8 the first candidate is reasonable. But that alone is not enough.

At minimum, it is safer to decide as far as:

UTF-8 with BOM or UTF-8 no BOM
Line endings CRLF or LF
Who reads this file
Is compatibility with legacy Windows tools required
Will Linux / macOS / CI / containers read it too

For example, for cross-platform source code and config, UTF-8 no BOM + LF is the likely first choice. On the other hand, if you must align with old Windows tools or existing operations, UTF-8 with BOM or even CP932 + CRLF may still be necessary.

What matters is deciding based on who you exchange data with, not on generic notions of “what is correct”.

7.2 Don’t Convert Existing Legacy Files on a Whim

If an existing file is CP932, it is safer not to UTF-8-ify it as a side effect of a routine small fix.

The safe-side operation is:

Existing files keep their original encoding / BOM / line endings
Encoding conversion is split out as a separate migration task
Bulk-convert only after confirming the conversion targets and the downstream consumers

Mojibake accidents tend to start from well-intentioned “modernizing while we’re at it”.

7.3 Treat Encoding and Line Endings as Part of the Interface

For CSVs, TXTs, logs, config files, and simple protocols, the text format itself is the interface, not just the content.

A spec should state at least:

the encoding
BOM or not
the newline style
header or not
the quoting / delimiter rules
which tool it was validated with

The three letters CSV are not enough. Only once you write UTF-8 with BOM, CRLF, comma delimiter, with header does the conversation stop drifting.

7.4 Be Explicit at Read/Write Boundaries

On the code side too, it is safer not to lean on implicit defaults.

Specify the encoding explicitly on file read / write
Stay conscious of encoding when passing text between processes
Pin the line endings as part of the spec in export / import logic
Don’t let a casual shell redirect become a production path

On Windows especially, “it saved” and “it saved with the correct bytes” are not the same thing.

Git is not a tool that automatically fixes encodings. Line endings, on the other hand, can get converted.

So it is safer to decide, per repository:

Is LF the baseline for source code
Is CRLF tolerated for Windows-only text
How is this pinned with .gitattributes
How are editor settings shared

It is important to think about encoding and line endings separately. Even if Git normalizes the line endings for you, the encoding accidents remain untouched.

7.6 Don’t Stop at “It Got Garbled” — Say What Is Misaligned

In the field, this rephrasing works wonders.

Bad phrasing: “It got garbled”
Good phrasing: “It looks like a UTF-8 no BOM file is being opened under CP932 assumptions”
Bad phrasing: “The line endings are weird”
Good phrasing: “An LF file is being converted to CRLF, inflating the diff”

Just being able to say what is misaligned changes the speed of the investigation considerably.

8. Drive Mojibake / Line-Ending Investigations with These 5 Questions

When an investigation stalls, returning to these five questions is the fastest route.

What are the bytes of this file right now
- UTF-8?
- UTF-8 with BOM?
- CP932?
- UTF-16LE?
Who wrote it first, under which assumption
- an editor
- a legacy app
- an Excel export
- a shell / script
- a batch job / middleware
Who is reading it now, under which assumption
- the editor’s auto-detect
- the console code page
- the library’s default encoding
- the importer’s specification
What are the BOM and line endings
- BOM present / absent
- CRLF / LF
Has the misread content already been saved
- still display-only?
- already re-saved, with the original bytes lost?

Fill in these five, and the cause usually comes into view.

9. Summary

Windows text encodings and line endings look complicated not because Japanese itself is hard. It is because bytes, encoding, BOM, newline, and tool defaults exist independently — and on top of that, old and new text cultures cohabit on Windows.

The six points especially worth remembering:

Mojibake is the result of reading the same bytes as a different encoding
Line-ending problems are on a separate axis from encodings
Don’t over-trust the words Shift_JIS, CP932, ANSI, Unicode at face value
“We switched to UTF-8” is insufficient — BOM and line endings are needed too
Treat garbled display and already-re-saved data corruption as separate things
In specs, write not text but something like UTF-8 no BOM, LF

In short, when handling text on Windows, the practical mindset is that it is not “a discussion about strings” but a discussion about how to align the contract over bytes.

11. References

Microsoft Learn, Code Page Identifiers - Win32 apps https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
Microsoft Learn, about_Character_Encoding - PowerShell https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.6
Microsoft Learn, Understanding file encoding in VS Code and PowerShell https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding?view=powershell-7.6
W3C Internationalization, Character encodings: Essential concepts https://www.w3.org/International/articles/definitions-characters/
Git documentation, gitattributes https://git-scm.com/docs/gitattributes
Git documentation, git-config https://git-scm.com/docs/git-config

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

An Introduction to Windows Text Encodings - The Mojibake That Happens When Integrating with Linux

A practical look at why mojibake happens on Windows, through the differences between CP932, UTF-8, UTF-16, BOMs, code pages, PowerShell, ...

Read Article

Prompting Rules That Reduce Codex Mojibake Accidents on Windows

Practical prompting rules for letting Codex handle Japanese files on Windows: avoid saving on guesswork, preserve existing encodings, and...

Read Article

How to Run PowerShell from C# (CSharp) and Receive the Results as Objects

How to launch PowerShell from C# and receive results as PSObject rather than strings — a practical walkthrough of the PowerShell SDK, Add...

Read Article

Testing PowerShell with Pester — A Practical Approach to Making Operations Scripts Harder to Break

A practical walkthrough of testing PowerShell scripts with Pester v5 — safely covering date handling, file operations, deletion logic, mo...

Read Article

Practical PowerShell Command Recipes — Growing the Small Tools You Use Every Day

A practical roundup of PowerShell commands for everyday work, covering where to use Measure-Object, Group-Object, Select-String, Compare-...

Read Article

Where This Topic Connects

This article connects naturally to the following service pages.

Technical Consulting & Design Review

In projects where the encoding assumptions for CSVs, logs, and config files drift between Windows and Linux, sorting out the I/O contract and operating rules up front goes a long way toward preventing accidents.

View Service Contact

Windows App Development

Windows line-of-business tools often live in environments where CP932 and UTF-8 are mixed, so building the handling of encodings and line endings into the design directly affects maintainability.

View Service Contact

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

View Profile Contact

Public links

GitHub LinkedIn X COM_BLAS COM_BigDecimal

Table of Contents

1. What to Grasp First

2. Breaking Down the Terminology

2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ

2.2 How to Think About Shift_JIS vs. CP932

2.3 The Traps in the Words ANSI, Unicode, and UTF-8N

3. Why Mojibake Happens

3.1 What Mojibake Really Is

3.2 Garbled Display and Data Corruption Are Different

3.3 Dropping to Unrepresentable Characters Is Irreversible

4. What the Line-Ending Differences Are

4.1 CRLF / LF / CR

4.2 Line Endings Are a Separate Problem from Encodings

4.3 \n and the Newline Bytes in the File Are Not Necessarily the Same

5. Why Windows in Particular Is So Confusing

5.1 Unicode and Legacy Code Pages Coexist

5.2 The Labels Don’t Line Up

5.3 ASCII-Only Content Hides the Problem

5.4 File Contents, File Names, Console, and Source Files Are Separate Layers

5.5 BOM and Line Endings Act on Yet Other Axes

5.6 Tools Change Things on Their Own

6. Common Accident Patterns

7. Rules That Reduce Accidents in Practice

7.1 Decide the Baseline for New Text

7.2 Don’t Convert Existing Legacy Files on a Whim

7.3 Treat Encoding and Line Endings as Part of the Interface

7.4 Be Explicit at Read/Write Boundaries

7.5 Share the Git and Editor Rules Too

7.6 Don’t Stop at “It Got Garbled” — Say What Is Misaligned

8. Drive Mojibake / Line-Ending Investigations with These 5 Questions

9. Summary

10. Related Articles

11. References

Related Articles

Related Topics

Where This Topic Connects

Author Profile

Go Komura

2.2 How to Think About `Shift_JIS` vs. `CP932`

2.3 The Traps in the Words `ANSI`, `Unicode`, and `UTF-8N`

4.1 `CRLF` / `LF` / `CR`

4.3 `\n` and the Newline Bytes in the File Are Not Necessarily the Same