Windows Text Encodings and Line Endings - The Basics of Mojibake and CRLF/LF
· Go Komura · Windows, Text Encoding, Mojibake, Line Endings, UTF-8, CP932, PowerShell, Unicode
In consultations about text handling on Windows, the following questions get tangled together with remarkable frequency.
- What is the difference between
Shift_JISandUTF-8 - Why does mojibake (garbled text) happen
- What is the difference between
CRLFandLF - We switched to
UTF-8, so why can it still be unreadable - Why does the same file look different in the editor, the console, Excel, and Git
This does not happen because Japanese is hard. Almost all of it is caused by reading the same bytes under a different assumption, or saving the misread content as is.
On top of that, Windows still hosts both the Unicode world and the code-page world side by side. Add BOMs, line endings, editor auto-detection, the console code page, and Git’s newline conversion, and the whole topic looks complicated.
This article organizes, for practical use, the Shift_JIS / UTF-8 / UTF-16 mix-ups common on Windows, why mojibake happens, the difference between CRLF and LF, and why this topic is so confusing in the first place.
The content is based on Microsoft Learn, PowerShell, Git, and W3C / Unicode public documentation as of April 2026. See the references at the end for details.
Table of Contents
- What to Grasp First
- Breaking Down the Terminology
- 2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ
- 2.2 How to Think About
Shift_JISvs.CP932 - 2.3 The Traps in the Words
ANSI,Unicode, andUTF-8N
- Why Mojibake Happens
- 3.1 What Mojibake Really Is
- 3.2 Garbled Display and Data Corruption Are Different
- 3.3 Dropping to Unrepresentable Characters Is Irreversible
- What the Line-Ending Differences Are
- 4.1
CRLF/LF/CR - 4.2 Line Endings Are a Separate Problem from Encodings
- 4.3
\nand the Newline Bytes in the File Are Not Necessarily the Same
- 4.1
- Why Windows in Particular Is So Confusing
- 5.1 Unicode and Legacy Code Pages Coexist
- 5.2 The Labels Don’t Line Up
- 5.3 ASCII-Only Content Hides the Problem
- 5.4 File Contents, File Names, Console, and Source Files Are Separate Layers
- 5.5 BOM and Line Endings Act on Yet Other Axes
- 5.6 Tools Change Things on Their Own
- Common Accident Patterns
- Rules That Reduce Accidents in Practice
- Drive Mojibake / Line-Ending Investigations with These 5 Questions
- Summary
- Related Articles
- References
1. What to Grasp First
Putting the conclusions first, these seven points matter most.
- A text file is not the string itself but bytes + an encoding + a line-ending convention. Sometimes a BOM is attached too.
- Mojibake happens when the same bytes are decoded as a different encoding.
- Line-ending trouble happens when the encoding is correct but the assumption about line separators is misaligned.
UnicodeandUTF-8do not mean the same thing.Unicodeis about the character set;UTF-8andUTF-16are about encodings.- What Windows people call “
Shift_JIS” is, in practice, better thought of as CP932 / the Windows family of Japanese code pages — conversations drift less that way. - “We switched to UTF-8” is still an incomplete specification. It only becomes an operating rule once you also decide whether there is a BOM and the line-ending style.
- The source of the confusion is not Japanese itself, but multiple assumptions with different histories surviving on the same Windows machine.
When handling text on Windows, the starting point is to separate these four:
- What are the bytes of this file
- With which encoding was it written
- With which encoding was it read
- Are the line endings
CRLForLF
Just separating these makes you far less likely to get lost.
2. Breaking Down the Terminology
2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ
First, it pays to take the words apart once.
| Term | What it refers to | Example | Common confusion |
|---|---|---|---|
| Unicode | A framework for representing characters as numbers | U+3042 (あ) |
Assumed to be the same as UTF-8 |
| UTF-8 | An encoding that turns Unicode into bytes | E3 81 82 |
Assumed to be Unicode itself |
| UTF-16LE | An encoding that turns Unicode into bytes | 42 30 |
Conflated with “Unicode” in menus |
| CP932 | The Windows Japanese legacy code page | 82 A0 |
Assumed to be exactly the same as Shift_JIS |
| CRLF / LF | The bytes that separate lines | 0D 0A / 0A |
Mistaken for a kind of encoding |
| BOM | Identification bytes at the start of a file | EF BB BF etc. |
Mistaken for an encoding name |
Even the single character あ has different bytes per encoding.
Character: あ
UTF-8 : E3 81 82
CP932 : 82 A0
UTF-16LE : 42 30
What matters is that characters and bytes are different things. An application appears to handle “characters” on screen, but for saving and transmission it ultimately exchanges bytes. The accidents almost always happen at the boundary of that conversion.
2.2 How to Think About Shift_JIS vs. CP932
In the field, Japanese Windows text files often get lumped together as “Shift_JIS”. That works for conversation, but for practical work it is a bit sloppy.
If you want to be more precise about Japanese legacy text on Windows, it is safer to think CP932, or the Windows family of Japanese code pages.
Be sloppy here, and conversations like these drift:
- You said “save it as
Shift_JIS”, but the other party assumed Windows-side CP932 - The Linux / macOS side treated it as
shift_jis, but some Windows-origin files do not round-trip - You were told to save as
ANSI, but which code page that means is environment-dependent
So in specifications and investigation notes, it is safer to write it as concretely as possible:
- Not
Shift_JISbutCP932 - Not
ANSIbutACP (active code page) / normally CP932 in Japanese environments - Not
textbut something concrete likeUTF-8 no BOM, LF
2.3 The Traps in the Words ANSI, Unicode, and UTF-8N
Around Windows, the word labels themselves are also a source of confusion.
These three are especially misleading.
ANSIAppears in the Windows UI and older documentation, but it is not ASCII. In most cases it refers to the machine’s active code page.UnicodeIn some editors and tools, theUnicodeentry in a menu means UTF-16LE. “I saved it as Unicode” does not necessarily meanUTF-8.UTF-8NSeen in some Japanese-market editors; it is normally a UI label to distinguish UTF-8 without a BOM. It is not an official encoding name.
In short, on Windows, the same word means different things in different tools. This is the first big point of confusion.
3. Why Mojibake Happens
3.1 What Mojibake Really Is
What mojibake actually is, is quite simple.
- A string is turned into bytes with some encoding
- Those bytes are turned back into a string with a different encoding
- If the assumptions do not match, a different string comes out
For example, saving あ as UTF-8 produces these bytes:
E3 81 82
Read as UTF-8 this is あ, but read under CP932 assumptions it appears as a different string like 縺�.
What is broken at that point is not “the Japanese” but the decoding assumption.
Mojibake in one line:
The same bytes were read as a different encoding.
3.2 Garbled Display and Data Corruption Are Different
What matters here is separating the stage where things can still be recovered from the stage where they barely can.
For example, this flow can still be recovered from:
- A UTF-8 file is opened as CP932
- On screen it looks like
縺� - Nothing has been saved yet
At this stage the original bytes are still UTF-8. Reopen with the correct encoding and it may come back intact.
The dangerous flow is this one:
- A UTF-8 file is misread as CP932
- The visibly broken content is saved as is
- The original UTF-8 bytes are lost
Once you get this far, it is no longer garbled display but data corruption.
In practice, instead of summarizing everything as “it got garbled”, it is important to at least separate these two questions:
- Are the bytes themselves still correct
- Has the misread content already been re-saved
3.3 Dropping to Unrepresentable Characters Is Irreversible
The other danger is converting a Unicode string down to a narrow code page like CP932.
If the string contains characters that do not exist on the target side, one of the following happens:
- They become
? - A replacement character is inserted
- The conversion errors out
- They are mapped to some similar nearby character
For example, some emoji and extended kanji cannot be dropped into CP932 as is. This kind of accident should be judged not by “is it readable” but by does a round-trip conversion return the original.
Information once lost does not come back, even if you learn the correct encoding afterwards.
4. What the Line-Ending Differences Are
4.1 CRLF / LF / CR
Line endings are bytes too.
CR= carriage return =0DLF= line feed =0A- Windows text files traditionally use
CRLF(0D 0A) - Linux / Unix systems generally use
LF(0A) - Bare
CRcan appear in older contexts such as classic Mac
In table form:
| Line ending | Bytes | Main context |
|---|---|---|
CRLF |
0D 0A |
Traditional Windows text files, legacy tools |
LF |
0A |
Linux / macOS / most development tools |
CR |
0D |
Rather old legacy data |
4.2 Line Endings Are a Separate Problem from Encodings
This is quite important.
The line-ending convention is a separate problem from the encoding.
The same UTF-8 file can have either CRLF or LF line endings.
For content of A, a newline, then B, the bytes change like this:
UTF-8 + LF : 41 0A 42
UTF-8 + CRLF : 41 0D 0A 42
Which means all of these are perfectly possible:
UTF-8, but only the line endings differCP932, but the line endings areLFUTF-16LE, but the line endings areCRLF
So when “we switched to UTF-8 and it’s still wrong”, the reality is sometimes that only the line endings are misaligned, not the encoding.
4.3 \n and the Newline Bytes in the File Are Not Necessarily the Same
From a programmer’s perspective, this is the quietly confusing part.
Writing \n in source code does not guarantee that only 0A lands in the file.
In the text mode of a language, runtime, or I/O API, \n may be converted to CRLF on Windows.
Which means the following can all be out of sync:
- the newline notation in the source code
- the string at runtime
- the bytes saved in the file
- the line endings visible in the editor
This is how the “I’m sure I wrote LF, but the file is CRLF” accident happens.
Modern editors handle bare LF just fine, but surrounding tools, legacy applications, and business operations still carry CRLF assumptions.
So line-ending problems are not “ancient history” — they routinely come up in practice today.
5. Why Windows in Particular Is So Confusing
5.1 Unicode and Legacy Code Pages Coexist
This is the biggest reason Windows is complicated.
Windows retains both:
- the Unicode path
- the code-page-based path
Newer applications, the web, and cross-platform assets lean toward UTF-8, while older CSVs, TXTs, logs, the Excel ecosystem, and business-system integrations still carry CP932. On top of that, UTF-16LE routinely appears in some outputs and around certain APIs.
In other words, multiple text cultures cohabit inside a single Windows machine.
5.2 The Labels Don’t Line Up
What multiplies the confusion is less the technology than the misaligned labels.
- It says
Shift_JIS, but the reality is CP932 - It says
ANSI, but the reality is the active code page - It says
Unicode, but the reality is UTF-16LE - It says
UTF-8, but BOM-or-not has never been pinned down - Editor-specific labels like
UTF-8Nshow up
Leave this vague, and the conversation appears to work while the underlying realities never match.
5.3 ASCII-Only Content Hides the Problem
This one is big as well.
Because UTF-8 is compatible with the ASCII range, a file containing only alphanumerics and symbols can “sort of read fine” even under the wrong assumption. On the CP932 side too, the ASCII-equivalent range rarely looks broken, so the problem never surfaces.
The result is a state like this:
- The English-only config looks fine
- The moment one line of Japanese goes in, it breaks
- A problem that lurked all along ignites for the first time in production
This is why encoding accidents tend to look like “it worked until yesterday and suddenly broke today”. In reality, it is usually that the landmine was there all along, and it only became visible the moment non-ASCII characters arrived.
5.4 File Contents, File Names, Console, and Source Files Are Separate Layers
On Windows, lumping all of the following together as “the encoding” gets you lost.
- file names / paths
- file contents
- display in the console
- the encoding of the source code file itself
- the string representation at runtime
- display on the clipboard and in GUI widgets
For example, even if a Japanese file name displays fine, the file’s contents may be saved in CP932. Conversely, even if the file itself is UTF-8, the display alone breaks if the console code page does not match.
Operations like chcp 65001 fundamentally act on the console-side assumption — they do not change the bytes of existing files.
Furthermore, even if the source code file is UTF-8, the log file it writes at runtime is not necessarily UTF-8. You have to separate, every single time, which layer’s encoding you are talking about.
Incidentally, on Japanese Windows the \ character can display as a yen sign, which also tends to get mixed into encoding discussions.
In most cases, however, this is a display font / glyph issue — the meaning as a path separator or escape character has not changed.
5.5 BOM and Line Endings Act on Yet Other Axes
Saying UTF-8 only settles half the question.
In practice, these matter too:
- Is there a BOM or not
- Are the line endings
CRLForLF
For example, with the very same UTF-8:
- some Windows tools read it only if the BOM is present
- some Unix-side processing ends up with junk characters in the first column when a BOM is present
- some legacy-side tools struggle with bare
LF CRLFcan make shell scripts and diffs messy
So even with the encoding matched, you can still have an accident.
5.6 Tools Change Things on Their Own
What makes it even trickier is that your local tools change things implicitly.
- The editor auto-detects
- A BOM is added / removed at save time
- Git converts between
CRLFandLF - Shells and commands save with their default encoding
- A CSV export uses an unexpected code page
- Defaults differ between versions of PowerShell or other tools
In other words, even when the operator specified nothing, some layer adds its own assumption — that is everyday Windows practice.
This is the true identity of “it broke even though I changed nothing”. More often than not, it was not a person but a tool’s default that made the change.
6. Common Accident Patterns
The typical accidents, in table form:
| Situation | What is actually misaligned | Typical symptom |
|---|---|---|
A legacy Windows tool treats a UTF-8 no BOM config file as ANSI / CP932 |
Decoding assumption | Only the Japanese gets garbled |
| A CP932 CSV is fed into a UTF-8-assuming pipeline | Decoding assumption | �, decode errors, nonsensical Japanese |
| A UTF-16LE log is fed into Unix-side text tools | Encoding assumption | NUL bytes mixed in; looks binary-ish |
An LF source file is converted to CRLF in another environment |
Line-ending assumption | Massive line-ending diffs, script malfunctions |
| Misread content is saved as is | The bytes themselves become something else | Unrecoverable data corruption |
| The spec just says “output a CSV” | The interface is undefined | Excel reads it, another tool breaks |
| The only decision is “standardize on UTF-8” | BOM / line endings undefined | Only certain tools fail |
The most dangerous pattern of all is seeing the garbled display and then saving, locking the accident in.
7. Rules That Reduce Accidents in Practice
From here: what to decide as operating rules to reduce accidents.
7.1 Decide the Baseline for New Text
For new files, making UTF-8 the first candidate is reasonable. But that alone is not enough.
At minimum, it is safer to decide as far as:
UTF-8 with BOMorUTF-8 no BOM- Line endings
CRLForLF - Who reads this file
- Is compatibility with legacy Windows tools required
- Will Linux / macOS / CI / containers read it too
For example, for cross-platform source code and config, UTF-8 no BOM + LF is the likely first choice. On the other hand, if you must align with old Windows tools or existing operations, UTF-8 with BOM or even CP932 + CRLF may still be necessary.
What matters is deciding based on who you exchange data with, not on generic notions of “what is correct”.
7.2 Don’t Convert Existing Legacy Files on a Whim
If an existing file is CP932, it is safer not to UTF-8-ify it as a side effect of a routine small fix.
The safe-side operation is:
- Existing files keep their original encoding / BOM / line endings
- Encoding conversion is split out as a separate migration task
- Bulk-convert only after confirming the conversion targets and the downstream consumers
Mojibake accidents tend to start from well-intentioned “modernizing while we’re at it”.
7.3 Treat Encoding and Line Endings as Part of the Interface
For CSVs, TXTs, logs, config files, and simple protocols, the text format itself is the interface, not just the content.
A spec should state at least:
- the encoding
- BOM or not
- the newline style
- header or not
- the quoting / delimiter rules
- which tool it was validated with
The three letters CSV are not enough.
Only once you write UTF-8 with BOM, CRLF, comma delimiter, with header does the conversation stop drifting.
7.4 Be Explicit at Read/Write Boundaries
On the code side too, it is safer not to lean on implicit defaults.
- Specify the encoding explicitly on file read / write
- Stay conscious of encoding when passing text between processes
- Pin the line endings as part of the spec in export / import logic
- Don’t let a casual shell redirect become a production path
On Windows especially, “it saved” and “it saved with the correct bytes” are not the same thing.
7.5 Share the Git and Editor Rules Too
Git is not a tool that automatically fixes encodings. Line endings, on the other hand, can get converted.
So it is safer to decide, per repository:
- Is
LFthe baseline for source code - Is
CRLFtolerated for Windows-only text - How is this pinned with
.gitattributes - How are editor settings shared
It is important to think about encoding and line endings separately. Even if Git normalizes the line endings for you, the encoding accidents remain untouched.
7.6 Don’t Stop at “It Got Garbled” — Say What Is Misaligned
In the field, this rephrasing works wonders.
- Bad phrasing: “It got garbled”
- Good phrasing: “It looks like a UTF-8 no BOM file is being opened under CP932 assumptions”
- Bad phrasing: “The line endings are weird”
- Good phrasing: “An LF file is being converted to CRLF, inflating the diff”
Just being able to say what is misaligned changes the speed of the investigation considerably.
8. Drive Mojibake / Line-Ending Investigations with These 5 Questions
When an investigation stalls, returning to these five questions is the fastest route.
- What are the bytes of this file right now
- UTF-8?
- UTF-8 with BOM?
- CP932?
- UTF-16LE?
- Who wrote it first, under which assumption
- an editor
- a legacy app
- an Excel export
- a shell / script
- a batch job / middleware
- Who is reading it now, under which assumption
- the editor’s auto-detect
- the console code page
- the library’s default encoding
- the importer’s specification
- What are the BOM and line endings
- BOM present / absent
CRLF/LF
- Has the misread content already been saved
- still display-only?
- already re-saved, with the original bytes lost?
Fill in these five, and the cause usually comes into view.
9. Summary
Windows text encodings and line endings look complicated not because Japanese itself is hard. It is because bytes, encoding, BOM, newline, and tool defaults exist independently — and on top of that, old and new text cultures cohabit on Windows.
The six points especially worth remembering:
- Mojibake is the result of reading the same bytes as a different encoding
- Line-ending problems are on a separate axis from encodings
- Don’t over-trust the words
Shift_JIS,CP932,ANSI,Unicodeat face value - “We switched to UTF-8” is insufficient — BOM and line endings are needed too
- Treat garbled display and already-re-saved data corruption as separate things
- In specs, write not
textbut something likeUTF-8 no BOM, LF
In short, when handling text on Windows, the practical mindset is that it is not “a discussion about strings” but a discussion about how to align the contract over bytes.
10. Related Articles
- Sorting Out Windows Text Encodings - Why Mojibake Happens, and What Goes Wrong Especially in Combination with Linux
- Best Practices for Reducing Codex Mojibake Accidents on Windows - Decide ‘How to Instruct’ Before Tuning the Environment
11. References
- Microsoft Learn, Code Page Identifiers - Win32 apps https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
- Microsoft Learn, about_Character_Encoding - PowerShell https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.6
- Microsoft Learn, Understanding file encoding in VS Code and PowerShell https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding?view=powershell-7.6
- W3C Internationalization, Character encodings: Essential concepts https://www.w3.org/International/articles/definitions-characters/
- Git documentation, gitattributes https://git-scm.com/docs/gitattributes
- Git documentation, git-config https://git-scm.com/docs/git-config
Related Articles
Recent articles sharing the same tags. Deepen your understanding with closely related topics.
An Introduction to Windows Text Encodings - The Mojibake That Happens When Integrating with Linux
A practical look at why mojibake happens on Windows, through the differences between CP932, UTF-8, UTF-16, BOMs, code pages, PowerShell, ...
Prompting Rules That Reduce Codex Mojibake Accidents on Windows
Practical prompting rules for letting Codex handle Japanese files on Windows: avoid saving on guesswork, preserve existing encodings, and...
How to Run PowerShell from C# (CSharp) and Receive the Results as Objects
How to launch PowerShell from C# and receive results as PSObject rather than strings — a practical walkthrough of the PowerShell SDK, Add...
Testing PowerShell with Pester — A Practical Approach to Making Operations Scripts Harder to Break
A practical walkthrough of testing PowerShell scripts with Pester v5 — safely covering date handling, file operations, deletion logic, mo...
Practical PowerShell Command Recipes — Growing the Small Tools You Use Every Day
A practical roundup of PowerShell commands for everyday work, covering where to use Measure-Object, Group-Object, Select-String, Compare-...
Related Topics
These topic pages place the article in a broader service and decision context.
Windows Technical Topics
Topic hub for KomuraSoft LLC's Windows development, investigation, and legacy-asset articles.
Where This Topic Connects
This article connects naturally to the following service pages.
Technical Consulting & Design Review
In projects where the encoding assumptions for CSVs, logs, and config files drift between Windows and Linux, sorting out the I/O contract and operating rules up front goes a long way toward preventing accidents.
Windows App Development
Windows line-of-business tools often live in environments where CP932 and UTF-8 are mixed, so building the handling of encodings and line endings into the design directly affects maintainability.
Author Profile
Profile page for the article author.
Go Komura
Representative of KomuraSoft LLC
Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.
Public links