Windows Text Encodings and Line Endings - The Basics of Mojibake and CRLF/LF

· · Windows, Text Encoding, Mojibake, Line Endings, UTF-8, CP932, PowerShell, Unicode

In consultations about text handling on Windows, the following questions get tangled together with remarkable frequency.

  • What is the difference between Shift_JIS and UTF-8
  • Why does mojibake (garbled text) happen
  • What is the difference between CRLF and LF
  • We switched to UTF-8, so why can it still be unreadable
  • Why does the same file look different in the editor, the console, Excel, and Git

This does not happen because Japanese is hard. Almost all of it is caused by reading the same bytes under a different assumption, or saving the misread content as is.

On top of that, Windows still hosts both the Unicode world and the code-page world side by side. Add BOMs, line endings, editor auto-detection, the console code page, and Git’s newline conversion, and the whole topic looks complicated.

This article organizes, for practical use, the Shift_JIS / UTF-8 / UTF-16 mix-ups common on Windows, why mojibake happens, the difference between CRLF and LF, and why this topic is so confusing in the first place.

The content is based on Microsoft Learn, PowerShell, Git, and W3C / Unicode public documentation as of April 2026. See the references at the end for details.

Table of Contents

  1. What to Grasp First
  2. Breaking Down the Terminology
    • 2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ
    • 2.2 How to Think About Shift_JIS vs. CP932
    • 2.3 The Traps in the Words ANSI, Unicode, and UTF-8N
  3. Why Mojibake Happens
    • 3.1 What Mojibake Really Is
    • 3.2 Garbled Display and Data Corruption Are Different
    • 3.3 Dropping to Unrepresentable Characters Is Irreversible
  4. What the Line-Ending Differences Are
    • 4.1 CRLF / LF / CR
    • 4.2 Line Endings Are a Separate Problem from Encodings
    • 4.3 \n and the Newline Bytes in the File Are Not Necessarily the Same
  5. Why Windows in Particular Is So Confusing
    • 5.1 Unicode and Legacy Code Pages Coexist
    • 5.2 The Labels Don’t Line Up
    • 5.3 ASCII-Only Content Hides the Problem
    • 5.4 File Contents, File Names, Console, and Source Files Are Separate Layers
    • 5.5 BOM and Line Endings Act on Yet Other Axes
    • 5.6 Tools Change Things on Their Own
  6. Common Accident Patterns
  7. Rules That Reduce Accidents in Practice
  8. Drive Mojibake / Line-Ending Investigations with These 5 Questions
  9. Summary
  10. Related Articles
  11. References

1. What to Grasp First

Putting the conclusions first, these seven points matter most.

  • A text file is not the string itself but bytes + an encoding + a line-ending convention. Sometimes a BOM is attached too.
  • Mojibake happens when the same bytes are decoded as a different encoding.
  • Line-ending trouble happens when the encoding is correct but the assumption about line separators is misaligned.
  • Unicode and UTF-8 do not mean the same thing. Unicode is about the character set; UTF-8 and UTF-16 are about encodings.
  • What Windows people call “Shift_JIS” is, in practice, better thought of as CP932 / the Windows family of Japanese code pages — conversations drift less that way.
  • “We switched to UTF-8” is still an incomplete specification. It only becomes an operating rule once you also decide whether there is a BOM and the line-ending style.
  • The source of the confusion is not Japanese itself, but multiple assumptions with different histories surviving on the same Windows machine.

When handling text on Windows, the starting point is to separate these four:

  1. What are the bytes of this file
  2. With which encoding was it written
  3. With which encoding was it read
  4. Are the line endings CRLF or LF

Just separating these makes you far less likely to get lost.

2. Breaking Down the Terminology

2.1 How Unicode / UTF-8 / UTF-16 / CP932 Differ

First, it pays to take the words apart once.

Term What it refers to Example Common confusion
Unicode A framework for representing characters as numbers U+3042 () Assumed to be the same as UTF-8
UTF-8 An encoding that turns Unicode into bytes E3 81 82 Assumed to be Unicode itself
UTF-16LE An encoding that turns Unicode into bytes 42 30 Conflated with “Unicode” in menus
CP932 The Windows Japanese legacy code page 82 A0 Assumed to be exactly the same as Shift_JIS
CRLF / LF The bytes that separate lines 0D 0A / 0A Mistaken for a kind of encoding
BOM Identification bytes at the start of a file EF BB BF etc. Mistaken for an encoding name

Even the single character has different bytes per encoding.

Character: あ

UTF-8    : E3 81 82
CP932    : 82 A0
UTF-16LE : 42 30

What matters is that characters and bytes are different things. An application appears to handle “characters” on screen, but for saving and transmission it ultimately exchanges bytes. The accidents almost always happen at the boundary of that conversion.

2.2 How to Think About Shift_JIS vs. CP932

In the field, Japanese Windows text files often get lumped together as “Shift_JIS”. That works for conversation, but for practical work it is a bit sloppy.

If you want to be more precise about Japanese legacy text on Windows, it is safer to think CP932, or the Windows family of Japanese code pages.

Be sloppy here, and conversations like these drift:

  • You said “save it as Shift_JIS”, but the other party assumed Windows-side CP932
  • The Linux / macOS side treated it as shift_jis, but some Windows-origin files do not round-trip
  • You were told to save as ANSI, but which code page that means is environment-dependent

So in specifications and investigation notes, it is safer to write it as concretely as possible:

  • Not Shift_JIS but CP932
  • Not ANSI but ACP (active code page) / normally CP932 in Japanese environments
  • Not text but something concrete like UTF-8 no BOM, LF

2.3 The Traps in the Words ANSI, Unicode, and UTF-8N

Around Windows, the word labels themselves are also a source of confusion.

These three are especially misleading.

  • ANSI Appears in the Windows UI and older documentation, but it is not ASCII. In most cases it refers to the machine’s active code page.
  • Unicode In some editors and tools, the Unicode entry in a menu means UTF-16LE. “I saved it as Unicode” does not necessarily mean UTF-8.
  • UTF-8N Seen in some Japanese-market editors; it is normally a UI label to distinguish UTF-8 without a BOM. It is not an official encoding name.

In short, on Windows, the same word means different things in different tools. This is the first big point of confusion.

3. Why Mojibake Happens

3.1 What Mojibake Really Is

What mojibake actually is, is quite simple.

  1. A string is turned into bytes with some encoding
  2. Those bytes are turned back into a string with a different encoding
  3. If the assumptions do not match, a different string comes out

For example, saving as UTF-8 produces these bytes:

E3 81 82

Read as UTF-8 this is , but read under CP932 assumptions it appears as a different string like 縺�. What is broken at that point is not “the Japanese” but the decoding assumption.

Mojibake in one line:

The same bytes were read as a different encoding.

3.2 Garbled Display and Data Corruption Are Different

What matters here is separating the stage where things can still be recovered from the stage where they barely can.

For example, this flow can still be recovered from:

  1. A UTF-8 file is opened as CP932
  2. On screen it looks like 縺�
  3. Nothing has been saved yet

At this stage the original bytes are still UTF-8. Reopen with the correct encoding and it may come back intact.

The dangerous flow is this one:

  1. A UTF-8 file is misread as CP932
  2. The visibly broken content is saved as is
  3. The original UTF-8 bytes are lost

Once you get this far, it is no longer garbled display but data corruption.

In practice, instead of summarizing everything as “it got garbled”, it is important to at least separate these two questions:

  • Are the bytes themselves still correct
  • Has the misread content already been re-saved

3.3 Dropping to Unrepresentable Characters Is Irreversible

The other danger is converting a Unicode string down to a narrow code page like CP932.

If the string contains characters that do not exist on the target side, one of the following happens:

  • They become ?
  • A replacement character is inserted
  • The conversion errors out
  • They are mapped to some similar nearby character

For example, some emoji and extended kanji cannot be dropped into CP932 as is. This kind of accident should be judged not by “is it readable” but by does a round-trip conversion return the original.

Information once lost does not come back, even if you learn the correct encoding afterwards.

4. What the Line-Ending Differences Are

4.1 CRLF / LF / CR

Line endings are bytes too.

  • CR = carriage return = 0D
  • LF = line feed = 0A
  • Windows text files traditionally use CRLF (0D 0A)
  • Linux / Unix systems generally use LF (0A)
  • Bare CR can appear in older contexts such as classic Mac

In table form:

Line ending Bytes Main context
CRLF 0D 0A Traditional Windows text files, legacy tools
LF 0A Linux / macOS / most development tools
CR 0D Rather old legacy data

4.2 Line Endings Are a Separate Problem from Encodings

This is quite important.

The line-ending convention is a separate problem from the encoding.

The same UTF-8 file can have either CRLF or LF line endings. For content of A, a newline, then B, the bytes change like this:

UTF-8 + LF   : 41 0A 42
UTF-8 + CRLF : 41 0D 0A 42

Which means all of these are perfectly possible:

  • UTF-8, but only the line endings differ
  • CP932, but the line endings are LF
  • UTF-16LE, but the line endings are CRLF

So when “we switched to UTF-8 and it’s still wrong”, the reality is sometimes that only the line endings are misaligned, not the encoding.

4.3 \n and the Newline Bytes in the File Are Not Necessarily the Same

From a programmer’s perspective, this is the quietly confusing part.

Writing \n in source code does not guarantee that only 0A lands in the file. In the text mode of a language, runtime, or I/O API, \n may be converted to CRLF on Windows.

Which means the following can all be out of sync:

  • the newline notation in the source code
  • the string at runtime
  • the bytes saved in the file
  • the line endings visible in the editor

This is how the “I’m sure I wrote LF, but the file is CRLF” accident happens.

Modern editors handle bare LF just fine, but surrounding tools, legacy applications, and business operations still carry CRLF assumptions. So line-ending problems are not “ancient history” — they routinely come up in practice today.

5. Why Windows in Particular Is So Confusing

5.1 Unicode and Legacy Code Pages Coexist

This is the biggest reason Windows is complicated.

Windows retains both:

  • the Unicode path
  • the code-page-based path

Newer applications, the web, and cross-platform assets lean toward UTF-8, while older CSVs, TXTs, logs, the Excel ecosystem, and business-system integrations still carry CP932. On top of that, UTF-16LE routinely appears in some outputs and around certain APIs.

In other words, multiple text cultures cohabit inside a single Windows machine.

5.2 The Labels Don’t Line Up

What multiplies the confusion is less the technology than the misaligned labels.

  • It says Shift_JIS, but the reality is CP932
  • It says ANSI, but the reality is the active code page
  • It says Unicode, but the reality is UTF-16LE
  • It says UTF-8, but BOM-or-not has never been pinned down
  • Editor-specific labels like UTF-8N show up

Leave this vague, and the conversation appears to work while the underlying realities never match.

5.3 ASCII-Only Content Hides the Problem

This one is big as well.

Because UTF-8 is compatible with the ASCII range, a file containing only alphanumerics and symbols can “sort of read fine” even under the wrong assumption. On the CP932 side too, the ASCII-equivalent range rarely looks broken, so the problem never surfaces.

The result is a state like this:

  • The English-only config looks fine
  • The moment one line of Japanese goes in, it breaks
  • A problem that lurked all along ignites for the first time in production

This is why encoding accidents tend to look like “it worked until yesterday and suddenly broke today”. In reality, it is usually that the landmine was there all along, and it only became visible the moment non-ASCII characters arrived.

5.4 File Contents, File Names, Console, and Source Files Are Separate Layers

On Windows, lumping all of the following together as “the encoding” gets you lost.

  • file names / paths
  • file contents
  • display in the console
  • the encoding of the source code file itself
  • the string representation at runtime
  • display on the clipboard and in GUI widgets

For example, even if a Japanese file name displays fine, the file’s contents may be saved in CP932. Conversely, even if the file itself is UTF-8, the display alone breaks if the console code page does not match.

Operations like chcp 65001 fundamentally act on the console-side assumption — they do not change the bytes of existing files.

Furthermore, even if the source code file is UTF-8, the log file it writes at runtime is not necessarily UTF-8. You have to separate, every single time, which layer’s encoding you are talking about.

Incidentally, on Japanese Windows the \ character can display as a yen sign, which also tends to get mixed into encoding discussions. In most cases, however, this is a display font / glyph issue — the meaning as a path separator or escape character has not changed.

5.5 BOM and Line Endings Act on Yet Other Axes

Saying UTF-8 only settles half the question.

In practice, these matter too:

  • Is there a BOM or not
  • Are the line endings CRLF or LF

For example, with the very same UTF-8:

  • some Windows tools read it only if the BOM is present
  • some Unix-side processing ends up with junk characters in the first column when a BOM is present
  • some legacy-side tools struggle with bare LF
  • CRLF can make shell scripts and diffs messy

So even with the encoding matched, you can still have an accident.

5.6 Tools Change Things on Their Own

What makes it even trickier is that your local tools change things implicitly.

  • The editor auto-detects
  • A BOM is added / removed at save time
  • Git converts between CRLF and LF
  • Shells and commands save with their default encoding
  • A CSV export uses an unexpected code page
  • Defaults differ between versions of PowerShell or other tools

In other words, even when the operator specified nothing, some layer adds its own assumption — that is everyday Windows practice.

This is the true identity of “it broke even though I changed nothing”. More often than not, it was not a person but a tool’s default that made the change.

6. Common Accident Patterns

The typical accidents, in table form:

Situation What is actually misaligned Typical symptom
A legacy Windows tool treats a UTF-8 no BOM config file as ANSI / CP932 Decoding assumption Only the Japanese gets garbled
A CP932 CSV is fed into a UTF-8-assuming pipeline Decoding assumption , decode errors, nonsensical Japanese
A UTF-16LE log is fed into Unix-side text tools Encoding assumption NUL bytes mixed in; looks binary-ish
An LF source file is converted to CRLF in another environment Line-ending assumption Massive line-ending diffs, script malfunctions
Misread content is saved as is The bytes themselves become something else Unrecoverable data corruption
The spec just says “output a CSV” The interface is undefined Excel reads it, another tool breaks
The only decision is “standardize on UTF-8” BOM / line endings undefined Only certain tools fail

The most dangerous pattern of all is seeing the garbled display and then saving, locking the accident in.

7. Rules That Reduce Accidents in Practice

From here: what to decide as operating rules to reduce accidents.

7.1 Decide the Baseline for New Text

For new files, making UTF-8 the first candidate is reasonable. But that alone is not enough.

At minimum, it is safer to decide as far as:

  • UTF-8 with BOM or UTF-8 no BOM
  • Line endings CRLF or LF
  • Who reads this file
  • Is compatibility with legacy Windows tools required
  • Will Linux / macOS / CI / containers read it too

For example, for cross-platform source code and config, UTF-8 no BOM + LF is the likely first choice. On the other hand, if you must align with old Windows tools or existing operations, UTF-8 with BOM or even CP932 + CRLF may still be necessary.

What matters is deciding based on who you exchange data with, not on generic notions of “what is correct”.

7.2 Don’t Convert Existing Legacy Files on a Whim

If an existing file is CP932, it is safer not to UTF-8-ify it as a side effect of a routine small fix.

The safe-side operation is:

  • Existing files keep their original encoding / BOM / line endings
  • Encoding conversion is split out as a separate migration task
  • Bulk-convert only after confirming the conversion targets and the downstream consumers

Mojibake accidents tend to start from well-intentioned “modernizing while we’re at it”.

7.3 Treat Encoding and Line Endings as Part of the Interface

For CSVs, TXTs, logs, config files, and simple protocols, the text format itself is the interface, not just the content.

A spec should state at least:

  • the encoding
  • BOM or not
  • the newline style
  • header or not
  • the quoting / delimiter rules
  • which tool it was validated with

The three letters CSV are not enough. Only once you write UTF-8 with BOM, CRLF, comma delimiter, with header does the conversation stop drifting.

7.4 Be Explicit at Read/Write Boundaries

On the code side too, it is safer not to lean on implicit defaults.

  • Specify the encoding explicitly on file read / write
  • Stay conscious of encoding when passing text between processes
  • Pin the line endings as part of the spec in export / import logic
  • Don’t let a casual shell redirect become a production path

On Windows especially, “it saved” and “it saved with the correct bytes” are not the same thing.

7.5 Share the Git and Editor Rules Too

Git is not a tool that automatically fixes encodings. Line endings, on the other hand, can get converted.

So it is safer to decide, per repository:

  • Is LF the baseline for source code
  • Is CRLF tolerated for Windows-only text
  • How is this pinned with .gitattributes
  • How are editor settings shared

It is important to think about encoding and line endings separately. Even if Git normalizes the line endings for you, the encoding accidents remain untouched.

7.6 Don’t Stop at “It Got Garbled” — Say What Is Misaligned

In the field, this rephrasing works wonders.

  • Bad phrasing: “It got garbled”
  • Good phrasing: “It looks like a UTF-8 no BOM file is being opened under CP932 assumptions”
  • Bad phrasing: “The line endings are weird”
  • Good phrasing: “An LF file is being converted to CRLF, inflating the diff”

Just being able to say what is misaligned changes the speed of the investigation considerably.

8. Drive Mojibake / Line-Ending Investigations with These 5 Questions

When an investigation stalls, returning to these five questions is the fastest route.

  1. What are the bytes of this file right now
    • UTF-8?
    • UTF-8 with BOM?
    • CP932?
    • UTF-16LE?
  2. Who wrote it first, under which assumption
    • an editor
    • a legacy app
    • an Excel export
    • a shell / script
    • a batch job / middleware
  3. Who is reading it now, under which assumption
    • the editor’s auto-detect
    • the console code page
    • the library’s default encoding
    • the importer’s specification
  4. What are the BOM and line endings
    • BOM present / absent
    • CRLF / LF
  5. Has the misread content already been saved
    • still display-only?
    • already re-saved, with the original bytes lost?

Fill in these five, and the cause usually comes into view.

9. Summary

Windows text encodings and line endings look complicated not because Japanese itself is hard. It is because bytes, encoding, BOM, newline, and tool defaults exist independently — and on top of that, old and new text cultures cohabit on Windows.

The six points especially worth remembering:

  • Mojibake is the result of reading the same bytes as a different encoding
  • Line-ending problems are on a separate axis from encodings
  • Don’t over-trust the words Shift_JIS, CP932, ANSI, Unicode at face value
  • “We switched to UTF-8” is insufficient — BOM and line endings are needed too
  • Treat garbled display and already-re-saved data corruption as separate things
  • In specs, write not text but something like UTF-8 no BOM, LF

In short, when handling text on Windows, the practical mindset is that it is not “a discussion about strings” but a discussion about how to align the contract over bytes.

11. References

  1. Microsoft Learn, Code Page Identifiers - Win32 apps https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
  2. Microsoft Learn, about_Character_Encoding - PowerShell https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.6
  3. Microsoft Learn, Understanding file encoding in VS Code and PowerShell https://learn.microsoft.com/en-us/powershell/scripting/dev-cross-plat/vscode/understanding-file-encoding?view=powershell-7.6
  4. W3C Internationalization, Character encodings: Essential concepts https://www.w3.org/International/articles/definitions-characters/
  5. Git documentation, gitattributes https://git-scm.com/docs/gitattributes
  6. Git documentation, git-config https://git-scm.com/docs/git-config

Recent articles sharing the same tags. Deepen your understanding with closely related topics.

These topic pages place the article in a broader service and decision context.

This article connects naturally to the following service pages.

Windows App Development

Windows line-of-business tools often live in environments where CP932 and UTF-8 are mixed, so building the handling of encodings and line endings into the design directly affects maintainability.

Author Profile

Profile page for the article author.

Go Komura

Representative of KomuraSoft LLC

Focused on Windows software development, technical consulting, and investigations into failures that are difficult to reproduce.

Back to the Blog