What is a character encoding? UTF-8, ASCII, and how computers store text

A character encoding is a mapping between human-readable characters and the bytes a computer stores. Without one, a byte value of 65 could mean A (ASCII), the first byte of a 2-byte Japanese character (Shift-JIS), or part of a 4-byte emoji (UTF-8). The encoding tells the computer which interpretation is correct.

Encoding focus
Character encoding basics
character-encoding
Category
Fundamentals
Encoding concepts and theory

How this is calculated

Before Unicode, every language had its own encoding: Latin-1 for Western European languages, Shift-JIS for Japanese, GB2312 for Simplified Chinese. A file written in one encoding would render as gibberish (mojibake) when interpreted as another. Unicode solved this by assigning a unique number (code point) to every character across all writing systems. UTF-8 is the most popular Unicode encoding because it's backwards-compatible with ASCII, space-efficient for Latin text, and capable of representing the full Unicode range. UTF-16 is used internally by Windows and Java. UTF-32 is rarely used because it wastes space.

Verdict

UTF-8 is the answer. Always. Unless you're maintaining a legacy system with a specific encoding requirement, every new file, API, and database column should be UTF-8. The encoding wars ended. UTF-8 won.

More Encoding scenarios

Frequently asked questions

How do I convert text to Base64?
Paste your string into the Text field and the Base64 output appears instantly. The tool uses standard Base64 (RFC 4648), so the output is identical to Linux's base64 command and every major language's built-in Base64 encoder.
What's the difference between Base64 and hex encoding?
Both represent binary data as text, but with different alphabets. Base64 uses 64 characters and needs roughly 4 chars per 3 bytes (33% overhead). Hex uses 16 characters and needs exactly 2 chars per byte (100% overhead). Base64 is denser, while hex is easier to read byte by byte.
Why does my UTF-8 text break when converted to binary?
UTF-8 encodes non-ASCII characters as multibyte sequences, so a single emoji or accented letter becomes 2-4 bytes. The binary output will be longer than the character count suggests, that's correct behavior, not a bug.
Is it safe to paste sensitive data into the converter?
Yes. The encoding conversion runs entirely in your browser with JavaScript, nothing is sent to our servers, logged, or stored. You can verify this with your browser's Network tab: no requests fire when you type.
What is URL-safe Base64?
A variant that replaces `+` with `-` and `/` with `_` so the result can be safely placed in URLs without percent-encoding. JWT tokens use URL-safe Base64. Standard Base64 is fine for most other uses.
Can I decode Base64 back to the original text?
Yes, the converter is bidirectional. Paste Base64 into the Base64 field and you'll get the original UTF-8 string back. If decoding fails silently, the input isn't valid Base64 (wrong characters, bad padding, or it was double-encoded).