What is a character encoding? UTF-8, ASCII, and how computers store text
A character encoding is a mapping between human-readable characters and the bytes a computer stores. Without one, a byte value of 65 could mean A (ASCII), the first byte of a 2-byte Japanese character (Shift-JIS), or part of a 4-byte emoji (UTF-8). The encoding tells the computer which interpretation is correct.
How this is calculated
Before Unicode, every language had its own encoding: Latin-1 for Western European languages, Shift-JIS for Japanese, GB2312 for Simplified Chinese. A file written in one encoding would render as gibberish (mojibake) when interpreted as another. Unicode solved this by assigning a unique number (code point) to every character across all writing systems. UTF-8 is the most popular Unicode encoding because it's backwards-compatible with ASCII, space-efficient for Latin text, and capable of representing the full Unicode range. UTF-16 is used internally by Windows and Java. UTF-32 is rarely used because it wastes space.
Verdict
UTF-8 is the answer. Always. Unless you're maintaining a legacy system with a specific encoding requirement, every new file, API, and database column should be UTF-8. The encoding wars ended. UTF-8 won.
