Zero-Width Characters, Smart Quotes and AI Text Watermarks: A Field Guide

"Invisible character" is a catch-all term for several very different things. Knowing which kind you are dealing with tells you whether it is harmless, merely annoying, or a genuine security concern. This field guide walks through the main families, with the code points to look for and what each one does. You can confirm any of them in seconds with the Invisible Character Detector.

Zero-width characters

These occupy no horizontal space at all. The best known is the zero-width space (U+200B), originally meant to mark where a long string may wrap. Its relatives include the zero-width non-joiner (U+200C) and zero-width joiner (U+200D), which control how adjacent letters connect — the joiner is also what glues emoji sequences together — the word joiner (U+2060), and the byte-order mark (U+FEFF), which doubles as a zero-width no-break space. Because they are completely invisible, they are the characters most likely to break a password field, a search query or a string comparison while leaving no visible trace. In almost all plain-text and code contexts they should simply be removed.

Unusual spaces

The space bar produces U+0020, but Unicode has many other space characters that look identical on screen. The non-breaking space (U+00A0) stops text from wrapping and is inserted automatically by many word processors and web pages. There are also thin, hair, figure and narrow no-break spaces (U+2000–U+200A, U+202F), the medium mathematical space (U+205F) and the wide ideographic space (U+3000) used in East Asian typography. They are not invisible, but because they masquerade as ordinary spaces they are a frequent cause of failed lookups, broken CSV columns and code that will not align. Converting them to a normal space usually fixes the problem without changing how the text reads.

Control and formatting characters

Control characters are non-printing codes inherited from early computing: the C0 range below U+0020 and the C1 range around U+0080–U+009F. Most should never appear in modern text and are a sign of a botched encoding conversion. The same applies to the replacement character (U+FFFD), the little "?" in a diamond that appears when bytes could not be decoded. Line and paragraph separators (U+2028 and U+2029) deserve special mention: they are valid line breaks in Unicode but historically broke JavaScript string literals, so they are worth catching.

Bidirectional controls and "Trojan Source"

Languages such as Arabic and Hebrew read right to left, and Unicode provides marks and overrides — the left-to-right and right-to-left marks (U+200E, U+200F), embeddings, the right-to-left override (U+202E) and the newer isolates (U+2066–U+2069) — to manage mixed-direction text. In ordinary documents they are legitimate. In source code they can be dangerous: a 2021 disclosure nicknamed "Trojan Source" showed that bidirectional overrides can make code display in an order different from how a compiler actually reads it, hiding malicious logic in plain sight. If you find these characters in code from an unfamiliar source, treat them with suspicion and remove them.

Tag characters and the AI-watermark question

The Unicode tag block (U+E0000–U+E007F) mirrors the ASCII range but renders as nothing. Because each tag character maps to an ordinary letter or symbol, a sequence of them can encode a hidden message — a name, a URL, an instruction — inside text that looks completely normal. This is the mechanism behind demonstrations of "invisible" prompts and concealed metadata, and it is one of the things people have in mind when they ask whether AI assistants embed hidden watermarks in their output. It is important to be precise here: most everyday "this looks like AI" cues are not secret watermarks at all but ordinary typography — a fondness for the em dash, curly quotes, and the occasional non-breaking space — that the model picked up from its training text. True invisible watermarking using zero-width or tag characters is technically possible and has been demonstrated, but you cannot conclude from the presence or absence of these characters who or what wrote something. What a detector can honestly tell you is which characters are actually in the text; the interpretation is up to you.

Smart quotes and other typographic substitutes

Finally there are characters that are perfectly visible and often desirable, yet frequently unwanted in technical contexts: the curly single and double quotation marks (U+2018, U+2019, U+201C, U+201D), the en and em dashes (U+2013, U+2014) and the ellipsis (U+2026). In published prose these are correct typography. In code, JSON, CSV files and many form fields they cause errors because the software expects the plain ASCII apostrophe, quote, hyphen or three dots. That is why the detector treats them as an optional clean-up: straighten them when you are preparing data or code, keep them when you are polishing writing.

Putting it together

When you scan a piece of text, the category labels tell you which family each finding belongs to, and therefore how worried to be. Unusual spaces and stray zero-width characters are almost always just noise to be cleaned. Bidirectional controls and tag characters in code or untrusted input deserve a closer look. Smart quotes are a stylistic choice. For the practical clean-up workflow, see the step-by-step guide, and run anything you are unsure about through the detector — it is free, instant and never sends your text anywhere.