Whitespace / Invisible Character Stripper
This free tool detects and removes hidden characters lurking in your text that cause formatting issues, break code, corrupt data, and create mysterious bugs you can't see. Paste in your content, and the stripper identifies every invisible character by type, shows you exactly where each one sits, and cleans them out in a single click. Fix the problems you can't see before they break the things you can.
What Are Invisible Characters?
Invisible characters are Unicode code points that occupy space in a string without producing any visible glyph on screen. You can't see them by reading the text. You can't find them by scanning the page. But they're there, taking up bytes, interfering with string comparisons, breaking regex patterns, corrupting database entries, and causing the kind of intermittent bugs that make developers question their sanity.
Some invisible characters are mundane. A regular space, a tab, and a line break are all technically invisible but serve obvious formatting purposes. The problem characters are the ones that look like nothing, behave like something, and arrived in your text through a path you didn't expect.
Zero-width spaces that split a word into two tokens without any visible gap. Non-breaking spaces that look identical to regular spaces but behave differently in code. Byte order marks that sit at the start of a file and confuse parsers. Soft hyphens that exist as invisible break points until a line wraps. Right-to-left marks that reverse text rendering in contexts you didn't anticipate. Ideographic spaces borrowed from CJK character sets that are wider than standard spaces but visually ambiguous.
These characters enter your text through copy-paste from web pages, word processors, PDFs, email clients, CMS editors, and any other system that processes text with its own Unicode handling. They survive transfers between systems because they're valid Unicode. They persist because nothing flags them. And they cause problems that range from cosmetic annoyances to data integrity failures.
Where Do These Characters Come From?
Understanding the sources helps you prevent the contamination in the first place, not just clean it up after the fact.
- Word processors. Microsoft Word, Google Docs, and other rich text editors insert non-breaking spaces, soft hyphens, smart quotes (curly quotes that are different Unicode code points from straight quotes), em dashes, en dashes, and other typographic characters automatically. Copy-pasting from these editors into a code editor, CMS, or database carries all of these along invisibly.
- Web pages. HTML source code can contain any Unicode character, and browsers render most invisible ones silently. Copy text from a web page and you may inherit zero-width spaces used for line-breaking hints, zero-width joiners used in emoji rendering, right-to-left marks used in bidirectional text, and non-breaking spaces used for formatting control. The text looks clean in the browser. The underlying string is polluted.
- PDFs. Text extracted from PDFs is notorious for invisible character contamination. PDF rendering engines use their own internal character mapping, and extraction tools often produce output with soft hyphens at line break points, non-standard space characters, ligature artifacts, and control characters that were part of the PDF's internal formatting instructions.
- Spreadsheets. Data exported from Excel, Google Sheets, or CSV files frequently contains trailing spaces, leading spaces, non-breaking spaces inserted by auto-formatting, and line breaks embedded within cells. These characters survive export and import cycles between systems, accumulating across data pipelines.
- APIs and databases. Data received from third-party APIs or migrated between databases can carry invisible characters from the source system. A name field that contains a zero-width space after the last character will fail exact-match lookups even though it looks identical to the clean version on screen.
- Keyboard input. Some keyboard shortcuts and input methods produce invisible characters directly. On macOS, Option+Space inserts a non-breaking space. Certain international keyboard layouts insert zero-width non-joiners or right-to-left marks as part of normal text entry. Users insert these characters without realizing it.
What Characters Does This Tool Detect?
The stripper identifies and categorizes every invisible and problematic whitespace character in your text, grouped by type and risk level.
- Zero-width characters. Zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), and word joiner (U+2060). These take up zero pixels of visual space but exist as real characters in the string. A word containing a zero-width space in the middle appears as one word visually but is two tokens programmatically. This breaks search functionality, string matching, URL validation, and anything that processes text by splitting on word boundaries.
- Non-breaking spaces. Non-breaking space (U+00A0) and narrow non-breaking space (U+202F). These look identical to regular spaces but prevent line wrapping at their position. In HTML, produces U+00A0. The character causes problems because it doesn't match a regular space in string comparisons, regex patterns, or trim operations that target only standard whitespace.
- Byte order marks. The byte order mark (U+FEFF) is a zero-width character placed at the beginning of a file to indicate its encoding. It's valid as the first character of a UTF-8 or UTF-16 file, but when it appears mid-text or gets concatenated into strings through file processing, it becomes invisible garbage that breaks JSON parsing, XML processing, CSV imports, and header comparisons.
- Directional formatting characters. Left-to-right mark (U+200E), right-to-left mark (U+200F), left-to-right embedding (U+202A), and related bidirectional control characters. These control text rendering direction and are essential for languages like Arabic and Hebrew. In English text or in code, they're contaminants that can cause text to render incorrectly or behave unpredictably in string processing.
- Soft hyphens. The soft hyphen (U+00AD) is an invisible hint that tells rendering engines where a word can be broken across lines. If the line doesn't break there, the hyphen is invisible. If it does, a visible hyphen appears. In contexts outside of text rendering, like database fields, URLs, or code, soft hyphens are invisible characters that corrupt the data without any visual indication.
- Special whitespace. Em space (U+2003), en space (U+2002), thin space (U+2009), hair space (U+200A), figure space (U+2007), punctuation space (U+2008), ideographic space (U+3000), and other Unicode space variants. Each has a specific typographic purpose but causes problems when mixed into text that expects only standard spaces. They're visually similar or identical to regular spaces but are different code points.
- Control characters. Null bytes (U+0000), backspace (U+0008), escape (U+001B), delete (U+007F), and other ASCII control characters that occasionally appear in text through encoding errors, data corruption, or copy-paste from binary-adjacent sources. These can terminate strings early, corrupt file formats, and trigger security vulnerabilities.
How Do Invisible Characters Cause Real Problems?
The damage invisible characters cause is wildly disproportionate to their visibility, which is zero.
- Broken code. A variable name with a zero-width space in it looks correct in every editor but fails to compile or match any reference to the "same" variable without the hidden character. A JSON key with a non-breaking space instead of a regular space passes visual inspection but fails parsing. A CSS class name pasted from a design document with an invisible character appended doesn't match the class name in the HTML. Developers can stare at these bugs for hours because the code looks perfect.
- Failed string comparisons. Two strings that look identical on screen but contain different invisible characters are not equal. A database lookup for "New York" returns nothing when the stored value is "NewYork" (with a zero-width space). An email deduplication process fails to match entries where one has a trailing non-breaking space and the other doesn't. Every system that compares, sorts, or matches strings is vulnerable to invisible character contamination.
- SEO and content issues. Invisible characters in title tags, meta descriptions, heading tags, URLs, and anchor text can interfere with how search engines process your content. A title tag with a byte order mark at the beginning may display correctly in the browser but show a garbage character in search results. A URL with a zero-width space encodes as %E2%80%8B, turning a clean URL into an ugly, broken-looking one that may not resolve correctly.
- Data pipeline corruption. Invisible characters survive most data transformations because they're valid Unicode. They pass through ETL processes, API transfers, database migrations, and file conversions without triggering any error. Each handoff is an opportunity for contamination to enter, and because no step flags the characters, they accumulate silently until they cause a visible failure downstream.
- Accessibility problems. Screen readers process invisible characters as part of the text content. A zero-width space in the middle of a word can cause a screen reader to pronounce it as two separate words. Non-breaking spaces affect how assistive technologies parse sentence and paragraph boundaries. Invisible characters that are invisible to sighted users create audible artifacts for users relying on screen readers.
- Security vulnerabilities. Invisible characters have been used in homograph attacks, where a URL or identifier contains invisible characters that make it look identical to a legitimate one. A username with a right-to-left override character can make "admin" display differently while matching a different account internally. Byte order marks and null bytes have been used to bypass input validation filters that don't account for them.
What's the Difference Between Stripping and Normalizing?
The tool offers both options because they address different needs.
Stripping removes invisible characters entirely. A zero-width space is deleted. A non-breaking space is deleted. The text becomes shorter by the number of characters removed. This is the right approach when the invisible characters are contaminants that shouldn't be there at all, which is the case for the vast majority of situations.
Normalizing replaces invisible characters with their closest standard equivalent. A non-breaking space becomes a regular space. Various Unicode space variants become standard spaces. Smart quotes become straight quotes. Em dashes become hyphens or double hyphens. The text stays the same length, and the meaning is preserved, but every character is converted to its simplest, most universal form.
Normalization is the better choice when invisible characters might be carrying intentional formatting. A non-breaking space between a number and its unit ("100 km") was probably placed deliberately to prevent a line break at that position. Stripping it removes the space entirely, joining "100" and "km" into "100km." Normalizing it converts it to a regular space, preserving the gap while eliminating the special behavior.
For code, data, and most web content, stripping is the default. For editorial content that was composed in a word processor with intentional typographic characters, normalizing preserves the author's formatting choices while cleaning up the Unicode.
When Should I Use This Tool?
Build invisible character stripping into your workflow at every point where text crosses a system boundary.
- Before pasting content into a CMS. Any text composed in Word, Google Docs, or another rich text editor should be stripped before entering your content management system. This prevents invisible characters from embedding in your database and propagating to every rendered page.
- Before committing code. If you've pasted code from a tutorial, Stack Overflow answer, documentation page, or AI assistant, run it through the stripper. A single invisible character in a code file can create a bug that no linter catches and no amount of visual inspection reveals.
- Before importing data. CSV files, JSON feeds, API responses, and database exports should be cleaned before importing into your system. One contaminated field in a thousand-row import can break queries, corrupt aggregations, and create ghost records that resist deduplication.
- After extracting text from PDFs. PDF text extraction is one of the most reliable sources of invisible character contamination. Always clean extracted text before using it in any downstream process.
- When debugging mysterious failures. If a string comparison that should work doesn't, if a URL that looks correct returns a 404, if a search query that should match returns nothing, invisible characters are a prime suspect. Paste the problematic text into the tool and see what it reveals. This diagnostic use alone saves hours of debugging time.
- Before publishing to the web. A final pass through the stripper before publishing catches invisible characters that entered the text at any point during the content production process. Think of it as a last line of defense against the characters that every other step failed to catch.
Common Invisible Character Problems to Avoid
- Assuming your text editor shows everything. Most text editors, including many code editors, render invisible characters as nothing by default. VS Code, Sublime Text, and others have settings to show whitespace and invisible characters, but they're typically off by default. Even with these settings enabled, some zero-width characters may not be visualized depending on the editor and its font configuration. Don't trust what you see. Trust what the tool detects.
- Running string comparisons without normalization. If your application accepts user input and compares it against stored values, invisible characters in either the input or the stored value will cause false negatives. Normalize or strip both sides of the comparison before matching. This is especially important for login systems, search functions, and any form of text matching.
- Cleaning data once and assuming it stays clean. Every time text passes through a new system, new invisible characters can be introduced. Cleaning data at import doesn't protect against contamination added by the CMS editor, by a content update, or by a future data migration. Build stripping into your pipeline as a recurring step, not a one-time fix.
- Stripping characters you actually need. In rare cases, invisible characters serve a legitimate purpose. Zero-width joiners control emoji rendering (combining a person emoji with a skin tone modifier). Bidirectional marks are essential for correct rendering of Arabic and Hebrew text. Non-breaking spaces prevent awkward line breaks in typography-sensitive contexts. If your content legitimately uses these characters, use normalization mode or selective stripping rather than removing everything.
- Not testing across browsers and platforms. An invisible character that renders harmlessly in Chrome on macOS might display as a visible box character in Firefox on Windows or cause a layout break in Safari on iOS. After stripping, verify the cleaned text displays correctly across the browsers and platforms your audience uses.
- Ignoring invisible characters in URLs. URLs with invisible characters often work in the browser because the browser encodes them silently. But pasted into an email, a document, or a social post, the encoded version appears with percent-encoded sequences that look broken and may exceed character limits. Always strip invisible characters from URLs specifically, where they have zero legitimate purpose and maximum potential for disruption.
Related Tools
Let's Grow Your Business
Want some free consulting? Let’s hop on a call and talk about what we can do to help.