Skip to main content
Utilavo

URL Encoding: Percent-Encoding for Web Developers

Updated 12 min read

By Utilavo Editorial · Reviewed

URL encoding, more precisely called percent-encoding, is the mechanism that lets arbitrary text and binary data travel safely inside a URL. A URL is not a free-form string — it has a rigid grammar defined by RFC 3986, where characters like the slash, question mark, ampersand, and hash each carry structural meaning. When data you want to put into a URL happens to contain one of those characters, percent-encoding replaces it with a percent sign followed by two hexadecimal digits representing the character's byte value, so the data is preserved without colliding with the URL's syntax.

Getting percent-encoding right matters far more than it first appears. The same logical string must be encoded differently depending on whether it lands in a path segment, a query parameter, or a fragment, and the rules differ again between true URI encoding and the application/x-www-form-urlencoded format used by HTML forms. Misunderstanding these distinctions produces broken links, corrupted query parameters, failed API calls, double-encoding bugs, and even security vulnerabilities such as open redirects and reflected XSS. This guide walks through the character classes defined by RFC 3986, the practical differences between the JavaScript encoding functions, how non-ASCII text is handled through UTF-8, and the mistakes that most often cause production failures.

What percent-encoding is and why it exists

RFC 3986 defines the generic syntax for a URI and restricts the characters that may appear unescaped to a small subset of US-ASCII. Everything outside that subset — spaces, most punctuation, control characters, and every non-ASCII byte — must be represented using percent-encoding. A percent-encoded octet is written as a percent sign followed by two uppercase hexadecimal digits, so a single byte with the value 0x20 (a space) becomes %20, and a byte with value 0x26 (an ampersand) becomes %26. Each escape always represents exactly one byte, never one character, which becomes important the moment multi-byte UTF-8 is involved.

The reason this scheme exists is that a URL is parsed structurally before its contents are ever read as data. When a browser or server sees a question mark, it treats everything after it as the query string; when it sees a hash, everything after becomes the fragment; a slash starts a new path segment. If your data literally contains a question mark and you do not encode it, the parser splits the URL at the wrong place and the remainder is silently misinterpreted. Percent-encoding lets you smuggle a literal question mark into a parameter value as %3F, where the parser sees an ordinary data octet rather than a delimiter.

Percent-encoding is a reversible, lossless transformation. Decoding simply walks the string and replaces every %XX sequence with the byte 0xXX, leaving all other characters untouched. Because it is purely a transport-safety measure, encoding the same string twice is not idempotent and is almost always a bug — a point covered in detail below. You can encode or decode any string interactively with the URL Encoder/Decoder to see exactly which characters change and which pass through unchanged.

It is worth stressing that percent-encoding is an encoding, not encryption or hashing. It provides no confidentiality and no integrity protection — anyone can decode %2F back to a slash trivially. Its only job is to make data syntactically safe to place inside a URL. If you need to obscure or protect data, percent-encoding is the wrong tool; if you only need to transmit binary blobs as text you would reach for Base64 instead, which is a different transformation entirely.

Reserved and unreserved characters in RFC 3986

RFC 3986 sorts characters into two groups that determine whether encoding is required. The unreserved set is the uppercase and lowercase letters A-Z and a-z, the digits 0-9, and exactly four symbols: hyphen, period, underscore, and tilde (- . _ ~). These characters never need to be encoded and, by the specification, should not be — an encoder that escapes them produces a technically valid but non-canonical URL, and decoders are required to treat %41 as identical to the letter A.

The reserved set is the characters that may serve as delimiters within the URL's generic syntax. RFC 3986 splits these into gen-delims (: / ? # [ ] @) and sub-delims (! $ & ' ( ) * + , ; =). A reserved character is allowed to appear unencoded only when it is being used for its structural purpose. The instant you want one of these characters to be part of your data rather than part of the URL skeleton, it must be percent-encoded. For example, an ampersand used to separate two query parameters stays as &, but an ampersand inside a parameter value must become %26 so it is not mistaken for a separator.

Any character that is neither unreserved nor reserved — spaces, double quotes, angle brackets, the backslash, the caret, the backtick, curly braces, the pipe, and every byte above 0x7F — must always be percent-encoded wherever it appears. A space becomes %20, a double quote becomes %22, a less-than sign becomes %3C, and a percent sign itself becomes %25. Encoding the literal percent sign is essential precisely because it is the escape introducer; failing to encode it is the root cause of the double-encoding problems discussed later.

A subtle but important rule is that the reserved set is context-sensitive. A forward slash is structural in a path but is just data inside a query value, and a colon is structural in the scheme and authority but ordinary inside a path segment. This is why no single encoding function is correct everywhere, and why the standard library exposes more than one. Use the URL Encoder/Decoder to check how a specific character is treated, and always think about which component of the URL the data is destined for before encoding it.

encodeURI vs encodeURIComponent vs form encoding

JavaScript ships two distinct encoding functions because there are two distinct jobs. encodeURI is designed to take a complete, already-structured URL and make it safe without breaking it — so it deliberately does not encode the reserved characters that form the URL skeleton. It leaves : / ? # [ ] @ ! $ & ' ( ) * + , ; = intact, encoding only genuinely unsafe characters such as spaces and non-ASCII bytes. Use it when you have a whole URL in hand and merely need to clean up illegal characters, never to encode an individual value.

encodeURIComponent is the function for encoding a single component — one query parameter name, one parameter value, or one path segment. It encodes everything except the unreserved set, which means it does escape : / ? & = + and the rest of the reserved characters. This is exactly what you want when inserting user data into a URL, because it guarantees that an ampersand or equals sign inside the value cannot break out and corrupt the surrounding query string. The practical rule is simple: build the URL skeleton yourself, and run every dynamic value through encodeURIComponent before inserting it.

The application/x-www-form-urlencoded format is a third convention, defined by the WHATWG URL Standard rather than RFC 3986, and it is what browsers use when submitting an HTML form or building a query string via URLSearchParams. Its most visible quirk is that it encodes a space as a plus sign (+) rather than %20, and correspondingly it encodes a literal plus sign as %2B. It is otherwise similar to component encoding but follows the WHATWG percent-encode sets, which differ in a handful of characters. The key takeaway is that form encoding and RFC 3986 component encoding are not interchangeable, and mixing them is a frequent source of bugs.

The plus-versus-%20 distinction is the single most common point of confusion. In a real example, the value "a b+c" becomes "a%20b%2Bc" under encodeURIComponent but "a+b%2Bc" under form encoding. A server that parses the query string with the wrong convention will either turn legitimate plus signs into spaces or leave plus signs where spaces were intended. When you control both ends, pick one convention and apply it consistently; when you do not, match whatever the receiving endpoint expects, which for classic form posts is almost always the form-encoded variant.

Encoding the path, query string, and fragment

A URL has three data-bearing regions with different encoding requirements: the path, the query, and the fragment. In the path, segments are separated by slashes, so a slash that is part of your data must be encoded as %2F while structural slashes stay literal. Spaces in a path are encoded as %20 — never as a plus sign, because the plus-for-space convention applies only to the query under form encoding. A path segment like a product name "summer/winter sale" used as a single segment becomes summer%2Fwinter%20sale.

The query string after the question mark is conventionally a series of key=value pairs joined by ampersands. Both the keys and the values must be individually encoded so that any literal =, &, or + inside them does not collide with the delimiters. A search for "cats & dogs" placed in a query parameter named q produces q=cats%20%26%20dogs under component encoding, or q=cats+%26+dogs under form encoding — in both cases the data ampersand is %26 so it is not read as a pair separator. Always encode each value separately; never encode the assembled query as one blob.

The fragment, everything after the hash, is handled by the client and never sent to the server in the HTTP request line, but it still follows percent-encoding rules so it can be parsed reliably by JavaScript and the History API. A hash character inside fragment data must be encoded as %23 to avoid prematurely terminating the fragment. Because the fragment has slightly more permissive allowed characters than a query under RFC 3986, encoders may leave a few extra characters unescaped there, but encoding conservatively with component encoding is always safe.

The governing principle across all three regions is to encode the data, not the skeleton. Build the structural characters — the slashes between path segments, the question mark, the ampersands and equals signs between query pairs, the hash before the fragment — yourself, and percent-encode only the dynamic values that slot into those positions. Running an entire URL through component encoding destroys its structure by escaping the very delimiters that make it a URL.

UTF-8 and international characters

Percent-encoding operates on bytes, but text above the ASCII range is made of characters that occupy multiple bytes. The modern, standard approach mandated by both RFC 3986 and the WHATWG URL Standard is to first encode the text as UTF-8 and then percent-encode each resulting byte individually. A single non-ASCII character therefore expands into several percent escapes. The euro sign, for instance, is the three-byte UTF-8 sequence 0xE2 0x82 0xAC, so it percent-encodes to %E2%82%AC.

The same rule applies to accented Latin letters, CJK characters, and emoji. The accented e in "cafe" encodes to %C3%A9 because its UTF-8 representation is two bytes. The Japanese character that reads "ka" becomes %E3%81%8B, and an emoji from the supplementary planes can expand to four escapes such as %F0%9F%98%80. JavaScript's encodeURIComponent already performs the UTF-8 step internally, which is one more reason to prefer it over hand-rolled encoding when handling user input.

Domain names are a special case. Hostnames cannot be percent-encoded; internationalized domain names are instead converted to an ASCII-compatible form called Punycode (an A-label like xn--...), which is a completely separate transformation from percent-encoding. This is why a browser may show Unicode in the address bar for readability while the underlying DNS query and HTTP request use the Punycode and percent-encoded forms respectively. Encoding affects the path, query, and fragment; Punycode affects the host.

Because the byte-level encoding depends entirely on choosing UTF-8 first, decoding must assume UTF-8 as well. Legacy systems that percent-encoded text in another character set, such as Latin-1, can produce sequences that fail to decode as valid UTF-8 and surface as replacement characters or errors. When you control the data, always encode as UTF-8; when consuming third-party URLs, be prepared for the occasional non-UTF-8 escape and decode defensively.

Double-encoding bugs and when to decode

Double encoding is the most common percent-encoding bug, and it stems directly from the fact that the percent sign is itself a character that gets encoded to %25. If you encode a string that has already been encoded, every existing escape's percent sign is escaped again: %20 becomes %2520, and %26 becomes %2526. A server that decodes such a URL once recovers %20 as literal text rather than a space, so a URL that looks plausible silently carries corrupted data. The classic symptom is a literal "%20" appearing in a page where a space was expected.

The cure is to encode exactly once, at exactly one layer, and to be deliberate about ownership of that step. A frequent cause of accidental double encoding is a value that is encoded by application code and then encoded again by an HTTP client library, a templating layer, or a redirect helper that assumes it received raw text. Before encoding, establish whether the data is already encoded; if you are unsure, decoding once and re-encoding gives a known-good single-encoded result. The URL Encoder/Decoder is handy for inspecting a suspect string and confirming whether it is raw, single-encoded, or double-encoded.

Decoding is the mirror image and should be applied symmetrically: decode once for each time the data was encoded, and only when you actually need the raw value. The common place to decode is on the receiving side, after the URL has been parsed into its components — decode each query parameter value after splitting on & and =, not before, otherwise an encoded ampersand inside a value would be turned into a real one and split the value incorrectly. Order matters: parse structure first, decode the pieces second.

A related pitfall is decoding too eagerly in the middle of a pipeline, which re-introduces the delimiter-collision problem the encoding was meant to prevent. Keep values encoded while they flow through URL-shaped contexts and decode them only at the boundary where they are consumed as plain data, such as just before a database lookup or display. Treating encode and decode as a matched pair at well-defined boundaries eliminates the vast majority of percent-encoding defects.

Security considerations: open redirects and XSS

Because percent-encoding controls how a URL is parsed, mishandling it has direct security consequences. A reflected cross-site scripting vulnerability often arises when a query parameter is read, decoded, and then written into the HTML response without contextual output encoding. An attacker can craft a parameter whose decoded value contains a less-than sign and a script tag; if the application echoes it unescaped, the browser executes attacker-controlled JavaScript. Percent-encoding the input on the way in is not sufficient protection — the defense is correct output encoding for the HTML context at the point of insertion.

Open-redirect flaws are another classic. An endpoint that takes a destination URL in a parameter and redirects to it can be abused to send victims to a malicious site if it trusts the value blindly. Attackers frequently use encoding tricks — double-encoded slashes, encoded scheme separators, or mixed-case escapes — to slip a hostile target past naive validation that only checks the raw string. The robust fix is to decode fully, then validate the resulting URL's scheme and host against an allowlist, rather than pattern-matching the still-encoded text.

Server-side request forgery and path-traversal attacks similarly lean on encoding to evade filters. Encoding a slash as %2F or a dot as %2E can bypass a filter that looks for literal "../" while still being decoded into a traversal sequence by a downstream component. The general lesson is that filters must operate on the fully decoded, canonical form of the input, and the same input should not be decoded again after it has been validated, or a checker-versus-user discrepancy reopens the hole.

The unifying security principle is to canonicalize input by decoding it completely and exactly once, validate the canonical form against strict allowlists, and then apply output encoding appropriate to wherever the value is finally used — HTML, an attribute, JavaScript, or another URL. Percent-encoding is a safe-transport mechanism, not a security control, so it must be paired with validation and context-aware output encoding rather than relied upon alone.

Key takeaways

  • Percent-encoding replaces each unsafe byte with %XX so data can travel inside a URL without colliding with its RFC 3986 syntax; it is encoding, not encryption.
  • Only the unreserved set (A-Z a-z 0-9 - . _ ~) never needs encoding; reserved characters must be encoded whenever they are data rather than delimiters.
  • Use encodeURIComponent for individual values and encodeURI only for whole URLs; build the URL skeleton yourself and encode each dynamic value separately.
  • A space is %20 in URI/path contexts but a plus sign in application/x-www-form-urlencoded; the two conventions are not interchangeable.
  • Non-ASCII text is encoded as UTF-8 first, then each byte is percent-encoded individually, so one character can expand to several escapes.
  • Encode exactly once and decode symmetrically at component boundaries; canonicalize and validate input to prevent double-encoding bugs, open redirects, and XSS.

Frequently asked questions

What is the difference between encodeURI and encodeURIComponent?

encodeURI is meant for an entire, already-structured URL and deliberately leaves the reserved delimiters (: / ? # & = and others) unencoded so it does not break the URL. encodeURIComponent is meant for a single value — one query parameter or path segment — and encodes those reserved characters too, so an ampersand or equals sign inside the value cannot corrupt the surrounding URL. Use encodeURIComponent for any dynamic value you insert into a URL.

Why is a space sometimes %20 and sometimes a plus sign?

Both represent a space, but in different conventions. RFC 3986 percent-encoding, used in paths and by encodeURIComponent, encodes a space as %20. The application/x-www-form-urlencoded format used by HTML forms and URLSearchParams encodes a space as a plus sign and a literal plus sign as %2B. They are valid in their own contexts but not interchangeable, so the receiving server must parse with the convention the URL was built with.

Is URL encoding the same as Base64?

No. URL encoding escapes individual unsafe characters so text is safe inside a URL, leaving safe characters readable. Base64 transforms arbitrary binary data into a 64-character ASCII alphabet, expanding size by about a third and producing output that is not human-readable. They solve different problems; you can learn about the latter with the Base64 Encoder and the base64 encoding guide.

Does URL encoding encrypt or protect data?

No. Percent-encoding is a reversible, public transformation with no secret key — anyone can decode %2F back to a slash instantly. It provides neither confidentiality nor integrity and must never be used as a security measure. For protecting data you need encryption or hashing, and for safely placing data in HTML you need context-appropriate output encoding in addition to URL encoding.

How do I fix a URL that shows %2520 or a literal %20 in it?

Those symptoms indicate double encoding, where already-encoded text was encoded again and the percent sign became %25. Fix it by ensuring the value is encoded exactly once, at one layer of your code. If you are unsure of a string's state, decode it once and re-encode it to get a clean single-encoded result; the URL Encoder/Decoder lets you inspect and correct such strings.

How are international characters like emoji or accented letters encoded in a URL?

They are first encoded as UTF-8 bytes, then each byte is percent-encoded individually, so a single character can become several escapes. The accented e in cafe becomes %C3%A9, the euro sign becomes %E2%82%AC, and many emoji become four escapes such as %F0%9F%98%80. encodeURIComponent performs the UTF-8 step automatically; internationalized domain names are a separate case handled by Punycode rather than percent-encoding.