perishablepress.com Open in urlscan Pro
69.16.192.217 Public Scan

Back to summary

URL:
https://perishablepress.com/stop-using-unsafe-characters-in-urls/
Submission: On November 05 via manual (November 5th 2022, 4:29:53 pm UTC) from US — Scanned from US

Form analysis
4 forms found in the DOM

GET https://perishablepress.com/

<form method="get" class="search" action="https://perishablepress.com/" onsubmit="location.href='https://perishablepress.com/search/'+encodeURIComponent(this.s.value).replace(/%20/g,'+');return false;">
  <label for="s-32" class="h">Search</label>
  <input id="s-32" name="s" type="search" title="Search" maxlength="100" value="" placeholder="Type &amp; press enter to search..">
</form>

POST https://perishablepress.com/wp/wp-comments-post.php

<form action="https://perishablepress.com/wp/wp-comments-post.php" method="post" id="commentform" class="comment-form" novalidate="">
  <div class="comment-policy t2">
    <p>Name and email required. Email kept private. Basic markup allowed. Please wrap any small/single-line code snippets with <code>&lt;code&gt;</code> tags. Wrap any long/multi-line snippets with <code>&lt;pre&gt;&lt;code&gt;</code> tags. For more
      info, check out the <a target="_blank" href="/comment-policy/" title="Perishable Press Comment Policy">Comment Policy</a> and <a target="_blank" href="/privacy.html" title="Perishable Press Privacy Policy">Privacy Policy</a>.</p>
    <p class="comment-feed"><a target="_blank" href="https://perishablepress.com/stop-using-unsafe-characters-in-urls/feed/" title="Subscribe via RSS feed">Subscribe to comments on this post</a></p>
  </div>
  <div class="comment-field"><label for="author">Name</label><input id="author" name="author" type="text" value="" placeholder="Enter your name"></div>
  <div class="comment-field"><label for="email">Email</label><input id="email" name="email" type="email" value="" placeholder="Enter your email"></div>
  <div class="comment-field"><label for="url">Website</label><input id="url" name="url" type="url" value="" placeholder="Enter your website"></div>
  <div class="comment-field"><label for="comment">Comment</label><textarea id="comment" name="comment" placeholder="Enter your comment"></textarea></div>
  <p class="form-submit"><input name="submit" type="submit" id="submit" class="submit" value="Post Comment"> <input type="hidden" name="comment_post_ID" value="15055" id="comment_post_ID">
    <input type="hidden" name="comment_parent" id="comment_parent" value="0">
  </p>
</form>

POST https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&id=a5e4fc9f37

<form action="https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&amp;id=a5e4fc9f37" method="post" target="_blank" novalidate="">
  <div class="t1">Newsletter</div>
  <div class="t2">Get news, updates, deals &amp; tips via email.</div>
  <label for="mce-EMAIL-48" class="h">Enter your email</label>
  <input id="mce-EMAIL-48" name="EMAIL" type="email" placeholder="Enter your email" required="">
  <input type="text" name="b_3b1eb5693738305478b07e6c4_a5e4fc9f37" tabindex="-1" value="" aria-hidden="true" style="position:absolute;left:-9999em;">
  <input type="submit" name="subscribe" value="Subscribe">
  <div class="t3">Email kept private. Easy unsubscribe anytime.</div>
</form>

POST https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&id=a5e4fc9f37

<form action="https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&amp;id=a5e4fc9f37" method="post" target="_blank" novalidate="">
  <div class="t1">Perishable Press Newsletter</div>
  <div class="t2">Get news, updates, deals &amp; tips via email.</div>
  <label for="mce-EMAIL-91" class="h">Enter your email</label>
  <input id="mce-EMAIL-91" name="EMAIL" type="email" placeholder="Enter your email" required="">
  <input type="text" name="b_3b1eb5693738305478b07e6c4_a5e4fc9f37" tabindex="-1" value="" aria-hidden="true" style="position:absolute;left:-9999em;">
  <input type="submit" name="subscribe" value="Subscribe">
  <div class="t3">Your email will remain private. Easy unsubscribe anytime.</div>
</form>

Text Content

Main Menu Night Mode
* Books
* Plugins
* Search

Save 25% on Wizard’s SQL for WP w/ code: WIZARDSQL
Search
Perishable Press
Web Dev + WordPress + Security
.htaccess made easy: Improve site performance and security.
Related Posts
* Block Tough Proxies
* They’re Scanning for Your Backup Files
* 7G Firewall: September 2020 Update
* WordPress Block Proxy Visits
* 4 Ways to Make a WordPress Site Private Access Only

+ More related posts »

(PLEASE) STOP USING UNSAFE CHARACTERS IN URLS

♦ Posted by Jeff Starr in Security
Updated May 31, 2022 • 26 comments

Just as there are specifications for designing with CSS, HTML, and JavaScript,
there are specifications for working with URIs/URLs. The Internet Engineering
Task Force (IETF) clearly defines these specifications in RFC 3986: Uniform
Resource Identifier (URI): Generic Syntax. Within that document, there are
guidelines regarding which characters may be used safely within URIs. This post
summarizes the information, and encourages developers to understand and
implement accordingly.

FYI: URL is a specific type of URI. Learn more »

TABLE OF CONTENTS

* About the RFC 3986 Specification
* Character Encoding Chart
* More about Character Types
* Reserved Characters
* Unreserved Characters
* Usafe Characters
* ASCII Characters
* URLs in HTML and JavaScript
* Unsafe Characters in WordPress
* A Dangerous Trend
* URLs and Firewall Security
* WordPress and 5G Blacklist
* Take-home Message

ABOUT THE RFC 3986 SPECIFICATION

The specifications for Uniform Resource Identifiers (URIs) and more specifically
Uniform Resource Locators (URLs) provide a safe, consistent way to request,
identify, and resolve resources on the Internet. As clearly stated in RFC 3986:

> A Uniform Resource Identifier (URI) is a compact sequence of characters that
> identifies an abstract or physical resource. This specification defines the
> generic URI syntax and a process for resolving URI references that might be in
> relative form, along with guidelines and security considerations for the use
> of URIs on the Internet. The URI syntax defines a grammar that is a superset
> of all valid URIs, allowing an implementation to parse the common components
> of a URI reference without knowing the scheme-specific requirements of every
> possible identifier.

Thanks to the brilliant work of experts such as Tim Berners-Lee, Roy Fielding,
Larry Masinter, and Mark McCahill, developers have a safe, consistent protocol
for working with URIs/URLs on the Web. It is important that we adhere to these
specifications when developing software, plugins, apps, and the like. Failing to
do so introduces potential security vulnerabilities which may be exploited by
nefarious individuals and malicious scripts.

CHARACTER ENCODING CHART

To help promote the cause of Web Standards and adhering to specifications, here
is a quick reference chart explaining which characters are “safe” and which
characters should be encoded in URLs.

Classification Included characters Encoding required? Safe characters
Alphanumerics [0-9a-zA-Z] and unreserved characters. Also reserved characters
when used for their reserved purposes (e.g., question mark used to denote a
query string) NO Unreserved characters - . _ ~ (does not include blank space) NO
Reserved characters : / ? # [ ] @ ! $ & ' ( ) * + , ; = (does not include blank
space) YES1 Unsafe characters Includes the blank/empty space and " < > % { } | \
^ ` YES ASCII Control characters Includes the ISO-8859-1 (ISO-Latin) character
ranges 00-1F hex (0-31 decimal) and 7F (127 decimal) YES Non-ASCII characters
Includes the entire “top half” of the ISO-Latin set 80-FF hex (128-255 decimal)
YES All other characters Any character(s) not mentioned above should be
percent-encoded. YES

1 Reserved characters only need encoded when not used for their defined,
reserved purposes.
FYI: As with the specifications, the above chart is a work in progress and
subject to change. If you have any suggestions to improve it, please let me
know.

The above chart is a summary of which characters need to be encoded in
URIs/URLs, based on the current specification RFC 3986. Web developers should be
mindful when working with URLs in their applications and implementations.

MORE ABOUT CHARACTER TYPES

Here is some further discussion about each of the various character types:
Reserved Characters, Unreserved Characters, Unsafe Characters, and ASCII
Characters.

RESERVED CHARACTERS

More information about “reserved” characters from RFC 3986:

URIs include components and subcomponents that are delimited by characters in
the “reserved” set. These characters are called “reserved” because they may (or
may not) be defined as delimiters by the generic syntax, by each scheme-specific
syntax, or by the implementation-specific syntax of a URI’s dereferencing
algorithm. If data for a URI component would conflict with a reserved
character’s purpose as a delimiter, then the conflicting data must be
percent-encoded before the URI is formed.

reserved = gen-delims / sub-delims

gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"

sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="

The purpose of reserved characters is to provide a set of delimiting characters
that are distinguishable from other data within a URI. URIs that differ in the
replacement of a reserved character with its corresponding percent-encoded octet
are not equivalent. Percent-encoding a reserved character, or decoding a
percent-encoded octet that corresponds to a reserved character, will change how
the URI is interpreted by most applications. Thus, characters in the reserved
set are protected from normalization and are therefore safe to be used by
scheme-specific and producer-specific algorithms for delimiting data
subcomponents within a URI. […]

If a reserved character is found in a URI component and no delimiting role is
known for that character, then it must be interpreted as representing the data
octet corresponding to that character’s encoding in US-ASCII.

UNRESERVED CHARACTERS

More information about “unreserved” characters from RFC 3986:

Characters that are allowed in a URI but do not have a reserved purpose are
called unreserved. These include uppercase and lowercase letters, decimal
digits, hyphen, period, underscore, and tilde.

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

URIs that differ in the replacement of an unreserved character with its
corresponding percent-encoded US-ASCII octet are equivalent: they identify the
same resource. However, URI comparison implementations do not always perform
normalization prior to comparison (see Normalization and Comparison). For
consistency, percent-encoded octets in the ranges of ALPHA (%41–%5A and
%61–%7A), DIGIT (%30–%39), hyphen (%2D), period (%2E), underscore (%5F), or
tilde (%7E) should not be created by URI producers and, when found in a URI,
should be decoded to their corresponding unreserved characters by URI
normalizers.

USAFE CHARACTERS

More about “unsafe” characters from RFC 1738. Note that RFC 1738 is now
obsolete, however the information remains useful in the general sense, and is
shared below for educational and reference purposes.

Characters can be unsafe for a number of reasons. The space character is unsafe
because significant spaces may disappear and insignificant spaces may be
introduced when URLs are transcribed or typeset or subjected to the treatment of
word-processing programs. The characters < and > are unsafe because they are
used as the delimiters around URLs in free text; the quote mark (") is used to
delimit URLs in some systems. The character # is unsafe and should always be
encoded because it is used in World Wide Web and in other systems to delimit a
URL from a fragment/anchor identifier that might follow it. The character % is
unsafe because it is used for encodings of other characters. Other characters
are unsafe because gateways and other transport agents are known to sometimes
modify such characters. These characters are {, }, |, \, ^, ~, [, ], and `.

All unsafe characters must always be encoded within a URL. For example, the
character # must be encoded within URLs even in systems that do not normally
deal with fragment or anchor identifiers, so that if the URL is copied into
another system that does use them, it will not be necessary to change the URL
encoding.

Again, the above “unsafe character” information is from RFC 1738, which is
obsoleted and replaced by RFC 3986. For example, the tilde ~ character is now an
unreserved character and does not need to be encoded. Likewise with the #
hashtag (octothorp) character, it now is a reserved character and does not need
to be encoded unless used for non-reserved purposes. If in doubt, follow the RFC
3986.

ASCII CHARACTERS

More information about ASCII characters from W3C Internationalization:

Currently Web addresses are typically expressed using Uniform Resource
Identifiers or URIs. The URI syntax defined in RFC 3986 STD 66 (Uniform Resource
Identifier (URI): Generic Syntax) essentially restricts Web addresses to a small
number of characters: basically, just upper and lower case letters of the
English alphabet, European numerals and a small number of symbols.

The original reason for this was to aid in transcription and usability, both in
computer systems and in non-computer communications, to avoid clashes with
characters used conventionally as delimiters around URIs, and to facilitate
entry using those input facilities available to most Internet users.

User’s expectations and use of the Internet have moved on since then, and there
is now a growing need to enable use of characters from any language in Web
addresses. A Web address in your own language and alphabet is easier to create,
memorize, transcribe, interpret, guess, and relate to. It is also important for
brand recognition. This, in turn, is better for business, better for finding
things, and better for communicating. In short, better for the Web.

Imagine, for example, that all web addresses had to be written in Japanese
katakana, as shown in the example below. How easy would it be for you, if you
weren’t Japanese, to recognize the content or owner of the site, or type the
address in your browser, or write the URI down on notepaper, etc.?

http://ヒキワリ.ナットウ.ニホン

There have been several developments recently that begin to make this possible.
[…] Learn more at w3.org »

URLS IN HTML AND JAVASCRIPT

In earlier versions of HTML, the entire range of the ISO-8859-1 (ISO-Latin)
character set may be used in documents. Since HTML4, the entire Unicode
character set may also be used. In HTTP, however, the range of allowed
characters is expressly limited to only a subset of the US-ASCII character set
(see the Character Encoding Chart for details).

So, when writing HTML, ISO and Unicode characters may be used everywhere in the
document except where URLs are referenced*. This includes the following
elements:

* Update: As Mathias explains, “it’s perfectly okay to leave those symbols
unencoded, as browsers will take care of them as per the URL parsing algorithm
in the HTML spec.”

As flexible as HTML is in terms of which characters may be used, there are
strict limits to which characters may be used when referencing URLs. This
limitation applies not only to URLs used in HTML, but also to URLs referenced in
any coding language (e.g., JavaScript, PHP, Perl, etc.).

UNSAFE CHARACTERS IN WORDPRESS

Note: Things have changed a lot with WordPress and URI specification. So the
following information may be outdated, but the example serves to illustrate how
developers should be mindful of specification when crafting URIs in
applications.

In version 3.5, WordPress uses improper, unencoded URLs to enqueue JavaScript
libraries. Specifically, in the WP Admin area, various URLs are called using
square brackets [ ], which are classified as unsafe characters. Here is an
example:

http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

Also affecting the WordPress Admin, here is an example of unsafe characters in
URLs, pointed out in this comment:

http://test.site/wp-admin/post.php?t=1347548645469?t=1347548651124?t=1347548656685?t=1347548662469?t=1347548672300?t=1347548681615?

“Special-use” specifies that the question mark “?” is reserved for the
denotation of a query string, but must be encoded for any other purpose.
Unfortunately, WordPress is including multiple unencoded question marks for URLs
involved with its “preview” functionality. In other words, in any URL, the first
question mark “?” may be unencoded to denote the query string, but subsequent
“?” must be encoded.

These errors may not be a huge deal, but they increase potential vulnerability
and certainly should be fixed in the next WP update. Likewise, future versions
of WordPress should keep URI/URL specifications in mind and verify that all URLs
are properly encoded.

A DANGEROUS TREND

Note: Things have changed a lot with Google and URI specification. So the
following information may be outdated, but the example serves to illustrate how
developers should be mindful of specification when crafting URIs in
applications.

WordPress isn’t the only popular piece of software that is not following
specification; rather, we’re seeing a disturbing trend wherein big companies
such as Google are including unsafe characters in their URLs. Here is a recently
reported example:

http://blog.sergeys.us/beer?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed:+SergeySus+(Sergey+Sus+Photography+%C2%BB+Blog)&utm_content=Google+Reader

Notice the unencoded colon : character? Apparently Google is including them in
URLs for FeedBurner and Google Reader. Hopefully this is just an oversight that
will be corrected in a future update.

For more examples of unsafe characters in popular apps and plugins, scan through
some of the comments left on my 5G, 6G (beta), and BBQ plugin.

URLS AND FIREWALL SECURITY

For the record, the 5G Blacklist, 6G Firewall (beta) — and all of my firewalls
for that matter — are built on the foundation of IETF specifications. As
explained in detail here and here, the .htaccess rules used in my G-series
firewalls are designed to block malicious URL requests such as those that
contain unsafe characters. Other firewall/security plugins and scripts operate
in similar fashion, using standards and specifications to determine which URLs
are potentially dangerous.

> Developers, please stop using unsafe characters in URLs.

Many WAF firewalls and security applications rely on pattern-recognition to help
protect their sites against threatening activity, but such security measures
fail when developers ignore specification and include unencoded unsafe
characters in URLs. Worse, by introducing inconsistency into the system,
noncompliant scripts pose a potential security risk and open the doors to
attacks.

WORDPRESS AND 5G BLACKLIST

Note: This section contains outdated information. WordPress now is way beyond
version 3.5, and the URI specification has changed considerably. Also, the 5G
Blacklist is superseded by the 6G and 7G Firewall. And 8G is in the works :)

As mentioned, WordPress 3.5 includes unencoded square brackets in various URLs
in the Admin area. As explained, the 5G Blacklist blocks such unsafe characters
to help users secure their WP-powered sites. Thus, if you’re running both
WordPress and 5G, there will be an issue wherein certain URL requests are denied
with a “403 – Forbidden” response.

So, until WordPress can get things fixed up, here is how to modify the 5G
Blacklist (don’t even think about modifying any WP core files) to “allow” those
unsafe URLs to pass through the firewall.

STEP 1

In the 5G Blacklist, locate this section of code:

# 5G:[QUERY STRINGS]
<ifModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{QUERY_STRING} (environ|localhost|mosconfig|scanner) [NC,OR]
RewriteCond %{QUERY_STRING} (menu|mod|path|tag)\=\.?/? [NC,OR]
RewriteCond %{QUERY_STRING} boot\.ini [NC,OR]
RewriteCond %{QUERY_STRING} echo.*kae [NC,OR]
RewriteCond %{QUERY_STRING} etc/passwd [NC,OR]
RewriteCond %{QUERY_STRING} \=\\%27$ [NC,OR]
RewriteCond %{QUERY_STRING} \=\\\'$ [NC,OR]
RewriteCond %{QUERY_STRING} \.\./ [NC,OR]
RewriteCond %{QUERY_STRING} \? [NC,OR]
RewriteCond %{QUERY_STRING} \: [NC,OR]
RewriteCond %{QUERY_STRING} \[ [NC,OR]
RewriteCond %{QUERY_STRING} \] [NC]
RewriteRule .* - [F]
</ifModule>

STEP 2

Replace that entire block of code with this revised version that excludes the
rules that block the unsafe characters:

Done. No further edits should be required, unless you’ve made any of your own
modifications.

TAKE-HOME MESSAGE

When developing for the Web, adherence to standards and protocols is important.
By taking the time to properly encode your URLs, you eliminate inconsistency,
eliminate vulnerabilities, facilitate extensibility, and ensure proper
functionality. Hopefully this article serves as a reminder and helps clear up
any confusion about which characters need encoded and why it’s so important to
do so.

blacklist http tips url
About the Author
Jeff Starr = Creative thinker. Passionate about free and open Web.
Digging Into WordPress: Take your WordPress skills to the next level.
← Next Post
Protection for WordPress Pingback Vulnerability
Previous Post →
Printed .htaccess books

26 RESPONSES TO “(PLEASE) STOP USING UNSAFE CHARACTERS IN URLS”

1. emrah 2012/12/31 7:49 pm • Reply

this post is racist. web is evolving, and there are enough space for other
charsets. Not only English chars.

* Jeff Starr 2012/12/31 8:23 pm • Post Author • Reply

Wow, interesting opinion, emrah. Thanks for chiming in with that.

I should add the post merely strives to explain existing specifications
(and why they’re important), it makes no value judgments one way or
another regarding which character sets are better/worse than others.
Totally not the point here.

And I’m pretty sure that UTF-8 includes non-english characters as well,
so no need to start accusing anything/anyone of being “racist”. Sheesh.

* emrah 2013/01/01 8:15 am

ok i was kidding with a little truth. UTF-8 is fine enough.

* Jeff Starr 2013/01/01 11:43 pm • Post Author

That’s good to know.. I was genuinely concerned about you ;)

* Mathias Bynens 2013/01/03 6:09 am

Minor nitpick: UTF-8 is a character encoding, so technically there is
no such thing as “UTF-8 characters”. You probably meant to say “Unicode
symbols”.

* Jeff Starr 2013/01/03 5:49 pm • Post Author

Yes that is what I meant to say, thanks Mathias.

2. Paul 2013/01/01 7:16 am • Reply

Admirably thorough article. Let’s hope Google is reading.

3. Mathias Bynens 2013/01/02 4:18 am • Reply

> The specifications for Uniform Resource Identifiers (URIs) and more
> specifically Uniform Resource Locators (URLs) provide a safe, consistent
> way to request, identify, and resolve resources on the Internet. As
> clearly stated in RFC3986: […]

If only that were true. Sadly, RFC3986 doesn’t match reality. That’s why
Anne van Kesteren has been working on a URL spec based on existing
implementations.

> In HTTP, however, the range of allowed characters is expressly limited to
> only a subset of the US-ASCII character set […] So, when writing HTML,
> ISO and Unicode characters may be used everywhere in the document except
> where URLs are referenced.

That’s not true at all. HTML != HTTP. <a href=☺>…</a> is perfectly valid
HTML. It’s up to browsers to resolve special characters to their
percent-encoded escape sequences as needed before any HTTP requests are
made to the URL. For more information, see the URL parsing algorithm in the
HTML spec (404 link removed 2014/07/20).

Also, in a blog post on special characters in URLs, you might want to
mention Punycode.

* Jeff Starr 2013/01/02 6:41 pm • Post Author • Reply

Great feedback, Mathias, thank you. The new spec looks ambitious and
promising. It will be interesting to watch as things unfold in 2013.

For your second point, I guess I’m confused.. researching this topic on
the Web, I came across numerous sources, including this article, which
seem to contradict what you’re saying here.. Also, not all HTTP requests
are initiated by browsers, especially those involving unsafe characters,
which are usually transmitted via script, command line, etc. The main
point of the article is that unsafe characters should not be present in
the URL, according to the referenced specifications. A smiley face may be
perfectly valid HTML, but it needs to be encoded to be safe for HTTP.

Then again, my expertise is admittedly limited in this area, so any
further infos are most welcome.

* Mathias Bynens 2013/01/03 6:04 am

Hey Jeff, thanks for the reply! I’ll try to explain what I meant
exactly, in case it was unclear.

I fail to see where that article contradicts anything I’m saying.
You’re right that those special characters shouldn’t be used without
encoding them in URLs as far as HTTP and RFC3986 are concerned — but
contrary to what you’re saying here, in HTML it’s perfectly okay to
leave those symbols unencoded, as browsers will take care of them as
per the URL parsing algorithm in the HTML spec.

* Jeff Starr 2013/01/03 6:17 pm • Post Author

Ah, that makes sense, thanks for explaining. I went ahead and added a
note in that section of the article to clarify.

Question: where in the spec does it explain how browsers should handle
unsafe and reserved characters? For example, if I have an HTML document
that references a URL containing, say, square brackets (classified as
“unsafe”):

http://example.com/wp-admin/load-scripts.php?c=1&load[]=swfobject,jquery,utils&ver=3.5

As mentioned, the issue with WordPress is that browsers aren’t encoding
the square brackets that are included in some URLs. Similarly with
Google, there are additional question marks (“reserved” characters)
included in URLs that aren’t getting encoded. It would be great to hear
your thoughts on what’s happening (or not) with this issue.

4. Yael K. Miller 2013/01/02 10:42 am • Reply

Thanks for writing this post — I had no idea that an URL itself could be
unsafe.

Google also uses “?” when you create goal-specific URLs in Google
Analytics.

And Amazon loves making long URLs when you’re searching on amazon.com
Example (my business partner’s books):
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Ddigital-text&field-keywords=Phyllis+Zimbler+Miller

(Although, I could be wrong: when ? or % is displayed does that mean it’s
always un-encoded or is Firefox displaying encoded characters as
un-encoded?)

Ok, so where can I find the information to encode these unsafe characters?

* Jeff Starr 2013/01/02 6:53 pm • Post Author • Reply

Hi Yael, thanks for the feedback. Just to be clear, the first instance of
“?” in a URL denotes a query string and is valid; it’s only subsequent
instances of “?” that need encoding. Also, the Amazon URL you mention
looks valid, and technically there is nothing wrong with “long” URLs
(although they can be unwieldy to work with).

As Mathias mentions in the previous comment, it’s up to browsers to
properly encode/escape URLs before sending the request, but I’m not sure
if that’s always the case.

And to encode unsafe characters, any online URL decoder/encoder should do
the job, for example this one. There are various conversion charts also
available online. If I have time, I’ll try posting an article about this.

* Yael K. Miller 2013/01/03 10:18 am

Thanks.

5. Maxi 2013/01/07 2:16 am • Reply

You
Are
Brilliant!

thanks for sharing 5G and this solution to my newly 403 problems

6. Sean Ellingham 2013/01/16 9:13 am • Reply

Unless I’m reading the RFCs wrongly (or misinterpreting your post), saying
that all reserved characters must always be encoded is incorrect. For
example, in an HTTP URL (at least as far as I can tell), the reserved
characters “;”, “:”, “@”, “&” and “=” are perfectly acceptable in the path
and query string without being encoded – see page 17 of RFC1738.

* Jeff Starr 2013/01/16 2:56 pm • Post Author • Reply

> saying that all reserved characters must always be encoded is
> incorrect.

I don’t think I say that anywhere in the article, so if there is some
confusion please let me know so I may clarify.

Also, page 17 of RFC1738 refers to “ip based protocols”, which use the
reserved characters according to their specifically defined reserved
purpose. At least, that’s how I currently understand the RFC, please
advise if I am misguided.

* Sean Ellingham 2013/01/16 3:46 pm

True, it’s not said directly, but when I read the article that was the
impression I received. I believe the culprit is the quick reference
chart, which says encoding is required for everything except safe
characters.

The particular part of page 17 of RFC1738 I was referring to was this
section:

; HTTP

httpurl = "http://" hostport [ "/" hpath [ "?" search ]]
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
search = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

My understanding of that (with the definitions around it) was that
those five reserved characters were allowed as is in the path and query
string. Then again, that’s just my interpretation – I’m not making any
claims that I’m an expert in this area, so I’ll gladly defer to better
judgement. These RFCs aren’t exactly the easiest things to get your
head around!

* Jeff Starr 2013/01/16 4:13 pm • Post Author

Thanks for clarifying, I’ve updated the chart with a note about
encoding of reserved characters.

For RFC, page 3 states:

“Thus, only alphanumerics, the special characters “$-_.+!*'(),”, and
reserved characters used for their reserved purposes may be used
unencoded within a URL.”

I think the information you reference on p17 is showing that reserved
characters are acceptable if used according to definition (i.e., p3).

But you are correct that the RFCs are not the easiest thing to
understand. And like you, I’m no expert in this area so if anything is
incorrect I’ll be glad to revise accordingly.

* Sean Ellingham 2013/01/16 4:48 pm

Hmm, I obviously missed that line on page 3 – that sentence does seem
fairly conclusive. What is confusing me though is that the definitions
for the schemes listed on pages 17 onwards seem to conflict with that,
by implying that the reserved characters can potentially be used in
certain other parts of a URL. Using HTTP as an example, if the
definitions are expanded, it seems to say that a path can comprise
multiple “/” delimited sections, where a section comprises any number
of: unreserved characters, escape sequences, or the “:”, “;”, “@”, “&”
or “=” characters – it is because these characters are explicitly
listed, combined with the following statement on page 8, that I believe
them to be valid.

Within the and components, “/”, “;”, “?” are
reserved. The “/” character may be used within HTTP to designate a
hierarchical structure.

However, it is worth considering that (if my interpretation is correct)
whilst RFC1738 seems to allow unencoded “&” and “=” in the query
string, we would expect these to be ‘reserved’ as we are used to their
use in application/x-www-form-urlencoded style form submissions –
although I don’t know where that encoding of key/value pairs is
specified (I believe it might be in the HTML specification, although
I’m not sure).

* Sean Ellingham 2013/01/16 4:51 pm

Oops, the quote from page 8 got eaten as I forgot to replace the angle
brackets with the HTML entities – need to get to bed! Here’s what it
should have said:

“Within the <path> and <searchpart> components, “/”, “;”, “?” are
reserved. The “/” character may be used within HTTP to designate a
hierarchical structure.”

7. Riva 2013/01/27 12:23 pm • Reply

Really liked what you had to say in your post, (Please) Stop Using Unsafe
Characters in URLs : Perishable Press, thanks for the good read!
— Riva

http://www.terrazoa.com

8. P. Don 2013/01/31 10:15 am • Reply

I checked RFC 1738 since I was having a problem with commas in request
strings triggering 403s using the 5G .htaccess.

It appears that commas are fine, defined as “extra” characters (see the BNF
section), one of the allowable sets of unreserved characters (alpha | digit
| safe | extra). There is a comma in your initial list of safe characters
in the OP, but it’s ambiguous whether it’s in the list or just punctuation.
The comma does not seem to belong in the Reserved list.

9. Tim 2013/04/16 7:52 am • Reply

very interesting. I recently developed a search interface that included
arrays of checkboxes and used the get method so that searches could be
saved. I wonder if my technique is using unsafe characters or if this is
acceptable:

the options[] field is added as a url parameter like this:

?options%5B%5D=large&options%5B%5D=medium

10. G. C. 2022/05/31 11:39 am • Reply

Just so you are aware:

Classifying sub-delims as “unsafe” is inaccurate.

Sub-delimiters are a class for which frameworks and projects that involve
URL parsing algorithms are the target. Specifically, from rfc 3986:

A subset of the reserved characters (gen-delims) is used as delimiters of
the generic URI components described in Section 3. A component’s ABNF
syntax rule will not use the reserved or gen-delims rule names directly;
instead, each tax rule lists the characters allowed within that component
(i.e., not delimiting it), and any of those characters that are also in the
reserved set are “reserved” for use as subcomponent delimiters within the
component.

Only the most common subcomponents are defined by this specification; other
subcomponents may be defined by a URI scheme’s specification, or by the
implementation-specific syntax of a URI’s dereferencing algorithm, provided
that such subcomponents are delimited by characters in the reserved set
allowed within that component.

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters are
specifically allowed by the URI scheme to represent data in that component.
If a reserved character is found in a URI component and no delimiting role
is known for that character, then it must be interpreted as representing
the data octet corresponding to that character’s encoding in US-ASCII.

Simply they are safe for their intended use, as delimiters, and should
otherwise be percent encoded.

For framework and library devs like me the distinction is kind of
important.

11. Joe 2022/10/09 12:22 am • Reply

First; you should understand that “unsafe characters” no longer exists in
the current standard as a character class, since you mention it. However
you then go on to say it is still useful — no, that’s incorrect. It is now
wrong information.

RFC 1738 is not just obsolete, it is deprecated.

When you echo 1738’s verbiage about gen-delims and sub-delims, you are
repeating the very misconception which caused the standard to be revised.

{, }, |, \, ^, ~, [, ]
These are not unsafe characters.

Example: It is safe to use [ ] { } in URLs
It is non-standard to use them in resource names.
If you are writing a URL dereferencing algorithm, or USING such an
algorithm, these characters are reserved FOR YOU

gen-delims = “:” “[” / “]” “@”

sub-delims = “!” / “$” / “&” / “‘” / “(” / “)”
/ “*” / “+” / “,” / “;” / “=”

During the host portions and query portion, rules are stricter, however
during the path component
[]@:;,=()!*+

Each of these characters are reserved for URL systems to create semantics
with, so that you may know no resource names are conflicting with your DSL
syntax.

perishablepress.com Open in urlscan Pro 69.16.192.217 Public Scan

Form analysis 4 forms found in the DOM

GET https://perishablepress.com/

POST https://perishablepress.com/wp/wp-comments-post.php

POST https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&id=a5e4fc9f37

POST https://perishablepress.us11.list-manage.com/subscribe/post?u=3b1eb5693738305478b07e6c4&id=a5e4fc9f37

Text Content

perishablepress.com Open in urlscan Pro
69.16.192.217 Public Scan

Form analysis
4 forms found in the DOM