Equipping our ASCII Armor

On one of our e-commerce web sites, we needed a unique transaction ID to pass to a third party reporting tool on the checkout pages. We already had a GUID on the page for internal use. And you know how much we love GUIDs!


This is a companion discussion topic for the original blog entry at: http://www.codinghorror.com/blog/2005/10/equipping-our-ascii-armor.html

Nice work! I never even HEARD of ASCII85. Learn something new every day.

ā€œBase64 ought to be enough for anybody.ā€

ā€œASCII values 0-255ā€ eh? Repeat after me: ASCII is a 7-bit encoding. ASCII is a 7-bit encoding.

Indeed, the ASCII spec only defines character values 32-126 - these 95 values are the only valid ASCII values. Anything else isnā€™t ASCII.

However, given 20 ASCII characters, that is still about 131.397 bits of information. So youā€™ve still got over 3 bits to spare after your GUID! Just enough space to store a fairly small number.

Ian, thanks for the clarification as always. Just when I thought I was a computer ā€œscientistā€ā€¦

Jon, Iā€™m waiting for the inevitable BASE95 or ASCII95. There must be some good reason that Adobe chose to use just 85 of the 95 possible printable ASCII characters, but I canā€™t think of what that could be right nowā€¦

Jon, Iā€™m waiting for the inevitable BASE95 or ASCII95. There must be
some good reason that Adobe chose to use just 85 of the 95 possible
printable ASCII characters, but I canā€™t think of what that could be
right nowā€¦

Seems to me that the need for certain special case characters would preclude them from using the full 95 characters. I know from reading the wiki entry you linked that at least >, <, and z are generally off limits.

Am I off base?

P.S. This is my first comment here, I love you blog though - truly a pleasure to read. It has been a staple of my google homepage for quite some time :slight_smile:

Base64 may be less efficient in terms of space, but itā€™s far more efficient in terms of speed. Division is a slow operation (it becomes noticeable when done with frequency). Obviously when transmitting data over the network, this could be relevantā€¦ unless you remember if you have such a restriction, it means either youā€™re using XML (which is both slow and verbose) or youā€™re dealing with an old, old legacy system (which is slow and you have no say on the verboseness).

I work with handheld developmentā€¦ and both space and speed are frequent issues. However, in such a situation, Iā€™d stick to base64 or maybe even use hex encoding.

A very simple solution to this problem is to use a base 32 rather than base 16 representation to cut the length down from 32 characters to 16 characters. Hexadecimal encoding uses chars [0-9A-F] for an alphabet size of 16 to represent 4 byte runs of a binary stream.

You can represent 8 byte runs of the same binary stream using chars [A-Za-z]. This will cut the ascii representation in half. It would be fairly straight forward to implement such an encoding, simply use the same logic you would to convert to hexadecimal but substitute the larger alphabet for the representation and cut the stream into 8 byte rather than 4 byte chunks.

Nice touch updating the wikipedia entry for ascii85. Of course, this would all be easier if weā€™d all use base 85, but no one will listen to me.

The reason thereā€™s not a ascii95 is because 85^5 (5 bytes of encoded string) is only a little bigger than 256^4 (4 bytes of unencoded string), which makes it a particularly effective encoding for blocks of such a small size. In fact, you have to get up to 21 bytes before ascii95 would be an improvement on ascii85 (21 unencoded bytes -> 26 ascii95, but 27 ascii85). By that point, the numbers are rather too large to deal with reasonably - weā€™re barely at 64-bit computers, let alone 168 bits!

Since ascii95 wouldnā€™t really be a feasible improvement on ascii85, we may as well just use 85 and then have those 10 characters for whatever we want, like ā€˜zā€™ and ā€˜~ā€™.
(Plus we donā€™t want to use space, so it would really only be ascii94. That doesnā€™t change the math, though.)

1 Like

Iā€™m not sure what that Base85 is based off of, but it happens that there are 85 chars that can be used as un-escaped content in XML.

1 Like

Because with base85 five ASCII characters represent exactly four bytes of binary data (32 bits).

Base65536 ā€“ this is amazing!