Gigabyte: Decimal vs. Binary

Jeff-

Back in the day of FAT16 (Windows 95), “large” (4 gig) hardrives suffered from inefficiency on cluster size. For example, I believe under FAT16 the smallest size was 32K, so if you had a 1K file, it would wasted 31K. FAT32 improved on this, but I still believe space was lost but not as much as smaller (32K) clusters could be defined.

So, 1 Terabyte drive is for “marketing” purposes by the hard drive maunfacturer. You’ll never physically stoe 1 Terabyte.

I’m sure someone can explain the gory details on this better than I can.

Jon Raynor

Gigabytes weren’t an SI unit until the IEEE decided to make them one. It’s rather insulting, actually - reminds me of the gritty cop shows where the FBI would step into a police investigation and say “You guys go home, let the experts handle this”. Except that the FBI actually has that authority, whereas the IEEE just wishes it does.

The meaning of SI prefixes only applies to SI (metric) units, which bytes aren’t. A byte is already 8 bits, and it isn’t divisible into centibytes or millibytes, so it doesn’t even make sense as a metric unit. The composite units were so named because they approximated metric units, not because they were equivalent. That’s not “wrong”; it happens in every industry, it’s just that the nerds in other industries don’t kick and scream anal-retentively about it.

For those claiming that the inconsistency was always there because the network industry used kbps, think again. One of the reasons they used kiloBITS per second was to disambiguate it from storage units. The term was adapted from baud - slightly different meaning, but essentially equivalent by the time of 14400 baud modems, when baud was becoming an awkward measurement anyway. There was a legitimate need to compare bandwidth with storage (the Internet), but it also did not make sense to use powers of 2 because bandwidth was actually provided in powers of 10 (bits). There was no foul play here, just pragmatism.

Memory capacity, on the other hand, is always 2^n bytes. Hard drives are generally multiples of 512, too; when 500 gigabytes is used to mean 500 * 10^9 bytes, it is actually an approximation. The real number might be something like 499,289,948,160 bytes, though it could be more or less depending on the geometry. 500 GB is never quite accurate using ANY convention.

I think it’s obvious that the units for memory and disk should be the same, since data is constantly being swapped from one to the other. So let’s put the question about why the rules for memory should apply to hard drives to rest.

Of course I know what the proposed solution is. Just have everyone switch to the dorky “bi” prefixes! That’s nice, except that every part of the industry EXCEPT for the hard drive manufacturers has been using the same convention for 50 years. You don’t just stomp your foot, shake your fist and tell us to mend our evil non-standard ways. Standards should reflect conventions that are already widely used, not fight them. Frankly, I’d rather deal with the hard drive capacity gap than deal with the silly new SI units invented by academic suits with hardly any practical experience.

Whenever one inured to the inconsistent KB/MB/GB definitions used in some computing contexts first hears the kibibyte, mebibyte, gibibyte KiB/MiB/GiB construction, they think it silly. I did, too.

But after a few years of being bitten by related problems, and having to explain/argue the exceptions, and familiarity with the new words/abbreviations, it looks better.

The use of powers-of-2 internally by computers is an implementation detail that only insiders need to optimize for, in their minds and communications. For everyone else, base-10 works better. There’s no reason for average users to understand or even see KiB, MiB, GiB names/numbers, in disk sizes, file sizes, bandwidths, clock speeds, etc. Everything can and should be in base-10, shift-units-at-a-glance SI. And the proportion of average users to insiders keeps growing. SI will win.

For Jeff’s question about ever needing to use ‘petabytes’, Many workplaces are now dealing with petabytes of data. We have a few petabytes of spinning disks at the Internet Archive; I know commercial and big-science entities have far more.

And, regarding being “glad I won’t have to deal with saying” zetta and yotta, why so pessimistic about the progress of technology and/or your own lifespan?

Sebastian: Err, yes. You’re right - 1440, not 1044 KB. Sorry about that!

Sean: There are very good reasons why RAM is going to be in powers of two - it would be quite a lot of effort to allow for 1000 megabytes of RAM on on DIMM and 1000 on another, compared to 1024 on each. (RAM is addressed by a computer on an address bus. Each line of that address bus is a bit of the address; allocating addresses to RAM dimms thus naturally falls on the boundry of an address bus line. Which translates to a power-of-two in the address space. That’s why 1 GB of RAM is always going to be 2^30; because 2*10^9 is not going to divide easily on an address bus. Doing divisions by 1000 is going to add an extra cycle or two to every RAM access plus some extra chips!) Hard drives aren’t addressed this way, so that’s why they can be sizes that aren’t otherwise ‘nice’ for computers.

Will: Modems always were wierd; mainly becaue they (usually) used 10 bit bytes. Yeah, I was aware of some confusion with communications people, but I got the impression they hardly delt with bytes anyway.

(Meanwhile, we’re skipping the octet vs byte debate? 8 bits wasn’t always standard, you know! :slight_smile:

‘Aaron G’ writes: “A byte is already 8 bits, and it isn’t divisible into centibytes or millibytes, so it doesn’t even make sense as a metric unit.”

In information theoretic contexts, even bits can be fractional. And when describing very slow links, it could be meaningful to speak of such exotic and peculiar things as centibytes or centibits per second.

Contrived and weird, yes, but not totally nonsensical.

I see no reason why there can’t be an option for using both kilobytes (1000 bytes) and kibibytes (1024 bytes), like the labels here that say something like “1Gal. (3.8L).” Slowly people will begin to understand the relation between the two, like how many people learn that a yard is almost a meter.

As for sounding ridiculous, that’s just ridiculous. They may sound funny, but so does the mole (mol) and the joule (J). In fact, my chemistry teacher in high school had us make a mole (the animal) for a grade!

Once again, the drives could use the metric standard and the binary standard, as in “500GB (465GiB),” allowing consumers to see the difference and keep them happier with the manufacturers because they knew the two possible measures that could be used, instead of feeling they were ripped off.

In the programming sense, using the standard 10^x is rather an annoying convention because of the nature of bits - 0 or 1. If they were to somehow come up with a 10 state bit (easily possible with quantum computers,) then I could see the warrant on using the standard metric definitions, but until then, no thanks. This difference in systems - base 2 instead of base 10 - led to the rise of other counting systems, such as octal and hexadecimal (hex). Personally, I like to count memory and the like in hex in the binary notation. In hex, this use of “strange” numbers tacked onto the end disappear. For example:

1024 Bytes = 0x00000400 Bytes = 1KiB
1024 KiByt = 0x00100000 Bytes = 1MiB
1024 MiByt = 0x40000000 Bytes = 1GiB

It also comes in handy to use the KiB notation in small systems, where you need to know exactly how much memory you have left and if it’s enough for a 4KiB image.

Oh yes, the reason we use binary measurements is because comuters use binary! Addressing for both RAM and hard-disk is done using the binary/hex system. That being the case, it makes sense to me that they use the binary versions of the prefixes, but that would confuse people. So once again, I think listing both notations on the package makes plenty of sense.

Not to mention, if a byte were a standard SI unit, then it would be made of 10 bits. Then you could have a real decibyte. But naturally, if there was such a change, all the software out there right now would wind up being pretty useless because it isn’t built for 10 bit architectures (although that can fairly quickly be remedied).

In the end, I think placing both labels on products will help get people used to the relation of a GB and a GiB. I have started to be able to tell approximate size of large files from one system to another, similarly to the conversion of yards and meters.

Anyhow, that’s what I think.

I was completely unaware of these rules. This was SO helpful- cleared up a lot in my head! Thanks for this!

Honestly, I think that it’d be ultimately better to use a convention like this:

16 bytes = 1 da16B = 0x10 bytes
16^2 bytes = 1 h16B = 0x100 bytes
16^3 bytes = 1 k16B = 0x1000 bytes
etc.

and the number before the unit written in hexadecimal.

The manufacturer could provide this for anyone who’d find it useful then show a decimal conversion in scientific notation (x.xx * 10^x) for general consumers who really only care about scale.

The binary prefix symbols are great. The solution to not sounding like you have a speech impediment is to pronounce them kili, megi, teri (KIL-ee MEG-ee, TER-ee) etc.

I can’t believe in all of these replies how nobody figured out the real problem - the byte. A byte is NOT the fundamental unit of memory - the bit is. A byte is 8 bits, which first of all violates the base ten system… so why are we using it as a base unit? The base 10 system can only be used on bits, and only upwards since there aren’t “centibits” as the bit cannot be broken down any more. Basically, a kilobyte in base 10 is really a kilo-octabit which makes zero sense.

Really we shouldn’t use the metric system at all since memory is not metric. It’s like saying kiloinch or megayard (which sounds awesome).

2 Likes

I believe the byte was used as a measurement because it took 8 bits to define a character. That’s where the byte came from.

The words kilo, mega, etc. were just “borrowed” for their meaning for 1 thousand and 1 million respectively… the same way as for kilohertz and megahertz in radio frequencies - a mixed use of wavelength in metres and hertz (Hz (one cycle per second)). Radio bands are expressed as metre bands, but individual frequencies are expressed in hertz (VLF), kHz, MHz & GHz (gigahertz). Even those UHF (Ultra-High Frequency) frequency bands and higher. Microwaves still using a mix of both worlds - bands listed in cm (centimetre), nm (nanometre), but the frequencies are still listed in GHz. The extremely high bands are given letters (K-band, L-band X-band, etc.). A really mixed up world we live in.

Even though a byte is 8 bits, a KB is still 1,000 bytes (8,000 bits). Cellphone marketing uses this deceptively when they advertise their cellphones and show their “speed”. Divide their speed by 8 and you get the true speed in KB or MB. Oh, and Kb is kilobit, KB is kilobyte, Mb is megabits and MB is megabytes. Most often they accurately show their speed in Mb (small b), but I have seen some - and cable companies providing Internet access - using MB… which is a deception.

A little trivia: In the early days of radio, only Russia (USSR) actually expressed radio frequencies in metres. The rest of the world was using Cycles (cycles/second), KiloCycles and MegaCycles and eventually that became Hertz, Kilohertz and Megahertz. Eventually they came around and joined the crowd. I think that was back in the 1960s (when I first took up shortwave listening). Radio books would list the shortwave frequencies for different countries and the USSR was the only one who listed theirs in wavelengths (metres followed by decimals of them). It was probably a matter of the radios on the market at the time. They were all marked in kilocycles. The USSR wanted to sell their radios to the rest of the world so they pretty much had no choice than to make the switch. :slightly_smiling_face: Even after the change, many communist nations were still announcing their programs would be at such and such time on the x-metre band. You actually had to obtain a schedule in order to find out the actual frequencies they’d be transmitting on.

Computers are an odd beast - using both binary octal and decimal values at the same time.

2 Likes

“It’s us computer science types who are abusing the official prefix designations: …”

I respectfully disagree.

The operative words in the discussion are 'bit and ‘byte’, not the SI prefixes.

The crux of the problem is binary. A bit has 2 values: 0 or 1. In other words, the use of bits establishes a base 2 (binary) number system. And, binary does not ‘cleanly’ translate to base 10.

A ‘standard’ byte uses 8 bits. Since the byte is composed of bits, it can represent 2^8, or 256 values (0 to 255 in decimal).

If 10 memory address lines are used in hardware, that’s 10 bits, or 2^10 address locations or 1024 locations (0-1023). To limit address space to 1000, or 0-999, one would have to use a hardware mask to disable the last 24 address locations, which would be both inefficient and silly. To describe it as 1000 + 24 is similarly clumsy.

So, how does one precisely describe what is going on in computer memory? By defining a ‘kilobyte’ as 1024 bytes. This accounts for all the bytes addressable in binary and provides a relatively simple framework to discuss binary without discussing it in binary.

To illustrate the disconnect: binary for 1000 is 1111101000 and decimal for the binary number 1000000000 is 512. The idea is to create equivalence using exponents in the base number system (2^N, 10^M, etc), not force one number system onto another.

Further, if memory is divided into sections or ‘pages’, the number of bytes in each page will also be counted in binary, as the addressing is based on the number of address bits used. Again, if 10 bits are used, 1024 bytes will be on each page and 1024 bytes will be exposed with each page swap. The 1024 definition facilitates discussions of this reality.

As with most things, the definition is important. ‘Kilobyte’ was defined as 1024 bytes because it accurately describes what is going on in a binary system and simplifies the discussion. Based on the comments to this blog post, this viewpoint has clearly been lost in time.

In the related topic of computer networks, where bits are described, there is no confusion because it is counting bits, not using bits to count bytes. So, saying 1000 bits/second or 1kbps accurately represents the data traffic. Introducing 1024 would not be helpful in this case because binary is not used to address or track data traffic.

The use of 1000 for ‘kilobyte’ dissociates what is going on in the hardware and is imprecise. Sure, if you want to please an uneducated consumer, you can play the hard drive manufactures’ game.

As for hard drive manufacturers, the same people who used Klingon years to inflate their MTBF numbers, I wouldn’t use their approach to justify any part of the discussion.

Marketing had a big thing to do with that. Manufacturers began using kb/s, mb/s & gb/s instead of KB/s, MB/s & GB/s to make the customer believe their equipment is super fast. I tell my son to divide his speed by 8 (8 bits = 1 byte) and that’s what his actual speed is in KB, MB or GB. I told him if he sees lower case letters, it’s in bits. Manufacturers are blowing smoke in the customers’ eyes… and they’ve gotten used to it. :roll_eyes:

1 Like