Base64 Encoding
Every
language can be represented using a set of symbols. For example:
english can be represented by a set of 26 alphabets along with some
punctuation marks and numerical digits from 0 to 9.
If
a language is made up of ‘N’ symbols, then the same can
be represented by log_{2}N
bits.
Since
each language is made up of different no of symbols, the number of
bits for each language is different. For example: ASCII encoding used
to represent english uses 8 bit characters. An encoding scheme used
to represent Chinese may require a substantially larger number of
bits since Chinese characters number in tens of thousands, though
most of them are only minor graphic variants encountered in
historical texts.
Also,
there may be nontextual data such as an image where some bits make
up textual header information while the rest of them represent the
pixels of some image.
The
point being made here is that there may exist different encoding
schemes such that different number of bits may be required by each
encoding.
To
transmit such data over a media which supports textual data only,
there should exists a common encoding scheme whose output is only
textual.
Base64
encoding provides a way of doing this by converting each group of 6
bits into a specific character. The character set chosen for output
varies between different implementations of Base64 but usually they
adhere to the convention that the output characters should be printer
friendly and they should also be a subset of most implementations.
With
the above two guidelines, most Base64 implementations chose AZ, az
and 09 as the first 62 characters. Difference is mostly in the
choice of the remaining 2 characters.
Usually
the last 2 characters are chosen from the following: +, , /, ., :, !
etc.
Note
1: Since
Base64 encoding encodes 6 bits into characters using 8 bits, its
output is always more than the input (usually 4/3 times the size of
input). The only advantage gained by this inefficiency is that the
output is textual and thus acceptable by most software systems.
Note
2: There
exists other binarytotext encoding schemes too with varying degree
of efficiency. For example, the familiar hexadecimal system can also
be considered as a binarytotext encoding but its not widely used
for this purpose because its much more inefficient than Base64
(since it converts every 4 bits in input to 8bit output symbols, the
output of hexadecimal would be double the size of input).
Padding in Base64
Base64
converts in groups of 6 bits.
So
for 8 bit encodings such as ASCII english, it converts 3 input
characters into 4 output characters. If the number of characters in
input is exact multiple of 3, then the number of characters in the
output is an exact multiple of 4. For cases, where this is not so, 2
cases arise:
Last
group in the input has 1 character.
Last
group in the input has 2 characters.
For
#1, the first 6 bits are processed as usual and the next 2 bits are
padded with 4 0s to their right. This gives 2 output characters. To
indicate that the input did not have 4 0s in the end, the output is
padded with 2 ‘=’ characters.
Similarly,
for #2, there will be 16 input bits, out of which first 2 groups of 6
bits will be processed as normal. Next 4 bits are padded with 2 0s to
their right. This gives 3 output characters. To indicate that the
input did not have 2 0s in the end, the output is padded with 1 ‘=’
character.
Thus
by padding with 1 or 2 ‘=’ characters, we always get 4
characters in the output.
Strictly
speaking, the padding ‘=’ characters are not needed since
the number of input characters can always be found by calculating
3N_{o}/4
where N_{o}
is the number of output characters. But still some implementations
mandate padding.
