The Burrows–Wheeler transform (BWT, also called blocksorting compression), is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California.^{[1]} It is based on a previously unpublished transformation discovered by Wheeler in 1983.
When a character string is transformed by the BWT, none of its characters change value. The transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as movetofront transform and runlength encoding.
For example:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Output  TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT 
The output is easier to compress because it has many repeated characters. In fact, in the transformed string, there are a total of six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters in it.
Contents 
The transform is done by sorting all rotations of the text, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF' pointer):
Transformation  

Input  All Rotations 
Sort the Rows 
Output 
^BANANA@ 
^BANANA@ @^BANANA A@^BANAN NA@^BANA ANA@^BAN NANA@^BA ANANA@^B BANANA@^ 
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA 
BNN^AA@A 
The following pseudocode gives a simple, but inefficient,
way to calculate the BWT and its inverse. It assumes that the input
string s
contains a special character 'EOF' which is
the last character, occurs nowhere else in the text, and is ignored
during sorting.
function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) function inverseBWT (string s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character)
To understand why this creates moreeasilycompressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps lesscommon exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be regenerated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
Inverse Transformation  

Input  
BNN^AA@A 

Add 1  Sort 1  Add 2  Sort 2 
B N N ^ A A @ A 
A A A B N N ^ @ 
BA NA NA ^B AN AN @^ A@ 
AN AN A@ BA NA NA ^B @^ 
Add 3  Sort 3  Add 4  Sort 4 
BAN NAN NA@ ^BA ANA ANA @^B A@^ 
ANA ANA A@^ BAN NAN NA@ ^BA @^B 
BANA NANA NA@^ ^BAN ANAN ANA@ @^BA A@^B 
ANAN ANA@ A@^B BANA NANA NA@^ ^BAN @^BA 
Add 5  Sort 5  Add 6  Sort 6 
BANAN NANA@ NA@^B ^BANA ANANA ANA@^ @^BAN A@^BA 
ANANA ANA@^ A@^BA BANAN NANA@ NA@^B ^BANA @^BAN 
BANANA NANA@^ NA@^BA ^BANAN ANANA@ ANA@^B @^BANA A@^BAN 
ANANA@ ANA@^B A@^BAN BANANA NANA@^ NA@^BA ^BANAN @^BANA 
Add 7  Sort 7  Add 8  Sort 8 
BANANA@ NANA@^B NA@^BAN ^BANANA ANANA@^ ANA@^BA @^BANAN A@^BANA 
ANANA@^ ANA@^BA A@^BANA BANANA@ NANA@^B NA@^BAN ^BANANA @^BANAN 
BANANA@^ NANA@^BA NA@^BANA ^BANANA@ ANANA@^B ANA@^BAN @^BANANA A@^BANAN 
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA 
Output  
^BANANA@ 
A number of optimizations can make these algorithms run more efficiently without changing the output. In BWT, there is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit bad worstcase behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. The example code below demonstrates efficient decoding. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.
Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an 'EOF' marker to the input or, augmenting the output with information, such as an index, that makes it possible to identify the input string from the class of all of its rotations.
There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.
The bijective transform is computed by first factoring the input into Lyndon words, and then sorting together all rotations of these words. The transformed string is then obtained by picking the last character of the strings in this sorted list.
For example, applying the bijective transform gives:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Output  STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT 
The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII. In total, 18 characters take part in these runs.
A complete description of the algorithm can be found in an article due to Gil and Scott.
Instead of reconstructing the BurrowsWheeler Transform of an edited text, Salson et al.^{[2]} propose an algorithm that deduces the new BurrowsWheeler Transform from the original one, doing a limited number of local reorderings in the original BurrowsWheeler Transform.
This Python implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.
Using the null
character as the end of file marker, and using s[i:] +
s[:i]
to construct the ith rotation of
s
, the forward transform takes the last character of
each of the sorted rows:
def bwt(s): """Apply BurrowsWheeler transform to input string.""" assert "\0" not in s, "Input string cannot contain null character ('\0')" s = s + "\0" # Add end of file marker table = [s[i:] + s[:i] for i in range(len(s))] # Table of rotations of string table.sort() last_column = [row[1] for row in table] # Last characters of each row return "".join(last_column) # Convert list of characters into string
The inverse transform repeatedly inserts r
as the
left column of the table and sorts the table. After the whole table
is built, it returns the row that ends with null, minus the
null.
def ibwt(r): """Apply inverse BurrowWheeler transform.""" table = [""] * len(r) # Make empty table for i in range(len(r)): table = [r[i] + table[i] for i in range(len(r))] # Add a column of r table.sort() s = [row for row in table if row.endswith("\0")][0] # Find the correct row (ending in "\0") return s.strip("\0") # Get rid of trailing null character
The advent of highthroughput sequencing (HTS) techniques at the end of the 2000 decade has led to another application of the Burrows–Wheeler transformation. In HTS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of "reads", each 20 to 100 base pairs ("DNA characters") long. In many experiments, e.g., in ChIPSeq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP[1]^{[3]}, or Maq^{[4]}). Here, the enormous memory requirements for the hash table was a problem, and so, the next generation of alignment programs, notably Bowtie^{[5]} , BWA^{[6]}, and SOAP2^{[7]}, uses the Burrows–Wheeler transform of the reference genome as the basis for very efficiently aligning the reads.

The BurrowsWheeler transform (BWT, also called blocksorting compression), is an algorithm used in data compression techniques such as bzip2. It was invented by Michael Burrows and David Wheeler in 1994 while working at DEC Systems Research Center in Palo Alto, California.^{[1]} It is based on a previously unpublished transformation discovered by Wheeler in 1983.
When a character string is transformed by the BWT, none of its characters change value. The transformation permutes the order of the characters. If the original string had several substrings that occurred often, then the transformed string will have several places where a single character is repeated multiple times in a row. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as movetofront transform and runlength encoding.
For example:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Output  TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT 
The output is easier to compress because it has many repeated characters. In fact, in the transformed string, there are a total of six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters in it.
Contents 
The transform is done by sorting all rotations of the text, then taking the last column. For example, the text "^BANANA@" is transformed into "BNN^AA@A" through these steps (the red @ character indicates the 'EOF' pointer):
Transformation  

Input  All Rotations  Sort the Rows  Output 
^BANANA@ 
^BANANA@ @^BANANA A@^BANAN NA@^BANA ANA@^BAN NANA@^BA ANANA@^B BANANA@^ 
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA 
BNN^AA@A 
The following pseudocode gives a simple, but inefficient, way to calculate the BWT and its inverse. It assumes that the input string s
contains a special character 'EOF' which is the last character, occurs nowhere else in the text, and is ignored during sorting.
function BWT (string s) create a table, rows are all possible rotations of s sort rows alphabetically return (last column of the table) function inverseBWT (string s) create empty table repeat length(s) times insert s as a column of table before first column of the table // first insert creates first column sort rows of the table alphabetically return (row that ends with the 'EOF' character)
To understand why this creates moreeasilycompressible data, let's consider transforming a long English text frequently containing the word "the". Sorting the rotations of this text will often group rotations starting with "he " together, and the last character of that rotation (which is also the character before the "he ") will usually be "t", so the result of the transform would contain a number of "t" characters along with the perhaps lesscommon exceptions (such as if it contains "Brahe ") mixed in. So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).
The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it is reversible, allowing the original document to be regenerated from the last column data.
The inverse can be understood this way. Take the final table in the BWT algorithm, and erase all but the last column. Given only this information, you can easily reconstruct the first column. The last column tells you all the characters in the text, so just sort these characters to get the first column. Then, the first and last columns together give you all pairs of successive characters in the document, where pairs are taken cyclically so that the last and first character form a pair. Sorting the list of pairs gives the first and second columns. Continuing in this manner, you can reconstruct the entire list. Then, the row with the "end of file" character at the end is the original text. Reversing the example above is done like this:
Inverse Transformation  

Input  
BNN^AA@A  
Add 1  Sort 1  Add 2  Sort 2 
B N N ^ A A @ A 
A A A B N N ^ @ 
BA NA NA ^B AN AN @^ A@ 
AN AN A@ BA NA NA ^B @^ 
Add 3  Sort 3  Add 4  Sort 4 
BAN NAN NA@ ^BA ANA ANA @^B A@^ 
ANA ANA A@^ BAN NAN NA@ ^BA @^B 
BANA NANA NA@^ ^BAN ANAN ANA@ @^BA A@^B 
ANAN ANA@ A@^B BANA NANA NA@^ ^BAN @^BA 
Add 5  Sort 5  Add 6  Sort 6 
BANAN NANA@ NA@^B ^BANA ANANA ANA@^ @^BAN A@^BA 
ANANA ANA@^ A@^BA BANAN NANA@ NA@^B ^BANA @^BAN 
BANANA NANA@^ NA@^BA ^BANAN ANANA@ ANA@^B @^BANA A@^BAN 
ANANA@ ANA@^B A@^BAN BANANA NANA@^ NA@^BA ^BANAN @^BANA 
Add 7  Sort 7  Add 8  Sort 8 
BANANA@ NANA@^B NA@^BAN ^BANANA ANANA@^ ANA@^BA @^BANAN A@^BANA 
ANANA@^ ANA@^BA A@^BANA BANANA@ NANA@^B NA@^BAN ^BANANA @^BANAN 
BANANA@^ NANA@^BA NA@^BANA ^BANANA@ ANANA@^B ANA@^BAN @^BANANA A@^BANAN 
ANANA@^B ANA@^BAN A@^BANAN BANANA@^ NANA@^BA NA@^BANA ^BANANA@ @^BANANA 
Output  
^BANANA@ 
A number of optimizations can make these algorithms run more efficiently without changing the output. In BWT, there is no need to represent the table in either the encoder or decoder. In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices. Some care must be taken to ensure that the sort does not exhibit bad worstcase behavior: Standard library sort functions are unlikely to be appropriate. In the decoder, there is also no need to store the table, and in fact no sort is needed at all. In time proportional to the alphabet size and string length, the decoded string may be generated one character at a time from right to left. The example code below demonstrates efficient decoding. A "character" in the algorithm can be a byte, or a bit, or any other convenient size.
There is no need to have an actual 'EOF' character. Instead, a pointer can be used that remembers where in a string the 'EOF' would be if it existed. In this approach, the output of the BWT must include both the transformed string, and the final value of the pointer. That means the BWT does expand its input slightly. The inverse transform then shrinks it back down to the original size: it is given a string and a pointer, and returns just a string.
A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.
Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an 'EOF' marker to the input or, augmenting the output with information, such as an index, that makes it possible to identity the input string from the class of all of its rotations.
There is a bijective version of the transform, by which the transform string uniquely identified the original. In this version, any string as an inverse of the same length.
The bijective transform is computed by first factoring the input into Lyndon words, and then sorting together all rotations of these words. The transformed string is then obtained by picking the last character of the strings in this sorted list.
For example, applying the bijective transform gives:
Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES 

Output  STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT 
The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII. In total, 18 characters take part in these runs.
A complete description of the algorithm can be found in an article due to Gil and Scott.
This implementation sacrifices speed for simplicity: the program is short, but takes more than the linear time that would be desired in a practical implementation.
Using the null character as the end of file marker, and using s[i:] + s[:i]
to construct the ith rotation of s
, the forward transform takes the last character of each of the sorted rows:
def bwt(s):
"""Apply BurrowsWheeler transform to input string.""" assert "\0" not in s, "Input string cannot contain null character ('\0')" s = s + "\0" # Add end of file marker table = [s[i:] + s[:i] for i in range(len(s))] # Table of rotations of string table.sort() last_column = [row[1] for row in table] # Last characters of each row return string.join(last_column, "") # Convert list of characters into string
The inverse transform repeatedly inserts r
as the left column of the table and sorts the table. After the whole table is built, it returns the row that ends with null, minus the null.
"""Apply inverse BurrowWheeler transform.""" table = [""] * len(r) # Make empty table for i in range(len(r)): table = [r[i] + table[i] for i in range(len(r))] # Add a column of r table.sort() s = [row for row in table if row.endswith("\0")][0] # Find the correct row (ending in "\0") return s.strip("\0") # Get rid of trailing null character
The advent of highthroughput sequencing (HTS) techniques at the end of the 2000 decade has led to another application of the BurrowsWheeler transformation. In HTS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of "reads", each 20 to 100 base pairs ("DNA characters") long. In many experiments, e.g., in ChIPSeq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long). A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP, or Maq^{[2]}). Here, the enormous memory requirements for the hash table was a problem, and so, the next generation of alignment programs, notably Bowtie^{[3]} and BWA^{[4]}, uses the BurrowsWheeler transform of the reference genome as the basis for very efficiently aligning the reads.

