Here we can see, “Converting a binary/blob guid column to text in SQL”
Because we live in an imperfect world where we frequently have to trade ease-of-use for simplicity and abstractions for performance, when working with computer systems and standards, you’ll frequently come across tasks that appear to be obvious and simple, only to discover later that they lead down a rabbit hole. Unfortunately, this is the case whenever a programmer comes into touch with the individual bits of a binary blob — which can happen anytime.
The premise is simple: given a database powered by, say, the widely used MySQL or SQLite RDBMS engines, database designers and systems programmers must choose between storing a GUID as plaintext (VARCHAR or TEXT) and taking the performance, memory, and storage hit that comes with it, or digging out the BLOB column type in your DDL.
For some quick background on why this problem occurs (and if you’re already an encoding expert, you can skip this and the following paragraph): If you encounter data in the form of a series of hexadecimal characters, you can bet your bottom dollar that you’re dealing with a representation of the underlying data rather than the data itself. The omnipresent UUID: 4c36947a-f5ad-11e7-8f4c-7085c25fb679 is a concrete and extremely significant example.
What you’re seeing here is a succession of ASCII characters. There are 32 of them if you count them without the brackets and dashes. Therefore, 32 bytes are required for 32 ASCII characters. Why is it that a GUID or UUID is just sixteen bytes long? A UUID is defined as a sequence of bytes, not characters. Because only seventy-nine of the one hundred and twenty-eight possible values that can be stored in a single byte are “printable text” (that is, 0, and everything in between), we can’t display the GUID without using binary or hexadecimal notation.
Back to the point: when you tell your database to save our UUID from earlier in a VARCHAR or TEXT field, it has no idea that you’re also directing it to shove the 32 bytes you gave it into a 16-byte field. Consider this: how does it know that once you’ve run the SQL query, you’re not going to run another:
INSERT INTO mytable ('guid')
VALUES ("4c36947a-f5ad-11e7-8f4c-7085c25fb679")
that you won’t follow that up with another:
INSERT INTO mytable ('guid')
VALUES ("not a guid")
It’s not possible. It’s treating this column as UTF8/ASCII text data because you requested it. In fact, the query we just ran won’t take up just 32 bytes since the – in the GUID is an intrinsic element of the column value in the database and can’t be discarded and reinserted at will. So it compiles and stores your 16-byte GUID as a whopping 32, 36, or even 38-byte value (with the opening and closing braces and the dividing dashes), consuming 238 percent of the space it requires!
As a result, sane developers save their GUIDs as fixed 16-byte numbers in BLOB columns. Then, the database driver/adapter/wrapper translates the saved value from an array of 16 bytes to whatever datatype the software uses to represent a GUID at runtime. The rest is history, as they say.
But what happens if you need to manually look anything up in your database? For example, when we try to run a simple SQLite query with the GUID stored in a blob column, the following happens:
U���&<ZE��Z�T�? What the heck is that? That isn’t a GUID at all! That’s our GUID in UTF-8 format, and aside from the smiley face symbols, it’s not nice. But how can we reclaim the GUID version we know and love from this abomination? It’s not just for looks: what if we need to match against another developer’s database, and they’ve saved their GUIDs in their table as text rather than binary?
The programmer, who is familiar with ASCII, UTF-8, binary, hexadecimal, and all that other good stuff, might rush in with a solution like this,
SELECT substr(hex(guid), 1, 8)
|| '-' || substr(hex(guid), 9, 4)
|| '-' || substr(hex(guid), 13, 4)
|| '-' || substr(hex(guid), 17, 4)
|| '-' || substr(hex(guid), 21, 12)
FROM [MyTable]
converting the binary value of the guid column to hexadecimal (as that’s what our 0-9 and A-F characters are), and then splitting it up into a sequence of 8-4-4-4-12 characters (keeping in mind that every two characters form one byte), and then rejoice.
If you find something strange about those findings, it’s because they’re incorrect. When dealing with binary data representations, a wise developer understands that just converting from binary to hexadecimal isn’t enough. Furthermore, the intelligent developer understands that there might be some relevance to the fact that UUIDs and GUIDs are always shown in the same, familiar 8-4-4-4-12 pattern. So she says to herself, “Maybe I should check Wikipedia on this one.”
GUIDs aren’t defined as “a sequence of 16 bytes divided into 4 bytes, 2 bytes, 2 bytes, and 6 bytes,” but rather as
NAME | LENGTH (BYTES) | LENGTH (BASE-16) | CONTENTS |
---|---|---|---|
time_low | 4 | 8 | integer giving the low 32 bits of the time |
time_mid | 2 | 4 | integer giving the middle 16 bits of the time |
time_hi_and_version | 2 | 4 | 4-bit “version” in the most significant bits, followed by the high 12 bits of the time |
clock_seq_hi_and_res clock_seq_low | 2 | 4 | 1-3 bit “variant” in the most significant bits, followed by the 13-15 bit clock sequence |
node | 6 | 12 | the 48-bit node id |
The usage of the word integer, which has a very precise connotation, is crucial. Unlike a single byte, which reads the same both ways, an integer has another quirk: it can be stored in memory in little-endian or big-endian format, with the least-significant byte appearing first or last when viewed in memory.
The UUID’s first three pieces (the 8-4-4) are byte-order dependent, whereas the latter two (the 4-12) are not. To make matters even more confusing, while you’d imagine that little-endian architectures like Intel/AMD x86 would use little-endian representation for integers and big-endian architectures like MIPS or PowerPC alone would use big-endian representation, this isn’t the case.
Instead, most (software) platforms utilized big-endian (aka network byte order) for UUIDs when the 8-4-4-4-12 UUID/GUID was initially established. Still, Microsoft used the representation used by the architecture the OS was running on (little-endian for Windows).3
While Microsoft’s approach was consistent with how the operating system handled four- and two-byte integers, other platforms focused on UUID’s universal part. They thus used a hard-coded (and easier to work with!) big-endian approach regardless of the architecture of the machine in question, implying that binary representations of a UUID could be copied to a different architecture/environment and still retain the same properties.
Because of these discrepancies, we now have two official UUID versions, as defined by RFC 4122. Most of the world utilized Variant 1 at first, with the underlying bytes in big-endian format, however Microsoft uses Variant 2 (which is encountered when working with.NET/Win32/COM GUID instances).
That’s why using SQL’s HEX() function to convert a BLOB guid column to hexadecimal and dividing the string after a particular number of characters isn’t enough. So instead, we’ll have to resort to a heinous act like this:
SELECT
substr(hguid, 7, 2) || substr(hguid, 5, 2)
|| substr(hguid, 3, 2) || substr(hguid, 1, 2) || '-'
|| substr(hguid, 11, 2) || substr(hguid, 9, 2) || '-'
|| substr(hguid, 15, 2) || substr(hguid, 13, 2) || '-'
|| substr(hguid, 17, 4) || '-'
|| substr(hguid, 21, 12)
AS guid
FROM (SELECT hex(guid) AS hguid FROM messages)
As a result, we got the correct result we were searching for.
This article from Percona is a fantastic read if you want to improve the performance of tables with UUIDs even more. However, because the findings are totally randomized, merely adopting UUIDv4 eliminates the problem they address.
Conclusion
I hope you found this information helpful. Please fill out the form below if you have any questions or comments.
User Questions
1. In SQLite, what is the BLOB data type?
A large binary object (BLOB) is a collection of binary data stored in the database as a value. You can store documents, photos, and other multimedia assets in the database using the BLOB. For the sake of demonstration, we’ll create a new table called documents.
2. How can you find out how much a BLOB is worth?
The getBinaryStream() and getBlob() methods of the ResultSet interface can be used to get BLOB values (binary data) from a table. These methods take an integer value for the needed column’s index (or a String value for the column’s name) and read CLOB data from it.
3. What is the best way to check my BLOB data?
When you run a select statement over a file with a field declared as a BLOB, the data in the field does not display. Instead, as the field’s value, you’ll see *POINTER. To view the data in a BLOB field, you need a locator.
4. What’s the best practice for handling GUIDs from SQL in PBI? I’m trying to reduce the size of my file
What’s the best practice for handling GUIDs from SQL in PBI? I’m trying to reduce the size of my file from PowerBI
5. GUID as clustering key : SQL – Reddit