Tuesday January 31, 2023

What are UUIDs? And Why Are They Important?

Stokkete/Shutterstock.comA Universally Unique Identifier (UUID) is a specific form of identifier which can be safely deemed unique for most practical purposes. Two UUIDs correctly generated have a very low chance of being identical, even though they were created in different environments by two separate parties. UUIDs are universally unique. This is why they are so unique. UUIDs can be used in any situation where unique, decentralized ID generation is required. Although we will approach them from the perspective of software that interacts directly with databases records, they can also be used in other applications. What is a UUID? A UUID simply refers to a value that you can safely consider unique. You can ignore the UUID altogether because it is so unlikely to cause collisions. UUIDs may be referred to with different terms (GUID, which stands for Globally Unique Identifier) but the meaning is the same. True UUIDs are unique identifiers that are generated and stored in a standard format. RFC 4122 defines valid UUIDs. This specification describes the algorithms that can generate UUIDs that are unique across implementations. It does not require a central issuing authority. Five different algorithms are included in the RFC, each using a different method to produce a value. Here’s a quick summary of the “versions” available: Version 1 – Time-Based – This version combines a timestamp, clock sequence, and a value specific to the generating device (usually its IP address) to produce an output that is unique for that host at that moment in time. Version 2 – DCE security – This version was created to improve on Version 1. It is compatible with Distributed Computing Environment (DCE). It is not widely used. Version 3 – Name Based (MD5) – The MD5 hashes a namespace and a name to create a unique value for each name within the namespace. This method produces reproducible results because it generates another UUID using the same namespace and names. Version 4 – Random – Modern systems prefer UUID v4 because it uses the host’s source for random or pseudo-random numbers in order to issue its values. There is very little chance of the same UUID being generated twice. Version 5 – Name Based (SHA-1) – This version is identical to Version 3, but uses the stronger SHA-1 algorithm for hashing the input namespace. Advertisement
The RFC refers only to Versions, but that doesn’t mean you should use Version 5 just because it seems the most recent. The best one depends on the use case. In many cases, v4 is preferred because of its random nature. This makes it a good candidate for simple “give us a new identifier” situations. Generation algorithms produce a 128-bit unsigned integer. UUIDs can be stored as binary sequences of 16 characters or as hexadecimal strings. However, they are most commonly used to identify UUIDs. Here’s an example of a UUID string: 16763be4-6022-406e-a950-fcd5018633ca The value is represented as five groups of alphanumeric characters separated by dash characters. The dashes are optional and not required. Their presence is due to historical details of the UUID specification. They make it easier for humans to see the identifier. UUID Use Cases Decentralized generation of unique identifiers is the main use case for UUIDs. The UUID can be generated anywhere, and you can safely consider it unique regardless of whether it comes from your backend code or a client device. UUIDs make it easier to identify and maintain object identity in disconnected environments. Historically, most applications used an autoincrementing integer field for their primary keys. You can’t identify an object’s ID until it has been added to the database. UUIDs allow you to identify your identity earlier in your application. This is a basic PHP demo to demonstrate the difference. Let’s look at the integer-based system first: class BlogPost public function __construct( public readonly ?int $Id, public readonly string $Headline, public readonly ?AuthorCollection $Authors=null) #[POST(“/posts”)] function createBlogPost(HttpRequest $Request) : void $headline = $Request -> getField(“Headline”); $blogPost = new BlogPost(null, $headline); Advertisement

Because we don’t know the ID of the $Id property until it’s been persistent to the database, we must initialize it with null. This is not ideal. $Id should not be nullable. It allows BlogPost instances exist in an incomplete state. Changing to UUIDs addresses the problem: class BlogPost public function __construct( public readonly string $Uuid, public readonly string $Headline, public readonly ?AuthorCollection $Authors=null) #[POST(“/posts”)] function createBlogPost(HttpRequest $Request) : void $headline = $Request -> getField(“Headline”); $blogPost = new BlogPost(“16763be4-…”, $headline); Post identifiers can now be generated within the application without risking duplicate values. This ensures that object instances are always valid and don’t require nullable ID properties. The model makes transactional logic easier; child records that need to be referenced to their parent (such a post’s Author associations), can be added immediately without the need for a database round trip to retrieve the ID assigned to them. Your blog application may move more logic to the client in the future. The frontend might be able to create full offline drafts, thereby creating BlogPost instances that are temporarily stored on the user’s device. The client could now generate the UUID for the post and send it to the server once network connectivity is restored. The client could then retrieve the server’s draft and match it up with any local state, as the UUID would be already known. UUIDs allow you to combine data from different sources. Merging databases tables and caches that utilize integer keys can be time-consuming and error-prone. UUIDs provide uniqueness within tables as well as at the level of the entire universe. This makes them ideal candidates for replicable structures and data that is frequently moved between storage systems. UUIDs and Databases: Some caveats There are some issues to be aware of when UUIDs are used in real systems. Integer IDs are easy to scale and optimize. Database engines can index, sort, filter and filter a list that is only going in one direction. UUIDs are not the same. UUIDs are four times larger than integers (36 bytes vs. 4 bytes), which could make this a significant consideration for large datasets. It is also more difficult to index and sort the values, especially the most common random UUIDs. Because they are random, there is no natural order to them. Indexing performance will be affected if you use a UUID for a primary key. Advertisement

These problems can be magnified in a well-normalized, heavily used foreign key-based database. You may now have multiple relational tables that contain references to your 36-byte UUIDs. The extra memory required to perform joins or sorts can have a significant impact upon your system’s performance. Your UUIDs can be stored as binary data to partially mitigate these issues. This means that you can store your UUIDs as binary data in a BINARY(16), rather than VARCHAR(36). Some databases, such as PostgreSQL, have a built-in UUID datatype. Others, like MySQL, have functions that convert a UUID string into its binary representation and vice versa. Although this is more efficient, you will still need extra resources to store and retrieve your data. It is possible to keep integers as your primary keys, but add a UUID field to your application’s reference. While relational link tables may use IDs to improve performance, your code fetches and inserts top level objects with UUIDs. It all depends on your system, your priorities, and your system’s scale: UUIDs are the best choice when you need simple data merges and decentralized ID generation. However, you must recognize the trade-offs. Summary UUIDs can be used for decentralized identity generation. Although collisions are possible, they should not be considered. If you had enough entropy, the chance of finding a duplicate would be approximately 50% if you generated one million UUIDs per second for an entire century. UUIDs can be used to establish identity independent of your database before an insert occurs. This simplifies the application-level code and prevents incorrectly identified objects from being present in your system. UUIDs can also be used to aid data replication. They guarantee uniqueness regardless of data store, device or environment. This is in contrast to traditional integer keys which operate at the table level. UUIDs are a common tool in software development. However, they are not the best solution. Newcomers tend not to consider the possibility of collisions, but this should not be your primary consideration, unless your system requires uniqueness. Advertisement

Most developers face the more obvious challenge of storing and retrieving UUIDs. Naively using a VARCHAR(36), or stripping out the hyperhens and using VARCHAR32), could cause your application to crash over time. This is because most database indexing optimizations are ineffective. To get the best performance possible from your solution, make sure you research the UUID handling capabilities built into your database system.

Back to Top
%d bloggers like this: