System Design
Design With Sid
distributed systems

Unique ID Generation in Distributed Systems

Unique ID Generation in Distributed Systems

Introduction

In distributed systems, generating unique IDs is a common challenge due to the need to ensure no collisions across multiple servers. Whether you're designing databases, user accounts, or any other service that requires global uniqueness, having an effective ID generation strategy is essential.

In this article, we’ll explore four methods for generating unique IDs in a distributed environment:

  1. Multi-Master Application: Multiple master nodes generate unique IDs independently using methods like database sequences or auto-incrementing keys. While it works well for certain scenarios, it can become complex due to the potential for ID collisions.

  2. UUID (Universally Unique Identifier): UUIDs are 128-bit numbers generated independently by each system. Due to their randomness and large size, the chance of collisions is practically zero, making them ideal for distributed systems. UUIDs are simple to implement but tend to be long and not human-readable.

  3. Ticketing Server: A centralized approach where a dedicated server is responsible for issuing incrementing unique IDs to all distributed services. This method ensures no ID collisions but can become a bottleneck or a single point of failure if not properly managed.

  4. Twitter Snowflake: This is a time-based, distributed ID generation algorithm developed by Twitter. It generates 64-bit IDs that are unique across different data centers and servers. The ID is composed of a timestamp, a data center ID, a machine ID, and a sequence number, ensuring both uniqueness and order across distributed systems.

In this blog, we'll dive into the code implementation of widely used Twitter Snowflake Technique.

Snowflake Algorithm Structure

The algorithm structure looks like this:

  • Timestamp (41 bits): Represents the number of milliseconds that have passed since the epoch.
  • Datacenter ID (5 bits): Identifies the datacenter where the ID is generated.
  • Worker ID (5 bits): Identifies the worker or machine generating the ID within the datacenter.
  • Sequence Number (12 bits): A counter that ensures uniqueness within the same millisecond.

This 64-bit ID structure is highly efficient for generating IDs at scale in distributed systems.

Code Explanation

Now let's move on to the code implementation, where we will use the above structure to generate unique IDs.

package algo;

public class SnowflakeIDGenerator {
    private final long epoch = 1288834974657L; // Start time for our epoch
    
    // Limits for workerId, datacenterId, and sequence
    private final int maxWorkerId = 31;  // Max 5 bits = 31
    private final int maxDatacenterId = 31; // Max 5 bits = 31
    private final int maxSequence = 4095; // Max 12 bits = 4095

    // Values for this instance
    private int workerId;
    private int datacenterId;
    private int sequence = 0;
    private long lastTimestamp = -1L; // last time when Id was generated, to keep track of sequence
    
    public SnowflakeIDGenerator(int workerId, int datacenterId) {
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException("Worker ID must be between 0 and " + maxWorkerId);
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException("Datacenter ID must be between 0 and " + maxDatacenterId);
        }
        this.workerId = workerId;
        this.datacenterId = datacenterId;
    }
    
    public synchronized long nextId() { 
        long timestamp = System.currentTimeMillis();  // Current time in milliseconds
        
        // If the system clock is running backward
        if (timestamp < lastTimestamp) {
            throw new RuntimeException("Clock is moving backwards!");
        }
        // If we're generating IDs within the same millisecond
        if (timestamp == lastTimestamp) {
            sequence++;
            if (sequence > maxSequence) { // We've used up all 4096 sequence numbers in this millisecond
                timestamp = waitNextMillis(); // Wait for the next millisecond
                sequence = 0; // Reset sequence
            }
        } else {
            // Reset the sequence when we're in a new millisecond
            sequence = 0;
        }
        
        lastTimestamp = timestamp; // Update the last timestamp used
        long id = (timestamp - epoch) * 100000000L // Use timestamp (in milliseconds) - twepoch
                + datacenterId * 100000L // Append datacenter ID in an easy-to-read format
                + workerId * 1000L       // Append worker ID
                + sequence;              // Append sequence number for uniqueness within the same millisecond

        return id;
    } 
    
    public long waitNextMillis() {
    	long currentMillis = System.currentTimeMillis();
        while (currentMillis == lastTimestamp) {
            currentMillis = System.currentTimeMillis();
        }
        return currentMillis;
    }
    
    public static void main(String[] args) {
        SnowflakeIDGenerator idGen = new SnowflakeIDGenerator(1, 1);

        System.out.println(idGen.nextId());
        System.out.println(idGen.nextId());
        System.out.println(idGen.nextId());
    }
}

Conclusion

The Snowflake ID Generator is an efficient and scalable way to generate unique IDs across distributed systems. By combining the current timestamp, datacenter ID, worker ID, and a sequence number, it ensures that no two IDs are the same, even when multiple machines are generating IDs simultaneously.

Key Takeaways:

  1. Scalability: The algorithm is designed to scale horizontally by allowing multiple workers and datacenters to generate IDs independently without collisions.
  2. Time-Based Uniqueness: The use of the current timestamp as the basis ensures that the IDs are chronologically ordered.
  3. Low Latency: With this approach, IDs can be generated very quickly (in the same millisecond) without the need for a central coordinator, which keeps the process fast.
  4. Simplicity: The algorithm is simple to implement and can be customized by adjusting the bit allocation for worker ID, datacenter ID, and sequence number according to system needs.

This technique has been widely used in high-scale applications, such as Twitter, and is a great solution for distributed systems where uniqueness and performance are critical.

Feel free to experiment with this implementation or adapt it to suit your system's specific needs. If you have any questions or suggestions, feel free to reach out!