Database Concurrency: Two phase Locking (2PL) to MVCC - Part 1

What is concurrency in databases?

Imagine this scenario:

User A initiates the process to book the final window seat on a flight, but the booking hasn't been finalized yet.

At the same time, User B manages to start booking the same seat because the database hasn't made it clear that User A has already claimed that seat

User C checks the number of window seats and sees one available, leading to confusion.

Without concurrency, the booking system could end up confirming both User A and User B for the same seat, leading to a chaotic situation. User C might also proceed under the incorrect assumption that there are seats available when there are none.

Concurrency is all about making sure that while lots of different tasks are happening at once, everything stays correct, consistent and isolated. The lack of concurrency control could lead to conflicts, data corruption, inconsistent data, and system crashes, which are unacceptable in a robust database environment.

Concurrency Benefits

Concurrency in databases matters a lot, and here’s why it’s a big deal:

Simultaneous Access: Databases often serve many users or processes at once. Concurrency ensures that everyone can access the data they need when they need it, without interference.
Transaction Isolation: Concurrency control mechanisms are put in place to ensure that transactions are isolated from each other, meaning the operations of one transaction are not visible to others until they are complete. This is crucial to maintaining the integrity of the data.
Improved Performance: When multiple operations can occur concurrently, the overall system performance is better. This means faster responses and less waiting time for users.
Resource Utilization: Concurrency leads to more efficient use of database resources, as it can process several operations at once, rather than processing transactions one by one.

Database Concurrency Challenges

Databases adhering to ACID principles must ensure that each transaction is executed in an isolated environment. It’s crucial that other operations should not to be aware of what’s going on with the records while the transaction is carried out. A lack of adequate isolation can result in transactions interfering with one another, potentially causing the database to reach inconsistent and unexpected states. So, what occurs when one transaction attempts to read a row that another transaction is updating? Let’s explore the challenges that can arise in handling concurrent transactions.

Concurrency Issue	Description	Example
Dirty Read	A transaction reads data modified by another transaction that hasn’t yet been committed.	T1 modifies a value in row 1. T2 reads the updated value in row 1 before T1 commits. T1 aborts, making T2’s read data invalid.
Non-Repeatable Read	A transaction re-reads data and finds it altered or deleted by another transaction.	T1 reads a record. T2 updates/deletes that record and commits. T1 re-reads the record and finds a different value or discovers that the record is gone.
Phantom Read	A transaction re-executes a query and finds the set of rows meeting the criteria has changed due to another committed transaction.	T1 runs a query to count rows meeting a condition. T2 inserts/deletes rows affecting this condition and commits. T1 re-runs the query and gets a different count.
Lost Update	Two transactions updating the same dataset, resulting in one update being lost as it is overwritten by the latter. In short, the last update wins.	T1 and T2 both read row 1. T1 modifies row1 and commits. T2 updates row1, not knowing the changes made by T1. T2 commits, it will lead to overwriting T1’s changes.

How do databases deal with this? They implement different levels of isolation to avoid such problems.

Isolation Levels

Read Uncommitted: The lowest level, where transactions can see uncommitted changes made by other transactions.
Read Committed: Transactions can only see changes that have been committed while the transaction is running. However, if the same data is read again within the same transaction, it’s possible to see different values (non-repeatable reads and phantom reads)
Repeatable Read: Ensures that if a transaction reads a record, it will read the same value every time, regardless of changes made by other transactions. However, it is still possible to have phantom reads (new records inserted by other transactions can be seen if they match the criteria).
Serializable: The highest level, which ensures full serializability of transactions. It is as if transactions are processed one after another rather than concurrently.

Each level is a trade-off between consistency and performance: higher isolation levels provide greater consistency but typically at the cost of concurrency.

Databases use a variety of methods to manage concurrency, including locking systems such as Two phase locking (2PL), Multiversion concurrency Control (MVCC), and other techniques. These are all designed to ensure data integrity while maximizing the database’s ability to handle multiple requests at the same time.

Two Phase Locking

Two-phase locking (2PL) is a concurrency control method used in database systems to ensure transaction serializability. Transaction serializability ensures that even when database transactions overlap in time, the final result looks if they had occurred one after another in sequence. The protocol works in two phases:

Locking (Growing) Phase: In this phase, a transaction may obtain locks on data (rows, tables etc) but cannot release any. There are mainly two types of locks:
- Shared Lock: Allows a transaction to read a resource. Multiple transactions can hold shared locks on the same resource, which means multiple readers can read the same data.
- Exclusive Lock: Allows a transaction to both read and write a resource. If a transaction holds an exclusive lock on a resource, no other transaction can acquire any lock (shared or exclusive) on that resource.
Unlocking (Shrinking) Phase: The transaction releases all the locks in this phase but cannot acquire any new ones. locks can be released as soon as the transaction no longer needs the locked resource

In 2PL, multiple readers can access data simultaneously through shared locks. However, this system struggles with concurrent reads and writes: reader and writer transactions interfere with each other since readers’ shared locks are incompatible with writers’ exclusive locks.
The lock compatibility would be like this:

Lock Compatibility	Shared Lock	Exclusive Lock
Shared Lock	Yes	No
Exclusive lock	No	No

Example

To illustrate two-phase locking, let’s consider two transactions, T1 and T2:

Transaction T1 wants to read a record A and then update it. Transaction T2 wants to read the same record A and then read record B.

Here’s how two-phase locking would work in this scenario:

T1 starts and requests a shared lock on record A.

T1 is granted the shared lock since no one else has a lock on A.

```
T1 reads record A.
```

Before writing, T1 upgraded its lock from shared to exclusive lock on record A.

T2 starts and requests a shared lock on record A, but must wait because T1 has an exclusive lock.

T1 writes to record A (still in the locking phase).

After writing, T1 enters the unlocking phase, releasing all the locks on record A and then commits(The unlocking/shrinking phase presents several scenarios, discussed in types of 2PL below)

As soon as the lock is released, T2 acquires a shared lock.

T2 reads record A and then requests and is granted a shared lock on record B.

T2 reads record B and then completes its operation.

T1 must complete its operations before T2 can read it, ensuring that T2 gets the updated value. This example shows how 2PL implements serialization of transactions by implementing locks and prevents inconsistent reads and updates.

Drawbacks:

1. Deadlocks

A deadlock occurs when two or more transactions are waiting for each other to release locks, and none of them can proceed.

Example

Transaction T1 and T2 are running concurrently

T1 requests and acquires an exclusive/shared lock on Row 1 in the sales table.

T2 requests and acquires an exclusive/shared lock on Row 2 in the same table.

T1 now requests an exclusive lock on Row 2 but has to wait because T2 holds the lock.

T2 requests an exclusive lock on Row 1 but also has to wait because T1 holds the lock.

Transactions can get stuck in a deadlock when each waits for the other to release its locks. To address this, the database must detect and resolve the deadlock, typically by aborting/rollback one of the transactions.

2. Cascading Rollbacks

Cascading rollback happens when the failure of one transaction forces rollbacks of other transactions. It occurs when other transactions read uncommitted data from the failed transaction, leading to a situation known as dirty reads.

Example

T1 starts,acquires exclusive lock on Row 1, writes data against row 1 and then unlocks row 1 during shrinking phase.

T2 starts, acquires shared lock and reads the uncommitted data from Row 1 (the changes made by T1).

T3 starts and reads the uncommitted data from Row 1 as well, or perhaps reads some derivative data from T2 that originated with T1's changes.

T1 encounters a problem and must be rolled back, undoing the changes to Row 1.

Because T2 and T3 have read and possibly acted on this now-rolled-back data, their operations may be based on invalid data states. Therefore, they too must be rolled back to ensure data consistency which depicts cascading rollback effect.

3. Reduced Concurrency

In 2PL, transactions maintain their locks until completion, potentially limiting the number of transactions that can be processed concurrently.. If another transaction needs the same locked resources, it has to wait, which can cause delays and a backlog of pending transactions.

Example

T1 starts a long-running operation, like generating a report, which requires a shared lock on many rows in the accounts table.

T2 and T3, which want to update some of the locked rows, are blocked until T1 completes and releases its locks, even though they might be quick operations.

What are the different variants of Two-Phase Locking, and how do they mitigate above limitations?

Strict 2PL

Strict 2PL ensures that transactions retain all their exclusive locks until they complete, either by committing or aborting. By holding onto write-locks until a transaction is fully finalized, it prevents other transactions from accessing the locked data, thereby minimizing the risk of cascading rollbacks. Shared locks can be released before commit/rollback.

Example

```
T1 starts and writes to Row 1.
```
```
T1 acquires a shared lock on Row 2.
```

T2 starts and try to acquire shared lock on Row 1 but since it is exclusively locked for writes by T1, T2 has to wait

T3 starts and try to acquire write lock on Row 2 but since it is shared locked by T1, T3 has to wait

T1 encounters a problem and rolled back, undoing the changes to Row 1.(Here it can release the shared lock on Row 2 before rollback and must release exclusive lock on Row 1 after rollback)

At this step, the database doesn’t need to rollback transactions T2 and T3 because they didn’t read the uncommitted data from T1.(No cascading effect)

Now T3 can take write lock on Row 2 and after T2 rollbacks successfully, T2 can take shared lock on Row1.

Rigorous 2PL

Rigorous 2PL takes a more conservative approach by having a transaction maintain all its locks, both exclusive and shared, until it either commits or aborts. This approach decreases the chances of deadlocks since resources are more strictly controlled, but it also lowers the level of concurrent operations as both shared and exclusively locked resources are held for the duration of the transaction.

Example

```
T1 starts and writes to Row 1.
```
```
T1 acquires a shared lock on Row 2.
```

T2 starts and try to acquire shared lock on Row 1 but since it is exclusively locked for writes by T1, T2 has to wait

T3 starts and try to acquire write lock on Row 2 but since it is shared locked by T1, T2 has to wait

T1 encounters a problem and rolled back, undoing the changes to Row 1.(Here it releases the shared lock on Row 2 and exclusive lock on Row1 after rollback)

At this step, the database doesn’t need to rollback transactions T2 and T3 because they didn’t read the uncommitted data from T1.(No cascading effect)

Now T2 can read Row1 and T3 can take exclusive lock on row 2 once T1 rollbacks successfully.

These both types are still prone to deadlocks during the locking phase which can be seen from the following example

```
T1 starts and locks Row 1 exclusively.
```
```
T2 starts and locks Row 2 exclusively.
```

T1 tries to lock Row 2 but has to wait because T2 holds the lock.

T2 tries to lock Row 1 but has to wait because T1 holds the lock.

Conservative/Static 2PL

Static Two-Phase Locking, or Conservative 2PL, requires transactions to lock all necessary data items exclusively at the start. If a transaction can’t immediately lock all its needed items, it will wait until everything is available. Once it secures all locks, the transaction can proceed, and it can release locks without the stricter constraints found in Strict or Rigorous 2PL.

Note that there is no restriction on releasing or unlocking data items in Static 2PL as we had in Strict or Rigorous 2-PL. While this method ensures a deadlock-free sequence of transactions, it doesn’t eliminate the possibility of cascading rollbacks, unlike Strict or Rigorous 2PL. Therefore, despite its deadlock prevention, Static 2PL doesn’t guarantee strict scheduling, which is a disadvantage in comparison to Strict and Rigorous 2-PL.

Example

```
T1 acquires lock on Row1 and Row2
```
```
T1 reads row1 and writes Row2.
```
```
T1 unlocks Row2.
```

T2 starts and try to acquire lock on Row 2 and reads it(Which is a dirty read)

T1 encounters a problem and rolled back, undoing the changes to Row 1 and releasing all locks

At this step, T2 also has to rollback transaction as dirty read can destroy data integrity

What is MVCC?

MVCC is a strategy that allows multiple users to access a database concurrently without interference from other transactions. PostgreSQL implements MVCC by creating a snapshot of the database state/version at the start of each transaction. This snapshot provides a consistent view of the data as it existed at the moment the transaction began, regardless of subsequent changes made by others.

The main benefit of the MVCC model over traditional locking for concurrency control is that MVCC allows queries to access data without conflicting with write operations. This means

“Readers never block writers and writers never block readers”

Understanding the basics of MVCC is just the beginning. In Part 2, we’re going to dive deeper into how PostgreSQL utilizes MVCC across its different transaction isolation levels: Read Committed, Repeatable Read, and Serializable.

Each of these levels offers different guarantees about how transaction changes are visible to others, and MVCC plays a pivotal role in implementing them.

Stay tuned!