Relational Databases Exploring Internal Operations and Query Handling Using PostgreSQL as an Example.

Modern data management heavily relies upon databases as they offer a defined and effective way to store and handle large volumes of data efficiently. This article delves into the workings of relational databases with PostgreSQL, as a key illustration. It covers topics such as data storage mechanisms employed by these databases the processing of queries the optimization of data retrieval through indexing and the management of access in environments, with users. We also delve into the difficulties presented by actions, in databases like reads and other unusual transactional behaviors and discuss how PostgreSQLs methods, for managing transaction isolation levels and concurrent operations help address these problems.

Since their introduction, in the 1970s by Edgar Codd¸ databases have become the norm for handling data and efficiently¸ working on the principles of set theory and predicate logic to organize data into tables (relations) rows (tuples) and columns (attributes). Having a grasp of how relational databases handle storage¸ queries¸ concurrency management¸ and data integrity is crucial, for designing databases and enhancing performance capabilities. This article will discuss PostgreSQL. An open source relational database system recognized for its backing of advanced SQL standards and its flexibility along, with compliance, with ACID (Atomicity,A Consistency,I Isolation,D Durability).

Data Storage in PostgreSQL

In PostgreSQL, data is stored on disk in pages (also known as blocks), typically 8KB in size. These pages are the fundamental unit of storage. A table in PostgreSQL is represented by a set of these pages, organized into files. PostgreSQL employs a heap file organization, meaning that new rows are appended to the end of the table. Each row, in turn, is stored as a tuple, consisting of metadata (e.g., transaction IDs, tuple visibility information) and the actual column values.

Data is structured as follows:

  • Heap Files: Tables are stored in heap files, where each row can be placed in any available location, without any strict ordering. The heap file organization enables efficient inserts and moderate performance for reads.
  • TOAST (The Oversized Attribute Storage Technique): For handling large data types, PostgreSQL employs TOAST, which stores large data values (e.g., large text or binary data) outside the main table rows, preventing the page size limit from being exceeded.
  • Tablespaces: PostgreSQL supports tablespaces, which allow administrators to specify different storage locations (physical disks) for database objects, enabling optimization based on workload characteristics.

Query Processing in PostgreSQL

When a query is executed, PostgreSQL follows a multi-step process:

  1. Parsing: The SQL query is first parsed into a query tree, where the syntax is validated.
  2. Rewriting: Query rewrite rules are applied, typically in cases of views or other transformation rules. The query tree is transformed accordingly.
  3. Planning/Optimization: PostgreSQL generates a series of possible execution plans based on the structure of the query, available indexes, and estimated costs. The cost-based optimizer assigns a cost to each plan, with lower-cost plans being more likely to be executed.
  4. Execution: The chosen plan is executed. The executor processes the query tree, retrieves the necessary data from the heap or indexes, and applies the specified operations (e.g., joins, filters).

PostgreSQL utilizes various join algorithms (nested loop, merge join, hash join) based on the cost estimations. The query planner aims to minimize disk I/O, which is the most expensive operation.

Indexing Mechanisms

Indexes in PostgreSQL are auxiliary structures that improve query performance by allowing the system to locate rows faster without scanning the entire table. PostgreSQL supports multiple index types, each optimized for different use cases:

  • B-Tree Indexes: The default index type, B-trees are balanced trees that allow efficient lookups, insertions, and deletions. B-tree indexes are ideal for equality and range queries.
  • Hash Indexes: Optimized for equality lookups, but less commonly used due to their limited use cases compared to B-tree indexes.
  • GIN (Generalized Inverted Indexes) and GiST (Generalized Search Trees): These are specialized index types used for complex data types like arrays, full-text search, and geometric data.

When an index is used, PostgreSQL traverses the index structure to quickly find the relevant data pages, significantly reducing the need to scan the entire table.

Parallel Query Execution and Access

PostgreSQL has the ability to run queries in parallel to boost efficiency by using multiple CPU cores.The workload is split into processes known as workers which handle different parts of the dataset. PostgreSQLs features, for processing encompass:

  • Parallel Sequential Scans: Information is split into sections, with workers examining parts of the data table.
  • Parallel Index Scans: to this point is that index scans have the capability to be divided among workers who can simultaneously scan various sections of the index.
  • Parallel Joins and Aggregations: PostgreSQL has the capability to spread out join and aggregation tasks among workers, for processing.

PostgreSQLs parallelism is controlled through its process oriented design. With each query operation managed by a process, in the system and spawning extra worker processes when parallel execution is activated for effective coordination using inter process communication (IPC). While this setup allows scalability in operations it mandates configuration to strike a balance, between leveraging parallelism benefits and managing associated overhead.

Concurrency Control and Isolation

In databases, like PostgreSQL where multiple users work simultaneously on the system ensuring smooth transactional flow is essential. PostgreSQL utilizes Multi Version Concurrency Control (MVCC) a technique that enables transactions to access the database without causing interference maintaining the required isolation, between concurrent processes.

Transaction Isolation Levels

PostgreSQL supports four standard transaction isolation levels as defined by SQL standards:

  1. Read Uncommitted: This level allows transactions to see uncommitted changes made by others, but PostgreSQL internally upgrades this to Read Committed, as it does not allow dirty reads.
  2. Read Committed: The default isolation level in PostgreSQL, where a transaction only sees changes committed before the transaction begins. Changes made by other transactions after the start are not visible until committed.
  3. Repeatable Read: A transaction sees the database as it was at the time the transaction began, even if other transactions commit changes. This prevents non-repeatable reads but allows phantom reads.
  4. Serializable: The highest isolation level, which ensures that transactions are executed in a serial order, effectively eliminating both non-repeatable and phantom reads by simulating a serialized schedule.

Transaction Anomalies

  • Dirty Reads: Occur when a transaction reads data written by another uncommitted transaction. PostgreSQL prevents this through its use of MVCC.
  • Non-repeatable Reads: Happen when a transaction reads the same row twice and gets different results, due to updates from another transaction. Repeatable Read isolation prevents this anomaly.
  • Phantom Reads: Occur when a transaction re-executes a query and finds new rows that weren’t visible during the first execution. Serializable isolation in PostgreSQL prevents this issue.

Phantom Reads and Serializable Isolation in PostgreSQL

Phantom reads are a specific anomaly where a query returns a different set of rows when re-executed, despite no changes to the individual rows. PostgreSQL addresses phantom reads in Serializable isolation by using predicate locking, which locks not only rows but also gaps between rows to ensure consistent query results.

References

  1. Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377-387.
  2. Stonebraker, M., & Hellerstein, J. M. (2005). What Goes Around Comes Around. Readings in Database Systems, 4th edition.
  3. PostgreSQL Global Development Group. (2024). PostgreSQL Documentation. Available at: https://www.postgresql.org/docs/.