How we build and operate the Keboola data platform
MySQL Performance Vojtěch Biberle 9 min read

Improving MySQL Performance with Table Partitioning

This is the story of how we tackled challenges with MySQL performance, why traditional solutions didn't suffice, and how we ultimately leveraged table partitioning to efficiently manage our large datasets without causing downtime.

Introduction

At Keboola, we've been managing sizable MySQL tables since the beginning days of MySQL 5. Over the years, we've watched some of these tables grow rapidly, especially the table storing our job records, which was exceptionally data hungry. We faced significant issues with this massive table, particularly with query performance and its extensive disk space consumption.

Jump to our solution.

The Issue

Background

Our platform relies on a MySQL table to track jobs running within the system. With thousands of new jobs added every hour, the table's size began to balloon rapidly. Over time, it grew to approximately 300 GB, and multiple tables of similar size exist across our stacks. This sheer volume of data led to significant storage concerns:

  • Slow Queries: Certain relatively basic SELECT queries took several seconds to execute, impacting user experience and system efficiency. In addition, almost any WHERE condition outside of a few optimized ones could easily lead to query execution times of several minutes.
  • Storage Constraints: The table consumed substantial disk space, highlighting the need for more effective storage management.
  • Maintenance Challenges: Operations like deleting old records became impractically slow due to the table's size.

We also realized that historical data was not needed for more than three months. Much of the data we accumulated for years had become useless to us. Consequently, efficient data purging became extremely important to free up space and reduce maintenance overhead.

We needed a strategy to shrink the table size, delete old job events, and optimize performance without causing service interruptions.

How It Started

The table in question was originally introduced back in 2018, when we were using MySQL 5. At that time, features like ALGORITHM=INPLACE were unavailable, limiting our ability to modify tables without significant downtime. Adding new columns to an existing table was challenging due to the lack of online Data Definition Language (DDL) support in MySQL 5.

To work around this limitation, we created a new table containing the additional columns we needed and established a relationship between the two tables using foreign keys. Although this allowed us to extend the data model without altering the original table, it introduced new problems. The new table also began to grow rapidly as it accumulated data that was linked to the primary table, further increasing storage requirements and complicating data management.

This setup compounded our storage issues and made maintenance more cumbersome. Managing two large tables with foreign key constraints created a bottleneck, especially as data volume continued to surge. These challenges underscored the need for a more efficient solution, setting the stage for our exploration into table partitioning.

Challenges with Traditional Approaches

Simple DELETE Operations

Deleting old records seems like a straightforward solution to reduce table size. However, we encountered significant challenges with this method:

  • Slow Deletion Process: Performing a simple DELETE FROM table was extremely slow, taking 5–6 hours for these large tables.
  • Inefficient Space Reclamation: Even after deleting records, the disk space wasn’t immediately freed up due to the way that MySQL's InnoDB storage engine handles deleted data.
  • Internal Fragmentation: The space occupied by deleted rows remained allocated, causing internal fragmentation. Over time, this degraded the performance, even if the current SELECT queries seemed acceptable.

Using OPTIMIZE TABLE

Running OPTIMIZE TABLE reclaimed unused space and defragments the table, but it had several major drawbacks:

  • Table Locking: The operation locked the entire table, preventing reads and writes, which caused unacceptable downtime.
  • Maintenance Windows: Performing OPTIMIZE TABLE without impacting users required scheduling maintenance windows, which we strive to keep to a minimum.
  • Limited Scalability: Given the table's size (~300 GB), OPTIMIZE TABLE was a time-consuming process that further complicated maintenance.

Is There a Solution?

Yes, and it comes in the form of table partitioning.

Introducing Table Partitioning

Table partitioning allows you to divide a large table into smaller, more manageable segments called partitions. You can manage each partition independently, providing several benefits:

  • Efficient Data Management: You can quickly add or remove partitions, managing large datasets without affecting the entire table.
  • Improved Performance: Queries are optimized to scan only relevant partitions, reducing query time.
  • No Full Table Locks: Partition operations don't require locking the entire table, minimizing downtime.

Our Solution: Range Partitioning by Date

How It Works

We implemented range partitioning based on a date column (createdAt). Here's the approach:

1. Monthly Partitions

The table is partitioned into monthly segments, with each partition containing all records for a specific month.

Partitions are created using SQL similar to the following:

ALTER TABLE table_name PARTITION BY RANGE (UNIX_TIMESTAMP(createdAt)) (
  PARTITION until202401 VALUES LESS THAN(UNIX_TIMESTAMP('2024-01-02 00:00:00'))
  PARTITION until202402 VALUES LESS THAN(UNIX_TIMESTAMP('2024-02-01 00:00:00'))
  PARTITION until202403 VALUES LESS THAN(UNIX_TIMESTAMP('2024-03-01 00:00:00'))
  PARTITION until202404 VALUES LESS THAN(UNIX_TIMESTAMP('2024-04-01 00:00:00'))
  PARTITION until202405 VALUES LESS THAN(UNIX_TIMESTAMP('2024-05-01 00:00:00'))
  PARTITION pmax VALUES LESS THAN(MAXVALUE)
);

2. Automated Partition Management:

Right before the end of each month, a scheduled CRON job adds a new partition for the upcoming month and drops an old partition. At this point, the pmax partition is still empty, making the reorganization process fast.

First, we add new partitions using SQL similar to the following:

ALTER TABLE table_name REOGRANIZE PARTITION pmax INTO (
  PARTITION until202406 VALUES LESS THAN(UNIX_TIMESTAMP('2024-06-01 00:00:00'))
  PARTITION pmax VALUES LESS THAN(MAXVALUE)
);

Then, we delete old partitions using SQL similar to the following:

ALTER TABLE table_name DROP PARTITION until202401;

Benefits of This Approach

Efficient Deletion: Dropping a partition is nearly instantaneous, regardless of its size, effectively bypassing the slow DELETE operations.

  • Space Reclamation: Dropping a partition deletes the associated data file, immediately freeing up disk space.
  • No Fragmentation: Since entire partitions are removed, no internal fragmentation is left behind.
  • Minimal Impact: Dropping partitions doesn't lock the entire table, so the impact on ongoing operations is negligible.

Technical Details

When dropping a partition, we use the following command:

ALTER TABLE your_table DROP PARTITION partition_name;

MySQL supports dropping partitions using native in-place APIs with ALGORITHM=INPLACE. This operation deletes the data in the partition and drops it without rebuilding the entire table. It’s essential to select the right ALGORITHM setting:

  • ALGORITHM=INPLACE: Deletes data and drops the partition without rebuilding the table.
  • ALGORITHM=COPY or old_alter_table=ON: Rebuilds the table and attempts to move data to other partitions, which is unnecessary and time-consuming in our use case.

Other Partitioning Types

Depending on your requirements, other partitioning methods may be more suitable:

  • List Partitioning: Partitions are defined by a list of values.
  • Hash Partitioning: Partitions are assigned based on a hash function.
  • Key Partitioning: This is similar to hash partitioning but is based on one or more columns.
  • Composite Partitioning: This combines multiple partitioning methods (sub-partitioning).

You’ll want to choose the partitioning strategy that best aligns with your data access patterns and maintenance needs.

Disadvantages and Considerations

While partitioning offers significant benefits, it's essential to be aware of its limitations and challenges.

Lack of Automation

MySQL doesn't provide built-in automation for partition management, so you'll need to:

  • Create Custom Scripts: Write scripts to add or drop partitions as needed.
  • Use External Tools: Download and adapt scripts or procedures from external sources.

Schedule Regular Maintenance: Set up cron jobs or other automated tasks to manage partitions.

Indexes and Partitioning Keys

MySQL imposes specific requirements on indexes in partitioned tables:

  • Unique Keys: All columns used in the partitioning expression must be part of every unique key, including the primary key.

In our case, we initially considered using UUIDv7 as the primary key since it includes time information. However, we needed to maintain backward compatibility, which required us to continue using the AUTO_INCREMENT integer id as the primary key and createdAt as the partitioning key. We then realized, that even with UUIDv7, we would still need to retain the id as the primary key and create a unique key on id and createdAt. Given these constraints, we decided not to add the UUIDv7 key to the table.

Other Limitations

  • No Foreign Key References: Partitioned tables cannot have foreign key relationships.
  • Full-Text Indexes: Full-text indexes or searches are not supported in partitioned tables.
  • Spatial Columns: Data types like POINT or GEOMETRY are not allowed in partitioned tables.
  • Partitioning Key Data Types: The partitioning key must be an integer column or an expression that resolves to an integer. Columns with ENUM types or other non-integer expressions cannot be used directly.
  • Partitioning Existing Tables with Data: Partitioning must be defined when creating a new table or applied to an empty table.

Can We Do Better?

Yes, we can optimize our approach by using UUIDv7 as the primary key, and we would like to; however, in our case, this is not possible (see below).

Using UUIDv7

What Is UUIDv7?

UUIDv7 is a new UUID version that includes a timestamp component, allowing for time-based ordering while maintaining the uniqueness of UUIDs.

Benefits in Partitioning

  • Single Index: By using UUIDv7 as the primary key, we can avoid composite keys that include the partitioning column.

Implementation Considerations

  • Schema Changes: Updating the primary key from simple UNSIGNED INT to UUID.
  • Application Updates: Modifications to how the application generates and handles IDs.
  • Compatibility: Ensuring that all parts of the system (e.g., ORM tools like Doctrine) support UUIDv7.

This topic is thoroughly covered in an article by Viktor Dorotovič: Partitioning InnoDB Tables by Time-Based Pseudo-Sequential UUIDs, which includes strategies for setting up partitions and comparing date-time values in UUIDv7.

Alternative Approaches and "Cheats"

Stored Procedures in the Database

  • Pros: Encapsulate partition management logic within the database.
  • Cons:
    • Version Control: Stored procedures are more difficult to version and manage in systems like Git.
    • Complexity: Stored procedures may add complexity to database maintenance tasks.

Challenges with ORM Doctrine

When working with Doctrine, implementing partitioning and composite primary keys can present unique challenges.

Composite Primary Keys

  • Doctrine's Requirements: Doctrine requires all parts of a composite primary key to be set before performing storing operations like persist and flush.
    • Issue with Auto-Generated Columns: If your primary key includes an auto-generated column (e.g., an auto-incrementing ID or an auto-generated date component), it becomes challenging to meet Doctrine's requirement since the value isn't known until after insertion.

Workarounds

To overcome these limitations, consider the following approaches:

  • Use UUIDv7: Using UUIDv7 as the primary key, which includes a timestamp, eliminates the need for composite primary keys. This approach is compatible with Doctrine's requirements and avoids special handling during inserts. However, in our case, using UUIDv7 is not feasible.
  • Use Native SQL Queries for Inserts: If UUIDv7 is not an option, another approach is to bypass the ORM for insert operations by using native SQL queries. This allows the database to handle auto-generated values without conflicting with the ORM's expectations.

Loading Entities After Insertion

  • Loading into Doctrine: After inserting data via native SQL, you can load the entities back into Doctrine for further manipulation or tracking.

Simplified PHP code:

<?php

class StorageJobRepository {

  private EntityManagerInterface $em;
  
  public function createNewStorageJob($params): StorageJob {
    $entityAsArray = StorageJob::createInsertData($params);
    $tableName = $this->entityManager->getConnection()->quoteIdentifier(
      $this->entityManager->getClassMetadata(StorageJob::class)->getTableName(),
    );
    $rowsNames = implode(
        ',',
        array_map(
            function (string $arrayKey): string {
                return $this->entityManager->getConnection()->quoteIdentifier($arrayKey);
            },
            array_keys($entityAsArray),
        ),
    );
    $parameterWildcards = implode(
        ',',
        array_map(
            function (string $arrayKey): string {
                return sprintf(':%s', $arrayKey);
            },
            array_keys($entityAsArray),
        ),
    );

    $insertSQL = "INSERT INTO {$tableName} ({$rowsNames}) VALUES({$parameterWildcards})";

    $this->entityManager->getConnection()->executeStatement($insertSQL, $entityInsertData);
    $lastInsertedId = $this->entityManager->getConnection()->lastInsertId();

    /** @var StorageJob $storageJob */
    $storageJob = $this->entityManager->createQueryBuilder()
        ->select('sj')
        ->from(StorageJob::class, 'sj')
        ->andWhere('sj.id = :id')
        ->setParameter('id', $lastInsertedId)
        ->getQuery()
        ->getSingleResult();

    return $storageJob;
  }
}

Revisit Data Retention Policies

Before implementing partitioning, it's worth reconsidering your data retention needs:

  • Historical Data Importance: Assess whether older data might become valuable in the future.
  • Compliance Requirements: Be mindful of any legal obligations related to data storage and deletion.
  • Archiving Solutions: Consider archiving old data instead of deleting it, allowing for future access without impacting current performance.

How We Migrated Our Data

Transitioning from a massive, unpartitioned table to a partitioned, manageable table required careful planning and execution. Here's how we approached the migration:

Step 1: Deciding to Use Partitioning

Recognizing the limitations of our current setup, we decided partitioning was the optimal solution to handle our data management challenges.

Step 2: Creating a New Table

We created a new table without foreign keys and set up partitioning based on our requirements, accommodating the new data structure and partitioning scheme.

Step 3: Automating Partition Management

Our scripts for partition management are now operational, automatically creating new partitions and dropping old ones via cron scripts, ensuring seamless data management.

Step 4: Updating the Codebase

Our application code was updated to insert data into both the old and new tables simultaneously. During this phase:

  • The old table remained the primary source of data and generated IDs.
  • The new table stored the data, synchronized using the same IDs as in the old table.

Step 5: Checking Data Consistency

We performed thorough checks to ensure that the data in the new table was consistent with the old one, a crucial step for validating migration integrity.

Step 6: Switching ID Generation to the New Table

After confirming data consistency, we shifted ID generation to the new table:

  • To prevent ID conflicts between the old and new tables—particularly since both were operational simultaneously—we introduced a gap in the auto-increment sequence of the new table.
  • We used native SQL queries for data insertion due to the custom auto-increment setup.

Caution with ORM Behavior: Some ORMs, like Doctrine, may silently overwrite key values if configured to auto-generate them—something we learned firsthand.

Step 7: Rewriting the System to Use the New Table Exclusively

We are currently updating our system to rely solely on the new partitioned table, modifying all parts of the application to interact only with it.

Step 8: Decommissioning the Old Table

Once the system fully utilizes the new table, we plan to drop the old table entirely, completing the migration process.

Conclusion

Table partitioning offers a powerful solution for managing large MySQL tables efficiently. By partitioning data based on date ranges and automating the addition and removal of partitions, you can:

  • Improve query performance by targeting specific partitions.
  • Reclaim disk space by dropping old partitions.
  • Minimize downtime by avoiding full table locks during maintenance.

However, it's essential to be aware of partitioning limitations and plan accordingly. Partitioning requires careful schema design, ongoing maintenance, and potential adjustments to application code. By understanding these challenges and following best practices, you can significantly enhance your database's performance and scalability.


References:

If you liked this article please share it.

Comments ()