Amazon Redshift and Yahoo Cloud Serving Benchmark

Amazon Redshift is an easy to use yet powerful datawarehouse part of the Amazon AWS family. Redshift is a relational database based on PostgreSQL designed to scale up to petabytes of data.

A datawarehouse should be optimized for batch loading and select queries, while insert queries and update queries are less relevant. As it can be read in the post Amazon RedShift, an independent review the INSERT queries are not performing well while the suggested COPY command is definitely faster.

The question is, without loading terabytes or even petabytes of data, how well are performing simple SELECT queries against a small Amazon RedShift cluster?

Environment

The cluster was pretty basic with 2 worker nodes, each of the node was an dc1.large large machine with these characteristics:

  • 7 EC2 Compute Units (2 virtual cores)
  • 15 Gb memory
  • 160Gb SSD storage

The client used for the tests was an m4.2xlarge machine with this configuration:

  • 26 EC2 Compute Units (8 virtual cores)
  • 32 Gb memory
  • 8 Gb SSD storage

Yahoo Cloud Serving Benchmark (YCSB)

Yahoo Cloud Serving Benchmark (YCSB) is a tool developed by Yahoo to stress test database services. It is mainly used to test key/values textual data stores by inserting and retrieving data.

Even if a datawarehouse is not designed to be just a key/values store and the (YCSB) is not the best test to be applied to a datawarehouse I wanted to study what can be achieved in this scenario.

An initial configuration

I loaded 1 million of YCSB generated records in a usertable table (see post YCSB with JDBC tutorial) and ran an YCSB test with 1000 select queries and 10 threads, without updates or inserts.

This was the initial table:

CREATE TABLE usertable(YCSB_KEY VARCHAR (255) PRIMARY KEY,
  FIELD1 TEXT, FIELD2 TEXT,
  FIELD3 TEXT, FIELD4 TEXT,
  FIELD5 TEXT, FIELD6 TEXT,
  FIELD7 TEXT, FIELD8 TEXT,
  FIELD9 TEXT, FIELD10 TEXT);

And here is the result:

1000 select queries, 10 threads, 1M records (1Gb data):
Run time (seconds) 10.39
Throughput (ops/sec) 9.61
AverageLatency (ms) 1033.29
MinLatency (ms) 533.31
MaxLatency (ms) 5333.14

The average latency was about 1 second, with just 1 million records is not that much honestly, but we can definitely do better.

Some optimization

The PRIMARY KEY on YCSB_KEY is considered just a hint for RedShift but this does not take advantage of some key features, so I added a distkey and sortkey on the YCSB_KEY as well.

The new table is the following:

CREATE TABLE usertable(YCSB_KEY VARCHAR (255) PRIMARY KEY DISTKEY SORTKEY,
  FIELD1 TEXT, FIELD2 TEXT,
  FIELD3 TEXT, FIELD4 TEXT,
  FIELD5 TEXT, FIELD6 TEXT,
  FIELD7 TEXT, FIELD8 TEXT,
  FIELD9 TEXT, FIELD10 TEXT);

And here are the results:

1000 select queries, 10 threads, 1M records (1Gb data) with DISTKEY and SORTKEY:
Run time (seconds) 6.97
Throughput (ops/sec) 143.41
AverageLatency (ms) 65.11
MinLatency (ms) 17.10
MaxLatency (ms) 327.98
95thPercentileLatency (ms) 97
99thPercentileLatency(ms) 131

This time the results where definitely better, with an average latency of 65 milliseconds.

Table compression

Amazon RedShift provides column specific compression strategies which work well on repeated data. Unfortunately in this case the textual data is random and a compression might not be effective.

For example this is the compressed table:

CREATE TABLE usertable(YCSB_KEY VARCHAR (255) PRIMARY KEY DISTKEY SORTKEY,
  FIELD1 TEXT ENCODE TEXT32K, FIELD2 TEXT ENCODE TEXT32K,
  FIELD3 TEXT ENCODE TEXT32K, FIELD4 TEXT ENCODE TEXT32K,
  FIELD5 TEXT ENCODE TEXT32K, FIELD6 TEXT ENCODE TEXT32K,
  FIELD7 TEXT ENCODE TEXT32K, FIELD8 TEXT ENCODE TEXT32K,
  FIELD9 TEXT ENCODE TEXT32K, FIELD10 TEXT ENCODE TEXT32K);

And here are the results:

1000 select queries, 10 threads, 1M records (1Gb data) with DISTKEY and SORTKEY, compressed data:
Run time (seconds) 27.61
Throughput (ops/sec) 36.20
AverageLatency (ms) 269.57
MinLatency (ms) 47.86
MaxLatency (ms) 2694.48
95thPercentileLatency (ms) 367
99thPercentileLatency(ms) 471

Basically the compression on random textual values not only means a performance hit but also zero space gain.

More records

1M records is not that much for a small cluster, so I added more YCSB generated records and run a stress test after each iteration, always performing VACUUM and ANALYZE and waiting for the cluster to complete the rebalancing process.

The following table structure will be used on the following tests:

CREATE TABLE usertable(YCSB_KEY VARCHAR (255) PRIMARY KEY DISTKEY SORTKEY,
  FIELD1 TEXT, FIELD2 TEXT,
  FIELD3 TEXT, FIELD4 TEXT,
  FIELD5 TEXT, FIELD6 TEXT,
  FIELD7 TEXT, FIELD8 TEXT,
  FIELD9 TEXT, FIELD10 TEXT);
1000 select queries, 10 threads:
10M records 20M records 50M records 100M records 150M records
Run time (seconds) 38.93 51.08 59.71 281.73 366.02
Throughput (ops/sec) 25.68 19.57 16.74 3.54 2.73
AverageLatency (ms) 374.76 493.45 582.04 2779.25 3605.62
MinLatency (ms) 18.30 17.58 20.31 154.82 347.85
MaxLatency (ms) 1009.89 1527.15 1415.24 7116.18 8015.36

Cluster resizing

The performance drop was quite significant between 50M and 100M records. One of the interesting feature of Amazon RedShift is the effortless cluster resizing, which can be achieved with few clicks of the mouse. After resizing the cluster from 2 dc1.large nodes to 3 dc1.large nodes, operation which took about 20 minutes, the performances are the following:

1000 select queries, 10 threads, 3 nodes:
100M records 150M records
Run time (seconds) 193.34 287.57
Throughput (ops/sec) 5.17 3.47
AverageLatency (ms) 1911.96 2848.15
MinLatency (ms) 111.11 643.23
MaxLatency (ms) 4486.48 5792.25

Smaller records

Amazon RedShift is an analytical columnar database

This means that each column is stored separately and the performances can greatly improve if only a subset of columns are retrieved. For example if the previous table with 10 columns and 1 primary key is reduced to a table with 1 column and 1 primary key we can improve the performance by about 10 times.

CREATE TABLE usertable(YCSB_KEY VARCHAR (255) PRIMARY KEY DISTKEY SORTKEY,  FIELD1 TEXT);
1000 select queries, 10 threads, 150M records (150Gb data), 3 nodes, reduced columns:
Run time (seconds) 21.35
Throughput (ops/sec) 46.81
AverageLatency (ms) 205.82
MinLatency (ms) 30.10
MaxLatency (ms) 3787.41

YCSB generates records with a primary key and 10 columns filled with random textual data, each record occupies 1kb. This is not exactly a common scenario for a datawarehouse in which columns with date, time, numeric values and simple values are much more recurrent.

If we create the following table:

CREATE TABLE USERTABLE(ID INTEGER PRIMARY KEY DISTKEY SORTKEY, VALUE DECIMAL(8.2))

And load 150M records with random decimal values in a cluster with 2 nodes and we execute the following query:

SELECT SUM(VALUE) FROM USERTABLE;

The time to sum 150M decimal values is about 1.086 seconds, which is a very good result.

Conclusion

Yahoo Cloud Service Benchmark (YCSB) should be consider as an heavy weight test for a datawarehouse. Each record is 1 Kb of mostly random textual data and the keys can be duplicated. In this scenario even 20 operations / second are pretty good in most cases.

Amazon RedShift is a columnar database which means that each column is stored separately and the less the number of retrieved columns the better the performances.

It was clear that such type of data is not common in a datawarehouse and Amazon RedShift performs very well on smaller columns or select queries on few columns.

For example the time elapsed to sum 150M decimal values in a cluster with 2 nodes was 1 second. In a cluster with 3 nodes the average time to retrieve a value in a table of 150M records was about 205 milliseconds.