PostgresSQL to Amazon RedShift

As in the today's world, volume of data is increasing significantly along with the need of real time analytics on the data. This is driving unprecedented demand for Big Data/No SQL type of databases.

More and more requirements are coming to migrate from traditional (relation) databases to more Morden database. The good news is there are plenty of options like Mongo, Cassandra , Dynamo DB, Redshift , Hadoop etc . But the challenge is, they all solve a special needs and selecting right database depends on the actual use cases , type of data etc.

Here I am covering a generic use case of migration from Postgres (or any other relational databases) to Redshift.

We will focus on 2 key area ..

1) Fundamental Architecture difference between Postgres and Amazon Redshift, especially the way Storage and Scaling is managed. Storage played a very important role when it comes to performance as most of the time its I/O that's impact performances.

2) Best Practices in using Redshift. We will focus on few key aspect of Redshift as using any big data system in right way is extremely important to get most out of it, else it may not give you the desired results that you are looking for.

Fundamental Architecture difference between Postgres (or any relational DB) and Amazon Redshift

Storage

Data is stored in a small chunks on disk. Traditional database uses row based storage and the chunk size will be few KBs. This means couple of rows will be stored in the same chunk.

The advantage of row based storage is any update, delete or read on any row will required access to just one storage point. Row based operations are efficient.

So if you application requires frequent insert, update , delete & read (read all or most of the columns in the row), than Posgres may be a good choice.

Redshift is a columnar storage database. This means, column data is stored together in a chunk typically of size 1 MB.

This is mainly useful for Analytics , say a table has millions of rows and you are interested in selecting only few columns, in that case Redshift can pull out data from few chunks which has those columns, this will reduce I/Os, memory and cpu as it will not have to process each and every storage chuck on disk.

Scaling

The biggest advantage of redshift is that, it can scale horizontally where as most of traditional database relied heavily on vertical scaling. Redshift allow you to scale horizontally, this means you can keep on adding nodes as your data grows. As you increase nodes you get additional storage as well as compute power. This means , you may be able to get the same level of performance of your query, even if you data grows from terabytes to petabytes.

Best Practices/Usage of Redshift

While it is important to select right database based on your use case, its probably more important to use the database in the right way, otherwise you may not be able to get the desired result.

Data Distribution

This define, how your data is distributed across multiple nodes on the redshift cluster.

- Even distribution , this is a default option available, it distributes data evenly across all nodes. The challenge with this is, its unaware of data relations, so in case of join query, lots of data movements required.

- key based distribution - you have an option to distribute data based on key of your choice, if you go by this, its important to select right key, key should have high level of distributions (ie it should not be like Boolean, Yes/no kind of fields, date may be an option) and it should be able to reduce data movements due to joins. Going by Key based, distribution is ideal, if one can manage to identify right key.

- All distributions - this shouldn't be used as much as possible, but this could be a good option for small table which are used as join for many tables. All distribution means, full copy will be replicated on all nodes.

Use of SELECT

Avoid use of select * - Select * is very costly operation for Redshift as it will have to go through each column storage blocks. So its always better to ask for the column you really needs.

Use of GroupBy & Aggregation

Always try to use sort key for groupby if possible.

Data Type

Use appropriate data type , fit for purpose. Say if text can never be more than 4 char, varchar (4) is better option than varchar(50), as it save lots of space in total. Also choose the right data type like number, date etc. As you know, redshift store data column wise, so an efficient compression can be applied on data based on its type.

Joins

say NO to cross join, its does nested loop join , you can see that in query explain.

Also if your join has multiple tables, try to add data filter in all possible table (like its okay to <tableA>.ccy="INR" and <tableB>.country= "India", though the end result will be same even if we take out say country filter, but adding country filter on table B, help filter out data where its located and reduce the need of data movement across nodes).

Defining Constraints

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. This is a very important point to note, as its application responsibility to ensure consistency of the data.

Redshift assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique.

Amazon Redshift does enforce NOT NULL column constraints

Monitoring database Performance :

It is extremely important that we should be able to easily monitor performance of our database and get analytics based on our usage to improve query, design etc to optimise the performance.

Thanks fully Amazon does provide few readymade template and script to do that.

it worth looking at AWS Labs redshift utilities

https://github.com/awslabs/amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-monitoring

Documentation :

Amazon has good documents on Redshift or details

https://docs.aws.amazon.com/redshift/latest/dg/welcome.html

Database Developer Guide : https://docs.aws.amazon.com/redshift/latest/dg/redshift-dg.pdf

Redshift Best Practices : https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html

AWS Labs redshift utilities : https://github.com/awslabs/amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-monitoring

Thank you very much for your time, I would love to hear your experiance with Amazon Redshift. Thanks

Building Blocks of Blockchain

Basics of Blockchain BUILDING BLOCKS OF BLOCKCHAIN: Core concepts of Blockchain, Bitcoin and Ethereum are explained in layman’s term. Also a step by step guide is provided to build your own blockchain in Java. About the Book & Author “Simplicity is the glory of expression” - Walt Whitman. The aim of the book is to put the complexity of blockchain technology in the simplest possible form. I am a software engineer by profession. I am in the field of software development from more than 1.5 decades. I have architected and delivered couple of complex projects for the organizations I worked for. I love to play with the problem of distributed caching, distributed computing and Application Scalability. I am involved with Blockchain technology from last couple of years. The purpose of the book is to share the knowledge I have built over the time, in the simplest possible format. The book covers various aspect of Blockchain like Business, Technology and In...

blockchain

Search This Blog