Skip to main content

PostgresSQL to Amazon RedShift

As in the today's world, volume of data is increasing significantly along with the need of real time analytics on the data. This is driving unprecedented demand for Big Data/No SQL type of databases. 

More and more requirements are coming to migrate from traditional (relation) databases to more Morden database. The good news is there are plenty of options like Mongo, Cassandra , Dynamo DB, Redshift , Hadoop etc .  But the challenge is,  they all solve a special needs and selecting right database depends on the actual use cases , type of data etc.

Here I am covering a generic use case of migration from Postgres (or any other relational databases) to Redshift.
We will focus on 2 key area ..

1) Fundamental Architecture difference between Postgres and Amazon Redshift, especially the way Storage and Scaling is managed. Storage played a very important role when it comes to performance as most of the time its I/O that's impact performances.

2) Best Practices in using Redshift.  We will focus on few key aspect of Redshift as using any big data system in right way is extremely important to get most out of it, else it may not give you the desired results that you are looking for.

Fundamental Architecture difference between Postgres (or any relational DB) and Amazon Redshift

Storage

Data is stored in a small chunks on disk. Traditional database uses row based storage and the chunk size will be few KBs. This means couple of rows will be stored in the same chunk.





The advantage of row based storage is any update, delete or read on any row will required access to just one storage point. Row based operations are efficient.




So if you application requires frequent insert, update , delete & read (read all or  most of the columns in the row), than Posgres may be a good choice.




Redshift  is a columnar storage database. This means, column data is stored together in a chunk typically of size 1 MB.





This is mainly useful for Analytics , say a table has millions of rows and you are interested in selecting only few columns, in that case Redshift can pull out data from few chunks which has those columns, this will reduce I/Os, memory and cpu  as it will not have to process each and every storage chuck on disk.






Scaling



The biggest advantage of redshift is that, it can scale horizontally where as most of traditional database relied heavily on vertical scaling. Redshift allow you to scale horizontally, this means you can keep on adding nodes as your data grows. As you increase nodes you get additional storage as well as compute power.  This means , you may be able to get the same level of performance of your query, even if you data grows from terabytes to petabytes.

Best Practices/Usage of Redshift

While it is important to select right database based on your use case,  its probably more important to use the database in the right way, otherwise you may not be able to get the desired result.

Data Distribution

This define, how your data is distributed across multiple nodes on the redshift cluster.
- Even distribution , this is a default option available,  it distributes data evenly across all nodes. The challenge with this is, its unaware of data relations, so in case of join query, lots of data movements required.
- key based distribution - you have an option to distribute data based on key of your choice, if you go by this, its important to select right key,  key should have high level of distributions (ie it should not be like Boolean, Yes/no kind of fields, date may be an option) and it should be able to reduce data movements due to joins. Going by Key based, distribution is ideal, if one can manage to identify right key.

- All distributions - this shouldn't be used as much as possible, but this could be a good option for small table which are used as join for many tables. All distribution means, full copy will be replicated on all nodes.

Use of SELECT

Avoid use of select *  - Select * is very costly operation for Redshift as it will have to go through each column storage blocks. So its always better to ask for the column you really needs.

Use of GroupBy & Aggregation 

 Always try to use sort key for groupby if possible.

Data Type 

Use appropriate data type , fit for purpose. Say if text can never be more than 4 char, varchar (4) is better option than varchar(50), as it save lots of space in total. Also choose the right data type like number, date etc.  As you know, redshift store data column wise, so an efficient compression can be applied on data based on its type.



Joins

say NO to cross join, its does nested loop join , you can see that in query explain.

Also if your join has multiple tables, try to add data filter in all possible table (like its okay to <tableA>.ccy="INR"  and <tableB>.country= "India", though the end result will be same even if we take out say country filter, but adding country filter on table B, help filter out data where its located and reduce the need of data movement across nodes).

Defining Constraints

Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. This is a very important point to note, as its application responsibility to ensure consistency of the data.
Redshift assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique.

Amazon Redshift does enforce NOT NULL column constraints

Monitoring database Performance :

It is extremely important that we should be able to easily monitor performance of our database and get analytics based on our usage to improve query, design etc to optimise the performance.
Thanks fully Amazon does provide few readymade template and script to do that.
it worth looking at AWS Labs redshift utilities

Documentation :

Amazon has good documents on Redshift or details



                                                  https://github.com/awslabs/amazon-redshift-monitoring


Thank you very much for your time, I would love to hear your experiance  with Amazon Redshift. Thanks



















Comments

Popular posts from this blog

Around the Web

Around the Web IBM Is Betting Big on Blockchain Technology. Is It Worth the Risk? Money.CNN.com 3 Startups That Can Turn Blockchain Tech into an Everyday Tool for Asian Users News.Yahoo.com Appraising the Luxury Goods Market With Blockchain Technology Forbes.com The Promise of the Blockchain Technology Economist.com Blockchain Jobs Are Booming in Asia, Even as Cryptocurrency Prices Struggle CNBC.com

Fake media - Can blockchain help ?

Fake media - A potential solution using Blockchain ? Media is an important pillar of democracy but this pillar is loosing trust and getting weaker due to TRP focused approach. This is a big risk for the democratic fabric of  any country. News channel /media industry needs a refinement to remain responsible and trustworthy.  What is the problem ? - Media is loosing trust due to fake or misrepresented facts. - There is no effective regulating body for media. - U nreliable information going viral online (aka fake news) is one of the biggest threats faced by society. What is the solution ? - Need an effective regulating body to control fake news. There is no central agency which can be trusted today.   - Need a central agency which is unbiased , run automatically and transparently without any human  interaction. How Blockchain can help ? Step 1:  Create an open community controlled by blockchain  The community should be open for everyone 

What is Bitcoin ?

Bitcoin - Explained  with a simple use case      →????? Let’s imagine it’s a great evening and we are sitting in a pub. I have one 100 dollar note* with me. I give it to you. You now have one 100 dollar note and I have zero. That was simple, right? Note * : Here 100 dollar note is a banknote (often known as a bill , paper money or simply a note ).  I will refer it as note , please read it as note or bill based on your local convention. Let’s take a closer look at what happened. My 100 dollar note was physically put into your hand. You know it happened. I was there. You were there. You touched it. We didn’t need a third person there to help us do the transfer. We didn’t need to pull in a judge to sit with us and confirm that the 100 dollar note went from me to you. The note is your now, you can do anything you like, you can use it to buy something for you or you can give it to someone else. You have the full ownership of the note.   I can’t use the sa