As in the today's world, volume of data is increasing significantly along with the need of real time analytics on the data. This is driving unprecedented demand for Big Data/No SQL type of databases.
More and more requirements are coming to migrate from traditional (relation) databases to more Morden database. The good news is there are plenty of options like Mongo, Cassandra , Dynamo DB, Redshift , Hadoop etc . But the challenge is, they all solve a special needs and selecting right database depends on the actual use cases , type of data etc.
Here I am covering a generic use case of migration from Postgres (or any other relational databases) to Redshift.
We will focus on 2 key area ..
1) Fundamental Architecture difference between Postgres and Amazon Redshift, especially the way Storage and Scaling is managed. Storage played a very important role when it comes to performance as most of the time its I/O that's impact performances.
2) Best Practices in using Redshift. We will focus on few key aspect of Redshift as using any big data system in right way is extremely important to get most out of it, else it may not give you the desired results that you are looking for.
Fundamental Architecture difference between Postgres (or any relational DB) and Amazon Redshift
Storage
Data is stored in a small chunks on disk. Traditional database uses row based storage and the chunk size will be few KBs. This means couple of rows will be stored in the same chunk.
The advantage of row based storage is any update, delete or read on any row will required access to just one storage point. Row based operations are efficient.
So if you application requires frequent insert, update , delete & read (read all or most of the columns in the row), than Posgres may be a good choice.
Redshift is a columnar storage database. This means, column data is stored together in a chunk typically of size 1 MB.
This is mainly useful for Analytics , say a table has millions of rows and you are interested in selecting only few columns, in that case Redshift can pull out data from few chunks which has those columns, this will reduce I/Os, memory and cpu as it will not have to process each and every storage chuck on disk.
Scaling
The biggest advantage of redshift is that, it can scale horizontally where as most of traditional database relied heavily on vertical scaling. Redshift allow you to scale horizontally, this means you can keep on adding nodes as your data grows. As you increase nodes you get additional storage as well as compute power. This means , you may be able to get the same level of performance of your query, even if you data grows from terabytes to petabytes.
Best Practices/Usage of Redshift
While it is important to select right database based on your use case, its probably more important to use the database in the right way, otherwise you may not be able to get the desired result.
Data Distribution
This define, how your data is distributed across multiple nodes on the redshift cluster.
- Even distribution , this is a default option available, it distributes data evenly across all nodes. The challenge with this is, its unaware of data relations, so in case of join query, lots of data movements required.
- key based distribution - you have an option to distribute data based on key of your choice, if you go by this, its important to select right key, key should have high level of distributions (ie it should not be like Boolean, Yes/no kind of fields, date may be an option) and it should be able to reduce data movements due to joins. Going by Key based, distribution is ideal, if one can manage to identify right key.
- All distributions - this shouldn't be used as much as possible, but this could be a good option for small table which are used as join for many tables. All distribution means, full copy will be replicated on all nodes.
Use of SELECT
Avoid use of select * - Select * is very costly operation for Redshift as it will have to go through each column storage blocks. So its always better to ask for the column you really needs.
Use of GroupBy & Aggregation
Always try to use sort key for groupby if possible.
Data Type
Use appropriate data type , fit for purpose. Say if text can never be more than 4 char, varchar (4) is better option than varchar(50), as it save lots of space in total. Also choose the right data type like number, date etc. As you know, redshift store data column wise, so an efficient compression can be applied on data based on its type.
Joins
say NO to cross join, its does nested loop join , you can see that in query explain.
Also if your join has multiple tables, try to add data filter in all possible table (like its okay to <tableA>.ccy="INR" and <tableB>.country= "India", though the end result will be same even if we take out say country filter, but adding country filter on table B, help filter out data where its located and reduce the need of data movement across nodes).
Defining Constraints
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. This is a very important point to note, as its application responsibility to ensure consistency of the data.
Redshift assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique.
Amazon Redshift does enforce NOT NULL column constraints
Monitoring database Performance :
It is extremely important that we should be able to easily monitor performance of our database and get analytics based on our usage to improve query, design etc to optimise the performance.
Thanks fully Amazon does provide few readymade template and script to do that.
it worth looking at AWS Labs redshift utilities
Documentation :
Amazon has good documents on Redshift or details
Database Developer Guide : https://docs.aws.amazon.com/redshift/latest/dg/redshift-dg.pdf
Redshift Best Practices : https://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
AWS Labs redshift utilities : https://github.com/awslabs/amazon-redshift-utils
Thank you very much for your time, I would love to hear your experiance with Amazon Redshift. Thanks
Comments
Post a Comment