Sokrati released Athena, its our own data warehouse as a service.
Sokrati deals with a lot of data (20TB+), each team used to worry about scaling up their databases and similar problems were faced by them. Setting up a new shard for a database, migrating data between shards was a problem and a nightmarish activity. On business front, analytics team would wait forever to fetch the data, they would have to come up with their own way of analysing it as it was impossible to load it into standard tools.
This is when an idea struck as to why don’t we build a layer that will ease life of all the developers! This layer should be able to theoretically handle infinite scale, shard / re-shard when needed, Add / Remove boxes depending on size of data, Archive data to an archive store.
We came up with this idea of building a REST based web-service which would be simple to use and scale. This service would have just three calls, Store, Fetch and Status. Store would take tablename, dbname, schema and S3 file, containing actual data, as an input and insert / update the data. Fetch would build the right query depending on selectors and filters and save the output in an S3 file. Since this service would deal with huge amounts of data, we decided that both these calls should be offline jobs and hence we added a status servlet to check whether the job is complete or not.
Choice of database:
We evaluated different databases: HBase, MySQL + sharding, Amazon’s Redshift. Columnar data store performs a lot better that MySQL for analytics queries, hence it was a choice between HBase and Redshift. HBase is a columnar store but then we would have to take up maintaining that cluster which wasn’t an issue with Redshift. Hence we decided to go with Redshift as database.
Redshift is a columnar store with postgreSQL type interface for accessing data. It is a fast, fully managed, peta-byte scale datawarehouse.
Setup: 4 nodes, each node of type dw.hs1.xlarge.
- Data load for 100 million rows: 3 minutes.
- Simple select query on one table: 17 seconds.
- Select query with group by and order by: 31 seconds.
- select query with a join of 2 tables with million rows each: 48 seconds.
- 100K Upsert on a table with 10 million rows: 40 seconds
Downside of using Redshift:
Even though redshift has a lot of good things it is not a perfect system. They do not support upserts (update on duplicate key insert) out-of-the-box (like many other columnar DBs). So we had to implement our own layer for upsert based on primary keys.
Athena is ready!
The way Athena is built, its underlying database could be changed and no other team will have make any code changes, all of it would be seamless and smooth, since the API calls stay the same.
With Athena developer’s life has become easier, now they do not worry about scaling up of data. Only one team worries about databases.
We plan to implement a computation framework on top of Athena which would read data from Athena run a map-reduce job and store it back to Athena. This layer would take care of scheduling jobs so that the DB is never overloaded. Once this is rolled out Sokrati analytics would run on four calls Store, Fetch, Compute, Status.
Now Sokrati is ready to embark into an era of faster data analytics!