I recently got to attend the annual AWS re:Invent conference. It was a bit like drinking from a firehose at times, but I’ve tried to distill my thoughts down into a top ten list of takeaways.
First Things First
Man, what a great conference. The folks who made it all work did a great job — the sessions were super informative and interesting, the speakers excellent, and all the logistics of herding 20,000 attendees from room to room and meal to meal went off without a hitch. Impressive! With that out of the way, on to the list!
I noted lots of agreement around separating storage from compute. This was the most prominent, recurring theme of the sessions I went to. Most places seem to be doing this by dumping their raw events into S3 and leveraging different packages for compute (Presto, Hive, etc.).
EMR is a really flexible compute platform, and with EMRFS, can use S3 objects directly and transparently in MR jobs. In bog-standard Hadoop clusters using HDFS, when you need more HDFS, you need to add more compute nodes to the cluster, even if you don’t need the additional compute. I’ve run into this situation at a previous copmany, and then we joked “Hey, we’re getting the compute for free!” That’s not really true, of course; having to provision nodes that handle both storage and compute means you can’t use specialized nodes.
Companies are starting to come around on schemas and the need to manage them, vis-a-vis their big data operations. Sonos, for example, is going to take a look at Confluent Platform’s Schema Registry. FINRA has opted to run a single hive metastore in RDS and then connect multiple EMR clusters to it. This gives them a single source for schemas, as well as saving them 5-7 minutes of compute time when they bring up a new cluster (no need to populate millions of partitions in the HMS).
Make the transient transient, the long running durable. This goes to the use of S3 as primary storage layer. If you only need to run batch EMR jobs, then you can use the Spot market to spin up a cluster when needed and release it when done for very dollar-efficient compute.
There’s an absolute explosion in the types and number of tools available for doing analysis — Spark Streaming, Presto, Drill, Hive, SQL via Redshift, etc. I saw a lot more Hive than I was expecting, given the age of the tool and its performance, but the refrain I kept hearing was that companies trust Hive for their production workloads, whereas some of the next generation tools are still a little too immature. There’s a lot of work going in to proving them out, though, and I think the trend is to move off of Hive onto these new tools.
There’s also an explosion of file/data formats, with different pros and cons. Avro has wide adoption, which is good to see, as it validates our decision to use it for our event schemas. People like that it contains the write schema, but there is a lot of interest in columnar formats as well like Parquet and ORC.
With storage being so cheap, it seems like there is an opportunity to fit the storage format to the tool and just duplicate the data. So your “data lake” could be in S3, in its native format (Avro, etc.), and would be the ultimate source of truth. ETL jobs could then process the data into multiple formats for the different tools.
Kafka is generating lots of interest, and many companies are wrestling with the Kafka vs. Kinesis question. Obviously, Kinesis is going to come in for a lot of praise at an AWS conference, but it seemed like Kafka was more adopted. It will be interesting to watch the evolution of the two tools. Kinesis’ magic is the tight integration with the AWS ecosystem, and the fact that it is cloud-native (a speaker from Netflix mentioned frustration around trying to work with Kafka in an autoscaling environment where you expect the instances to come and go with some regularity). The new Kinesis Firehose functionality is really intriguing, and it is tools and features like that which could tip the balance for a lot comapnies, especially if they are going with an “AWS first” mentality. Kafka, on the other hand, is seen as more mature and more developer-friendly.
My current favorite Apache project, Samza, didn’t get mentioned much, but Netflix at least is doing a lot of work with it, submitting patches and fixing bugs. One of their points of focus has been to get Samza to run outside of YARN, or any sort of resource management layer, really. They run their Samza jobs as Docker containers, which is exciting as that’s how we’d like to run ours!
Docker has basically won the day. Literally everybody mentioned using it in some way within their organization, even on production. It’s astonishing to me how quickly the technology has matured, as well as how quickly consensus has formed around it. Pretty cool!
Are you interested in working with tools in the AWS ecosystem? Come join us!