Grammarly, like most growing companies, strives to make data-driven decisions. That means that we need a reliable way to collect, analyze, and query data about our users. We started out using third-party tools like Mixpanel to handle our analytics needs, but soon our needs surpassed the capabilities of those tools. For example, we wanted to control the pre-aggregation and enrichment of data, to generate reports that were more customized, and to have higher confidence in the accuracy of data. So we developed our own in-house analytics engine and application on top of Apache Spark. Recently, I gave a talk at the Spark Summit sharing some of our learnings along the way. The talk covered:
- Outputting data to several storages in a single Spark job
- Dealing with the Spark memory model, building a custom spillable data structure for data traversal
- Implementing a custom query language with parser combinators on top of the Spark SQL parser
- Custom query optimizer and analyzer
- Flexible-schema storage and query against multi-schema data with schema conflicts
- Custom aggregation functions in Spark SQL
Here is the video of the talk:
Check out the slides as well: