Apache Parquet is a data format for efficient storage and retrieval of massive quantities of data that has been popularized by widespread adoption of the “data lake” architecture. The title of this talk is a question I asked of a coworker (a data engineer) in the process of consolidating our reporting and inference models onto a new analytics platform. The short answer, “because Databricks”, didn’t satisfy me. I was accustomed to traditional SQL, executing ad hoc queries from any language or platform I chose.
The long answer comes from a months-long investigation into how to read and write Parquet files from my favorite languages (C# and F#) and a growing appreciation of the format’s flexibility. I hope to convey more than just the Parquet specification, but also the ““how”” of parsing and writing Parquet and ““why”” this data format is so popular. I want to spark a discussion on the value of open-source data formats and how to avoid vendor lock-in for data access.
Food and drinks will be provided! Please contact us (on meetup) if you have any dietary restrictions/food allergies.
Meeting will be held in the offices of SmartDraw Software, 1780 Hughes Landing Blvd #1100 on the 11th floor.
Chris is the software simulation manager at NOV ReedHycalog in Conroe TX, where he manages a team of engineers and data scientists. His group supports the drill bit design team by providing on demand wellbore simulation and analysis. Prior to working in software, Chris was a research mathematician studying algebraic geometry and representation theory at University of Chicago and Louisiana State University.