Why Is All of My Data Stored in Parquet? (A Deep Dive into the Data Lake)

When
Thursday November 21st, 2024 @ 6:30 PM
Who
Christopher Bremer
Where
SmartDraw Software

RSVP

Apache Parquet is a data format for efficient storage and retrieval of massive quantities of data that has been popularized by widespread adoption of the “data lake” architecture. The title of this talk is a question I asked of a coworker (a data engineer) in the process of consolidating our reporting and inference models onto a new analytics platform. The short answer, “because Databricks”, didn’t satisfy me. I was accustomed to traditional SQL, executing ad hoc queries from any language or platform I chose.

The long answer comes from a months-long investigation into how to read and write Parquet files from my favorite languages (C# and F#) and a growing appreciation of the format’s flexibility. I hope to convey more than just the Parquet specification, but also the ““how”” of parsing and writing Parquet and ““why”” this data format is so popular. I want to spark a discussion on the value of open-source data formats and how to avoid vendor lock-in for data access.

Food / Drinks

Food and drinks will be provided! Please contact us (on meetup) if you have any dietary restrictions/food allergies.

Directions and Parking Instructions

Meeting will be held in the offices of SmartDraw Software, 1780 Hughes Landing Blvd #1100 on the 11th floor.

Christopher Bremer

Chris is the software simulation manager at NOV ReedHycalog in Conroe TX, where he manages a team of engineers and data scientists. His group supports the drill bit design team by providing on demand wellbore simulation and analysis. Prior to working in software, Chris was a research mathematician studying algebraic geometry and representation theory at University of Chicago and Louisiana State University.
Last Updated: August 25th, 2023

Meeting Sponsor

Other Meetings Presented By This Speaker