Department of Computer Science HKBU
Keynote Speech

DataPrep: Accelerate Data Preparation for AI

Dr. Jiannan Wang

Associate Professor
School of Computing Science
Simon Fraser University, Canada

Title: DataPrep: Accelerate Data Preparation for AI
Date & Time: 09:00, 17 June 2022 GMT +8 (Hong Kong Time)
18:00, 16 June 2022 GMT -7 (Canadian Vancouver Time)
Zoom Details: Meeting ID: 982 2306 0925
Passcode: 096082
Link: https://bit.ly/zm617
ABSTRACT

Data scientists have been complaining about data preparation (data collection → data understanding → data cleaning → data enrichment → data integration → feature engineering) for many years. Although some efforts have been devoted to solving this problem, a recent survey released by Anaconda in 2020 shows that it is still the case that "Data preparation and cleansing takes valuable time away from real data science work and has a negative impact on overall job satisfaction." Most recently, Andrew Ng urged the AI community to shift from Model-Centric toward Data-Centric AI development.

In this talk, I will start by answering two fundamental questions: i) what makes data preparation hard? ii) why has this problem not been solved? Then, I will present DataPrep, a fast and easy-to-use python library to address these challenges. The DataPrep library currently contains three components: a data connector component to simplify and accelerate data collection, an exploratory data analysis (EDA) component to enable fast data understanding, and a data cleaning component to clean and standardize data. I will describe their novel design and demonstrate how they can significantly save data scientists' time. In the end, I will share some lessons and experience that I learned about open-source software development.

BIOGRAPHY

Jiannan Wang is an Associate Professor and the Director of Professional Master’s Program in the School of Computing Science at Simon Fraser University. Prior to that, he was a postdoc in the AMPLab at UC Berkeley. He obtained his PhD from Tsinghua University. He has over ten years’ research experience in data preparation. His research contributions won him the VLDB Best Experiments, Analysis & Benchmark Paper Award (2021), a CS-Can|Info-Can Outstanding Early Career Researcher Award (2020), an IEEE TCDE Rising Star Award (2018), an ACM SIGMOD Best Demonstration Award (2016), a Distinguished Dissertation Award from the China Computer Federation (2013), and a Google Ph.D. Fellowship (2011). He is a General Co-chair for VLDB 2023, a PhD Symposium Track Chair for ICDE 2022, an Associate Editor for VLDB 2021, and a core PC member for SIGMOD 2019.