Nan Tang


I am an associate professor at Data Science and Analytics Thrust, Information Hub, Hong Kong University of Science and Technology (Guangzhou). I also hold an affiliated position at Hong Kong University of Science and Technology, the Clear Water Bay campus at Hong Kong.

Before joining HKUST(GZ), I worked as a senior scientist at Qatar Computing Research Institute, a visiting scientist at MIT CSAIL, a research fellow at University of Edinburgh, a scientific staff member at CWI (national research institute for mathematics and computer science in the Netherlands), and a visiting scholar at University of Waterloo.

I am directing the Data Intelligence lab, which focuses on finding good data and smart analytics that are fundamental to data management, data science and artificial intelligence.

  • Retrieval-based language models using multi-modal data lakes. Data lakes have become increasingly popular for many organizations. Given a natural language question, retrieving datasets (e.g., text, tables, graphs) and reasoning with language models are key for business intelligence.
  • Good data for AI (a.k.a. data-centric AI). For most machine learning practitioners, the success of machine learning projects heavily depends on whether we can find good data for model training.
  • AI for good data. Data scientists spend at least 80% of their time on data preparation. Machine learning models can help address diverse data preparation challenges.
  • Visualization. Data visualization is important to data analytics. I am working on automatic visualization, visualization recommendation, chat-to-story, chat-to-video, and visualization using AR/VR devices.

Office: E3 601
E-mail: nantang (at)
Call: (+86)-20-88330888


Jul 15, 2024 :pencil: [VLDB 2024] Six papers, (1) “MisDetect: Iterative Mislabel Detection using Early Loss”, (2) “LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes”, (3)”Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”, (4) “Are Large Language Models a Good Replacement of Taxonomies?”, (5) “HAIChart: Human and AI Paired Visualization System”, (6) “The Dawn of Natural Language to SQL: Are We Fully Ready?”, and two demos, (i) “Retrieval-Based Tabular Data Cleaning Using LLMs and Data Lake”, (ii) “LakeCompass: An End-to-End System for Table Maintenance, Search and Analysis in Data Lakes”, were accepted.
Mar 18, 2024 :pencil: [SIGMOD 2024] Paper “Controllable Tabular Data Synthesis Using Diffusion Models” and two demos, “IDE: A System for Iterative Mislabel Detection” and “CHatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human-ChatGPT Interactions” were accepted.
Mar 10, 2024 :pencil: [ICDE 2024] Two papers, “Mitigating Data Scarcity in Supervised Machine Learning through Reinforcement Learning Guided Data Generation” and “Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration”, were accepted.
Mar 8, 2024 :trophy: [KDD Cup 2024] Our proposal “CRAG–Comprehensive RAG Benchmark and Challenge”, co-hosted with Meta Reality Lab, was accepted.
Dec 16, 2023 :medal_sports: [2024 SIGMOD Research Highlight Award] Paper “Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration” :medal_sports: [Best of SIGMOD 2023] Paper “GoodCore: Data-effective and Data-efficient Machine Learning through Coreset Selection over Incomplete Data”.