Red Bird 2023 Project

基于大语言模型和多模态数据湖的数据分析

It has three main components:

1. Improving large language models (LLMs) through fine-tuning

A common problem for LLMs to be used on industrial applications is that these LLMs are simply not good enough. Although many companies have a lot of data assets, however, it is common that after fine-tuning using these data assets, the LLMs are still not good enough. There are multiple reasons:

Data is not good
Data is not enough
How to best fine-tune LLMs for certain data types are still unknown, such as table learning or knowledge graph learning

Moreover, we do have specific applications to work on:

LLMs for creative educations, especially machine learning courses. Please contact Prof. Wei Wang (weiwcs@hkust-gz.edu.cn) if you are interested in this thread.
LLMs for database problems, such as NL2SQL
LLMs for table learning

2. Retrieval-augmented generation

For data analytics using multi-modal data lakes, including text files, tables, relational databases, knowledge graphs, it is not reliable to use LLMs to directly give answers. A more reliable way is to first retrieve the datasets that are needed to answer a given query or doing a certain data analytics, and then use LLMs or other existing tools (such as databases) to do the reasoning.

So far, how to index text files are widely studied and used, but how to index (large) tables or graphs are still an open problem. Hence, there are many open problems about given a natural language, retrieving:

text file(s)
table(s)
graph(s)
image(s)
or a combination of the above files

and then doing the reasoning.

Please refer to the Symphony paper (https://www.cidrdb.org/cidr2023/papers/p51-chen.pdf) for a vision on this thread.

I envision that retrieving required multi-modal datasets for a given natural language query will be the bottleneck that many commercial applications of LLMs for data analytics cannot be grounded.

This is more from the application side, where the required datasets are provided (for example, from the above retrieval exercise). There are multiple directions we are pursuing:

Multi-modal IoT data analytics. Please contact Prof. Kaishun Wu (wuks@hkust-gz.edu.cn) if you are interested in this thread.
LLM powered data visualizations/stories/videos Please contact Prof. Yuyu Luo (yuyuluo@hkust-gz.edu.cn) if you are interested in this thread.
H.A.R.V.I.S (HKUST AR for VIS): This is a LLM and AR powered system for data visualization

Title: Data Analytics over Multi-modal Data Lakes using Large Language Models

基于大语言模型和多模态数据湖的数据分析

1. Improving large language models (LLMs) through fine-tuning

2. Retrieval-augmented generation

3. Data analytics over multi-modal data using LLMs

Enjoy Reading This Article?