Red Bird 2023 Project
Title: Data Analytics over Multi-modal Data Lakes using Large Language Models
基于大语言模型和多模态数据湖的数据分析
It has three main components:
1. Improving large language models (LLMs) through fine-tuning
A common problem for LLMs to be used on industrial applications is that these LLMs are simply not good enough. Although many companies have a lot of data assets, however, it is common that after fine-tuning using these data assets, the LLMs are still not good enough. There are multiple reasons:
- Data is not good
- Data is not enough
- How to best fine-tune LLMs for certain data types are still unknown, such as table learning or knowledge graph learning
Moreover, we do have specific applications to work on:
- LLMs for creative educations, especially machine learning courses. Please contact Prof. Wei Wang (weiwcs@hkust-gz.edu.cn) if you are interested in this thread.
- LLMs for database problems, such as NL2SQL
- LLMs for table learning
2. Retrieval-augmented generation
For data analytics using multi-modal data lakes, including text files, tables, relational databases, knowledge graphs, it is not reliable to use LLMs to directly give answers. A more reliable way is to first retrieve the datasets that are needed to answer a given query or doing a certain data analytics, and then use LLMs or other existing tools (such as databases) to do the reasoning.
So far, how to index text files are widely studied and used, but how to index (large) tables or graphs are still an open problem. Hence, there are many open problems about given a natural language, retrieving:
- text file(s)
- table(s)
- graph(s)
- image(s)
- or a combination of the above files
and then doing the reasoning.
Please refer to the Symphony paper (https://www.cidrdb.org/cidr2023/papers/p51-chen.pdf) for a vision on this thread.
I envision that retrieving required multi-modal datasets for a given natural language query will be the bottleneck that many commercial applications of LLMs for data analytics cannot be grounded.
3. Data analytics over multi-modal data using LLMs
This is more from the application side, where the required datasets are provided (for example, from the above retrieval exercise). There are multiple directions we are pursuing:
- Multi-modal IoT data analytics. Please contact Prof. Kaishun Wu (wuks@hkust-gz.edu.cn) if you are interested in this thread.
- LLM powered data visualizations/stories/videos Please contact Prof. Yuyu Luo (yuyuluo@hkust-gz.edu.cn) if you are interested in this thread.
- H.A.R.V.I.S (HKUST AR for VIS): This is a LLM and AR powered system for data visualization
Enjoy Reading This Article?
Here are some more articles you might like to read next: