Danny To Eun Kim
김도은
toeunkim{at}cmu{dot}edu

(View from wider screen to see with pictures!)

I'm a second year PhD student at CMU Language Technologies Institute (LTI), advised by Prof. Fernando Diaz. My research interests span the fields of Natural Language Processing (NLP) and Information Retrieval (IR), with recent emphasis on retrieval-enhanced machine learning (REML) and known-item retrieval.

Prior to CMU, while I was in the UK, I was a graduate researcher at Web Intelligence Group, University College London (UCL), where I received M.Eng. in Computer Science in 2022. I was working with Prof. Emine Yilmaz and Prof. Aldo Lipani on Conversational AI and User Simulation. Also, I was a member of both Knowledge Graphs and NLP interest group at The Alan Turing Institute.

During the summer of 2022, I interned at Raft as a Machine Learning Engineer, automating the paperwork in freight forwarding industry by OCR and NLP.
From May 2021, I joined team Condita as a main developer to compete in the first Amazon Alexa Prize TaskBot Challenge, where I spent significant amount of time, constructing a real-time, multi-modal, knowledge-intensive, and interactive conversational assistant; our team has made to the quarterfinals.
During my undergraduate degree, I worked with Prof. Marianna Obrist at UCL (Human-Computer) Interaction Centre, finding ways to cluster text-based stories by authors' smell experience.

I take pride in being one of the early members of the UCL Artifical Intelligence Society, and during my three years of active involvement, I had the privilege of founding the first Machine Learning tutorial series, which has become an annual tradition.

Check out my blog page if you are interested!

CV  /  Google Scholar  /  Blog  /  Twitter  /  GitHub

profile photo
Table of Contents  

Research  /  Publications  / 
Presentations & Workshops  / 
Teaching & Mentoring  / 
Research

Current Research:

  • Retrieval-Enhanced Machine Learning (REML) or RAG
  • Tip of the tonuge (known-item) retrieval

Theses


Selected Publications
Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation
To Eun Kim, Fernando Diaz
Preprint, 2024
arXiv / code

Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research.

Retrieval-Enhanced Machine Learning: Synthesis and Opportunities
To Eun Kim, Alireza Salemi, Andrew Drozdov, Fernando Diaz, Hamed Zamani
Preprint, 2024
arXiv

In this work, we posit that the paradigm of retrieval-enhancement can be extended to a broader spectrum of machine learning (ML) such as computer vision, time series prediction, and computational biology, not just limited to NLP. We introduce a formal framework of Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations.

A Comprehensive Survey on Generative Diffusion Models for Structured Data
Heejoon Koo, To Eun Kim
Preprint, 2023
arXiv

There is still a lack of literature and its reviews on structured data modelling via diffusion models, compared to other data modalities such as visual and textual data. To address this gap, we present a comprehensive review of recently proposed diffusion models in the field of structured data. First, this survey provides a concise overview of the score-based diffusion model theory, subsequently proceeding to the technical descriptions of the majority of pioneering works that used structured data in both data-driven general tasks and domain-specific applications. Thereafter, we analyse and discuss the limitations and challenges shown in existing works and suggest potential research directions.

When and What to Ask Through World States and Text Instructions: IGLU NLP Challenge Solution
Z. Shi*, J. Ramos*, To Eun Kim, X. Wang, H. Rahmani, Aldo Lipani
* denotes equal contribution.
IGLU NeurIPS Workshop, 2022
arXiv

In the NeurIPS 2022 IGLU Challenge NLP Task, we address two key research questions: 1) when should the agent ask for clarification, and 2) what clarification questions should it ask. In this report, we briefly introduce our methods for the classification and ranking task. For the classification task, our model achieves an F1 score of 0.757, which placed the 3rd on the leaderboard. For the ranking task, our model achieves about 0.38 for Mean Reciprocal Rank by extending the traditional ranking model. Lastly, we discuss various neural approaches for the ranking task and future direction.

A Multi-Task Based Neural Model to Simulate Users in Goal-Oriented Dialogue Systems
To Eun Kim, Aldo Lipani
SIGIR, 2022
paper / poster / code

Conversational User Simulator that 1) generates user-side utterance, 2) predicts user's next action and 3) satisfaction level by multi-task learning. SOTA in Satisfaction and Action prediction in USS dataset

Condita: A State Machine Like Architecture for Multi-Modal Task Bots
Jerome Ramos*, To Eun Kim*, Z. Shi, X. Fu, F. Ye, Y. Feng, Aldo Lipani
* denotes equal contribution.
Alexa Prize TaskBot Challenge Proceedings, 2022
paper

We present COoking-aNd-DIy-TAsk-based (Condita) task-oriented dialogue system, for the 2021 Alexa Prize TaskBot Challenge. Condita provides an engaging multi-modal agent that assists users in cooking and home improvement tasks, creating a memorable and enjoyable experience to users. We discuss Condita's state machine like architecture and analyze the various conversational strategies implemented that allowed us to achieve excellent performance throughout the competition.

Attention-based Ingredient Phrase Parser
Z. Shi, P. Ni, M. Wang, To Eun Kim, Aldo Lipani
ESANN, 2022
paper / arXiv / code

Spin-off research from the Alexa Prize TaskBot Challenge. Assisting users to cook is one of these tasks that are expected to be solved by intelligent assistants, where ingredients and its corresponding attributes, such as name, unit, and quantity, should be provided to users precisely and promptly. To provide an engaged and successful conversational service to users for cooking tasks, we propose a new ingredient parsing model.


Presentations & Workshops
A Multi-Task Based Neural Model to Simulate Users in Goal-Oriented Dialogue Systems
Presented at the SCAI: Search-Oriented Conversational AI Workshop
15th July, 2022
Recording / Slides

Teaching & Mentoring
Course Co-organiser
  • Intensive Python Programming Course (for MSc students' dissertation)
Teaching Assistants (MSc courses)
  • CEGE0096: Geospatial Programming (Fall 2022)
    • Coursework marking automation
  • CEGE0004: Machine Learning for Data Science (Spring 2023)
  • COMP0071 Software Engineering (Spring 2023)
  • COMP0189: Applied Artificial Intelligence (Spring 2023)
Transition Mentor
  • Helping 1st year CS students with programming (C, Java, Python, Haskell)
UCL Artificial Intelligence Society
A Founder, Maintainer, and Lecturer of Machine Learning Tutorial Series

You can know me better from my CV


Jump to the top of this page.
Website source code is adapted from here.