# Intent Annotation of Recommendation Dialogue (IARD) Dataset

## Introduction

The IARD dataset is a labeled recommendation dialogue dataset (the raw dialogues are from ReDial [1]). This dataset contains 336 multi-turn recommendation dialogues with 4,583 utterances that were annotated with user intents and recommender actions at the utterance level. In the following, we will introduce the dialogue data we have collected and processed, the two taxonomies we have established respectively for user intents and recommender actions, and our IARD dataset (more details can be found in [2]).

## Recommendation Dialogue Data

The raw recommendation dialogues are from the ReDial dataset, which is a publicly available dataset centered around movie recommendations according to [1]. It was collected by a team of researchers working at Polytechnique Montréal, MILA - Quebec AI Institute, Microsoft Research Montréal, HEC Montreal, and Element AI.

Specifically, the ReDial dataset was collected through an interface where workers (from Amazon Mechanical Turk) were paired to accomplish a movie recommendation task using natural language. For each pair, one worker was given the role "seeker" who was to seek for interested movies, and the other played the role "recommender" who was responsible for giving recommendations to the seeker. Every conversation session involved at least four movies, and at the end both seeker and recommender were asked some reflective questions for each mentioned movie (e.g., "Was the movie suggested by the recommender?" "Has the seeker seen the movie?" "Did the seeker like the movie suggestion?") that can be used to check their responses' consistency.

We processed the raw dialogue data in ReDial (that contains 11,348 dialogues) by performing the following four steps:

1. We filtered out dialogues that contain less than three dialogue turns (one dialogue turn denotes a consecutive utterance-response pair: Utterance is from seeker and response is from recommender) and less than four different recommended movies.
2. We removed those with inconsistent answers from seekers and recommenders to the post-conversation reflective questions.
3. We then randomly sampled some satisfactory recommendation dialogues (SAT-Dial) where one recommended movie was not liked by the seeker but a subsequent one was accepted by her/him. These dialogues were aimed to capture the seeker's feedback intents about the recommendation when s/he was not satisfied with it, and furthermore the actions taken by the human recommender that helped the seeker to find a satisfactory item later.
4. We also sampled some unsatisfactory recommendation dialogues (unSAT-Dial) by choosing the dialogues that do not contain any recommendations accepted by the seeker. These dialogues can be useful for detecting what kind of interaction may lead to unsuccessful recommendation.

Finally, we got 253 satisfactory dialogues (SAT-Dial) and 83 unsatisfactory dialogues (unSAT-Dial) (see Table 1 with the statistics).

Table 1: Statistics of our selected dialogue data (from ReDial)

## Taxonomies for User Intents and Recommender Actions

We examined the above-selected recommendation dialogues, in order to understand the language interaction between users (seekers) and human recommenders, based on which we have developed two hierarchical taxonomies for user intents and recommender actions respectively, by using a grounded theory approach (see more details in [2]).

### Taxonomy for User Intents

The established taxonomy for user intents is aimed to classify the types of utterance inputted by recommendation seekers. We have come up with 3 top-level intents (i.e., Ask for Recommendation, Add Details, and Give Feedback), and 15 sub-intents (see Table 2).

Table 2: Taxonomy for user intents

### Taxonomy for Recommender Actions

From recommenders' perspective, we have characterized their behavior at 4 top-level actions (i.e., Request, Respond, Recommend, and Explain) and 9 sub-actions (see Table 3).

Table 3: Taxonomy for recommender actions

## IARD Dataset

### Data Annotation

After the two taxonomies were established, we asked two annotators to label all of the selected dialogue data (see Table 1). Concretely, for each seeker utterance or recommender response, the annotator was encouraged to choose all suitable code(s) that s/he thinks can represent the seeker's intent(s) or the recommender's action(s). They first independently labeled 30 random dialogues, and then met to discuss and resolve any disagreements to ensure annotation quality and consistency, before they started to label the remaining dialogues. For all of the labeled dialogues, the average inter-rater agreement scores (Cohen's kappa [3] across 15 sub-intents and 9 sub-actions are respectively 0.75 (min=0.50, max=0.95) and 0.82 (min=0.50, max=0.96), which indicate satisfactory agreement according to the interrater reliability [4].

### Data Fields

• conversation_id: a unique id for the conversation (consistent with the "conversation id" in the original ReDial dataset [1])

• accepted_recommendation: an array that contains positions of utterances where the seeker accepted the recommended item(s) suggested by the recommender (if there is no accepted recommendation, it is an empty array.)

• dialogue_info: all of the utterances in the current conversation. For each utterance, there is a unique id (i.e, utterance_id) and its associated information (e.g., utterance position, utterance text).

• utterance_id: a unique id for an utterance, e.g., "S1, R2". Note that for seeker utterance, its id starts with 'S'; for recommender response, its id starts with 'R'; and the contained digital number refers to the utterance position in the conversation.

• utterance_pos: the utterance position in the conversation
• worker_id: the id of the worker in AMT (according to the original ReDial dataset [1])
• role: the role of the worker (i.e., "recommender" or "seeker")
• utterance_text: the utterance's content
• top-level intent/action: the top-level intent(s)/action(s) labeled for the current utterance (that can be seeker utterance or recommender response)
• sub-intent/action: the sub-intent(s)/action(s) labeled for the current utterance (that can be seeker utterance or recommender response)

Note:

• For each movie mentioned in the conversation, we include both the movie_id and movie_name in the utterance_text (e.g., "Me too. Another good one is @124485 <Spaceballs (1987)>.").

Example Data Format

"1057": {  "accepted_recommendation": [11],  "dialogue_info": {    "S1": {...}    ...    "R6": {      "utterance_pos": 6,      "worker_id": 66,      "role": "recommender",      "utterance_text": "Me too. Another good one is @124485 <Spaceballs (1987)> ",      "top-level intent/action": ["Recommend","Explain"],      "sub-intent/action": ["REC-S","EXP-I"]    },    "S7": {      "utterance_pos": 7,      "worker_id": 14,      "role": "seeker",      "utterance_text": "I did see that one, but I didn't really like it...I do love 80s movies though",      "top-level intent/action": ["GiveFeedback"],      "sub-intent/action": ["REJ","CRI-A"]    },    ...  }}

## Citation

If you use IARD dataset for your research work, please cite the following paper:

• Wanling Cai and Li Chen. 2020. Predicting User Intents and Satisfaction with Dialogue-based Conversational Recommendations. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP '20), July 14-17, 2020.

Bibtex entry:

xxxxxxxxxx@inproceedings{IARD,  author = {Wanling Cai and Li Chen},  title = {Predicting User Intents and Satisfaction with Dialogue-based Conversational Recommendations},  booktitle = {Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization}  series = {UMAP '20},  year = {2020},}

The IARD dataset can be used for any research purposes under the following condition:

• The user may not imply any endorsement from the authors of the paper or Hong Kong Baptist University.
• The user must acknowledge the use of the IARD dataset in her/his publications (see above for the citation information).
• The user may redistribute the dataset as long as it is distributed under the same license conditions.
• The user can not use this dataset for any commercial or revenue-bearing purposes, without obtaining permission from the authors of [2] (i.e., Wanling Cai and Li Chen).

Neither Hong Kong Baptist University nor any of the researchers involved can guarantee the correctness of the data, its suitability for any particular purpose, or the validity of results based on the use of the dataset.

In no event shall Hong Kong Baptist University and its affiliates or employees be liable to you for any damages arising out of the use or inability to use the data (including but not limited to loss of data or data being rendered inaccurately).

If you have any questions, please feel free to send us an email (cswlcai@comp.hkbu.edu.hk and lichen@comp.hkbu.edu.hk).

## Acknowledgement

This work was partially supported by Hong Kong Baptist University IRCMS Project (IRCMS/19-20/D05). We also thank Ms. Yangyang Zheng for her assistance in annotating.

## References

[1] Raymond Li, Samira Ebrahimi Kahou, Hannes Schulz, Vincent Michalski, Laurent Charlin, and Chris Pal. 2018. Towards Deep Conversational Recommendations. In Advances in Neural Information Processing Systems 31 (NIPS '18). 9748-9758.

[2] Wanling Cai and Li Chen. 2020. Predicting User Intents and Satisfaction with Dialogue-based Conversational Recommendations. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization (UMAP '20), July 14-17, 2020.

[3] Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement 20, 1 (1960), 37-46.

[4] Mary L McHugh. 2012. Interrater Reliability: the Kappa Statistic. Biochemia Medica 22, 3 (2012), 276-282.