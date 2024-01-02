Researchers from Microsoft and Columbia University have delved into the detection of hallucinations in language models, particularly focusing on transformer models with decoder-only architecture. The objective of hallucination detection is to determine whether the generated text adheres to the input prompt or contains false information.

To achieve this, the team constructed probes, which are systems trained on the internal operations of language models. These probes aim to predict instances where the model may produce delusional content during contextually relevant tasks. In order to train and evaluate these probes effectively, a dataset with annotated examples of synthetic and biological hallucinations was provided.

The research revealed that probes designed to identify artificially induced hallucinations were not effective at detecting naturally occurring hallucinations. This suggests that probes trained on modified or synthetic data may struggle to generalize to real-world scenarios. The distribution properties and task-specific information were found to impact the hallucination data within the model’s hidden states.

Looking into intrinsic and extrinsic hallucination saliency across different tasks, layers, and types of hidden states, the team found that the transformer’s internal representations placed more emphasis on extrinsic hallucinations, which are connected to the external world. Two methods were employed to gather hallucinations: sampling replies produced a language model conditioned on inputs, and introducing inconsistencies into reference inputs or outputs.

The team observed that the second method yielded a higher rate of hallucination annotations from human annotators. However, synthetic examples were deemed less valuable due to their inability to match the test distribution.

The researchers summarized their key contributions as follows:

1. Creation of a dataset consisting of over 15,000 utterances, tagged for hallucinations in both natural and artificial output texts, spanning three grounded generation tasks.

2. Presentation of three probe architectures that improve the efficiency and accuracy of hallucination detection compared to existing baselines.

3. Exploration of factors influencing probe accuracy, such as the nature of hallucinations (intrinsic vs. extrinsic), model size, and specific encoding components being probed.

This study contributes to the ongoing efforts to improve the robustness and reliability of language models, shedding light on the intricate nature of hallucinations and the challenges in their detection.