Skip to the content.

TAOBAO-MM Dataset arXiv

TAOBAO-MM is a large-scale recommendation dataset derived from user interaction logs on Taobao, one of the world’s largest e-commerce platforms. The dataset features historical behavior sequences of up to 1,000 interactions per user and includes high-quality multimodal embeddings for items.

Accurately modeling ultra-long user behavior sequences alongside rich multimodal item content is essential for advancing real-world recommender systems. To support and accelerate academic research in this direction, we publicly release TAOBAO-MM.

To the best of our knowledge, TAOBAO-MM is the first publicly available recommendation dataset that simultaneously provides both long user behavior sequences and high-quality multimodal embeddings for items.

Dataset Description

The table below summarizes key statistics for TAOBAO-MM:

Dataset Users Items Samples Behavior Length
TAOBAO-MM 8.79M 35.4M 99.0M 1,000

TAOBAO-MM contains interaction logs from 8.79 million users, involving 35.4 million distinct items, resulting in 99 million labeled samples. The samples are partitioned into a training set of 76M and a test set of 23M instances, following a temporally consistent split to preserve real-world dynamics.

Due to privacy and copyright restrictions, all ID-based features have been anonymized, and raw multimodal content (e.g., item images) is not released. Instead, we provide pre-computed 128-dimensional multimodal embeddings for each item, generated by encoders trained under Semantic-aware Contrastive Learning (SCL) framework.

The dataset is distributed as a collection of Parquet files, each containing a single structured table, as detailed below:

To construct the training sample table, we perform a join operation between train_samples, train_user_features, and item_features on their respective IDs. The test sample table is generated analogously by joining test_samples with test_user_features and item_features.

For clarity and ease of use, we also provide a detailed description of all column fields in each table.

train_samples / test_samples

Field Type Description
129_1 Int64 User ID
205 Int64 Item ID
label_0 List of Int8 One-hot label, [0,1] indicates non-click, [1,0] indicates click

train_user_features / test_user_features

Field Type Description
129_1 Int64 User ID
130_1 Int64 User age
130_2 Int64 User gender
130_3 Int64 User province name
130_4 Int64 User city name
130_5 Int64 User city level
150_2_180 List of Int64 List of item IDs the user interacted with, the usual length is 1000
151_2_180 List of Int64 List of item categories the user interacted with, the usual length is 1000

item_features

Field Type Description
205 Int64 Item ID
206 Int64 Item category
213 Int64 Item city name
214 Int64 Item province name

scl_embeddings_int8_p90

Field Type Description
205 Int64 Item ID
205_c List of Int8 SCL embeddings with shape (128)

Download and Use

TAOBAO-MM dataset is free to download for research purposes under Apache License 2.0.

The dataset is now publicly available on HuggingFace. After installing the huggingface_hub package via pip install huggingface_hub, you can download the dataset directly using the following command:

huggingface-cli download --repo-type=dataset TaoBao-MM/Taobao-MM --local-dir your/local/path

Upon download, the dataset is organized with the following directory structure:

taobao-mm (139 GB)
├── feature_map (7.11 GB)
│   ├── 129_1_sorted_map.npy
│   ├── 130_1_sorted_map.npy
│   ├── ……
│   ├── scl_emb_int8_p90_keys.npy
│   └── scl_emb_int8_p90_values.npy
├── train (52.6 GB)
│   ├── metadata.json
│   ├── train-shard-000000.parquet
│   ├── train-shard-000001.parquet
│   ├── ……
│   └── train-shard-000161.parquet
├── test (15.6 GB)
│   ├── metadata.json
│   ├── test-shard-000000.parquet
│   ├── test-shard-000001.parquet
│   ├── ……
│   └── test-shard-000048.parquet
└── raw (64.1 GB)
    ├── train_samples.parquet
    ├── train_user_features.parquet
    ├── test_samples.parquet
    ├── test_user_features.parquet
    ├── item_features.parquet
    └── scl_embedding_int8_p90.parquet

The downloaded dataset is organized into the following four directories:

To facilitate model development and evaluation with TAOBAO-MM, we provide ready-to-use Dataset class implementations in our official repository. In addition, we include implementations of several baselines for multimodal long-sequence recommendation, including our proposed method, MUSE (MUltimodal SEarchbased framework for lifelong user interest modeling), to serve as reference points for benchmarking and further research.

Contact

If you have any questions, please feel free to contact us through github issues or email the authors.

Citation

If you find our work useful for your research, please consider citing the paper:

@misc{wu2025musesimpleeffectivemultimodal,
      title={MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling}, 
      author={Bin Wu and Feifan Yang and Zhangming Chan and Yu-Ran Gu and Jiawei Feng and Chao Yi and Xiang-Rong Sheng and Han Zhu and Jian Xu and Mang Ye and Bo Zheng},
      year={2025},
      eprint={2512.07216},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2512.07216}, 
}