TAOBAO-MM Dataset
TAOBAO-MM is a large-scale recommendation dataset derived from user interaction logs on Taobao, one of the world’s largest e-commerce platforms. The dataset features historical behavior sequences of up to 1,000 interactions per user and includes high-quality multimodal embeddings for items.
Accurately modeling ultra-long user behavior sequences alongside rich multimodal item content is essential for advancing real-world recommender systems. To support and accelerate academic research in this direction, we publicly release TAOBAO-MM.
To the best of our knowledge, TAOBAO-MM is the first publicly available recommendation dataset that simultaneously provides both long user behavior sequences and high-quality multimodal embeddings for items.
Dataset Description
The table below summarizes key statistics for TAOBAO-MM:
| Dataset | Users | Items | Samples | Behavior Length |
|---|---|---|---|---|
| TAOBAO-MM | 8.79M | 35.4M | 99.0M | 1,000 |
TAOBAO-MM contains interaction logs from 8.79 million users, involving 35.4 million distinct items, resulting in 99 million labeled samples. The samples are partitioned into a training set of 76M and a test set of 23M instances, following a temporally consistent split to preserve real-world dynamics.
Due to privacy and copyright restrictions, all ID-based features have been anonymized, and raw multimodal content (e.g., item images) is not released. Instead, we provide pre-computed 128-dimensional multimodal embeddings for each item, generated by encoders trained under Semantic-aware Contrastive Learning (SCL) framework.
The dataset is distributed as a collection of Parquet files, each containing a single structured table, as detailed below:
train_samples.parquet: Contains the training samples. Each record includes a user ID, an item ID, and a binary label indicating whether the user clicked(1)or did not click(0)on the item, as derived from real-world interaction logs.test_samples.parquet: Contains the test samples, structured identically to the training samples, with the same fields.train_user_features.parquet: Provides demographic and behavioral features for all users appearing in the training set. Each entry includes the user ID, age, gender, city, province, and the user’s historical behavior sequence (a list of up to 1,000 item IDs and categories).test_user_features.parquet: Analogous to the training counterpart, this file contains demographic attributes and behavior sequences for all users present in the test set.item_features.parquet: Includes metadata for all items that appear in either the training or test samples. Each record consists of the item ID, category, item city, and item province.scl_embeddings_int8_p90.parquet: A lookup table mapping each item ID to its corresponding SCL multimodal embedding, quantized to int8 for efficiency. This embedding table covers 100% of the target items (i.e., those appearing as positive or negative candidates in the training and test samples) and 90% of the items observed in users’ historical behavior sequences.
To construct the training sample table, we perform a join operation between train_samples, train_user_features, and item_features on their respective IDs. The test sample table is generated analogously by joining test_samples with test_user_features and item_features.
For clarity and ease of use, we also provide a detailed description of all column fields in each table.
train_samples / test_samples
| Field | Type | Description |
|---|---|---|
| 129_1 | Int64 | User ID |
| 205 | Int64 | Item ID |
| label_0 | List of Int8 | One-hot label, [0,1] indicates non-click, [1,0] indicates click |
train_user_features / test_user_features
| Field | Type | Description |
|---|---|---|
| 129_1 | Int64 | User ID |
| 130_1 | Int64 | User age |
| 130_2 | Int64 | User gender |
| 130_3 | Int64 | User province name |
| 130_4 | Int64 | User city name |
| 130_5 | Int64 | User city level |
| 150_2_180 | List of Int64 | List of item IDs the user interacted with, the usual length is 1000 |
| 151_2_180 | List of Int64 | List of item categories the user interacted with, the usual length is 1000 |
item_features
| Field | Type | Description |
|---|---|---|
| 205 | Int64 | Item ID |
| 206 | Int64 | Item category |
| 213 | Int64 | Item city name |
| 214 | Int64 | Item province name |
scl_embeddings_int8_p90
| Field | Type | Description |
|---|---|---|
| 205 | Int64 | Item ID |
| 205_c | List of Int8 | SCL embeddings with shape (128) |
Download and Use
TAOBAO-MM dataset is free to download for research purposes under Apache License 2.0.
The dataset is now publicly available on HuggingFace. After installing the huggingface_hub package via pip install huggingface_hub, you can download the dataset directly using the following command:
huggingface-cli download --repo-type=dataset TaoBao-MM/Taobao-MM --local-dir your/local/path
Upon download, the dataset is organized with the following directory structure:
taobao-mm (139 GB)
├── feature_map (7.11 GB)
│ ├── 129_1_sorted_map.npy
│ ├── 130_1_sorted_map.npy
│ ├── ……
│ ├── scl_emb_int8_p90_keys.npy
│ └── scl_emb_int8_p90_values.npy
├── train (52.6 GB)
│ ├── metadata.json
│ ├── train-shard-000000.parquet
│ ├── train-shard-000001.parquet
│ ├── ……
│ └── train-shard-000161.parquet
├── test (15.6 GB)
│ ├── metadata.json
│ ├── test-shard-000000.parquet
│ ├── test-shard-000001.parquet
│ ├── ……
│ └── test-shard-000048.parquet
└── raw (64.1 GB)
├── train_samples.parquet
├── train_user_features.parquet
├── test_samples.parquet
├── test_user_features.parquet
├── item_features.parquet
└── scl_embedding_int8_p90.parquet
The downloaded dataset is organized into the following four directories:
feature_map/: Contains mapping dictionaries stored in.npyformat. These include (1) mappings that remap anonymized IDs to consecutive integers in the range[0, ID_SIZE]for efficient indexing, and (2) key–value mappings linking item IDs to their corresponding SCL multimodal embeddings.train/: Holds the training samples in sharded Parquet format. These files are generated by joining the training-related tables from theraw/directory.test/: Contains the test samples, also stored as sharded Parquet files, constructed analogously to the training set through the same join process applied to the raw test-related tables.raw/: Includes the original, unjoined data tables in Parquet format. This directory is provided for who wish to reconstruct or customize the preprocessing pipeline. If you only require the pre-joined training and test samples, downloading this directory is optional.
To facilitate model development and evaluation with TAOBAO-MM, we provide ready-to-use Dataset class implementations in our official repository. In addition, we include implementations of several baselines for multimodal long-sequence recommendation, including our proposed method, MUSE (MUltimodal SEarchbased framework for lifelong user interest modeling), to serve as reference points for benchmarking and further research.
Contact
If you have any questions, please feel free to contact us through github issues or email the authors.
Citation
If you find our work useful for your research, please consider citing the paper:
@misc{wu2025musesimpleeffectivemultimodal,
title={MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling},
author={Bin Wu and Feifan Yang and Zhangming Chan and Yu-Ran Gu and Jiawei Feng and Chao Yi and Xiang-Rong Sheng and Han Zhu and Jian Xu and Mang Ye and Bo Zheng},
year={2025},
eprint={2512.07216},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2512.07216},
}