TAOBAO-MM Dataset

TAOBAO-MM is a large-scale recommendation dataset derived from user interaction logs on Taobao, one of the world’s largest e-commerce platforms. The dataset features historical behavior sequences of up to 1,000 interactions per user and includes high-quality multimodal embeddings for items.

Accurately modeling ultra-long user behavior sequences alongside rich multimodal item content is essential for advancing real-world recommender systems. To support and accelerate academic research in this direction, we publicly release TAOBAO-MM.

To the best of our knowledge, TAOBAO-MM is the first publicly available recommendation dataset that simultaneously provides both long user behavior sequences and high-quality multimodal embeddings for items.

Dataset Description

The table below summarizes key statistics for TAOBAO-MM:

Dataset	Users	Items	Samples	Behavior Length
TAOBAO-MM	8.79M	35.4M	99.0M	1,000

TAOBAO-MM contains interaction logs from 8.79 million users, involving 35.4 million distinct items, resulting in 99 million labeled samples. The samples are partitioned into a training set of 76M and a test set of 23M instances, following a temporally consistent split to preserve real-world dynamics.

Due to privacy and copyright restrictions, all ID-based features have been anonymized, and raw multimodal content (e.g., item images) is not released. Instead, we provide pre-computed 128-dimensional multimodal embeddings for each item, generated by encoders trained under Semantic-aware Contrastive Learning (SCL) framework.

The dataset is distributed as a collection of Parquet files, each containing a single structured table, as detailed below:

train_samples.parquet: Contains the training samples. Each record includes a user ID, an item ID, and a binary label indicating whether the user clicked (1) or did not click (0) on the item, as derived from real-world interaction logs.
test_samples.parquet: Contains the test samples, structured identically to the training samples, with the same fields.
train_user_features.parquet: Provides demographic and behavioral features for all users appearing in the training set. Each entry includes the user ID, age, gender, city, province, and the user’s historical behavior sequence (a list of up to 1,000 item IDs and categories).
test_user_features.parquet: Analogous to the training counterpart, this file contains demographic attributes and behavior sequences for all users present in the test set.
item_features.parquet: Includes metadata for all items that appear in either the training or test samples. Each record consists of the item ID, category, item city, and item province.
scl_embeddings_int8_p90.parquet: A lookup table mapping each item ID to its corresponding SCL multimodal embedding, quantized to int8 for efficiency. This embedding table covers 100% of the target items (i.e., those appearing as positive or negative candidates in the training and test samples) and 90% of the items observed in users’ historical behavior sequences.

To construct the training sample table, we perform a join operation between train_samples, train_user_features, and item_features on their respective IDs. The test sample table is generated analogously by joining test_samples with test_user_features and item_features.

For clarity and ease of use, we also provide a detailed description of all column fields in each table.

train_samples / test_samples

Field	Type	Description
129_1	Int64	User ID
205	Int64	Item ID
label_0	List of Int8	One-hot label, [1,0] indicates non-click, [0,1] indicates click

train_user_features / test_user_features

Field	Type	Description
129_1	Int64	User ID
130_1	Int64	User age
130_2	Int64	User gender
130_3	Int64	User province name
130_4	Int64	User city name
130_5	Int64	User city level
150_2_180	List of Int64	List of item IDs the user interacted with, the usual length is 1000
151_2_180	List of Int64	List of item categories the user interacted with, the usual length is 1000

item_features

Field	Type	Description
205	Int64	Item ID
206	Int64	Item category
213	Int64	Item city name
214	Int64	Item province name

scl_embeddings_int8_p90

Field	Type	Description
205	Int64	Item ID
205_c	List of Int8	SCL embeddings with shape (128)

Download and Use

TAOBAO-MM dataset is free to download for research purposes under Apache License 2.0.

The dataset is now publicly available on HuggingFace. After installing the huggingface_hub package via pip install huggingface_hub, you can download the dataset directly using the following command:

huggingface-cli download --repo-type=dataset TaoBao-MM/Taobao-MM --local-dir your/local/path

Upon download, the dataset is organized with the following directory structure:

taobao-mm (139 GB)
├── feature_map (7.11 GB)
│   ├── 129_1_sorted_map.npy
│   ├── 130_1_sorted_map.npy
│   ├── ……
│   ├── scl_emb_int8_p90_keys.npy
│   └── scl_emb_int8_p90_values.npy
├── train (52.6 GB)
│   ├── metadata.json
│   ├── train-shard-000000.parquet
│   ├── train-shard-000001.parquet
│   ├── ……
│   └── train-shard-000160.parquet
├── test (15.6 GB)
│   ├── metadata.json
│   ├── test-shard-000000.parquet
│   ├── test-shard-000001.parquet
│   ├── ……
│   └── test-shard-000048.parquet
└── raw (64.1 GB)
    ├── train_samples.parquet
    ├── train_user_features.parquet
    ├── test_samples.parquet
    ├── test_user_features.parquet
    ├── item_features.parquet
    └── scl_embedding_int8_p90.parquet

The downloaded dataset is organized into the following four directories:

feature_map/: Contains mapping dictionaries stored in .npy format. These include (1) mappings that remap anonymized IDs to consecutive integers in the range [0, ID_SIZE] for efficient indexing, and (2) key–value mappings linking item IDs to their corresponding SCL multimodal embeddings.
train/: Holds the training samples in sharded Parquet format. These files are generated by joining the training-related tables from the raw/ directory.
test/: Contains the test samples, also stored as sharded Parquet files, constructed analogously to the training set through the same join process applied to the raw test-related tables.
raw/: Includes the original, unjoined data tables in Parquet format. This directory is provided for who wish to reconstruct or customize the preprocessing pipeline. If you only require the pre-joined training and test samples, downloading this directory is optional.

To facilitate model development and evaluation with TAOBAO-MM, we provide ready-to-use Dataset class implementations in our official repository. In addition, we include implementations of several baselines for multimodal long-sequence recommendation, including our proposed method, MUSE (MUltimodal SEarchbased framework for lifelong user interest modeling), to serve as reference points for benchmarking and further research.

Contact

If you have any questions, please feel free to contact us through github issues or email the authors.

Citation

If you find our work useful for your research, please consider citing the paper:

@misc{wu2025musesimpleeffectivemultimodal,
      title={MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling}, 
      author={Bin Wu and Feifan Yang and Zhangming Chan and Yu-Ran Gu and Jiawei Feng and Chao Yi and Xiang-Rong Sheng and Han Zhu and Jian Xu and Mang Ye and Bo Zheng},
      year={2025},
      eprint={2512.07216},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2512.07216}, 
}