Distinguished Architect
Natural Language Processing Department at Baidu Inc.

I am a distinguished architect at Baidu NLP since December 2017, Before that, I was a researcher at Microsoft Research Asia (MSRA) from September 2014 to December 2017. I obtained Ph.D. degree in computer science from Harbin Institute of Technology (HIT) under the supervision of Prof. Hsiao-Wuen Hon (MSRA), Prof. Ting Liu (HIT) and Dr. Chin-Yew Lin (MSRA) in September 2014.

Please contact me via legendarydan (at) gmail (dot) com
My Sina Weibo (in Chinese) and Twitter (in English)

News

Please send me emails with your resume (for internships or FTE positions) if you are interested in working with us on augmented large language models, including autonomous agents, question answering, automated numerical reasoning and theorem proving. Experiences with machine (incl. but not limited to deep) learning for NLP are preferred.
May 2023: Our work on generative retrieval 📚 TOME was accepted by ACL 2023.
Oct 2022: DuReader_retrieval and DuQM were accepted by EMNLP 2022.
Mar 2022: We released a large-scale Chinese dataset DuReader_retrieval for passage retrieval.
Sep 2021: We released a dataset DuQM for robust question matching.
Aug 2021: 🚀 RocketQAv2 was accepted by EMNLP 2021.
Aug 2021: Dr. Hua Wu and I gave a talk - Benchmarks: An Industry Perspective at ACL-2021 Workshop on Benchmarking: Past, Present and Future (BPPF).
May 2021: 💑 PAIR and DuReader_robust were accepted by ACL 2021.
Apr 2021: We released a dataset DuReader_checklist for robust machine reading comprehension.
Mar 2021: 🚀 RocketQAv1 was accepted by NAACL 2021.
Oct 2020: We proposed RocketQAv1, an optimized training approach to dense passage retrieval for open-domain question answering. RocketQA achieved the 1st rank at the leaderboard of MSMARCO Passage Ranking Task. It was featured in zh-cn and en-us.
Aug 2020: We lunched an open-source project of Chinese NLP benchmarks LUGE（千言）.
Apr 2020: We released a dataset DuReader_robust for robust machine reading comprehension.
Nov 2019: Our proposed machine reading comprehension system D-NET was ranked at top 1 in the MRQA 2019 Shared Task, that tests if MRC systems can generalize beyond the datasets on which they were trained. D-NET was featured in zh-cn (1, 2) and en-us.

Papers [Google Scholar]

Investigating the Factual Knowledge Boundary of Large Language Models with Retrieval Augmentation , Preprint
Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, Haifeng Wang

TOME: A Two-stage Approach for Model-based Retrieval , ACL 2023
Ruiyang Ren, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, Haifeng Wang

SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases , ACL 2023 Workshop on Trustworthy Natural Language Processing (TrustNLP)
Yanchen Liu, Jing Yan, Yan Chen, Jing Liu, Hua Wu

Dense Text Retrieval based on Pretrained Language Models: A Survey , Preprint
Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-rong Wen

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine , EMNLP 2022
Yifu Qiu, Hongyu Li, Yingqi Qu, Ying Chen, Qiaoqiao She, Jing Liu, Hua Wu, Haifeng Wang
[data] [code/models] [regular leaderboard (open)] [LIC2022 - leaderboard (closed)]

DuReader_vis: A Chinese Dataset for Open-domain Document Visual Question Answering , Findings of ACL 2022
Le Qi, Shangwen Lv, Hongyu Li, Jing Liu, Yu Zhang, Qiaoqiao She, Hua Wu, Haifeng Wang, Ting Liu
[data]

DuQM: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models , EMNLP 2022
Hongyu Zhu, Yan Chen, Jing Yan, Jing Liu, Yu Hong, Ying Chen, Hua Wu and Haifeng Wang
[data] [code/models] [regular leaderboard (open)] [CCF BDCI 2021 - leaderboard (closed)]

RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , EMNLP 2021
Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen
[blog] [code/models]

PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval , Findings of ACL 2021
Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng Wang and Ji-Rong Wen
[code/models]

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering, NAACL 2021
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu and Haifeng Wang
[blog] [code/models]

DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications, ACL 2021
Hongxuan Tang, Hongyu Li, Jing Liu, Yu Hong, Hua Wu and Haifeng Wang
[data] [code/models] [regular leaderboard (open)] [LIC2020 - leaderboard (closed)]

A Robust Adversarial Training Approach to Machine Reading Comprehension, AAAI 2020
Kai Liu, Xin Liu, An Yang, Jing Liu, Jinsong Su, Sujian Li and Qiaoqiao She

CoKE: Contextualized Knowledge Graph Embedding, preprint, 2019
Quan Wang, Pingping Huang, Haifeng Wang, Songtai Dai, Wenbin Jiang, Jing Liu, Yajuan Lyu, Yong Zhu, Hua Wu
[code/models]

D-NET: A Simple Framework for Improving the Generalization of Machine Reading Comprehension, EMNLP 2019 Workshop on Machine Reading for Question Answering (MRQA)
Hongyu Li, Xiyuan Zhang, Yibing Liu, Yiming Zhang, Quan Wang, Xiangyang Zhou, Jing Liu, Hua Wu and Haifeng Wang
[blog] [code/models] [leaderboard]

Enhancing Pre-trained Language Representations with Rich Knowledge for Machine Reading Comprehension, ACL 2019
An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She and Sujian Li
[blog] [code/models]

Towards Robust Neural Machine Reading Comprehension via Question Paraphrases, IALP 2019
Ying Li, Hongyu Li and Jing Liu

Towards Time-Aware Distant Supervision for Relation Extraction, preprint, 2019
Tianwen Jiang, Sendong Zhao, Jing Liu, Jin-Ge Yao, Ming Liu, Bing Qin, Ting Liu, Chin-Yew Lin

Answer-focused and Position-aware Neural Question Generation, EMNLP 2018
Xingwu Sun, Jing Liu, Yajuan Lyu, Yanjun Ma and Shi Wang

Aggregated Semantic Matching for Short Text Entity Linking, CoNLL 2018
Feng Nie, Shuyan Zhou, Jing Liu, Jinpeng Wang, Chin-Yew Lin and Rong Pan

Neural Math Word Problem Solver with Reinforcement Learning, COLING 2018
Danqing Huang, Jing Liu, Chin-Yew Lin and Jian Yin

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications, ACL 2018 Workshop on Machine Reading for Question Answering (MRQA)
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, Haifeng Wang
[data] [code/models] [leaderboard (open)]

Adaptations of ROUGE and BLEU to Better Evaluate Machine Reading Comprehension Task, ACL 2018 Workshop on Machine Reading for Question Answering (MRQA)
An Yang, Kai Liu, Jing Liu, Yajuan Lyu, Sujian Li

Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification, ACL 2018
Yizhong Wang, Kai Liu, Jing Liu, Wei He, Yajuan Lyu, Hua Wu, Sujian Li, Haifeng Wang

Revisiting Distant Supervision for Relation Extraction, LREC 2018
Tingsong Jiang, Jing Liu and Chin-Yew Lin
[data]

A Statistical Framework for Product Description Generation, IJCNLP 2017
Jinpeng Wang, Yutai Hou, Jing Liu, Yunbo Cao and Chin-Yew Lin

News Citation Recommendation with Implicit and Explicit Semantics, ACL 2016
Hao Peng, Jing Liu and Chin-Yew Lin

Knowledge Base Completion via Coupled Path Ranking, ACL 2016
Quan Wang, Jing Liu, Yuanfei Luo, Bin Wang and Chin-Yew Lin

RBPB: Regularization-Based Pattern Balancing Method for Event Extraction ACL 2016
Lei Sha, Jing Liu, Chin-Yew Lin, Sujian Li, Baobao Chang and Zhifang Sui

Improving Ranking Consistency for Web Search by Leveraging Knowledge Base and Search Logs, CIKM 2015
Jyun-Yu Jiang, Jing Liu, Chin-Yew Lin and Pu-Jen Cheng

A Regularized Competition Model for Question Difficulty Estimation in Community Question Answering Services, EMNLP 2014
Quan Wang, Jing Liu, Bin Wang and Li Guo

A Computational Approach to Measuring the Correlation between Expertise and Social Media Influence for Celebrities on Microblogs, ASONAM 2014
Xin Zhao, Jing Liu, Yulan He, Chin-Yew Lin and Ji-Rong Wen

Question Difficulty Estimation in Community Question Answering Services, EMNLP 2013
Jing Liu, Quan Wang, Chin-Yew Lin and Hsiao-Wuen Hon

A Hierarchical Entity-based Approach to Structuralize User Generated Content in Social Media: A Case of Yahoo! Answers, EMNLP 2013
Baichuan Li, Jing Liu, Chin-Yew Lin, Irwin King and Michael R. Lyu

What's in a Name? An Unsupervised Approach to Link Users across Communities , WSDM 2013
Jing Liu, Fan Zhang, Xinying Song, Young-In Song, Chin-Yew Lin and Hsiao-Wuen Hon

An Unsupervised Method for Author Extraction from Web Pages Containing User-Generated Content, CIKM 2012
Jing Liu, Xinying Song, Jingtian Jiang and Chin-Yew Lin

Competition-based User Expertise Score Estimation, SIGIR 2011
Jing Liu, Young-In Song and Chin-Yew Lin

Automatic Extraction of Web Data Records Containing User-Generated Content, CIKM 2010
Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin and Hsiao-Wuen Hon

Microsoft Research Asia with Redmond at the NTCIR-8 Community QA Pilot Task, NTCIR 2010
Young-In Song, Jing Liu, Tetsuya Sakai, Xinjing Wang, Guwen Feng, Yunbo Cao, Hisami Suzuki and Chin-Yew Lin

Datasets

LUGE（千言）[portal]
Baidu, CCF (China Computer Federation) and CIPSC (Chinese Information Processing Society of China) jointly lunched the project of LUGE（千言）, that is an open-source project of Chinese NLP benchmarks. LUGE was featured in videos (zh-cn, en-us) and articles (zh-cn). If you are interested in contributing to LUGE, pls. contact me.

DuReader_retrieval [paper][data][code/models][LIC2022 - leaderboard]
We released DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. The dataset contains over 90K questions and 8M passages from Baidu Search. We hosted a shared task of DuReader_retrieval at 2022 Language and Intelligence Challenge, and there were more than 794 teams and more than 910 submissions in the shared task. The shared task was featured in zh-cn.

DuQM [paper][data][code/models][regular leaderboard (open)][CCF BDCI 2021 - leaderboard (closed)]
We released a Chinese dataset namely DuQM towards challenging the question matching models from multiple aspects, including lexical semantics, syntactic structure, misspelling and speech filler. We hosted a shared task of DuQM at 2021 CCF Big Data & Computing Intelligence Contest, and there were more than 4,000 registered teams and more than 9,000 submissions in the shared task. The shared task was featured in zh-cn.

DuReader_checklist [data][code/models][regular leaderboard (open)][LIC2021 - leaderboard(closed)]
We released a Chinese dataset namely DuReader_checklist towards challenging the machine reading comprehension models from multiple aspects, including understanding of vocabulary, phrase, semantic role, reasoning and so on. We hosted a shared task of DuReader_checklist at 2021 Language and Intelligence Challenge, and there were more than 1,080 teams and more than 4,800 submissions in the shared task. The shared task was featured in zh-cn.

DuReader_robust [paper][data][code/models][regular leaderboard (open)][LIC2020 - leaderboard (closed)]
We released a Chinese dataset namely DuReader_robust towards evaluating the robustness of machine reading comprehension models. We hosted a shared task of DuReader_robust at 2020 Language and Intelligence Challenge, and there were more than 1,500 teams and more than 4,600 submissions in the shared task. The shared task was featured in zh-cn.

Professional Activities

Area Chair: ACL 2021 (Question Answering), AACL 2022 (Question Answering)
Session Chair: AACL 2020 (Question Answering)
Program commitee/reviewer, ACL, EMNLP, NAACL, EACL, AACL, SIGIR, KDD, WSDM, WWW, CIKM, ICWSM, ACM Transactions on the Web (TWEB), ACM Transactions on Intelligent Systems and Technology (TIST), ACM Transactions on Information Systems (TOIS), Frontiers of Computer Science (FCS)

Working Experience

Principal Architect, Baidu NLP, Dec. 2017 - present
Researcher, Microsoft Research Asia, Sep. 2014 - Dec. 2017
Intern, Microsoft Research Asia, Jul. 2009 - Sep. 2014

Educations

PhD, Computer Science, Harbin Institute of Technology, Sep. 2009 - Sep. 2014
M.Sc, Computer Science, Harbin Institute of Technology, Sep. 2007 - Jul. 2009
B.Sc, Computer Science, Xidian University, Sep. 2003 - Jul. 2007

Jing Liu（刘璟）