Dubyak Seminar Series

Dr. Jian Wu, Old Dominion University

Title: Toward Uncertainty-aware Data Extraction from Complex Scientific Tables

Abstract: Scientific tables offer a compact format for reporting data and are ubiquitous in scholarly documents, including journal articles, conference proceedings, and electronic theses and dissertations. Automatically extracting and integrating data from these tables across multiple journals and years has the potential to enable data-driven research that would be infeasible through manual effort alone. However, complex scientific tables often contain intricate structural and content features that are rarely present in standard tables, which typically consist of simple text arranged in clear formats. Moreover, verifying data extracted by AI-based approaches remains labor-intensive. To address these challenges, we propose developing a software framework that enables automated and uncertainty-aware data extraction from complex tables across multiple scientific domains. We began by surveying recent deep learning models for table structure recognition (TSR) to identify high-performing and reproducible models. Based on this, we proposed an ensemble approach to quantify uncertainty in TSR results. We then designed a framework that integrates the outputs of TSR and optical character recognition (OCR) engines into grid cells. Using these grid representations, we applied conformal prediction methods to quantify uncertainty in data extraction arising from both TSR and OCR components. Empirical evaluations demonstrate the effectiveness of our ensemble method in capturing TSR uncertainty and validate the effectiveness of the conformal method in quantifying the combined uncertainty from TSR and OCR.

Bio: Dr. Jian Wu is an assistant professor of Computer Science at Old Dominion University, Norfolk, VA. He obtained his Ph.D. in Astronomy and Astrophysics at Pennsylvania State University in 2011 and worked as a postdoctoral fellow before joining ODU in 2018. Since then, his research has been supported by NSF, IMLS, DARPA, Los Alamos National Laboratory, and Virginia Commonwealth. His research interests include natural language processing, scholarly big data, information retrieval, digital libraries, and the science of science. He has published more than 90 peer-reviewed papers on ACM, IEEE, and AAAI venues, with best papers and nominations, in addition to his earlier publications in Astronomy and Astrophysics. He shared the British Computer Society Award 2021 for the Best Open Source Project with Dr. C. Lee Giles at Pennsylvania State University.

April 25, 1:00 PM (Online, Zoom)

RSVP: https://forms.gle/h1NUaLY243MSC9kv5