Skip to:

e-Science 2008 4th IEEE International Conference on e-Science

Main Conference Sessions

Parallel Processing of Large-Scale XML-Based Application Documents on Multicore Architectures with PiXiMaL

Authors

  • Michael Head, Binghamton University
  • Madhusudhan Govindaraju, SUNY Binghamton

Abstract

Very large scientific datasets are becoming increasingly available in XML formats. Our earlier benchmarking results show that parsing XML is a time-consuming process when compared with binary formats optimized for large-scale documents. This performance bottleneck will get exacerbated as the size of XML data increases in e-Science applications. Our focus in this paper is on addressing this performance bottleneck. In recent times, the microprocessor industry has made rapid strides toward Chip Multi Processors (CMPs). The widely available XML parsers have been unable to take advantage of the opportunities presented by CMPs, instead, passing the task of parallelization to the application programmer. The paradigms used thus far to process large-size XML documents on uni-processors are not applicable for CMPs. We present the design, implementation, and performance analysis of PiXiMaL, a parallel processing library for large-scale XML-data files. In particular, we discuss an effective scheme to parallelize the tokenization process to achieve an overall performance increase when parsing large-scale XML documents that are increasingly in use today. Our approach is to build a DFA-based parser that recognizes a useful subset of the XML specification and converts the DFA into an NFA which can
be applied on any subset of the input.

Date and Time

Friday, December 12, 3:30 p.m. to 4:00 p.m.

Room Number

208

More Information

Show your support for e-Science 2008

Add one of our badges to your site:

  • Teal eScience 2008 Web badge
  • Green eScience 2008 Web badge
  • Orange eScience 2008 Web badge