One of the most exciting aspects of the Big Data era for both the industry and research communities is the incredible progress being made in the domains of the machine and deep learning. Modern applications demand resources that are more than a single node can supply. The difficulties that the total data processing environment must address include a variety of data engineering for pre- and post-data processing, communication, and system integration. The ability of data analytics tools to quickly interface with existing frameworks in a variety of languages is a crucial requirement as it increases user productivity and efficiency.All of this calls for an effective and widely dispersed integrated approach to data processing, yet many of today’s well-liked data analytics solutions are unable to simultaneously meet all of these criteria.
In this project, we introduce Cylon, a high-performance distributed data processing toolkit that is open-source and easily integrated with current Big Data and AI/ML frameworks. It has a compact data structure as the foundation, a versatile C++ core, and language bindings for Python, Java, and C++ on top of it.
We develop Cylon’s design and demonstrate how it can be used as a standalone framework or imported as a library into already-existing applications. Early tests reveal that Cylon boosts well-known technologies like Apache Spark and Dask with significant performance gains for crucial operations and improved component linkages. The ultimate goal is to demonstrate how Cylon’s design supports cross-platform usage with the least amount of overhead, which includes well-known AI tools like PyTorch, Tensorflow, and Jupyter notebooks.