Streaming
and Steering Applications: Requirements and Infrastructure STREAM2015 and
STREAM2016 http://streamingsystems.org
Streaming Systems | Stream 2015 | Stream 2015 Final Report | Stream 2015 White Papers | Stream 2016 | Stream 2016 Final Report
These workshops STREAM2015 and STREAM2016 are made possible by support from the National Science Foundation (Division of Advanced Cyberinfrastructure) and Department of Energy (Advanced Scientific Computing Research) and the Air Force Office of Scientific Research (AFOSR).
STREAM2016 will be held March 22-23, 2016 in Washington DC. The purpose of this meeting will be to follow up on issues identified in STREAM2015 and will also cover data from scientific instruments of interest to the Department of Energy.
Registration information, meeting logistics, and additional details can be found at: http://www.orau.gov/streaming2016
Call for White Papers STREAM2016
Data streaming from on-line instruments, large scale simulations, and distributed sensors such as those found in transportation systems and urban environments point to the growing interest and important role of streaming data and related real-time steering and control. As part of a two part workshop series, we are organizing STREAM2016 to identify application and technology communities in this area and to clarify challenges that they face.
The report of the first workshop in the series (STREAM2015) held in October 2015 can be found at: http://streamingsystems.org/finalreport.html STREAM2016 will follow from STREAM2015 and will focus on the features, requirements and challenges of DOE applications and the hardware and software systems needed to support them.
STREAM2016 will be held March 22-23, 2016 in Washington DC. The purpose of this meeting will be to follow up on issues identified in STREAM2015 and will also cover data from scientific domains of interest to the Department of Energy. Additional details on logistics will be provided shortly.
Members of the community are invited to submit a 1-2 page White Paper/Statement of Interest in areas of relevance to STREAM2016's scope and objectives including issues raised in STREAM2015. Participants will be selected based upon relevance of submissions as well as strategic balance of expertise. White papers are due by February 21 and should be sent to Sophia Pasadis (spasadis@lbl.gov). If you have any questions, please contact workshop organizers: Geoffrey Fox (gcf@indiana.edu), Shantenu Jha (shantenu.jha@rutgers.edu) or Lavanya Ramakrishnan (lramakrishnan@lbl.gov).
Geoffrey Fox (Indiana) gcf@indiana.edu Shantenu Jha (Rutgers) shantenu.jha@rutgers.edu Lavanya Ramakrishnan (LBL) lramakrishnan@lbl.gov
Summary of STREAM2015 and STREAM2016
The workshops STREAM2015 and STREAM2016 cover a class of applications – those associated with streaming data and related (near) real-time steering and control – that are of growing interest and importance. The goal of the workshops is to identify application and technology communities in this area and to clarify challenges that they face. We will focus on application features and requirements as well as hardware and software needed to support them. We will also look at research issues and possible future steps after the workshop. We have surveyed the field and identified application classes including Internet of People, Social media, financial transactions, Industrial Internet of Things, DDDAS, Cyberphysical Systems, Satellite and airborne monitors, National Security, Astronomy, Light Sources, and Instruments like the LHC, Sequencers, Data Assimilation, Analysis of Simulation Results, Steering and Control. We also survey technology developments across academia and Industry where all the major commercial clouds offer significant new systems aimed at this area. The field needs such an interdisciplinary workshop addressing the big picture: applications, infrastructure, research and futures.
STREAM2015 will be the first in a series of two workshops and will have a focus on NSF applications and infrastructure. It will be held in Indianapolis, where two full meeting days (October 27-28) will be followed by a report writing day on October 29, 2015. The second workshop STREAM2016 will have a focus on DOE activities and applications as well as following up on ideas raised in STREAM2015. STREAM2016 will be held in Washington on March 22-23, 2016. We will produce separate reports on the discussions at each workshops that will be complete around two months after each event. The first workshop budget covers travel and meeting expenses for about 30 attendees.
We have identified an organizing committee expanding the core group – Fox (Indiana), Jha (Rutgers) and Ramakrishnan (LBNL) – proposing these two workshops. In selecting a list of attendees we will reach out to underrepresented communities, in particular women and ethnic minorities. The real time streaming of sessions for STREAM2015 will enhance opportunities for a broad community to engage at the meeting and we will support questions and comments from remote participants. The web site http://streamingsystems.org will support this workshop and contain the final report, presentations, position papers, archival copies of streamed video and a repository of useful documents and links.
Streaming
Technology Requirements, Application and Middleware (STREAM2015) Call for
Participation
The role of streaming data and related real-time steering and control are of growing interest and importance. STREAM2015 and STREAM2016 will be held to identify application and technology communities in this area and to clarify challenges that they face. The objective of the workshops is to identify common and open research issues and possible future steps.
Members of the community are invited to submit a 1-2 page White Paper/Statement of Interest in areas of relevance to STREAM2015's scope and objectives. White papers are due by October 5 and should be sent to workshop organizers. Participants will be selected based upon relevance of submissions as well as strategic balance of expertise. Partial travel support is available.
A complete list of Stream 2015 white papers can be found here.
Detailed Description of STREAM2015
The STREAM2015 workshop covers a class of applications -- those associated with streaming data and related (near) real-time steering and control. The goal of the workshop is to identify application, infrastructure and technology communities in this area and to clarify challenges that they face. We will focus on application features and requirements as well as hardware and software needed to support them. We will also look at research issues. Later we cover typical application areas (Section 2) and some approaches to software (Section 3). Section 4 covers the workshop goals, objectives and organization. It also covers the report generated by workshop describing findings and identifying future activities building the streaming and steering community. Section 5 concludes descriptions with the appendix in Section 6 giving proposed schedule and venue details.
In Table 1, we identify eight problem areas that involve streaming data. We argue that the applications of Table 1 are critical for next-generation scientific research and thus need research into a unifying conceptual architecture, programming models as well as scalable run time. All problem areas are actively used today but without agreed-upon software models focused on streaming.
Streaming/Steering
Application Class |
Details
and Examples |
Features |
|
1 |
Internet of People: wearables |
Smart watches, bands, health, glasses, telemedicine |
Small independent events |
2 |
Social media, Twitter, cell phones, blogs, financial transactions |
Study of information flow, online algorithms, outliers, graph analytics |
Sophisticated analytics across many events; text and numerical data |
3 |
Industrial Internet of Things, Cyberphysical Systems, DDDAS, Control |
Software Defined Machines, Smart buildings, transportation, Electrical Grid, Environmental and seismic sensors, Robotics, Autonomous vehicles, Drones |
Real-time response often needed; data varies from large to small events |
4 |
Satellite and airborne monitors, National Security; Justice, Military |
Surveillance, remote sensing, Missile defense, Anti-submarine, Naval tactical cloud |
Often large volumes of data and sophisticated image analysis |
5 |
Astronomy, Light Sources, Instruments like LHC, Sequencers |
Scientific Data Analysis in real time or batch from large sources. LSST, DES, SKA in astronomy |
Real-time or sometimes batch, or even both. large complex events |
6 |
Data Assimilation |
Integrate typically distributed data into simulations to enhance quality. |
Link large scale parallel simulations with time dependent data. Sensitivity to latency. |
7 |
Analysis of Simulation Results |
Climate, Fusion, Molecular Dynamics, Materials. Typically local or in-situ data |
Increasing bottleneck as simulations scale in size. |
8 |
Steering and Control |
Control of simulations or Experiments. Data could be local or distributed |
Variety of scenarios with similarities to robotics |
Table 1: Eight Streaming and/or Steering Application Classes
As we illustrate in Table 1, these applications are not new but they are growing rapidly in size and importance. Correspondingly it becomes relevant to examine the needed functionality and performance of hardware and software infrastructure that could support these applications. We can identify such applications within academic, commercial and government areas. Examples in Table 1 include the Internet of Things projected to reach 30 to 70 billion devices in 2020 [1] with particular examples including wearables, brilliant machines [2] and smart buildings; these myriad of small devices contrasts with events streaming from larger scientific instruments such as light sources, telescopes, satellites and sequencers. There is the social media phenomena that adds over 20,000 photos online every second [3] with an active research program studying structure and dynamics of information. In National Security, one notable example comes from the Navy which is developing Apache Foundation streaming (big data) software for missile defense [4]. A NIST survey [5] of big data applications found that 80% involved some sort of streaming [6] and the AFOSR DDDAS initiative [7-9] looks at streaming and control (steering). Data assimilation and Kalman filters have been used extensively to incorporate streaming data into analytics such as weather forecasts and target tracking.
Most of the applications involve linking analysis with distributed dynamic data and can require real-time response. The requirements of distributed computing problems, which couple HPC and cloud computing with streaming data, are distinct from those familiar from large scale parallel simulations, grid computing, data repositories and workflows which have generated sophisticated software platforms. Scientific experiments are increasingly producing large amounts of data that need to be processed on HPC and/or cloud platforms. These experiments often need support for real-time feedback to steer the instruments. Thus, there is a growing need to generalize computational steering to include coupling of distributed resources in real-time, and a fresh perspective on how streaming data might be incorporated in this infrastructure. The analysis of simulation results or visualizations has been explored significantly in the last few years and is recognized to be a serious problem as simulations increase their performance towards exascale. The in-situ analysis of such data shares features with streaming applications but the data is not distributed if simulation and analysis engines are identical or co-located.
One goal of the workshop will be to identify those features that distinguish different applications in the streaming/steering class. Five categories we have already identified are:
a) Set of independent events where precise time sequencing is unimportant. e.g. independent search requests or smartphone or wearable cloud accesses from users.
b) Time series of connected small events where time ordering is important. e.g. streaming audio or video; robot monitoring.
c) Set of independent large events where each event needs parallel processing with time sequencing not critical Example: processing images from telescopes or light sources with material science.
d) Set of connected large events where each event needs parallel processing with time sequencing critical e.g. processing high resolution monitoring (including video) information from robots (self-driving cars) with real time response needed.
e) Stream of connected small or large events that need to be integrated in a complex way. e.g. streaming events being used to update model (clustering) rather than being classified with an existing static model which fits category a).
These 5 categories can be further considered for single or multiple heterogeneous streams. We will refine and expand these categories as part of the workshop
Although the growing importance of these application areas has been recognized, we see that the needed hardware and software infrastructure is not as well studied. Particular solutions such as for the analysis of events from the LHC or imagery from telescopes and light sources have been developed.
The distributed stream processing community has produced frameworks to deploy, execute and manage event based applications at large scale and these are one important class of streaming and steering software. Examples of early event stream processing frameworks include: Aurora[10], Borealis[11], StreamIt[12] and SPADE[13]. With the emergence of Internet-scale applications in recent years, new distributed map-streaming processing models like Apache S4[14], Apache Storm[15], Apache Samza[16], Spark Streaming[17], TwitterÕs Heron[18] and Granules[19] have been developed, with commercial solutions including Google Millwheel[20] Azure Stream Analytics[21] and Amazon Kinesis[22].
Although these academic and commercial approaches are effective, we suggest a more integrated approach that spans many application areas and many solutions and evaluates applications with current and future software. This could lead to new research directions for a scalable infrastructure, and clearer ideas on how to appropriate infrastructure to support a range of applications. Note in the grid solutions for problems like LHC data analysis, events tend not be streamed directly but rather batches are processed on distributed infrastructure. In Table 2 below, we contrast some well-known scientific computing paradigms with streaming and steering.
Paradigm |
Features
and Examples |
|
1 |
Multiple Loosely Coupled Tasks |
Grid computing, largely independent computing/event analysis, many task computing |
2 |
MapReduce |
Single Pass compute and collective computation. |
3 |
BSP and Iterative MapReduce |
Iterative staged compute (map) and computation includes parallel machine learning, graph, simulations. Typically batch |
4 |
Workflow |
Dataflow linking functional stages of execution |
5 |
Streaming |
Incremental (often distributed) data I/O feeding to long running analysis using other computing paradigms. Typically interactive |
Steering |
Incremental I/O from computer or instrument driving possibly real-time response (control) |
Table 2: Six
Computing Paradigms with Streaming and Steering contrasted with four other
paradigms common in scientific computing.
In the first four paradigms of the above table, data is typically accessed systematically either at the start of or more generally at programmatically controlled stages of a computation. In workflow, multiple examples of such data-driven computations are linked together. On the other hand, the streaming paradigm absorbs data asynchronously throughout the computation while steering feeds back to control instructions.
Identifying research directions will be one of the goals of the workshop. We can already identify the need to study the system architecture including balance between processing on source, fog (local) and cloud (backend), online algorithms, storage, data management, resource management, scheduling, programming models, quality of service (including delay in control responses) and fault tolerance. Optimizations like operator reordering, load balancing, fusion, and fission have been researched to reduce the latency of the stream processing applications [23].
The purpose of this workshop is to explore the landscape sketched above, identify the application and technology communities and converge on the immediate and long-term challenges. We propose examining four aspects of this landscape:
1. Application Study: Table 1 is a limited sampling of applications that critically depend upon Steering HPC. It is necessary to extend and refine Table 1 with a broader set of application characteristics and requirements. We need to improve the set of features at the end of Section 2 and identify which aspects are important in determining software and hardware requirements. A set of benchmarks may be important.
2. System Architectures: A critical challenge that follows is to understand scalable architectures that will support the different types of streaming and steering applications in table 1, i.e., firm up the vague concept of ubiquitous sensors and Internet of Things to match the range of application and infrastructure types. In particular we should identify where HPC, accelerators, and clouds are important.
3. Research Directions: There is a need to integrate features of traditional HPC, such as scientific libraries and communication, with the rich set of capabilities found in the commercial streaming ecosystem. This general approach has been validated for a range of traditional applications, but not for the rich class of streaming and steering problems. Interesting questions are centered on the data management requirements while the NRC study [24] stressed the importance of new online (streaming -- look at each data point once) algorithms.
4. Next Steps Forward: We hope this workshop starts a process that will identify and bind the community of applications and systems researchers and providers in the streaming and steering areas. We intend a thorough report with the final day of workshop devoted to writing this. As well as covering findings of our workshop, the report will suggest next steps forward. These could include a second workshop to dig deeper in some areas, and other studies such as collection of benchmarks.
We are not aware of any meeting in this area that juxtaposes infrastructure, applications and systems (software). There are many Internet of Things workshops and conferences – the online list WikiCFP [25] for example lists 66 IoT events with 11 still open this year. Robotics at this site has 18 open and Sensor Networks 13. These meetings would not attract the interdisciplinary mix we aim at in our proposed workshop. DDDAS meetings also cover some topics proposed here.
This workshop will be unique in that it will focus on understanding the big picture as opposed to discussing specific solutions. It will also bring together resource and infrastructure providers with academic community.
The core organizing committee consists of Geoffrey Fox (Indiana University), Shantenu Jha (Rutgers) and Lavanya Ramakrishnan (LBNL). These three have worked together over the last six months to understand the workshop area and produced a report on HPC Streaming for DoE [26]. Fox has worked extensively on streaming problems for the last 15 years starting with the publish-subscribe system NaradaBrokering [27] and is now focused on cloud control of robotics [28]. Jha has extensive experience in computational steering, analysis of large scale simulations and distributed computing and middleware. Ramakrishnan's workflow research has covered several DoE streaming applications.
Breakout (working) groups will be asked to collaboratively author their reports in real time via shared collaborative tools (probably Google Documents), which allow multiple users to view and edit a document simultaneously, while saving and tracking edits by users. The breakout reports will be presented to the plenary, and made accessible online to all breakout groups for further discussion and edits. We will video record all major sessions of the workshop.
All participants will be encouraged to stay for the 3rd writing day to refine notes, synthesize main findings and formulate key report sections. A pre-workshop organizational conference call will select track and theme leads (who will double as editors). The writing team, comprised of the organizers, track and theme leads will be required to stay.
The writing team/editors will continue to engage after the workshop to finalize the report. We will deliver a draft final report within 30 days of the workshop. Whereas the bulk of the writing will occur on Day 3, the editors will meet via a remote conferencing system within 30 days of the workshop to prepare a final draft of the workshop report and findings. We have found this to be an effective pathway from the immediate aftermath of a workshop to a quick report.
The draft report will be disseminated to all workshop participants and posted on the workshop web site; it will be distributed on mailing lists such as XSEDE, OSG, DOE welcoming and soliciting comments and feedback within a 45-day timeframe. We will thus deliver a final report to NSF 90 days from the workshop.
The report will be a live document e.g., arXiv repository, with the main material and essentially complete first version, but one that is updated with incremental refinements. Taking advantage of the live document, in addition to bringing the community to the report, we will examine the possibility of taking the draft of the report to the communit, while respecting the time constraint.
Streaming data and steering are well established fields but have only gained profound impact since data turned into a deluge. Now with the Internet of Things and new experimental instruments, we see a streaming deluge requiring new approaches to control or steering. This workshop will bring together interdisciplinary experts on applications and infrastructure to address the three conceptual goals: 1) What are the driving applications, 2) What are actual and needed hardware and software, and 3) What are research challenges? The community identified for this workshop needs to work together on an ongoing basis, which is the fourth futures goal of the workshop. We are not aware of any closely related activity and suggest the streaming deluge can only be addressed by a set of activities such as those proposed here.
The meeting will be held from October 27-29 in Indianapolis at the IUPUI (combined Indiana University, Purdue Indianapolis campus) using their event facilities [29] which are located in the center of campus. It is an easy (14 mile) taxi ride from Indianapolis airport and near many downtown hotels including Marriott (nearest), Hyatt and Hilton. We have available the Tower Ballroom (see picture below with oval seating style) with seating for 60 in conference style and breakout rooms. The rooms are equipped with video conferencing/streaming presentation support.
We will provide lunch and refreshments (coffee) to the participants plus a reception on the evening of October 27.
The meeting is organized as two days (October 27-28) for main discussions plus a final day (October 29) for organizers to work on meeting report. We only provide two small rooms on the final day to support the 6-10 people expected to attend that day.
The meeting is organized around four goals described in Section IV: Application Study, Systems Architecture, Research Directions, and Next Steps Forward, with the first two goals covered on day one (October 27) and the second two goals on day two. A proposed schedule is given below.
Note that we will be streaming sessions and questions and comments will be solicited from those attending remotely.
Day One
Morning: Introduction and Plenary on Architectures and Systems
● Attendees Introduction: 2 slide presentations by those not on panels
● Application Requirements Panel and discussion
● System Architecture Panel and discussion
Day One Afternoon: Breakout Sessions
● Breakout Sessions: Application Requirements and System Architecture
● Plenary Summary
Day Two Morning: Plenary on Research Directions and Next Steps Forward
● Recap and lessons from Day One
● Research Directions Panel and discussion
● Next Steps Forward Panel and discussion
● Breakout Sessions: Research Directions and Next Steps Forward
Day Two Afternoon: Breakout Sessions and Planning
● Breakout Sessions: Research Directions and Next Steps Forward continued
● Plenary Summary
● Plenary discussion of findings in all four goals
● Organize report writing and discussion of follow-up activities
Make as much progress as possible with workshop report
NSF funded conferences are required to address child care services. These are available to our workshop attendees through Sitters to the Rescue established in 1996 with good credentials. The charge is $20 per hour per sitter. If needed by any participant, we will rent another room at the IUPUI facility to satisfy this requirement.
The proposed facilities satisfy federal accessibility requirements.
Streaming and Steering Applications:
Requirements and Infrastructure
References
[1] Cisco Internet Business Solutions Group (IBSG) (Dave Evans). The Internet of Things: How the Next Evolution of the Internet Is Changing Everything. 2011 April [accessed 2013 August 14]; Available from: http://www.cisco.com/web/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf.
[2] Chauhan, N. Modernizing Machine-to-Machine Interactions. 2014 Available from: https://www.gesoftware.com/sites/default/files/GE-Software-Modernizing-Machine-to-Machine-Interactions.pdf.
[3] Kimberlee Morrison. How Many Photos Are Uploaded to Snapchat Every Second? 2015 June 9 [accessed 2015 June 15,]; Available from: http://www.adweek.com/socialtimes/how-many-photos-are-uploaded-to-snapchat-every-second/621488.
[4] ONR. Data Focused Naval Tactical Cloud (DF-NTC): ONR Information Package. 2014 June 24 [accessed 2015 June 15]; Available from: http://www.onr.navy.mil/~/media/Files/Funding-Announcements/BAA/2014/14-011-Attachment-0001.ashx.
[5] NIST. NIST Big Data Public Working Group (NBD-PWG) Home Page. 2013 [accessed 2015 March 31]; Available from: http://bigdatawg.nist.gov/home.php.
[6] Geoffrey C. Fox, Shantenu Jha, Judy Qiu, and Andre Luckow, Towards an Understanding of Facets and Exemplars of Big Data Applications, in 20 Years of Beowulf: Workshop to Honor Thomas Sterling's 65th Birthday April 13, 2015. Annapolis http://dsc.soic.indiana.edu/publications/OgrePaperv11.pdf.
[7] DDDAS: Dynamic Data Driven Applications Systems NSF Site. [accessed 2015 July 22]; Available from: http://www.nsf.gov/cise/cns/dddas/.
[8] Dynamic Data Driven Applications Systems (DDDAS) AFOSR Site. [accessed 2015 July 22]; Available from: https://community.apan.org/afosr/w/researchareas/7661.dynamic-data-driven-applications-systems-dddas.aspx.
[9] DDDAS Dynamic Data-Driven Applications System Showcase. [accessed 2015 July 22]; Available from: http://www.1dddas.org/.
[15] Anderson, Q., Storm Real-time Processing Cookbook. 2013: Packt Publishing Ltd. ISBN:178216443X
[16] Kamburugamuve, S., Survey of distributed stream processing for large stream sources. 2013. http://grids.ucs.indiana.edu/ptliupages/publications/survey_stream_processing.pdf.
[19] Shrideep Pallickara. Granules Home Page. 2015 [accessed 2015 JUne 12]; Available from: http://granules.cs.colostate.edu/.
[21] Microsoft. Azure Stream Analytics. 2015 [accessed 2015 June 12]; Available from: http://azure.microsoft.com/en-us/services/stream-analytics/.
[22] Varia, J. and S. Mathew. Overview of amazon web services. 2013 Available from: http://docs.aws.amazon.com/gettingstarted/latest/awsgsg-intro/intro.html.
[24] Committee on the Analysis of Massive Data; Committee on Applied and Theoretical Statistics; Board on Mathematical Sciences and Their Applications; Division on Engineering and Physical Sciences; National Research Council, Frontiers in Massive Data Analysis. 2013: National Academies Press. http://www.nap.edu/catalog.php?record_id=18374
[25] WikiCFP: Listing of IOT Conferences and Workshops. [accessed 2015 June 15]; Available from: http://www.wikicfp.com/cfp/call?conference=IOT.
[26] Geoffrey Fox, Shantenu Jha, and Lavanya Ramakrishnan. Scalable HPC Workflow Infrastructure for Steering Scientific Instruments and Streaming Applications. 2015 April 14 [accessed 2015 June 15]; Available from: http://dsc.soic.indiana.edu/publications/StreamingHPC-WhitePaper-WorkflowWorkshop.pdf.
[27] NaradaBrokering. Scalable Publish Subscribe System. 2010 [accessed 2010 May]; Available from: http://www.naradabrokering.org/.
[28] Supun Kamburugamuve, Leif Christiansen, and Geoffrey Fox, A Framework for Real Time Processing of Sensor Data in the Cloud. Journal of Sensors, 2015. 2015: p. 11. DOI:10.1155/2015/468047. http://dsc.soic.indiana.edu/publications/iotcloud_hindavi_revised.pdf
[29] IUPUI Indianapolis Event Facilities. 2015 [accessed 2015 June 15]; Available from: http://eventservices.iupui.edu/university-tower.asp.