INFO-I 590 Management, Access, and Use of Big and Complex Data

Canvas

Please use Canvas for submission of some assignments and checking your grades. Also, talk to fellow classmates and instructors using Canvas chat

Go to IU Canvas

Use of Google+

The class will interact through a Google community group. These are the following times the instructor will be available for Google Hangout. You will see information for joining the Google Hangout events in the Google Community group for this class.

Office Mix

Office Mix technology is being used to create the lessons. The lessons are available in Office Mix, which supports combined voice, video, and slides.

Check out Office Mix

Grading

Competency in the course is evaluated on a student’s engagement with and mastery of the content. This is through:

Exercises: allow you to put into practice what you learned in a lesson and through the readings and resources. Exercises are always done as an assignment that is submitted via Canvas,
Project: there are two projects: Lessons 6 and 9. A project is a larger effort that may pull together concepts from multiple lessons. Project 9 has two tracks, one for non-technology savvy graduate and undergraduate students, and a longer and more technical track for the technology savvy graduate students,
Reflections: allow you to test yourself on what you have learned in a lesson. Reflections are done through a separate lesson feedback video,
Peer review: engagement in peer reviewing of classmate materials, and,
Class interactions: engagement in class interactions (chat sessions, hangouts, engaging with the video content).

The breakdown used in determining a grade is 30% exercises (a), 30% projects (b), and 40% engagement (c, d, e). Graduate students can skip doing one reflection. Undergraduates may skip one reflection and one exercise (or for exercises, the lowest grade will be dropped.)

Check your grades on Canvas

Instructor

Beth Plale ("prof plale")
Professor of Informatics and Computing
School of Informatics and Computing
https://www.linkedin.com/in/bethplale
@bplale
Email

More Details

Associate Instructor

Zong Peng
Research Assistant, Data to Insight Center
PhD candidate, School of Informatics and Computing,
Indiana University
Email

More Details

Course Description (Syllabus)

Data is abundant and its abundance offers potential for new discovery, and economic and social gain. But data can be difficult to use. It can be noisy and inadequately contextualized. There can be too big a gap from data to knowledge, or due to limits in technology or policy not easily combined with other data. This course will examine the underlying principles and technologies needed to capture data, clean it, contextualize it, store it, access it, and trust it for a repurposed use. Specifically the course will cover the 1) distributed systems and database concepts underlying noSQL and graph databases, 2) best practices in data pipelines, 3) foundational concepts in metadata and provenance plus examples, and 4) the developing theory in data trust and its role in reuse.

Course Overview

Lessons

Lesson Number	Topic	Area	Week
Lesson 1	Big Data in Business	Intro	1
Lessons 1 and 2 give you a perspective of how society is thinking and talking about big data today. We look the topic from the perspective of business (lesson 1) and science (lesson 2). Go to Lesson 1
Lesson 2	Big Data in Scientific Research	Intro	2
Lesson 2 gives a science perspective on big data. You will read a dozen or so short articles from Science magazine (Feb 2011). The articles are written by practitioners in a dozen or so fields including social science, medical, and scientific disciplines, each discussing their unique data issues. Taken together, the collection highlights how differently one discipline sees data challenges from another discipline. Go to Lesson 2
Lesson 3	Data Processing Pipelines in Science	Data pipelines	3
What is a data pipeline? Data rarely instantly show up ready to use in whatever exploratory purpose you may have in mind. Data from creation to use undergoes numerous steps, some of which are end products in themselves. Lesson 3 is derived from a talk that database scholar Jim Gray gave in 2009 on the Laboratory Information Management System (LIMS), an early notion of the data processing pipeline for science. Go to Lesson 3 - Part 1 Go to Lesson 3 - Part 2
Lesson 4	Data Processing Pipelines in Business	Data pipelines	3
Lesson 4 introduces the business perspective of data pipelines. It draws inspiration from a 2011 talk by Wernert Vogels "Data Without Limits". Vogels is CTO of Amazon, and in this nice 2011 talk discusses data pipelines in context of business computing. He argues that cloud computing is core to a business model "without limits". The pipeline he proposes is: collect \| store \| organize \| analyze \| share. Go to Lesson 4
Lesson 5	Data Cleansing	Data pipelines	4
Lesson 5 gives use cases on data cleansing from environmental science and social science. It includes a link to a nice video “Dealing with Missing Data and Data Cleansing”, Part 3 of 3 on Quantitative Coding and Data Entry, Graham R Gibbs, Research Methods in Social Sciences, University of Huddersfield. Go to Lesson 5 Lesson 5 Reflection
Lesson 6	Data Coding	Data pipelines	4
In this lesson the student will gain basic knowledge about coding data, or the tagging or categorizing of it on important features of the data so that it can be clustered, etc. The student will get a chance to try their hand at coding a dataset. One possible project uses 278 media mentions from Pervasive Technology Institute over the year 2013-2014. The categorization that the student does will be illustrated through visualizing the results as a simple pie chart. Other possible projects will be introduced later. Go to Lesson 6
Lesson 7	Software Systems Design Overview	Software systems	5
Lesson 7 is an introduction to the general concepts of software systems. These concepts are used during design of the large software systems needed to handle any of today’s large applications (social media, cloud services, shopping carts, video rentals …). Go to Lesson 7 Lesson 7 Reflection
Lesson 8	Complexity in Software Systems	Software systems	5
A key problem in large-scale applications is complexity. We will examine the sources of that complexity, and design principles that are used to overcome the complexity. Go to Lesson 8
Lesson 9	Project: Twitter Dataset Analysis and Modeling	Project	6
Lesson 9 is the course project. An explanation of the project is the subject of the lesson 9 video. You will have 3 weeks to complete the project. The project description is in Canvas, and will be turned in through Canvas. Expect to give a demo of the project. Go to Lesson 9
Lesson 10	Distributed File System Introduction	Distributed systems	7
We looked in Lessons 7 and 8 at systems design and some of the complexities therein. The focus of this course is for the student to understand the distributed systems concepts that are inherent in today’s noSQL stores. The next step leading up to a study of noSQL stores themselves is to look at distributed file systems, where key concepts like transparencies, session semantics, fault tolerance, and naming all have a form that can be easily understood from the perspective of files and directories, which we all work with. Go to Lesson 10
Lesson 11	Role of Caching in Distributed Computing	Distributed systems	8
Caching of data is a key concept to efficient performance of a large distributed application. We turn back to the Levy and Silberschatz paper to see what they have to say about caching. Go to Lesson 11
Lesson 12	Role of Fault Tolerance in Distributed Computing	Distributed systems	9
When distributed systems span multiple locations or computers, the incidence of failure increases substantially. In this lesson we’ll turn back to the Levy and Silberschatz paper to learn about fault tolerance. Go to Lesson 12
Lesson 13	NoSQL Data Stores	Distributed data stores	10
This lesson is the introduction to large-scale data stores. The lesson video gives an overview of noSQL data stores using a 2012 talk by Martin Fowler at the GoTo Aarhus Conference 2012. The reading covers the concept of data services. As the reading states, “while data services were initially conceived to solve problems in the enterprise world, the cloud is now making data services accessible to a much broader range of consumers.” Go to Lesson 13 Lesson 13 Reflection
Lesson 14	Consistency in Distributed noSQL Data Stores	Distributed data stores	10
When storage in a data store is distributed across multiple storage devices, consistency of reads and writes becomes a paramount issue. Go to Lesson 14
Lesson 15	Routing in noSQL Data Stores	Distributed data stores	11
When data are stored across multiple computers in a single noSQL data store, and the data store can be accessed any of the multiple servers that support the noSQL data store, how does the noSQL data store ensure that a request for a data object is routed to the right location where the data are stored? This is a routing problem in noSQL data stores. This lesson discusses ways of keeping track of the information needed to route requests to the right server. Go to Lesson 15
Lesson 16	Comparison of noSQL/SQL Data Models	Distributed data stores	12
While efficient locating and retrieval of data from a distributed noSQL data stores is important, equally important is the flexibility that noSQL stores give you in storing data objects. Relational databases give structured and normalized tables for rapid and precise querying. noSQL stores support less structured data and less precise querying. This lesson walks you through a real life comparison using ecological data and storing it in a relational, document store, key-value pair, and column store data model. Go to Lesson 16 Lesson 16 Reflection
Lesson 17	Data Provenance	Provenance	13
As data sharing increasingly moves from a friendly exchange between two parties that know each other to a transaction on an open data sharing market, the need also grows for data to carry with it sufficient information that the recipient can use to establish whether or not they trust and can use the data. Data provenance lies at the heart of the descriptive data needed to discern data trustworthiness. This lesson will introduce data provenance and give you a sense of what provenance data is and how it is represented. Go to Lesson 17
Lesson 18	Linked Data	Linked data	14
The Internet contains vast numbers of pages linked together. Linked data enables more detailed information to be linked together in a loose way that doesn’t require standardization on a small set of names (a vocabulary). Linked data is key to data sharing because it allows concepts about data objects to be linked together loosely (again, without seeking complete agreement.) This lesson introduces linked data through a video by Manu Sporny “Intro to Linked Data” 2012, and discusses some of the issues around linked data through the readings. Go to Lesson 18 Lesson 18 Reflection
Lesson 19	Science Gateways, Scientific Workflows and Distributed Computing: Data In, Data Out	Open source open data	15
Lesson 19 steps back from the details and gives a broad discussion of data cyberinfrastructure in science: how it is structure in the United States and how it is sustained. Discusses science gateways, their use in several scientific disciplines, and the open source and open governance philosophy of one project in support of science. Go to Lesson 19
Lesson 20	Social and Technical Barriers to Data Sharing	Data Sharing	16
This last lesson ends the semester with several short readings on the social and economic barriers to sharing data. Watch the video for lesson 20. Read the readings and do the lesson 20 assignment, submitting your response via Canvas. This lesson will utilize peer review. Go to Lesson 20

Expectations

There are a total of 20 lessons. This on-line course covers a semester of work. A student is expected to follow the timeline given on the course web site and in the syllabus. The course uses regular weekly on-line interactions where timely topics are discussed, and uses peer review an a tool for student feedback. Expect to put 6-7 hours a week into the course. This includes time spent in readings, assignments, projects, and engaging with instructional content and each other.