Please use Canvas for submission of some assignments and checking your grades. Also, talk to fellow classmates and instructors using Canvas chat
The class will interact through a Google community group. These are the following times the instructor will be available for Google Hangout. You will see information for joining the Google Hangout events in the Google Community group for this class.
Office Mix technology is being used to create the lessons. The lessons are available in Office Mix, which supports combined voice, video, and slides.
Competency in the course is evaluated on a student’s engagement with and mastery of the content. This is through:
The breakdown used in determining a grade is 30% exercises (a), 30% projects (b), and 40% engagement (c, d, e). Graduate students can skip doing one reflection. Undergraduates may skip one reflection and one exercise (or for exercises, the lowest grade will be dropped.)
Beth Plale ("prof plale")
Professor of Informatics and Computing
School of Informatics and Computing
https://www.linkedin.com/in/bethplale
@bplale
Email
Zong Peng
Research Assistant, Data to Insight Center
PhD candidate, School of Informatics and Computing,
Indiana University
Email
Data is abundant and its abundance offers potential for new discovery, and economic and social gain. But data can be difficult to use. It can be noisy and inadequately contextualized. There can be too big a gap from data to knowledge, or due to limits in technology or policy not easily combined with other data. This course will examine the underlying principles and technologies needed to capture data, clean it, contextualize it, store it, access it, and trust it for a repurposed use. Specifically the course will cover the 1) distributed systems and database concepts underlying noSQL and graph databases, 2) best practices in data pipelines, 3) foundational concepts in metadata and provenance plus examples, and 4) the developing theory in data trust and its role in reuse.
Lesson Number | Topic | Area | Week |
---|---|---|---|
Lesson 1 | Big Data in Business | Intro | 1 |
Lessons 1 and 2 give you a perspective of how society is thinking and talking about big data today. We look the topic from the perspective of business (lesson 1) and science (lesson 2). Go to Lesson 1 | |||
Lesson 2 | Big Data in Scientific Research | Intro | 2 |
Lesson 2 gives a science perspective on big data. You will read a dozen or so short articles from Science magazine (Feb 2011). The articles are written by practitioners in a dozen or so fields including social science, medical, and scientific disciplines, each discussing their unique data issues. Taken together, the collection highlights how differently one discipline sees data challenges from another discipline. Go to Lesson 2 | |||
Lesson 3 | Data Processing Pipelines in Science | Data pipelines | 3 |
What is a data pipeline? Data rarely instantly show up ready to use in whatever exploratory purpose you may have in mind. Data from creation to use undergoes numerous steps, some of which are end products in themselves. Lesson 3 is derived from a talk that database scholar Jim Gray gave in 2009 on the Laboratory Information Management System (LIMS), an early notion of the data processing pipeline for science. Go to Lesson 3 - Part 1 Go to Lesson 3 - Part 2 | |||
Lesson 4 | Data Processing Pipelines in Business | Data pipelines | 3 |
Lesson 4 introduces the business perspective of data pipelines. It draws inspiration from a 2011 talk by Wernert Vogels "Data Without Limits". Vogels is CTO of Amazon, and in this nice 2011 talk discusses data pipelines in context of business computing. He argues that cloud computing is core to a business model "without limits". The pipeline he proposes is: collect | store | organize | analyze | share. Go to Lesson 4 | |||
Lesson 5 | Data Cleansing | Data pipelines | 4 |
Lesson 5 gives use cases on data cleansing from environmental science and social science. It includes a link to a nice video “Dealing with Missing Data and Data Cleansing”, Part 3 of 3 on Quantitative Coding and Data Entry, Graham R Gibbs, Research Methods in Social Sciences, University of Huddersfield. Go to Lesson 5 Lesson 5 Reflection | |||
Lesson 6 | Data Coding | Data pipelines | 4 |
In this lesson the student will gain basic knowledge about coding data, or the tagging or categorizing of it on important features of the data so that it can be clustered, etc. The student will get a chance to try their hand at coding a dataset. One possible project uses 278 media mentions from Pervasive Technology Institute over the year 2013-2014. The categorization that the student does will be illustrated through visualizing the results as a simple pie chart. Other possible projects will be introduced later. Go to Lesson 6 | |||
Lesson 7 | Software Systems Design Overview | Software systems | 5 |
Lesson 7 is an introduction to the general concepts of software systems. These concepts are used during design of the large software systems needed to handle any of today’s large applications (social media, cloud services, shopping carts, video rentals …). Go to Lesson 7 Lesson 7 Reflection | |||
Lesson 8 | Complexity in Software Systems | Software systems | 5 |
A key problem in large-scale applications is complexity. We will examine the sources of that complexity, and design principles that are used to overcome the complexity. Go to Lesson 8 | |||
Lesson 9 | Project: Twitter Dataset Analysis and Modeling | Project | 6 |
Lesson 9 is the course project. An explanation of the project is the subject of the lesson 9 video. You will have 3 weeks to complete the project. The project description is in Canvas, and will be turned in through Canvas. Expect to give a demo of the project. Go to Lesson 9 | |||
Lesson 10 | Distributed File System Introduction | Distributed systems | 7 |
We looked in Lessons 7 and 8 at systems design and some of the complexities therein. The focus of this course is for the student to understand the distributed systems concepts that are inherent in today’s noSQL stores. The next step leading up to a study of noSQL stores themselves is to look at distributed file systems, where key concepts like transparencies, session semantics, fault tolerance, and naming all have a form that can be easily understood from the perspective of files and directories, which we all work with. Go to Lesson 10 | |||
Lesson 11 | Role of Caching in Distributed Computing | Distributed systems | 8 |
Caching of data is a key concept to efficient performance of a large distributed application. We turn back to the Levy and Silberschatz paper to see what they have to say about caching. Go to Lesson 11 | |||
Lesson 12 | Role of Fault Tolerance in Distributed Computing | Distributed systems | 9 |
When distributed systems span multiple locations or computers, the incidence of failure increases substantially. In this lesson we’ll turn back to the Levy and Silberschatz paper to learn about fault tolerance. Go to Lesson 12 | |||
Lesson 13 | NoSQL Data Stores | Distributed data stores | 10 |
This lesson is the introduction to large-scale data stores. The lesson video gives an overview of noSQL data stores using a 2012 talk by Martin Fowler at the GoTo Aarhus Conference 2012. The reading covers the concept of data services. As the reading states, “while data services were initially conceived to solve problems in the enterprise world, the cloud is now making data services accessible to a much broader range of consumers.” Go to Lesson 13 Lesson 13 Reflection | |||
Lesson 14 | Consistency in Distributed noSQL Data Stores | Distributed data stores | 10 |
When storage in a data store is distributed across multiple storage devices, consistency of reads and writes becomes a paramount issue. Go to Lesson 14 | |||
Lesson 15 | Routing in noSQL Data Stores | Distributed data stores | 11 |
When data are stored across multiple computers in a single noSQL data store, and the data store can be accessed any of the multiple servers that support the noSQL data store, how does the noSQL data store ensure that a request for a data object is routed to the right location where the data are stored? This is a routing problem in noSQL data stores. This lesson discusses ways of keeping track of the information needed to route requests to the right server. Go to Lesson 15 | |||
Lesson 16 | Comparison of noSQL/SQL Data Models | Distributed data stores | 12 |
While efficient locating and retrieval of data from a distributed noSQL data stores is important, equally important is the flexibility that noSQL stores give you in storing data objects. Relational databases give structured and normalized tables for rapid and precise querying. noSQL stores support less structured data and less precise querying. This lesson walks you through a real life comparison using ecological data and storing it in a relational, document store, key-value pair, and column store data model. Go to Lesson 16 Lesson 16 Reflection | |||
Lesson 17 | Data Provenance | Provenance | 13 |
As data sharing increasingly moves from a friendly exchange between two parties that know each other to a transaction on an open data sharing market, the need also grows for data to carry with it sufficient information that the recipient can use to establish whether or not they trust and can use the data. Data provenance lies at the heart of the descriptive data needed to discern data trustworthiness. This lesson will introduce data provenance and give you a sense of what provenance data is and how it is represented. Go to Lesson 17 | |||
Lesson 18 | Linked Data | Linked data | 14 |
The Internet contains vast numbers of pages linked together. Linked data enables more detailed information to be linked together in a loose way that doesn’t require standardization on a small set of names (a vocabulary). Linked data is key to data sharing because it allows concepts about data objects to be linked together loosely (again, without seeking complete agreement.) This lesson introduces linked data through a video by Manu Sporny “Intro to Linked Data” 2012, and discusses some of the issues around linked data through the readings. Go to Lesson 18 Lesson 18 Reflection | |||
Lesson 19 | Science Gateways, Scientific Workflows and Distributed Computing: Data In, Data Out | Open source open data | 15 |
Lesson 19 steps back from the details and gives a broad discussion of data cyberinfrastructure in science: how it is structure in the United States and how it is sustained. Discusses science gateways, their use in several scientific disciplines, and the open source and open governance philosophy of one project in support of science. Go to Lesson 19 | |||
Lesson 20 | Social and Technical Barriers to Data Sharing | Data Sharing | 16 |
This last lesson ends the semester with several short readings on the social and economic barriers to sharing data. Watch the video for lesson 20. Read the readings and do the lesson 20 assignment, submitting your response via Canvas. This lesson will utilize peer review. Go to Lesson 20 |
There are a total of 20 lessons. This on-line course covers a semester of work. A student is expected to follow the timeline given on the course web site and in the syllabus. The course uses regular weekly on-line interactions where timely topics are discussed, and uses peer review an a tool for student feedback. Expect to put 6-7 hours a week into the course. This includes time spent in readings, assignments, projects, and engaging with instructional content and each other.