Dr. Lisa Amini
Director, IBM Research, USA
Why AI needs even more Data Science, and vice versa
Wednesday June 6th, 9:00 – 10:00
Recent advances in AI and deep learning are capturing headlines, and yet suffer from a variety of short-comings, including catastrophic forgetting, inability to generalize robustly, susceptibility to bias, and inadequate techniques for introspection and explanation. Many of these are challenges where an even greater influence from the expertise and rigorous approaches of data science could have profound effects. For example, AI has an urgent and critical need for learning causal models, an area requiring a sound grasp of statistical analysis, principles of identification and other mainstays of data science. Conversely, differentiable (deep learning) techniques for learning causal structure could bring powerful new tools to data scientists. In another example, information theoretic approaches to understanding information flow in deep neural networks could enable more robust, efficient, and predictable AI. AI for ethical decision making is yet another area with a deep need for complementary data science and AI expertise. This talk will cover these, and other examples of projects we are undertaking in the new MIT-IBM Watson AI Lab, and the necessary interplay of data science and AI. I will also highlight a novel academic+industry approach we are taking to AI research, and why it is both unique and compelling.
Dr. Lisa Amini is the Director of IBM Research Cambridge, which includes the newly announced MIT-IBM Watson AI Lab (http://mitibmwatsonailab.mit.edu). The MIT-IBM Watson AI Lab is dedicated to fundamental artificial intelligence (AI) research with the goal of propelling scientific breakthroughs in four research pillars: AI Algorithms, the Physics of AI, the Application of AI to industries, and Advancing shared prosperity through AI; all of which leverage and pioneer machine learning, deep learning, and machine reasoning algorithms. Lisa was previously Director of Knowledge & Reasoning Research in the Cognitive Computing group at IBM’s TJ Watson Research Center in New York, and she is also an IBM Distinguished Engineer. Lisa was the founding Director of IBM Research Ireland, and the first woman Lab Director of any IBM Research Global (i.e., non-US) Lab (2010-2013). As Senior Manager of the Exploratory Stream Processing Research Group, Lisa was the founding Chief Architect for IBM’s InfoSphere Streams product and Research predecessor. She earned her PhD in Computer Science from Columbia University, in New York.
Dr. Surajit Chaudhuri
Microsoft Research, Redmond, USA
What Data Platforms can do to support Data Scientists?
Tuesday June 5th, 9:00 – 10:00
Almost two decades ago, some enterprise database systems including Microsoft SQL Server offered the ability to support data mining primitives (including predictive models) natively in the database systems with the goal of reducing data movement and supporting data governance. However, such attempts did not gain traction. In this talk, we reflect on that failure and discuss the evolution of “in-database” analytics landscape. In the second part of the talk, we discuss in what other fundamental ways data platforms need to advance to aid data scientists with the task of data exploration – an essential step towards gaining insights from data. Specifically, we focus on the challenges of finding and transforming structured data and our approach to “Self-Service” Data Transformation.
Surajit Chaudhuri is a Distinguished Scientist at Microsoft Research in Redmond (USA) and leads the Data Management, Exploration and Mining group. He also works closely with Microsoft’s Azure Data division. Surajit’s current areas of interest are Big Data platforms, self-manageability, and cloud database services. Working with his colleagues in Microsoft Research, he helped incorporate the Database Engine Tuning Advisor and Data Cleaning technology in Microsoft SQL Server. Surajit is an ACM Fellow, a recipient of the ACM SIGMOD Edgar F. Codd Innovations Award, ACM SIGMOD Contributions Award, a VLDB 10-year Best Paper Award, and an IEEE Data Engineering Influential Paper Award.
Prof. Andreas Krause
ETH Zurich, Switzerland
Towards Safe Reinforcement Learning
Tuesday June 5th, 14:00 – 15:00
More and more machine learning systems make data-driven decisions in the real work, in increasingly higher-stakes applications. This has caused substantial interest in reinforcement learning — the field of learning to make decisions from data — which has seen stunning recent empirical breakthroughs. At its heart is the challenge of trading exploration — collecting data for learning better models — and exploitation — using the estimate to make decisions. In many applications, however, exploration is a potentially dangerous proposition, as it requires experimenting with actions that have unknown consequences. Hence, most prior work has confined exploration to simulated environments. In this talk, I will present our work towards rigorously reasoning about safety of exploration in reinforcement learning. I will discuss a model-free approach, where we seek to optimize an unknown reward function subject to unknown constraints. Both reward and constraints are revealed through noisy experiments, and safety requires that no infeasible action is chosen at any point. I will also discuss model-based approaches, where we learn about system dynamics through exploration, yet need to verify safety of the estimated policy. Our approaches use Bayesian inference over the objective, constraints and dynamics, and — under some regularity conditions — are guaranteed to be both safe and complete, i.e., converge to a natural notion of reachable optimum. I will also show experiments on safely tuning cyber-physical systems in a data-driven manner.
Andreas Krause is a Professor of Computer Science at ETH Zurich, where he leads the Learning & Adaptive Systems Group. He also serves as Academic Co-Director of the Swiss Data Science Center. Before that he was an Assistant Professor of Computer Science at Caltech. He received his Ph.D. in Computer Science from Carnegie Mellon University (2008) and his Diplom in Computer Science and Mathematics from the Technical University of Munich, Germany (2004). He is a Microsoft Research Faculty Fellow, received an ERC Starting Investigator grant, the Deutscher Mustererkennungspreis as well as the ETH Golden Owl teaching award. His research has been recognized by awards at several premier conferences and journals. He is currently serving as Program Co-Chair for ICML 2018 and as Action Editor for the Journal of Machine Learning Research.
Prof. Volker Markl
Technische Universität Berlin, Germany
Big Data Management and Apache Flink: Key Challenges and (Some) Solutions
Monday June 4th, 14:00 – 15:00
The shortage of qualified data scientists is effectively limiting Big Data from fully realizing its potential to deliver insight and provide value for scientists, business analysts, and society as a whole. In order to remedy this situation, we believe that novel technologies that draw on the concepts of declarative languages, query optimization, automatic parallelization and hardware adaptation are necessary. In this talk, we will discuss several aspects of our research in this area, including results in how to optimize iterative data flow programs, optimistic fault-tolerance, and steps toward a deep language embedding of advanced data analysis programs. We will also discuss how our research activities have led to Apache Flink, an open-source big data analytics system, which by now has become a major data processing engine in the Apache Big Data Stack, used in a variety of applications by academia and industry.
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universitat Berlin (TU Berlin), director of the research group “Intelligent Analysis of Massive Data” at the German Research Center for Artificial Intelligence (DFKI), and speaker of the Berlin Big Data Center (BBDC). Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 19 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been speaker and principal investigator of the Stratosphere research project that resulted in the « Apache Flink » big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany’s leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.
Prof. Gil McVean
Oxford University, UK
The genomic analysis of biomedical big data
Monday June 4th, 9:00 – 10:00
Population-scale analysis of genomic variation linked to individual-level data on disease, treatment and response, has the potential to answer many key questions in biomedical research about the biological processes that lead to disease and how best to design effective, and individualized interventions. However, the analysis of medical data obtained from routine sources and through self-reporting has many challenges. I will describe work using the UK Biobank data to characterize the genetic architecture and heterogeneity of common human diseases and discuss some of the many open statistical and computational challenges in the field.
Gil McVean is Professor of Statistical Genetics at the University of Oxford and Director of Oxford’s Big Data Institute within the Li Ka Shing Centre for Health Information and Discovery (www.bdi.ox.ac.uk). His research focuses on understanding the molecular and evolutionary processes that shape genetic variation in populations and the relationship between genetic variation and phenotype. He has made contributions to our understanding of areas including recombination hotspots, historical patterns of natural selection, the male mutation rate, human genomic variation, the role of HLA in complex disease and genealogical processes. He has played a leading role in the HapMap and 1000 Genomes Projects, is co-founder of Genomics plc (www.genomicsplc.com), and currently works on organisms from HIV to malaria.
Prof. Victoria Stodden
Reproducibility and Generalizability in Data-enabled Discovery
Wednesday June 6th, 14:00 – 15:00
As computation becomes central to scientific research and discovery – bringing us the field of Data Science – new questions arise regarding the implementation, dissemination, and evaluation of methods that underlie scientific claims. I present a framework for conceptualizing the affordances that support Data Science including computational reproducibility, transparency, and generalizability of findings. For example, reproducibility in computational research can be interpreted most narrowly as a simple trace of computational steps that generate scientific findings, and most expansively as an entirely independent implementation of an experiment that tests the same hypothesis as previously published work. Standards for determining a scientific finding are necessarily adapting to computationally- and data-enabled research. Finally, the social context for these innovations raises important questions regarding incentives to engage in new research practices and the ethics of these practices themselves.
Victoria Stodden joined the School of Information Sciences as an associate professor in Fall 2014. She is a leading figure in the area of reproducibility in computational science, exploring how can we better ensure the reliability and usefulness of scientific results in the face of increasingly sophisticated computational approaches to research. Her work addresses a wide range of topics, including standards of openness for data and code sharing, legal and policy barriers to disseminating reproducible research, robustness in replicated findings, cyberinfrastructure to enable reproducibility, and scientific publishing practices. Stodden co-chairs the NSF Advisory Committee for CyberInfrastructure and is a member of the NSF Directorate for Computer and Information Science and Engineering (CISE) Advisory Committee. She also serves on the National Academies Committee on Responsible Science: Ensuring the Integrity of the Research Process.
Previously an assistant professor of statistics at Columbia University, Stodden taught courses in data science, reproducible research, and statistical theory and was affiliated with the Institute for Data Sciences and Engineering. She co-edited two books released in 2014—Privacy, Big Data, and the Public Good: Frameworks for Engagement published by Cambridge University Press and Implementing Reproducible Research published by Taylor & Francis. Stodden earned both her PhD in statistics and her law degree from Stanford University. She also holds a master’s degree in economics from the University of British Columbia and a bachelor’s degree in economics from the University of Ottawa.