home

=**Data Science**=

//Notes from first meeting - 11/09/2011// 1. Target Audience: •Developers (who need to take their skills to the next level and work more with data) • DBA's - they have database skills, but again need to ramp up to work with much larger amounts of data • Data Analysts - amorphous group with some technical skills, but who may lack formal credential to move into other areas • Higher level engineers/researchers needing to work with big data - this last group is clearly different from the other groups, but we're considering if we can include them and if so, how? 2. We then started to brainstorm about possible tools to use, but that was just a starting point. Please also start to add possible program topics to Section 1 below. We'll review these later and then divide them into a logical course series. 3. We considered the possibility of a boot camp which might take place in the summer before the Certificate starts. It would provide a foundations in math/statistics so that the starting point for the Cert program would be a bit higher and more consistent. If we develop this idea, we'll do so after the curriculum for the Certificate has been developed. 4. Lastly there was some level of agreement to check in with employees of major employers who might use data science. Add questions to section 4 that we want to know to develop a relevant curriculum.

=1. Program Topics Brainstorm=
 * 1) //Storage//...Describe all the different ways data can be stored: file systems, databases, etc. This would address theoretical ideas like replication, transaction, and backup and tools like SQL, HDFS, Solr, Lucene. This would be decoupled from a discussion of parallel processing.
 * 2) //Concurrency//...The theory of parallel processing, how to calculate theoretical upper bounds of parallel systems and evaluate actual performance. Amdahl's law. The concept of being I/O- vs. CPU-bound. Tools would be Map Reduce and maybe a shared memory architecture.
 * 3) // Map-Reduce : // programming paradigm that allows processing of very large amounts of data. Basic programming (counting, histogram, index building) using the map-reduce paradigm. Hadoop, along with its file system HDFS, is the most widely adopted open source environment for map-reduce nowadays.
 * 4)  //Basic Machine Learning (Inferential Statistics):// Basic hypothesis testing (eg. t-test) and confidence intervals. B asic models of regression (linear, logistic), curve fitting (least squares), classification (eg. Naive Bayes and decision trees) and clustering (eg. k-means).
 * 5) //Data Exploration:// Calculation of relative frequencies and database fill rates (eg., 30% of users have a white MacBook that is at least 2-years old), SQL statements, Hadoop/Hive/Pig queries, estimates of how long until query is completed, etc.
 * 6) //Unstructured Data:// Information Retrieval by TFIDF, PageRank and other trust networks, rule-based vs. statistics-based NLP, Tag Clouds, Ontologies and Semantic Web.
 * 7) //Personalization and Recommendation//: Basic Machine Learning technologies applied to recommendation and personalization, eg. Naive Bayes, User profile-based recommendation, purchase history-based recommendation, Market-Basket Analysis and Association Rules.
 * 8) //Location-based Data:// Geo-referencing on the spheroid, spatial indexes, working with vector-based map data and GPS location information, spatial data queries and data aggregation, Routing and the notion of proximity by route vs. radius.
 * 9) //Entity Resolution:// Reasoning with partial, incomplete and noisy data about individuals across multiple data sources.
 * 10) //Sampling and Experimental Design:// Designing experiments using basic Inferential Statistics (above), randomized sampling and A/B testing.
 * 11) //Problem formulation and definition//: Define the animal you would like to hunt and the animal you are in fact hunting. Know what it is that you shot before you shoot it. This is less about stats and more about defining the distribution and understanding the problem type -- preferably in a machine learning context.

=1A. Courses= **Course 1: Storage / Basic Large Data Processing**

Description:
An introduction to data storage and processing techniques. This course covers the fundamentals of different mechanisms for storing, retrieving, and processing data. An overview of different database technologies is presented and the course finishes with a discussion of choosing the appropriate tool to get the job done.

Outline:

 * 1) Introduction to data (data types, data movement, terminology, etc)
 * 2) What is data? Define data to be things on a disk, specifically files. Talk about how files are really just a bunch of bytes and that the arrangement of those bytes, and their subsequent codification into application defined type systems, lets them be useful to us as users.
 * 3) Type systems and their purpose - []
 * 4) Data movement as a pipeline. Cover the idea of "data -> process -> data" by demonstrating the UNIX pipe operator with simple programs (cut, sort, uniq, grep). Demonstrating this "program as building block" concept right away should help later items sink in.
 * 5) Storage and Concurrency Preliminaries physical: disks, memory, storage systems logical: basics of file systems defining parallelism, defining concurrency, multi-tenant systems, time sharing
 * 6) Basics of physical disks disk and cache - @http://momjian.us/main/writings/pgsql/Dziuba_OSCON_2011_Data.pdf
 * 7) The basics of filesystems (I'd actually think it sane to focus on HDFS for this because it effectively abstracts away implementation details and still presents many core concepts of what makes a filesystem.) []
 * 8) Defining Concurrency. The Java Concurrency tutorial provides a good look at concurrency, but is too detailed. I think this should talk about what a lock is, synchronizing access,
 * 9) Files and File-based data systems
 * 10) File operations - open, close, seek, etc. This could be a simple demonstration in Python or C or anything that makes the process very clear. This is more to clearly illustrate the process of how a database of any sort will locate data inside a file.
 * 11) ISAM and VSAM, we'll look at the MySQL ISAM engine as a practical example. []
 * 12) LevelDB - It's a much more complex way of storing data, but it's a simple key-value based database that folds files together and does magic in the background. This should be an "at your discretion" read since there's no good documentation that isn't source code. [] []
 * 13) Relational Database Management Systems - 1
 * 14) Fundamentals of an RDBM Systems - []
 * 15) Data Design, Normalization
 * 16) Optimization - Indexes - []
 * 17) Relational Database Management Systems - 2
 * 18) Introduction to the SQL Language - []
 * 19) Sharding and Partioning Introduction - []
 * 20) Hadoop Introduction and Overview
 * 21) Jeff Dean, Sanjay Ghemawat, " MapReduce: Simplified Data Processing on Large-Scale Clusters", OSDI 2004 - [|http://usenix.org/events/osdi04/tech/full_papers/] [|**dean**][|/] [|**dean**][|.pdf]
 * 22) "RDBMS-like" extensions to Hadoop (languages, schemas, optimization): Pig (Olston 08), HIVE (Thusoo 09), Hbase
 * 23) NoSQL: MapReduce vs. Parallel RDBMS
 * 24) Stonebraker et al. "MapReduce and Parallel RDBMS: Friend or Foe?" CACM 2011 - []
 * 25) Performance analysis of parallel DBs and MapReduce, Pavlo et al. SIGMOD 09 - []
 * 26) NoSQL - Specific NoSQL systems

MORE ADVACED: // Map-Reduce : // programming paradigm that allows processing of very large amounts of data. Basic programming (counting, histogram, index building) using the map-reduce paradigm. Hadoop, along with its file system HDFS, is the most widely adopted open source environment for map-reduce nowadays.
 * 1) low latency, highly scalable, eventually consistent key-value stores: Cassandra, CouchDB, MongoDB
 * 2) Dremel (Google) []
 * 3) Advanced Search techniques, Lucene, Solr
 * 4) Search concepts - indexes, analyzers, documents, and terms - []
 * 5) Inverted indexes - poor man's search, and the core of everything that we're doing with search []
 * 6) Writing SOLR queries. This may be the easiest way to show how things work - []
 * 7) Platform Selection Techniques
 * 8) Developing the solution (end result desired)
 * 9) Defining the data sources
 * 10) Mapping requirements to capabilities
 * Course 2: Parallel Processing / Hadoop / Advanded Data Processing **
 * structure course around machine learning book
 * [May better belong in course 1?] Practical tools and techniques for scrubbing and cleaning data. "If you have enough rows, every numeric field will have a few strings in it." (Paraphrased from John Rauser).
 * //Basic Machine Learning (Inferential Statistics):// Basic hypothesis testing (eg. t-test) and confidence intervals. Basic models of regression (linear, logistic), curve fitting (least squares), classification (eg. Naive Bayes and decision trees) and clustering (eg. k-means).
 * //Personalization and Recommendation//: Basic Machine Learning technologies applied to recommendation and personalization, eg. Naive Bayes, User profile-based recommendation, purchase history-based recommendation, Market-Basket Analysis and Association Rules.
 * //Location-based Data:// Geo-referencing on the spheroid, spatial indexes, working with vector-based map data and GPS location information, spatial data queries and data aggregation, Routing and the notion of proximity by route vs. radius.
 * //Entity Resolution:// Reasoning with partial, incomplete and noisy data about individuals across multiple data sources.


 * Course 3: Analysis / Hypothesis Testing / Statistics / Visualization / R **
 * //Sampling and Experimental Design:// Designing experiments using basic Inferential Statistics (above), randomized sampling and A/B testing.
 * //Problem formulation and definition//: Define the animal you would like to hunt and the animal you are in fact hunting. Know what it is that you shot before you shoot it. This is less about stats and more about defining the distribution and understanding the problem type -- preferably in a machine learning context.

=1B. Course Teams= Course 1: Buck, Bill, Jeremiah Course 2: Vitor, Olly, Darren, Mike Course 3: Olly, Mike, Jon, Nathan

=2. Tools Brainstorm= R - @http://www.r-project.org/ Python - @http://www.python.org/ numpy - @http://numpy.scipy.org/ scipy - @http://www.scipy.org/ matplotlib - [] matlab - [|www.mathworks.com] Hadoop - [] Soir - [] SQL SPSS SAS NLP Tools (please elaborate) Hive - [] Pig - [] HBase - [] Cassandra - [] MongoDB - [] Gnuplot - [] Mechanical Turk - [] Mahout - [] [|Ganglia] monitoring system Agreement on NOT using Excel

=3. Questions Brainstorm= 1. What are some of the primary skills you see missing or in need of greater depth in the analysis of data at your organization? 2. Are you seeing a particular technology or tool becoming prominent in data analysis? 3. Are you finding that you need folks that know more about storing and managing data, 4. What is the title you give to the roles of people who analyze data at your organization? 5. Does your organization have an Enterprise Data Warehouse? 6. What tools do people use to visualize and present findings of data analysis in your organization? 7. Do you leverage data mining and predictive modeling in your organization? 8. Do you deal with real-time data or streaming data in your organization? 9. Do you perform A/B testing or other types of randomized experimentation to improve your product, user experience, or operations?

Interview with DJ Patil on what makes a good data scientist: http://www.youtube.com/watch?feature=player_embedded&v=UqTcKpk-X_E

=4. Target Organizations= Boeing - Buck has questions out at this time Microsoft - Olly has questions out Starbucks Nike Amazon - Olly has questions out. Darren spoke to Principal Quantitative Engineer John Rauser who echoed the points he made at http://www.forbes.com/sites/danwoods/2011/10/07/amazons-john-rauser-on-what-is-a-data-scientist/. LinkedIn - Darren spoke to Daniel Tunkelang. Emphasis was on the data scientist as a discoverer of new data-driven products that no non-data-driven product manager could conceive of. Zynga Google - Olly has questions out RichRelevance AdReady - Olly has questions out INRIX - Olly has questions out Decide.com - Olly has questions out Visible Technologies - Olly has questions out Predixion Software - Olly has questions out Globys - Olly has questions out Intelius Facebook Twitter AllRecipes.com - Jeremiah has questions out GovernmentJobs.com - Jeremiah has questions out

=5. Resources= NPR Series on Big Data: [] Data Science Kit from OReilly: @http://shop.oreilly.com/category/deals/data-science-kit.do Data Science Competitions! [] Data Science introduction course at Berkeley: [] Data Science Summer Institute (topics,courses, tutorials,etc.): [] A job description to consider for program prerequisites or outcome: [] Below a stats course at MIT we could consider (free): [] Useful blog page: [|List of Known Scalable Architecture Templates] [|Professional NoSQL] =6. Board Members=
 * Erik Bansleben, Director, Academic Programs, UW Professional & Continuing Education, ebansleben@pce.uw.edu
 * Roger Barga, Architect, Microsoft
 * Vitor Carvalho, Principal Scientist and Data Engineering Manager at Intelius, []
 * Olly Downs, Chief Scientist, Analytical Insights, Inc., Consulting Chief Scientist, AdReady, Inc. odowns@analyticalinsights.com
 * Bill Howe, Senior Scientist, eScience Institute, University of Washington, []
 * Simon Kahan
 * Nathan Kutz, Professor and Chair of Applied Mathematics, kutz@uw.edu
 * Mike Lazarus, Vice President of Analytics, Atigeo, mikelazarus@yahoo.com
 * Bill McNeill, Software Engineer at Intelius, billmcn@gmail.com
 * Jeremiah Peschka, Managing Director, Brent Ozar PLF, LLC, jeremiah@brentozar.com - http://facility9.com http://brentozar.com
 * Daren Vengroff, Chief Scientist, RichRelevance, vengroff@richrelevance.com
 * Jon Wakefield, Professor of Statistics and Professor of Biostatistics, UW, jonno@uw.edu
 * Buck Woody, Senior Technical Specialist, Microsoft, woodyb@hotmail.com - [|http://buckwoody.com]
 * Paul Brown, President, Multifarious, Inc., prb@mult.ifairo.us - []

=7. Stat Comments from John Helm= The overall curriculum also looks good, but I must gently remonstrate on what I perceive the approach to statistics looks like. Big data is messy and the assumptions of classical statistics are not well satisfied. Please accept my urging to consider:

* Explicitly deriving classical statistics from Bayes' Theorem so as to lay bare the assumptions that are at play when doing least squares. * If Classical Statistics are to be emphasized, spend some time on advanced techniques for assumption checking. At a minimum, cover the Anderson-Darling statistic (and the Cramer-vonMises statistic) for testing the hypothesis that a data-set (eg. residuals) is normally distributed--Kolmogorov's D simply is not adequate. * Find a way to introduce techniques of "Exploratory Data Analysis" (EDA) as founded and espoused by  Tukey and his followers. There is a magnificent EDA program for the Mac called [|DataDesk]. It was developed by Paul Velleman, one of Tukey's students. There are student editions, etc.; so it may be worth a look. (Another good program, which is FREE, is [|DataPlot] by NIST.) * Consider formally introducing some robust and non- parametric techniques, such as regression with robust norms such as Tchebychef, Hubber, and LMS; and sliding window techniques such as LOESS. * Consider covering techniques for Model Selection. Once can do this formally using Bayesian techniques, or use Information Theoretic statistics such as the Akaike Information Criterion (AIC) or the Schwarz Bayesian Criterion (SBC).