The term “Big Data” makes reference to large data sets, whether structured or not, that may increase enormously and at such high speed that they become difficult to manage using the usual data base techniques and analysis tools in place up until a short time ago. The management, analysis and use of such massive amount of complex data require new solutions that go beyond traditional processes or usual software tools used in daily operations.
The Specialization in Data Science provides professionals with tools to design, prepare, analyze and manage large volumes of structure and unstructured data. Emphasis is put on the theory, so that graduates will have the flexibility required to adapt to sudden changes in technology, as well as on case studies and lab practices, using commercial and open source software.
Duration: 18 months.
Format: Classroom attendance, twice a week: Fridays from 5.30 PM to 10 PM and Saturdays from 9 AM to 1.30 PM, at 25 de Mayo 444, Autonomous City of Buenos Aires
Degree: Specialist in Data Science
Accredited under CONEAU No. 424, 27 July 2015.
Ministerial Resolution 1753/16
Dr. in Computer Sciences, Universidad de Buenos Aires (UBA).
Professor and Researcher in the field of Data Science, specially in Data Warehousing and Business Intelligence, Semantic Web and Geographic Information Systems.
Director Center for Information Retrieval ITBA.
This subject is intended to provide the statistical rationale of smart data analytics. In other words, it is not focused on an algorithm but on a conceptual approach. These principles will be used by many of the subjects included in the program.
Exploratory Data Analysis (EDA). Dimensionality Reduction: Principal Component Analysis. Simple and Multiple Linear Regression. Logistic Regression. Analysis of Variance (ANOVA). Analysis of Survey Data. ROC Curves, Gains. Bayesian Networks. Introduction to Time Series Analysis: ARIMA (Autoregressive Integrated Moving Average), ARCH (Autoregressive Conditional Heterogeneity), GARCH (Generalized Autoregressive Conditional Heterogeneity) models.
This subject will deal with basic data mining techniques and algorithms, putting special emphasis on regression, association analysis and clustering. We will start from classic techniques and discuss the new challenges brought by Big Data.
Basic Concepts of Data Mining. Descriptive and Predictive Models. Basic Techniques: Association Rules, Classification, Clustering, Patterns. Decision Trees. Application to Specific Forecasting Problems. Case Studies. KPIs (Key Performance Indicators). Dashboards. Commercial and Open Source Tools.
Students will learn about the architecture of data warehouses (DW) and their conceptual, logical and physical design, and how to use the same with Online Analytical Processing (OLAP) tools, mining, dashboards, etc. Special attention will be put on DW vis-à-vis the 3 problems posed by Big Data: Volume, Velocity and Variety. To that effect, graph databases (variety) and real time DW (velocity) will be studied.
Architecture. Conceptual, Logical and Physical Design. Multidimensional Schema: Star, Snowflake and Constellation. Slowly Changing Dimensions. Physical Design. On Line Analytical Processing: OLAP. OLAP vs OLTP. Query Languages: Basic and Advanced MDX. Advanced Environments for OLAP. Commercial and Open Source Tools. OLAP on Big Data: Real Time Analysis, Graph Databases.
This subject focuses on the tools popularly linked to Big Data: Hadoop and MapReduce, and massively parallel processing architectures using commodity clusters. Students will thus be exposed to a real Big Data environment in terms of hardware and software.
Introduction to Distributed Systems. Models. CAP Theorem. Clusters for Massively Parallel Processing (MPP). Virtualization of Clusters and Data Centers. Cloud Architecture. Big Data Principles: Velocity, Volume, Variety, Veracity. What is “Big Data” and What is Not. Structured and Unstructured Data. NoSQL Databases: MongoDB. The MapReduce Paradigm. Hadoop File System. YARN: Hadoop Evolution. Architecture, Components. Columnar Databases: Apache Cassandra, HBase. Key-value stores: Amazon DynamoDB, Redis. High Level Languages: HiveQL and Pig Latin. Data Analysis with Hadoop and Hive. Apache Spark. Programming with Spark. Streaming, Tweet Collection and e-data in real time, machine learning using Spark.
Information visualization is an important component of data analysis. This Program will show the theoretical bases of data visualization (for instance, how to visualize high dimensional data) and the practical tools to implement the same.
Introduction, Definitions, Historic Background, Outstanding Charts. Principles of Graphical Excellence. Observations and Variables. Types of Variables. Visualization of Tables, Hierarchies and Networks. Use of Color. Efficient Representation of Information, Summarization and Visualization of Large Data Sets. Practices with d3js, jit, Processing, Google Visualization API, Tableau, Fusion Tables and QGIS.
By taking this subject, students will gain general and practical knowledge of state-of-the-art Machine Learning to be then applied to their professional lives, particularly in a Big Data environment. Graduates will be acquainted with the top machine learning algorithms and models and will be able to define methods and tests to select the appropriate model for any case they may encounter in the real world. This subject will supplement all models and algorithms studied under “Data Mining”.
Machine Learning Basic Concepts. Inferences. Version Space. Learning as Heuristic Search. General Concepts for Bias and Pruning. Decision Trees. Extension to Basic Algorithms and Implementation Problems. Rule Generation. Bayesian Learning. Agglomerative and Partitioning Algorithms. K-Means, SVM. Descriptive and Discriminating Features. Overview of Other Models (Genetic, Neural Networks, etc.). Big Data Applications.
The Extract, Transform and Load (ETL) process in a data warehouse (DW) is key to any project and accounts for 80% of the project budget. It becomes even more critical in a Big Data environment, because, in addition to large data sets, there is a need for a quasi-real time analysis (due to the velocity at which data arrive) and a great variety of collection and acquisition processes due to the variety of data that mostly come from the web. This subject will discuss these processes and be strongly focused on their actual application.
Extract, Transform and Load (ETL) Process. Conceptual Design. Use of BPMN Techniques. Application. Commercial Tools (MS Integration Services) and Open Source Tools (Pentaho Kettle). ETL to support real-time OLAP and DW. Use of Hadoop/MapReduce in the ETL process. ETL vs ELT. Examples and ETL programming.
Geographic Information Systems (GIS) and scientific applications are, together with social media, the largest sources of Big Data and require a special treatment to be managed, integrated with other types of data and queried. These problems will be discussed under this heading.
Geographic Information Systems (GIS): Discrete and Continuous Models (Continuous Fields). OLAP over GIS. Analysis of Mobile Objects Paths: Patterns. Ontologies. Analysis of Biological, Astronomical and Chemical Data. Microarray Analysis.
This seminar takes one full-time week and will be given by visiting professors with renowned experience in this field. The intention is to share the view of other specialists, promote exchanges with other institutions and present students with possible topics for their final paper.
During the seminar, students will plan their final integrated assignment. The workshop is intended to contribute to critical thinking and, to that effect, students would have acquired basic knowledge of scientific methods and methodological techniques, learned how to research and the stages involved in such work, learned about the various types of research available, different data collection instruments and the pros and cons of each of them, been able to perform a methodological analysis of research papers, learned key concepts to participate in a research and prepare his/her Final Integrated Assignment, and acquired the tools necessary to prepare a final report.
The final assignment will consist of the development of an individual project related to one of the applicable areas (GIS, biology, etc.)
Students will make a proposal and the Program Director and the Academic Committee will appoint a tutor to be chosen from within the group of tutors mentioned above. The Program Director, together with the Academic Committee will issue a formal acceptance of the topic selected for the final assignment, which shall be submitted not later than 12 months after the date on which the last module was passed. The Final Integrated Assignment will be evaluated by professionals appointed by the Director in consideration of their relevant teaching and practicing experience, in line with the topic chosen by the student. They shall issue a written and substantiated opinion not later than sixty (60) days after receiving the paper, complying with the guidelines to be provided by the Program Director with the agreement of the Academic Committee.
* Dr. Delrieux, Claudio (Universidad Nacional del Sur)
* Dr. Yankilevich, Daniel (Pragma Consultores)
* Dr. Romero, Oscar (Universidad Politécnica de Cataluña, España)
Licenciado en Ciencias de la Computación, Depto de Computación, FCEyN, UBA. 25 años de experiencia docente (Inteligencia Artificial, Sistemas Expertos, Data Mining y Visualización de Información, 10 años como consultor independiente en las siguientes áreas: Spatial Databases, GIS, Data Quality, InfoVis y BI. Se desempeñó también como responsable de desarrollos GIS en diferentes organizaciones.
MBA y Licenciado en Ciencias de la Computación São Paulo/Brasil; Chief Technology Officer; Endeavor Entrepreneur. Especialista en Big Data, Machine Learning y DevOps. Desarrolló soluciones de Analytics y Big Data como CTO y como consultor de grandes empresas en los últimos 10 años – donde investigó e implementó plataformas de Batch, Stream Processing y Machine Learning utilizando: Spark, Storm, Cassandra, MongoDB, ElasticSearch, Kafka, Postgres, etc – en ambientes cloud diversos.
Magister en Dirección de Empresas de la Universidad del CEMA, Ingeniero en Informática, Universidad de la República, Uruguay, Lic. En Cs. de la Computación, ESLAI (Escuela Latinoamericana de Informática).
Doctora en Ciencias de la Computación, UBA. Licenciada en Ciencias Matemáticas, UBA. Profesor Titular a tiempo completo en el Departamento de Ingeniería Informática, ITBA.
Doctora en Ingeniería en Informática, ITBA. Profesora Titular de Bases de Datos, Directora del Centro de Interacción Hombre-Dispositivo y Usabilidad, ITBA. Especialista en Data Warehousing, OLAP y Sistemas de Información Geográfica.
Doctora en Lenguajes y Sistemas Informáticos, Universidad Rey Juan Carlos (Madrid). Master en Ingeniería de Software, Universidad Politécnica de Madrid. Diploma de Estudios Avanzados del Doctorado en Antropología, Universidad Complutense de Madrid. Licenciada en Sociología, UBA.
Ingeniero Informático por el. Participó en proyectos de Big Data tanto en Google (a través de Globant), Despegar.com y Socialmetrix utilizando herramientas como Sqoop, Pig, Hbase, Hive, Oozie, Spark y Cassandra. Fue docente de grado en el ITBA (desarrollo de aplicaciones web). Actualmente, se encuentra desarrollando la infraestructura de procesamiento de datos de Real-Time Bidding de ads y tracking de eventos de aplicaciones mobile en Jampp. Es docente de la Diplomatura en Big Data del ITBA desde Abril de 2015.
Doctora en Ciencias de la Computación por la Universidad de Buenos Aires, Argentina.
Cuenta con 15 años de experiencia en empresas de diversas industrias (IBM Information Architect – Deloitte Pre-Sales – IBM SPSS Sales Executive) llevando a cabo proyectos de consultoría de negocio, tecnología de la información e IA aplicada a negocios. Predictive Analysis, Business Analytics, Data Mining and Information Management. Es especialista en Inteligencia Artificial y Robótica, y ha sido oradora en TEDxRosario Arg 2012, TEDxBarcelona España 2013, Makers of Barcelona 2013, Campus Party México 2013, AXIS México 2014, entre otros eventos.