COurse 1: Data Extraction : Flume

Course 2: Data Extraction : Sqoop

Course 3: Data Processing :Hadoop

COurse 4: Data Processing :MapR

COurse 4: Data Processing :Pig and Hive


This course will provide students with an introduction to Apache Pig. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for writing data analysis programs, coupled with an infrastructure for executing these programs. Pig, in conjunction with Apache Hadoop, can handle very large data sets with relatively simple programs.


Installing and Running Pig
The Pig Data Model
Basic Pig Latin
Advanced Pig Latin
Developing and Testing Scripts
Tuning Pig
Embedding Pig Latin in Python
Writing Evaluation and Filter Functions
Writing Load and Store Functions
Pig and and the Rest of the Hadoop Zoo
Built-in User Defined Functions and Piggybank


Students should have some structured programming experience. Familiarity with SQL databases is helpful but not required.

2 Days

Course Outline

1. Introduction
Overview of Hadoop
Hadoop Distributed File System
What Is Pig?
Why Use Pig?
2. Installing and Running Pig
Downloading and Installing Pig
Running Pig
3. Grunt
Interpreting Pig Latin Scripts
HDFS Commands
Controlling Pig
4. The Pig Data Model
Data Types
5. Basic Pig Latin
Input and Output
Relational Operations
User Defined Functions
6. Advanced Pig Latin
Advanced Relational Operations
Using Pig with Legacy Code
Integrating Pig and MapReduce
Nonlinear Data Flows
Controlling Execution
Pig Latin Preprocessor
7. Developing and Testing Scripts
Development Tools
Testing Your Scripts with PigUnit
8. Tuning Pig
Improving Script Performance
Improving Performance with User Defined Functions
Using Compression in Intermediate Results
Data Layout Optimization
Handling Bad Records
9. Embedding Pig Latin in Python
Utility Methods
10. Writing Evaluation and Filter Functions
Writing an Evaluation Function in Java
Algebraic Interface
Accumulator Interface
Python UDFs
Writing Filter Functions
11. Writing Load and Store Functions
Load Functions
Store Functions
12. Pig and and the Rest of the Hadoop Zoo
Pig and Hive
NoSQL Databases
Metadata in Hadoop
13. Built-in User Defined Functions and Piggybank
Built-in UDFs

Course 5: Data Processing :Greenplum

Course 6: Data Processing :Vertica

COurse 7: Data Processing :Lucene Solr

Course 8: Data Storage : HDFS

Course 9: Data Storage : NoSQL - DataStax

Course 9: Data Storage : NoSQL - Cassandra
The Apache Cassandra course at Spectramind starts with the fundamental concepts of using a highly-scalable, column-oriented database to implement appropriate use cases. It will cover topics like Cassandra Datamodels,Cassandra Architecture, Differences between RDBMS and Cassandra to name a few. There will be many challenging, practical and focused hands-on exercises for the learners during this course.
Course Objectives

After the completion of ‘Apache Cassandra' course at Spectramind, you should be able to:

1. Understand Cassandra and NoSQL domain.

2. Create Cassandra cluster for different kinds of applications.

3. Understand Apache Cassandra Architecture.

4. Design and model Applications for Cassandra.

5. Port existing application from RDBMS to Cassandra.

6. Learn to use Cassandra with various programming languages.
Who should go for this course?

A developer working with large-scale, high-volume websites.

An application architect or data architect who needs to understand the available options for high-performance, decentralized, elastic data stores

A database administrator or database developer currently working with standard relational database systems who needs to understand how to implement a fault-tolerant, eventually consistent data store

A manager who wants to understand the advantages (and disadvantages) of Cassandra and related columnar databases to help make decisions about technology strategy

A student, analyst, or researcher who is designing a project related to Cassandra or other non-relational data store options.

This course assumes no prior knowledge of Apache Cassandra or any other NoSQL database. Though some familiarity with Linux command line is essential, minimal exposure to Java,database or data-warehouse concepts is required.
Why Learn Cassandra?
Apache Cassandra™, an Apache Software Foundation project, is an open-source NoSQL distributed database management system. Apache Cassandra was originally developed at Facebook, and is used by many companies today. While many developers have embraced simpler NoSQL variants (like MongoDB and CouchDB), Cassandra is possibly at the forefront of the NoSQL innovation, providing a level of reliability and fine tuning not found in many of the competitors' offerings. When it comes to scaling, nothing scales like it, the biggest example being the Facebook which uses Cassandra for storing petabytes of Data.

Cassandra is designed to handle Cassandra workloads across multiple data centers with no single point of failure, providing enterprises with extremely high database performance and availability.

World's largest Website (i.e Cassandra) is running over Cassandra.

Daily 100s of start-ups and large product companies are choosing Cassandra for their next generation computing and data platforms. Some companies using Cassandra are Facebook, Twitter, IBM, Cisco, Rackspace, NetFlix, eBay, Reddit, @WalmartLabs, Zoho, Digg and so on.

Apache Cassandra is open-source. It means you can deep dive into its source code and change it according to your own requirements.

The job market for Apache Cassandra is at peak and is growing at rate of 300%!

Module 1
Learning Objectives - After this module students will be able to:
Explain the differences between NoSQL and RDBMS databases, Explain what the various NoSQL databases are, Explain the various Cassandra features, Explain why Cassandra scores over other NoSQL databases, Distinguish between use cases when Cassandra is a strong choice and when it is not, Understand the use cases where Cassandra is implemented.
Topics - Quick Review of RDBMS:
Transactions, ACIDity, Schema, Two Phase Commit, Sharding and Share Nothing Architecture, Feature Based, Key Based, Lookup Table Based, NoSQL Databases, Brewers CAP Theorem, Cassandra Definition and Features, Distributed and Decentralised, Elastic Scalability, High Availability and Fault Tolerance, Tuneable Consistency, Strict Consistency, Casual Consistency, Weak (Eventual Consistency), Column Orientation, Schema Free, High Performance, USE Cases for Cassandra, Cassandra Installation.

Module 2
Learning Objectives - After this module students will be able to:
Run basic Cassandra commands, Understand Design differences between RDBMS and Cassandra data model, Describe What a Cassandra cluster is, Describe what a Keyspace is, how it relates to Cluster and what is stored in the Keyspace, Explain what a Column Family is, Explain the primary key and its uses, Explain the parts of the compound primary Key, Explain what a partition key is, Explain how data is stored in a partition, Explain how clustering columns ensure that the stored data will be clustered in a partition, Explain secondary indexes and there implications, Explain how Cassandra locate data in the data cluster, Explain expiring column and Time to Live (TTL).
Topics - Installing Cassandra, Running the Command-Line Client Interface, Basic CLI Commands, Help, Connecting to a Server, Describing the Environment, Creating and Keyspace and Column Family, Writing and Reading Data, The Relational Data Model, Simple Introduction, Cluster, Keyspaces, Column Families, Column Family Options, Columns, Wide Rows, Skinny Rows, Column Sorting, Super Columns, Composite Keys, Design Differences between RDBMS and CASSANDRA, Query Language, Referential Integrity, Secondary Indexes, Sorting, DeNormalisation, Design Patterns, Materialized Views.

Module 3
Learning Objectives - After this module students will be able to:
Explain what happens during the read and write operations, Explain how Cassandra accomplishes some of its basic notable aspects, such as durability and high availability. Understand more complex inner workings, such as the gossip protocol, hinted handoffs, read repairs, Merkle trees etc, Understand Staged Event-Driven Architecture (SEDA).
Topics - System Keyspace, Peer-To-Peer, Gossip and Failure Detection, Anti-Entropy and Read Repair, Memtables, SSTables, and Commit Logs, Hinted Handoff, Compaction, Bloom Filters, Tombstones, Staged Event-Driven Architecture (SEDA), Read, Mutation, Gossip, Response, Anti-Entropy, Load Balance, Migration, Streaming, Managers and Services, Casssandra Daemon, Storage Service, Messaging Service, Hinted Handoff Manager.

Module 4
Learning Objectives - After this module students will be able to:
Analyze the requirements for a Cassandra use case and apply data modeling techniques, Identify the challenges faced by RDBMS, Identify the design consideration for designing Cassandra data model, Understand how data modeling differs in Cassandra from traditional relational databases, Understand how to De-Normalize RDBMS data, Demonstrate how the queries are used to design Cassandra data model, Demonstrate ability to apply data modeling concepts to various exercises that are given during the class, Understand the implications of the client side joins when writing application that access data in Cassandra, Able to insert data, perform batch updates and search column families.
Topics - Database Design, Sample Application RDBMS Design, Sample Application Cassandra Design, Application Code, Creating Database, Loading Schema, Data Structures, Setting Connections, Population of database, Application Features.

Module 5
Learning Objectives - After this students will be able to:
Understand what Replicas are, Understand various replica Placement Strategies, Understand Partitions, Understand Snitches, Create Clusters, Understand Dynamic Ring Participation, Understanding Security with in Cassandra, Understand Miscellaneous Settings and various additional tools in Cassandra, Understand Basic read and Write Properties, Understand what Slice Predicates are.
Topics - Keyspaces, Replicas, Replica Placement Strategy, Replication Factor, Partitioner, Snitches, Creating Clusters, Dynamic Ring Participation, Security, Miscellaneous Settings, Additional Tools, Query differences between RDBMS and Cassandra, Basic Write Properties, Consistency Level, Basic Read Properties, API’s, Set Up and Inserting Data, Slice Predicate, Get Range Slices, Multiget Slice, Deleting, Programmatically Defining Keyspaces and Column Families.

Module 6
Learning Objectives - After this module students will be able to:
Understand what Hadoop is and how it is used, Describe Cassandra File System, Start working with Map Reduce, Understand tools above Map Reduce like Pig and Hive and how they work with Cassandra, Understand Cluster Configuration, Understand live use cases.
Topics - Hadoop, MapReduce, Cassandra Hadoop Source Package, Outputting Data to Cassandra, PIG, HIVE, Use Cases.

Module 7
Learning Objectives - After this module the will be able to:
Perform Data Definition Language (DDL) Statements within Cassandra, Perform Data Manipulation Language (DML) Statements within Cassandra, Create and modify Users and User permission within Cassandra, Capture CQL output to a file, Import and export data with CQL, Execute CQL scripts from within CQL and from the command prompt.
Topics - Data Definition language(DDL) Statements, Data Manipulation Language (DML), Create and modify Users, User permission, Capture CQL output to a file, Import and export data, CQL scripts from within CQL, CQL Scripts from the command prompt.

Module 8
Learning Objectives - After this module students will be able to:
Understand what Thrift is, Understand Cassandra web console, Demonstrate ability to implement the concepts learnt during the course on a real life problem.
Topics - Basic Client API, Thrift, Thrift Support for Java, Exceptions, Thrift Summary, Cassandra Web Console, Hector (Java), Features, Hector API, Live Project.

COurse 10: Data Storage : NoSQL - MangoDB

This course will help you to master one of the most popular NoSQL databases. This course is designed to provide knowledge and skills to become a successful mongoDB® expert. The course covers a range of NoSQL and mongoDB® topics such as CRUD Operations, Schema Design and Data Modelling, Scalability etc.
Course Objectives

After the completion of the mongoDB® Course at Spectramind!, you should be able to:

Gain an insight into the 'Roles' played by a mongoDB® expert.

Learn how to design Schema using Advanced Queries.

Troubleshoot Performance issues

Understand mongoDB® Aggregation framework, and mongoDB® Backup and Recovery options and strategies.

Understand scalability and availability in mongoDB® using concept of Sharding.

This Course helps you to become a certified mongoDB® Developer and Operations Professional with right skills and knowledge needed to develop and run Applications on mongoDB®.
Who should go for this course?

If you are a Software Architect, Database Professional, IT Manager, and Software Developer, Students, DBA, System Administrator interested in learning the most popular NoSQL database, this course is for you. Professionals, Business Managers and Executives interested in understanding how mongoDB® can solve many of their Big Data problems and how it can help in achieving the business goals should go for this course. After completing this course you will be able to produce database designs for mongoDB® applications, understand its usage in solving your business problem.

There are no prerequisites for attending this mongoDB® course. It will help if you have knowledge of a mainstream programming language such as Java, a basic understanding of database concepts, and knowledge of a text editor such as 'VI editor' (recommended not mandatory as these concepts will also be covered during the course).
Why Learn MongoDB Development and Administration?
mongoDB® is a matured NoSQL database product with an ever growing adoption. Many big enterprise and internet companies such as Cisco, EBay, Disney etc. are now running large Production deployments.

If you’re learning mongoDB® now, you will be learning to use a well established product that has industry validation and similar functionality to many RDBMS systems you’ve encountered before.

With its increased adoption, mongoDB® has enabled developers to build new types of applications for cloud, mobile and social technologies. This makes mongoDB® developers an invaluable resource for companies looking to innovate in each of these areas.

mongoDB® is the most widely adopted NoSQL technology, mongoDB® skills are in high demand from businesses, and most importantly your peers are learning mongoDB® to stay relevant.

If you work at a large engineering company, it’s likely that some new projects for social communications, advanced analytics products, content management or archiving could use a mongoDB® backend. With the right expertise, you can position yourself to lead the project.

Employers are looking for bright professionals who stay up-to-speed on new technologies. But even if you’re not looking for a new position, learning mongoDB® can place you in line to lead a new project or oversee a large database migration.

So Get Ahead and learn mongoDB® now to stay relevant in the industry.

Module 1
Design Goals, Architecture and Installation
Learning Objectives:
In this module, you will get an understanding of Design Goals and Architecture of mongoDB. This Module will also cover installation of mongoDB and associated tools.
Overview of MongoDB, Design Goals for MongoDB Server & Databases, MongoDB Tools, Collection and Documents, Introduction to JSON and BSON, Installation of Tools, Bottle Pymongo, Introduction to MongoDB, Installing MongoDB on Windows, Linux, MAC OS etc., Project: Problem Statement.

Module 2
CRUD Operations
Learning Objectives:
In this module, you will get an understanding of CRUD Operations and their functional usage.
Introduction, Read and Write Operations, MongoDB CRUD Tutorials, MongoDB CRUD Reference.

Module 3
Schema Design and Data Modelling
Learning Objectives:
In this module, you will learn Schema Design and Data Modelling in mongoDB.
Concept, Examples and Patterns, Model Relationships Between Documents, Model Tree Structures, Model Specific Application Contexts, Data Model Reference.

Module 4
Learning Objectives:
In this module you will learn MongoDB Administration activities such as Backup, Recovery, Data Import/Export, Performance tuning etc.
Operational Strategies, Data Management, Optimization Strategies for MongoDB, Backup and Recovery, Configuration and Maintenance.

Module 5
Scalability and Availability
Learning Objectives:
In this module, you will understand the setup and configuration of mongoDB High Availability, Disaster Recovery, and Load Balancing.
Replication: Introduction, Concepts, Replica Set
Sharding: Introduction, Concepts, Sharded Cluster.

Module 6
Indexing and Aggregation Framework
Learning Objectives:
In this module, you will learn the Indexing and Aggregation Framework in mongoDB.
Indexes: Introduction, Concepts, Index Types, Index Properties, Index Creation Aggregation: Introduction, Concepts, Map-Reduce, Aggregation Examples.

Module 7
Application Engineering and MongoDB Tools
Learning Objectives:
In this module, you will learn mongoDB tools to develop and deploy your applications. This module will also help you understand the multiple package components.
API Documentation for MongoDB drivers, mongo Shell, MMS (MongoDB Monitoring Service), MongoDB + Hadoop Connector, MongoDB Package Components.

Module 8
Project, Additional Concepts and Case Studies
Learning Objectives:
In this module, you will understand how multiple mongoDB components work together in a mongoDB project implementation. We will also discuss multiple case studies and specifications of the course project.
Advance Security, MongoDB Methods, Course Project, Case Studies.

Course 11: Data Storage : NoSQL - Oracle

Course 13: Data Analytics :Mahout

This course will introduce you to the fundamentals of machine learning, and where Mahout fits in the Hadoop ecosystem. The course will provide a blend of Machine Learning Techniques, recommendation system, and Mahout on Hadoop and Amazon EMR.
Course Objectives

After the completion of Apache Mahout Course at Spectramind, you should be able to:

Gain an insight into the Machine Learning techniques.
Understand various Machine Learning Techniques and how to implement these using 'Apache Mahout'.
Understand the recommendation system.
Learn Collaborative filtering, Clustering and Categorization.
Overview of recommendation platform.
Analyse Big Data using Hadoop and Mahout.
Implementing a recommender using MapReduce.
Who should go for this course?

This course is designed for all those who are interested in learning Big Data technologies and write intelligent applications using Apache Mahout.

Some of the prerequisites for learning Apache Mahout are familiarity with Hadoop framework and other ecosystem components . Also, having a mathematical background with Beginner level Java development knowledge will be an added advantage. The basic Java and Hadoop knowledge is recommended and not mandatory as these concepts will also be covered during the course.
Why Learn Machine Learning with Mahout?
In the modern information age of exponential data growth, the success of companies and enterprises depends on how quickly and efficiently they turn vast amounts of data into actionable information. Whether it's for processing hundreds or thousands of personal e-mail messages a day or driving user intent from petabytes of weblogs, the need for tools that can organize and enhance data has never been greater. Therein lies the premise and the promise of the field of machine learning and Apache Mahout.

Module 1
Introduction to Machine Learning and Apache Mahout
Learning Objectives - This module will give you an insight about what 'Machine Learning' is and How Apache Mahout algorithms are used in building intelligent applications.
Topics - Machine Learning Fundamentals, Apache Mahout Basics, History of Mahout, Supervised and Unsupervised Learning techniques, Mahout and Hadoop, Introduction to Clustering, Classification.

Module 2
Mahout and Hadoop
Learning Objectives - In this module you will learn how to set up Mahout on Apache Hadoop. You will also get an understanding of Myrrix Machine Learning Platform.
Topics - Mahout on Apache Hadoop setup, Mahout and Myrrix. .

Module 3
Recommendation Engine
Learning Objectives - In this module you will get an understanding of the recommendation system in Mahout and different filtering methods.
Topics - Recommendations using Mahout, Introduction to Recommendation systems, Content Based (Collaborative filtering, User based, Nearest N Users, Threshold, Item based), Mahout Optimizations.

Module 4
Implementing a recommender and recommendation platform
Learning Objectives - In this module you will learn about the Recommendation platforms and implement a Recommender using MapReduce.
Topics - User based recommendation, User Neighbourhood, Item based Recommendation, Implementing a Recommender using MapReduce, Platforms: Similarity Measures, Manhattan Distance, Euclidean Distance, Cosine Similarity, Pearson's Correlation Similarity, Loglikihood Similarity, Tanimoto, Evaluating Recommendation Engines (Online and Offline), Recommendors in Production.

Module 5
Learning Objectives - This module will help you in understanding 'Clustering' in Mahout and also give an overview of common Clustering Algorithms.
Topics - Clustering, Common Clustering Algorithms, K-means, Canopy Clustering, Fuzzy K-means and Mean Shift etc., Representing Data, Feature Selection, Vectorization, Representing Vectors, Clustering documents through example, TF-IDF, Implementing clustering in Hadoop, Classification.

Module 6
Learning Objectives - In this module you will get a clear understanding of Classifier and the common Classifier Algorithms.
Topics - Examples, Basics, Predictor variables and Target variables, Common Algorithms, SGD, SVM, Navie Bayes, Random Forests, Training and evaluating a Classifier, Developing a Classifier.

Module 7
Mahout and Amazon EMR
Learning Objectives - At the end of this module, you will get an understanding of how Mahout can be used on Amazon EMR Hadoop distribution.
Topics - Mahout on Amazon EMR, Mahout Vs R, Introduction to tools like Weka, Octave, Matlab, SAS.

Module 8
Project Discussion
Learning Objectives - In this module you will develop an intelligent application using Mahout on Hadoop.
Topics - A complete recommendation engine built on application logs and transactions.

Course 14: Data Analytics :NLTK

Course 15: Data Analytics :Open NLP

Course 16: Data Analytics :UIMA

Course 17: Data Visualization : Intellicus

Course 18: Data Visualization : Tableau

Tableau Course Description

This course is designed for the beginner to intermediate-level Tableau user. It is for anyone who works with data – regardless of technical or analytical background. This course is designed to help you understand the important concepts and techniques used in Tableau to move from simple to complex visualizations and learn how to combine them in interactive dashboards.

Course Objective Summary
• Understand the many options for connecting to data
• Understand the Tableau interface / paradigm – components, shelves, data elements, and Terminology.
• The student will be able to use this knowledge to effectively create the most Powerful visualizations.
• Create basic calculations including string manipulation, basic arithmetic calculations, custom Aggregations and ratios, date math, logic statements and quick table calculations
• Able to represent your data using the following visualization types:
• Cross Tab
• Geographic Map
• Page Trails
• Heat Map
• Density Chart
• Scatter Plots
• Pie Chart and Bar Charts
• Small Multiples
• Dual Axis and Combo Charts with different mark types
• Options for drill down and drill across
• Use Trend Lines, Reference Lines and statistical techniques to describe your data
• Understanding how to use group, bin, hierarchy, sort, set and filter options effectively
• Work with the many formatting options to fine tune the presentation of your visualizations
• Understand how and when to Use Measure Name and Measure Value
• Understand how to deal with data changes in your data source such as field addition, deletion or Name change
• Understand all of your options for sharing your visualizations with others
• Combine your visualizations into Interactive Dashboards and publish them to the web

Course Content:

1. Introduction and Overview

• Why Tableau? Why Visualization?
• Level Setting – Terminology
• Getting Started – creating some powerful visualizations quickly
• The Tableau Product Line
• Things you should know about Tableau

2. Getting Started

• Connecting to Data and introduction to data source concept
• Working with data files versus database server
• Understanding the Tableau workspace
• Dimensions and Measures
• Using Show Me!
• Tour of Shelves (How shelves and marks work)
• Building Basic Views
• Help Menu and Samples
• Saving and sharing your work

3. Analysis

• Creating Views
• Marks
• Size and Transparency
• Highlighting
• Working with Dates
• Date aggregations and date parts
• Discrete versus Continuous
• Dual Axis / Multiple Measures
• Combo Charts with different mark types
• Geographic Map Page Trails
• Heat Map
• Density Chart
• Scatter Plots
• Pie Charts and Bar Charts
• Small Multiples
• Working with aggregate versus disaggregate data
• Analyzing , Sorting & Grouping
• Aliases
• Filtering and Quick Filters
• Cross-Tabs (Pivot Tables)
• Totals and Subtotals Drilling and Drill Through
• Aggregation and Disaggregation
• Percent of Total
• Working with Statistics and Trend lines
4. Getting Started with Calculated Fields
• Working with String Functions
• Basic Arithmetic Calculations
• Date Math
• Working with Totals
• Custom Aggregations
• Logic Statements
5. Formatting
• Options in Formatting your Visualization
• Working with Labels and Annotations
• Effective Use of Titles and Captions
• Introduction to Visual Best Practices
6.Building Interactive Dashboard
• Combining multiple visualizations into a dashboard
• Making your worksheet interactive by using actions and filters
• An Introduction to Best Practices in Visualization
Sharing Workbooks
• Publish to Reader
• Packaged Workbooks
• Publish to Office
• Publish to PDF
• Publish to Tableau Server and Sharing over the Web
Putting it all together
• Scenario-based Review Exercises
• Best Practices

Course 19: Data Visualization : MicroStrategy

Course 20: Data Analysis

  • Introduction to Data Science
  • What is Data Science
  • Disciplines that make up Data Science
  • What does a data scientist do with the data?
  • Data Science Applications (Churn Analysis, Segmentation and Profiling, Recommendations etc.,)
  • Understanding Data
  • Understanding Data
  • Understanding Data Types
  • Importance and utility of big data (unstructured data)
  • Qualitative and Quantitative Data
  • Working with R
  • Basic Data Types
  • Vector
  • Matrix
  • List
  • Data Frame
  • Data Import and export
  • Control Structures
  • Some important R Packages
  • Data Munging
  • Data Pre-processing
  • Handling variety of file formats
  • Handling missing values
  • Type Conversion
  • Data Wrangling
  • R's apply and plyr Packages
  • EDA (Real data Sets - Concluding without Hypothesis testing)
  • Exploratory Data Analysis and Statistical Graphs/Charts
  • Univariate Analysis
  • Central Tendencies : Mean, Median, Mode
  • Dispersion : Range, Variance, Standard Deviation
  • Other Measures : Quartile and Percentile, Interquartile Range
  • EDA (Additional Characteristics of Data)
  • Skew, Kurtosis and Moments
  • Relationship between attributes : Covariance, Correlation Coefficient
  • Unreasonable effectiveness of Data
  • Anscombe's quartet

Course 21: Probability and Statistics

  • Probability Essentials
  • Probability Rules
  • Conditions of Statistical Independence
  • Conditions of Statistical Dependence
  • Probability Distributions
  • Bernoulli
  • Binomial
  • Multinomial
  • Poisson
  • Weibull
  • Geometric
  • Negative Binomial
  • Gamma and Exponential
  • Normal
  • Sampling and Sampling Distributions
  • Estimations and Confidence intervals
  • Central Limit Theorem
  • Hypothesis Testing - 1
  • Concepts of Hypothesis Testing
  • Testing for equality of variances of two samples
  • Comparing the equality of means of two samples
  • Comparing two proportions
  • Correlation between two samples
  • Tests on two variables contingency table
  • Business Forecasting
  • Trend analysis and Time Series
  • Cyclical and Seasonal analysis
  • Box-Jenkins method
  • Smoothing and Moving averages
  • Auto-correlation
  • ARIMA – Holt-Winters method

Course 22: Data visualization

  • Fundamentals of R Graphing Systems
  • Faceting, Geoms, Aesthethics, Layers and Scales
  • Advanced Visualizations - Variety of Histogram/Bar Charts, Tree, Polar Bar graphs etc.,
  • Getting External Data, Data Mashups

Course 23: Big Data : Computing at Scale

Introduction to Big Data, Analytics and Data Science

  • Why current Landscape exists?
  • Why RDBMS exist? Why we need other data Store?

Basic Parallel/Distributed Computing

  • Basic Computing Architectures

NoSQL Concepts

  • ACID vs. BASE
  • Schema on Read vs. Schema on Write
  • CAP Theorem
  • NoSQL DBs (Key-value, Columnar, Document, Graph)

NoSQL Advanced Concepts

  • Replication and Sharding
  • Neo4j - Cypher
  • MongoDB - CRUD, Indexing, MR Programing
  • HBase

Graph Databases

  • Neo4j - Cypher

Document Databases

  • MongoDB - CRUD, Indexing, MR Programing

Introduction to Hadoop

  • Use Cases and Applications
  • Hadoop Evolution

HDFS Fundamentals

  • Cluster and File System Concepts
  • HDFS Overview and Architecture
  • File Formats and I/O

MapReduce (MR) Programming

  • MapReduce programming Model
  • Concepts of Functional programming
  • Combiners & Partitioners
  • Example Walk throughs
  • How to write MR Programs"

MR lab on Amazon EMR

  • Streaming APIs (Python)

Fundamental MapReduce (MR) Algorithms

  • Inverted Index , Page Rank
  • TF-IDF

Pig : Dataflow Language

  • Pig Data Model
  • Input and Output
  • Relational Operations
  • User Defined Functions

Hive : Datawarehouse Framework

  • Hive Architecture
  • Data Definition
  • Data Manipulation
  • Hive - Schema Design and Tuning
  • Pig and Hive Comparison

Data Logistics in Hadoop

  • Data ingestion using Flume/sqoop

Data Munging and Wrangling

  • Techniques!
  • Binning, Classing and Standardization
  • Outlier/Noise

Big data Analytics in R + Hadoop

  • RHadoop and RHIPE

Course 24: Predictive Analytics

Prediction Modeling

  • Linear Regression

Classification Techniques

  • Logistic Regression
  • K Nearest Neighbors (kNN)
  • Support Vector Machines
  • Naive Bayes
  • Decision Tree


  • K-means


  • Market Basket Analysis, Frequent Itemset Mining Algorithms


  • Similarity Metrics, Distance Measurement, Recommendations


  • Apriori, FP-Growth

Course 25: Text Analytics

  • Basic text Processing
  • Introductory NLP
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License