Machine Learning: Beginner to Pro Roadmap

This article will help you to understand how to go about learning machine learning step by step with the best resources and their links.

What is machine learning?

Machine learning is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit programming. The key idea behind machine learning is to empower computers to learn from data and improve their performance over time.

To learn about ML, you need some understanding about Maths and Statistics topics like -

For Maths,

Linear Algebra
Calculus
Optimization
Differential Equations
Geometry and Trigonometry

For statistics,

Descriptive Statistics
Inferential Statistics
Probability Distributions
Bayesian Statistics
Regression Analysis
Analysis of Variance (ANOVA)
Time Series Analysis

Now to get your hands dirty and do some real implementation, you need to know a programming language such as Python. After that make yourself familiar with libraries of Python like Pandas, Matplotlib, and Seaborn.

Okay, so until now everything we covered was sort of like a pre-requisite. Now comes the real important topics which is machine learning algorithms.

For a Machine Learning algorithm, get your basics clear on -

Machine Learning types:

Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning

There is a list of topics for learning machine learning:

1 → Data Collection

Open Data Repositories:

Kaggle Datasets: Kaggle hosts a wide variety of datasets across different domains.
UCI Machine Learning Repository: A collection of databases, domain theories, and data generators widely used by the machine learning community.
Government and Public Databases: Data.gov: U.S. government’s open data.
European Data Portal: Open data from European countries.

Web Scraping Tools:

Beautiful Soup: A Python library for pulling data out of HTML and XML files.
Scrapy: An open-source and collaborative web crawling framework for Python.
Selenium: is a powerful open-source framework often used for automated testing of web applications. However, it can also be employed for web scraping tasks when you need to interact with dynamic and JavaScript-heavy websites.

APIs for Data Retrieval:

Many websites and services offer APIs for accessing their data. Examples include Twitter, GitHub, and various financial APIs.

Image Datasets:

ImageNet: A large dataset for image classification.
COCO (Common Objects in Context): A large-scale object detection, segmentation, and captioning dataset.

Text and NLP Datasets:

IMDb Reviews: Large movie review dataset for sentiment analysis.
Gutenberg Project: A large collection of free eBooks.

Healthcare Datasets:

MIMIC-III: Medical Information Mart for Intensive Care, a database of ICU patient data.
UCI Machine Learning Repository — Health: Various healthcare-related datasets.

Finance Datasets:

Yahoo Finance API: Provides historical stock data.
Quandl: A platform for financial, economic, and alternative data.

Social Media Data:

Twitter API: Access to Twitter’s data for various purposes.
Reddit API: Access to Reddit’s data for research and analysis.

Climate and Environmental Datasets:

NASA Earthdata: Diverse collection of environmental data.
NOAA Data Access Viewer: National Centers for Environmental Information datasets.

2 → Data Preprocessing

Rescaling
— MinMax Scaling
— Absolute Maximum Scaling
— Normalization
— Standardization
— Robust Scaling
Encoding
— Ordinal Encoding
— Label Encoding
— One-hot Encoding
Imputer
— Next-previous value
— KNN (K-Nearest Neighbours)
— Max — Min Value
— Missing value prediction
— Most Frequent Value
— Mean / Median
— Fixed Value
— Linear interpolation (Pandas Interpolate method)
Dimension Reduction
— PCA
— Backward Elimination (Only for Linear Regression and Logistic Regression)
— Forward selection
— Score Comparison
— Missing value Ratio
— Low Variance Filter
— High Correlation filter
— Random Forest
— Factor Analysis
Outlier Reduction
1. Two Types
— — Outlier Detection
— — Outlier Removal
2. Outlier Detection
— — Box Plot
— — IQR Methods
— — Z-score Method
— — Distance from the mean (Multivariate)
3. Outlier Removal
— — Trimming
— — Capping (Treat outlier as missing value)
— — Discretization (Bining) -> By making the groups
Feature Engineering
— Feature Creation
— Transformation
— Feature extraction
— Feature Selection

Check Normal Distribution

Seaborn Distplot
QQ plot

Transformation (Data convert into normal distribution)

Logarithm Transformation (FunctionTransformer) — Good in right-skewed data
Reciprocal Transformation (FunctionTransformer) (1/x)
Box-Cox Transformation (PowerTransformer)
Square Transformation (FunctionTransformer) (x2) — Good in left-skewed data
Square root Transformation (FunctionTransformer) [root(x2)]
Johnson Transformation (PowerTransformer)

3 → Feature Management techniques

PCA
ICA
LDA
LLE
t-SNE

Feature Selection techniques

Filter Method
Information gain
Chi-square Test
Fisher’s score
Correlation Coefficient
Variance Threshold
Mean Absolute Difference
Dispersion Ratio
Wrapper Method
— Forward Selection
— Backward Elimination
— Bi-directional elimination (stepwise selection)
Embedded Method
— Random Forest
— Lasso Regularizations

4 → Ensemble Learning

Bagging
— Bootstrapping
— Aggregating
— Max — voting
— Averaging
Boosting
— Adaboost
— Gradient Boosting
— Extreme gradient boosting or XGBoost
Stacking

5 → Machine Learning Algorithms

2 — Types
— Supervised
— unsupervised
Supervised Machine Learning (Algorithms)
— Regression
— — Linear
— — Polynomial
— — Ridge & Lasso
— — Gradient Descent
— Decision Tree
— Random Forest
— Classification
— — KNN
— — Trees
— — Logistic
— — Naive Bayes
— — SVM
Unsupervised Machine Learning
— Clustering
— — SVD
— — PCA
— — K-Means
— Association analysis
— — Apriori
— — FP-Growth
— Hidden Markov Models

A list of full topics If you want to see it in Google Docs then here it is.

Hope you enjoy it!

Dislike

Thanks for feedback.