Machine learning is now being used all around the world and its helping analytics team greatly in saving costs and improving business decisions.
A Machine learning project starts with Raw data and Ends with a web application that can predict outcomes and generate insights from raw data.
The following Steps are Involved in Machine learning project pipeline.
Step 1 : EDA
EDA stands for exploratory data analysis where we explore the Raw data , determine the target variables, analyse the data to look for missing values ,analyse distribution of variables and features , datatypes, outliers , data size , correlation of input features with the target column, relationships among the features and look for any Visible patterns in the data.
Step 2 : Data Transformation and Cleaning
Every machine learning algorithm requires data to be in specific format for it to perform at its best. For example, in Linear regression , logistic regression, PCA and some clustering algorithms which uses distance to calculate similarity between datapoints requires the features to be on same scale, as having them on different scale may lead our algorithm to give more importance to features having more magnitude and less importance to features having less magnitude. In such case we use data normalizer to normalize our data. Apart from that treating missing values , treating outliers , encoding categorical features is also important for optimal performance of algorithms.
Step 3 : Feature Engineering
Feature engineering is simply the science (and art) of extracting more information from existing data. Here we try to generate feature by using existing data. For example , we can use timestamp column in our dataset to generate month , quarter , year columns and feed our algorithm with additional information of data which can significantly improve algorithm performance .we can transform existing columns to make it more useful for algorithm.
Step 4 : Train test split
An important step in our machine learning pipeline is to test how well our model is performing on the unseen data. so we keep some data unseen to our model from the given data and use it later to evaluate model performance on unseen data, this process of dividing the given data into train and test data is known as train test split.
Step 5 : Selecting model
Model selection is important step and there are various ways to do it. Ideally, we compare the evaluation results of different models and select the model which performs best. But their can be other criteria’s too for selecting the best model. For example, in some cases, KNN can give best results but it is very slow in performance as it compares new data point with each and every data in our given training dataset to predict the result which makes it very slow, so it can’t be used in most of the production scenarios as most of the time users don’t like to wait too long for the results. so while selecting model look for overall purpose, Accuracy and speed of model. You can start with the model of your choice and keep comparing various model results to select best model.
Step 6: Model training and hyperparameter tuning
Once you have selected the model, train it using the train dataset that we created in train test split process and tune the hyperparameters to select the best hyperparameters for your specific use case. you can use random search CV or grid search CV method to tune your hyperparameters.
Step 7 : Evaluate Model
Once you have trained your model its time to evaluate it on the unseen data. Take the test data from the train test split and evaluate model on the predictions given by your model using Evaluation metrics like Accuracy, precision, Recall, F-score or AUC for classification model and mean absolute error .Root mean squared Error or R-square for regression models. Select the suitable evaluation metric as per the requirement,
Step 8 : Deploy model.
Once you have got your best model its now time to deploy it so that the end users can use it by sending API requests. You can create an Endpoint using Flask or FastAPI and use your model in it to predict the results on the input data from API calls. You can deploy your flask app on Cloud or local servers as per the need.
Leave A Comment