Tuesday 4 July 2023

 

What decides the profit? A decision tree in python

 

Steps we are going to do

The code starts by importing necessary libraries like pandas, which helps with data manipulation, and scikit-learn (sklearn), which provides machine learning algorithms.

·         The dataset is read from an Excel file using pandas and stored in the variable "data." It contains information related to turnover, return on capital employed (ROCE), liquidity ratio, number of employees, and profit after tax (pat).

·         Some data preprocessing is performed, such as transposing the dataset, setting column names, and converting the "pat" column to numeric values.

·         The dataset is divided into input features (X) and the target variable (y). X contains the columns "turnover," "roce," "liquidityratio," and "noemployees," while y contains the "pat" column.

·         The dataset is split into training and testing sets using the train_test_split function from sklearn. The test set size is set to 20% of the data, and a random seed (random_state) is used for reproducibility.

·         An instance of the DecisionTreeClassifier is created and assigned to the variable "classifier." This classifier is a decision tree model used to make predictions based on the input features.

·         The classifier is fitted to the training data, meaning it learns patterns and relationships between the input features (X_train) and the target variable (y_train).

·         The code then predicts the profit values for the test set (X_test) using the trained classifier and stores the predictions in the variable "y_pred."

·         The classification_report function from sklearn.metrics is used to generate a report evaluating the performance of the classifier. This report includes metrics such as precision, recall, and F1-score for each class (profit label) in the test set.

·         Finally, the report is printed to the console, showing precision, recall, F1-score, and support for each profit label, as well as overall accuracy.

·         In summary, the code uses a decision tree algorithm to predict profit based on given input features. It trains the model on a training dataset, makes predictions on a test dataset, and evaluates its performance using classification metrics.

 

Python Code for the Work

import pandas as pd

In [2]:

from sklearn.tree import DecisionTreeClassifier

In [3]:

from sklearn.model_selection import train_test_split

In [4]:

from sklearn.metrics import classification_report

In [5]:

data = pd.read_excel(r"C:\Users\Dell\Desktop\Book1.xlsx")

In [8]:

data = data.T

data.columns = data.iloc[0]

data = data[1:]

In [16]:

data['pat'] = pd.to_numeric(data['pat'], errors='coerce')

In [17]:

X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]

In [18]:

y = data['pat']

In [19]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:

classifier = DecisionTreeClassifier()

In [ ]:

print(data['pat'].unique())

print(data['pat'].dtype)

In [22]:

X = data[['turnover', 'roce', 'liqudityratio', 'noemployees']]

In [23]:

y = data['pat']

In [24]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:

classifier = DecisionTreeClassifier()

In [26]:

classifier.fit(X_train, y_train)

Out[26]:

DecisionTreeClassifier

DecisionTreeClassifier()

In [35]:

from sklearn import tree

import graphviz

In [27]:

y_pred = classifier.predict(X_test)

In [29]:

report = classification_report(y_test, y_pred, zero_division=0)

In [30]:

print(report)

              precision    recall  f1-score   support

 

  -2490380.0       0.00      0.00      0.00       0.0

  -2448564.0       0.00      0.00      0.00       1.0

  -1906132.0       0.00      0.00      0.00       0.0

  -1571463.0       0.00      0.00      0.00       0.0

     60751.0       0.00      0.00      0.00       1.0

    373589.0       0.00      0.00      0.00       1.0

   3000785.0       0.00      0.00      0.00       1.0

 

    accuracy                           0.00       4.0

   macro avg       0.00      0.00      0.00       4.0

weighted avg       0.00      0.00      0.00       4.0

 

Interpretation

The classification report provides an assessment of the performance of the decision tree classifier. Let's interpret the different metrics:

Precision: Precision evaluates the accuracy of positive predictions. In this case, all the listed labels have a precision of 0.00, indicating that there were no correct positive predictions for these labels.

Recall: Recall measures the proportion of actual positives that were correctly identified. Similar to precision, the recall for all the listed labels is 0.00, indicating that there were no true positive predictions for these labels.

F1-score: The F1-score is a combined measure of precision and recall, taking into account both metrics. Since both precision and recall are 0.00, the F1-score for all the listed labels is also 0.00

Support: The support indicates the number of occurrences of each label in the test data. It reveals that labels such as -2490380.0, -1906132.0, and -1571463.0 did not appear in the test data (support of 0). On the other hand, labels like -2448564.0, 60751.0, 373589.0, and 3000785.0 had a single occurrence (support of 1).

Accuracy: The overall accuracy of the classifier is reported as 0.00, indicating that none of the predictions were correct.

Macro avg: This row presents the average precision, recall, and F1-score across all labels. Since all the individual metrics are 0.00, the macro average is also 0.00.

Weighted avg: Similar to the macro average, the weighted average calculates the metrics by considering the support of each label. As the support for each label is either 0 or 1, the weighted average metrics are also 0.00.

Overall, the classification report suggests that the decision tree classifier did not make any correct predictions for the given labels. It's important to note that the dataset might be imbalanced, or the classifier may require further tuning or more data to improve its performance.

 

 

 

Friday 23 June 2023

 

Using K-means Cluster Analysis to Discover Hidden Patterns

 

Introduction

We would like to take this opportunity to welcome you to this article on our blog, in which we will dig into the fascinating realm of k-means cluster analysis, a well-known unsupervised machine learning method. The results of a cluster analysis may be used to divide the total number of observations in a dataset into a number of distinct but related subsets, or clusters, according to the degree to which they share characteristics across many dimensions. We are able to find subgroups of observations that display similar patterns of response on a set of clustering variables by doing a k-means cluster analysis. This allows us to make more informed decisions. Quantitative measures make up the vast majority of these variables; however, binary variables may also be included.

 

Acquiring Knowledge of the K-means Clustering Method

K-means cluster analysis is a kind of unsupervised learning that seeks to divide observations into separate clusters, with the expectation that each observation will belong to exactly one cluster. The analysis begins by randomly allocating each observation to one of the clusters. Next, the clusters are repeatedly optimized to minimize the within-cluster sum of squares, and finally, the analysis is complete. The number of clusters whose location we want to determine is denoted by the letter "k" in the term "k-means."

 

The Steps Involved

 

Bringing in the Necessary Library Packages

Importing the required libraries into our Python environment is the first step in getting started. For the purpose of carrying out k-means cluster analysis, we will be using scikit-learn, which is a well-known machine learning toolkit. This library gives us access to the KMeans class.

 

The Dataset Is Being Loaded

Next, we need to load our dataset, which includes the clustering variables and observations that will serve as the foundation for our study.

 

K-means Cluster Analysis Being Carried Out

 

 

We will make use of the KMeans class in order to determine the subgroups of observations that exhibit response patterns that are comparable to one another. After that, we fit the model to our data while specifying the number of clusters that we want to build (k).

 

Extracting the Assignments to the Clusters

After we have finished the k-means cluster analysis, we will be able to get the cluster assignments that correspond to each observation. Each observation is given a label that specifies which cluster it falls under.

 

The process of analyzing and interpreting the results is as follows

It is now time to examine and make sense of what the findings mean. You may obtain insight into the many subgroups that have been discovered by the study by examining the features of each cluster. For example, you might look at the mean values of the clustering variables. In addition, you may improve your comprehension by visualizing the clusters by selecting the relevant plots or graphs and putting them together.

 

Python Code

 

# Import the required libraries

from sklearn.cluster import KMeans

import numpy as np

 

# Load the dataset

X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

 

# Create an instance of the KMeans class

kmeans = KMeans(n_clusters=2, random_state=0)

 

# Fit the model to the data

kmeans.fit(X)

 

# Extract the cluster assignments

cluster_assignments = kmeans.labels_

 

# Print the cluster assignments

print("Cluster assignments:", cluster_assignments)

 

Summary

 

k-means cluster analysis is an unsupervised approach of machine learning that is used to find subgroups of observations that have similar response patterns. In this blog article, we discussed the notion of k-means cluster analysis. We were able to divide the dataset into a number of separate groups by using the k-means cluster analysis. These groupings were determined by the clustering variables. Keep in mind that you are not forced to execute the cluster analysis on a test dataset unless you desire to do so. You have the option to do so. You may skip the step of separating your dataset into training and test sets if the number of observations in your dataset is not very large. However, you must ensure that your written summary include an explanation of the reasoning behind this choice.

 

An Investigation into Lasso Regression Analysis with Regard to Variable Selection

 

Introduction

We would like to take this opportunity to welcome you to this blog post in which we will delve into the intriguing realm of lasso regression analysis. The powerful method known as lasso regression is used to linear regression models in order to identify essential variables and improve the precision of the models' predictions. When attempting to forecast a quantitative response variable, its purpose is to identify the greatest possible set of predictors that will result in the lowest possible prediction error rate. Lasso regression is a method that uses a sleight of hand to reduce some of the regression coefficients closer and closer to zero, which effectively removes some variables from the model. This assists us in identifying the factors that have the highest correlation with the response variable, which ultimately leads to more accurate predictions on our part.

 

 

Acquiring Knowledge of the Lasso Regression Analysis

The essential statistical methods of variable selection and shrinkage are brought together in lasso regression. The ability to alter the magnitude of the coefficients that are given to each predictor is made possible by shrinkage, and variable selection enables us to pick the predictors that are most relevant to our study. The magic comes when certain variables are assigned zero coefficients in the lasso regression model, which essentially removes those variables from consideration. On the other hand, variables that have coefficients that are not zero are the ones that have the most significant impact on the variable that is being measured (the response variable).

 

 

The Steps Involved

 

Bringing in the Necessary Library Packages

In order to get started with Python, we will need to import several libraries. In order to do lasso regression with k-fold cross-validation, we will be using scikit-learn, which is a well-known machine learning toolkit. This library offers us with the LassoCV class, which we will use.

The Dataset Is Being Loaded:

The subsequent step is to load our dataset. This dataset includes both the predictor variables, which we will use to generate predictions, as well as the quantitative response variable, which we will use to determine what we can predict.

 

Performing Lasso Regression while Employing Cross-Validation

We will use the LassoCV class in order to choose the optimal subset of predictors that can properly predict our response variable and use it in our analysis. In this class, the lasso regression technique is used, together with k-fold cross-validation. The performance of our model can be evaluated with the use of cross-validation, which also enables us to choose the most effective regularization parameter, which governs the total amount of shrinkage that is performed.

 

Getting the Most Out of Important Predictors

As soon as the lasso regression analysis has been finished, we will be able to extract the significant predictors. These predictors are the ones that have regression coefficients that aren't zero, which indicates that they have a significant relationship with the response variable we're interested in.

 

Code for Lasso Regression Analysis

# Import the required libraries

from sklearn.linear_model import LassoCV

from sklearn.datasets import load_boston

 

# Load the dataset

boston = load_boston()

X = boston.data

y = boston.target

 

# Create an instance of the LassoCV class

lasso = LassoCV(cv=5)

 

# Fit the model to the data

lasso.fit(X, y)

 

# Extract the important predictors

important_predictors = boston.feature_names[lasso.coef_ != 0]

 

# Print the important predictors

print("Important predictors:", important_predictors)

 

 

 

 

Summary

In this article, we discussed the notion of lasso regression analysis, which is a strong method for variable selection and shrinkage in linear regression models. We also looked at several examples of lasso regression analysis. We were able to determine the subset of predictors that most accurately predicts our quantitative response variable by doing a lasso regression analysis with k-fold cross-validation. This allowed us to narrow down our list of potential predictors. The variables that have regression coefficients that are not zero are the ones that have a stronger connection with the variable that we are interested in (our response variable). It is important to keep in mind that if the number of observations in your dataset is relatively low, you may not need to separate it into a training set and a test set since doing so could result in an inadequate sample size for training the model.

 

Running a Random Forest

Introduction:

In this article, we will go into the topic of random forest analysis, which is a robust approach for predictive modeling that is used in machine learning. The use of random forests enables us to investigate the relative significance of a number of potential explanatory factors in the context of the prediction of a binary or categorical response variable. The processes required in performing a random forest analysis, analyzing the findings, and understanding the relevance of variable importance will all be covered in this lesson.

What exactly is an analysis of a random forest?

The Random Forest Analysis (also known as RFA) is a flexible modeling method that makes use of a collection of decision trees in order to predict a response variable. It requires the creation of many decision trees and the aggregation of their predictions in order to provide forecasts that are more accurate and robust. Random forests may examine the influence of the number of trees on classification accuracy and give insights into the value of explanatory factors in predicting the target variable. Random forests provide insights into the importance of explanatory variables in predicting the target variable.

The Steps Involved

1) Bringing in the Necessary Library Files:

To get started, we will first import the required libraries into Python. The RandomForestClassifier class is available for use in the construction of random forest models inside Scikit-learn.

2) Adding Items to the Dataset

In order to carry out our analysis, we need to load the dataset that consists of both the category and binary answer variables, as well as the factors that explain the results. This dataset has to be properly prepared, with the response variable having its values encoded as binary.

 

3) Dividing the Dataset in Half:

It is necessary to separate the dataset into a training set and a testing set before we can evaluate how well our random forest model performs. The accuracy of the model will be evaluated based on its performance on the testing set, while the training set will be utilized to train the model.

 

4) Development of the Random Forest Model:

Next, an instance of the RandomForestClassifier class will be created, and the instance will be tailored to the training data. The model is able to learn from the data provided in the training set by creating numerous decision trees with different feature and data subsets at random.

5) Attempting to Make Predictions:

Now that our random forest model has been trained, we are able to make predictions based on the testing data. When developing its final forecast, the model takes into account all of the separate decision trees' findings.

 

6) Evaluating Variable Importance:

We are able to assess the relevance of each explanatory variable in terms of our ability to forecast the response variable using random forests. We obtain an understanding of which aspects of the model have the greatest influence by evaluating the effect that the variables have on the performance of the model.

 

7) Interpretation:

Following the execution of the random forest analysis, we are able to investigate the variable importance scores in order to get a comprehension of the relative significance of every explanatory variable. A greater effect on the model's predictions is shown by a higher significance score for the factor. The selection of features, the preparation of data, and future analysis may all be guided by these findings.

 

 

Code in Phyton for Random Forest

# Import the required libraries

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

 

# Load the dataset

iris = load_iris()

X = iris.data

y = iris.target

 

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 

# Create an instance of the RandomForestClassifier

clf = RandomForestClassifier()

 

# Fit the classifier to the training data

clf.fit(X_train, y_train)

 

# Perform predictions on the testing data

y_pred = clf.predict(X_test)

 

# Print the predictions

print("Predicted labels:", y_pred)

 

 

Conclusion

 

 

The random forest analysis is a useful method for assessing the significance of explanatory factors in the context of making predictions about a binary or categorical response variable. You will be able to conduct your own random forest analysis using Python and scikit-learn if you follow the instructions provided in this blog article. This will allow you to obtain insights into the relevance of the variables and make more accurate predictions.

 

1.     Understanding Classification Tree Analysis

 

Introduction

In this article, we are going to investigate the idea of classification tree analysis by using the scikit-learn module that is available in Python. When trying to predict a categorical response variable, classification trees are a useful tool for analyzing the nonlinear connections and interactions between factors. In this lesson, we will go into the process of doing a classification tree analysis and then analyze the findings of that study.

 

 

What exactly is meant by the term "classification tree analysis"?

 

Classification tree analysis is a kind of predictive modeling that makes use of decision trees to investigate the linkages that exist between categorical response variables and the explanatory factors that contribute to their formation. Creating a set of straightforward rules or criteria to segment the data and choose the variable constellations that provide the most accurate prediction of the target variable is a necessary step.

 

The Steps Involved

 

Importing the essential Libraries Before getting started, we need to make sure that Python has all of the essential libraries imported. For the purpose of constructing classification tree models, the scikit-learn package includes a class called DecisionTreeClassifier.

 

1.      The Dataset Is Being Loaded

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

 

 

In this part of our research, we will be using the Iris dataset, which is a very popular dataset in the field of machine learning. This dataset contains measurements of a variety of iris blossoms, and the purpose of the dataset is to identify the species of iris based on the measurements supplied.

iris = load_iris()

X = iris.data

y = iris.target

 

2.      The Dataset Is Divided Into:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The dataset has to be segmented into a training set and a testing set before we can evaluate the effectiveness of our classification tree model. The model will be constructed using data from the training set, while the correctness of the model will be evaluated with data from the testing set.

 

3.      Constructing the Model of the Classification Tree

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)

Following that, we will construct an instance of the DecisionTreeClassifier class and then train it using the data that we have. In order to provide accurate forecasts, the model first has to discover recurring themes and connections within the training data.

 

4.      Attempting to Make Predictions:

y_pred = clf.predict(X_test)

Now that our model has been trained, we are in a position to make predictions based on the testing data. The model utilizes the acquired guidelines and standards to assign categories to the samples on the basis of the characteristics that have been supplied.

 

5.      Taking a Look at the Model:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Calculating metrics like as accuracy, precision, recall, and F1-score are some examples of how we might evaluate the effectiveness of our classification tree model. These metrics provide insights into the accuracy with which the model forecasts the appropriate class labels.

 

6.      Interpretation:

 

 

We were able to acquire an accuracy of X.XX by using the classification tree analysis on the dataset including iris images. This indicates that our model accurately predicted the class of x hundred percent of the samples that were used in the testing set. Through the use of decision tree analysis, we were able to discover nonlinear correlations and interactions between the explanatory factors and the categorical answer variable, therefore illuminating the underlying patterns that were present in the data.

 

 

 

 

Conclusion:

 

Classification tree analysis is a useful method for gaining an understanding of nonlinear interactions and for generating predictions based on response variables that are categorical. You will be able to perform your own classification tree analysis on your dataset using Python and scikit-learn if you follow the procedures provided in this blog article and use them. This will allow you to obtain insights into your dataset and make correct predictions.