banner



How To Design Configuration File For Machine Learning Code

This article was published as a part of the Data Scientific discipline Blogathon

Research is to see what everybody else has seen and to recall what

nobody else has thought – Albert  Szent-Gyorgyi

Introduction:

The information science lifecycle is an iterative procedure where every stride is visited again and once more at diverse stages. This is mainly due to the research /experiment-based approach the field demands and about times, there is no right or wrong result. Every effect has its relevance based on the information, approach, assumption fabricated along the way, the factors considered/skipped, etc.

Finally, the approach which gives us relatively improve results and the one that makes business organisation sense makes it to production. But the cycle doesn't terminate at that place, fifty-fifty postal service-production one needs to constantly monitor the model operation and make revisions as often as advisable.

As the business has realized the importance of information and the benefits of its right usage, the size of the data science teams has increased over the years. More than teams are carrying out various experiments, revisions, and optimizations. It can get very circuitous in no time if a procedure is not brought in the place where every experiment is tracked, measured and results documented for reference. This goes a long way in avoiding redundant inquiry and experiments.

To attain this, replicability and reproducibility place an important role i.e is the power to perform data analysis and achieve the aforementioned results as someone else.

Why do we demand reproducible reports?

Reproducible ML reports

In this article, we volition explore the process of edifice and managing machine learning reports using configuration files and generate HTML reports. For this unproblematic car learning project, I volition use the Chest Cancer Wisconsin (Diagnostic) Information Set. The objective of this ML project is to predict whether a person has a benign or malignant tumour.

Allow'due south become started !!

  1. Nosotros volition kickoff conventionally build a classification model.
  2. We volition build the aforementioned model using the YAML configuration file.
  3. Finally, nosotros volition generate an HTML report and save it.

Classification model – without config file:

Allow'south create a Jupyter notebook by proper name notebook.ipynb and have the beneath code in it. I am using VSCode as my editor, information technology gives a overnice and easier fashion to create a Jupyter notebook.

#mport of import packages import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import joblib #path to the dataset #filename = "../Data/breast-cancer-wisconsin.data" filename = "./Data/chest-cancer-wisconsin.csv" #load data  information = pd.read_csv(filename) #replace "?" with -99999 data = data.replace('?', -99999) # drib id column data = data.drop(['id'], axis=1) # Define X (contained variables) and y (target variable) X = information.driblet(['diagnosis','Unnamed: 32'], axis=1) y = data['diagnosis'] #split data into train and examination set up  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # telephone call our classifer and fit to our information classifier = KNeighborsClassifier(n_neighbors=v, weights="uniform", 		algorithm = "auto", leaf_size = 25, 		p=1, metric="minkowski", n_jobs=-1) #preparation the classifier classifier.fit(X_train, y_train) #test our classifier  result = classifier.score(X_test, y_test) print("Accuracy score is. {:.1f}".format(result)) #relieve our classifier in the model directory joblib.dump(classifier, './Model/knn.pkl')
ml experiments

If yous notice, in the above code in that location are various hardcoded numbers, file names, model parameters, train/exam split percentage, etc. If y'all wish to experiment and so you can make changes in the code and re-run it.

As a all-time do, it is not advisable to make changes to lawmaking direct instead it is recommended to use configuration files. In that location are various file types for configuration like YAML, JSON, XML, INI, etc. and in our case, nosotros will employ the YAML format.

YAML file formats are popular for their ease of readability. YAML is relatively easy to write and within unproblematic YAML files, there are no data formatting items, such as braces and square brackets; most of the relations betwixt items are defined using indentation.

Let'southward create our config file in YAML – Config.YAML

#INITIAL SETTINGS data_directory: "./Information/" data_name: "breast-cancer-wisconsin.csv" drop_columns: ["id","Unnamed: 32"] target_name: "diagnosis" test_size: 0.3 random_state: 123 model_directory: "./Model" model_name: KNN_classifier.pkl #kNN parameters n_neighbors: 3 weights: uniform algorithm: auto leaf_size: 15 p: 2 metric: minkowski n_jobs: 1

At present that we have built our model the conventional way, let's move to the next department where we volition do it slightly differently.

Classification model – with a config file:

There are two major changes compared to the final arroyo.

  1. Loading and reading of the YAML file.
  2. Replacing all the hardcoded parameters with variables from the YAML config file.

Allow'south the below chunk of lawmaking to notebook.ipynb which will load the Config.yaml.

# folder to load config file CONFIG_PATH = "./" # Function to load yaml configuration file def load_config(config_name):     with open(os.path.join(CONFIG_PATH, config_name)) equally file:         config = yaml.safe_load(file)     render config config = load_config("Config.yaml")

Now, let'southward proceed to replace the hardcoded parameter with variables from the config file. For instance, nosotros will modify the train/examination divide code.

# split data into train and test set X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=config["test_size"], random_state=config["random_state"] )

Hither are the changes we made:

  1. The test_size = 0.2 is replaced with config["test_size"]
  2. The random country = 42  is replaced with config["random_state"]

Later making similar changes across, the concluding file would look like this.

# Import of import packages import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier import joblib import os import yaml # folder to load config file CONFIG_PATH = "./" # Function to load yaml configuration file def load_config(config_name):     with open(os.path.join(CONFIG_PATH, config_name)) as file:         config = yaml.safe_load(file)     return config config = load_config("Config.yaml") # load data information = pd.read_csv(os.path.join(config["data_directory"], config["data_name"])) # replace "?" with -99999 data = data.replace("?", -99999) # drop id column information = data.drib(config["drop_columns"], axis=one) # Define Ten (independent variables) and y (target variable) X = np.array(data.driblet(config["target_name"], 1)) y = np.array(data[config["target_name"]]) # split information into train and test prepare X_train, X_test, y_train, y_test = train_test_split(     X, y, test_size=config["test_size"], random_state= config["random_state"] ) # call our classifer and fit to our data classifier = KNeighborsClassifier(     n_neighbors=config["n_neighbors"],     weights=config["weights"],     algorithm=config["algorithm"],     leaf_size=config["leaf_size"],     p=config["p"],     metric=config["metric"],     n_jobs=config["n_jobs"], ) # training the classifier classifier.fit(X_train, y_train) # examination our classifier result = classifier.score(X_test, y_test) print("Accuracy score is. {:.1f}".format(consequence)) # salve our classifier in the model directory joblib.dump(classifier, os.path.bring together(config["model_directory"], config["model_name"]))

Y'all can detect the entire code on Github.

So far, we have successfully congenital a classification model, congenital a YAML config file, loaded the config file on Jupyter notebook, and parameterized our entire code. Now, if you make changes to the config file and run the notebook.ipynb, you will see the model results very similar to the conventional arroyo.

We will move to the final department where we will generate a written report of everything we take washed so far.

Generating reports:

Here are the steps to be followed to generate the report:

  1. Open the concluding as ambassador and navigate to your project folder.
  2. We will be using nbconvert library for report generation. If it is not installed and then do a pip install nbconvert or conda install nbconvert
  3. Type jupyter nbconvert –execute –to html notebook.ipynb in the concluding. The –execute executes all the cells in the Jupyter notebook.

  4. A notebook.html file will exist generated and saved in your project binder.

If you wish to experiment on your model then instead of making changes in your code directly, make changes to your Config.yaml and follow the above steps to generate the written report.

generating reproducible ml reports

Conclusion:

At present we understand the importance of using a configuration file in a Machine learning project. In this article, we learned what is a configuration file, the importance of the configuration file in your machine learning projection, how to create a YAML file and use it in your ML project. Now you can commencement using the configuration file in your next machine learning project.

If y'all learned something new or enjoyed reading this article, please share it so that others can run into it.

Happy learnings !!!!

You tin can connect with me – Linkedin

Y'all tin can notice the code for reference – Github

References:

https://unsplash.com/

https://yaml.org/

The media shown in this article are not owned past Analytics Vidhya and is used at the Author'southward discretion.

Source: https://www.analyticsvidhya.com/blog/2021/05/reproducible-ml-reports-using-yaml-configs-with-codes/

Posted by: judgemolon1941.blogspot.com

0 Response to "How To Design Configuration File For Machine Learning Code"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel