Kainos and AI: One Simple Tool to Improve the Efficiency of your Machine Learning Systems
Kainos and AI: One Simple Tool to Improve the Efficiency of your Machine Learning Systems
Have you ever been a position where you didn't have enough data? Would that chatbot, recommendation system or fraud detector become possible if more data were available? If so, keep reading??? ??this blog is for you!
Background
The Text Augmenter (henceforth referred to as TA) is a program which can be used to augment text records (generate new records from existing records) in order to supply Machine Learning projects with additional training data. I used research done by Jake Young as a jumping off point for TA. I wanted to use the methods that he had come up with but in a slightly more user-friendly way to be used in the team.
Now you know why I made this, so what does this actually look like?

Synonym Augmentation
This works by breaking down each of the rows of text into individual words, then finding the keywords and replacing them with synonyms to create a new data record. This process is explained using the following flow chart:


Benchmarking
TA has a built-in feature for benchmarking the original dataset against the augmented datasets using a variety of common Machine Learning algorithms. It uses the following:
Stochastic Gradient Descent [SGD]
Gaussian Na??ve-Bayes [GNB] *
Complement Na??ve-Bayes [CNB]
Linear Regression
Decision Tree Classifier [DTC] *
Multi-Layer Perceptron [MLP] *
* this classifier is used in the Flask application
The augmentation function splits each of the datasets up into training and test values, then trains the classifier and makes predictions. It then gives the user feedback about the classification: accuracy value, classification report and confusion matrix. It does this for all of the datasets??? ??the original and the datasets created by the two methods of augmentation??? ??and displays comparisons between them.

Currently, the main way we use TA is via the main Python script running on the Terminal. This method allows for the use of all of the selected benchmarking classifiers, and is the fastest way of using TA.

This project gave me my first ever opportunity to use Flask, a web framework written in Python. Using this, I built a webpage to house the functions for TA. Doing this allowed me to gain a better understanding of endpoints and requests having never really experienced much related to the web. I'm currently unsure about whether or not this will be hosted as a usable web page, or whether the primary method for use will be one of the other methods. The downside of this, is that the number of machine learning classifiers used for benchmarking bulk up the tool, and thus I have restricted the benchmarking to only three classifiers.

An implementation of the Python script has been created as a Jupyter Notebook. This is separated into cells containing each of the functions, and an explanation of what they do and how to use them.
