Literature Review of Predictive Analysis Methods

Christopher Gower

1. Introduction

Predictive Analytics is one of the most useful applications of data science in predicting the future and producing solutions to potential problems. Using this method allows the administrators of the relevant system to anticipate potential problems and challenges, identify growth opportunities, and optimize internal operations. Of course, there is no one way to do this method; Depending on the purpose, different methods provide the best results. In this essay, information about what these methods are will be given.

2. What is Predictive Analysis?

Predictive analytics is a field of data science that focuses on making sense of previous and current data using specific methods and statistical techniques to make informed and consistent predictions about future events.

Data mining: searching for meaning bubbles and logical relationships in big data stores (meaning the data collected from the past to the present)

Data mining is the process of sorting large data sets to identify meaning bubbles and logical relationships that can help to solve problems through data analysis using a variety of helpful tools. Data mining techniques and tools enable problems or data to predict future trends and make more informed decisions.

Data mining is a crucial component of successful analytical components across solutions. The information it generates can be used in artificial intelligence, advanced analytics applications involving analysis of historical data, and real-time analytics applications that can be examined as stream data is generated/collected.

Text analytics: synthesizing unstructured text in and out of analysis-friendly structured data

In process of extracting insights and patterns from substantial quantities of textual data that does not match a predetermined configuration analytics integrates a variety of machine learning, statistical, and linguistic approaches. It makes it possible for organizations, governments, scholars, and the media to use the vast material at their disposal to make critical decisions. Sentiment analysis, topic modelling, named entity identification, term frequency, and event extraction are just a few of the techniques used in text analytics.

Text analytics and text mining are frequently used interchangeably. While text analytics produces numbers, text mining is the process of extracting qualitative information from unstructured text (simplilearn.com, 2022).

Text analytics may benefit corporations, organizations, and social movements in a variety of ways, including:

Assisting companies in comprehending consumer trends, product performance, and service quality. Consequently, decisions are made quickly, business intelligence is improved, productivity is raised, and costs are reduced.
Aids researchers in quickly exploring a large body of prior material and obtaining the information that is pertinent to their inquiry. This promotes speedier scientific advancements.
Aids in gaining a comprehensive grasp of societal patterns and viewpoints, which helps political entities like governments make decisions.
Search engines and information retrieval systems can perform better thanks to text analysis techniques, which leads to quicker user experiences.
By classifying related content, user content recommendation systems can be improved.
Predictive modelling is the process of developing and modifying a statistical model to forecast future events.

A popular statistical method for forecasting behaviour is predictive modelling. Data-mining technology called predictive modelling solutions creates a model by studying past and present data and using it to forecast future results. Predictive modelling involves gathering data, creating a statistical model, making predictions, and validating (or updating) the model when new data becomes available.

3. Choosing The Right Model for Goal

3.1. Regression

Regression models identify the relationship between an independent variable or predictor and a dependent or target variable. Based on known predictors, this relationship is used to forecast unknown target variables of the same type. It is the most popular predictive analytics model and uses several conventional techniques (Bruce, n.d.).

3.2. Linear regression/ multivariate linear regression

The association between two variables is established using linear regression, often known as simple regression. The slope of a straight line used to represent linear regression indicates how changing one variable affect changing another. The value of one variable when the value of the other is 0 is represented by the y-intercept of a linear regression relationship. Each dependent value in a linear regression has a single associated independent variable that determines its value.

When there are intricate relationships between the data, more than one variable may be able to account for the link. A multiple regression analysis is used in this situation to try explaining a dependent variable using numerous independent variables (Multiple Linear Regression (MLR) Definition, Formula, and Example, 2022).

Multiple regression analysis may be applied in two different ways. First, depending on several independent factors, identify the dependent variable. The second step is to ascertain how strongly one variable is related to the other. For instance, you could be curious to see how a crop output would alter as rainfall or temperature rise (Almahdi, 2021). Multiple regression assumes that each independent variable does not strongly correlate with the others. Additionally, it presupposes that each independent variable and the sole dependent variable are correlated. By adding a different regression coefficient to each independent variable, each of these relationships is weighted to ensure that more significant independent variables influence the dependent value.

3.3. Polynomial regression

An nth degree polynomial in x is used to describe the connection between the independent variable x and the dependent variable y in polynomial regression, a type of regression analysis. A nonlinear connection between the value of x and the associated conditional mean of y, denoted E(y|x), can be fit via polynomial regression. Despite fitting a nonlinear model to the data using polynomial regression, the regression function E(y|x) is linear in the unknown parameters that are estimated from the data, making polynomial regression a linear statistical estimation issue. Because of this, multiple linear regression is thought of as a specific instance of polynomial regression. Regression analysis aims to model an independent variable’s (or a vector of independent variables’) value in terms of the expected value of a dependent variable.

3.4. Logistic regression

Predictive analytics and categorization frequently make use of this kind of statistical model, commonly referred to as a logit model. Based on a given dataset of independent variables, logistic regression calculates the likelihood that an event will occur, such as voting or not voting. Given that the result is a probability, the dependent variable’s range is 0 to 1 (What Is Logistic Regression? | IBM, n.d.). In logistic regression, the odds — that is, the likelihood of success divided by the probability of failure — are transformed using the logit formula.

Like linear regression, logistic regression makes predictions about categorical variables as opposed to continuous ones (Logistic Regression — Logicmojo, n.d.). It is used to evaluate the connection between a dependent variable and one or more independent variables. A categorical variable may have the values true, false, yes, no, 1, 0, etc. The logit function turns the S-curve into a straight line, whereas linear regression creates a probability as the unit of measurement (Logistic Regression — Logicmojo, n.d.-b).

3.5. Classification

This type of predictive analytics aims to identify the commonalities within a dataset and assigns a new piece of data to a certain category depending on its attributes. It does need specifying such classes since it anticipates future data classes. Several methods of categorisation include:

3.5.1. Decision trees

A decision support tool known as a decision tree employs a tree-like model to represent options and their potential outcomes, including utility, resource costs, and chance event outcomes. One technique to show an algorithm that solely uses conditional control statements is to use this method.

The leaf node’s contents serve as the outcome and the conditions along the route serve as a conjunction in the if clause of a decision rule that linearizes the decision tree. The regulations often take the following form:

if condition1 and condition2 and condition3 then outcome.

By building association rules with the target variable on the right, decision rules may be created. They may also signify causal or temporal relationships.

Decision trees (and influence diagrams) are decision help tools that have various benefits. Decision trees:

Are straightforward to comprehend and interpret. After a quick explanation, people can grasp decision tree models.
Be valuable while having few concrete facts. Experts’ descriptions of a situation, including its choices, probabilities, and costs, as well as their preferred outcomes, can yield significant information.
Assist in determining the worst, best, and anticipated values for various circumstances.
Apply the white box model. whether a model offers a specific outcome.
May be used in conjunction with other decision-making processes.
More than one decision-actions maker’s may be considered.

Disadvantages of decision trees:

They are unstable, i.e., a slight change in the data might cause a substantial change in the optimum decision tree’s structure; and
They are frequently somewhat wrong. With same data, many other predictors outperform them. It is possible to fix this by substituting a single decision tree with a random forest of decision trees, but a random forest is more difficult to understand than a single decision tree.
For categorical variables with varying numbers of levels, information gain in decision trees is skewed in favor of the qualities with higher levels.
Calculations can become quite complicated, especially if many values are ambiguous and/or if several outcomes are connected.

3.5.2. Random Forests

Leo Breiman and Adele Cutler are the creators of the widely used machine learning technique known as random forest, which mixes the output of several decision trees to produce a single outcome. Its widespread use is motivated by its adaptability and usability since it can solve classification and regression issues.

The bagging technique is extended by the random forest algorithm, which uses feature randomness in addition to bagging to produce an uncorrelated forest of decision trees. There are three key hyperparameters for random forest algorithms that must be specified prior to training. Node size, tree count, and sampled feature count are a few of them (Thorn, 2021). From there, classification or regression issues may be resolved using the random forest classifier.

Each decision tree in the ensemble that makes up the random forest method is built of a data sample taken from a training set with replacement known as the bootstrap sample. The dataset is subsequently given a second randomization injection by feature bagging, increasing dataset diversity and decreasing decision tree correlation. The forecast will be determined differently depending on the type of issue. The individual decision trees will be averaged for the regression job, and for the classification task, the predicted class will be determined by a majority vote, or the most common categorical variable. The prediction is then finalized by cross-validation using the oob sample (Brownlee, 2016).

3.5.3. Naive Bayes

A probabilistic method for creating classifiers is naive bayes. The naive Bayes classifier makes the defining premise that, given the class variable, the value of one feature is independent of the value of any other feature.

Despite the previously indicated oversimplification assumptions, naive Bayes classifiers perform well in challenging real-world scenarios. Naive Bayes has the benefit of being able to be trained incrementally and only requiring a modest quantity of training data to estimate the classification-related parameters.

Naive Bayes is a conditional probability model that assigns probabilities for each of K potential outcomes or classes to a problem instance that must be categorized. This issue instance is represented by a vector x = (x1, …, xn) that represents some n characteristics (independent variables).

3.6. Clustering

Clustering is the process of organizing data into “clusters,” or collections of related data, based on similarities. The most important elements of a dataset are segregated during clustering. The method charts the connections between the data, which may subsequently be used to forecast the state of upcoming data. Even though there are alternative methods, K-means clustering is likely the most well-known type of clustering. Instead of utilizing pre-defined classes, clustering offers the benefit of letting data decide the clusters and, consequently, the defining attributes of the class. When little is known about the data beforehand, it is quite beneficial. Cluster models are commonly used by analysts for consumer segmentation. These courses can be expanded upon to help develop targeted marketing plans.

4. References

Website, S. (2022, November 23). What is Text Mining in Data Mining? Simplilearn.com. https://www.simplilearn.com/what-is-text-mining-in-data-mining-article

Bruce, P. (n.d.). Practical Statistics for Data Scientists. O’Reilly Online Learning. https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/ch04.html

Multiple Linear Regression (MLR) Definition, Formula, and Example. (2022, June 24). Investopedia. https://www.investopedia.com/terms/m/mlr.asp

Almahdi, H. (2021, December 13). Predicting Crops Yield: Machine Learning Nanodegree Capstone Project. Medium. https://towardsdatascience.com/predicting-crops-yield-machine-learning-nanodegree-capstone-project-e6ec9349f69

What is Logistic regression? | IBM. (n.d.). https://www.ibm.com/topics/logistic-regression

Logistic Regression — Logicmojo. (n.d.). https://logicmojo.com/logistic-regression-machine-learning

Logistic Regression — Logicmojo. (n.d.-b). https://logicmojo.com/logistic-regression-machine-learning

Thorn, J. (2021, December 13). Decision Trees Explained — Towards Data Science. Medium. https://towardsdatascience.com/decision-trees-explained-3ec41632ceb6

Brownlee, J. (2016, April 22). Bagging and Random Forest Ensemble Algorithms for Machine Learning. https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/. https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

Mirkan Emir Sancak