The purpose of this research is to put together the 7 most common types of classification algorithms along with the python code: Logistic Regression, Naïve Bayes, Stochastic Gradient Descent, K-Nearest Neighbours, Decision Tree, Random Forest, and Support Vector Machine Show
1 Introduction1.1 Structured Data ClassificationClassification can be performed on structured or unstructured data. Classification is a technique where we categorize data into a given number of classes. The main goal of a classification problem is to identify the category/class to which a new data will fall under. Few of the terminologies encountered in machine learning – classification: THE BELAMYSign up for your weekly dose of what's up in emerging technology.
The following are the steps involved in building a classification model:
1.2 Dataset Source and ContentsThe dataset contains salaries. The following is a description of our dataset:
This data was extracted from the census bureau database found at: http://www.census.gov/ftp/pub/DES/www/welcome.html 1.3 Exploratory Data Analysis2 Types of Classification Algorithms (Python)2.1 Logistic RegressionDefinition: Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function. Advantages: Logistic regression is designed for this purpose (classification), and is most useful for understanding the influence of several independent variables on a single outcome variable. Disadvantages: Works only when the predicted variable is binary, assumes all predictors are independent of each other and assumes data is free of missing values. 2.2 Naïve BayesDefinition: Naive Bayes algorithm based on Bayes’ theorem with the assumption of independence between every pair of features. Naive Bayes classifiers work well in many real-world situations such as document classification and spam filtering. Advantages: This algorithm requires a small amount of training data to estimate the necessary parameters. Naive Bayes classifiers are extremely fast compared to more sophisticated methods. Disadvantages: Naive Bayes is is known to be a bad estimator. 2.3 Stochastic Gradient DescentDefinition: Stochastic gradient descent is a simple and very efficient approach to fit linear models. It is particularly useful when the number of samples is very large. It supports different loss functions and penalties for classification. Advantages: Efficiency and ease of implementation. Disadvantages: Requires a number of hyper-parameters and it is sensitive to feature scaling. 2.4 K-Nearest NeighboursDefinition: Neighbours based classification is a type of lazy learning as it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the k nearest neighbours of each point. Advantages: This algorithm is simple to implement, robust to noisy training data, and effective if training data is large. Disadvantages: Need to determine the value of K and the computation cost is high as it needs to compute the distance of each instance to all the training samples. 2.5 Decision TreeDefinition: Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. Advantages: Decision Tree is simple to understand and visualise, requires little data preparation, and can handle both numerical and categorical data. Disadvantages: Decision tree can create complex trees that do not generalise well, and decision trees can be unstable because small variations in the data might result in a completely different tree being generated. 2.6 Random ForestDefinition: Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement. Advantages: Reduction in over-fitting and random forest classifier is more accurate than decision trees in most cases. Disadvantages: Slow real time prediction, difficult to implement, and complex algorithm. 2.7 Support Vector MachineDefinition: Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. Advantages: Effective in high dimensional spaces and uses a subset of training points in the decision function so it is also memory efficient. Disadvantages: The algorithm does not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. 3 Conclusion3.1 Comparison Matrix
Code location: https://github.com/f2005636/Classification 3.2 Algorithm Selection(Types of Classification Algorithms) Which type of machine learning should you use to predict the number of gift cards?Which type of machine learning should you use to predict the number of gift cards that will be sold next month? What is this? Answer Description: Clustering, in machine learning, is a method of grouping data points into similar clusters.
Which machine learning techniques can be used for anomaly detection AI 900?Computer vision. Machine Learning (Regression)
Which type of machine learning should you use to identify groups of people who have similar purchasing habits OA regression OB clustering OC classification?Explanation. Clustering is a machine learning task that is used to group instances of data into clusters that contain similar characteristics. Clustering can also be used to identify relationships in a dataset.
Which type of machine learning should you use to identify groups of people who have similar?Clustering is an unsupervised process to identify groups of similar data points. The similarities are determined using the feature values, characteristics of each data point.
|