Machine learning with spark and python : essential techniques for predictive analytics
Michael Bowles
- 2nd ed.
- Indianapolis, IN : John Wiley and Sons, c2020.
- xxvii, 340 p. : ill. ; 25 cm.
• Machine generated contents note: ch. 1 The Two Essential Algorithms for Making Predictions • Why Are These Two Algorithms So Useful? • What Are Penalized Regression Methods? • What Are Ensemble Methods? • How to Decide Which Algorithm to Use • The Process Steps for Building a Predictive Model • Framing a Machine Learning Problem • Feature Extraction and Feature Engineering • Determining Performance of a Trained Model • Chapter Contents and Dependencies • Summary • ch. 2 Understand the Problem by Understanding the Data • The Anatomy of a New Problem • Different Types of Attributes and Labels Drive Modeling Choices • Things to Notice about Your New Data Set • Classification Problems: Detecting Unexploded Mines Using Sonar • Physical Characteristics of the Rocks Versus Mines Data Set • Statistical Summaries of the Rocks Versus Mines Data Set • Visualization of Outliers Using a Quantile-Quantile Plot • Statistical Characterization of Categorical Attributes • Contents note continued: How to Use Python Pandas to Summarize the Rocks Versus Mines Data Set • Visualizing Properties of the Rocks Versus Mines Data Set • Visualizing with Parallel Coordinates Plots • Visualizing Interrelationships between Attributes and Labels • Visualizing Attribute and Label Correlations Using a Heat Map • Summarizing the Process for Understanding the Rocks Versus Mines Data Set • Real-Valued Predictions with Factor Variables: How Old Is Your Abalone? • Parallel Coordinates for Regression Problems • -Visualize Variable Relationships for the Abalone Problem • How to Use a Correlation Heat Map for Regression • -Visualize Pair-Wise Correlations for the Abalone Problem • Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes • Multiclass Classification Problem: What Type of Glass Is That? • Using PySpark to Understand Large Data Sets • Contents note continued: ch. 3 Predictive Model Building: Balancing Performance, Complexity, and Big Data • The Basic Problem: Understanding Function Approximation • Working with Training Data • Assessing Performance of Predictive Models • Factors Driving Algorithm Choices and Performance • -Complexity and Data • Contrast between a Simple Problem and a Complex Problem • Contrast between a Simple Model and a Complex Model • Factors Driving Predictive Algorithm Performance • Choosing an Algorithm: Linear or Nonlinear? • Measuring the Performance of Predictive Models • Performance Measures for Different Types of Problems • Simulating Performance of Deployed Models • Achieving Harmony between Model and Data • Choosing a Model to Balance Problem Complexity, Model Complexity, and Data Set Size • Using Forward Stepwise Regression to Control Overfitting • Evaluating and Understanding Your Predictive Model • Contents note continued: Control Overfitting by Penalizing Regression Coefficients • -Ridge Regression • Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets • ch. 4 Penalized Linear Regression • Why Penalized Linear Regression Methods Are So Useful • Extremely Fast Coefficient Estimation • Variable Importance Information • Extremely Fast Evaluation When Deployed • Reliable Performance • Sparse Solutions • Problem May Require Linear Model • When to Use Ensemble Methods • Penalized Linear Regression: Regulating Linear Regression for Optimum Performance • Training Linear Models: Minimizing Errors and More • Adding a Coefficient Penalty to the OLS Formulation • Other Useful Coefficient Penalties • -Manhattan and ElasticNet • Why Lasso Penalty Leads to Sparse Coefficient Vectors • ElasticNet Penalty Includes Both Lasso and Ridge • Solving the Penalized Linear Regression Problem • Contents note continued: Understanding Least Angle Regression and Its Relationship to Forward Stepwise Regression • How LARS Generates Hundreds of Models of Varying Complexity • Choosing the Best Model from the Hundreds LARS Generates • Using Glmnet: Very Fast and Very General • Comparison of the Mechanics of Glmnet and LARS Algorithms • Initializing and Iterating the Glmnet Algorithm • Extension of Linear Regression to Classification Problems • Solving Classification Problems with Penalized Regression • Working with Classification Problems Having More Than Two Outcomes • Understanding Basis Expansion: Using Linear Methods on Nonlinear Problems • Incorporating Non-Numeric Attributes into Linear Methods • ch. 5 Building Predictive Models Using Penalized Linear Methods • Python Packages for Penalized Linear Regression • Multivariable Regression: Predicting Wine Taste • Building and Testing a Model to Predict Wine Taste • Contents note continued: Training on the Whole Data Set before Deployment • Basis Expansion: Improving Performance by Creating New Variables from Old Ones • Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines • Build a Rocks Versus Mines Classifier for Deployment • Multiclass Classification: Classifying Crime Scene Glass Samples • Linear Regression and Classification Using PySpark • Using PySpark to Predict Wine Taste • Logistic Regression with PySpark: Rocks Versus Mines • Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings • Multiclass Logistic Regression with Meta Parameter Optimization • ch. 6 Ensemble Methods • Binary Decision Trees • How a Binary Decision Tree Generates Predictions • How to Train a Binary Decision Tree • Tree Training Equals Split Point Selection • How Split Point Selection Affects Predictions • Algorithm for Selecting Split Points • Contents note continued: Multivariable Tree Training • -Which Attribute to Split? • Recursive Splitting for More Tree Depth • Overfitting Binary Trees • Measuring Overfit with Binary Trees • Balancing Binary Tree Complexity for Best Performance • Modifications for Classification and Categorical Features • Bootstrap Aggregation: "Bagging" • How Does the Bagging Algorithm Work? • Bagging Performance • -Bias Versus Variance • How Bagging Behaves on Multivariable Problem • Bagging Needs Tree Depth for Performance • Summary of Bagging • Gradient Boosting • Basic Principle of Gradient Boosting Algorithm • Parameter Settings for Gradient Boosting • How Gradient Boosting Iterates toward a Predictive Model • Getting the Best Performance from Gradient Boosting • Gradient Boosting on a Multivariable Problem • Summary for Gradient Boosting • Random Forests • Random Forests: Bagging Plus Random Attribute Subsets • Random Forests Performance Drivers • Contents note continued: Random Forests Summary • ch. 7 Building Ensemble Models with Python • Solving Regression Problems with Python Ensemble Packages • Using Gradient Boosting to Predict Wine Taste • Using the Class Constructor for GradientBoostingRegressor • Using GradientBoostingRegressor to Implement a Regression Model • Assessing the Performance of a Gradient Boosting Model • Building a Random Forest Model to Predict Wine Taste • Constructing a RandomForestRegressor Object • Modeling Wine Taste with RandomForestRegressor • Visualizing the Performance of a Random Forest Regression Model • Incorporating Non-Numeric Attributes in Python Ensemble Models • Coding the Sex of Abalone for Gradient Boosting Regression in Python • Assessing Performance and the Importance of Coded Variables with Gradient Boosting • Coding the Sex of Abalone for Input to Random Forest Regression in Python • Assessing Performance and the Importance of Coded Variables • Contents note continued: Solving Binary Classification Problems with Python Ensemble Methods • Detecting Unexploded Mines with Python Gradient Boosting • Determining the Performance of a Gradient Boosting Classifier • Detecting Unexploded Mines with Python Random Forest • Constructing a Random Forest Model to Detect Unexploded Mines • Determining the Performance of a Random Forest Classifier • Solving Multiclass Classification Problems with Python Ensemble Methods • Dealing with Class Imbalances • Classifying Glass Using Gradient Boosting • Determining the Performance of the Gradient Boosting Model on Glass Classification • Classifying Glass with Random Forests • Determining the Performance of the Random Forest Model on Glass Classification • Solving Regression Problems with PySpark Ensemble Packages • Predicting Wine Taste with PySpark Ensemble Methods • Predicting Abalone Age with PySpark Ensemble Methods • Contents note continued: Distinguishing Mines from Rocks with PySpark Ensemble Methods • Identifying Glass Types with PySpark Ensemble Methods • Summary.