Machine Learning Overview

In 1959, Arthur Samuel defined machine learning (https://en.wikipedia.org/wiki/Machine_learning) as a "Field of study that gives computers the ability to learn without being explicitly programmed" Often hand programming not possible Solution? Get the computer to program itself, by showing it examples of the behavior we want! This is the learning approach of AI Really, we write the structure of the program and the computer tunes many internal parameters

• Recall that machine learning is the process of turning data into a model. Then with that model, you can perform inference on it to make predictions.

Course plan
Reflex Search problems

Markov decision processes
Adversarial games

Constraint satisfaction problems
Bayesian networks

Variables Logic
"Low-level intelligence" "High-level intelligence" Machine learning CS221 • While machine learning can be applied to any type of model, we will focus our attention on reflex-based models, which include models such as linear classifiers and neural networks. • In reflex-based models, inference (prediction) involves a fixed set of fast, feedforward operations.

Reflex-based models
• Abstractly, a reflex-based model (which we will call a predictor f ) takes some input x and produces some output y.
• (In statistics, y is known as the response, and when x is a real vector, it is known as covariates or sometimes predictors, which is an unfortunate naming clash.) • The input can usually be arbitrary (an image or sentence), but the form of the output y is generally restricted, and what it is determines the type of prediction task.
Binary classification Fraud detection: credit card transaction → fraud or no fraud Toxic comments: online comment → toxic or not toxic Higgs boson: measurements of event → decay event or background CS221 • One common prediction task is binary classification, where the output y, typically expressed as positive (+1) or negative (-1).
• In the context of classification tasks, f is called a classifier and y is called a label (sometimes class, category, or tag).
• Here are some practical applications.
• One application is fraud detection: given information about a credit card transaction, predict whether it is a fradulent transaction or not, so that the transaction can be blocked. • Another application is moderating online discussion forums: given an online comment, predict whether it is toxic (and therefore should get flagged or taken down) or not. • A final application comes from physics: After the discovery of the Higgs boson, scientists were interested in how it decays. The Large Hadron Collider at CERN smashes protons against each other and then detects the ensuing events. The goal is to predict whether each event is a Higgs boson decaying (into two tau particles) or just background noise.
• Each of these applications has an associated Kaggle dataset. You can click on the pictures to find out more details.
• As an aside, multiclass classification is a generalization of binary classification where the output y could be one of K possible values. For example, in digit classification, K = 10. • The second major type of prediction task we'll cover is regression. Here, the output y is a real number (often called the response or target).
• One application is poverty mapping: given a satellite image, predict the average asset wealth index of the homes in that area. This is used to measure poverty across the world and determine which areas are in greatest need of aid.
• Another application: given information about a house (e.g., location, number of bedrooms), predict its price.
• A third application is to predict the arrival time of some service, which could be package deliveries, flights, or rideshares.
• The key distinction between classification and regression is that classification has discrete outputs (e.g., "yes" or "no" for binary classification), whereas regression has continuous outputs.
Structured prediction x f y is a complex object • The final type of prediction task we will consider is structured prediction, which is a bit of a catch all.
• In structured prediction, the output y is a complex object, which could be a sentence or an image. So the space of possible outputs is huge.
• One application is machine translation: given an input sentence in one language, predict its translation into another language.
• Dialogue can be cast as structured prediction: given the past conversational history between a user and an agent (in the case of virtual assistants), predict the next utterance (what the agent should say).
• In image captioning, say for visual assistive technologies: given an image, predict a sentence describing what is in that image.
• In image segmentation, which is needed to localize objects for autonomous driving: given an image of a scene, predict the segmentation of that image into regions corresponding to objects in the world. • Generating an image or a sentence can seem daunting, but there's a secret here. A structured prediction task can often be broken up into a sequence of multiclass classification tasks. For example, to predict an entire sentence, predict one word at a time, going left to right. This is a very powerful reduction! • Aside: one challenge with this approach is that the errors might cascade: if you start making errors, then you might go off the rails and start making even more errors. • Here are the rest of the modules under the machine learning unit. • We will start by talking about regression and binary classification, the two most fundamental tasks in machine learning. Specifically, we study the simplest setting: linear regression and linear classification, where we have linear models trained by gradient descent.

Roadmap
• Next, we will introduce stochastic gradient descent, and show that it can be much faster than vanilla gradient descent.
• Then we will push the limits of linear models by showing how you can define non-linear features, which effectively gives us non-linear predictors using the machinery of linear models! Feature templates provide us with a framework for organizing the set of features. • Then we introduce neural networks, which also provide non-linear predictors, but allow these non-linearities to be learned automatically from data. We follow up immediately with backpropagation, an algorithm that allows us to automatically compute gradients needed for training without having to take gradients manually. • We then briefly discuss the extension of neural networks to differentiable programming, which allows us to easily build up many of the existing state-of-the-art deep learning models in NLP and computer vision like lego blocks. • So far we have focused on supervised learning. We take a brief detour and discuss K-means, which is a simple unsupervised learning algorithm for clustering data points. • We end on a more reflective note: Generalization is about answering the question: when does a model trained on set of training examples actually generalize to new test inputs? This is where model complexity comes up. Finally, we discuss best practices for doing machine learning in practice.