A bit about myself...
![bloomberg](images/bloomberg-logo.svg)
Infrastructure Engineer
Hackathons
Organiser / Mentor / Hacker
![bt](images/bt.svg)
![ibm](images/ibm.svg)
![helper.io](images/helper.svg)
Applied machine learning in:
* network security
* enterprise software management
* employment
## What is Machine Learning?
A mechanism for machines to **learn** behaviour with no human
intervention
Programs that can **adapt** when exposed to new data
Based on **pattern recognition**
![education](images/application1.jpg)
![security](images/application2.jpg)
![robotics](images/application3.jpg)
![finance](images/application4.jpg)
![speech recognition](images/application5.jpg)
![advertising](images/application6.jpg)
### Huge Growth
### Why?
* cheaper computing computer
* cheaper data storage
* more data than ever ‐ everyone is online
* produces:
- greater volume
- greater variety
* because it's *cool*
### The Difference
![venn diagram](images/venn-diagram.svg)
## Supervised Learning
Use labeled historical data to predict future outcomes
Given some input data, predict the correct output
![shapes](images/shapes.svg)
What **features** of the input tell us about the output?
Feature Space
A feature is some property that describes raw input data
An input can be represented as a vector in
feature space
2 features = 2D vector = 2D space
### Why Use Feature Space?
![feature-extractor](images/feature-extractor.svg)
* Could simply use raw binary data as input
* Raw inputs are complex and noisy
* Abstract the complexity away by using features
Training data is used to produce a model
f(x̄) = mx̄ + c
Model divides feature space into segments
Each segment corresponds to one output class
Use trained model to classify new, unseen inputs
Choosing a Suitable Model
### Evaluation
* Many ways to evaluate
- overall accuracy, number of false positives, etc.
* Depends on what problem you're trying to solve
- are false positive acceptable?
* `k`-fold cross validation commonly used
![k fold cross validation](images/k-fold-cross-validation.svg)
* testing accuracy just using the training dataset is bad
* classifier should generalise to unseen data points
* split training data into two sets
- one for training
- one for testing
* perform this process *k* times using different splits
### Discrete or Continuous?
* Classification
- output is from a finite, discrete set
- of *'classes'*
* Regression
- output is a real number from a continuous range
## Process
[Jason Brownlee's Process](http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/):
1. define the problem
2. prepare data
3. spot check algorithms
4. tuning
5. present results
### 1. Define the Problem
Models are useless if you're solving the wrong problem
* what is it you exactly want to do?
* primary requirement: speed? correctness?
* what does the data come from?
* how will users/systems be affected by your model?
### Faces vs Vegetables
![face-veg-img](demo/data/face1.jpg)
![face-veg-img](demo/data/vegetable1.jpg)
![face-veg-img](demo/data/face5.jpg)
![face-veg-img](demo/data/vegetable10.jpg)
![face-veg-img](demo/data/vegetable19.jpg)
![face-veg-img](demo/data/vegetable7.jpg)
| | |
| ----------- | ------------------ |
| **Input:** | image (rgb pixels) |
| **Output:** | face or vegetable |
### 2. Prepare the Data
Garbage in, garbage out
* What's the source of the data?
- collected by your system(s)?
- provided by third-parties?
* Which features to extract?
- more features **≠** better accuracy
- how long does it take to extract features?
### Data Source
* 40 images
- 20 faces
- 20 vegetables
* manually labelled by a friend
- who we trust
### Features to Extract from Images
* average intensity of each colour channel across all pixels
- red, green, blue
* average saturation across all pixels
- bright and strong colours
![sobel edge detection](images/sobel-edge-detection.png)
* how about the complexity of the images?
* more complicated shapes have more edge pixels
* feature:
- proportion of pixels are 'edge pixels'
* **sobel edge detection** used to find edge pixels
1. mean red colour
2. mean green colour
3. mean blue colour
4. mean saturation
5. edge pixel ratio
- \# edge pixels / \# total pixels
### 3. Spot Check Algorithms
![](images/machinelearningalgorithms.png)
![recursive data space](images/data-space2.png)
![decision tree](images/decision-tree.png)
![arbitrary data space](images/data-space3.png)
![neural network](images/neural-network.png)
Which algorithm to use?
* Humans can be biased
* Take the decision out of our hands entirely
* **Automate** the selection of algorithms
Spot check every algorithm you can!
* Run your dataset(s) across dozens of algorithms
* **10-fold cross-validation**
- measure accuracy, false positives and false negatives
* compare results of each algorithm using statistical tests
- say *"algorithm A is better than B"* with confidence
### Let's try it out!
![python](images/python.svg)
![sklearn](images/sklearn.svg)
* Python
* scikit-learn
* Could also use:
* Pylearn2, MILK, Theano, ...
```
from sklearn import svm
from sklearn import tree
from sklearn import naive_bayes
classifiers = {
'SVM':
svm.SVC(kernel='linear', C=1),
'Decision Tree':
tree.DecisionTreeClassifier(criterion='gini', split='best'),
'Gaussian Naive Bayes':
naive_bayes.GaussianNB(),
...
}
```
```
from sklearn import preprocessing
from sklearn import cross_validation
featureVecs = preprocessing.normalize(featureVectors)
cross_validation.cross_val_score(
classifier, featureVectors, labels, cv=kFolds)
```
| Classifier | Mean Acc. | Confidence Interval | Lower Bound Acc |
| ------------------------ | --------- | ------------------- | --------------- |
| Gaussian Naive Bayes | 0.975 | 0.15 | 0.825 |
| Decision Tree | 0.95 | 0.2 | 0.75 |
| Multi-Nomial Naive Bayes | 0.85 | 0.331662 | 0.518338 |
| SVM | 0.775 | 0.415331 | 0.359669 |
| Neural Network (Sigmoid) | 0.525 | 0.269258 | 0.255742 |
| Bernoulli Naive Bayes | 0.5 | 0 | 0.5 |
Decision tree had good performance
![face veg decision tree](images/face-veg-decision-tree.svg)
### 4. Tuning
* Pick top `x` algorithms from previous step
* Smaller set of algorithms to manually investigate
* Greater confidence chosen algorithms are naturally good at
picking out the structure of the dataset / feature space
#### Squeeze out Remaining Performance
1. algorithm tuning
- tune each algorithm for better accuracy
- search hyperparameter space
2. ensembles
- combine multiple 'okay' models into one, better model
3. feature refinement
#### Hyperparameter Optimisation
* Learn model **parameters**
* Must set model's **hyperparameters** before training
* Tune performance by searching hyperparameter space
![hyperparameter_optimisation](images/hyperparameter_optimisation.jpg)
[**auto_sklearn**](https://github.com/automl/auto-sklearn)
Library on top of `sklearn`
Automates the spot check and tuning stages
Automatically build ensemble of models
Learns best model types and hyperparameters for you
It's as easy as:
```python
# Train classifier for up to two minutes.
# Longer training time, better accuracy (in general).
autoClassifier = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120)
autoClassifier.fit(trainingFeatures, trainingLabels)
# Output accuracy of classifier when run against test dataset.
print(autoClassifier.score(testFeatures, testLabels))
print(autoClassifier.show_models())
```
```
Score: 0.987452948557
[(0.940000, SimpleClassificationPipeline(configuration={
'balancing:strategy': 'none',
'classifier:__choice__': 'adaboost',
'classifier:adaboost:algorithm': 'SAMME',
'classifier:adaboost:learning_rate': 1.2306208006800998,
'classifier:adaboost:max_depth': 6,
'classifier:adaboost:n_estimators': 499,
'imputation:strategy': 'median',
'one_hot_encoding:use_minimum_fraction': 'False',
'preprocessor:__choice__': 'extra_trees_preproc_for_classification',
'preprocessor:extra_trees_preproc_for_classification:bootstrap': 'True',
'preprocessor:extra_trees_preproc_for_classification:criterion': 'entropy',
'preprocessor:extra_trees_preproc_for_classification:max_depth': 'None',
'preprocessor:extra_trees_preproc_for_classification:max_features': 3.5347851525007146,
'preprocessor:extra_trees_preproc_for_classification:min_samples_leaf': 6,
'preprocessor:extra_trees_preproc_for_classification:min_samples_split': 8,
'preprocessor:extra_trees_preproc_for_classification:min_weight_fraction_leaf': 0.0,
'preprocessor:extra_trees_preproc_for_classification:n_estimators': 100,
'rescaling:__choice__': 'standardize'})),
(0.020000, SimpleClassificationPipeline(configuration={
'balancing:strategy': 'none',
'classifier:__choice__': 'adaboost',
'classifier:adaboost:algorithm': 'SAMME',
'classifier:adaboost:learning_rate': 1.0081104516473922,
'classifier:adaboost:max_depth': 6,
'classifier:adaboost:n_estimators': 468,
'imputation:strategy': 'mean',
'one_hot_encoding:use_minimum_fraction': 'False',
'preprocessor:__choice__': 'liblinear_svc_preprocessor',
'preprocessor:liblinear_svc_preprocessor:C': 1.1828431725901418,
'preprocessor:liblinear_svc_preprocessor:dual': 'False',
'preprocessor:liblinear_svc_preprocessor:fit_intercept': 'True',
'preprocessor:liblinear_svc_preprocessor:intercept_scaling': 1,
'preprocessor:liblinear_svc_preprocessor:loss': 'squared_hinge',
'preprocessor:liblinear_svc_preprocessor:multi_class': 'ovr',
'preprocessor:liblinear_svc_preprocessor:penalty': 'l1',
'preprocessor:liblinear_svc_preprocessor:tol': 0.0022792606924326923,
'rescaling:__choice__': 'min/max'})),
(0.020000, SimpleClassificationPipeline(configuration={
'balancing:strategy': 'weighting',
'classifier:__choice__': 'proj_logit',
'classifier:proj_logit:max_epochs': 11,
'imputation:strategy': 'median',
'one_hot_encoding:minimum_fraction': 0.002883367159521145,
'one_hot_encoding:use_minimum_fraction': 'True',
'preprocessor:__choice__': 'gem',
'preprocessor:gem:N': 16,
'preprocessor:gem:precond': 0.30628439346357783,
'rescaling:__choice__': 'standardize'})),
(0.020000, SimpleClassificationPipeline(configuration={
'balancing:strategy': 'weighting',
'classifier:__choice__': 'passive_aggressive',
'classifier:passive_aggressive:C': 0.0036975653885940544,
'classifier:passive_aggressive:fit_intercept': 'True',
'classifier:passive_aggressive:loss': 'hinge',
'classifier:passive_aggressive:n_iter': 326,
'imputation:strategy': 'mean',
'one_hot_encoding:use_minimum_fraction': 'False',
'preprocessor:__choice__': 'kitchen_sinks',
'preprocessor:kitchen_sinks:gamma': 0.6227804363658538,
'preprocessor:kitchen_sinks:n_components': 1821,
'rescaling:__choice__': 'normalize'})),
]
```
### 5. Present Results
* Produce document that explains:
- problem
- solution (final algorithm/features/datasets used)
- accuracy / speed of solution
- limitations of solution
* Be sure to list any other insights discovered along the way
![process](images/process.svg)
### Automation is Key
* Create a test harness that:
- feature extraction
- trains models using many algorithms
- evaluates models in a rigorous way
* Ensure you can trust that harness
## Summary
Machine learning is all about **automation**.
It's about automatically finding patterns in data...
...and building models to fit that data.
Models that also *generalise*.
Follow the five step process:
1. define problem
2. prepare data
3. spot check algorithms
4. tuning
5. present results
Automate as much as you can.
So you can focus on **feature engineering**.
Don't reinvent the wheel.
Use the **hundreds of tools** already there.
### Tools
![r](images/r.svg)
![python](images/python.svg)
![java](images/java.svg)
![scala](images/scala.svg)
![sklearn](images/sklearn.svg)
![theano](images/theano.png)
![mdp](images/mdp.png)
![spark](images/spark.png)
### Useful Resources
[Data Mining: Practical Machine Learning Tools and Techniques](http://machinelearningmastery.com/6-practical-books-for-beginning-machine-learning/)
[A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)
[Jason Brownlee's Process](http://machinelearningmastery.com/process-for-working-through-machine-learning-problems/)
[Introduction to Machine Learning with sci-kit Learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
[Efficient and Robust Automated Machine Learning](https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf)
### This Presentation
[donaldwhyte.github.io/intro-to-ml/automated](http://donaldwhyte.github.io/intro-to-ml/automated)
[github.com/DonaldWhyte/intro-to-ml/](http://github.com/DonaldWhyte/intro-to-ml)