Machine Learning Collection

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Table of Contents


  • LightGBM - A fast, distributed, high performance gradient boosting framework.
  • LightGBM benchmarking suite - Benchmark tools for LightGBM.
  • Explainable Boosting Machines - interpretable model developed in Microsoft Research using bagging, gradient boosting, and automatic interaction detection to estimated generalized additive models.
  • Cyclic Boosting Machines - An explainable supervised machine learning algorithm specifically for predicting count-data, such as sales and demand.


  • Neural Network Intelligence - An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
  • Archai - Reproducible Rapid Research for Neural Architecture Search (NAS).
  • FLAML - A fast and lightweight AutoML library.
  • Azure Automated Machine Learning - Automated Machine Learning for Tabular data (regression, classification and forecasting) by Azure Machine Learning
  • Cream - A collection of Microsoft NAS and Vision Transformer work.

Neural Network

  • PyMarlin - Lightweight Deep Learning Model Training library based on PyTorch.
  • bayesianize - A Bayesian neural network wrapper in pytorch.
  • O-CNN - Octree-based convolutional neural networks for 3D shape analysis.
  • ResNet - deep residual network.
  • CNTK - microsoft cognitive toolkit (CNTK), open source deep-learning toolkit.
  • InfiniBatch - Efficient, check-pointed data loading for deep learning with massive data sets.
  • Models under Hugging Face - Microsoft shares transformer models at Hugging Face. 51 pretrained models (as of June 28, 2021).
  • Muzic - Music Understanding and Generation with Artificial Intelligence.

Graph & Network

  • graspologic - utilities and algorithms designed for the processing and analysis of graphs with specialized graph statistical algorithms.
  • TF Graph Neural Network Samples - tensorFlow implementations of graph neural networks.
  • ptgnn - PyTorch Graph Neural Network Library
  • StemGNN - spectral temporal graph neural network (StemGNN) for multivariate time-series forecasting.
  • SPTAG - a distributed approximate nearest neighborhood search (ANN) library.
  • DiskANN - Scalable graph based indices for approximate nearest neighbor search.


  • Microsoft Vision Model ResNet50 - a large pretrained vision ResNet-50 model using search engine's web-scale image data.
  • Oscar - Object-Semantics Aligned Pre-training for Vision-Language Tasks.
  • TorchGeo - a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.
  • Swin Transformer - an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".

Time Series


  • T-ULRv2 - Turing multilingual language model.
  • Turing-NLG - Turing Natural Language Generation, 17 billion-parameter language model.
  • DeBERTa - Decoding-enhanced BERT with Disentangled Attention
  • UniLM - Unified Language Model Pre-training / Pre-training for NLP and Beyond
  • Unicoder - Unicoder model for understanding and generation.
  • NeuronBlocks - building your nlp dnn models like playing lego
  • Multilingual Model Transfer - new deep learning models for bootstrapping language understanding models for languages with no labeled data using labeled data from other languages.
  • MT-DNN - multi-task deep neural networks for natural language understanding.
  • inmt - interactive neural machine trainslation-lite
  • OpenKP - automatically extracting keyphrases that are salient to the document meanings is an essential step in semantic document understanding.
  • DeText - a deep neural text understanding framework for ranking and classification tasks.
  • Genalog - an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.
  • FastFormers - highly efficient transformer models for NLU.
  • VERSEAGILITY - a Python-based toolkit to ramp up your custom natural language processing (NLP) task, allowing you to bring your own data and bring models into production. It is a central component of the Microsoft Data Science Toolkit.
  • DPU Utilities - Utilities used by the Deep Program Understanding team.
  • KEAR - Official code for achieving human parity on CommonsenseQA with External Attention.
  • Prompt Engine - A utility library for creating and maintaining prompts for Large Language Models.

Online Machine Learning

  • Vowpal Wabbit - fast, efficient, and flexible online machine learning techniques for reinforcement learning, supervised learning, and more.


  • Recommenders - examples and best practics for building recommendation systems (A2SVD, DKN, xDeepFM, LightGBM, LSTUR, NAML, NPA, NRMS, RLRMC, SAR, Vowpal Wabbit are invented/contributed by Microsoft).
  • GDMIX - A deep ranking personalization framework
  • rankerEval - A fast numpy-based implementation of ranking metrics for information retrieval and recommendation.


  • DeepSpeed - DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
  • MMLSpark - machine learning library on spark.
  • photon-ml - a scalable machine learning library on apache spark.
  • TonY - framework to natively run deep learning frameworks on apache hadoop.
  • isolation-forest - A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm.

Causal Inference

  • EconML - Python package for estimating heterogeneous treatment effects from observational data via machine learning.
  • DoWhy - Python library for causal inference that supports explicit modeling and testing of causal assumptions.

Responsible AI

  • InterpretML - a toolkit to help understand models and enable responsbile machine learning.
    • Interpret Community - extends interpret repo with additional interpretability techniques and utility functions.
    • DiCE - diverse counterfactual explanations.
    • Interpret-Text - state-of-the-art explainers for text-based ml models and visualize with dashboard.
  • fairlearn - python package to assess and improve fairness of machine learning models.
  • LiFT - linkedin fairness toolkit.
  • RobustDG - Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.
  • SHAP - a game theoretic approach to explain the output of any machine learning model (scott lundbert, Microsoft Research).
  • LIME - explaining the predictions of any machine learning classifier (Marco, Microsoft Research).
  • BackwardCompatibilityML - Project for open sourcing research efforts on Backward Compatibility in Machine Learning
  • confidential-ml-utils - Python utilities for training and deploying ML models against data you can't see.
  • presidio - context aware, pluggable and customizable data protection and anonymization service for text and images.
    • Presidio-research - This package features data-science related tasks for developing new recognizers for Presidio.
  • Confidential ONNX Inference Server - An Open Enclave port of the ONNX inference server with data encryption and attestation capabilities to enable confidential inference on Azure Confidential Computing.
  • Responsible-AI-Widgets - responsible AI user interfaces for Fairlearn, interpret-community, and Error Analysis, as well as foundational building blocks that they rely on.
  • Error Analysis - A toolkit to help analyze and improve model accuracy.
  • Secure Data Sandbox - A toolkit for conducting machine learning trials against confidential data.
  • shrike - Python utilities to aid "compliant experiment" in Azure Machine Learning - training ML models without seeing the training data.
  • HAX Toolkit - The Human-AI eXperience (HAX) Toolkit is a set of practical tools for creating human-AI experiences with people in mind from the beginning.
  • GAM Changer - Edit machine learning models to reflect human knowledge and values.
  • AdaTest - Find and fix bugs in natural language machine learning models using adaptive testing.


  • ONNXRuntime - cross-platfom, high performance ML inference and training accelerator.
  • Hummingbird - compile trained ml model into tensor computation for faster inference.
  • EdgeML - provides code for machine learning algorithms for edge devices developed at Microsoft Research India.
  • DirectML - high-performance, hardware-accelerated DirectX 12 library for machine learning.
  • MMdnn - MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization.
  • inifinibatch - Efficient, check-pointed data loading for deep learning with massive data sets.
  • InferenceSchema - Schema decoration for inference code
  • nnfusion - flexible and efficient deep neural network compiler.
  • Accera - Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research.

Reinforcement Learning

  • AirSim - open source simulator for autonomous vehicles build on unreal engine / unity from microsoft research.

  • TextWorld - TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

  • Moab - Project Moab, a new open-source balancing robot to help engineers and developers learn how to build real-world autonomous control systems with Project Bonsai.

  • MARO - multi-agent resource optimization (MARO) platfom.

  • Training Data-Driven or Surrogate Simulators - build simulation from data for use in RL and Bonsai platform for machine teaching.

  • Bonsai - low code industrial machine teaching platform.

    • Bonsai Python SDK - A python library for integrating data sources with Bonsai BRAIN.
  • SEGAR - Sandbox environment for generalizable agent research.


  • counterfit - a CLI that provides a generic automation layer for assessing the security of ML models.
  • Federated Learning Simulation Framework - a flexible framework for running experiments with PyTorch models in a simulated Federated Learning (FL) environment.
  • FLUTE - a platform for conducting high-performance federated learning simulations.



  • COCO Dataset - COCO is a large-scale object detection, segmentation, and captioning dataset.
  • MS MARCO - collection of datasets focused on deep learning in search.
  • InnerEye CreateDataset - InnerEye dataset creation tool for InnerEye-DeepLearning library. Transforms DICOM data into mask for training Deep Learning models.
  • Sepsis Cohort from MIMIC III - Sepsis cohort from MIMIC dataset.
  • MIND : Microsoft News Dataset - a large-scale dataset for news recommendation research.
  • Dataset for AI for Earth - AIForEarthDataSets is a collection of datasets for AI research.
  • ORBIT - a collection of videos of objects in clean and cluttered scenes recorded by people who are blind/low-vision on a mobile phone.
  • EcoFlows - Community-representation to collaborate on labelled AI data for ecological and agricultural scenarios in APAC, updated monthly.

Debug & Benchmark

  • tensorwatch - debugging, monitoring and visualization for python machine learning and data science.
  • PYRIGHT - static type checker for python.
  • Bench ML - Python library to benchmark popular pre-built cloud AI APIs.
  • debugpy - An implementation of the Debug Adapter Protocol for Python
  • kineto - A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters contributed by Azure AI Platform team.
  • SuperBenchmark - a benchmarking and diagnosis tool for AI infrastructure (software & hardware).
  • tempeh - tempeh is a framework to TEst Machine learning PErformance exHaustively which includes tracking memory usage and run time.


  • GitHub Actions - Automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub.
  • Azure Pipelines - Automate your builds and deployments with Pipelines so you spend less time with the nuts and bolts and more time being creative.
  • Dagli - framework for defining machine learning models, including feature generation and transformations as DAG.


  • AI for Earth API Platform - distributed infrastructure designed to provide a secure, scalable, and customizable API hosting, designed to handle the needs of long-running/asynchronous machine learning model inference.

  • Open Platfom for AI (OpenPAI) - resource scheduling and cluster management for AI.

    • OpenPAI Runtime - Runtime for deep learning workload.
    • OpenPAI Protocol - OpenPAI protocol enables job sharing and portability.
    • Openpaimarketplace - A marketplace which stores examples and job templates of openpai.
    • OpenPAI FrameworkController - built to orchestrate all kinds of applications on Kubernetes by a single controller.
    • HivedDScheduler - Kubernetes Scheduler for Deep Learning.
    • OpenPAI JS SDK - The JavaScript SDK is designed to facilitate the developers of OpenPAI to offer user friendly experience.
    • OpenPAI VS Code Client - Extension to connect OpenPAI clusters, submit AI jobs, simulate jobs locally, manage files, and so on.
  • MLOS - Data Science powered infrastructure and methodology to democratize and automate Performance Engineering.

  • Platform for Situated Intelligence - an open-source framework for multimodal, integrative AI.

  • Qlib - an AI-oriented quantitative investment platform.

Feature Engineering

  • Feast on Azure - Azure plugins for Feast (FEAture STore).
  • Feathr - An Enterprise-Grade, High Performance Feature Store.


  • TagAnomaly - Anomaly detection analysis and labeling tool, specifically for multiple time series (one time series per category)
  • VoTT - Visual object tagging tool
  • Satellite imagery annotation tool - A lightweight web-interface for creating and sharing vector annotations over satellite/aerial imagery scenes.

Developer tool

  • Visual Studio Code - Code editor redefined and optimized for building and debugging modern web and cloud applications.
  • Gather - adds gather functionality in the Python language to the Jupyter Extension.
  • Pylance - an extension that works alongside Python in Visual Studio Code to provide performant language support.
  • Azure ML Snippets - VSCode snippets for Azure Machine Learning

Sample Code


  • AI@Edge Community - find the resources you need to create solutions using intelligence at the edge through combinations of hardware, machine learning (ML), artificial intelligence (AI) and Microsoft Azure service.
  • Global AI Community - empowers developers who are passionate about AI to share knowledge through events and meetups.
  • Deep Learning Lab (Japan) - provides information on development cases and the latest technology trends related to deep learning.





Blog, News & Webinar



