Enhancements: Support for Hardware Accelerators

Created on 28 Feb 2017 · 21Comments · Source: kubernetes/enhancements

Description

Kubernetes is becoming popular for managing workloads that consume accelerators like Tensorflow for example. The agility that Kubernetes offers makes it easy to consume accelerators across a fleet of machines.
Kubernetes can provide an end to end workflow by separating provisioning and configuration of accelerators from consumption.

Progress Tracker

[ ] Alpha
- [ ] Write and maintain draft quality doc
  - [ ] During development keep a doc up-to-date about the desired experience of the feature and how someone can try the feature in its current state. Think of it as the README of your new feature and a skeleton for the docs to be written before the Kubernetes release. Paste link to Google Doc: DOC-LINK
- [ ] Design Approval
  - [ ] Design Proposal. This goes under design-proposals. Doing a proposal as a PR allows line-by-line commenting from community, and creates the basis for later design documentation. Paste link to merged design proposal here: PROPOSAL-NUMBER
  - [ ] Decide which repo this feature's code will be checked into. Not everything needs to land in the core kubernetes repo. REPO-NAME
  - [ ] Identify shepherd (your SIG lead and/or [email protected] will be able to help you). My Shepherd is: _replace.[email protected]_ (and/or GH Handle)
  - A shepherd is an individual who will help acquaint you with the process of getting your feature into the repo, identify reviewers and provide feedback on the feature. They are _not_ (necessarily) the code reviewer of the feature, or tech lead for the area.
  - The shepherd is _not_ responsible for showing up to Kubernetes-PM meetings and/or communicating if the feature is on-track to make the release goals. That is still your responsibility.
  - [ ] Identify secondary/backup contact point. My Secondary Contact Point is: _replace.[email protected]_ (and/or GH Handle)
- [ ] Write (code + tests + docs) then get them merged. ALL-PR-NUMBERS
  - [ ] Code needs to be disabled by default. Verified by code OWNERS
  - [ ] Minimal testing
  - [ ] Minimal docs
  - cc @kubernetes/docs on docs PR
  - cc @kubernetes/feature-reviewers on this issue to get approval before checking this off
  - New apis: Glossary Section Item in the docs repo: kubernetes/kubernetes.github.io
  - [ ] Update release notes
[ ] Beta
- [ ] Testing is sufficient for beta
- [ ] User docs with tutorials
  - Updated walkthrough / tutorial in the docs repo: kubernetes/kubernetes.github.io
  - cc @kubernetes/docs on docs PR
  - cc @kubernetes/feature-reviewers on this issue to get approval before checking this off
- [ ] Thorough API review
- cc @kubernetes/api
[ ] Stable
- [ ] docs/proposals/foo.md moved to docs/design/foo.md
  - cc @kubernetes/feature-reviewers on this issue to get approval before checking this off
- [ ] Soak, load testing
- [ ] detailed user docs and examples
- cc @kubernetes/docs
- cc @kubernetes/feature-reviewers on this issue to get approval before checking this off

FEATURE_STATUS is used for feature tracking and to be updated by @kubernetes/feature-reviewers.
FEATURE_STATUS: IN_DEVELOPMENT

cc @kubernetes/sig-node-feature-requests @kubernetes/sig-scheduling-feature-requests

do-not-mergdocs lifecyclrotten sinode stagalpha

Source

vishh

Most helpful comment

Can we use the term "hardware accelerators"? I was really confused by this issue at first.

philips on 2 Mar 2017

👍3

All 21 comments

cc @aronchick for priority

vishh on 28 Feb 2017

s/accelerators/device assignment please? /cc @derekwaynecarr

jeremyeder on 1 Mar 2017

regarding accelerators, does it mean some kind of device, e.g. GPU (but not limit to GPU)?

k82cn on 1 Mar 2017

/subscribe

cmluciano on 1 Mar 2017

@k82cn yes. Actually per sig meeting yesterday, any PCI device (most tend to be accelerators but I'd personally prefer more generic wording). Note that Intel has "accelerators" inside their CPUs (called CPU extensions). All of these things should become candidates for scheduler match making.

jeremyeder on 1 Mar 2017

cmluciano on 1 Mar 2017

@jeremyeder

My understanding is that,

There needs to be a way to discover, represent and consume Accelerators as a resource in Kubernetes
As an optimization, node hardware topology needs to taken into account while provisioning accelerators.

1 does not depend on 2 and 2 can be solved independent of 1.
This feature is meant to focus on 1
It can benefit from 2 if it made available in parallel.

vishh on 1 Mar 2017

👍2

Is the scope limited to accelerators or some co-processors like TPM etc?

My understanding is that,

There needs to be a way to discover, represent and consume Accelerators as a resource in Kubernetes

If the hardware discovery is a functionality that we are targeting, shouldn't scope be broadened to all types of devices(including accelerators)?

ravisantoshgudimetla on 1 Mar 2017

This issue is not meant to support arbitrary third party devices which I
believe warrants an issue by itself. Node Feature Discovery attempts to
solve the device discovery problem to an extent.

On Wed, Mar 1, 2017 at 2:26 PM, ravig notifications@github.com wrote:

Is the scope limited to accelerators or some co-processors like TPM etc?

My understanding is that,

There needs to be a way to discover, represent and consume
Accelerators as a resource in Kubernetes

If the hardware discovery is a functionality that we are targeting,
shouldn't scope be broadened to all types of devices(including
accelerators)?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/features/issues/192#issuecomment-283491270,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGvIKI5igGmT1xdSyaC9BAPC3f9y0RZAks5rhfB6gaJpZM4MO8fm
.

vishh on 1 Mar 2017

Can we use the term "hardware accelerators"? I was really confused by this issue at first.

philips on 2 Mar 2017

👍3

Good proposal! I think topology support for deivce is a must. For example, nvidia GPUs on different PCI bridge can not talk p2p.

liyubobj on 3 Mar 2017

ping @calebamiles to review

idvoretskyi on 9 May 2017

One of the critical pieces of this problem is Hardware device plugins landed in v1.8 https://github.com/kubernetes/features/issues/368.
This feature is broad and requires more work around identifying and defining the matrix of devices, device plugins and workload compatibility. This aspect is expected to be handled outside of core kubernetes, but the specifics are not yet defined. For that reason, I'm leaving this issue open, and moving it to v1.9.

vishh on 12 Sep 2017

@vishh is it still alpha for 1.9?

Also, can you update the feature template to follow the new format? https://github.com/kubernetes/features/blob/master/ISSUE_TEMPLATE.md

idvoretskyi on 13 Nov 2017

It is still alpha for 1.9.

mindprince on 13 Nov 2017

@vishh :wave: Please indicate in the 1.9 feature tracking board
whether this feature needs documentation. If yes, please open a PR and add a link to the tracking spreadsheet. Thanks in advance!

zacharysarah on 22 Nov 2017

@vishh Bump for docs ☝️

/cc @idvoretskyi

zacharysarah on 29 Nov 2017

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 27 Feb 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot on 29 Mar 2018

@vishh
Any plans for this in 1.11?

If so, can you please ensure the feature is up-to-date with the appropriate: