Reinforcing Learnings in the Organisation

Deep learning paradigms are about teaching computers what to do based on how humans learn (i.e. by example). The popularity of reinforcement learning has continued to grow manifold with applications in healthcare, manufacturing and gaming to name a few industries. We are becoming more algorithmically and computationally nuanced in how machines can learn intelligent behaviour in complex dynamic environments.We would expect to be aligned or even more advanced in how organizations create environments, policies and rewards for their people to learn the relevant ‘winning’ behaviors for their business. Has the focus on human learning kept up or is a gap opening up?
In the reinforcement learning paradigm, the goal is to enable the (software) agent to reach a desired state in a given environment by taking suitable actions which are incentivized by respective rewards.
The function that takes in state observations (inputs) and maps them to actions (outputs) is called policy. Given a set of observations, the policy decides which action the agent needs to take. This translates into employees acting on policies or guiding principles and taking actions based on their work scenarios.
Policy network and policy gradients
- Policies set by the organization to guide actions
- Rewards and penalties for actions for better behavior
- Only feedback is scoreboard
- Agent tries to receive as much reward as possible by optimizing it’s policy
- Collect a bunch of experience for training on policies
- Compute the gradients that would make the actions that agents are likely to take in future
- Actions leading to negative rewards will be slowly filtered out and actions leading to positive rewards will be more and more likely – so our agent is learning how to play the role
Downsides with policy gradient
- Credit assignment problem: Penalizing based on final result makes an agent discard promising process steps that brought them to the brink of success
- Sparse reward setting – agent has to figure out what parts of it’s action were causing the eventual reward at the end of the episode
Sample inefficiency – ton of training time
Reward shaping – manually shaping a reward function to guide desired behavior
- needs to be redone for every new environment that policy needs to be applied to – not scalable
- alignment problem – very difficult: agent will find surprising ways for getting rewards without doing what you want it to do – policy overfits to reward function designed
Parallels with people management
- How do you design an environment and policies which enable employees / agents take right actions and achieve positive outcomes?

Reinforcing Learnings in the Organisation

Published by Sai Krishnan Mohan

Leave a comment Cancel reply

Share this:

Related

Published by Sai Krishnan Mohan

Leave a comment Cancel reply