Reward maximisation in a system

Do human in a system seeks to move towards homeostasis or heterostasis? Do we seek a state of stability of equilibrium or do we seek a condition of maximal or optimal functioning?

The Alignment Problem by Brian Christian touches on the intersection among pedagogy, psychology, societal norms, reward function, animal behaviours and machine learning. His book addresses the alignment problem in the domain of machine learning. The considerations highlighted in the book, though it is in the context of machine learning, could be translated to the different aspects of our life.

The subject of homeostasis and heterostasis stands out for me as I have been thinking about reward maximisation, motivation and the interactions among different entities. System in this context could be refer to the structure of a personal life, or an organisation or even a country. In Singapore, we are not foreign to multiple bonus structures, in which the government incentivises desirable behaviours of the citizens. In an organisation, we have KPIs or OKRs. In our personal life, we can also employ different system, from gamification, to streak mentality to setting SMART goals.

We probably hear statements like this:

“It will have been better for <Person X> to have done <Action A1> instead of <Action A2> when interacting with <Person Y/Group Y> in the context of <Event E1> to prevent <Consequences C1>.”

“<Organisation Y> is giving <bonus B1> in expectation of <Action A3>. <Organisation Z> cannot match up to <bonus B1>, therefore the expectation of <Action A3> does not exist.”

On the surface level, it might seems reasonable to have this assumptions. The people making the above statements could even make concrete justifications to their assumptions. However we may make sweeping statements like this without considering the motivation/reward function of the individuals or groups.

If the nature of human being is to seek reward maximisation, we could already be seeing a diverse, and perhaps diverging reward functions.

For example, <Group Y.Person 1> may prefer <Action A1> but <Group Y.Person 2> may prefer <Action A2>. And the person, <Group Y.Person 3> making the statement in favour of <Action A1> may be in constant contact of <Group Y.Person 1> and less contact with <Group Y.Person 2>. In this case, <Group Y.Person 3> may think that <Action A1> stands in agreement to benefit people more than <Action A2>.

Granted that both <Action A1> and <Action A2> are going to have certain negative consequences, how then can we best decide on the actions to undertake. Real life is unlike the virtual world, most of the times, we are only able to take that one exact action at that exact point in time. Unfortunately, when Person X choose to undertake one action over the other, only the negative consequences of a single action will pan out. Real life also don’t come with time machine. This leaves us only with our own imagination and extrapolation as to what will happen, suppose if the other action is taken. I’ve been reading Rationality by Eliezer Yudkowsky in which he constantly refers to Bayesian Reasoning. I couldn’t fully grasp the concept yet due to my superficial statistical background but the notion of application of probability theory to inductive reasoning seems the closest to which we could have a Dr Strange like experience of living through the various scenarios.

Similarly, for the second statement, if we were to break down the organisational dynamics, we can roughly see two slightly differing reward function. Employees may be seeking recognition, growth opportunities and financial incentives. An employer strives to enhance productivity, retain talent and achieve business opportunities. The objectives of the employee and the employer may not always fully aligned and there are intricacies to consider in maintaining a delicate balance.

One set of contradictory rewards that occur to employees and employers, especially at a resource constraint company, is the maximisation of financial incentives for individuals and the maximisation of profit for the organisation. The big tech constantly makes news about bad working conditions and putting priorities of profit over people.

To address the complexities of conflicting reward functions, it is crucial to delve into the motivations and reward functions of individuals or groups. Rather than making sweeping assumptions, it is essential to consider the unique perspectives, desires, and priorities of each stakeholder. This understanding forms the foundation for effective decision-making and navigating the intricacies of reward maximisation. By actively engaging in discussions and valuing different perspectives, it becomes possible to uncover common ground and build consensus.

Policy-based approaches led to a system—be it animal, human, or machine—with highly trained “muscle memory.” The right behavior just flowed effortlessly. Value-based approaches, by contrast, led to a system with a highly trained “spider-sense.” It could tell right away if a situation was threatening or promising. Either, alone, if fully developed, was enough. In practice, however, policy-based approaches and value-based approaches went hand in hand. Barto and Sutton began to elaborate an idea known as the “actor-critic” architecture, where the “actor” half of the system would learn to take good actions, and the “critic” half would learn to predict future rewards.

Christian’s book also touched on the policy-based and value-based approach to shaping behaviour. By utilising the “actor-critic” architecture, both the employer and employee can form a constant feedback loop to desirable outcomes.