調和技研 技術ブログ

調和技研で取り組んでいる技術を公開していきます

Uses of Dynamic Programming, Reinforcement Learning and Multi-Agent Systems to Solve Dynamic Pricing Problem

Introduction:

I’m Tanzim Ahmed, working as a Software Engineer in Chowa Giken Corporation. Recently for our project purpose, we had to survey on Dynamic Pricing algorithms. There is a huge number of research projects in this area. Particularly we studied some projects that have use of Dynamic Programming and Reinforcement Learning. For simulation process we needed to study on multi-agent systems. During the course of the survey, we found numerous Dynamic Pricing projects exploited in different industries around the world.

 

In this blog, I want to mention some interesting approaches followed by various research-groups to solve Dynamic Pricing problem. These approaches, of course, are centered around Dynamic Programming, Reinforcement Learning and Multi-Agent Systems.

 

Key Words:

Dynamic Pricing, Dynamic Programming, Reinforcement Learning, Multi-Agent Systems

 

Key Points:

What is Dynamic Pricing?

Dynamic Pricing in Different Industries

Dynamic Programming in Dynamic Pricing

Reinforcement Learning in Dynamic Pricing

Multi-Agent Systems in Dynamic Pricing

Some of Other Algorithms for Dynamic Pricing

Conclusion

References

 

What is Dynamic Pricing?:

Dynamic Pricing is a pricing strategy in which businesses set flexible prices for products or services based on market supply and demand, customer types, competitor prices and other parameters.

 

Dynamic Pricing in Different Industries:

Dynamic pricing is a common practice in several industries such as hospitality, travel, entertainment, retail, electricity, public transport etc. 

 

Two popular companies of the world, Amazon and Uber use Dynamic Pricing for their products and services. On Amazon, popular items are quickly discounted, while items that are less attractive may cost more than they do on rival sites. The price of some products vary throughout the year. By testing a product at various price points, Amazon can determine the optimum low price that it could use during the peak shopping period. It may increase the price of some products which they figure out the customers would not do comparison shopping with its competitors. Example: HDMI cable before Black Friday. It is said that Amazon changes its prices every 10 minutes on average.

Using LSTM networks in Deep Learning models Uber makes predictions on future stance of the market, and even unexpected events before they happen. Some inputs in their model: the global news events, weather, historical data, holidays, time, traffic, etc. These data are collected, and the algorithm predicts the top price that the customer are most likely willing to pay. This “willingness to pay” algorithm can determine how likely customers are to agree to the price of the ride at the current time. Uber knows when its users’ phone batteries are running low because the app switches into power-saving mode. This information plays an important role in their pricing strategy. 

 

Dynamic Programming in Dynamic Pricing:

Dynamic Programming is mainly an optimization over plain recursion. Wherever we see a recursive solution that has repeated calls for same inputs, we can optimize it using Dynamic Programming. The idea is to simply store the results of subproblems, so that we do not have to re-compute them when needed later. This simple optimization reduces time complexity from exponential to polynomial.

Dynamic Programming is extensively used in Dynamic Pricing problems.

 

Qing et al. (2006) worked on multiple-class customers. They considered the partial backlogging case where price discounts are used to induce a desirable level of backlogging. Backlogging means the rejected customers wait with a probability dependent on a price discount. The model considers dynamically adjusted price discounts to customers who are denied immediate service. 

They solved the problem by induction for different stages of the inventory period. In each stage, 3 decisions are to be made: what discounts to offer, what prioritization to use, and finally, how much inventory to allocate to demand in the stage. They proposed a 6-step algorithm to determine the discount. They compared their model with a First Come, First Served (FCFS) solution that shows increased profit for their model. They showed that a 2-stage heuristic that addresses individual decisions based on the current inventory and waiting customer state and expectations for the remainder of the period performs very well.

Providing discounts in the model has a lesser, though significant, effect on profits. If we consider multiple-class customers and discounts for group of customers, then this paper's idea could be helpful. 

 

Xin et al. (2015) used memory-based reference price model. Reference price is generated by exponentially weighing past prices. They let the minimum historical price be the initial reference price for a more conservative result. They covered both the cases: loss-averse and gain-seeking behaviors of customers. For other cases: they proposed an approximation heuristic. For the general parameter settings, they proposed a heuristic and established lower bounds and upper bounds to the optimal objective value from the heuristic. The model assumes piecewise linear demand function, also that consumers are more sensitive to losses than gains. They converted the dynamic pricing problem to a longest path problem. All the parameters in the demand function can be conveniently estimated from real data on historical prices and demands using linear regression. In the loss/gain neutral case, the unconstrained dynamic pricing problem can be easily solved using the standard linear quadratic control techniques. To alleviate the multi-collinearity of linear reference price model they performed linear regression with regularization, i.e., lasso regression. According to the authors, Lasso regression usually results in a better fit and fixes the wrong sign in parameter estimates better than ridge regression.

 

Norman et al. (2016) proposed a Dynamic Programming model for solving Dynamic Pricing problem. Input to their model: number of items in stock, competitor price, time.

Equations of value functions of the model:

f:id:chowagiken_tanzim:20191004135950p:plain

Figure 1: Value Functions

We tried to execute the value function of the model.

f:id:chowagiken_tanzim:20191004140051p:plain

Figure 2: Optimal Price VS Time for Different Numbers of Stocks

The code generates some random price data in specific range. It trains the data with Linear Regression Model. For diminishing time, and perishable items in stock, the program determines corresponding optimal price and optimal profit. Then it plots optimal price (P) against time (T) for different numbers of items in stock (N).

 

Yuanming et al. (2019) considered seasonality in their Dynamic Programming model. They proposed that for multiple seasons, multiple value functions could be utilized. For example, in their work, two seasons are considered. Hence, two value functions are introduced:

f:id:chowagiken_tanzim:20191004140132p:plain

Figure 3: Two Value Functions for Two Seasons

Here, (x1, y1), (x2, y2) are state variables. h*1, h*2 are optimal control variables. β1 and β2 are the seasonal contraction factors respectively for two seasons. In Dynamic Programming model, this approach could be followed to capture seasonality from data.

 

Reinforcement Learning in Dynamic Pricing:

Reinforcement learning is an area of Machine Learning which is about taking suitable action to maximize reward in a situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the supervised learning in a way that in supervised learning the training data has the answer key with it, so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of training dataset, it is bound to learn from its experience.

 

Roberto et al. (2018) showed how to solve dynamic pricing using Reinforcement Learning (RL) techniques so that prices are maximized while keeping a balance between revenue and fairness. They demonstrated that RL provides two main features to support fairness in dynamic pricing: RL can learn from recent experience, adapting the pricing policy to complex market environments; it provides a trade-off between short and long-term objectives, hence integrating fairness into the model’s core. They proposed the application of RL for revenue optimization, with the additional integration of fairness as part of the learning procedure by using Jain’s index as a fairness metric. They applied concepts from resource allocation distribution to dynamic pricing to interpret fairness as similarly distributed prices among groups of customers (equality between groups of customers). The learning procedure provides homogeneous prices among customer groups, but at the same time, considers price sensitives within each group to maximize revenue. 

 

They used Q-Learning and ϵ-greedy strategy. A gradient descent step is executed to reduce the loss produced between the current and the target Q-value. Results in a simulated environment show a significant improvement in fairness while at the same time maintaining revenue optimization. They demonstrated that an unfair scenario can be unbiased within a certain degree, by integrating fairness metrics as part of the model optimization. They covered the integration of fairness from a resource allocation point of view, including it as a design principle.

They proposed design principles for fairness in dynamic pricing with a focus on elements such as group or individual schemes, and equity or definitions of equality. This design allows to specify a well-defined metric and to include it in the overall model in order to ensure fairness.

This approach could be followed if the customers are distributed in multiple groups.

 

Rupal et al. (2014) examined the problem of establishing a pricing policy that maximizes the revenue for selling a given inventory by a fixed deadline. They proposed a methodology to optimize revenue in a model-free environment in which demand is learned and pricing decisions are updated in real-time. In their work, the dynamic pricing problem of a perishable service is modeled as a discrete finite horizon Markov decision process (MDP). The dynamic pricing problem is formulated as an MDP because pricing is a real-time decision-making problem in a stochastic environment.

They used two methods, Q-learning and the Q-learning with eligibility traces. The Q(λ) algorithm was used to solve the non-stationary Markov decision process. Q(λ) algorithm performs particularly well in situations where the demand between successive days exhibits auto-correlation. Thus it is possible to learn the auto-correlation of the real-time customer arrival rates and customer reservations along with any other influential factors implicitly within the selling horizon.

The reinforcement algorithms used in their work produce look-up tables of value functions of all state-action pairs where the state is characterized as the capacity level and time until expiration and the action is the price. The value functions of a state-action pair are calculated by the immediate revenue gained and the expected future revenues. Using these algorithms allows the initialization of the value functions as the best estimated demand function and then observe the real-time demand and update the value functions. 

The price, at any given time period, is determined by the remaining capacity and the time (e.g., number of days) left before the deadline expires. Q-learning with eligibility converges to the optimal policy, and the rate of learning is faster than with the simple Q-learning. Q-learning with eligibility traces algorithm has the merit of learning the pricing policy in a model-free-environment and hence can implicitly incorporate auto-correlation of demand information within its policy.

If the true model is different from the assumed structure, then after observing realized demand data, the algorithm derives a policy for the correct underlying demand model. They showed that by using reinforcement learning they could model the problem with interdependent demands. This type of model can be useful in producing a more accurate pricing scheme of services or products when important events affect consumer preferences.

 

An anonymous group of researchers (2019) defined the pricing process as a Markov Decision Process and then defined the state space, discrete and continuous action space, and different reward function for different pricing applications.The model is pre-trained by historical sales data and previous specialists’ pricing actions, which is also used for offline evaluation.

f:id:chowagiken_tanzim:20191004140603p:plain

Figure 4: Dynamic Pricing Framework with Deep Reinforcement Learning

At each time step, the pricing agent observes the state for product, and takes an action. Then the agent receives the reward for that action as well as the observation for the new state. These four elements (bold font) form the transition.

Their field experiment showed that it outperformed the manual markdown pricing strategy. 

They showed that pricing policies from DDPG (Deep Deterministic Policy Gradient) and DQN (Deep Q-Network) outperformed other pricing policies significantly. This Reinforcement Learning model could be helpful for pricing products in real-time.

 

Multi-Agent Systems in Dynamic Pricing:

An agent is a computerized entity like a computer program or a robot. An agent can be described as autonomous when it has the capacity to adapt when its environment changes. 

A multi-agent system (MAS or "self-organized system") is a computerized system composed of multiple interacting intelligent agents. 

Multiple agent system solves the problems that are difficult for individual agents. A multi-agent system is made up of a set of computer processes that occur at the same time, i.e. several agents that exist at the same time, share common resources and communicate with each other. 

The key issue in multi-agent systems is to formalize the coordination between agents. 

 

Prithviraj et al. (2000) designed and implemented an electronic market in software to study the performance of the different price-setting algorithms. To simulate asynchronous buyer requests in their system they implemented the buyers on independent threads using Java. They compared the price profile of two fixed step-size Derivative Followers (DF) and one Adaptive step-size Derivative Follower (ADF) competing against each other. And then compared the price profile of an Adaptive step size (ADF) and a fixed step-size Derivative Follower (DF) competing with a fixed-price seller (FP). They compared the price profile of a competition between a Fixed-Price seller (FP), an Adaptive step-size Derivative Follower (ADF), and a Model Optimizer (MO). And then compared the price profile of a Fixed-Price seller (FP), an Adaptive step-size Derivative Follower (ADF), and a Model Optimizer with Exploration (MOE) competing against each other. They compared price profiles of three competing model optimizers without exploration, and price profile of three model optimizers with exploration competing against each other.

 

Wenchong et al. (2018) explored a price Q-learning mechanism for perishable products that considers uncertain demand and customer preferences in a competitive multi-agent retailer market (a model-free environment). In the proposed simulation model, agents imitate the behavior of consumers and retailers. Four potential influencing factors (competition, customer preferences, uncertain demand, perishable characteristics) are constructed in the pricing decisions. All retailer agents adjust their products’ prices over a finite sales horizon to maximize expected revenues. A retailer agent adjusts its price according to the Q-learning mechanism, while others adapt traditional pricing strategies. Shortage is allowed while backlog is not. 

The simulation results show that the dynamic pricing strategy via the Q-learning mechanism can be used for pricing perishable products in a competitive environment, as it can produce more revenue for retailers.

 

Some of Other Algorithms for Dynamic Pricing:

Zhao et al. (2017) proposed a Demand Prediction Model and a Dynamic Pricing Model for Share Bike business utilizing Machine Learning. This work considers weather, days of weeks, rush hours etc in demand forecast model.

He experimented with Regression, Ridge Regression, Lasso Regression, Decision Tree, Random Forest, Gradient Boosting, Ada Boosting. However, Random Forest got the Highest Precision Score according to their experiment. 

And they used the following equation to set the optimal price:

O.P/C.P = (1/P.E) x (P.D/A.D)

Where, 

O.P = Optimal Price, C.P = Current Price

P.E = Price Elasticity

P.D = Predicted Demand, A.D = Average Demands 

 

Kris et al. (2015) showed how combining machine learning and optimization techniques into a pricing decision support tool has made a substantial financial impact on Rue La La’s (an online retailer company) business. They developed a dynamic pricing algorithm that learns a customer’s purchase probability of a product at each price point by observing real-time customer purchase decisions, and then uses this knowledge to dynamically change prices to maximize total revenue throughout the event. 

The algorithm was built upon the well-known Thompson Sampling algorithm used for multi-armed bandit problems by creatively incorporating inventory constraints into the model and algorithm. The demand prediction model has non-parametric structure. Several features are used to build the demand prediction models; however, only three of them are related to price - price, discount, and relative price of competing styles. A common way to address the issue of overfitting in regression trees is using random forests. They chose to use bagging instead of random forests for better interpretability. They showed that their algorithm has both strong theoretical performance guarantees as well as promising numerical performance results when compared to other algorithms developed for the same setting.

 

Conclusion:

In this blog, I tried to discuss some useful approaches to solve Dynamic Pricing problems. According to the results of the research works, it seems that using Dynamic Programming, Reinforcement Learning and Multi-Agent Simulations the researchers could solve Dynamic Pricing problems efficiently. As Dynamic Pricing is a modern, sophisticated approach for Revenue Maximization, these approaches could be studied for implementation to increase the revenue margin in different businesses.

 

References:

Dynamic Pricing Wikipedia: https://en.wikipedia.org/wiki/Dynamic_pricing

Amazon Dynamic Pricing: https://www.cio.com/article/2870961/report-analyzes-amazons-dynamic-pricing-strategy.html

Uber Dynamic Pricing: https://www.uber.com/en-GB/blog/uber-dynamic-pricing/

DP-Qing et al.: https://www.researchgate.net/profile/Qing_Ding/publication/220243763_Dynamic_Pricing_Through_Discounts_for_Optimizing_Multiple-Class_Demand_Fulfillment/links/0fcfd50dac33ac045e000000.pdf

DP-Xin et al.: 

https://publish.illinois.edu/xinchen/files/2015/05/refprice_03162015.pdf

DP-Norman et al.: https://github.com/normanrz/dynamic-prices

DP-Yuanming et al.: https://www.sciencedirect.com/science/article/pii/S0377221718309822#bib0025

RL-Roberto et al.: https://arxiv.org/pdf/1803.09967.pdf

RL-Rupal et al.: https://www.sciencedirect.com/science/article/pii/S030504831300100X

RL-Anonymous: https://openreview.net/pdf?id=HJMRvsAcK7

MAS-Prithviraj et al.: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.202.5816&rep=rep1&type=pdf

MAS-Wenchong et al.: http://jasss.soc.surrey.ac.uk/21/2/12.html

Others-Share Bike: https://github.com/weiwei1988/2nd-Datascience-Cources-

Others-Rue La La: https://www.hbs.edu/faculty/Publication%20Files/kris%20Analytics%20for%20an%20Online%20Retailer_6ef5f3e6-48e7-4923-a2d4-607d3a3d943c.pdf