Publications Remi Munos

[1]	N. Korda, Prashanth, and R. Munos. Fast gradient descent for drifting least squares regression, with application to bandits. In American Conference on Artificial Intelligence, 2015. [ bib \| .pdf ] Online learning algorithms require to often recompute least squares regression estimates of parameters. We study improving the computational complexity of such algorithms by using stochastic gradient descent (SGD) type schemes in place of classic regression solvers. We show that SGD schemes efficiently track the true solutions of the regression problems, even in the presence of a drift. This finding coupled with an O(d) improvement in complexity, where d is the dimension of the data, make them attractive for implementation in the big data settings. In the case when strong convexity in the regression problem is guaranteed, we provide bounds on the error both in expectation and high probability (the latter is often needed to provide theoretical guarantees for higher level algorithms), despite the drifting least squares solution. As an example of this case we prove that the regret performance of an SGD version of the PEGE linear bandit algorithm is worse than that of PEGE itself only by a factor of O(log⁴ n). When strong convexity of the regression problem cannot be guaranteed, we investigate using an adaptive regularisation. We make an empirical study of an adaptively regularised, SGD version of LinUCB in a news article recommendation application, which uses the large scale news recommendation dataset from Yahoo! front page. These experiments show a large gain in computational complexity and a consistently low tracking error
[2]	A. Carpentier, R. Munos, and A. Antos. Minimax strategy for stratified sampling for Monte Carlo. To appear in Journal of Machine Learning Research, 2015. [ bib \| .pdf ] We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two pseudo-regret1 analyses: a distribution-dependent bound of order O(n^-3/2) that depends on a measure of the disparity of the strata, and a distribution-free bound O(n^-4/3) that does not. We also provide the first problem independent (minimax) lower bound for this problem and demonstrate that MC-UCB matches this lower bound both in terms of number of samples n and in terms of number of strata K. Finally, we link the pseudo-regret with the difference between the mean squared error on the estimated weighted average of the mean values of the arms, and the optimal oracle strategy: this provides us also a problem dependent and a problem independent rate for this measure of performance and, as a corollary, asymptotic optimality.
[3]	L. Busoniu, R. Munos, and E. Páll. An analysis of optimistic, best-first search for minimax sequential decision making. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2014. [ bib \| .pdf ] We consider problems in which a maximizer and a minimizer agent take actions in turn, such as games or optimal control with uncertainty modeled as an opponent. We extend the ideas of optimistic optimization to this setting, obtaining a search algorithm that has been previously considered as the best-first search variant of the B* method. We provide a novel analysis of the algorithm relying on a certain structure for the values of action sequences, under which earlier actions are more important than later ones. An asymptotic branching factor is defined as a measure of problem complexity, and it is used to characterize the relationship between computation invested and near-optimality. In particular, when action importance decreases exponentially, convergence rates are obtained. Throughout, examples illustrate analytical concepts such as the branching factor. In an empirical study, we compare the optimistic best-first algorithm with two classical game tree search methods, and apply it to a challenging HIV infection control problem.
[4]	K. Máthé, L. Busoniu, R. Munos, and B. De Schutter. Optimistic planning with a limited number of action switches for near-optimal nonlinear control. In Proceedings of the 53rd IEEE Conference on Decision and Control, 2014. [ bib \| .pdf ] We consider infinite-horizon optimal control of nonlinear systems where the actions (inputs) are discrete. With the goal of limiting computations, we introduce a search algorithm for action sequences constrained to switch at most a given number of times between different actions. The new algorithm belongs to the optimistic planning class originating in artificial intelligence, and is called optimistic switch-limited planning (OSP). It inherits the generality of the OP class, so it works for nonlinear, nonsmooth systems with nonquadratic costs. We develop analysis showing that the switch constraint leads to polynomial complexity in the search horizon, in contrast to the exponential complexity of state-of-the-art OP; and to a correspondingly faster convergence. The degree of the polynomial varies with the problem and is a meaningful measure for the difficulty of solving it. We study this degree in two representative, opposite cases. In simulations we first apply OSP to a problem where limited-switch sequences are near-optimal, and then in a networked control setting where the switch constraint must be satisfied in closed loop.
[5]	T. Lattimore and R. Munos. Bounded regret for finite-armed structured bandits. In Advances in Neural Information Processing Systems, 2014. [ bib \| .pdf ] We study a new type of K-armed bandit problem where the expected return of one arm may depend on the returns of other arms. We present a new algorithm for this general class of problems and show that under certain circumstances it is possible to achieve finite expected cumulative regret. We also give problemdependent lower bounds on the cumulative regret showing that at least in special cases the new algorithm is nearly optimal.
[6]	M. Soare, A. Lazaric, and R. Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, 2014. [ bib \| poster \| http \| .pdf ] We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies that pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the G-optimality criterion used in optimal experimental design.
[7]	B. Szorenyi, G. Kedenburg, and R. Munos. Optimistic planning in markov decision processes using a generative model. In Advances in Neural Information Processing Systems, 2014. [ bib \| .pdf ] We consider the problem of online planning in a Markov decision process with discounted rewards for any given initial state. We consider the PAC sample complexity problem of computing, with probability 1-δ, an ε-optimal action using the smallest possible number of calls to the generative model (which provides reward and next-state samples). We design an algorithm, called StOP (for Stochastic Optimistic Planning), based on the “optimism in the face of uncertainty” principle. StOP can be used in the general setting, requires only a generative model, and enjoys a complexity bound that only depends on the local structure of the MDP
[8]	T. Kocák, G. Neu, M. Valko, and R. Munos. Efficient learning by implicit exploration in bandit problems with side observations. In Advances in Neural Information Processing Systems, 2014. [ bib \| poster \| .pdf ] We consider online learning problems under a a partial observability model capturing situations where the information conveyed to the learner is between full information and bandit feedback. In the simplest variant, we assume that in addition to its own loss, the learner also gets to observe losses of some other actions. The revealed losses depend on the learner's action and a directed observation system chosen by the environment. For this setting, we propose the first algorithm that enjoys near-optimal regret guarantees without having to know the observation system before selecting its actions. Along similar lines, we also define a new partial information setting that models online combinatorial optimization problems where the feedback received by the learner is between semi-bandit and full feedback. As the predictions of our first algorithm cannot be always computed efficiently in this setting, we propose another algorithm with similar properties and with the benefit of always being computationally efficient, at the price of a slightly more complicated tuning mechanism. Both algorithms rely on a novel exploration strategy called implicit exploration, which is shown to be more efficient both computationally and information-theoretically than previously studied exploration strategies for the problem
[9]	S. Sabato and R. Munos. Active regression by stratification. In Advances in Neural Information Processing Systems, 2014. [ bib \| http \| .pdf ] We propose a new active learning algorithm for parametric linear regression with random design. We provide finite sample convergence guarantees for general distributions in the misspecified model. This is the first active learner for this setting that provably can improve over passive learning. Unlike other learning settings (such as classification), in regression the passive learning rate of O(1/ε) cannot in general be improved upon. Nonetheless, the so-called `constant' in the rate of convergence, which is characterized by a distribution-dependent risk, can be improved in many cases. For a given distribution, achieving the optimal risk requires prior knowledge of the distribution. Following the stratification technique advocated in Monte-Carlo function integration, our active learner approaches the optimal risk using piecewise constant approximations.
[10]	Masrour Zoghi, Shimon Whiteson, Rémi Munos, and Maarten de Rijke. Relative upper confidence bound for the K-armed dueling bandit problem. In International Conference on Machine Learning, June 2014. [ bib \| .pdf ] This paper proposes a new method for the K-armed dueling bandit problem, a variation on the regular K-armed bandit problem that offers only relative feedback about pairs of arms. Our approach extends the Upper Confidence Bound algorithm to the relative setting by using estimates of the pairwise probabilities to select a promising arm and applying Upper Confidence Bound with the winner as a benchmark. We prove a sharp finite-time regret bound of order O(K log T) on a very general class of dueling bandit problems that matches a lower bound proven by Yue et al. In addition, our empirical results using real data from an information retrieval application show that it greatly outperforms the state of the art.
[11]	Michal Valko, Rémi Munos, Branislav Kveton, and Tomáš Kocák. Spectral bandits for smooth graph functions. In International Conference on Machine Learning, 2014. [ bib \| .pdf ] Smooth functions on graphs have wide applications in manifold and semi-supervised learning. In this paper, we study a bandit problem where the payoffs of arms are smooth on a graph. This framework is suitable for solving online learning problems that involve graphs, such as content-based recommendation. In this problem, each item we can recommend is a node and its expected rating is similar to its neighbors. The goal is to recommend items that have high expected ratings. We aim for the algorithms where the cumulative regret with respect to the optimal policy would not scale poorly with the number of nodes. In particular, we introduce the notion of an effective dimension, which is small in real-world graphs, and propose two algorithms for solving our problem that scale linearly and sublinearly in this dimension. Our experiments on real-world content recommendation problem show that a good estimator of user preferences for thousands of items can be learned from just tens of nodes evaluations
[12]	Tomáš Kocák, Michal Valko, Rémi Munos, and Shipra Agrawal. Spectral thompson sampling. In American Conference on Artificial Intelligence, 2014. [ bib \| .pdf ] Thompson Sampling (TS) has surged a lot of interest due to its good empirical performance, in particular in the computational advertising. Though successful, the tools for its performance analysis appeared only recently. In this paper, we describe and analyze SpectralTS algorithm for a bandit problem, where the payoffs of the choices are smooth given an underlying graph. In this setting, each choice is a node of a graph and the expected payoffs of the neighboring nodes are assumed to be similar. Although the setting has application both in recommender systems and advertising, the traditional algorithms would scale poorly with the number of choices. For that purpose we consider an effective dimension d, which is small in real-world graphs. We deliver the analysis showing that the regret of SpectralTS scales as d sqrt(T ln N) with high probability, where T is the time horizon and N is the number of choices. Since a d sqrt(T ln N) regret is comparable to the known results, SpectralTS offers a computationally more efficient alternative. We also show that our algorithm is competitive on both synthetic and real-world data.
[13]	Prashanth, N. Korda, and R. Munos. Fast LSTD using stochastic approximation: Finite time analysis and application to traffic control. In European Conference on Machine Learning, 2014. [ bib \| http \| .pdf ] We propose a stochastic approximation based method with randomisation of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our method results in an O(d) improvement in complexity in comparison to regular LSTD, where d is the dimension of the data. We provide convergence rate results for our proposed method, both in high probability and in expectation. Moreover, we also establish that using our scheme in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. This result coupled with the low complexity of our method makes it attractive for implementation in big data settings, where d is large. Further, we also analyse a similar low-complexity alternative for least squares regression and provide finite-time bounds there. We demonstrate the practicality of our method for LSTD empirically by combining it with the LSPI algorithm in a traffic signal control application.
[14]	Ph. Preux, R. Munos, and M. Valko. Bandits attack function optimization. In IEEE Congress on Evolutionary Computation, 2014. [ bib \| .pdf ] We consider function optimization as a sequential decision making problem under the budget constraint. Such constraint limits the number of objective function evaluations allowed during the optimization. We consider an algorithm inspired by a continuous version of a multi-armed bandit problem which attacks this optimization problem by solving the tradeoff between exploration (initial quasi-uniform search of the domain) and exploitation (local optimization around the potentially global maxima). We introduce the so-called Simultaneous Optimistic Optimization (SOO), a deterministic algorithm that works by domain partitioning. The benefit of such an approach are the guarantees on the returned solution and the numerical eficiency of the algorithm. We present this machine learning rooted approach to optimization, and provide the empirical assessment of SOO on the CEC 2014 competition on single objective real-parameter numerical optimization testsuite.
[15]	R. Munos. From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, 7(1):1-130, 2014. [ bib \| http ] This work covers several aspects of the optimism in the face of uncertainty principle applied to large scale optimization problems under finite numerical budget. The initial motivation for the research reported here originated from the empirical success of the so-called Monte-Carlo Tree Search method popularized in computer-go and further extended to many other games as well as optimization and planning problems. Our objective is to contribute to the development of theoretical foundations of the field by characterizing the complexity of the underlying optimization problems and designing efficient algorithms with performance guarantees. The main idea presented here is that it is possible to decompose a complex decision making problem (such as an optimization problem in a large search space) into a sequence of elementary decisions, where each decision of the sequence is solved using a (stochastic) multi-armed bandit (simple mathematical model for decision making in stochastic environments). This so-called hierarchical bandit approach (where the reward observed by a bandit in the hierarchy is itself the return of another bandit at a deeper level) possesses the nice feature of starting the exploration by a quasi-uniform sampling of the space and then focusing progressively on the most promising area, at different scales, according to the evaluations observed so far, and eventually performing a local search around the global optima of the function. The performance of the method is assessed in terms of the optimality of the returned solution as a function of the number of function evaluations. Our main contribution to the field of function optimization is a class of hierarchical optimistic algorithms designed for general search spaces (such as metric spaces, trees, graphs, Euclidean spaces, ...) with different algorithmic instantiations depending on whether the evaluations are noisy or noiseless and whether some measure of the ”smoothness” of the function is known or unknown. The performance of the algorithms depend on the local behavior of the function around its global optima expressed in terms of the quantity of near-optimal states measured with some metric. If this local smoothness of the function is known then one can design very efficient optimization algorithms (with convergence rate independent of the space dimension), and when it is not known, we can build adaptive techniques that can, in some cases, perform almost as well as when it is known.
[16]	A. Carpentier and R. Munos. Minimax number of strata for online stratified sampling: the case of noisy samples. To appear in Theoretical Computer Science, 2014. [ bib \| .pdf ]
[17]	Ronald Ortner, Daniil Ryabko, Peter Auer, and Rémi Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, 558:62-76, 2014. [ bib \| .pdf ] We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm, that first represents the setting as an MDP which exhibits some special structural properties. In order to grasp this information we introduce the notion of ε-structured MDPs, which are a generalization of concepts like (approximate) state aggregation and MDP homomorphisms. We propose a general algorithm for learning ε-structured MDPs and show regret bounds that demonstrate that additional structural information enhances learning. Applied to the restless bandit setting, this algorithm achieves after any T steps regret of order O(sqrt(T)) with respect to the best policy that knows the distributions of all arms. We make no assumptions on the Markov chains underlying each arm except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.
[18]	G. Kedenburg, R. Fonteneau, and R. Munos. Aggregating optimistic planning trees for solving markov decision processes. In Advances in Neural Information Processing Systems, 2013. [ bib \| .pdf ] This paper addresses the problem of online planning in Markov decision processes using a generative model and under a budget constraint. We propose a new algorithm, ASOP, which is based on the construction of a forest of single successor state planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are explored using a “safe” optimistic planning strategy which combines the optimistic principle (in order to explore the most promising part of the search space first) and a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a finite-sample analysis and discuss the trade-off between the principles of optimism and safety. We report numerical results on a benchmark problem showing that ASOP performs as well as state-of-the-art optimistic planning algorithms
[19]	N. Korda, E. Kaufmann, and R. Munos. Thompson sampling for one-dimensional exponential family bandits. In Advances in Neural Information Processing Systems, 2013. [ bib \| http \| .pdf ] Thompson Sampling has been demonstrated in many complex bandit models, however the theoretical guarantees available for the parametric multi-armed bandit are still limited to the Bernoulli case. Here we extend them by proving asymptotic optimality of the algorithm using the Jeffreys prior for 1-dimensional exponential family bandits. Our proof builds on previous work, but also makes extensive use of closed forms for Kullback-Leibler divergence and Fisher information (and thus Jeffreys prior) available in an exponential family. This allow us to give a finite time exponential concentration inequality for posterior distributions on exponential families that may be of interest in its own right. Moreover our analysis covers some distributions for which no optimistic algorithm has yet been proposed, including heavy-tailed exponential families.
[20]	Alexandra Carpentier and Rémi Munos. Toward optimal stratification for stratified monte-carlo integration. In International Conference on Machine Learning, 2013. [ bib \| .pdf ] We consider the problem of adaptive stratified sampling for Monte Carlo integration of a noisy function, given a finite budget n of noisy evaluations to the function. We tackle in this paper the problem of adapting to the function at the same time the number of samples into each stratum and the partition itself. More precisely, it is interesting to refine the partition of the domain in area where the noise to the function, or where the variations of the function, are very heterogeneous. On the other hand, having a (too) refined stratification is not optimal. Indeed, the more refined the stratification, the more difficult it is to adjust the allocation of the samples to the stratification, i.e. sample more points where the noise or variations of the function are larger. We provide in this paper an algorithm that selects online, among a large class of partitions, the partition that provides the optimal trade-off, and allocates the samples almost optimally on this partition
[21]	Michal Valko, Alexandra Carpentier, and Rémi Munos. Stochastic simultaneous optimistic optimization. In International Conference on Machine Learning, 2013. [ bib \| demo \| poster \| slides \| code \| .pdf ] We study the problem of global maximization of a function f given a finite number of evaluations perturbed by noise. We consider a very weak assumption on the function, namely that it is locally smooth (in some precise sense) with respect to some semi-metric, around one of its global maxima. Compared to previous works on bandits in general spaces (Kleinberg et al., 2008; Bubeck et al., 2011) our algorithm does not require the knowledge of this semi-metric. Our algorithm, StoSOO, follows an optimistic strategy to iteratively construct upper confidence bounds over the hierarchical partitions of the function domain to decide which point to sample next. A finite-time analysis of StoSOO shows that it performs almost as well as the best specifically-tuned algorithms even though the local smoothness of the function is not known.
[22]	Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nello Cristianini. Finite-time analysis of kernelised contextual bandits. In Conference on Uncertainty in Artificial Intelligence, 2013. [ bib \| poster \| code \| .pdf ] We tackle the problem of online reward maximisation over a large finite set of actions described by their contexts. We focus on the case when the number of actions is too big to sample all of them even once. However we assume that we have access to the similarities between actions' contexts and that the expected reward is an arbitrary linear function of the contexts' images in the related reproducing kernel Hilbert space (RKHS). We propose KernelUCB, a kernelised UCB algorithm, and give a cumulative regret bound through a frequentist analysis. For contextual bandits, the related algorithm GP-UCB turns out to be a special case of our algorithm, and our finite-time analysis improves the regret bound of GP-UCB for the agnostic case, both in the terms of the kernel-dependent quantity and the RKHS norm of the reward function. Moreover, for the linear kernel, our regret bound matches the lower bound for contextual linear bandits.
[23]	R. Fonteneau, L. Busoniu, and R. Munos. Optimistic planning for belief-augmented markov decision processes. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2013. [ bib \| .pdf ] This paper presents the Bayesian Optimistic Planning (BOP) algorithm, a novel model-based Bayesian reinforcement learning approach. BOP extends the planning approach of the Optimistic Planning for Markov Decision Processes (OP-MDP) algorithm to contexts where the transition model of the MDP is initially unknown and progressively learned through interactions within the environment. The knowledge about the unknown MDP is represented with a probability distribution over all possible transition models using Dirichlet distributions, and the BOP algorithm plans in the belief-augmented state space constructed by concatenating the original state vector with the current posterior distribution over transition models. We show that BOP becomes Bayesian optimal when the budget parameter increases to infinity. Preliminary empirical validations show promising performance.
[24]	Lucian Busoniu, Alexander Daniels, Remi Munos, and Robert Babuska. Optimistic planning for continuous-action deterministic systems. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2013. [ bib \| .pdf ] We consider the class of online planning algorithms for optimal control, which compared to dynamic programming are relatively unaffected by large state dimensionality. We introduce a novel planning algorithm called SOOP that works for deterministic systems with continuous states and actions. SOOP is the first method to explore the true solution space, consisting of infinite sequences of continuous actions, without requiring knowledge about the smoothness of the system. SOOP can be used parameter-free at the cost of more model calls, but we also propose a more practical variant tuned by a parameter which balances finer discretization with longer planning horizons. Experiments on three problems show SOOP reliably ranks among the best algorithms, fully dominating competing methods when the problem requires both long horizons and fine discretization.
[25]	O. Cappé, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 41(3):1516-1541, 2013. [ bib \| .pdf \| .pdf ] We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins (1979), based on upper confidence bounds of the arm payoffs computed using the Kullback-Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: The kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins (1985) and Burnetas and Katehakis (1996), respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
[26]	M. G. Azar, R. Munos, and H.J. Kappen. Minimax PAC-bounds on the sample complexity of reinforcement learning with a generative model. Machine Learning Journal, 91(3):325-349, 2013. [ bib \| .pdf ] We consider the problem of learning the optimal action-value function in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γin[0, 1) only O(N log(N/δ)/ [(1 - γ)³ ε²]) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1-δ. Further, we prove that, for small values of ε, an order of O(N log(N/δ)/ [(1 - γ)³ ε²]) samples is required to find an ε-optimal policy w.p. 1-δ. We also prove a matching lower bound of Ω(N log(N/δ)/ [(1 - γ)³ ε²]) on the sample complexity of estimating the optimal action-value function. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: The upper bound matches the lower bound interms of N , ε, δ and 1/(1 -γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1-γ).
[27]	E.M. Thomas, M. Clerc, A. Carpentier, E. Daucé, D. Devlaminck, and R. Munos. Optimizing P300-speller sequences by RIP-ping groups apart. In IEEE/EMBS 6th international conference on neural engineering, 2013. [ bib \| http \| .pdf ] So far P300-speller design has put very little emphasis on the design of optimized flash patterns, a surprising fact given the importance of the sequence of flashes on the selection outcome. Previous work in this domain has consisted in studying consecutive flashes, to prevent the same letter or its neighbors from flashing consecutively. To this effect, the flashing letters form more random groups than the original row-column sequences for the P300 paradigm, but the groups remain fixed across repetitions. This has several important consequences, among which a lack of discrepancy between the scores of the different letters. The new approach proposed in this paper accumulates evidence for individual elements, and optimizes the sequences by relaxing the constraint that letters should belong to fixed groups across repetitions. The method is inspired by the theory of Restricted Isometry Property matrices in Compressed Sensing, and it can be applied to any display grid size, and for any target flash frequency. This leads to P300 sequences which are shown here to perform significantly better than the state of the art, in simulations and online tests.
[28]	J. Fruitet, A. Carpentier, R. Munos, and M. Clerc. Automatic motor task selection via a bandit algorithm for a brain-controlled button. Journal of Neural Engineering, 10(1), 2013. [ bib \| http \| .pdf ] Objective. Brain-computer interfaces (BCIs) based on sensorimotor rhythms use a variety of motor tasks, such as imagining moving the right or left hand, the feet or the tongue. Finding the tasks that yield best performance, specifically to each user, is a time-consuming preliminary phase to a BCI experiment. This study presents a new adaptive procedure to automatically select (online) the most promising motor task for an asynchronous brain-controlled button. Approach. We develop for this purpose an adaptive algorithm UCB-classif based on the stochastic bandit theory and design an EEG experiment to test our method. We compare (offline) the adaptive algorithm to a naive selection strategy which uses uniformly distributed samples from each task. We also run the adaptive algorithm online to fully validate the approach. Main results. By not wasting time on inefficient tasks, and focusing on the most promising ones, this algorithm results in a faster task selection and a more efficient use of the BCI training session. More precisely, the offline analysis reveals that the use of this algorithm can reduce the time needed to select the most appropriate task by almost half without loss in precision, or alternatively, allow us to investigate twice the number of tasks within a similar time span. Online tests confirm that the method leads to an optimal task selection. Significance. This study is the first one to optimize the task selection phase by an adaptive procedure. By increasing the number of tasks that can be tested in a given time span, the proposed method could contribute to reducing 'BCI illiteracy'.
[29]	O. Maillard and R. Munos. Linear regression with random projections. Journal of Machine learning Research, 13:2735-2772, 2012. [ bib \| .pdf ] We investigate a method for regression that makes use of a randomly generated subspace G_P (of finite dimension P) of a given large (possibly infinite) dimensional function space F, for example, L₂([0,1]^d). G_P is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in G_P rather than in F, and derive excess risk bounds for a specific regression algorithm (least squares regression in G_P). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments.
[30]	J. Fruitet, A. Carpentier, R. Munos, and M. Clerc. Bandit algorithms boost motor-task selection for brain computer interfaces. In Advances in Neural Information Processing Systems, 2012. [ bib ] Brain-computer interfaces (BCI) allow users to “communicate” with a computer without using their muscles. BCI based on sensori-motor rhythms use imaginary motor tasks, such as moving the right or left hand, to send control signals. The performances of a BCI can vary greatly across users but also depend on the tasks used, making the problem of appropriate task selection an important issue. This study presents a new procedure to automatically select as fast as possible a discriminant motor task for a brain-controlled button. We develop for this purpose an adaptive algorithm, UCB-classif, based on the stochastic bandit theory. This shortens the training stage, thereby allowing the exploration of a greater variety of tasks. By not wasting time on inefficient tasks, and focusing on the most promising ones, this algorithm results in a faster task selection and a more efficient use of the BCI training session. Comparing the proposed method to the standard practice in task selection, for a fixed time budget, UCB-classif leads to an improved classification rate, and for a fixed classification rate, to a reduction of the time spent in training by 50%.
[31]	A. Carpentier and R. Munos. Adaptive stratified sampling for monte-carlo integration of differentiable functions. In Advances in Neural Information Processing Systems, 2012. [ bib ] We consider the problem of adaptive stratified sampling for Monte Carlo integration of a differentiable function given a finite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a finite-sample analysis.
[32]	A. Sani, A. Lazaric, and R. Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, 2012. [ bib \| .pdf ] In stochastic multi-armed bandits the objective is to solve the exploration-exploitation dilemma and ultimately maximize the expected reward. Nonetheless, in many practical problems, maximizing the expected reward is not the most desirable objective. In this paper, we introduce a novel setting based on the principle of risk-aversion where the objective is to compete against the arm with the best risk-return trade-off. This setting proves to be intrinsically more difficult than the standard multi-arm bandit setting due in part to an exploration risk which introduces a regret associated to the variability of an algorithm. Using variance as a measure of risk, we introduce two new algorithms, we investigate their theoretical guarantees, and we report preliminary empirical results.
[33]	L. Busoniu and R. Munos. Optimistic planning for markov decision processes. In International conference on Artificial Intelligence and Statistics, 2012. [ bib \| .pdf ] The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities.
[34]	A. Carpentier and R. Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In International conference on Artificial Intelligence and Statistics, 2012. [ bib \| .pdf ] We consider a linear stochastic bandit problem where the dimension K of the unknown parameter θ is larger than the sampling budget n. In such cases, it is in general impossible to derive sub-linear regret bounds since usual linear bandit algorithms have a regret in O(Ksqrt(n)). In this paper we assume that θ is S-sparse, i.e. has at most S non-zero components, and that the space of arms is the unit ball for the L₂ norm. We combine ideas from Compressed Sensing and Bandit Theory and derive an algorithm with a regret bound in O(Ssqrt(n)). We detail an application to the problem of optimizing a function that depends on many variables but among which only a small number of them (initially unknown) are relevant.
[35]	E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal finite time analysis. In International Conference on Algorithmic Learning Theory, 2012. [ bib \| .pdf ] The question of the optimality of Thompson Sampling for solving the stochastic multi-armed bandit problem had been open since 1933. In this paper we answer it positively for the case of Bernoulli rewards by providing the first finite-time analysis that matches the asymptotic rate given in the Lai and Robbins lower bound for the cumulative regret. The proof is accompanied by a numerical comparison with other optimal policies, experiments that have been lacking in the literature until now for the Bernoulli case.
[36]	R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. In International Conference on Algorithmic Learning Theory, 2012. [ bib \| .pdf ] We consider the restless Markov bandit problem, in which the state of each arm evolves according to a Markov process independently of the learner's actions. We suggest an algorithm that after T steps achieves O(sqrt(T)) regret with respect to the best policy that knows the distributions of all arms. No assumptions on the Markov chains are made except that they are irreducible. In addition, we show that index-based policies are necessarily suboptimal for the considered problem.
[37]	A. Carpentier and R. Munos. Minimax number of strata for online stratified sampling given noisy samples. In International Conference on Algorithmic Learning Theory, 2012. [ bib \| .pdf ] We consider the problem of online stratified sampling for Monte Carlo integration of a function given a finite budget of n noisy evaluations to the function. More precisely we focus on the problem of choosing the number of strata K as a function of the budget n. We provide asymptotic and finite-time results on how an oracle that has access to the function would choose the number of strata optimally. In addition we prove a lower bound on the learning rate for the problem of stratified Monte-Carlo. As a result, we are able to state, by improving the bound on its performance, that algorithm MC-UCB, is minimax optimal both in terms of the number of samples n and the number of strata K, up to a log(nK) factor. This enables to deduce a minimax optimal bound on the difference between the performance of the estimate output by MC-UCB, and the performance of the estimate output by the best oracle static strategy, on the class of Holder continuous functions, and up to a factor log(n).
[38]	M. G. Azar, R. Munos, and H.J. Kappen. On the sample complexity of reinforcement learning with a generative model. In International Conference on Machine Learning, 2012. [ bib \| .pdf ] We consider the problem of learning the optimal action-value function in the discounted-reward Markov decision processes (MDPs). We prove a new PAC bound on the sample-complexity of model-based value iteration algorithm in the presence of the generative model, which indicates that for an MDP with N state-action pairs and the discount factor γin[0, 1), only O(N log(N/δ)/ [(1 -γ)³ ε²]) samples are required to find an ε-optimal estimation of the action-value function with the probability 1-δ. We also prove a matching lower bound of O(N log(N/δ)/ [(1 -γ)³ ε²]) on the sample complexity of estimating the optimal action-value function by every RL algorithm. To the best of our knowledge, this is the first matching result on the sample complexity of estimating the optimal (action) value function in which the upper bound matches the lower bound of RL in terms of N, ε, δ and 1/(1-γ). Also, both our lower bound and our upper bound significantly improve on the state-of-the-art in terms of 1/(1-γ).
[39]	L. Busoniu, R. Munos, and R. Babuska. A review of optimistic planning in Markov decision processes. In Frank Lewis and Derong Liu, editors, Reinforcement Learning and Adaptive Dynamic Programming for feedback control, pages 494-516. Wiley, 2012. [ bib \| .pdf ] We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.
[40]	A. Carpentier and R. Munos. Finite time analysis of stratified sampling for monte carlo. In Advances in Neural Information Processing Systems, 2011. [ bib \| .pdf ] We consider the problem of stratified sampling for Monte-Carlo integration. We model this problem in a multi-armed bandit setting, where the arms represent the strata, and the goal is to estimate a weighted average of the mean values of the arms. We propose a strategy that samples the arms according to an upper bound on their standard deviations and compare its estimation quality to an ideal allocation that would know the standard deviations of the strata. We provide two regret analyses: a distribution-dependent bound O(n^-3/2) that depends on a measure of the disparity of the strata, and a distribution-free bound O(n^-4/3) that does not.
[41]	M. G. Azar, R. Munos, M. Ghavamzadeh, and B. Kapper. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [ bib \| .pdf ] We introduce a new convergent variant of Q-learning, called speedy Q-learning, to address the problem of slow convergence in the standard form of the Q-learning algorithm. We prove a PAC bound on the performance of SQL, which shows that for an MDP with n state-action pairs and the discount factor γ only T = O(log(n)/(ε² (1 - γ)⁴)) steps are required for the SQL algorithm to converge to an ε-optimal action-value function with high probability. This bound has a better dependency on 1/ε and 1/(1 - γ), and thus, is tighter than the best available result for Q-learning. Our bound is also superior to the existing results for both model-free and model-based instances of batch Q-value iteration that are considered to be more efficient than the incremental methods like Q-learning.
[42]	A. Carpentier, O. A. Maillard, and R. Munos. Sparse recovery with brownian sensing. In Advances in Neural Information Processing Systems, 2011. [ bib \| .pdf ] We consider the problem of recovering the parameter α of a sparse function f (i.e. the number of non-zero entries of α is small compared to the number K of features) given noisy evaluations of f at a set of well-chosen sampling points. We introduce an additional randomization process, called Brownian sensing, based on the computation of stochastic integrals, which produces a Gaussian sensing matrix, for which good recovery properties are proven, independently on the number of sampling points N, even when the features are arbitrarily non-orthogonal. Under the assumption that f is Hölder continuous with exponent at least 1/2 we provide an estimate of the parameter with quadratic error O(\|\|η\|\| / N ), where η is the observation noise. The method uses a set of sampling points uniformly distributed along a one-dimensional curve selected according to the features. We report numerical experiments illustrating our method.
[43]	R. Munos. Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in Neural Information Processing Systems, 2011. [ bib \| .pdf ] We consider a global optimization problem of a deterministic function f in a semi-metric space, given a finite budget of n evaluations. The function f is assumed to be locally smooth (around one of its global maxima) with respect to a semi-metric. We describe two algorithms based on optimistic exploration that use a hierarchical partitioning of the space at all scales. A first contribution is an algorithm, DOO, that requires the knowledge of . We report a finite-sample performance bound in terms of a measure of the quantity of near-optimal states. We then define a second algorithm, SOO, which does not require the knowledge of the semi-metric under which f is smooth, and whose performance is almost as good as DOO optimally-fitted.
[44]	O. A. Maillard, R. Munos, and D. Ryabko. Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems, 2011. [ bib \| .pdf ] The problem of selecting the right state-representation in a reinforcement learning problem is considered. Several models (functions mapping past observations to a finite set) of the observations are given, and it is known that for at least one of these models the resulting state dynamics are indeed Markovian. Without know ing neither which of the models is the correct one, nor what are the probabilistic characteristics of the resulting MDP, it is required to obtain as much reward as the optimal policy for the correct model (or for the best of the correct models, if there are several). We propose an algorithm that achieves that, with a regret of order T^2/3 where T is the horizon time.
[45]	A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of least-squares policy iteration. Journal of Machine learning Research, 13:3041-3074, 2011. [ bib \| .pdf ] In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is beta-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.
[46]	Odalric-Ambrym Maillard, Rémi Munos, and Gilles Stoltz. Finite-time analysis of multi-armed bandits problems with Kullback-Leibler divergences. In Conference On Learning Theory, 2011. [ bib \| http \| .pdf ] We consider a Kullback-Leibler-based algorithm for the stochastic multi-armed bandit problem in the case of distributions with finite supports (not necessarily known beforehand), whose asymptotic regret matches the lower bound of Burnetas and Katehakis (1996). Our contribution is to provide a finite-time analysis of this algorithm; we get bounds whose main terms are smaller than the ones of previously known algorithms with finite-time analyses (like UCB-type algorithms)
[47]	M. Ghavamzadeh, A. Lazaric, R. Munos, and M. Hoffman. Finite-sample analysis of Lasso-TD. In International Conference on Machine Learning, 2011. [ bib \| .pdf ] In this paper, we analyze the performance of Lasso-TD, a modification of LSTD in which the projection operator is defined as a Lasso problem. We first show that Lasso-TD is guaranteed to have a unique fixed point and its algorithmic implementation coincides with the recently presented LARS-TD and LC-TD methods. We then derive two bounds on the prediction error of Lasso-TD in the Markov design setting, i.e., when the perfor- mance is evaluated on the same states used by the method. The first bound makes no as- sumption, but has a slow rate w.r.t. the number of samples. The second bound is under an assumption on the empirical Gram matrix, called the compatibility condition, but has an improved rate and directly relates the prediction error to the sparsity of the value function in the feature space at hand.
[48]	Alexandra Carpentier, Mohammad Ghavamzadeh, Alessandro Lazaric, Rémi Munos, and Peter Auer. Upper confidence bounds algorithms for active learning in multi-armed bandits. In Algorithmic Learning Theory, 2011. [ bib \| http \| .pdf ] In this paper, we study the problem of estimating the mean values of all the arms uniformly well in the multi-armed bandit setting. If the variances of the arms were known, one could design an optimal sampling strategy by pulling the arms proportionally to their variances. However, since the distributions are not known in advance, we need to design adaptive sampling strategies to select an arm at each round based on the previous observed samples. We describe two strategies based on pulling the arms proportionally to an upper-bound on their variance and derive regret bounds for these strategies. We show that the performance of these allocation strategies depends not only on the variances of the arms but also on the full shape of their distribution
[49]	Odalric-Ambrym Maillard and Rémi Munos. Adaptive bandits: Towards the best history-dependent strategy. In International conference on Artificial Intelligence and Statistics, 2011. [ bib \| .pdf ] We consider multi-armed bandit games with possibly adaptive opponents. We introduce models of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which define two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes defined by some models. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model. This allows to model opponents (case 1) or strategies (case 2) which handles finite memory, periodicity, standard stochastic bandits and other situations. When only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by O(sqrt(TAC)), where C is the number of classes. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt( T )) are unfortunately not tractable and scale poorly with the number of models. Our contribution here is to provide tractable algorithms with regret bounded by T^2/3 C^1/3 log(\|Θ\|)^1/2
[50]	L. Busoniu, R. Munos, B. De Schutter, and R. Babuska. Optimistic planning for sparsely stochastic systems. In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2011. [ bib \| .pdf ] We propose an online planning algorithm for finite action, sparsely stochastic Markov decision processes, in which the random state transitions can only end up in a small number of possible next states. The algorithm builds a planning tree by iteratively expanding states, where each expansion exploits sparsity to add all possible successor states. Each state to expand is actively chosen to improve the knowledge about action quality, and this allows the algorithm to return a good action after a strictly limited number of expansions. More specifically, the active selection method is optimistic in that it chooses the most promising states first, so the novel algorithm is called optimistic planning for sparsely stochastic systems. We note that the new algorithm can also be seen as model-predictive (receding-horizon) control. The algorithm obtains promising numerical results, including the successful online control of a simulated HIV infection with stochastic drug effectiveness.
[51]	S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412:1832-1852, 2011. [ bib \| http \| .pdf ] We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of forecasters that perform an on-line exploration of the arms. These forecasters are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. One of the main results in the case of a finite number of arms is a general lower bound on the simple regret of a forecaster in terms of its cumulative regret: the smaller the latter, the larger the former. Keeping this result in mind, we then exhibit upper bounds on the simple regret of some forecasters. The paper ends with a study devoted to continuous-armed bandit problems; we show that the simple regret can be minimized with respect to a family of probability distributions if and only if the cumulative regret can be minimized for it. Based on this equivalence, we are able to prove that the separable metric spaces are exactly the metric spaces on which these regrets can be minimized with respect to the family of all probability distributions with continuous mean-payoff functions.
[52]	S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesvári. X-armed bandits. Journal of Machine Learning Research, 12:1655-1695, 2011. [ bib \| http \| .pdf ] We consider a generalization of stochastic bandits where the set of arms, X, is allowed to be a generic measurable space and the mean-payoff function is locally Lipschitz with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally continuous with a known smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by sqrt(n), that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modified strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches.
[53]	A. Lazaric and R. Munos. Learning with stochastic inputs and adversarial outputs. To appear in Journal of Computer and System Sciences, 2011. [ bib \| .pdf ] Most of the research in online learning is focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independent and identically distributed according to a stationary probability distribution. Nonetheless, in a number of domains the relationship between inputs and outputs may be adversarial, whereas input instances are i.i.d. from a stationary distribution (e.g., user preferences). This scenario can be formalized as a learning problem with stochastic inputs and adversarial outputs. In this paper, we introduce this novel stochastic-adversarial learning setting and we analyze its learnability. In particular, we show that in binary classification, given a hypothesis space H with finite VC-dimension, it is possible to design an algorithm which incrementally builds a suitable finite set of hypotheses from H used as input for an exponentially weighted forecaster and achieves a cumulative regret of order sqrt(n VC(H) log n) with overwhelming probability. This result shows that whenever inputs are i.i.d., it is possible to solve any binary classification problem using a finite VC-dimension hypothesis space with a sub-linear regret independently from the way labels are generated (either stochastic or adversarial). We also discuss extensions to multi-label classification, regression, learning from experts and bandit settings with stochastic side information, and application to games.
[54]	L. Busoniu, A. Lazaric, M. Ghavamzadeh, R. Munos, R. Babuska, and B. De Schutter. Least-squares methods for policy iteration. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning: State of the Art, number 12, pages 75 - 109. Springer, 2011. [ bib \| .pdf ] Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization. We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.
[55]	J.Y. Audibert, S. Bubeck, and R. Munos. Bandit view on noisy optimization. In Optimization for Machine Learning. MIT Press, 2010. [ bib \| .pdf ] This chapter deals with the problem of making the best use of a finite number of noisy evaluations to optimize an unknown function. We are primarily concerned with the case where the function is defined over a finite set. In this discrete setting, we discuss various objectives for the learner, from optimizing the allocation of a given budget of evaluations to optimal stopping time problems with (epsilon,delta)-PAC guarantees. We also consider the so-called online optimization framework, where the result of an evaluation is associated to a reward, and the goal is to maximize the sum of obtained rewards. In this case, we extend the algorithms to continuous sets and (weakly) Lipschitzian functions (with respect to a prespecified metric).
[56]	O. A. Maillard and R. Munos. Scrambled objects for least-squares regression. In Advances in Neural Information Processing Systems, 2010. [ bib \| .pdf ] We consider least-squares regression using a randomly generated subspace G of finite dimension P, where F is a function space of infinite dimension, e.g. L_2([0,1]^d). G is defined as the span of P random features that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d. coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function, such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called scrambled wavelets and we show that they enable to approximate functions in Sobolev spaces H^s([0,1]^d). An interesting aspect of the resulting bounds is that they do not depend on the distribution from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution.
[57]	M. Ghavamzadeh, A. Lazaric, O. A. Maillard, and R. Munos. LSTD with random projections. In Advances in Neural Information Processing Systems, 2010. [ bib \| .pdf ] We consider the problem of reinforcement learning in high-dimensional spaces when the number of features is bigger than the number of samples. In particular, we study the least-squares temporal difference (LSTD) learning algorithm when a space of low dimension is generated with a random projection from a high-dimensional space. We provide a thorough theoretical analysis of the LSTD with random projections and derive performance bounds for the resulting algorithm. We also show how the error of LSTD with random projections is propagated through the iterations of a policy iteration algorithm and provide a performance bound for the resulting least-squares policy iteration (LSPI) algorithm.
[58]	A. M. Farahmand, R. Munos, and Cs. Szepesvári. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, 2010. [ bib \| .pdf ] We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Nikodym derivative of a certain distribution rather than its supremum as opposed to what has been suggested by the previous results. Also our results indicate that the contribution of the approximation/Bellman error to the performance loss is more prominent in the later iterations of API/AVI, and the effect of an error term in the earlier iterations decays exponentially fast.
[59]	O. A. Maillard, R. Munos, A. Lazaric, and M. Ghavamzadeh. Finite sample analysis of Bellman residual minimization. In Masashi Sugiyama and Qiang Yang, editors, Asian Conference on Machine Learpning. JMLR: Workshop and Conference Proceedings, volume 13, pages 309-324, 2010. [ bib \| .pdf ] We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution, the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual with a rate of order O(1/sqrt((n)). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.)
[60]	A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. In International Conference on Machine Learning, pages 615-622, 2010. [ bib \| .pdf ] In this paper we consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning algorithm. We report a finite-sample analysis of LSTD. We first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is beta-mixing.
[61]	A. Lazaric, M. Ghavamzadeh, and R. Munos. Finite-sample analysis of LSTD. Technical report, INRIA-00482189, 2010. [ bib \| http ]
[62]	A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of a classification-based policy iteration algorithm. In International Conference on Machine Learning, pages 607-614, 2010. [ bib \| .pdf ] We present a classification-based policy iteration algorithm, called Direct Policy Iteration, and provide its finite-sample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classification-based policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity.
[63]	A. Lazaric, M. Ghavamzadeh, and R. Munos. Analysis of a classification-based policy iteration algorithm. Technical report, INRIA-00482065, 2010. [ bib \| http ]
[64]	O. A. Maillard and R. Munos. Online learning in adversarial lipschitz environments. In European Conference on Machine Learning, 2010. [ bib \| http ] We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We provide a class of algorithms with cumulative regret upper bounded by O(sqrtdT ln(lambda)) where d is the dimension of the search space, T the time horizon, and lambda the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of non-linear regressors/classifiers, such as neural networks.
[65]	S. Bubeck and R. Munos. Open loop optimistic planning. In Conference on Learning Theory, 2010. [ bib \| .pdf ] We consider the problem of planning in a stochastic and discounted environment with a limited numerical budget. More precisely, we investigate strategies exploring the set of possible sequences of actions, so that, once all available numerical resources (e.g. CPU time, number of calls to a generative model) have been used, one returns a recommendation on the best possible immediate action to follow based on this exploration. The performance of a strategy is assessed in terms of its simple regret, that is the loss in performance resulting from choosing the recommended action instead of an optimal one. We first provide a minimax lower bound for this problem, and show that a uniform planning strategy matches this minimax rate (up to a logarithmic factor). Then we propose a UCB (Upper Confidence Bounds)-based planning algorithm, called OLOP (Open-Loop Optimistic Planning), which is also minimax optimal, and prove that it enjoys much faster rates when there is a small proportion of near-optimal sequences of actions. Finally, we compare our results with the regret bounds one can derive for our setting with bandits algorithms designed for an infinite number of arms.
[66]	J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In Conference on Learning Theory, 2010. [ bib \| .pdf ] We consider the problem of finding the best arm in a stochastic multi-armed bandit game. The regret of a forecaster is here defined by the gap between the mean reward of the optimal arm and the mean reward of the ultimately chosen arm. We propose a highly exploring UCB policy and a new algorithm based on successive rejects. We show that these algorithms are essentially optimal since their regret decreases exponentially at a rate which is, up to a logarithmic factor, the best possible. However, while the UCB policy needs the tuning of a parameter depending on the unobservable hardness of the task, the successive rejects policy benefits from being parameter-free, and also independent of the scaling of the rewards.
[67]	R. Munos. Approximate dynamic programming. In Olivier Sigaud and Olivier Buffet, editors, Markov Decision Processes in Artificial Intelligence, chapter 3, pages 67-98. ISTE Ltd and John Wiley & Sons Inc, 2010. [ bib \| .html \| .pdf ] In any complex or large scale sequential decision making problem, there is a crucial need to use function approximation to represent the relevant functions such as the value function or the policy. The Dynamic Programming (DP) and Reinforcement Learning (RL) methods introduced in previous chapters make the implicit assumption that the value function can be perfectly represented (i.e. kept in memory), for example by using a look-up table (with a finite number of entries) assigning a value to all possible states (assumed to be finite) of the system. Those methods are called exact because they provide an exact computation of the optimal solution of the considered problem (or at least, enable the computations to converge to this optimal solution). However, such methods often apply to toy problems only, since in most interesting applications, the number of possible states is so large (and possibly infinite if we consider continuous spaces) that a perfect representation of the function at all states is impossible. It becomes necessary to approximate the function by using a moderate number of coefficients (which can be stored in a computer), and therefore extend the range of DP and RL to methods using such approximate representations. These approximate methods combine DP and RL methods with function approximation tools.
[68]	J.-Y. Audibert, S. Bubeck, G. Chaslot, V. Danjean, S. Gelly, T. Hérault, J.-B. Hoock, C.-S. Lee, R. Munos, J. Pérez, A. Rimmel, M. Schoenauer, M. Sebag, O. Teytaud, M.-H. Wang, and Y. Wang. Gothique. Plein Sud, 72:110-115, 2009. [ bib \| .pdf ] En 1997, un ordinateur battait, pour la première fois, Garry Kasparov, le champion du monde des échecs. Mais le jeu de Go restait encore un domaine réservé de l'homme. Plus complexe que les échecs avec plus de 10^600 possibilités de jeu, soit plus que le nombre de particules de l'Univers, le jeu de Go représente une remarquable école de stratégie. Les programmes informatiques ont donc la plus grande difficulté à rivaliser avec les meilleurs humains, mais de nouveaux algorithmes changent la donne.
[69]	S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in multi-armed bandits problems. In Proc. of the 20th International Conference on Algorithmic Learning Theory, pages 23-37, 2009. [ bib \| http \| .pdf ] We consider the framework of stochastic multi-armed bandit problems and study the possibilities and limitations of strategies that perform an online exploration of the arms. The strategies are assessed in terms of their simple regret, a regret notion that captures the fact that exploration is only constrained by the number of available rounds (not necessarily known in advance), in contrast to the case when the cumulative regret is considered and when exploitation needs to be performed at the same time. We believe that this performance criterion is suited to situations when the cost of pulling an arm is expressed in terms of resources rather than rewards. We discuss the links between the simple and the cumulative regret. The main result is that the required exploration/exploitation trade-offs are qualitatively different, in view of a general lower bound on the simple regret in terms of the cumulative regret.
[70]	J-Y. Audibert, R. Munos, and Cs Szepesvari. Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 410:1876-1902, 2009. [ bib \| .pdf ] Algorithms based on upper confidence bounds for balancing exploration and exploitation are gaining popularity since they are easy to implement, efficient and effective. This paper considers a variant of the basic algorithm for the stochastic, multi-armed bandit problem that takes into account the empirical variance of the different arms. In earlier experimental works, such algorithms were found to outperform the competing algorithms. We provide the first analysis of the expected regret for such algorithms. As expected, our results show that the algorithm that uses the variance estimates has a major advantage over its alternatives that do not use such estimates provided that the variances of the payoffs of the suboptimal arms are low. We also prove that the regret concentrates only at a polynomial rate. This holds for all the upper confidence bound based algorithms and for all bandit problems except those special ones where with probability one the payoff obtained by pulling the optimal arm is larger than the expected payoff for the second best arm. Hence, although upper confidence bound bandit algorithms achieve logarithmic expected regret rates, they might not be suitable for a risk-averse decision maker. We illustrate some of the results by computer simulations.
[71]	O. Maillard and R. Munos. Compressed least squares regression. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1213-1221, 2009. [ bib \| http \| .pdf ] We consider the problem of learning, from K data, a regression function in a linear space of high dimension N using projections onto a random subspace of lower dimension M. From any algorithm minimizing the (possibly penalized) empirical risk, we provide bounds on the excess risk of the estimate computed in the projected subspace (compressed domain) in terms of the excess risk of the estimate built in the high-dimensional space (initial domain). We show that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error. We apply the analysis to Least-Squares (LS) regression and discuss the excess risk and numerical complexity of the resulting “Compressed Least Squares Regression” (CLSR) in terms of N, K, and M. When we choose M=O(sqrt(K)), we show that CLSR has an estimation error of order O(logK / sqrt(K))
[72]	P.A. Coquelin, R. Deguest, and R. Munos. Sensitivity analysis in HMMs with application to likelihood maximization. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 387-395, 2009. [ bib \| .pdf ] This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.
[73]	A. Lazaric and R. Munos. Hybrid stochastic-adversarial on-line learning. In Conference on Learning Theory, 2009. [ bib \| http \| .pdf ] Most of the research in online learning focused either on the problem of adversarial classification (i.e., both inputs and labels are arbitrarily chosen by an adversary) or on the traditional supervised learning problem in which samples are independently generated from a fixed probability distribution. Nonetheless, in a number of domains the relationship between inputs and labels may be adversarial, whereas input instances are generated according to a constant distribution. This scenario can be formalized as an hybrid classification problem in which inputs are stochastic, while labels are adversarial. In this paper, we introduce this hybrid stochastic-adversarial classification problem, we propose an online learning algorithm for its so- lution, and we analyze its performance. In particular, we show that, given a hypothesis space H with finite VC dimension, it is possible to incrementally build a suitable finite set of hypotheses that can be used as input for an exponentially weighted forecaster achieving a cumulative regret of order O( n VC(H) log n) with overwhelming probability. Finally, we discuss extensions to multi-label classification, learning from experts and bandit settings with stochastic side information, and application to games.
[74]	Sertan Girgin, Manuel Loth, Rémi Munos, Philippe Preux, and Daniil Ryabko, editors. Recent Advances in Reinforcement Learning. Springer. Lecture Notes in Artificial Intelligence 5323, 2009. [ bib \| http ] This book constitutes revised and selected papers of the 8th European Workshop on Reinforcement Learning, EWRL 2008, which took place in Villeneuve d'Ascq, France, during June 30 - July 3, 2008. The 21 papers presented were carefully reviewed and selected from 61 submissions. They are dedicated to the field of and current researches in reinforcement learning.
[75]	P.A. Coquelin, R. Deguest, and R. Munos. Particle filter-based policy gradient for pomdps. In Proceedings of Advances in Neural Information Processing Systems, volume 22. MIT Press, 2008. [ bib \| .pdf ] Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency.
[76]	S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesvári. Online optimization of X-armed bandits. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 22, pages 201-208. MIT Press, 2008. [ bib \| http \| .pdf ] We consider a generalization of stochastic bandit problems where the set of arms, X, is allowed to be a generic topological space. We constraint the mean-payoff function with a dissimilarity function over X in a way that is more general than Lipschitz. We construct an arm selection policy whose regret improves upon previous result for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a finite number of global maxima around which the behavior of the function is locally Holder with a known exponent, then the expected regret is bounded up to a logarithmic factor by sqrt(n), i.e., the rate of the growth of the regret is independent of the dimension of the space. Moreover, we prove the minimax optimality of our algorithm for the class of mean-payoff functions we consider.
[77]	Y. Wang, J-Y. Audibert, and R. Munos. Infinitely many-armed bandits. In Proceedings of Advances in Neural Information Processing Systems, volume 22. MIT Press, 2008. [ bib \| .pdf ] We consider multi-armed bandit problems where the number of arms is larger than the possible number of experiments. We make a stochastic assumption on the mean-reward of a new selected arm which characterizes its probability of being a near-optimal arm. Our assumption is weaker than in previous works. We describe algorithms based on upper-confidence-bounds applied to a restricted set of randomly selected arms and provide upper-bounds on the resulting expected regret. We also derive a lower-bound which matches (up to a logarithmic factor) the upper-bound in some cases.
[78]	J-F. Hren and R. Munos. Optimistic planning of deterministic systems. In European Workshop on Reinforcement Learning Springer LNAI 5323, editor, Recent Advances in Reinforcement Learning, pages 151-164, 2008. [ bib \| .pdf ] If one possesses a model of a controlled deterministic system, then from any state, one may consider the set of all possible reachable states starting from that state and using any sequence of actions. This forms a tree whose size is exponential in the planning time horizon. Here we ask the question: given finite computational resources (e.g. CPU time), which may not be known ahead of time, what is the best way to explore this tree, such that once all resources have been used, the algorithm would be able to propose an action (or a sequence of actions) whose performance is as close as possible to optimality? The performance with respect to optimality is assessed in terms of the regret (with respect to the sum of discounted future rewards) resulting from choosing the action returned by the algorithm instead of an optimal action. In this paper we investigate an optimistic exploration of the tree, where the most promising states are explored first, and compare this approach to a naive uniform exploration. Bounds on the regret are derived both for uniform and optimistic exploration strategies. Numerical simulations illustrate the benefit of optimistic planning.
[79]	R. Maitrepierre, J. Mary, and R. Munos. Adaptative play in texas hold'em poker. In European Conference on Artificial Intelligence, 2008. [ bib \| .pdf ] We present a Texas Hold'em poker player for limit heads-up games. Our bot is designed to adapt automatically to the strategy of the opponent and is not based on Nash equilibrium computation. The main idea is to design a bot that builds beliefs on his opponent's hand. A forest of game trees is generated according to those beliefs and the solutions of the trees are combined to make the best decision. The beliefs are updated during the game according to several methods, each of which corresponding to a basic strategy. We then use an exploration-exploitation bandit algorithm, namely the UCB (Upper Confidence Bound), to select a strategy to follow. This results in a global play that takes into account the opponent's strategy, and which turns out to be rather unpredictable. Indeed, if a given strategy is exploited by an opponent, the UCB algorithm will detect it using change point detection, and will choose another one. The initial resulting program , called Brennus, participated to the AAAI'07 Computer Poker Competition in both online and equilibrium competition and ranked eight out of seventeen competitors.
[80]	R. Munos. Programmation dynamique avec approximation de la fonction valeur. In O. Sigaud and O. Buffet, editors, Processus décisionnels de Markov et intelligence artificielle, volume 2, chapter 11, pages 19-50. Hermes, 2008. [ bib \| .html \| .pdf ] L'utilisation d'outils pour l'approximation de la fonction de valeur est essentielle pour pouvoir traiter des problèmes de prise de décisions séquentielles de grande taille. Les méthodes de programmation dynamique (PD) et d'apprentissage par renforcement (A/R) introduites aux chapitres 1 et 2 supposent que la fonction de valeur peut être représentée (mémorisée) en attribuant une valeur à chaque état (dont le nombre est supposé fini), par exemple sous la forme d'un tableau. Ces méthodes de résolution, dites exactes, permettent de déterminer la solution optimale du problème considéré (ou tout au moins de converger vers cette solution optimale). Cependant, elles ne s'appliquent souvent qu'à des problèmes jouets, car pour la plupart des applications intéressantes, le nombre d'états possibles est si grand (voire infini dans le cas d'espaces continus) qu'une représentation exacte de la fonction ne peut être parfaitement mémorisée. Il devient alors nécessaire de représenter la fonction de valeur, de manière approchée, à l'aide d'un nombre modéré de coefficients, et de redéfinir et analyser des méthodes de résolution, dites approchées pour la PD et l'A/R, afin de prendre en compte les conséquences de l'utilisation de telles approximations dans les problèmes de prise de décisions séquentielles.
[81]	A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning Journal, 71:89-129, 2008. [ bib \| .pdf ] We consider the problem of finding a near-optimal policy using value-function methods in continuous space, discounted Markovian Decision Problems (MDP) when only a single trajectory underlying some policy can be used as the input. Since the state-space is continuous, one must resort to the use of function approximation. In this paper we study a policy iteration algorithm iterating over action-value functions where the iterates are obtained by empirical risk minimization, where the loss function used penalizes high magnitudes of the Bellman-residual. It turns out that when a linear parameterization is used the algorithm is equivalent to least-squares policy iteration. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.
[82]	R. Munos and Cs. Szepesvári. Finite time bounds for sampling based fitted value iteration. Journal of Machine Learning Research, 9:815-857, 2008. [ bib \| http \| .pdf ] In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted Lp-norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation capabilities of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the essential characteristics of the dynamics of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings.
[83]	Rémi Munos. Performance bounds in Lp norm for approximate value iteration. SIAM J. Control and Optimization, 2007. [ bib \| .ps ]
[84]	R. Munos. Analyse en norme Lp de l'algorithme d'itérations sur les valeurs avec approximations. Revue d'Intelligence Artificielle, 21:55-76, 2007. [ bib \| .pdf ]
[85]	J-Y. Audibert, R. Munos, and Cs Szepesvari. Tuning bandit algorithms in stochastic environments. In International Conference on Algorithmic Learning Theory, 2007. [ bib \| .pdf ]
[86]	A. Antos, Cs. Szepesvari, and R. Munos. Value-iteration based fitted policy iteration: learning with a single trajectory. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007. [ bib \| .pdf ]
[87]	A. Antos, Cs. Szepesvari, and R. Munos. Fitted Q-iteration in continuous action-space MDPs. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 9-16, Cambridge, MA, 2007. MIT Press. [ bib \| http \| .pdf ]
[88]	J.-Y. Audibert, R. Munos, and Cs. Szepesvári. Variance estimates and exploration function in multi-armed bandit. Technical report, Certis - Ecole des Ponts, 2007. [ bib \| .pdf ]
[89]	P.-A. Coquelin, R. Deguest, and R. Munos. Numerical methods for sensitivity analysis of Feynman-Kac models. Technical report, INRIA, 2007. [ bib \| http \| .pdf ]
[90]	P-A. Coquelin, S. Martin, and R. Munos. A dynamic programming approach to viability problems. In IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007. [ bib \| http \| .pdf ]
[91]	P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. Technical report, INRIA RR-6141, 2007. [ bib \| http \| .pdf ]
[92]	P.-A. Coquelin and R. Munos. Bandit algorithms for tree search. In Uncertainty in Artificial Intelligence, 2007. [ bib \| .pdf ]
[93]	S. Gelly and R. Munos. L'ordinateur, champion de go ? Pour la Science, 354:28-35, 2007. [ bib \| http \| .pdf ]
[94]	R. Munos. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771-791, 2006. [ bib \| .pdf ]
[95]	R. Munos. Geometric variance reduction in Markov chains. Application to value function and gradient estimation. Journal of Machine Learning Research, 7:413-427, 2006. [ bib \| .pdf ]
[96]	A. Antos, Cs. Szepesvari, and R. Munos. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. In G. Lugosi and H.U. Simon, editors, Conference on Learning Theory, volume 4005 of LNCS/LNAI, pages 574-588, Berlin, Heidelberg, 2006. Springer-Verlag. [ bib \| .pdf ]
[97]	J-Y. Audibert, R. Munos, and Cs Szepesvari. Use of variance estimation in the multi-armed bandit problem. In NIPS workshop On-line Trading of Exploration and Exploitation, 2006. [ bib \| .pdf ]
[98]	C. Barrera-Esteve, F. Bergeret, E. Gobet, A. Meziou, R. Munos, and D. Reboul-Salze. Numerical methods for the pricing of swing options: a stochastic control approach. Methodology and Computing in Applied Probability, 8:517-540, 2006. [ bib \| .pdf ]
[99]	O. Bokanowski, S. Martin, R. Munos, and H. Zidani. An anti-diffusive scheme for viability problems. Applied Numerical Mathematics, special issue on Numerical methods for viscosity solutions and applications, 45-9:1147-1162, 2006. [ bib \| .ps ]
[100]	S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with patterns in monte-carlo go. Technical report, INRIA RR-6062, 2006. [ bib \| http \| .pdf ]
[101]	R. Munos and Cs. Szepesvári. Finite time bounds for sampling based fitted value iteration. Technical report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [ bib \| http ]
[102]	Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô Malliavin calculus and martingales. Application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676-1713, 2005. [ bib \| .pdf ]
[103]	Rémi Munos. Error bounds for approximate value iteration. Technical report, Ecole Polytechnique RR527, 2005. [ bib \| .pdf ]
[104]	R. Munos. Error bounds for approximate value iteration. In American Conference on Artificial Intelligence, 2005. [ bib \| .pdf ]
[105]	R. Munos. Geometric variance reduction in Markov chains. Application to value function and gradient estimation. In American Conference on Artificial Intelligence, 2005. [ bib \| .pdf ]
[106]	Rémi Munos and Hasna Zidani. Consistency of a simple multidimensional scheme for Hamilton-Jacobi-Bellman equations. C. R. Acad. Sci. Paris, Ser. I Math, 2005. [ bib \| .ps ]
[107]	Cs. Szepesvári and R. Munos. Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning, pages 881-886, 2005. [ bib \| .ps ]
[108]	R. Munos. Policy gradient in continuous time. In Conférence francophone sur l'apprentissage automatique, 2005. [ bib \| .ps ]
[109]	O. Bokanowski, S. Martin, R. Munos, and H. Zidani. An anti-diffusive scheme for viability problems. Technical report, INRIA, Research Report 5431, 2004. [ bib \| http ]
[110]	Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô Malliavin calculus and martingales. Numerical implementation. Technical report, Ecole Polytechnique. Research Report 520, 2004. [ bib \| .pdf ]
[111]	Rémi Munos. Algorithme d'itération sur les politiques avec approximation linéaire. Journal Electronique d'Intelligence Artificielle, 4-37, 2004. [ bib \| .ps ]
[112]	C. Barrera-Esteve, F. Bergeret, C. Dossal, E. Gobet, A. Meziou, R. Munos, and D. Reboul-Salze. Numerical methods for the pricing of swing options: a stochastic control approach. Technical report, Ecole Polytechnique. Research Report 544, 2004. [ bib \| .pdf ]
[113]	Rémi Munos. Contributions à l'apprentissage par renforcement et au contrôle optimal avec approximation. PhD thesis, HDR de l'Université Pierre et Marie Curie, spécialité Mathématiques Appliquées, 2004. [ bib \| .pdf ]
[114]	R. Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560-567, 2003. [ bib \| .pdf ]
[115]	Rémi Munos and Andrew Moore. Variable resolution discretization in optimal control. Machine Learning Journal, 49:291-323, 2002. [ bib \| .ps ]
[116]	Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô Malliavin calculus and martingales. application to stochastic optimal control. Technical report, Ecole Polytechnique RR-498, 2002. [ bib \| .ps ]
[117]	Rémi Munos. Decision-making under uncertainty: Efficiently estimating where extra ressources are needed. Technical report, Ecole Polytechnique RR-550, 2002. [ bib \| .ps ]
[118]	Rémi Munos. Efficient resources allocation for Markov decision processes. In Advances in Neural Information Processing Systems, 2001. [ bib \| .ps ]
[119]	Rémi Munos and Andrew W. Moore. Rates of convergence for variable resolution schemes in optimal control. In International Conference on Machine Learning, 2000. [ bib \| .ps ]
[120]	Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40:265-299, 2000. [ bib \| .ps ]
[121]	Rémi Munos, Leemon Baird, and Andrew Moore. Gradient descent approaches to neural-net-based solutions of the hamilton-jacobi-bellman equation. In International Joint Conference on Neural Networks, 1999. [ bib \| .pdf ]
[122]	Rémi Munos and Andrew Moore. Influence and variance of a Markov chain : Application to adaptive discretizations in optimal control. In Proceedings of the 38th IEEE Conference on Decision and Control, 1999. [ bib \| .ps ]
[123]	Rémi Munos and Andrew Moore. Variable resolution discretization for high-accuracy solutions of optimal control problems. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 1348-1355, 1999. [ bib \| .ps ]
[124]	Rémi Munos. A general convergence theorem for reinforcement learning in the continuous case. In European Conference on Machine Learning, 1998. [ bib \| .ps ]
[125]	Rémi Munos and Andrew Moore. Barycentric interpolators for continuous space and time reinforcement learning. In Advances in Neural Information Processing Systems, volume 11, 1998. [ bib \| .ps ]
[126]	Rémi Munos. Finite-element methods with local triangulation refinement for continuous reinforcement learning problems. In European Conference on Machine Learning, pages 170-183, 1997. [ bib \| .ps ]
[127]	Rémi Munos. A convergent reinforcement learning algorithm in thecontinuous case based on a finite difference method. In International Joint Conference on Artificial Intelligence, 1997. [ bib \| .ps ]
[128]	Rémi Munos. Catégorisation adaptative de donnés sensori-motrices pour un système d'apprentissage par renforcement. In Journées de Rochebrune, rencontres interdisciplinaires sur les systèmes complexes naturels et artificiels, 1997. [ bib \| .ps ]
[129]	Rémi Munos. Apprentissage par Renforcement, étude du cas continu. PhD thesis, Ecole des Hautes Etudes en Sciences Sociales, 1997. [ bib \| .ps ]
[130]	Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Neural Information Processing Systems, 1997. [ bib \| .ps ]
[131]	Rémi Munos. A convergent reinforcement learning algorithm in the continuous case : the finite-element reinforcement learning. In International Conference on Machine Learning. Morgan Kaufmann, 1996. [ bib \| .ps ]
[132]	Rémi Munos. Using finite differences methods for approximating the value function of continuous reinforcement learning problems. In International Symposium on Multi-Technology Information Processing, 1996. [ bib \| .pdf ]
[133]	R. Munos and J. Patinel. Reinforcement learning with dynamic covering of the state-action space : Partitioning q-learning. In Simulation of Adaptive Behavior. The MIT Press/Bradford Books, 1994. [ bib ]

This file was generated by bibtex2html 1.97.