http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Home.html
iWeb 3.0.4Five papers at AI&Stats'17
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2017/3/6_Five_papers_at_AI%26Stats17.html
f1195575-c5db-40db-adfe-ae28767adfa0Mon, 6 Mar 2017 21:31:33 +0100<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2017/3/6_Five_papers_at_AI%26Stats17_files/droppedImage.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object002_1.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got five papers accepted at AI&Stats’17!<br/><br/>Linear Thompson Sampling Revisited (with Marc Abeille)<br/><br/>We derive an alternative proof for the regret of Thompson sampling (\ts) in the stochastic linear bandit setting. While we obtain a regret bound of order $\wt{O}(d^{3/2}\sqrt{T})$ as in previous results, the proof sheds new light on the functioning of the \ts. We leverage on the structure of the problem to show how the regret is related to the sensitivity (i.e., the gradient) of the objective function and how selecting optimal arms associated to \textit{optimistic} parameters does control it. Thus we show that \ts can be seen as a generic randomized algorithm where the sampling distribution is designed to have a fixed probability of being optimistic, at the cost of an additional $\sqrt{d}$ regret factor compared to a UCB-like approach. Furthermore, we show that our proof can be readily applied to regularized linear optimization and generalized linear model problems.<br/><br/><br/><br/>Thompson Sampling for Linear-Quadratic Control Problems (with Marc Abeille)<br/><br/>We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (\ts) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret \ts in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy. This results in an overall regret of $O(T^{2/3})$, which is significantly worse than the regret $O(\sqrt{T})$ achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems.<br/><br/><br/><br/>Exploration-Exploitation in MDPs with Options (with Ronan Fruit)<br/><br/>While a large body of empirical results show that temporally-extended actions and options may significantly affect the learning performance of an agent, the theoretical understanding of how and when options can be beneficial in online reinforcement learning is relatively limited. In this paper, we derive an upper and lower bound on the regret of a variant of UCRL using options. While we first analyze the algorithm in the general case of semi-Markov decision processes (SMDPs), we show how these results can be translated to the specific case of MDPs with options and we illustrate simple scenarios in which the regret of learning with options can be \textit{provably} much smaller than the regret suffered when learning with primitive actions.<br/><br/><br/><br/>Distributed Adaptive Sampling for Kernel Matrix Approximation (with D. Calandriello and M. Valko)<br/><br/>Most kernel-based methods, such as kernel regression, kernel PCA, ICA, or $k$-means clustering, do not scale to large datasets, because constructing and storing<br/>the kernel matrix $\kermatrix_n$ requires at least $\bigotime(n^2)$ time and space for $n$ samples. Recent works~\cite{alaoui2014fast,musco2016provably} show that sampling points with replacement according to their ridge leverage scores (RLS) generates small dictionaries of relevant points with strong spectral approximation guarantees for $\kermatrix_n$. The drawback of RLS-based methods is that computing exact RLS requires constructing and storing the whole kernel matrix. In this paper, we introduce \sequentialalg, a new algorithm for kernel approximation based on RLS sampling that \emph{sequentially} processes the dataset, storing a dictionary which creates accurate kernel matrix approximations with a number of points that only depends on the effective dimension $\deff{\gamma}$ of the dataset. Moreover since all the RLS estimations are efficiently performed using only the small dictionary, \sequentialalg never constructs the whole matrix $\kermatrix_n$, runs in linear time $\wt{\bigotime}(n\deff{\gamma}^3)$ w.r.t.~$n$, and requires only a single pass over the dataset. We also propose a parallel and distributed version of \sequentialalg achieving similar accuracy in as little as $\wt{\bigotime}(\log(n)\deff{\gamma}^3)$ time.<br/><br/><br/>Trading off Rewards and Errors in Multi-Armed Bandits (with A. Erraqabi, M. Valko, E. Brunskill, Y.-E. Liu)<br/><br/>In multi-armed bandits, the most common objective is the maximization of the cumulative reward. Alternative settings include active exploration, where a learner tries to gain accurate estimates of the rewards of all arms. While these objectives are contrasting, in many scenarios it is desirable to trade off rewards and errors. For instance, in educational games the designer wants to gather generalizable knowledge about the behavior of the students and teaching strategies (small \textit{estimation errors}) but, at the same time, the system needs to avoid giving a bad experience to the players, who may leave the system permanently (large \textit{reward}). In this paper, we formalize this tradeoff and introduce the \forcing algorithm whose performance is provably close to the best possible tradeoff strategy. Finally, we demonstrate on real-world educational data that \forcing returns \textit{useful} information about the arms without compromising the overall reward.<br/><br/><br/>One paper at UAI'16
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2017/3/6_One_paper_at_UAI16.html
322a3209-eba9-4f32-b30c-4228910e2991Mon, 6 Mar 2017 21:26:36 +0100<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2017/3/6_One_paper_at_UAI16_files/IMG_9418.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object003_2.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got one paper accepted at UAI’16!<br/><br/>Analysis of Nyström method with sequential ridge leverage score sampling<br/> (with D. Calandriello and M. Valko)<br/><br/>Large-scale kernel ridge regression (KRR) is limited by the need to store a large kernel matrix Kt. To avoid storing the entire matrix Kt, Nyström methods subsample a subset of columns of the kernel matrix, and efficiently find an approximate KRR solution on the reconstructed Kt . The chosen subsampling distribution in turn affects the statistical and computational tradeoffs. For KRR problems, [15, 1] show that a sampling distribution proportional to the ridge leverage scores (RLSs) provides strong reconstruction guarantees for Kt. While exact RLSs are as difficult to compute as a KRR solution, we may be able to approximate them well enough. In this paper, we study KRR problems in a sequential setting and introduce the INK-ESTIMATE algorithm, that incrementally computes the RLSs estimates. INK-ESTIMATE maintains a small sketch of Kt, that at each step is used to compute an intermediate estimate of the RLSs. First, our sketch update does not require access to previously seen columns, and therefore a single pass over the kernel matrix is sufficient. Second, the algorithm requires a fixed, small space budget to run dependent only on the effective dimension of the kernel matrix. Finally, our sketch provides strong approximation guarantees on the distance ∥Kt−Kt∥2 , and on the statistical risk of the approximate KRR solution at any time, because all our guarantees hold at any intermediate step. <br/><br/>One paper at COLT'16
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2016/5/26_One_paper_at_COLT16.html
c759da8f-fecd-4ff2-acdf-dd596baa30cdThu, 26 May 2016 17:39:34 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2016/5/26_One_paper_at_COLT16_files/IMG_9418.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object003_3.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got one paper accepted at AI&Stats’16!<br/><br/>Reinforcement Learning of POMDPs using Spectral Methods (with K. Azzizade, A. Anandkumar)<br/><br/>We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. <br/>While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound w.r.t.\ the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces. <br/><br/>One paper at AI&Stats'16
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2016/5/26_One_paper_at_AI%26Stats16.html
3c5e7d81-9d3a-4a68-affc-4cb38647eed9Thu, 26 May 2016 17:35:07 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2016/5/26_One_paper_at_AI%26Stats16_files/Old-City-Cadiz.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object001_5.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got one paper accepted at AI&Stats’16!<br/><br/>Improve Learning Complexity in Combinatorial Pure Exploration Bandits (with V. Gabillon, M. Ghavamzadeh, R. Ortner, P. Bartlett)<br/><br/>We study the problem of combinatorial pure exploration in the stochastic multi-armed bandit problem. We first construct a new measure of complexity that provably characterizes the learning performance of the algorithms we propose for the fixed confidence and the fixed budget setting. We show that this complexity is never higher than the one in existing work and illustrate a number of configurations in which it can be significantly smaller. While in general this improvement comes at the cost of increased computational complexity, we provide a series of examples, including a planning problem, where this extra cost is not significant. <br/><br/>Invited talk at PDIA'15
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/10/8_Invited_talk_at_PDIA15.html
92ed0cd6-0879-456a-9f06-7d97e73d1ca5Thu, 8 Oct 2015 14:27:16 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/10/8_Invited_talk_at_PDIA15_files/11907374_846940178707638_2335296311738270670_o.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object000_2.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a> Two papers at IJCAI’15
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/6/1_One_paper_at_ISIT15_2.html
6f2eac53-32be-4624-9960-4b7d3937a6f8Mon, 1 Jun 2015 17:19:57 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/6/1_One_paper_at_ISIT15_2_files/IMG_8186.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object005_1.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got two papers accepted at IJCAI’15!<br/><br/>Direct Policy Iteration with Demonstrations (with Jessica Chemali)<br/><br/>We consider the problem of learning the optimal policy of an unknown Markov decision process (MDP) when expert demonstrations are available along with interaction samples. We build on classification-based policy iteration to perform a seamless integration of interaction and expert data, thus obtaining an algorithm which can benefit from both sources of information at the same time. Furthermore, we provide a full theoretical analysis of the performance across iterations providing insights on how the algorithm works. Finally, we report an empirical evaluation of the algorithm and a comparison with the state-of-the-art algorithms.<br/><br/><br/><br/><br/><br/>Maximum Entropy Semi-Supervised Inverse Reinforcement Learning (with J. Audiffren, M. Valko, and M. Ghavamzadeh)<br/><br/>A popular approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning (IRL) problem. The MaxEnt-IRL algorithm successfully integrates the maximum entropy principle into IRL and unlike its predecessors, it resolves the ambiguity arising from the fact that a possibly large number of policies could match the expert's behavior. In this paper, we study an AL setting in which in addition to the expert's trajectories, a number of unsupervised trajectories is available. We introduce MESSI, a novel algorithm that combines MaxEnt-IRL with principles coming from semi-supervised learning. In particular, MESSI integrates the unsupervised data into the MaxEnt-IRL framework using a pairwise penalty on trajectories. Empirical results in a highway driving and grid-world problems indicate that MESSI is able to take advantage of the unsupervised trajectories and improve the performance of MaxEnt-IRL. <br/><br/><br/>One paper at ISIT’15
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/5/15_Three_papers_at_NIPS14_2.html
c6ca5c7c-d356-457a-8485-22260d29ec34Fri, 15 May 2015 17:13:41 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/5/15_Three_papers_at_NIPS14_2_files/Hong_Kong_Skyline_Restitch_-_Dec_2007.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object001_8.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got one paper accepted at ISIT’15!<br/><br/>The Replacement Bootstrap for Dependent Data (with A. Sani and D. Ryabko)<br/><br/>Applications that deal with time-series data often require evaluating complex statistics for which each time series is essentially one data point. When only a few time series are available, bootstrap methods are used to generate additional samples that can be used to evaluate empirically the statistic of interest. In this work a novel bootstrap method is proposed, which is shown to have some asymptotic consistency guarantees under the only assumption that the time series are stationary and ergodic. This contrasts previously available results that impose mixing or finite-memory assumptions on the data. Empirical evaluation on simulated and real data, using a practically relevant and complex extrema statistic is provided.<br/><br/>Ph.D. Position at SequeL
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/3/30_Ph.D._Position_at_SequeL.html
4d488911-71a1-44d9-9052-e23e59982369Mon, 30 Mar 2015 12:07:08 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2015/3/30_Ph.D._Position_at_SequeL_files/inria_lille_logo_diapo.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object000_2.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>Transfer in multi-armed bandit and reinforcement learning<br/><br/>Keywords: reinforcement learning, multi-armed bandit, transfer learning, exploration-exploitation, representation learning, hierarchical learning.<br/><br/>Research Topic<br/>This main objective of this Ph.D. research project is to advance the state-of-the-art in the field of multi-armed bandit and reinforcement learning (RL) through the development of novel transfer learning algorithms. <br/><br/>Reinforcement learning (RL) formalizes the problem of learning an optimal behavior policy from the experience directly collected from an unknown environment. Such general model already provides powerful tools that can be used to learn from data in a very diverse range of applications (e.g., see successful applications of RL to computer games, energy management, logistics, and autonomous robotics). Nonetheless, practical limitations of current algorithms encouraged research in developing efficient ways to integrate expert prior knowledge into the learning process. Although this improves the performance of RL algorithms, it dramatically reduces their autonomy, since it requires a constant supervision by a domain expert. A solution to this problem is provided by transfer learning, which is directly motivated by the observation that one of the key features that allows humans to accomplish complicated tasks is their ability of building general knowledge from past experience and transfer it in learning new tasks. Thus, we believe that bringing the capability of transfer of learning to existing machine learning algorithms will enable them to solve series of tasks in complex and unknown environments. The objective is to develop algorithms that not only learn from experience but also extract knowledge and transfer it through different tasks; thus obtaining a dramatic speed-up in the learning process and a significant improvement of its overall performance. Thus, the general objective in this Ph.D. project is to design RL algorithms able to incrementally discover, construct, and transfer “prior” knowledge in a fully automatic way. <br/><br/>Research Program<br/>While the idea of transfer learning has been applied in a series of machine learning problems, its integration in RL is much more complicated. In fact, the number of scenarios that can be constructed and the different types of knowledge that can be constructed and transferred is much larger than in simpler problems, such as supervised learning. During the Ph.D. we will thus investigate a variety of approaches to transfer in RL, ranging from transfer of sample to transfer of representations. More in detail, we will focus our attention on three aspects of RL algorithms that could significantly benefit from transfer of knowledge:<br/><br/>(i) Exploration. Which knowledge transfer can provably improve the exploration-exploitation performance of an RL agent in terms of sample complexity and regret?<br/>(ii) Representation. Which techniques of representation better fit into transfer in RL?<br/>(iii) Hierarchical structures. Is it possible to prove the advantage of hierarchical structures over flat structures in RL (e.g., options)? Under which assumptions? How can we create such hierarchies automatically?<br/><br/>The previous questions will require theoretical, algorithmic and empirical study. The Ph.D. will cover different learning scenarios (e.g., multi-armed bandit, linear bandit, contextual bandit, full reinforcement learning) and different validation environments (e.g., fully synthetic, off-line evaluation from logged data, online simulation). As such, we expect the Ph.D. to produce a variety of results:<br/><br/>• Theoretical study of the conditions and the type of improvement brought by transfer methods w.r.t. no-transfer standard RL algorithms. <br/>• Empirical validation of the proposed algorithms and comparison with existing transfer and no-transfer methods.<br/>• Investigation of the application of transfer in RL to real-world problems such as recommendation systems, trading, and computer games.<br/><br/>Profile<br/>The applicant must have a Master of Science in Computer Science, Statistics, or related fields, possibly with background in reinforcement learning, bandits, or optimization. Candidates with either very strong mathematical or computer science background will be considered. The working language in the lab is English, a good written and oral communication skills are required.<br/><br/>Application<br/>The application should include a brief description of research interests and past experience, a CV, degrees and grades, a copy of Master thesis (or a draft thereof), motivation letter (short but pertinent to this call), relevant publications, and other relevant documents. Candidates are encouraged to provide letter(s) of recommendation and contact information to reference persons. Please send your application in one single pdf to <a href="http://alessandro.lazaric-at-inria.fr/%22%20%5Ct%20%22_blank">alessandro.lazaric-at-inria.fr</a>. The deadline for the application is May 10, 2015. The final decision will be communicated in June/July 2015.<br/>• Application closing date: May 15, 2015<br/>• Interviews: May/June, 2015<br/>• Duration: 3 years (a full time position)<br/>• Starting date: October 15st, 2015 (flexible)<br/>• Supervisors: Alessandro Lazaric<br/>• Place: SequeL, INRIA Lille - Nord Europe<br/><br/>Working environment<br/>The PhD candidate will work at SequeL (<a href="https://sequel.lille.inria.fr/%22%20%5Ct%20%22_blank">https://sequel.lille.inria.fr/</a>) lab at Inria Lille - Nord Europe located in Lille. <a href="http://www.inria.fr/en/">Inria</a> (<a href="http://www.inria.fr/%22%20%5Ct%20%22_blank">http://www.inria.fr/</a>) is France's leading institution in Computer Science, with over 2800 scientists employed, of which around 250 in Lille. Lille is the capital of the north of France, a metropolis with 1 million inhabitants, with excellent train connection to Brussels (30 min), Paris (1h) and London (1h30). The research team <a href="https://sequel.lille.inria.fr/">SequeL</a> (Sequential Learning) is composed of about 20 members working in machine learning, notably in reinforcement learning, multi-armed bandit, statistical learning, and sequence prediction. The Ph.D. program will be co-funded by the <a href="https://project.inria.fr/ExTra-Learn/">ANR ExTra-Learn</a> project, which is entirely focused on the problem of transfer in RL.<br/><br/>Benefits<br/>• Salary: 1957,54 € the first two years and 2058,84 € the third year<br/>• Salary after taxes: around 1597,11€ the 1st two years and 1679,76 € the 3rd year (benefits included).<br/>• Possibility of French courses<br/>• Help for housing<br/>• Participation for public transport<br/>• Scientific Resident card and help for husband/wife visa<br/><br/>References <br/>D. Calandriello, A. Lazaric, M. Restelli. “Sparse Multi-task Reinforcement Learning”. In Proceedings of the Twenty-Eigth Annual Conference on Neural Information Processing Systems (NIPS'14), 2014. <br/>M. Gheshlaghi-Azar, A. Lazaric, E. Brunskill. “Resource-efficient Stochastic Optimization of a Locally Smooth Function under Correlated Bandit Feedback”. In Proceedings of the Thirty-First International Conference on Machine Learning (ICML'14), 2014.<br/>M. Azar, A. Lazaric, and E. Brunskill. “Sequential Transfer in Multi-armed Bandit with Finite Set of Models”. In: Proceedings of the Twenty-Seventh Annual Conference on Neural Information Processing Systems (NIPS'13). 2013. pp. 2220-2228.<br/>A. Lazaric and M. Restelli. “Transfer from Multiple MDPs”. In Proceedings of the Twenty-Fifth Annual Conference on Neural Information Processing Systems (NIPS'11), 2011.<br/>A. Lazaric. “Transfer in Reinforcement Learning: a Framework and a Survey”. In M. Wiering and M. van Otterlo, editors, Reinforcement Learning: State of the Art, Springer, 2011.<br/>M. E. Taylor and P. Stone. “Transfer Learning for Reinforcement Learning Domains: A Survey”. Journal of Machine Learning Research, 10(1): pp. 1633–1685, 2009.<br/>R. S. Sutton and A. Barto. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA, 1998.<br/>Invited talk at "30 minutes de sciences” @INRIA
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/12/10_Invited_talk_at_%2230_minutes_de_sciences_%40INRIA.html
1b81d15e-ffd2-4c25-90f8-c63a58735c01Wed, 10 Dec 2014 18:45:58 +0100<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/12/10_Invited_talk_at_%2230_minutes_de_sciences_%40INRIA_files/30-min-de-sciences_reference.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object001_6.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:344px; height:135px;"/></a>I’m giving an invited talk on:<br/><br/>Transfer in Reinforcement Learning<br/><br/>The reinforcement learning (RL) framework formalizes the problem of sequential decision-making under uncertainty. RL algorithms enables virtual or real agents to learn an optimal behavior strategy from the experience obtained by a direct interaction with an unknown environment. Despite the high level of generality, current RL algorithms often require a significant amount of prior knowledge from a domain expert to be effective and can hardly generalize across different tasks. In order to solve these limitations, it is possible to adopt a "transfer learning" approach where the prior knowledge is incrementally constructed as the agent solves a series of problems. In particular, the idea is that the agent can automatically detect the similarity across problems and exploit it to improve its learning performance. In this talk, I will first review the basic concepts of RL and discuss about two major aspects of RL that can significantly benefit from effective transfer algorithms, the reduction in the sample complexity in exploration-exploitation and the improvement in approximation accuracy in the representation problem.<br/><br/>References:<br/><br/>M. Azar, A. Lazaric, E. Brunskill, "Sequential Transfer in Multi-armed Bandit with Finite Set of Models". NIPS 2013. [<a href="Entries/2014/12/10_Invited_talk_at_%2230_minutes_de_sciences_%40INRIA_files/transfer-bandit-1.pdf">here</a>]<br/><br/>D. Calandriello, A. Lazaric, M. Restelli, "Sparse Multi-task Reinforcement Learning", NIPS 2014. [<a href="Entries/2014/12/10_Invited_talk_at_%2230_minutes_de_sciences_%40INRIA_files/sparse_mtrl_camera-1.pdf">here</a>]<br/><br/><br/><br/><br/>Invited talk at "Journée sur les méthodologies pour le contrôle de systèmes complexes"
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/11/17_Invited_talk_at_%22Journee_sur_les_methodologies_pour_le_controle_de_systemes_complexes%22.html
10aaa661-2e57-4e61-a3de-1513f0638dcbMon, 17 Nov 2014 15:04:50 +0100<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/11/17_Invited_talk_at_%22Journee_sur_les_methodologies_pour_le_controle_de_systemes_complexes%22_files/adp.png"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object001_1.png" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>I’m giving an invited talk on:<br/><br/>Approximate Dynamic Programming meets Statistical Learning Theory<br/><br/>Approximate dynamic programming (ADP) refers to a set of techniques integrating approximation schemes (e.g., regression) into dynamic programming algorithms to solve the problem of decision-making under uncertainty in large and/or (partially) unknown environments. While ADP has been successfully applied in a wide range of domains (e.g., logistics, transportation, robotics, finance), its performance and properties are not always clearly understood. In this talk, we will review how we can resort to tools from statistical learning theory (SLT) and adapt them to obtain a full theoretical characterization of the behavior of ADP algorithms and to derive bounds on their performance loss. After a general overview of the problem, we will focus on two specific instances of value and policy iteration (notably fitted Q-iteration (FQI) and least-squares policy iteration (LSPI)) and analyze their performance and review the lessons that can be learned from an effective use of statistical learning theory in the analysis of ADP algorithms.<br/><br/><br/><br/><br/>Three papers at NIPS'14
http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/9/15_Three_papers_at_NIPS14_1.html
0563cb11-dcee-4a68-873a-30ae1e97d18cMon, 15 Sep 2014 11:21:54 +0200<a href="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Entries/2014/9/15_Three_papers_at_NIPS14_1_files/00133.jpg"><img src="http://researchers.lille.inria.fr/%7Elazaric/Webpage/Home/Media/object001_7.jpg" style="float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;"/></a>My colleagues and I got three papers accepted at NIPS’14!<br/><br/>Best-Arm Identification in Linear Bandit (with M. Soare and R. Munos)<br/><br/>We study the best-arm identification problem in linear bandit, where the rewards of the arms depend linearly on an unknown parameter $\theta^*$ and the objective is to return the arm with the largest reward. We characterize the complexity of the problem and introduce sample allocation strategies which pull arms to identify the best arm with a fixed confidence, while minimizing the sample budget. In particular, we show the importance of exploiting the global linear structure to improve the estimate of the reward of near-optimal arms. We analyze the proposed strategies and compare their empirical performance. Finally, as a by-product of our analysis, we point out the connection to the $G$-optimality criterion used in optimal experimental design.<br/><br/>Exploiting easy data in online optimization (with A. Sani and G. Neu)<br/><br/>We consider the problem of online optimization, where a learner picks a decision from a given decision set and suffers some loss associated with the decision and the state of the environment. The learner's objective is to minimize its cumulative regret against the best fixed decision \textit{in hindsight}. Over the past few decades numerous variants have been considered, with many algorithms designed to achieve sublinear regret in the worst-case. However, this level of robustness comes at a cost. Proposed algorithms are often over-conservative, failing to adapt to the \textit{actual} complexity of the loss sequence; which is often far from the worst-case. In this paper we introduce a general algorithm that provided with a ``safe'' learning algorithm, and an opportunistic ``benchmark'', is able to effectively combine good worst-case guarantees with much improved performance on ``easy'' data. We derive general theoretical bounds on the regret of the proposed algorithm and discuss its implementation in a wide range of applications; notably in the problem of learning with shifting experts (a recent COLT open problem). Finally, we provide numerical simulations in the setting of prediction with expert advice with comparison to the state-of-the-art.<br/><br/><br/>Sparse Multi-Task Reinforcement Learning (with D. Calandriello and M. Restelli)<br/><br/>In multi-task reinforcement learning (MTRL), the objective is to simultaneously learn multiple tasks and exploit their similarity to improve the performance w.r.t.\ single-task learning. In this paper we investigate the case when all the tasks can be accurately represented in a linear approximation space using the same small subset of the original (large) set of features. This is equivalent to assuming that the weight vectors of the task value functions are \textit{jointly sparse}, i.e., the set of their non-zero components is small and it is shared across tasks. Building on existing results in multi-task regression, we develop two multi-task extensions of the fitted $Q$-iteration algorithm. While the first algorithm assumes that the tasks are jointly sparse in the given representation, the second one learns a transformation of the features in the attempt of finding a more sparse representation. For both algorithms we provide a sample complexity analysis and numerical simulations.<br/><br/><br/><br/><br/>