Statistique en grande dimension

6 ECTS | Enseignant : Stéphane Boucheron | Validation : CC + examen
Horaires hebdomadaires : 2h CM, 1h TD | Durée : 10 semaines
Mutualisé avec : M2 MIDS
Site web : http://stephane-v-boucheron.fr/courses/mmd/
Moodle : https://moodle.u-paris.fr/enrol/index.php?id=7947

Titre complet : Algorithmique des données massives

Usage des méthodes randomisées en traitement des données massives et en traitement des flots de données (streaming). Familiarisation avec Spark. Articulation estimation/optimisation.

Objectifs

Maîtriser les méthodes randomisées pour le traitement des données massives
Se familiariser avec le traitement des flots de données (streaming)
Utiliser Spark pour les applications pratiques
Comprendre l’articulation entre estimation et optimisation

Programme

1. Plus proches voisins en grande dimension

Locally sensitive hashing et au-delà
Applications aux données textuelles (Spark ML Feature Extraction)

2. Compressed sensing

Reconstruction parfaite des signaux parcimonieux par pénalisation ℓ1
Algorithmes (LASSO, AMMD, Coordinate descent, …)

3. Données de streaming

Échantillonnages
Comptage approximatif (Hyperloglog, Spark SQL)

4. Estimation robuste

Enjeux
Median of Means
Relaxation SDP

Modalités

Cours en présentiel avec site web et Moodle dédiés.

Bibliographie

Arnold, T. & Tilton, L. (2015). Humanities data in R: exploring networks, geospatial data, images, and text. Berlin : Springer.
Bandeira, A. S. (2015). Ten lectures and forty-two open problems in the mathematics of data science. Lecture Notes.
Blum, A., Hopcroft, J., & Kannan, R. (2016). Foundations of data science. Vorabversion eines Lehrbuchs.
Boucheron, S., Lugosi, G., & Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford : Oxford University Press.
Chambers, B. & Zaharia, M. (2018). Spark: the definitive guide: big data processing made simple. Sebastopol : O’Reilly Media, Inc.
Foucart, S. & Rauhut, H. (2013). A mathematical introduction to compressive sensing. Boston : Birkhäuser.
Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). Mining of massive datasets. Cambridge : Cambridge University Press.
Lugosi, G. (2017). Lectures on Combinatorial Statistics. St. Flour.
Moitra, A. (2018). Algorithmic aspects of machine learning. Cambridge : Cambridge University Press.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science. Cambridge : Cambridge University Press.

Cookie	Durée	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.