Notes and links from SSC2021. I often find I get the most from conferences if I take notes with the vague notion of sharing them. So this is mostly for me, but I hope it might be of interest to others, too. Note: This post will be updated sporadically.
1:45 PM - 2:00 PM EDT on Monday, June 7
In response to disruptions to our students’ plans due to COVID-19 and widespread feelings of isolation, we built a virtual community to help our students build portfolios, meet peers, and explore careers. In all, over 700 students in our programs signed up, and among the 164 students who responded to our end-of-summer survey, 41% were active and 48% passive participants; our data suggests that even passive participation was beneficial in making students feel more connected. As part of the ISSC, we held formal and informal data science workshops, social events, and career-building activities culminating in a DataFest COVID-19 Virtual Challenge. In all, 92 eligible teams applied to participate in DataFest, 62 were invited to compete, and 42 submitted complete submissions. In this talk, we will outline the principles guiding how the ISSC was structured including practical advice and tips for building a sustainable community, lessons learned, and plans for 2021.
En réponse aux perturbations des plans des élèves en raison de la COVID-19 et au sentiment d’isolement généralisé, nous avons créé une communauté virtuelle pour aider les étudiants à monter leur portfolio, à rencontrer leurs pairs et à découvrir des carrières. En tout, plus de 700 étudiants se sont inscrits à nos programmes et, parmi les 164 étudiants qui ont répondu à notre enquête de fin d’été, 41 % étaient des participants actifs et 48 % des participants passifs. Nos données suggèrent que même la participation passive a été bénéfique en donnant aux étudiants le sentiment d’être plus connectés. Dans le cadre de la Communauté statistique indépendante d’été, nous avons organisé des ateliers formels et informels sur les sciences des données, des événements sociaux et des activités de développement de carrière, qui ont débouché au défi virtuel des données DataFest COVID-19. Au total, 92 équipes admissibles ont posé leur candidature pour participer à ce défi, 62 ont été invitées à concourir et 42 ont soumis des dossiers complets. Dans cet exposé, nous présenterons les principes qui ont guidé la structure de la Communauté statistique indépendante d’été, les conseils pratiques et les astuces qui ont contribué à créer une communauté durable, ainsi que les leçons apprises et nos plans pour 2021.
3:45 PM - 4:00 PM EDT on Wednesday, June 9
Statistics and data science have become ubiquitous in the hard and social sciences. When working with large data or complex methodology it is crucial that the data analysts are able to program. R is a statistical programming language that is free and popular in the statistics community. R works well for data visualizations, wrangling and employing simple to complex methodology. As educators in statistics we noticed a variation of programming backgrounds in our senior students. Our team of 7 undergraduate students, 2 graduate students, and 2 assistant professors have developed a toolkit to help students improve their programming in R. The toolkit is a set of interactive modules that students complete autonomously. The modules start from the very basics of installing R to tidyverse to employing Bayesian methods. In this talk, we will outline the development and uses of this toolkit, and highlight some next steps.
Les statistiques et les sciences des données sont devenues omniprésentes dans les sciences dures et humaines. En travaillant avec des données volumineuses ou des méthodologies complexes, il est primordial que les analystes de données soient capables de programmer. R est un langage de programmation gratuit et populaire au sein de la communauté statistique. Il fonctionne à merveille pour les visualisations de données, leur préparation préalable et l’adoption de méthodologies simples ou complexes. En tant qu’éducateurs en statistique, nous avons remarqué que les formations en programmation varient parmi nos étudiants de cycle supérieur. Notre équipe de sept étudiants de premier cycle, deux étudiants de cycle supérieur et deux professeurs assistants ont conçu une boîte à outils afin d’aider les étudiants à rehausser leurs aptitudes de programmation en R. La boîte à outils est composée d’un ensemble de modules interactifs que les étudiants terminent de façon autonome. Les modules commencent par la base de l’installation de R, puis progressent vers tidyverse jusqu’à l’emploi de méthodes bayésiennes. Lors de cet exposé, nous décrirons les grandes lignes du développement et de l’utilisation de cette boîte à outils, et soulignerons les étapes à venir.
Notes and reflections of varying quality from the talks I attended + resources shared.
Note: If I’ve written about talk and there is anything you’d like me to correct or add, please just let me know! Twitter/email links in the navigation bar.
Jeffrey Rosenthal, @ProbabilityProf
Jeff took us on a journey through his experiences with media and especially as an expert witness in court cases. There are a mish-mash of links below and much more on his website: http://probability.ca.
An example of a case that was especially interesting/surprising was when he was an expert witness in a case about a marijuana grow-op. This story was really neat because significant jail time rests on whether or not the number of plants was more or less than 500. More than 500 leads to a mandatory three-year jail term.
The level of aggressive attacks you end up facing as a expert witness sounds extremely off-putting! I’m glad Jeff’s skin is think enough to go out and do this kind of work, because mine certainly is not.
Some of Jeff’s final notes and advice:
Here are several links that were referred to during that talk or shared in chat. (Not exhaustive.)
Lottery link: http://probability.ca/lotteryscandal/
Annals Quadfecta: https://imstat.org/2021/05/14/the-annals-quadfecta-23/
Jeff’s Canadian Supreme Court opinion writing article: http://probability.ca/jeff/ftpdir/SCC_UTLJ.pdf
Discussion of SIDS/SUDI case history: https://en.wikipedia.org/wiki/Sally_Clark
www.probability.ca/justice
Donald Estep, Canadian Statistical Sciences Institute and Simon Fraser University, @donestep1
Natalie Shlomo, University of Manchester
John Eltinge, US Census Bureau
Anne-Sophie Charest, Université Laval
Pierre Desrochers, Statistics Canada
My background is using official statistics for social and health research in Aotearoa New Zealand, so it was excellent to dip a toe back into this world and hear about the context here.
Big thanks to CANSSI for supporting this panel. @CANSSIINCASS
Michael Moon, University of Toronto, @micbonmoon
dplyr
, removes need to do a bunch of error prone copy and pasting (systematic).Great presentation from Michael, and I’m super excited about playing with this package!
Nathalie Moon, University of Toronto
I was part of the ISSC team, so not notes, other than I think Nathalie did a fab job explaining it.
Unofficial list of Canadian universities doing ASA DataFest: U of T, UBC, Waterloo, MacEwan/University of Alberta — we should all be friends!
Tharshanna Nadarajah, University of Toronto
Neat intro to MyOpenMath advantages and how Tharshanna has used it in teaching, student response and outcomes.
Sohee Kang, University of Toronto
Some lovely data collection activities that students can do in class from a phone or computer. Love this from the abstract: “Students often feel disengaged with data that they do not perceive as being”real" or “authentic”, and it is important that they believe that the data they are analyzing is representative of real-world problems."
Michael Correll, Tableau
“We’re just making bar charts, what’s ethically laden about that!” *sarcasm intensifies*
Types of bad visualizations: deceptive, fragile, bullshit 1, evil
Loved the example of the when bar charts should start at 0, but how this ‘rule’ is a major issue when you try to show line graphs of global temperature and demonstrate the serious increases.
There are multiple places in the pipeline where things can go wrong, but this means there are places for us to intervene!
Any human-computer interaction assistance project eventually just becomes Clippy.
Frankfurt. (1986). On bullshit. http://www2.csudh.edu/ccauthen/576f12/frankfurt__harry_-_on_bullshit.pdf
McNutt, Kindlmannn & Correll. (2020). Surfacing Visualization Mirages https://arxiv.org/abs/2001.02316
Correll. (2018). Ethical Dimensions of Visualization Research. https://arxiv.org/pdf/1811.07271.pdf ) if that helps.
Boris Babic, INSEAD but soon to be U of T
White box (like a good ol’ GLM that is easy for a human to understand) vs a black box (e.g., deep learning models), but of course there is lots of potential for grey boxes, like how many parameters is too many? What about level of understanding of the user?
Explainable vs interprettable
‘Ersatz understanding’: in law/philosophy the idea of understanding is based in the ‘why’ between input and output.
We want explainable systems because this is tied to our ideas around accountability and trust.
We also want robustness (e.g., we’d expect similar advice for similar patients in a health context). White box approximations can give us a sense of ocal faithfulness but the potential to produce super different outcomes.
In summary, the main issues are pseudo/ersatz understanding—we have a false impression that we do understand what the black box is doing, the potential for non-stability when trying to use approximations, and the challenges due to existing issues with statistical literacy of data product users. Assumptions about causality, etc.
Kristian Lum, UPenn, @KLDivergence
Yajuan Si, University of Michigan, @yajuansi
Heng Chen, Bank of Canada
Nothing specific to say other than how glad I am that SSC and NSERC and CANSSI and partner institutions are committed to improving EDI.
Folks to follow: @BouchraNasri, @donestep1, @statacake, @alejandroadem
Lengyi Han, UBC
Han proposes that the usual ways of introducing Poisson processes and notation can be quite difficult for students, especially to connect to real world examples.
Motivation with a Bernoulli process example (arrivals at a bank)
Suppose we divide the time in question (say 60 minutes) into equal sub-intervals and a person can only arrive at the midpoint of a sub-interval.
If we make finer and finer sub-intervals we show that the Poisson and Bernoulli process will be approximately the same for sufficiently large \(n\).
After the above, reiterate the assumptions and conditions for a Bernoulli process. From here, students can see that as the interval width tends to 0, we get the Poisson dist.
Specifically, the Bernoulli procress avoids the use of \(o(h)\) notation and student work through the derivation, motivating the use of this notation as helpful instead of obscure.
Sam Caetano (presenting), University of Toronto @StatisticalSam
Also check out: @RohanAlexander, @michaelycchong, @pailfodgetts
learnr
package is great for creating interactive tutorials: https://rstudio.github.io/learnr/Douglas Whitaker, Mount Saint Vincent University @DouglasWhitaker
S-SOMAS: Student Survey of MOtivational Attitude toward Statistics
One of family of 6 instruments. 2 x topics (statistics, data science) across 3 x types (student, instructor and environment)
MASDER: Motivational Attitudes in Statistics and Data Science Education Research team
“Expectancy-Value Theory”: achievement-related choices and performance are affected by what you value and what you expect to happen. EVERYTHING else operate through one or both of these.
Use PCA, check the loadings/dimensionality before item response theory.
eRm
package (Mair et al., 2021), mirt
package (Chalmers, 2012) for generalized partial credit.I’m looking forward to seeing how these tools develop!
Nooshin Rotondi, Ontario Tech University
Loved this talk! Excited to see next steps.
Chris Holdgraf, 2i2c.org, @choldgraf
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
~ Buckheit and Donoho
WaveLab and Reproducible Research, 1995
“Jupyter and the last mile problem”
“Jupyter is a community of people and an ecosystem open tools and standard for interactive computing”
Language agnostic: “empower people to use open tools”
Metaphor of getting you to a coffee shop: walk? pay someone to drive you? use public infrastructure?
The notebook experience combines a document standard, and environment (R, Python, etc.), and an interface.
Notebooks are “structured but generic”. The can have many different front ends, for example:'classic’, JupyterLab and nteract.
JupyterLab is meant to provide a more flexible and customizable space. It is sandalone as a UI, but you can also put components together, Lego-style into the UI you need.
Real-time collaboration is being worked on!! Still an early prototype, but I am so excited.
Jupyter Book and MyST (pronounced ‘mist’)
Markdown is usually not extensible, but MyST lets you set out roles and directives. Can add images and include references.
Want JupyterHub for a small user base (2-80 people)? The Littlest JupyterHub (tljh.jupyter.org). For larger groups, JupyterHub on Kubernetes.
Data8.org / inferentialthinking.org
We’ve been using JupyterHub in teaching at U of T for almost a year now, and it has been working really well. I use it with R and it also allows us to use RStudio which is really nice. Big thanks to Nathan Taback who has been leading this from our side, and to the members of 2i2c.org that have been helping us. Highly recommend.
I really like Chris’ emphasis on design principles in this talk.
The rest of the session was hand on workshops from David Liu, Nathan Taback ( @NathanTaback) and Nathaniel Stevens. https://github.com/ssc-datascience/pythonjupyter_wshop2021 They were really good!
Knock on Wood: Luck, Chance, and the Meaning of Everything by Jeffrey S. Rosenthal https://www.goodreads.com/book/show/36300756-knock-on-wood
Data Visualization: Charts, Maps, and Interactive Graphics by Robert Grant https://www.goodreads.com/book/show/40684954-data-visualization
In the sense of Frankfurt’s On Bullshit http://www2.csudh.edu/ccauthen/576f12/frankfurt__harry_-_on_bullshit.pdf↩︎
For attribution, please cite this work as
Bolton (2021, June 7). Liza Bolton: My first Statistical Society of Canada Annual Meeting. Retrieved from blog.lizabolton.com/posts/2021-06-07_ssc2021/
BibTeX citation
@misc{bolton2021my, author = {Bolton, Liza}, title = {Liza Bolton: My first Statistical Society of Canada Annual Meeting}, url = {blog.lizabolton.com/posts/2021-06-07_ssc2021/}, year = {2021} }