Inspired by a recent article from Erik Bernhardsson titled “Building a data team at a mid-stage startup: a short story”, Chris and Daniel discuss all things AI/data team building. They share some stories from their experiences kick starting AI efforts at various organizations and weight the pro and cons of things like centralized data management, prototype development, and a focus on engineering skills.
Pinecone is the first vector database for machine learning. Edo Liberty explains to Chris how vector similarity search works, and its advantages over traditional database approaches for machine learning. It enables one to search through billions of vector embeddings for similar matches, in milliseconds, and Pinecone is a managed service that puts this capability at the fingertips of machine learning practitioners.
William Falcon wants AI practitioners to spend more time on model development, and less time on engineering. PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research that lets you train on multiple-GPUs, TPUs, CPUs and even in 16-bit precision without changing your code! In this episode, we dig deep into Lightning, how it works, and what it is enabling. William also discusses the Grid AI platform (built on top of PyTorch Lightning). This platform lets you seamlessly train 100s of Machine Learning models on the cloud from your laptop.
Chris and Daniel sit down to chat about some exciting new AI developments including wav2vec-u (an unsupervised speech recognition model) and meta-learning (a new book about “How To Learn Deep Learning And Thrive In The Digital World”). Along the way they discuss engineering skills for AI developers and strategies for launching AI initiatives in established companies.
Tuhin Srivastava tells Daniel and Chris why BaseTen is the application development toolkit for data scientists. BaseTen’s goal is to make it simple to serve machine learning models, write custom business logic around them, and expose those through API endpoints without configuring any infrastructure.
Ro Gupta from CARMERA teaches Daniel and Chris all about road intelligence. CARMERA maintains the maps that move the world, from HD maps for automated driving to consumer maps for human navigation.
Nhung Ho joins Daniel and Chris to discuss how data science creates insights into financial operations and economic conditions. They delve into topics ranging from predictive forecasting to aid small businesses, to learning about the economic fallout from the COVID-19 Pandemic.
A fun little microsite where you’re given a name (example: Azurill) and you have to guess whether it’s a Big Data project or a Pokémon. Surprisingly difficult! 😆
Dave Lacey takes Daniel and Chris on a journey that connects the user interfaces that we already know - TensorFlow and PyTorch - with the layers that connect to the underlying hardware. Along the way, we learn about Poplar Graph Framework Software. If you are the type of practitioner who values ‘under the hood’ knowledge, then this is the episode for you.
Nikola Mrkšić, CEO & Co-Founder of PolyAI, takes Daniel and Chris on a deep dive into conversational AI, describing the underlying technologies, and teaching them about the next generation of voice assistants that will be capable of handling true human-level conversations. It’s an episode you’ll be talking about for a long time!
In which Lj Miranda proposes an exercise that data scientists can do to learn relevant software skills (with a tangible output in the end).
Create a machine learning application that receives HTTP requests, then deploy it as a containerized app.
I’m willing to wager that this is a worthy goal even if you’re coming from the software engineering side of the spectrum. Don’t worry, he’ll walk you through the steps.
Chris has the privilege of talking with Stanford Professor Margot Gerritsen, who co-leads the Women in Data Science (WiDS) Worldwide Initiative. This is a conversation that everyone should listen to. Professor Gerritsen’s profound insights into how we can all help the women in our lives succeed - in data science and in life - is a ‘must listen’ episode for everyone, regardless of gender.
David Sweet, author of “Tuning Up: From A/B testing to Bayesian optimization”, introduces Dan and Chris to system tuning, and takes them from A/B testing to response surface methodology, contextual bandit, and finally bayesian optimization. Along the way, we get fascinating insights into recommender systems and high-frequency trading!
Elad Walach of Aidoc joins Chris to talk about the use of AI for medical imaging interpretation. Starting with the world’s largest annotated training data set of medical images, Aidoc is the radiologist’s best friend, helping the doctor to interpret imagery faster, more accurately, and improving the imaging workflow along the way. Elad’s vision for the transformative future of AI in medicine clearly soothes Chris’s concern about managing his aging body in the years to come. ;-)
There are 70% more open roles at companies in data engineering as compared to data science. As we train the next generation of data and machine learning practitioners, let’s place more emphasis on engineering skills.
This vibes with what I’ve been hearing on Practical AI lately. Organizations are facing big challenges when it comes to deploying, maintaining, and improving data processing tools and platforms in production settings. Big challenges produce big opportunities. And what does a data engineer do? According to this article:
Develops a robust and scalable set of data processing tools/platforms. Must be comfortable with SQL/NoSQL database wrangling and building/maintaining ETL pipelines.
If you have that skillset, you are in high demand today. And if you can adapt that skillset and be considered a ML engineer, you will be in high demand for a long, long time.
John Myers of Gretel puts on his apron and rolls up his sleeves to show Dan and Chris how to cook up some synthetic data for automated data labeling, differential privacy, and other purposes. His military and intelligence community background give him an interesting perspective that piqued the interest of our intrepid hosts.
Daniel and Chris sniff out the secret ingredients for collecting, displaying, and analyzing odor data with Terri Jordan and Yanis Caritu of Aryballe. It certainly smells like a good time, so join them for this scent-illating episode!
At this year’s Government & Public Sector R Conference (or R|Gov) our very own Daniel Whitenack moderated a panel on how AI practitioners can engage with governments on AI for good projects. That discussion is being republished in this episode for all our listeners to enjoy!
The panelists were Danya Murali from Arcadia Power and Emily Martinez from the NYC Department of Health and Mental Hygiene. Danya and Emily gave some great perspectives on sources of government data, ethical uses of data, and privacy.
Bharat Sandhu, Director of Azure AI and Mixed Reality at Microsoft, joins Chris and Daniel to talk about how Microsoft is making AI accessible and productive for users, and how AI solutions can address real world challenges that customers face. He also shares Microsoft’s research-to-product process, along with the advances they have made in computer vision, image captioning, and how researchers were able to make AI that can describe images as well as people do.
Unsplash has released the world’s largest open library dataset, which includes 2M+ high-quality Unsplash photos, 5M keywords, and over 250M searches. They have big ideas about how the dataset might be used by ML/AI folks, and there have already been some interesting applications. In this episode, Luke and Tim discuss why they released this data and what it take to maintain a dataset of this size.
Lucy D’Agostino McGowan, cohost of the Casual Inference Podcast and a professor at Wake Forest University, joins Daniel and Chris for a deep dive into causal inference. Referring to current events (e.g. misreporting of COVID-19 data in Georgia) as examples, they explore how we interact with, analyze, trust, and interpret data - addressing underlying assumptions, counterfactual frameworks, and unmeasured confounders (Chris’s next Halloween costume).
Anaconda CEO (and Practical AI guest) Peter Wang:
I am excited to announce the Anaconda Dividend Program, which formalizes our commitment to direct a portion of our revenue to open-source projects that help advance innovation in data science. We are launching the program in partnership with NumFOCUS, and will kick off with a seed donation of $10,000, as well as an additional 10% of single-user Commercial Edition subscription revenue through the end of this year. Going forward, we will fund the dividend with at least 1% of our revenue in 2021, with a minimum of $25,000 committed for the year.
We’ve been beating the successful-businesses-that-thrive-in-large-part-due-to-open-source-software-should-set-aside-revenues-to-support-those-projects drum for years now, so it’s exciting to see forward-looking companies like Anaconda step up and do just that. More like this! 🙏
Rajiv Shah teaches Daniel and Chris about data leakage, and its major impact upon machine learning models. It’s the kind of topic that we don’t often think about, but which can ruin our results. Raj discusses how to use activation maps and image embedding to find leakage, so that leaking information in our test set does not find its way into our training set.