Data Science Icon

Data Science

141 Stories
All Topics

Robin Linacre robinlinacre.com

SQL should be your default choice for data engineering pipelines

Robin Linacre:

SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes.

You can make a similar argument for SQL that Gary Bernhardt made for Vim. Here’s Gary on Vim, run your own s/Vim/SQL/g filter on this as you read it:

…just for me, 15 years; at the beginning of that time, TextMate was just becoming popular. Then it was Sublime Text was cool. Then Atom was cool. Then VS Code was cool. A lot of people switched between two of those, three of those, maybe all four of those, and that whole time I was just getting better and better and better at Vim… And you multiply that out by the length of a career, you use Vim for 40 years - you’re gonna be so good at it by the end, and it’s still gonna be totally relevant, I think.

Tooling jeremiak.com

Datasette is my data hammer

Jeremia Kimelman:

Datasette is an open source tool that takes an SQLite database and gives you an out-of-the-box, web-based UI built specifically for exploring data. Need an example? Here’s a database of all of Motley Fool’s earning transcripts that I used to look for talk of their California campaign activity. And here’s a bunch of other examples of Datasette from the official site.

And the thing is: I love Datasette. It recently turned 5 years old and I wanted to write down the thing that makes it an absolutely delightful data hammer.

Neovim maxwellrules.com

Using Jupyter Notebooks inside NeoVim

Guillem Ballesteros:

I have reached Vim nirvana with my latest setup. I can finally bring all the advantages of working within Jupyter to my favorite text editor. You get the code cells and interactive development with a fine-tuned editor and plain text files which can be put through linters and code formatters.

He goes on to share the plugins and config that make the nirvana happen.

Practical AI Practical AI #203

AI competitions & cloud resources

In this special episode, we interview some of the sponsors and teams from a recent case competition organized by Purdue University, Microsoft, INFORMS, and SIL International. 170+ teams from across the US and Canada participated in the competition, which challenged students to create AI-driven systems to caption images in three languages (Thai, Kyrgyz, and Hausa).

Practical AI Practical AI #201

Protecting us with the Database of Evil

Online platforms and their users are susceptible to a barrage of threats – from disinformation to extremism to terror. Daniel and Chris chat with Matar Haller, VP of Data at ActiveFence, a leader in identifying online harm – is using a combination of AI technology and leading subject matter experts to provide Trust & Safety teams with precise, real-time data, in-depth intelligence, and automated tools to protect users and ensure safe online experiences.

Practical AI Practical AI #197

Data for All

People are starting to wake up to the fact that they have control and ownership over their data, and governments are moving quickly to legislate these rights. John K. Thompson has written a new book on the topic that is a must read! We talk about the new book in this episode along with how practitioners should be thinking about data exchanges, privacy, trust, and synthetic data.

Practical AI Practical AI #196

What's up, DocQuery?

Chris sits down with Ankur Goyal to talk about DocQuery, Impira’s new open source ML model. DocQuery lets you ask questions about semi-structured data (like invoices) and unstructured documents (like contracts) using Large Language Models (LLMs). Ankur illustrates many of the ways DocQuery can help people tame documents, and references Chris’s real life tasks as a non-profit director to demonstrate that DocQuery is indeed practical AI.

Practical AI Practical AI #195

Production data labeling workflows

It’s one thing to gather some labels for your data. It’s another thing to integrate data labeling into your workflows and infrastructure in a scalable, secure, and useful way. Mark from Xelex joins us to talk through some of what he has learned after helping companies scale their data annotation efforts. We get into workflow management, labeling instructions, team dynamics, and quality assessment. This is a super practical episode!

Practical AI Practical AI #191

Privacy in the age of AI

In this Fully-Connected episode, Daniel and Chris discuss concerns of privacy in the face of ever-improving AI / ML technologies. Evaluating AI’s impact on privacy from various angles, they note that ethical AI practitioners and data scientists have an enormous burden, given that much of the general population may not understand the implications of the data privacy decisions of everyday life.

This intentionally thought-provoking conversation advocates consideration and action from each listener when it comes to evaluating how their own activities either protect or violate the privacy of those whom they impact.

Chip Huyen huyenchip.com

Introduction to streaming for data scientists

Chip Huyen:

As machine learning moves towards real-time, streaming technology is becoming increasingly important for data scientists. Like many people coming from a machine learning background, I used to dread streaming. In our recent survey, almost half of the data scientists we asked said they would like to move from batch prediction to online prediction but can’t because streaming is hard, both technically and operationally…

Over the last year, working with a co-founder who’s super deep into streaming, I’ve learned that streaming can be quite intuitive. This post is an attempt to rephrase what I’ve learned.

Practical AI Practical AI #187

AI IRL & Mozilla's Internet Health Report

Every year Mozilla releases an Internet Health Report that combines research and stories exploring what it means for the internet to be healthy. This year’s report is focused on AI. In this episode, Solana and Bridget from Mozilla join us to discuss the power dynamics of AI and the current state of AI worldwide. They highlight concerning trends in the application of this transformational technology along with positive signs of change.

Sean Moriarity dockyard.com

Elixir versus Python for data science

Sean Moriarity:

A common argument against using Nx for a new machine learning project is its perceived lack of a library/support for some common task that is available in Python. In this post, I’ll do my best to highlight areas where this is not the case, and compare and contrast Elixir projects with their Python equivalents. Additionally, I’ll discuss areas where the Elixir ecosystem still comes up short, and using Nx for a new project might not be the best idea.

Sean is a prominent member of the Elixir community, so that’s the perspective on display here, but it’s a thorough and well-reasoned comparison. He concludes:

While there are still many gaps in the Elixir ecosystem, the progress over the last year has been rapid. Almost every library I’ve mentioned in this post is less than two years old, and I suspect there will be many more projects to fill some of the gaps I’ve mentioned in the coming months.

Practical AI Practical AI #183

AI's role in reprogramming immunity

Drausin Wulsin, Director of ML at Immunai, joins Daniel & Chris to talk about the role of AI in immunotherapy, and why it is proving to be the foremost approach in fighting cancer, autoimmune disease, and infectious diseases.

The large amount of high dimensional biological data that is available today, combined with advanced machine learning techniques, creates unique opportunities to push the boundaries of what is possible in biology.

To that end, Immunai has built the largest immune database called AMICA that contains tens of millions of cells. The company uses cutting-edge transfer learning techniques to transfer knowledge across different cell types, studies, and even species.

Practical AI Practical AI #171

Clothing AI in a data fabric

What happens when your data operations grow to Internet-scale? How do thousands or millions of data producers and consumers efficiently, effectively, and productively interact with each other? How are varying formats, protocols, security levels, performance criteria, and use-case specific characteristics meshed into one unified data fabric? Chris and Daniel explore these questions in this illuminating and Fully-Connected discussion that brings this new data technology into the light.

Practical AI Practical AI #166

Exploring deep reinforcement learning

In addition to being a Developer Advocate at Hugging Face, Thomas Simonini is building next-gen AI in games that can talk and have smart interactions with the player using Deep Reinforcement Learning (DRL) and Natural Language Processing (NLP). He also created a Deep Reinforcement Learning course that takes a DRL beginner to from zero to hero. Natalie and Chris explore what’s involved, and what the implications are, with a focus on the development path of the new AI data scientist.

Python kaggle.com

Get the daily Wordle on the first try using the tweet distribution

I love how much hacking has been inspired by Wordle.

The Wordle source code contains 2,315 days of answers (all common 5-letter English words) and 10,657 other valid, less-common 5-letter English words.

We combine these to form a set of 12,972 possible words/answers.

We then simulate playing 1,000 Wordle games for each of these possible words, guessing based on the frequency of the word in the English language and the feedback received.

Then we take three measures to evaluate the observed distribution of ⬛🟨🟩 squares on Twitter according to our valid words.

The resulting code is included in the article.

  0:00 / 0:00