Data Science Icon

Data Science

141 Stories
All Topics

Robin Linacre robinlinacre.com

SQL should be your default choice for data engineering pipelines

Robin Linacre:

SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable. A new SQL engine - DuckDB - makes SQL competitive with other high performance dataframe libraries, making SQL a good candidate for data of all sizes.

You can make a similar argument for SQL that Gary Bernhardt made for Vim. Here’s Gary on Vim, run your own s/Vim/SQL/g filter on this as you read it:

…just for me, 15 years; at the beginning of that time, TextMate was just becoming popular. Then it was Sublime Text was cool. Then Atom was cool. Then VS Code was cool. A lot of people switched between two of those, three of those, maybe all four of those, and that whole time I was just getting better and better and better at Vim… And you multiply that out by the length of a career, you use Vim for 40 years - you’re gonna be so good at it by the end, and it’s still gonna be totally relevant, I think.

Tooling jeremiak.com

Datasette is my data hammer

Jeremia Kimelman:

Datasette is an open source tool that takes an SQLite database and gives you an out-of-the-box, web-based UI built specifically for exploring data. Need an example? Here’s a database of all of Motley Fool’s earning transcripts that I used to look for talk of their California campaign activity. And here’s a bunch of other examples of Datasette from the official site.

And the thing is: I love Datasette. It recently turned 5 years old and I wanted to write down the thing that makes it an absolutely delightful data hammer.

Neovim maxwellrules.com

Using Jupyter Notebooks inside NeoVim

Guillem Ballesteros:

I have reached Vim nirvana with my latest setup. I can finally bring all the advantages of working within Jupyter to my favorite text editor. You get the code cells and interactive development with a fine-tuned editor and plain text files which can be put through linters and code formatters.

He goes on to share the plugins and config that make the nirvana happen.

Chip Huyen huyenchip.com

Introduction to streaming for data scientists

Chip Huyen:

As machine learning moves towards real-time, streaming technology is becoming increasingly important for data scientists. Like many people coming from a machine learning background, I used to dread streaming. In our recent survey, almost half of the data scientists we asked said they would like to move from batch prediction to online prediction but can’t because streaming is hard, both technically and operationally…

Over the last year, working with a co-founder who’s super deep into streaming, I’ve learned that streaming can be quite intuitive. This post is an attempt to rephrase what I’ve learned.

Sean Moriarity dockyard.com

Elixir versus Python for data science

Sean Moriarity:

A common argument against using Nx for a new machine learning project is its perceived lack of a library/support for some common task that is available in Python. In this post, I’ll do my best to highlight areas where this is not the case, and compare and contrast Elixir projects with their Python equivalents. Additionally, I’ll discuss areas where the Elixir ecosystem still comes up short, and using Nx for a new project might not be the best idea.

Sean is a prominent member of the Elixir community, so that’s the perspective on display here, but it’s a thorough and well-reasoned comparison. He concludes:

While there are still many gaps in the Elixir ecosystem, the progress over the last year has been rapid. Almost every library I’ve mentioned in this post is less than two years old, and I suspect there will be many more projects to fill some of the gaps I’ve mentioned in the coming months.

Python kaggle.com

Get the daily Wordle on the first try using the tweet distribution

I love how much hacking has been inspired by Wordle.

The Wordle source code contains 2,315 days of answers (all common 5-letter English words) and 10,657 other valid, less-common 5-letter English words.

We combine these to form a set of 12,972 possible words/answers.

We then simulate playing 1,000 Wordle games for each of these possible words, guessing based on the frequency of the word in the English language and the feedback received.

Then we take three measures to evaluate the observed distribution of ⬛🟨🟩 squares on Twitter according to our valid words.

The resulting code is included in the article.

Alex Strick van Linschoten github.com

ZenML helps data scientists work across the full stack

ZenML is an extensible MLOps framework to create production-ready machine learning pipelines. Built for data scientists, it has a simple, flexible syntax, is cloud and tool agnostic, and has interfaces/abstractions that are catered towards ML workflows.

The code base was recently completely rewritten with better abstractions and to set us up for our ongoing growth and inclusion of more integrations with tools that data scientists love to use.

Electron github.com

A desktop app for JupyterLab (based on Electron)

If you already know what JupyterLab is, then I don’t have to tell you why this might be exciting/useful. If you don’t, well, here’s what JupyterLab is:

JupyterLab is the next-generation user interface for Project Jupyter offering all the familiar building blocks of the classic Jupyter Notebook (notebook, terminal, text editor, file browser, rich outputs, etc.) in a flexible and powerful user interface. JupyterLab will eventually replace the classic Jupyter Notebook.

A desktop app for JupyterLab (based on Electron)

Lj Miranda ljvmiranda921.github.io

How to improve software engineering skills as a researcher

In which Lj Miranda proposes an exercise that data scientists can do to learn relevant software skills (with a tangible output in the end).

Create a machine learning application that receives HTTP requests, then deploy it as a containerized app.

I’m willing to wager that this is a worthy goal even if you’re coming from the software engineering side of the spectrum. Don’t worry, he’ll walk you through the steps.

Career mihaileric.com

We don't need data scientists, we need data engineers

TLDR:

There are 70% more open roles at companies in data engineering as compared to data science. As we train the next generation of data and machine learning practitioners, let’s place more emphasis on engineering skills.

This vibes with what I’ve been hearing on Practical AI lately. Organizations are facing big challenges when it comes to deploying, maintaining, and improving data processing tools and platforms in production settings. Big challenges produce big opportunities. And what does a data engineer do? According to this article:

Develops a robust and scalable set of data processing tools/platforms. Must be comfortable with SQL/NoSQL database wrangling and building/maintaining ETL pipelines.

If you have that skillset, you are in high demand today. And if you can adapt that skillset and be considered a ML engineer, you will be in high demand for a long, long time.

We don't need data scientists, we need data engineers

Peter Wang anaconda.com

Anaconda's dividend program helps sustain the open source DS/ML community

Anaconda CEO (and Practical AI guest) Peter Wang:

I am excited to announce the Anaconda Dividend Program, which formalizes our commitment to direct a portion of our revenue to open-source projects that help advance innovation in data science. We are launching the program in partnership with NumFOCUS, and will kick off with a seed donation of $10,000, as well as an additional 10% of single-user Commercial Edition subscription revenue through the end of this year. Going forward, we will fund the dividend with at least 1% of our revenue in 2021, with a minimum of $25,000 committed for the year.

We’ve been beating the successful-businesses-that-thrive-in-large-part-due-to-open-source-software-should-set-aside-revenues-to-support-those-projects drum for years now, so it’s exciting to see forward-looking companies like Anaconda step up and do just that. More like this! 🙏

Go github.com

Go+ is like Go if it were built for data scientists

This new data-science-focused language is fully compatible with Go*, but streamlines things for data science use. It simplifies common scripting tasks. This in Go:

package main

func main() {
    a := []float64{1, 2, 3.4}
    println(a)
}

Becomes this in Go+:

a := [1, 2, 3.4]
println(a)

And adds features like list comprehensions for easier data processing:

a := [1, 3, 5, 7, 11]
b := [x*x for x <- a, x > 3]
println(b) // output: [25 49 121]

mapData := {"Hi": 1, "Hello": 2, "Go+": 3}
reversedMap := {v: k for k, v <- mapData}
println(reversedMap) // output: map[1:Hi 2:Hello 3:Go+]

It can be compiled directly to bytecode or transpiled into Go code. Give it a go on the playground.

*I almost described it as a “superset” of Go, but I’m not 💯 if that’s true.

Career dfrieds.com

Data Science: reality doesn't meet expectations

After taking a 12-week data science bootcamp and in 2016 and then launching into industry, Dan Friedman’s expectations weren’t remotely met:

Over the past few years, I’ve worked as a Data Scientist, a Data Engineer, and as an industry consultant. I’ve also learned from the stories of dozens of data scientists and similar professions, actively read articles on data science and followed data science thought leaders on Twitter.

Across these diverse data experiences, I have noticed common themes.

Below are seven most common (and at times flagrant) ways that data science has failed to meet expectations in industry. Throughout each section, I’ll propose solutions to these shortcomings.

Maybe I’ve been listening to Practical AI too much, but I am not surprised that one of his seven shortcomings is that most of the job is spent cleaning data. That being said, there’s a lot here that is surprising to me and worthy of consideration for anyone thinking about entering the industry.

StackShare Icon StackShare

Cultivating your data lake

This post by Lauren Reeder of Segment goes over the different layers to consider when working with a data lake. What’s a data lake, you ask?

A data lake is a centralized repository that stores both structured and unstructured data and allows you to store massive amounts of data in a flexible, cost effective storage layer.

Her article explains what tools are needed and provides code & SQL statements to get started. 🤟

Andrew Ste cvcompiler.com

The most in-demand data science skills of 2019

Since data science has a huge impact on today’s businesses, the demand for DS experts is growing. At the moment I’m writing this, there are 144,527 data science jobs on LinkedIn alone. But still, it’s important to keep your finger on the pulse of the industry to be aware of the fastest and most efficient data science solutions.

Click through for key takeaways and trend analysis.

The most in-demand data science skills of 2019

Machine Learning towardsdatascience.com

How to automate tasks on GitHub with machine learning for fun and profit

This is an explainer on how to build a GitHub App that predicts and applies issue labels using Tensorflow and public datasets. Hamel Husain writes:

In order to show you how to create your own apps, we will walk you through the process of creating a GitHub app that can automatically label issues. Note that all of the code for this app, including the model training steps are located in this GitHub repository.

See also: Issue Label Bot

  0:00 / 0:00