Why Semantic Layers Matter (and how to build one with DuckDB)

Why Semantic Layers Matter (and how to build one with DuckDB)
什麼是語意層及其建置方式 — 以 DuckDB 為例
➤ 解密語意層的威力:從概念到 DuckDB 實踐指南
✤ https://motherduck.com/blog/semantic-layer-duckdb-tutorial/
本文深入探討語意層的重要性,並實際演示如何使用 DuckDB 和 Ibis 搭配 YAML 設定檔及 Python 腳本,建立一個簡易的語意層。作者強調語意層能統一定義業務指標、簡化複雜分析、提升數據治理效率,並改善與 AI 模型的互動。文章也說明何時不需要語意層,並建議進一步的學習資源。
+ 終於有篇關於語意層的實用文章,而且還是用我習慣的 DuckDB!期待實際操作。
+ 講得真好,尤其是「統一單一真相來源」這點,真的很有感觸。
#數據分析 #語意層 #DuckDB #Python #Ibis #資料治理
Why Semantic Layers Matter (and how to build one with DuckDB)
I just shipped the biggest update since I started building Shaper.
This is why I am excited about Shaper's new Tasks feature:
https://taleshape.com/blog/why-i-am-excited-about-shapers-new-task-feature/
"Basic Feature Engineering with #DuckDB"
#RStats #duckdb gurus, is thread safety the main reason we don't have (as of today) R level user defined functions like the duckdb's snake binding ? Or the reason is that we don't really have scalar functions ?
github.com/duckdb/duckd...
User Defined Functions with R ...
My talks at @useR_conf is here https://defuneste.codeberg.page/useR_2025/
tldr: I think storing "big" data as a parquet files, stored in s3 accessed with duckDB and wrapped in an R package is a nice way to save some of your sanity.
Now that we know that DuckDB is great let start showing how R can make it in production!
Side notes: loved using {litedown} and codeberg for the prez. Mermai.js you are also great but I am not rdy!
@duckdb Future of this new package is unknown, but maybe I will implement a few more functions from {sf} and {areal} in {ducksf} in the coming months. It is also not unlikely that the devs of #DuckDB Spatial extension (https://github.com/duckdb/duckdb-spatial) will just implement areal interpolation themselves, but then my job will only be easier, I will just wrap their function in {𝐝𝐮𝐜𝐤𝐬𝐟} instead of implementing it in SQL right now.
Get 9-30x speed doing areal-weighted interpolation with my new {𝐝𝐮𝐜𝐤𝐬𝐟} #rstats package compared to {sf}/{areal}. Experimental, but tested against both {areal} and {sf}. https://github.com/e-kotov/ducksf . Despite the costs of moving data between R and #DuckDB, the performance of {𝐝𝐮𝐜𝐤𝐬𝐟} is impressive, thanks to #DuckDB . Look at the attached benchmark results. And be sure to read the recent post of @duckdb about the performance improvements of their spatial joins here: https://duckdb.org/2025/08/08/spatial-joins.html
I’ve always known that the #DuckDB appender interface was the way to go for bulk loading data. But today I had reason to write a #Golang benchmark to see just how much faster it is and discovered it’s at least 250x faster (on my laptop) at inserting a bigint into a table.
I tested both in-memory and on-disk as well as testing INSERT with auto-commit and with batched commits at various batch sizes.
https://gist.github.com/rkennedy-argus/9e9b2a9fe79d7b098ff40bfb4ffc0384
I suppose I should test INSERTs with prepared statements, too. But I doubt they’ll put much of a dent in that difference.
Xorq:以 Python 簡潔性實現 SQL 規模的機器學習目錄、組合與部署
➤ 打造具備 Python 簡潔性與 SQL 擴展能力的下一代 ML 管道
✤ https://github.com/xorq-labs/xorq
Xorq 是一個新穎的機器學習框架,旨在簡化並標準化 ML 管道的建置、分享與部署流程。它透過結合 Python 的易用性與 SQL 的強大擴展性,讓開發者能夠以聲明式的方式跨多個計算引擎(如 DuckDB、Snowflake 和 DataFusion)建立可重複使用的 ML 管道。Xorq 的核心技術包括使用 Apache Arrow 進行零拷貝資料傳輸,以及利用 Ibis 和 DataFusion 實現高效運算。其特點包括:支援 pandas 風格語法與 Ibis 的多引擎聲明式表達;將 Python 運算式定義為 YAML 格式,確保可重複性;提供可移植的 UDF 與 UDAF,並支援自動序
#機器學習 #資料工程 #管道 #Python #SQL #Ibis #DuckDB #Snowflake #DataFusion #Apache Arrow
@christoffel66 exactly…the “highest” available in the list provided. So far the ORDER BY with list_position seems to be the clearest winner in terms of readability and not repeating itself.
This is using #DuckDB.
Nerd post!
I just discovered that #DuckDB (follow at @duckdb), which is a really cool #opensource tool that allows you to directly do SQL queries against csv, json, and many more file types (among with many other features), also lets you output query results directly to #LaTeX.
@statstas #duckdb can connect to external databases using odbc (https://duckdb.org/docs/stable/clients/odbc/windows.html) and can write parquet. That might work (I have zero experience with this).
I track Stratosphere's posts & their bot has a daily top 10 sketch IPs list. My
kept
lots of "*.100" IPs & I was curious how frequently they showed up.
Went back 200 posts w/GH:McKael/madonctl using both R and DuckDB.
Def block these.
— #DuckDB: ray.so/SdMcBZa
— #RStats: ray.so/naTBBMS
I track @stratosphere's posts & their bot has a daily top 10 sketch IPs list. My kept
lots of "*.100" IPs & I was curious how frequently they showed up.
Went back 200 posts w/GH:McKael/madonctl using both R and DuckDB.
Def block these.
— #DuckDB: https://ray.so/SdMcBZa
— #RStats: https://ray.so/naTBBMS