Optimizing Your Data Analysis with Malloy

An experimental language for data.
tools
data engineering
malloy
Author

Abhishek

Published

May 17, 2023

Note

If you’re someone who doesn’t often write SQL, parts of this discussion may seem complex. However, if you’re involved in data management, building interfaces, or working with databases, this blog post will prove insightful.

The Shift in Data Management

Data is a crucial part of all organizations, and it’s only when data is understood and utilised correctly, the businesses can truly thrive. Running a successful business today depends on the data and insights derived from it that make you smarter. A common misconception that surrounds data is that data is rectangular. That is, in fact, not true. Humans tend to visualize data as rectangles than leads to this misconception. Most operations that we perform on data such as filtering, aggregating, projecting, and windowing are all based on rectangular models. Even the join operation takes in two rectangles and gives a rectangle as the joined output.

In a common ecommerce analysis you can expect two tables:

Orders

order_id order_date shipping_cost user_id
1 2022-01-01 2 1
2 2022-01-01 3 2
3 2022-01-02 1 1
4 2022-01-02 23 3

Order Items

item_id order_id item price
1 1 Chocolate 2
2 1 Twizzler 1
3 2 Chocolate 2
4 2 M and M 1
5 3 Twizzler 1
6 4 Fudge 3
7 4 Skittles 1

Using the above tables let’s calculate:

  • total_shipping
  • total_revenue

total_shipping

SELECT
 sum(shipping_cost) AS total_shipping
FROM orders
total_shipping
8

total_revenue

SELECT
 sum(price) AS total_revenue
FROM items
total_revenue
11

Looking at the same calculation across the date dimension, we get:

total_shipping by date

SELECT
 order_date,
 sum(shipping_cost) AS total_shipping
FROM ‘orders.csv’
GROUP BY 1
ORDER BY 1
order_date total_shipping
2022-01-01 5
2022-01-02 3

total_revenue by date

SELECT
 order_date,
 sum(price) AS total_revenue
FROM ‘orders.csv’ AS orders
JOIN ‘items.cvs’ AS items on
 orders.order_id = items.order_id
GROUP BY 1
ORDER BY 1
order_date total_revenue
2022-01-01 6
2022-01-02 5

How does revenue relate to shipping?

This is what we expect:

order_date total_revenue total_shipping
2022-01-01 6 5
2022-01-02 5 3

This is what we get by joining:

SELECT
 orders.order_date,
 sum(items.price) AS total_revenue,
 sum(orders.shipping_cost) AS total_shipping
FROM ‘orders.csv’ AS orders
JOIN ‘items.cvs’ AS items ON orders.order_id = items.order_id
GROUP BY 1
ORDER BY 1
order_date total_revenue total_shipping
2022-01-01 6 10
2022-01-02 5 5

The incorrect results happen because the order rows in the join are duplicated resulting in an overstated computation.

SELECT *
FROM ‘orders.csv’ orders
LEFT JOIN ‘items.csv’ AS items ON orders.order_id = items.order_id
order_id order_date shipping_cost user_id item_id order_id item price
1 2022-01-01 2 1 2 1 Twizzler 1
2 2022-01-01 3 2 4 2 M and M 1
3 2022-01-02 1 1 5 3 Twizzler 1
4 2022-01-02 2 3 7 4 Skittles 1
1 2022-01-02 2 1 1 1 Chocolate 2
2 2022-01-02 3 2 3 2 Chocolate 2
4 2022-01-02 2 3 6 4 Fudge 3

The Limitations of Traditional Data Warehousing and SQL

SQL Templates

When using SQL, you will find yourself copying and pasting a lot, building queries with many templates. The issue lies in the fact that we operate in the rectangular context while the graph is network-based.

From the date dimension the total_shipping and total_revenue look like this:

Now, suppose there is another dimension user that looks like:

user_id total_revenue total_shipping
1 4 3
2 3 3
3 4 2

From the user dimension the total_shipping and total_revenue looks like this:

Hence we end up using a lot of query templates which may lead to copy-paste mistakes.

Data Warehouse Schema Design

In trditional data warehousing, we often refer to the Star Schema. This involves creating too many fact tables during ETL processes, with the unit of reusability being a table with some dimensionality. But the truth is, reusability should be about the data itself, not its definition.

The star schema was designed during a time when the databases were slow and the data was relatively big. Another issue with this approach is the lack of real-time data processing. If you produce an intermittent artifact, it won’t be real time, as these artifacts are produced and then joined later.

Introducing Malloy - An experimental language for data

Malloy, a revolutionary language for data, promises to transform the way we handle and view data. It ensures that join relations don’t affect aggregate calculations, fundamentally changing the way we approach data.

In Malloy, data is first described in a network, and the network on joined rectangles forms a reusable object called a source. Aggregate calculations are applied in a query operation, which can reference any ‘locality’ in the join network to compute results correctly.

The above example in Malloy looks like:

query: table('duckdb:orders.csv') + {
 join_many: items is table('duckdb:order_items.csv')
 on order_id = items.order_id
}
-> {
 group_by: order_date
 aggregate:
 total_revenue is items.price.sum()
 total_shipping is shipping_cost.sum()
order_by: 1
}

which results in this output:

order_date total_revenue total_shipping
2022-01-01 6 5
2022-01-02 5 3

Here,

table('duckdb:orders.csv') + {
 join_many: items is table('duckdb:order_items.csv')
 on order_id = items.order_id
}

is the SOURCE. The aggregate calculation,

total_revenue is items.price.sum()

is local to item, and

total_shipping is shipping_cost.sum()

is local to orders.

The above malloy query aggregates along a date dimension, but we can aggregate along any dimension dynamically, simply by updating the dimension as below:

query: table('duckdb:orders.csv') + {
 join_many: items is table('duckdb:order_items.csv')
 on order_id = items.order_id
}
-> {
 group_by: user_id
 aggregate:
 total_revenue is items.price.sum()
 total_shipping is shipping_cost.sum()
order_by: 1
}
user_id total_revenue total_shipping
1 4 3
2 3 3
3 4 2

Dimensional Freedom

Malloy offers the power of ‘Dimensional Freedom’. This concept allows you to produce results from anywhere in the join network, giving you the flexibility to view your data from different dimensions. The system also deduces a primary key, streamlines it, and formulates the calculation for you.

Source

  • In contrast to traditional SQL, Malloy’s reusability lies in a source.
source: orders_items is table('duckdb:orders.csv') + {
 join_many: items is table('duckdb:order_items.csv')
 on order_id = items.order_id
 declare:
 total_revenue is items.price.sum()
 total_shipping is shipping_cost.sum()
}
  • A source can be named, like the above source is named order_items.
  • The sources describe the join relationships, like the above source describes the relationship between orders and order_items
  • Sources describe both aggregate and scalar calculations.
  • Source makes queries incredibly simple. For example, performing an aggregate calculation on the above source looks like this in malloy:
query: orders_items -> {
 group_by: order_date
 aggregate: total_revenue, total_shipping
 order_by: 1
}
order_date total_revenue total_shipping
2022-01-01 6 5
2022-01-02 5 3

It recognizes that data naturally comes nested, and the need to separate them is an artifact of the SQL engine, not a real-world necessity.

Malloy’s Features

Malloy can query logs straight out of the box, providing a user-friendly interface for data management. It recognizes three types of join:

  1. Join_one: for a single item.
  2. Join_many: for multiple items.
  3. Join_cross: for cross joins.

But that’s not all. Malloy also writes nested data and handles non-rectangular data effectively, both in reading and writing. In fact, all Malloy queries read the data only once, which greatly increases efficiency.

Furthermore, Malloy is compatible with a variety of systems, capable of writing queries against BigQuery, PostgreSQL, and DuckDB.

Malloy also offers a VSCode plugin which is a file of extension .malloy and notebook support which is a file of extension .malloynb. In addition, malloy offers in-built support for charting and visualization which you can learn more about here.

Conclusion

Malloy presents a fresh approach to data management, breaking free from the rectangular confines of traditional SQL and data warehousing. By focusing on data’s natural network structure, it enables efficient, real-time data operations that can revolutionize the way your business handles and utilizes data.

I reckon this is something remarkable, and I expect to see it being adopted purely because it simplifies the data viewing aspect that will enable domain experts to focus on business logic than on what tables to join. I’ll be keeping an eye on this project.

Let me know your thoughts!

You can find the link to the notebook here

References

Buy Me A Coffee