alteryx2dbx
Weeks of manual migration work, automated into a single CLI command.
Python PySpark Alteryx Databricks Compiler Design
Alteryx Designer
Databricks Notebook
[1]
# Cmd 1 — Read all inputs
df_sales = spark.read.table("sales_transactions")
df_product = spark.read.table("product_master")
df_targets = spark.read.table("region_targets") [2]
# Cmd 2 — Select + Cleanse + Margin
df_clean = (df_sales
.select("order_id", "product_id", "region", "revenue", "cost")
.withColumn("region", upper(trim(col("region"))))
.withColumn("margin", (col("revenue") - col("cost")) / col("revenue") * 100)
) [3]
# Cmd 3 — Filter valid vs exceptions
df_valid = df_clean.filter(col("margin") > 0)
df_exceptions = df_clean.filter(~(col("margin") > 0)) [4]
# Cmd 4 — Reference data: format + summarize + pivot
df_ref = (df_targets
.withColumn("region", upper(col("region")))
.groupBy("region")
.pivot("metric")
.agg(avg("target_value"))
) [5]
# Cmd 5 — Join product data + running total
w = Window.partitionBy("region").orderBy("order_id")
df_enriched = (df_valid
.join(df_product, "product_id")
.withColumn("running_rev", sum("revenue").over(w))
) [6]
# Cmd 6 — Deduplicate, sort, merge targets
df_merged = (df_enriched
.dropDuplicates(["order_id"])
.orderBy("region", desc("revenue"))
.join(df_ref, "region")
) [7]
# Cmd 7 — Final score calculation
df_final = (df_merged
.withColumn("vs_target", col("margin") - col("margin_target"))
.withColumn("score", when(col("vs_target") > 0, "pass").otherwise("flag"))
) [8]
# Cmd 8 — Write validated + exceptions
df_final.write.mode("overwrite").saveAsTable("validated_sales")
df_exceptions.write.mode("overwrite").saveAsTable("exceptions_log") 34 tools · 80+ functions · 0 hallucinations
The problem
Migrating from Alteryx to Databricks means rewriting every workflow in PySpark by hand. It takes weeks. Subtle logic gets lost. LLM-assisted approaches make it worse: they invent PySpark functions that don’t exist, change how nulls behave, flip case sensitivity. You end up debugging the generated code anyway.
What it does
A CLI that reads Alteryx workflow files and outputs Databricks notebooks. Deterministic, no LLM. Point it at a workflow or a whole directory, get parameterized PySpark notebooks you can run directly. Every mapping is explicit. If something can’t be converted, it gets flagged, not silently dropped.
What makes it non-trivial
- Parses Alteryx XML and resolves execution order across 34 tool types through a DAG
- Custom expression grammar with operator precedence and 80+ function mappings between Alteryx and PySpark
- Fixes four semantic gaps that break every manual migration: case sensitivity, substring indexing, null handling, date formats
- 470 tests. Batch mode handles entire directories with aggregate reporting.