Skip to content
← All tools

alteryx2dbx

Weeks of manual migration work, automated into a single CLI command.

Python PySpark Alteryx Databricks Compiler Design
Alteryx Designer
Databricks Notebook
[1]
# Cmd 1 — Read all inputs
df_sales   = spark.read.table("sales_transactions")
df_product = spark.read.table("product_master")
df_targets = spark.read.table("region_targets")
[2]
# Cmd 2 — Select + Cleanse + Margin
df_clean = (df_sales
  .select("order_id", "product_id", "region", "revenue", "cost")
  .withColumn("region", upper(trim(col("region"))))
  .withColumn("margin", (col("revenue") - col("cost")) / col("revenue") * 100)
)
[3]
# Cmd 3 — Filter valid vs exceptions
df_valid      = df_clean.filter(col("margin") > 0)
df_exceptions = df_clean.filter(~(col("margin") > 0))
[4]
# Cmd 4 — Reference data: format + summarize + pivot
df_ref = (df_targets
  .withColumn("region", upper(col("region")))
  .groupBy("region")
  .pivot("metric")
  .agg(avg("target_value"))
)
[5]
# Cmd 5 — Join product data + running total
w = Window.partitionBy("region").orderBy("order_id")
df_enriched = (df_valid
  .join(df_product, "product_id")
  .withColumn("running_rev", sum("revenue").over(w))
)
[6]
# Cmd 6 — Deduplicate, sort, merge targets
df_merged = (df_enriched
  .dropDuplicates(["order_id"])
  .orderBy("region", desc("revenue"))
  .join(df_ref, "region")
)
[7]
# Cmd 7 — Final score calculation
df_final = (df_merged
  .withColumn("vs_target", col("margin") - col("margin_target"))
  .withColumn("score", when(col("vs_target") > 0, "pass").otherwise("flag"))
)
[8]
# Cmd 8 — Write validated + exceptions
df_final.write.mode("overwrite").saveAsTable("validated_sales")
df_exceptions.write.mode("overwrite").saveAsTable("exceptions_log")

34 tools  ·  80+ functions  ·  0 hallucinations

The problem

Migrating from Alteryx to Databricks means rewriting every workflow in PySpark by hand. It takes weeks. Subtle logic gets lost. LLM-assisted approaches make it worse: they invent PySpark functions that don’t exist, change how nulls behave, flip case sensitivity. You end up debugging the generated code anyway.

What it does

A CLI that reads Alteryx workflow files and outputs Databricks notebooks. Deterministic, no LLM. Point it at a workflow or a whole directory, get parameterized PySpark notebooks you can run directly. Every mapping is explicit. If something can’t be converted, it gets flagged, not silently dropped.

What makes it non-trivial

  • Parses Alteryx XML and resolves execution order across 34 tool types through a DAG
  • Custom expression grammar with operator precedence and 80+ function mappings between Alteryx and PySpark
  • Fixes four semantic gaps that break every manual migration: case sensitivity, substring indexing, null handling, date formats
  • 470 tests. Batch mode handles entire directories with aggregate reporting.