alteryx2dbx

Weeks of manual migration work, automated into a single CLI command.

Python PySpark Alteryx Databricks Compiler Design

Alteryx Designer

Databricks Notebook

[1]

# Cmd 1 — Read all inputs
df_sales   = spark.read.table("sales_transactions")
df_product = spark.read.table("product_master")
df_targets = spark.read.table("region_targets")

[2]

# Cmd 2 — Select + Cleanse + Margin
df_clean = (df_sales
  .select("order_id", "product_id", "region", "revenue", "cost")
  .withColumn("region", upper(trim(col("region"))))
  .withColumn("margin", (col("revenue") - col("cost")) / col("revenue") * 100)
)

[3]

# Cmd 3 — Filter valid vs exceptions
df_valid      = df_clean.filter(col("margin") > 0)
df_exceptions = df_clean.filter(~(col("margin") > 0))

[4]

# Cmd 4 — Reference data: format + summarize + pivot
df_ref = (df_targets
  .withColumn("region", upper(col("region")))
  .groupBy("region")
  .pivot("metric")
  .agg(avg("target_value"))
)

[5]

# Cmd 5 — Join product data + running total
w = Window.partitionBy("region").orderBy("order_id")
df_enriched = (df_valid
  .join(df_product, "product_id")
  .withColumn("running_rev", sum("revenue").over(w))
)

[6]

# Cmd 6 — Deduplicate, sort, merge targets
df_merged = (df_enriched
  .dropDuplicates(["order_id"])
  .orderBy("region", desc("revenue"))
  .join(df_ref, "region")
)

[7]

# Cmd 7 — Final score calculation
df_final = (df_merged
  .withColumn("vs_target", col("margin") - col("margin_target"))
  .withColumn("score", when(col("vs_target") > 0, "pass").otherwise("flag"))
)

[8]

# Cmd 8 — Write validated + exceptions
df_final.write.mode("overwrite").saveAsTable("validated_sales")
df_exceptions.write.mode("overwrite").saveAsTable("exceptions_log")

34 tools · 80+ functions · 0 hallucinations

The problem

Migrating from Alteryx to Databricks means rewriting every workflow in PySpark by hand. It takes weeks. Subtle logic gets lost. LLM-assisted approaches make it worse: they invent PySpark functions that don’t exist, change how nulls behave, flip case sensitivity. You end up debugging the generated code anyway.

What it does

A CLI that reads Alteryx workflow files and outputs Databricks notebooks. Deterministic, no LLM. Point it at a workflow or a whole directory, get parameterized PySpark notebooks you can run directly. Every mapping is explicit. If something can’t be converted, it gets flagged, not silently dropped.

What makes it non-trivial

Parses Alteryx XML and resolves execution order across 34 tool types through a DAG
Custom expression grammar with operator precedence and 80+ function mappings between Alteryx and PySpark
Fixes four semantic gaps that break every manual migration: case sensitivity, substring indexing, null handling, date formats
470 tests. Batch mode handles entire directories with aggregate reporting.