[Gold Standard] Updated plans for all tpcds queries with spark-only setup#377
[Gold Standard] Updated plans for all tpcds queries with spark-only setup#377apoorvedave1 wants to merge 28 commits intomicrosoft:masterfrom
Conversation
src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala
Show resolved
Hide resolved
|
Note to reviewers, currently q49 doesn't work well with the build pipelines so as per offline suggestions I have removed it from this PR. I will add it back once the issue is resolved. |
|
@apoorvedave1 Could you do this?
The reason is I want to make sure changes other than the expression reorder didn't go in. |
Ok sure, let me get back. |
src/test/resources/tpcds/spark-2.4/approved-plans-v1_4/q1/explain.txt
Outdated
Show resolved
Hide resolved
Could you update this PR description as well? Not easy to follow which portion of parent proposal applies to this PR. |
…_initial # Conflicts: # src/test/scala/com/microsoft/hyperspace/goldstandard/PlanStabilitySuite.scala # src/test/scala/com/microsoft/hyperspace/goldstandard/TPCDSBase.scala
|
Are the test failures related to the cross join? |
@imback82 yeah outside of the test function it was not picking up the config for enabling cross join. I made slight changes to the code and it works now. Please take a look |
| Union | ||
| LocalTableScan [customer_id,year_total] [customer_id,year_total] |
There was a problem hiding this comment.
why union with LocalTableScan?
| : : : : +- *(2) Project [d_date_sk#21, d_year#13] | ||
| : : : : +- *(2) Filter ((isnotnull(d_year#13) && (d_year#13 = 2001)) && isnotnull(d_date_sk#21)) | ||
| : : : : +- *(2) FileScan parquet default.date_dim[d_date_sk#21,d_year#13] Batched: true, Format: Parquet, Location [not included in comparison]/{warehouse_dir}/date_dim], PartitionFilters: [], PushedFilters: [IsNotNull(d_year), EqualTo(d_year,2001), IsNotNull(d_date_sk)], ReadSchema: struct<d_date_sk:int,d_year:int> | ||
| : : : +- LocalTableScan <empty>, [customer_id#24, year_total#25] |
There was a problem hiding this comment.
val df = Seq.empty[(String,String,String,String)].toDF("a","b","c","d")
println(df.queryExecution.toString())creates this
== Physical Plan ==
LocalTableScan <empty>, [a#12, b#13, c#14, d#15]
| @@ -1,279 +1,49 @@ | |||
| == Physical Plan == | |||
| TakeOrderedAndProject [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6] | ||
| WholeStageCodegen (10) | ||
| HashAggregate [cd_gender,cd_marital_status,cd_education_status,cd_purchase_estimate,cd_credit_rating,cd_dep_count,cd_dep_employed_count,cd_dep_college_count,count] [count(1),cnt1,cnt2,cnt3,cnt4,cnt5,cnt6,count] | ||
| TakeOrderedAndProject [cd_credit_rating,cd_dep_college_count,cd_dep_count,cd_dep_employed_count,cd_education_status,cd_gender,cd_marital_status,cd_purchase_estimate,cnt1,cnt2,cnt3,cnt4,cnt5,cnt6] |
| TakeOrderedAndProject [i_category,i_class,i_item_id,i_item_desc,revenueratio,i_current_price,itemrevenue] | ||
| WholeStageCodegen (6) | ||
| Project [i_item_desc,i_category,i_class,i_current_price,itemrevenue,_w0,_we0,i_item_id] | ||
| TakeOrderedAndProject [i_category,i_class,i_current_price,i_item_desc,i_item_id,itemrevenue,revenueratio] |
| @@ -1,137 +1,24 @@ | |||
| == Physical Plan == | |||
| Project [ss_ext_sales_price,ss_ext_wholesale_cost,ss_quantity] | ||
| BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price] | ||
| Project [cd_education_status,cd_marital_status,ss_ext_sales_price,ss_ext_wholesale_cost,ss_hdemo_sk,ss_quantity,ss_sales_price] | ||
| BroadcastHashJoin [cd_demo_sk,ss_cdemo_sk] |
There was a problem hiding this comment.
check this part?
more columns/checks pushed down in spark 3.x?
There was a problem hiding this comment.
maybe because the remaining columns are used in higher level broadcast join two lines above
BroadcastHashJoin [cd_education_status,cd_marital_status,hd_demo_sk,hd_dep_count,ss_hdemo_sk,ss_sales_price]
What is the context for this pull request?
What changes were proposed in this pull request?
In this PR we have updated plans for all tpcds queries q2-q99. Please review the dependency PR #384 first which contains the code for creating and validating golden files (query plan files) for gold standard.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests