Add warning log - newly created Hyperspace context for different Spark Session by sezruby · Pull Request #374 · microsoft/hyperspace

sezruby · 2021-03-04T17:54:25Z

What is the context for this pull request?

Tracking Issue: n/a
Parent Issue: n/a
Dependencies: n/a

What changes were proposed in this pull request?

Currently, Hyperspace keeps only one HyperspaceContext for each Hyperspace object.
So if an app uses multiple concurrent Spark sessions using one Hyperspace object & one thread, a new Hyperspace context is continuously created for each request.

As Hyperspace object is not thread-safe, we tried to force - "One client thread" should use only "one Spark Session", but we can't because it's possible to access from multiple threads with the same SparkSession (e.g. delta lake - broadcast join query execution).

So for now, we left some warning log in that case, so that users could notice it might an ineffective use of Hyperspace.

Does this PR introduce any user-facing change?

Yes, warning log can be left if a Hyperspace object is accessed with different Spark Session.

How was this patch tested?

tested on local env

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

sezruby · 2021-03-05T03:41:32Z

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

+        contexts.put(spark, (Thread.currentThread().getId, new HyperspaceContext(spark)))
+      }
+    } else if (ctx.get._1 != Thread.currentThread().getId) {
+      throw HyperspaceException(s"Hyperspace does not support multiple threads " +


@imback82
I guess this is too restricted.. How about writing some warning log and creating a new context in this case?
Seems checkAnswer API uses a different thread internally.

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

imback82 · 2021-03-05T19:37:24Z

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

+      }
+    } else if (ctx.get._1 != Thread.currentThread().getId) {
+      logWarning(s"Hyperspace is not thread safe for threads using one Spark session. " +
+        "Please be aware of it. Current thread id: ${Thread.currentThread().getId}, " +


What is the implication of potentially sharing the HyperspaceContext by threads if we already know that it's not thread safe? Isn't it better to fail faster here instead of failing somewhere else?

checkAnswer won't work if we throw an exception here. No idea how to fix it and one possible scenario:

setup & check in main thread

run with a worker thread

are you still blocked on this?

gentle ping

I think throwing exception is too restrictive. Users might not know how to fix the issue ..?

I thought it's user's responsibility to use the API which is not thread safe.
Different thread id not always means they are running concurrently..
I think we need documentation rather than restricting the use case?

Different thread id not always means they are running concurrently..

But the new structure allows multiple threads accessing the object concurrently? The existing implementation makes sure only one to one mapping via thread local. So, I would keep this condition as it is, and if there are demands for accessing the hyperspace object from multiple threads, we should think about making it thread-safe instead of documenting.

Plus, I think random failures are much worse and hard to debug, esp. failures caused by the thread safety.

I agree on how difficult analyzing failures from concurrent threads.
I added the exception change, then this can be a breaking change for some use case.

Added breaking change label so we can document it in the release notes.

btw, even if this is a breaking change, it's alerting the user a possible misuse where hyperspace context is being recreated, so I think it's worthwhile.

src/test/scala/com/microsoft/hyperspace/index/plananalysis/ExplainTest.scala

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

src/test/scala/com/microsoft/hyperspace/HyperspaceTest.scala

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

src/main/scala/com/microsoft/hyperspace/util/HyperspaceConf.scala

src/test/scala/com/microsoft/hyperspace/index/plananalysis/ExplainTest.scala

apoorvedave1

LGTM, thanks 👍

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

imback82 · 2021-03-11T08:06:05Z

Could you update the PR description as it seems out of date? Basically, we are preventing hyperspace context from being re-created in certain scenarios, and making sure too many contexts are not created.

imback82 · 2021-03-11T18:38:21Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

            assert(!basePlan.equals(dfWithHyperspaceEnabled.queryExecution.optimizedPlan))
-            checkAnswer(dfWithHyperspaceDisabled, dfWithHyperspaceEnabled)
+            val resultEnabled = dfWithHyperspaceEnabled.collect().toSeq.toSet
+            assert(resultEnabled.equals(resultDisabled))


is this change still needed? if so, why?

In checkAnswer, a new thread tries to build the query plan again - so it causes the exception.
No idea why other checkAnswers have no problem with it..

oh interesting. could you share the code where a new thread is spawned in checkAnswer?

in checkAnswer

val sparkAnswer = try df.collect().toSeq catch { case e: Exception => val errorMessage = s""" |Exception thrown while executing query: |${df.queryExecution} |== Exception == |$e |${org.apache.spark.sql.catalyst.util.stackTraceToString(e)} """.stripMargin return Some(errorMessage) }

It's because of broadcast join..

[info] at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:146) [info] at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:387) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:144) [info] at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) [info] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [info] at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) [info] at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140) [info] at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:117) [info] at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:211) [info] at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:101)

failed thread stack:

.. [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at com.microsoft.hyperspace.index.rules.FilterIndexRule$.apply(FilterIndexRule.scala:52) [info] at com.microsoft.hyperspace.index.rules.FilterIndexRule$.apply(FilterIndexRule.scala:38) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:87) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:89) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:84) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:76) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) [info] at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:66) [info] at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:66) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) [info] at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) [info] at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) [info] at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3359) [info] at org.apache.spark.sql.Dataset.collect(Dataset.scala:2782) [info] at org.apache.spark.sql.delta.PartitionFiltering.filesForScan(PartitionFiltering.scala:40) [info] at org.apache.spark.sql.delta.PartitionFiltering.filesForScan$(PartitionFiltering.scala:27) [info] at org.apache.spark.sql.delta.Snapshot.filesForScan(Snapshot.scala:52) [info] at org.apache.spark.sql.delta.files.TahoeLogFileIndex.matchingFiles(TahoeFileIndex.scala:140) [info] at org.apache.spark.sql.delta.files.TahoeFileIndex.listFiles(TahoeFileIndex.scala:56) [info] at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions$lzycompute(DataSourceScanExec.scala:193) [info] at org.apache.spark.sql.execution.FileSourceScanExec.selectedPartitions(DataSourceScanExec.scala:190) [info] at org.apache.spark.sql.execution.FileSourceScanExec.updateDriverMetrics(DataSourceScanExec.scala:529) [info] at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:307) [info] at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:305) [info] at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:327) [info] at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121) [info] at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)

Seems we shouldn't throw the exception .. ? 😧😧😧😧😧😧😧

Hmm, then we have to think about this feature again, since it may hit into thread-safety issue even if the user didn't intend it?

For example, this could have gone unnoticed if we didn't throw the exception?

Seems so. How about ThreadLocal[HashMap[Session, Context]]? 😁

But that would allow multiple threads share the same Hyperspace context.

So, basically in Delta code, it does allFiles.toDF() and eventually that dataframe is collect-ed, which triggers our optimizer. In our optimizer, we have an extractor that calls Hyperspace.getContext(spark).sourceProviderManager in ExtractRelation. So instead of case filter @ Filter(condition: Expression, ExtractRelation(relation)), we could do case filter @ Filter if isSupportedRelation, and I think we can avoid the issue.

But, I think this is still hacky and the right way to fix it seems like making the context thread-safe.

sezruby · 2021-03-11T19:06:27Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

+            val resultEnabled = dfWithHyperspaceEnabled.collect().toSeq.sortBy(_.hashCode())
+            assert(!basePlan.equals(updatedPlan))
+            assert(resultEnabled.equals(resultDisabled))
            checkAnswer(dfWithHyperspaceDisabled, dfWithHyperspaceEnabled)


After executing collect(), checkAnswer doesn't throw the exception 😧

imback82 · 2021-03-11T20:25:50Z

src/test/scala/com/microsoft/hyperspace/index/DeltaLakeIntegrationTest.scala

            assert(!basePlan.equals(dfWithHyperspaceEnabled.queryExecution.optimizedPlan))
-            checkAnswer(dfWithHyperspaceDisabled, dfWithHyperspaceEnabled)
+            val resultEnabled = dfWithHyperspaceEnabled.collect().toSeq.toSet
+            assert(resultEnabled.equals(resultDisabled))


oh interesting. could you share the code where a new thread is spawned in checkAnswer?

imback82 · 2021-03-11T20:28:39Z

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

+              s"Current limit: ${contexts.size}")
+        }
+        val newCtx = new HyperspaceContext(spark)
+        contexts.put(spark, (threadId, newCtx))


Should we check the return value of put and throw an exception if it returns Some? There could be multiple threads sharing the same spark session that hits if (!ctx.isDefined) at the same time.

sezruby · 2021-04-16T04:50:57Z

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala

      // the one HyperspaceContext is using because Hyperspace depends on the
      // session's properties such as configs, etc.
      context.set(new HyperspaceContext(spark))
+      if (!ctx.spark.equals(spark)) {


@imback82 I'm thinking of this change & what can we do about this.
I think the problem is - we access using the active spark session when optimizing query plan, but it can be different from the session which is used to create Hyperspace object.
So.. how about

not resetting hyperspace & just using the previous one even with different active spark session.

if a user wants to use another configs/spark session, they need to redefine Hyperspace object.

I think this is clearer behavior because we do val hs = new Hyperspace(spark)

not resetting hyperspace & just using the previous one even with different active spark session.

Will it be ok to silently use a different active session?

imback82 reviewed Mar 4, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala Outdated Show resolved Hide resolved

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala Outdated Show resolved Hide resolved

sezruby commented Mar 4, 2021

View reviewed changes

src/main/scala/com/microsoft/hyperspace/Hyperspace.scala Outdated Show resolved Hide resolved

sezruby commented Mar 5, 2021

View reviewed changes

sezruby requested review from apoorvedave1, imback82 and thugsatbay March 5, 2021 18:11

imback82 reviewed Mar 5, 2021

View reviewed changes

sezruby added bug Something isn't working enhancement New feature or request labels Mar 7, 2021

sezruby self-assigned this Mar 7, 2021

sezruby removed the bug Something isn't working label Mar 7, 2021

imback82 reviewed Mar 8, 2021

View reviewed changes

apoorvedave1 previously approved these changes Mar 8, 2021

View reviewed changes

sezruby dismissed apoorvedave1’s stale review via b6e0832 March 11, 2021 06:45

imback82 reviewed Mar 11, 2021

View reviewed changes

imback82 added the breaking changes label Mar 11, 2021

imback82 reviewed Mar 11, 2021

View reviewed changes

sezruby commented Mar 11, 2021

View reviewed changes

imback82 reviewed Mar 11, 2021

View reviewed changes

Add warning for newly created Hyperspace Object

72afc4c

sezruby force-pushed the multisession branch from e1dee72 to 72afc4c Compare April 12, 2021 19:01

sezruby changed the title ~~Support concurrent Spark sessions~~ Add warning log - newly created Hyperspace context for different Spark Session Apr 12, 2021

sezruby removed the breaking changes label Apr 13, 2021

Merge branch 'master' into multisession

0f3d839

sezruby commented Apr 16, 2021

View reviewed changes

Conversation

sezruby commented Mar 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the context for this pull request?

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sezruby Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imback82 Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

apoorvedave1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imback82 commented Mar 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sezruby Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

sezruby commented Mar 4, 2021 •

edited

Loading

sezruby Mar 5, 2021 •

edited

Loading

imback82 Mar 11, 2021 •

edited

Loading

sezruby Mar 11, 2021 •

edited

Loading

sezruby Mar 11, 2021 •

edited

Loading

sezruby Mar 11, 2021 •

edited

Loading

sezruby Apr 16, 2021 •

edited

Loading