Skip to content

Conversation

@bdlucas1
Copy link
Collaborator

@bdlucas1 bdlucas1 commented Dec 9, 2025

We only have a couple plot tests that actually look at the output, and those were constructed and run laboriously, and none of them test the new vectorized plot code. This makes refactoring the code risky (and so far I've been relying mostly on end-to-end testing that compares the generated visuals in my test front-end).

Here's a different approach: the idea is to write the result expression to a file in outline tree form, and then compare the actual result with an expected reference result using diff. This provides easily constructed tests with a fairly readable indication of what has changed. See comment in test_plot__detailed.py

In addition, this change

  • Makes it possible to change at runtime whether vectorized or classic plot code is used
  • Augments print_expression_tree to print the abbreviated numpy arrays (default numpy str implementation) so that numeric data can be checked efficiently for gross changes.

@bdlucas1
Copy link
Collaborator Author

bdlucas1 commented Dec 9, 2025

So there was a test failure because of a difference in the last digit of a double-precision float. Couple of possible fixes:

  • Have the print_expression_tree function special-case Machine`Real and round it to single precision before printing
  • Avoid trig functions where machines might differ and just use simple functions like x+y or x*y that are unlikely to show any difference on different architectures.
  • Probably both of the above.

I'll look into this tomorrow.

@bdlucas1
Copy link
Collaborator Author

bdlucas1 commented Dec 9, 2025

Pyodide failed because (of course) it can't run diff. Couple options:

  • don't run the tests under Pyodide
  • There's a Python diff package. Don't know what capability it has. It would create a dependency on another package at test time. I can look into it if that would be desirable.

I'm a slightly surprised that the Windows test succeeded - wasn't sure about /tmp and diff. Is it running under Cygwin or something?

@mmatera
Copy link
Contributor

mmatera commented Dec 9, 2025

So there was a test failure because of a difference in the last digit of a double-precision float. Couple of possible fixes:

What about apply eval_N with 4 digits of accuracy to the graphicsbox expression? The result should match in any architecture.

  • Have the print_expression_tree function special-case Machine`Real and round it to single precision before printing
  • Avoid trig functions where machines might differ and just use simple functions like x+y or x*y that are unlikely to show any difference on different architectures.
  • Probably both of the above.

I'll look into this tomorrow.

@mmatera
Copy link
Contributor

mmatera commented Dec 9, 2025

Pyodide failed because (of course) it can't run diff. Couple options:

  • don't run the tests under Pyodide
  • There's a Python diff package. Don't know what capability it has. It would create a dependency on another package at test time. I can look into it if that would be desirable.

I'm a slightly surprised that the Windows test succeeded - wasn't sure about /tmp and diff. Is it running under Cygwin or something?

I guess these tests should be in their own test workflow, and probably make sense to check just in Ubuntu. If other architectures fail, would be an issue of the low lying library, isn't it?

@mmatera
Copy link
Contributor

mmatera commented Dec 9, 2025

Pyodide failed because (of course) it can't run diff. Couple options:

  • don't run the tests under Pyodide
  • There's a Python diff package. Don't know what capability it has. It would create a dependency on another package at test time. I can look into it if that would be desirable.

I'm a slightly surprised that the Windows test succeeded - wasn't sure about /tmp and diff. Is it running under Cygwin or something?

Something else that would be useful is a way to regenerate the reference files automatically when needed. Box expressions follow some internal rules and can change with the version release.

@bdlucas1
Copy link
Collaborator Author

So I've made the following improvements:

  • When printing the result approximate the numbers, and conceal the difference in the printed result between numpy Integer32/64, Real32/64, etc. This eliminates all differences in printed numeric output between our test environments.
  • Error out early and print if there are any messages, which made it possible to diagnose the following failure on pyodide:
  • Don't try to run CountourPlot tests if scikit-image isn't installed. (Do we want to install it there for our tests?)

What about apply eval_N with 4 digits of accuracy

Yes, that's essentially what I'm doing, except in native floating point as the numbers are being printed. (Specifically, str(round(x * 1e6) / 1e6) to limit absolute precision to 6 digits).

I guess these tests should be in their own test workflow, and probably make sense to check just in Ubuntu. If other architectures fail, would be an issue of the low lying library, isn't it?

Yes, but with the changes above the small inconsequential differences between the architectures are concealed and the tests pass, so it should be possible to just run in all environments in the current test workflows, if you want. (Note that just running our tests on Ubuntu in itself doesn't guarantee our tests will see the same results someone gets when they generate the reference files on their Mac...)

a way to regenerate the reference files automatically

Indeed. See make_ref_files near the end of test_plot_detail.py. All that's needed is to run the tests, pointing their output at the ref file directory instead of /tmp.

You mentioned boxing - to be clear, these unit tests verify the immediate Graphics* output of Plot*, which is not boxed. I'm not sure how testing boxing should be done. These specific tests could be enlarged to further generate GraphicsBox* output from the Graphics* output of Plot*, but is the unit tests for Plot* the right home for unit tests for Graphics* -> GraphicsBox*?

@mmatera
Copy link
Contributor

mmatera commented Dec 11, 2025

So I've made the following improvements:

* When printing the result approximate the numbers, and conceal the difference in the printed result between numpy Integer32/64, Real32/64, etc. This eliminates all differences in printed numeric output between our test environments.

Excellent

* Error out early and print if there are any messages, which made it possible to diagnose the following failure on pyodide:

* Don't try to run CountourPlot tests if scikit-image isn't installed. (Do we want to install it there for our tests?)

What about apply eval_N with 4 digits of accuracy

Yes, that's essentially what I'm doing, except in native floating point as the numbers are being printed. (Specifically, str(round(x * 1e6) / 1e6) to limit absolute precision to 6 digits).

OK

I guess these tests should be in their own test workflow, and probably make sense to check just in Ubuntu. If other architectures fail, would be an issue of the low lying library, isn't it?

Yes, but with the changes above the small inconsequential differences between the architectures are concealed and the tests pass, so it should be possible to just run in all environments in the current test workflows, if you want. (Note that just running our tests on Ubuntu in itself doesn't guarantee our tests will see the same results someone gets when they generate the reference files on their Mac...)

OK, but that wouldn't be issues outside the graphics module? If MS Windows produce a plot differing in some decimal positions relative to Ubuntu, it does not say too much about how we build graphics, but about the numpy implementation, or Sympy, or any other low-lying library...

a way to regenerate the reference files automatically

Indeed. See make_ref_files near the end of test_plot_detail.py. All that's needed is to run the tests, pointing their output at the ref file directory instead of /tmp.

Great

You mentioned boxing - to be clear, these unit tests verify the immediate Graphics* output of Plot*, which is not boxed. I'm not sure how testing boxing should be done. These specific tests could be enlarged to further generate GraphicsBox* output from the Graphics* output of Plot*, but is the unit tests for Plot* the right home for unit tests for Graphics* -> GraphicsBox*?

OK, yes, my bad. I say boxing because I have been dealing with it for a while, but here we are working with a previous stage, which is the Graphics(3D) sub-language. What we are testing here is the translation / evaluation of Plot expressions into Graphics(3D) expressions.

@@ -0,0 +1,198 @@
System`Graphics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If brackets and commas were keep, these files could be evaluated, and would be in proper WL. In that case, .m extension would be more appropriated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's close, but some things like how the numpy arrays underlying NumericArray are presented by using numpy's built-in string conversion that limits the number of elements printed (and also shows limited precision) aren't WL. The primary intent is human consumption for debugging, but if proves to be too limiting that it's also not readily machine-readable may be we could revisit it and find a way to satisfy both requirements.

Copy link
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments, LGTM. Merge when you consider it is ready.

@bdlucas1
Copy link
Collaborator Author

I guess these tests should be in their own test workflow, and probably make sense to check just in Ubuntu. If other architectures fail, would be an issue of the low lying library, isn't it?

Yes, but with the changes above the small inconsequential differences between the architectures are concealed and the tests pass, so it should be possible to just run in all environments in the current test workflows, if you want. (Note that just running our tests on Ubuntu in itself doesn't guarantee our tests will see the same results someone gets when they generate the reference files on their Mac...)

OK, but that wouldn't be issues outside the graphics module? If MS Windows produce a plot differing in some decimal positions relative to Ubuntu, it does not say too much about how we build graphics, but about the numpy implementation, or Sympy, or any other low-lying library...

I just meant that if your suggestion to only run on Ubuntu was a way to avoid architecture-specific behavior in numpy (and maybe that's not what you meant) that it would only work if the tests were generated on the same architecture (and version, etc.) In any case I think we're on the same page that with the tests now hiding inconsequential differences they should now be robust wrt exact architecture-specific library details.

@bdlucas1 bdlucas1 merged commit a7acec0 into Mathics3:master Dec 11, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants