Skip to content

Conversation

@davidrohr
Copy link
Collaborator

@wiechula : Most of the stuff should be there, it is already working standalone.
Tomorrow, I'll pull in Rubens PR #3242 as soon as it passes the CI, and your new binary, to see how for it runs.

@davidrohr davidrohr requested review from a team, shahor02 and wiechula as code owners March 27, 2020 21:36
Copy link
Collaborator

@wiechula wiechula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was quick!
Looks good.
Thanks a lot.

@davidrohr
Copy link
Collaborator Author

@ktf : No reply from all builders since 11 hours. Any idea what is going on?

@ktf
Copy link
Member

ktf commented Mar 28, 2020

WIP prs are not tested anymore. Remove the WIP and it will start testing. Rationale here:

https://aliceo2group.github.io/quickstart/devel.html#draft-wip-pull-requests

@ktf ktf changed the title [WIP] Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together Mar 28, 2020
@davidrohr
Copy link
Collaborator Author

davidrohr commented Mar 28, 2020 via email

@davidrohr davidrohr changed the title Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together [TEST] Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together Mar 28, 2020
@davidrohr
Copy link
Collaborator Author

For reference, this needs alisw/alidist#2157 to pass the GPU CI.

@davidrohr
Copy link
Collaborator Author

@wiechula : I have rebased onto @shahor02 's patch (using his new feature), and cherry picked your commit with the encoder binary into my PR, and then adjusted your binary to use the new version of the RunZSEncoder function passing in the RawEncoder.
It runs through, doesn't crash, and produces non-empty output.
Yet, I'm lacking a way to read it and feed it back into the reco workflow, so I cannot test it.

I think probably the best way to do this, would be to link the o2-raw-file-reader-workflow to the o2-tpc-reco-workflow with the '|' mechanics, and modify the o2-tpc-reco-workflow to take the input, create the GPUTrackingInOutZS, and pass it in the CATrackerSpec.cxx. (A dummy for it is already there, although also that was never tested due to lacking input possibilities.)

Finally, I am using o2::InteractionRecord ir = o2::raw::HBFUtils::Instance().getFirstIR(); to get the first InteractionRecord which defines the bunch crossing that relates to the very first time bin. This is used in convertDigitsToRawZS.cxx for the encoder and in CATrackerSpec.cxx to set up the decoder, and obviously this must remain in sync. For testing, this should be OK, but it should be cleaned up eventually.

Could you test it and let me know what you think. Are your working to implement the reading, or shall I have a look?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, the only identifier to be used for mapping during decoding is the RDH.feeId, the cruID and linkID may change. As far as I know, the TPC at the moment does not write its feeID, but eventually the CRU should do this (apparently using the mapping provided by the TPC).
@wiechula If such a mapping exists, perhaps it is better to use it here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no definition nor mapping for the feeId at the moment. We started to discuss it. The easiest for the moment would be to use the obsolete definitions (CruId << 16) | ((LinkId + 1) << (CruEndPoint == 1 ? 8 : 0))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, simply i*20+j, to be independent of the data-taking-time params.

@wiechula
Copy link
Collaborator

@wiechula : I have rebased onto @shahor02 's patch (using his new feature), and cherry picked your commit with the encoder binary into my PR, and then adjusted your binary to use the new version of the RunZSEncoder function passing in the RawEncoder.
It runs through, doesn't crash, and produces non-empty output.

Looks good to me. I'll need to work on the binary then for some beautification. I would also like to add the possibility to only write defined sectors.

Yet, I'm lacking a way to read it and feed it back into the reco workflow, so I cannot test it.

I think probably the best way to do this, would be to link the o2-raw-file-reader-workflow to the o2-tpc-reco-workflow with the '|' mechanics, and modify the o2-tpc-reco-workflow to take the input, create the GPUTrackingInOutZS, and pass it in the CATrackerSpec.cxx. (A dummy for it is already there, although also that was never tested due to lacking input possibilities.)

Yes, I agree. It might need an intermediate aggregator device, which also creates the index.

Could you test it and let me know what you think. Are your working to implement the reading, or shall I have a look?

I can have a look. But don't know how fast I'll be.

Will you merge, or shall I use this branch?

wiechula
wiechula previously approved these changes Mar 29, 2020
@davidrohr
Copy link
Collaborator Author

@wiechula : I think an intermedia device to create the meta structure doesn't make much sense, since the structure is not easily shippable with DPL. I would just scan over the inputs in the tpc-reco-workflow. I am not sure if we have to do it in the CATrackerSpec, but probably. At least otherwise I have no idea how we could pass it easily.

Btw: I removed the MC flags. The ZS encoding will not support MC labels. The way it will be handled is: One has to pass the ZS encoded data + digits + digits-mc-labels to the tpc-reco-workflow. It will run the reconstruction based on the the ZS data, but then fetch the mc labels from the digits, and match based on pad,row,time of the digits. I hope this will be sufficient for MC processing.

I cannot merge it right now, because it depends on alisw/alidist#2157, which fails the CI for unclear reasons and also does not provide an error log.

@davidrohr
Copy link
Collaborator Author

davidrohr commented Mar 29, 2020 via email

@shahor02
Copy link
Collaborator

@davidrohr for the span: yes, I've realized that you are using different c-tor, that's why deleted my comment.
Concerning the MC: is there really a reason to implement the MC for reco from the raw data? Routine MC productions will not be done via raw files, and just to validate the raw reco it is enough to require that its results are similar to the reco from digits.
I don't think any other detector will run raw reco + labels.

@davidrohr
Copy link
Collaborator Author

ah, ok, I just replied to the mail.

For the MC: It comes basically at 0 overhead in the way it is implemented. Raw reconstruction just runs independent from the MC. The MC label propagation in the clusterizer is kind of a postprocessing step. It uses only the digits, and it basically maps the digit pad/row/time position to the clusters, using the cluster charge map. In this step, it doesn't play any role whether the cluster was created from digits or zero suppressed raw data. If we have digits and MC and process the derived ZS raw data, we can still run the MC propagator, since the digit positions by definition match the adc samples from the ZS raw. Thus, if we have it for free, why not. It can actually be useful. In AliRoot, there were situations where I was annoyed this was not possible.

@shahor02
Copy link
Collaborator

ok, thanks, if it does not require extra work, surely it does not hurt.

@davidrohr
Copy link
Collaborator Author

@shahor02 @wiechula : There seems to be some kind of memory corruption in the current o2-tpc-digits-to-rawzs. It is not fully clear to me yet when it happens.
What I see is:

  • Sometimes it segfaults (inside addData, stacktrace is below).
  • Sometimes addData throws an exception claiming the requested feeid/cruid/linkid/... have not been registered.
  • Sometimes it runs through without error, but o2-raw-file-reader-workflow reports invalid RDHs
  • Sometimes it runs through, and o2-raw-file-reader-workflow doesn't report errors.
  • I have tried to run with valgrind, but then it will always run through correctly.

I went through the code of o2-tpc-digits-to-rawzs and RunZSEncoder, but couldn't spot any place that should lead to memory corruption.
I am now trying to take the output of one of the runs where it went through, trying to connect the o2-raw-file-reader-workflow to the tpc-reco-workflow, to understand if the data is fully correct in such a case.

Stacktrace:

#0  0x00007f67a62c49ee in std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_before_node (__code=802955167, __k=@0x7f678445ca5c: 802955167, __n=155, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:1532
#1  std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_node (__c=802955167, __key=@0x7f678445ca5c: 802955167, __bkt=155, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:655
#2  std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false,#0  0x00007f67a62c49ee in std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_before_node (__code=802955167, __k=@0x7f678445ca5c: 802955167, __n=155, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:1532
#1  std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_find_node (__c=802955167, __key=@0x7f678445ca5c: 802955167, __bkt=155, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:655
#2  std::_Hashtable<unsigned int, std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> >, std::__detail::_Select1st, std::equal_to<unsigned int>, std::hash<unsigned int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::find (__k=@0x7f678445ca5c: 802955167, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:1435
#3  std::unordered_map<unsigned int, o2::raw::RawFileWriter::LinkData, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> > >::find (__x=@0x7f678445ca5c: 802955167, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/unordered_map.h:925
#4  o2::raw::RawFileWriter::isLinkRegistered (ss=802955167, this=0x7fff4ba515a0) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Raw/include/DetectorsRaw/RawFileWriter.h:148
#5  o2::raw::RawFileWriter::addData (this=this@entry=0x7fff4ba515a0, feeid=<optimized out>, cru=<optimized out>, lnk=lnk@entry=15 '\017', endpoint=<optimized out>, ir=..., data=..., preformatted=true) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Raw/src/RawFileWriter.cxx:111
#6  0x00007f67a639fe1c in o2::gpu::GPUReconstructionConvert::_ZN2o23gpu24GPUReconstructionConvert12RunZSEncoderEPKNS0_22GPUTrackingInOutDigitsEPSt10unique_ptrIA_ySt14default_deleteIS6_EEPjPNS_3raw13RawFileWriterEPKNS_17InteractionRecordERKNS0_8GPUParamEbb._omp_fn.0(void) () at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/DataFormats/common/include/CommonDataFormat/InteractionRecord.h:204
#7  0x00007f67a21dc9ae in gomp_thread_start (xdata=<optimized out>) at /var/tmp/portage/sys-devel/gcc-9.3.0/work/gcc-9.3.0/libgomp/team.c:123
#8  0x00007f67a5e21347 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f67a2f3ef5f in clone () from /lib64/libc.so.6 false, true> >::find (__k=@0x7f678445ca5c: 802955167, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/hashtable.h:1435
#3  std::unordered_map<unsigned int, o2::raw::RawFileWriter::LinkData, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, o2::raw::RawFileWriter::LinkData> > >::find (__x=@0x7f678445ca5c: 802955167, this=0x7fff4ba515b0) at /usr/lib/gcc/x86_64-pc-linux-gnu/9.3.0/include/g++-v9/bits/unordered_map.h:925
#4  o2::raw::RawFileWriter::isLinkRegistered (ss=802955167, this=0x7fff4ba515a0) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Raw/include/DetectorsRaw/RawFileWriter.h:148
#5  o2::raw::RawFileWriter::addData (this=this@entry=0x7fff4ba515a0, feeid=<optimized out>, cru=<optimized out>, lnk=lnk@entry=15 '\017', endpoint=<optimized out>, ir=..., data=..., preformatted=true) at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/Detectors/Raw/src/RawFileWriter.cxx:111
#6  0x00007f67a639fe1c in o2::gpu::GPUReconstructionConvert::_ZN2o23gpu24GPUReconstructionConvert12RunZSEncoderEPKNS0_22GPUTrackingInOutDigitsEPSt10unique_ptrIA_ySt14default_deleteIS6_EEPjPNS_3raw13RawFileWriterEPKNS_17InteractionRecordERKNS0_8GPUParamEbb._omp_fn.0(void) () at /home/qon/alice/sw/SOURCES/O2/v1.2.0/0/DataFormats/common/include/CommonDataFormat/InteractionRecord.h:204
#7  0x00007f67a21dc9ae in gomp_thread_start (xdata=<optimized out>) at /var/tmp/portage/sys-devel/gcc-9.3.0/work/gcc-9.3.0/libgomp/team.c:123
#8  0x00007f67a5e21347 in start_thread () from /lib64/libpthread.so.0
#9  0x00007f67a2f3ef5f in clone () from /lib64/libc.so.6

@shahor02
Copy link
Collaborator

@davidrohr what do you use an input? I can try to check.

@davidrohr
Copy link
Collaborator Author

@shahor02 : To try to reproduce:

  • merge the latest version of this PR (If you have a GPU build enabled, you need also Bump CMake to 3.17 alisw/alidist#2157).
  • I am running o2-sim -n 4 -g pythia8hi -e TGeant3, not sure if the event type is critical.
  • o2-sim-digitizer-workflow
  • o2-tpc-digits-to-rawzs -i tpcdigits.root -o tpcrawzs.raw

@shahor02
Copy link
Collaborator

@davidrohr the writer was not meant to be thread safe:) Trying to fix this

@davidrohr
Copy link
Collaborator Author

@shahor02 : ok, makes sense, sorry, I didn't even think about that. Worst case, we can enforce OpenMP num threads = 1 here.

@davidrohr
Copy link
Collaborator Author

Made some progress, thx to @shahor02 the encoding works correctly now, and we can pipe the data from the raw reader workflow to the tpc reco workflow.
Steps to run:

o2-sim ...
o2-sim-digitizer-workflow
OMP_NUM_THREADS=1 o2-tpc-digits-to-rawzs -i tpcdigits.root -o tpcrawzs.raw
o2-raw-file-reader-workflow --conf tpcraw.cfg | o2-tpc-reco-workflow --ca-clusterer --input-type=zsraw

with the attached tpcraw.cfg:
tpcraw.cfg.gz
(@shahor02 : Perhaps we could add an option to the RawFileWriter to create such a config file automatically?)

Where I am stuck now:
in CATrackerSpec.cxx I get all the raw data (lines 376 - 388 in current PR), and I need to fill the o2::gpu::GPUTrackingInOutZS tpcZS struct.
What it not clear to me yet: how do I get back the feeid? And I need the pointer to the page with 8kb pages and the number of 8 kb pages. How do I get this. Is it the *rdh or *raw pointers, I get from the DPLRawParset? For the payload size, I only see 8192 (or 0). So it seems to be only a single page. But I'd like to access the full superpage, and need to know how many 8 kb pages are inside.

Does anyone know how I can get this data easily?
Otherwise, I'll continue tomorrow.

@davidrohr davidrohr requested a review from noferini as a code owner March 30, 2020 08:43
@davidrohr davidrohr force-pushed the dev_pull_request branch 4 times, most recently from 7afa6d9 to 23958cd Compare March 31, 2020 07:25
@davidrohr
Copy link
Collaborator Author

New version just pushed, it includes caching of the zs pages in the workflow, and runs the tracking when the full tf is received. Caching seems to work, and tracking runs without crashing, but output is not correct.
Now need to figure out why.

@davidrohr davidrohr changed the title [TEST] Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together Work on ZS encoding / decoding + unrelated commits from my dev branch that give conflicts if not merged together Mar 31, 2020
@davidrohr
Copy link
Collaborator Author

@wiechula @shahor02 The TPC Raw processing is now fully working with this PR.

  • In order to merge this PR, we still need to merge Bump CMake to 3.17 alisw/alidist#2157 first, which is currently stuck by some unrelated AliRoot CI issue.
  • There should be some clean up done later on as described below, but I think for now this can be merged.
  • For now, we need the --disable-mc parameter, because we cannot feed both the digits and the zsraw data to the workflow, since it supports only one input. Perhaps we could just enable --disable-mc by default if the input is zsraw and if there are no digits with mc labels.
  • The output is minimally different to the output from the digits, but I think this comes from the zero-suppression itself and from the rounding. To verify that, one would need to modify the digits, remove everything below 2 ADC counts, and round all digits to integers.
  • In order to run it:
o2-sim ...
o2-sim-digitizer....
mkdir tpcrawzs.raw
o2-tpc-digits-to-rawzs -i tpcdigits.root -o tpcrawzs.raw
o2-raw-file-reader-workflow --conf tpcrawzs.raw/tpcraw.cfg | o2-tpc-reco-workflow --ca-clusterer --input-type=zsraw --disable-mc

The following things should be cleaned up in a follow-up PR. Perhaps someone else can take care here:

  • The computation of the fee ID, and the recovery of sector, region, endpoint from feeID should be unified in a helper class.
  • The last commit should be reverted, and replaced with a proper completion strategy.
  • We should foresee a way to pass digits, digit mc labels, and the zsraw data to the workflow.

@davidrohr
Copy link
Collaborator Author

GPU CI passed after alisw/alidist#2157 was merged, merging this.

@davidrohr davidrohr merged commit a8e1c23 into AliceO2Group:dev Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants