Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiling processor segfaults when multithreading is enabled #1929

Open
edelmanjm opened this issue Nov 18, 2024 · 9 comments · Fixed by gdsfactory/kfactory#521
Open

Tiling processor segfaults when multithreading is enabled #1929

edelmanjm opened this issue Nov 18, 2024 · 9 comments · Fixed by gdsfactory/kfactory#521

Comments

@edelmanjm
Copy link

Hi all, currently using klayout via gdsfactory, and I've run into an issue when calling fill.fill_tiled() with n_threads > 1 where klayout segfaults. I believe this is an issue in klayout itself based on the stacktrace and apparent threading behavior, but I haven't had time to fully root-case everything. Here's a sample backtrace, with the full log attached:

Thread 0::  Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib        	    0x7ff804063c3a __psynch_cvwait + 10
1   libsystem_pthread.dylib       	    0x7ff8040a16f3 _pthread_cond_wait + 1211
2   lib_tl.cpython-312-darwin.so  	       0x106038dc2 tl::WaitConditionPrivate::wait(tl::Mutex*, unsigned long) + 162
3   lib_tl.cpython-312-darwin.so  	       0x106038170 tl::JobBase::wait(long) + 64
4   lib_db.cpython-312-darwin.so  	       0x10a92f835 db::TilingProcessor::execute(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 4709
5   lib_db.cpython-312-darwin.so  	       0x10b1cb016 gsi::MethodVoid1<db::TilingProcessor, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>::call(void*, gsi::SerialArgs&, gsi::SerialArgs&) const + 118
6   lib_pya.cpython-312-darwin.so 	       0x106c6ff38 pya::method_adaptor(int, _object*, _object*, _object*) + 536
...

Thread 2:
0   lib_db.cpython-312-darwin.so  	       0x109c71468 tl::copy_on_write_ptr<db::Shapes, tl::copy_duplicator<db::Shapes>>::get_non_const() + 8
1   lib_db.cpython-312-darwin.so  	       0x109e1abc2 db::FlatRegion::compute_bbox() const + 18
2   lib_db.cpython-312-darwin.so  	       0x109c9ef0e db::AsIfFlatRegion::bbox() const + 46
3   lib_db.cpython-312-darwin.so  	       0x109ca8e93 db::AsIfFlatRegion::and_with(db::Region const&, db::PropertyConstraint) const + 1699
4   lib_db.cpython-312-darwin.so  	       0x10afe93bb db::Region::operator&(db::Region const&) const + 27
5   lib_db.cpython-312-darwin.so  	       0x10b028631 gsi::ConstMethod1<db::Region, db::Region, db::Region const&, gsi::arg_default_return_value_preference>::call(void*, gsi::SerialArgs&, gsi::SerialArgs&) const + 129
6   lib_gsi.cpython-312-darwin.so 	       0x106973dc6 gsi::VariantUserClassImpl::execute_gsi(tl::ExpressionParserContext const&, tl::Variant&, tl::Variant&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<tl::Variant, std::__1::allocator<tl::Variant>> const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, tl::Variant, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, tl::Variant>>> const*) const + 2758
7   lib_gsi.cpython-312-darwin.so 	       0x106972e21 gsi::VariantUserClassImpl::execute(tl::ExpressionParserContext const&, tl::Variant&, tl::Variant&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<tl::Variant, std::__1::allocator<tl::Variant>> const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, tl::Variant, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, tl::Variant>>> const*) const + 369
8   lib_tl.cpython-312-darwin.so  	       0x106001989 tl::AmpersandExpressionNode::execute(tl::EvalTarget&) const + 441
9   lib_tl.cpython-312-darwin.so  	       0x105ffba79 tl::AssignExpressionNode::execute(tl::EvalTarget&) const + 89
10  lib_tl.cpython-312-darwin.so  	       0x105ffbc4c tl::SequenceExpressionNode::execute(tl::EvalTarget&) const + 44
11  lib_tl.cpython-312-darwin.so  	       0x105ff265a tl::Expression::execute() const + 74
12  lib_db.cpython-312-darwin.so  	       0x10a92c146 db::TilingProcessorWorker::do_perform(db::TilingProcessorTask const*) + 3222
13  lib_tl.cpython-312-darwin.so  	       0x1060385db tl::Worker::run() + 91
14  lib_tl.cpython-312-darwin.so  	       0x106038fa6 tl::Thread::do_run() + 22
15  lib_tl.cpython-312-darwin.so  	       0x106038f89 tl::start_thread(void*) + 9
16  libsystem_pthread.dylib       	    0x7ff8040a118b _pthread_start + 99
17  libsystem_pthread.dylib       	    0x7ff80409cae3 thread_start + 15

Thread 3 Crashed:
0   libsystem_kernel.dylib        	    0x7ff804067db6 __pthread_kill + 10
1   libsystem_pthread.dylib       	    0x7ff8040a0ebd pthread_kill + 262
2   libsystem_c.dylib             	    0x7ff803fc6a79 abort + 126
3   libsystem_malloc.dylib        	    0x7ff803ecb3a9 malloc_vreport + 857
4   libsystem_malloc.dylib        	    0x7ff803ece97a malloc_report + 151
5   lib_db.cpython-312-darwin.so  	       0x109cb7dab std::__1::pair<db::polygon_contour<int>*, db::polygon_contour<int>*> std::__1::__unwrap_and_dispatch[abi:v160006]<std::__1::__overload<std::__1::__copy_loop<std::__1::_ClassicAlgPolicy>, std::__1::__copy_trivial>, db::polygon_contour<int>*, db::polygon_contour<int>*, db::polygon_contour<int>*, 0>(db::polygon_contour<int>*, db::polygon_contour<int>*, db::polygon_contour<int>*) + 91
6   lib_db.cpython-312-darwin.so  	       0x109cb7c63 void std::__1::vector<db::polygon_contour<int>, std::__1::allocator<db::polygon_contour<int>>>::assign<db::polygon_contour<int>*, 0>(db::polygon_contour<int>*, db::polygon_contour<int>*) + 211
7   lib_db.cpython-312-darwin.so  	       0x10a8bf66f void db::unstable_box_tree<db::box<int, int>, db::polygon<int>, db::box_convert<db::polygon<int>, true>, 100ul, 100ul, 4u>::tree_sort<db::box_tree_picker<db::box<int, int>, db::polygon<int>, db::box_convert<db::polygon<int>, true>, tl::vector<db::polygon<int>>>>(db::box_tree_node<db::unstable_box_tree<db::box<int, int>, db::polygon<int>, db::box_convert<db::polygon<int>, true>, 100ul, 100ul, 4u>>*, std::__1::__wrap_iter<db::polygon<int>*>, std::__1::__wrap_iter<db::polygon<int>*>, db::box_tree_picker<db::box<int, int>, db::polygon<int>, db::box_convert<db::polygon<int>, true>, tl::vector<db::polygon<int>>>&, db::box<int, int> const&, int) + 463
8   lib_db.cpython-312-darwin.so  	       0x10a8bf491 db::unstable_box_tree<db::box<int, int>, db::polygon<int>, db::box_convert<db::polygon<int>, true>, 100ul, 100ul, 4u>::sort(db::box_convert<db::polygon<int>, true> const&, db::simple_bbox_tag const&) + 257
9   lib_db.cpython-312-darwin.so  	       0x10a838de3 db::layer_class<db::polygon<int>, db::unstable_layer_tag>::sort() + 35
10  lib_db.cpython-312-darwin.so  	       0x10a742199 db::Shapes::update() + 41
11  lib_db.cpython-312-darwin.so  	       0x10a5901ac db::RecursiveShapeIterator::validate(db::RecursiveShapeReceiver*) const + 1628
12  lib_db.cpython-312-darwin.so  	       0x10a591390 db::RecursiveShapeIterator::at_end() const + 16
13  lib_db.cpython-312-darwin.so  	       0x109ca882f db::AsIfFlatRegion::and_with(db::Region const&, db::PropertyConstraint) const + 63
14  lib_db.cpython-312-darwin.so  	       0x10afe93bb db::Region::operator&(db::Region const&) const + 27
15  lib_db.cpython-312-darwin.so  	       0x10b028631 gsi::ConstMethod1<db::Region, db::Region, db::Region const&, gsi::arg_default_return_value_preference>::call(void*, gsi::SerialArgs&, gsi::SerialArgs&) const + 129
16  lib_gsi.cpython-312-darwin.so 	       0x106973dc6 gsi::VariantUserClassImpl::execute_gsi(tl::ExpressionParserContext const&, tl::Variant&, tl::Variant&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<tl::Variant, std::__1::allocator<tl::Variant>> const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, tl::Variant, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, tl::Variant>>> const*) const + 2758
17  lib_gsi.cpython-312-darwin.so 	       0x106972e21 gsi::VariantUserClassImpl::execute(tl::ExpressionParserContext const&, tl::Variant&, tl::Variant&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::vector<tl::Variant, std::__1::allocator<tl::Variant>> const&, std::__1::map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, tl::Variant, std::__1::less<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, tl::Variant>>> const*) const + 369
18  lib_tl.cpython-312-darwin.so  	       0x106001989 tl::AmpersandExpressionNode::execute(tl::EvalTarget&) const + 441
19  lib_tl.cpython-312-darwin.so  	       0x105ffba79 tl::AssignExpressionNode::execute(tl::EvalTarget&) const + 89
20  lib_tl.cpython-312-darwin.so  	       0x105ffbc4c tl::SequenceExpressionNode::execute(tl::EvalTarget&) const + 44
21  lib_tl.cpython-312-darwin.so  	       0x105ff265a tl::Expression::execute() const + 74
22  lib_db.cpython-312-darwin.so  	       0x10a92c146 db::TilingProcessorWorker::do_perform(db::TilingProcessorTask const*) + 3222
23  lib_tl.cpython-312-darwin.so  	       0x1060385db tl::Worker::run() + 91
24  lib_tl.cpython-312-darwin.so  	       0x106038fa6 tl::Thread::do_run() + 22
25  lib_tl.cpython-312-darwin.so  	       0x106038f89 tl::start_thread(void*) + 9
26  libsystem_pthread.dylib       	    0x7ff8040a118b _pthread_start + 99
27  libsystem_pthread.dylib       	    0x7ff80409cae3 thread_start + 15


Thread 3 crashed with X86 Thread State (64-bit):
  rax: 0x0000000000000000  rbx: 0x0000000000000006  rcx: 0x00007000076b65a8  rdx: 0x0000000000000000
  rdi: 0x0000000000000f03  rsi: 0x0000000000000006  rbp: 0x00007000076b65d0  rsp: 0x00007000076b65a8
   r8: 0x000000000000002e   r9: 0x0000000000000000  r10: 0x0000000000000000  r11: 0x0000000000000246
  r12: 0x0000000000000f03  r13: 0x0000000000000050  r14: 0x00007000076b8000  r15: 0x0000000000000016
  rip: 0x00007ff804067db6  rfl: 0x0000000000000246  cr2: 0x0000000000000000
  
Logical CPU:     0
Error Code:      0x02000148 
Trap Number:     133
...

I'm currently able to consistently reproduce this issue, but I haven't had time to write up a minimal example and debug it using ThreadSanitizer. If there's the desire for me to do so/I have the time, I'd be happy to delve into it a little later, but I just don't have the time right this second. Mostly posting here for visibility. Thanks!

crash.txt

@klayoutmatthias
Copy link
Collaborator

klayoutmatthias commented Nov 19, 2024

I think there is a general problem here.

The tiling processor works in a way that it executes code pieces on each tile. The code pieces are written in KLayout's "expressions" notation. The main reason for that is that this provides a certain level of multi thread safety. Hence, multiple tiles can be worked on in parallel.

The results of the tile actions are finally queued, reordered and sent to a single-threaded output stage. This scheme is safe as long as the tiled operations do not modify the layout the tiling processor works on. The Layout object is thread-safe while reading, but not, while writing.

This is true for all reading operations - specifically those which pull the tile's content into a Region object and act on it, before sending the results the single-threaded output channel. This is the main use case of the tiling processor.

I assume that the "fill_tiled" implementation deviates from this scheme by performing some "fill_region" which will modify the layout. That causes the crash I assume.

The solution is not straightforward. The easiest way is to put a lock around "fill_region", but that lock is also needed for the read operations as they are not thread-safe when another thread is writing. In the end, that will boil down to single-threaded operation.

A solution conforming to the concept was to establish an output channel that does the tiling. But again, that would degrade the performance to single-threaded mode.

I think a somewhat weird, yet feasible solution is to set up a small Layout per tile, deliver the fill cells there and send these layouts to an output channel that finally assembles these pieces into the output layout. That would at least enable multiple threads for the bulk of the fill operation, which is computing the fill arrays. The assembly step is single-threaded though.

I will give that a try.

Matthias

@klayoutmatthias
Copy link
Collaborator

Here is an experiment. The following code is a multi-core capable implementation of a filling tiling processor:


import time

# For issue #1929

ly = pya.CellView.active().layout()
cell = pya.CellView.active().cell

li2fill = pya.LayerInfo(69, 20)

l2fill = ly.layer(li2fill)
# to stash the bounding box for the fill cell
l0 = ly.layer(0, 0)

# create a sample fill cell 1x1µm dimension, 400x400nm fill pattern
fc = ly.cell("FILL")
if fc is None:
  fc = ly.create_cell("FILL") 
  fc.shapes(l2fill).insert(pya.DBox(-0.2, -0.2, 0.2, 0.2))
  fc.shapes(l0).insert(pya.DBox(-0.5, -0.5, 0.5, 0.5))

# create a small template layout for each tile  
ly_temp = pya.Layout()
ly_temp.dbu = ly.dbu
tc_temp = ly_temp.create_cell("TOP")
fc_temp = ly_temp.create_cell(fc.name)
fc_temp.copy_shapes(fc)

# an output channel that collects and finally delivers the fill cell arrays
class FillingOutputReceiver(pya.TileOutputReceiver):

  def __init__(self):
    self.results = []
    
  def put(self, ix, iy, tile, what, dbu, clip):
    # NOTE: needs to "dup" as "what" is owned by the expressions script
    self.results.append(what.dup())
    
  def deliver(self, output_cell, output_fill_cell, tc_temp):
    """
    To be called externally after all the processing is done.
    This method will take all the individual results and deliver them 
    to the output layout in a single thread
    """
    for res in self.results:
      tc_temp = res.cell(tc_temp.cell_index())
      for inst in tc_temp.each_inst():
        arr = inst.cell_inst
        arr.cell_index = output_fill_cell.cell_index()
        output_cell.insert(arr)

# instantiate the fill receiver
result_collector = FillingOutputReceiver()

# build the fill tiling processor
tp = pya.TilingProcessor()
tp.input("to_exclude", ly, cell.cell_index(), li2fill)
tp.tile_size(100.0, 100.0)
tp.var("ly_temp", ly_temp)
tp.var("tc_temp_index", tc_temp.cell_index())
tp.var("fc_temp_index", fc_temp.cell_index())
tp.var("distance_to_exclude", 0.3)
tp.output("results", result_collector)
tp.queue("""
  # NOTE: "dup" should maintain the cell indexes
  # "dup" is the ONLY thing we should do here on "ly_temp"
  # as we are going to modify the layout and we need a local copy.
  var ly_tile = ly_temp.dup;
  var to_fill = _tile - to_exclude.sized(distance_to_exclude / ly_tile.dbu);
  ly_tile.cell(tc_temp_index).fill_region(
    to_fill,
    fc_temp_index,
    ly_tile.cell(fc_temp_index).bbox
  );
  _output(results, ly_tile)
""")

# execute the processor
start = time.time()
tp.threads = 4
tp.execute("Fill")
end = time.time()
print(f"Fill execution done in {end - start} seconds with {tp.threads} core(s).")

# and build the results
start = time.time()
result_collector.deliver(cell, fc, tc_temp)
end = time.time()
print(f"Results built in {end - start} seconds.")

For input I used the Sky130 Caravel chip. Runtimes are in my case:

Fill execution done in 11.692350387573242 seconds with 1 core(s).
Results built in 0.4245586395263672 seconds.

and

Fill execution done in 3.1786248683929443 seconds with 4 core(s).
Results built in 0.43107128143310547 seconds.

So it basically seems to scale nicely. Assembly times are small compared to the computation times for the fill + boolean.

Matthias

@sebastian-goeldi
Copy link
Contributor

Hi Matthias,

The code in question is directly calling the kfactory code here https://github.com/gdsfactory/kfactory/blob/main/src%2Fkfactory%2Futils%2Ffill.py

I used and adopted an old forum post for this, where you gave an example on how to use it.

So, is your recommendation to defer the output receiver filling the layout until all sub-regions are finished with calculating the regions needed for filling? I guess I could modify it to do so. Though I have to say, I have used this plenty of times without issue on linux (with CI runners from 1-2 cores up to my machine with 16 cores / 32 threads) and it never segfaulted like this. And I have done weird stuff like use it for very low density fills (we had some requirents for certain cells being in every region, but also not too densly)

Best,
Sebastian

@klayoutmatthias
Copy link
Collaborator

Yes, I am sorry. I probably missed that detail before.

Threading issues are a nightmare. They may never happen and suddenly someone manages to trigger it all the time. Actually the Layout object is "mostly" thread safe, but if one thread pours in fill cell instances while the other one reads the same hierarchy at the same time, bad things will happen. Better to play safe.

Matthias

@sebastian-goeldi
Copy link
Contributor

I see, that makes sense, thanks!

Best,
Sebastian

@edelmanjm
Copy link
Author

Thanks for all the investigating and info! Should I close this out and open a corresponding issue with kfactory? Or do we want to leave this open pending further safety around the tiling processor in klayout?

@sebastian-goeldi
Copy link
Contributor

sebastian-goeldi commented Nov 21, 2024

I don't think there is a lot of room for improvement on klayout's side.

The problem boils down to the following, which even in standard python ) leads to either very unexpected results, early termination, or segfaults (forgot which one, I believe the first one):

 list = [j for j in range 20]
for i, el in enumerate():
    if i % e == 0:
        del _list[i]
    print(i, el)

Only here it's the other way around as it's inserting instead of deleting elements.

It's just a bad idea to iterate through mappings/linked lists while modifying it. I was wondering whether that wouldn't cause any problems when I read that forum post, but assumed that probably .start_changes()/.end_changes in the layout object did some magic to not modify the reading of the objects.

@sebastian-goeldi
Copy link
Contributor

@klayoutmatthias is it save to use region fill withreturning the remaining to be filled area returned and filled in a while loop while in the start_chanes() end_changes() mode? I would assume so, but want to confirm ;)

sebastian-goeldi added a commit to gdsfactory/kfactory that referenced this issue Nov 21, 2024
@sebastian-goeldi
Copy link
Contributor

@edelmanjm could you try the new version of kfactory and see whether it fixes your segfault?

I implemented Matthias' suggestion with the temp layout (which hopefully speeds it up as well, if only slightly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants