Memory corruption was detected in ERIS #27

alex-aparin · 2018-02-05T10:16:19Z

Hi. I learned internals of eris little bit during debugging undump of lua script. As I understood, there is memory corruption which occurs during undumping script (script creates not many objects (like closures etc). This is happened due inproper (I guess) working with marks of gco objects. Here is brief example of corruption:

Undumping closure (which leads to creation of new proto)
Creation of proto
Filling proto
Running of garbage collector which removes newly created proto
Working with dead pointer to proto (in debug leads to access violation, in release it depends on actual state of memory)

I suppose that some work with garbage collector barriers is needed because following steps fix problem:

Stop garbage collector (before undumping/unpresisting)
Undump/Unpersist
Restart garbage collector (after undumping/unpresisting)

cbeck88 · 2018-02-06T05:34:56Z

Hi, do you think you can show the code that demonstrates the problem? That would give others a better chance to reproduce the problem and try to fix it.

alex-aparin · 2018-02-06T16:11:16Z

@cbeck88 , Unfortunately I cannot show original lua script which I debugged. I suppose it can be reproduced if lua script will contain many nested prototypes, or any other objects. Main purpose of such script is just to contain many objects to cause garbage collection. I can show workaround code, but it has no meaning without main lua script. In free time I will try to create simplified version of lua script which will break undumping.

ptwohig · 2018-06-11T18:48:13Z

To clarify, is this with a C Closure or a Lua function?

cbeck88 · 2018-06-12T02:21:30Z

This is not a useful bug report if there isn't any kind of code example given -- it isn't reproduceable.

@LmTinyToon If you can't give a MVCE (https://stackoverflow.com/help/mcve) then if it were my repository I would close the issue

If OP has really disappeared then I would be inclined to assume that he figured it out it was a bug in his code and moved on.

ptwohig · 2018-06-12T04:05:34Z

@cbeck88 Well, agreed that the bug report is lacking. But, I was drawn to this because I seem to be getting a similar issue in my code and have spent a few days trying to reproduce it.

ptwohig · 2018-06-12T05:55:50Z

I'm having what is possibly the same issue. Unfortunately, it would be hard for me to post my whole test case (and I've tried to reproduce it in a simplified test case). However, running my production code in Valgrind seems to corroborate @LmTinyToon 's story.

==15944== Invalid read of size 1
==15944== at 0x211ABC3B: traverseproto (lgc.c:489)
==15944== by 0x211AC20D: propagatemark (lgc.c:594)
==15944== by 0x211AD3E6: singlestep (lgc.c:1056)
==15944== by 0x211AD6A4: luaC_step (lgc.c:1136)
==15944== by 0x211A16D0: lua_pushfstring (lapi.c:526)
==15944== by 0x211D03CE: searchpath (loadlib.c:449)
==15944== by 0x211D05CA: findfile (loadlib.c:480)
==15944== by 0x211D06A2: searcher_Lua (loadlib.c:498)
==15944== by 0x211A8D4B: luaD_precall (ldo.c:434)
==15944== by 0x211A90D9: luaD_call (ldo.c:498)
==15944== by 0x211A9147: luaD_callnoyield (ldo.c:509)
==15944== by 0x211A2BA4: lua_callk (lapi.c:925)
==15944== Address 0x25e963d9 is 9 bytes inside a block of size 120 free'd
==15944== at 0x4C2ED3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x211C24DF: l_alloc (lauxlib.c:1011)
==15944== by 0x211AF940: luaM_realloc_ (lmem.c:86)
==15944== by 0x211AAB44: luaF_freeproto (lfunc.c:132)
==15944== by 0x211AC5F9: freeobj (lgc.c:698)
==15944== by 0x211AC788: sweeplist (lgc.c:743)
==15944== by 0x211AD2E8: sweepstep (lgc.c:1032)
==15944== by 0x211AD477: singlestep (lgc.c:1070)
==15944== by 0x211AD6A4: luaC_step (lgc.c:1136)
==15944== by 0x211A14DC: lua_pushlstring (lapi.c:485)
==15944== by 0x211D24FC: u_string (eris.c:900)
==15944== by 0x211D62B4: unpersist (eris.c:2301)
==15944== Block was alloc'd at
==15944== at 0x4C2DA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x4C2FD84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x211C24F9: l_alloc (lauxlib.c:1015)
==15944== by 0x211AF940: luaM_realloc_ (lmem.c:86)
==15944== by 0x211AAE48: luaC_newobj (lgc.c:210)
==15944== by 0x211AA93F: luaF_newproto (lfunc.c:100)
==15944== by 0x211D39C6: u_proto (eris.c:1338)
==15944== by 0x211D62FA: unpersist (eris.c:2316)

ptwohig · 2018-06-12T05:57:39Z

Note that the error appears in multiple places, but the issue it is always the the same "block was free'd at". So I suspect this is indeed a bug.

Further, when I run my code under heavy load it almost always crashes with a heap corruption message. Sometimes it's a double free, sometimes it's a corrupted double-linked list. Basically it's always heap corruption and it has only started happening since I introduced Eris on this project.

ptwohig · 2018-06-12T07:12:51Z

Even more, all seems to be allocated at the same place and ultimately runs up against a double-free error.

https://gist.github.com/ptwohig/5ddc97166a61af608b19cdfc346c9761

ptwohig · 2018-06-12T17:14:48Z

Actually, I just figured out the reproduction steps which work 100% of the time:

Edit src/Makefile
Add -DHARDMEMTESTS to the compiler flags (probably want to add -g and -O0 so you can get stack traces as well)
Run unit tests.
And Boom! goes the dynamite. You get a segmentation fault each time

Same place in u_closure. I think @LmTinyToon is correct. I'm not sure that disabling garbage collection during will be a suitable workaround because this basically makes for a double-free regardless.

cbeck88 · 2018-06-12T17:29:35Z

Awesome, thank you 👍

I am interested to try to help fix this if that is welcome but I can't make time to do anything here until the weekend

ptwohig · 2018-06-12T18:43:44Z

Here's where I think the offending lines are:

    LClosure *cl;
    Proto *p;

    eris_checkstack(info->L, 4);

    /* Create closure and anchor it on the stack (avoid collection via GC). */
    cl = eris_newLclosure(info->L, nups);
    eris_setclLvalue(info->L, info->L->top, cl);                   /* ... lcl */
    api_incr_top(info->L);

    /* Preregister closure for handling of cycles (upvalues). */
    registerobject(info);

    /* Read prototype. In general, we create protos (and upvalues) before
     * trying to read them and pass a pointer to the instance along to the
     * unpersist function. This way the instance is safely hooked up to an
     * object, so we don't have to worry about it getting GCed. */
    pushpath(info, ".proto");
    cl->p = eris_newproto(info->L);
    /* Push the proto into which to unpersist as a parameter to u_proto. */
    lua_pushlightuserdata(info->L, cl->p);                /* ... lcl nproto */
    unpersist(info);                          /* ... lcl nproto nproto/oproto */

Consider this scenario:

eris_newproto is just a macro for luaF_newproto
luaF_newproto actually uses luaC_newobj to allocate the memory
luaC_newobj creates the object as a white object, which means its eligible for garbage collection immediately
unpersist will call eventually u_proto a few frames higher in the stack
It appears that the usage of luaM_realloc_ does just that, it just allocates a block of memory. It's not aware of the garbage collector. So it could be freeing a block of memory now that will also be free'd again later.

I suspect that's why the author's of Lua included that HARDMEMTESTS to force a failure as soon as possible in testing.

alex-aparin · 2018-06-13T09:00:13Z

I am sorry for late. Unfortunately I cannot give original code which represents bug. I tried to reproduce but it does not have effect.

alex-aparin · 2018-06-13T09:02:43Z

@ptwohig, Yes, it seems like you have the same problem (I remember similar stack after corruption). It was "floating" bug. Sometimes it worked sometimes not. Temporal disabling of garbage collector during work of eris fixed problem (anyway it helped me and bug disappeared)

ptwohig · 2018-06-13T16:16:01Z

@LmTinyToon I tried the disabling of GC and it still crashed on me. Granted I had a few hundred threads working a few hundred separate lua_States to get it to happen.

Here is something that is definitely wrong, though:

eris.c:1267

  p->sizecode = READ_VALUE(int);
  eris_reallocvector(info->L, p->code, 0, p->sizecode, Instruction);

And observe what happens beginning on lgc.c:479

static int traverseproto (global_State *g, Proto *f) {
  int i;
  if (f->cache && iswhite(f->cache))
    f->cache = NULL;  /* allow cache to be collected */
  markobjectN(g, f->source);
  for (i = 0; i < f->sizek; i++)  /* mark literals */
    markvalue(g, &f->k[i]);
  for (i = 0; i < f->sizeupvalues; i++)  /* mark upvalue names */
    markobjectN(g, f->upvalues[i].name);
  for (i = 0; i < f->sizep; i++)  /* mark nested protos */
    markobjectN(g, f->p[i]);
  for (i = 0; i < f->sizelocvars; i++)  /* mark local-variable names */
    markobjectN(g, f->locvars[i].varname);
  return sizeof(Proto) + sizeof(Instruction) * f->sizecode +
                         sizeof(Proto *) * f->sizep +
                         sizeof(TValue) * f->sizek +
                         sizeof(int) * f->sizelineinfo +
                         sizeof(LocVar) * f->sizelocvars +
                         sizeof(Upvaldesc) * f->sizeupvalues;
}

The garbage collector attempts to iterate each array of objects and mark them individually, however if a value exists (for example) for p->sizek and the pointer is 0, a segfault will happen. More specifically if it is possible that a garbage collection cycle hits between the assignment of one of the sizes and the assignment/allocation of the array holding the data we end up with a hard crash.

Changing everywhere in eris.c to code similar to this seems to have fixed that problem and all tests pass now.

  {
  	int psizek = READ_VALUE(int);
        eris_reallocvector(info->L, p->k, 0, psizek, TValue);
	p->sizek = psizek;
  }

I'm still not convinced that there isn't a need for barriers to prevent memory corruption but this is definitely problematic as well. Granted that the particular above segment can't trigger a garbage collection cycle, similar parts of the code can potentially do so as well.

I also noticed that between Lua 5.2 and Lua 5.3 they removed a proto barrier function, so maybe this was the cause of it. It's very hard to say right now until I do more digging.

jyelon · 2021-12-10T20:02:58Z

I had the same problem. Turns out, this is just a missing luaC_objbarrier.

In u_proto, look for this line of code:

p->p[i] = eris_newproto(info->L);

In this case, if proto p is marked "black", we need a luaC_objbarrier. I preceded this line of code with this:

assert(!isblack(obj2gco(p)));

And sure enough, I got a couple of assertion fails. So definitely, proto p can be black.

I was able to fix it by changing the code as follows:

p->p[i] = cp = eris_newproto(info->L);
luaC_objbarrier(info->L, p, cp);

If you want to see the bug yourself, this is very hard to do. You have to get the garbage collector to mark proto p black at just the wrong moment. It's extremely unlikely, unless you're saving and loading a lot of nested protos. I have not been able to write a small test case that triggers the bug. Despite that, I can assure you that the bug does happen, and that the fix above works.

Note that the "reallocvector" bug documented above, by ptwohig, is a completely separate bug. His repair works for that bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory corruption was detected in ERIS #27

Memory corruption was detected in ERIS #27

alex-aparin commented Feb 5, 2018

cbeck88 commented Feb 6, 2018

alex-aparin commented Feb 6, 2018 •

edited

Loading

ptwohig commented Jun 11, 2018

cbeck88 commented Jun 12, 2018 •

edited

Loading

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

cbeck88 commented Jun 12, 2018

ptwohig commented Jun 12, 2018

alex-aparin commented Jun 13, 2018

alex-aparin commented Jun 13, 2018 •

edited

Loading

ptwohig commented Jun 13, 2018 •

edited

Loading

jyelon commented Dec 10, 2021

Memory corruption was detected in ERIS #27

Memory corruption was detected in ERIS #27

Comments

alex-aparin commented Feb 5, 2018

cbeck88 commented Feb 6, 2018

alex-aparin commented Feb 6, 2018 • edited Loading

ptwohig commented Jun 11, 2018

cbeck88 commented Jun 12, 2018 • edited Loading

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

ptwohig commented Jun 12, 2018

cbeck88 commented Jun 12, 2018

ptwohig commented Jun 12, 2018

alex-aparin commented Jun 13, 2018

alex-aparin commented Jun 13, 2018 • edited Loading

ptwohig commented Jun 13, 2018 • edited Loading

jyelon commented Dec 10, 2021

alex-aparin commented Feb 6, 2018 •

edited

Loading

cbeck88 commented Jun 12, 2018 •

edited

Loading

alex-aparin commented Jun 13, 2018 •

edited

Loading

ptwohig commented Jun 13, 2018 •

edited

Loading