Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption was detected in ERIS #27

Open
alex-aparin opened this issue Feb 5, 2018 · 15 comments
Open

Memory corruption was detected in ERIS #27

alex-aparin opened this issue Feb 5, 2018 · 15 comments

Comments

@alex-aparin
Copy link

Hi. I learned internals of eris little bit during debugging undump of lua script. As I understood, there is memory corruption which occurs during undumping script (script creates not many objects (like closures etc). This is happened due inproper (I guess) working with marks of gco objects. Here is brief example of corruption:

  1. Undumping closure (which leads to creation of new proto)
  2. Creation of proto
  3. Filling proto
  4. Running of garbage collector which removes newly created proto
  5. Working with dead pointer to proto (in debug leads to access violation, in release it depends on actual state of memory)

I suppose that some work with garbage collector barriers is needed because following steps fix problem:

  1. Stop garbage collector (before undumping/unpresisting)
  2. Undump/Unpersist
  3. Restart garbage collector (after undumping/unpresisting)
@cbeck88
Copy link
Contributor

cbeck88 commented Feb 6, 2018

Hi, do you think you can show the code that demonstrates the problem? That would give others a better chance to reproduce the problem and try to fix it.

@alex-aparin
Copy link
Author

alex-aparin commented Feb 6, 2018

@cbeck88 , Unfortunately I cannot show original lua script which I debugged. I suppose it can be reproduced if lua script will contain many nested prototypes, or any other objects. Main purpose of such script is just to contain many objects to cause garbage collection. I can show workaround code, but it has no meaning without main lua script. In free time I will try to create simplified version of lua script which will break undumping.

@ptwohig
Copy link

ptwohig commented Jun 11, 2018

To clarify, is this with a C Closure or a Lua function?

@cbeck88
Copy link
Contributor

cbeck88 commented Jun 12, 2018

This is not a useful bug report if there isn't any kind of code example given -- it isn't reproduceable.

@LmTinyToon If you can't give a MVCE (https://stackoverflow.com/help/mcve) then if it were my repository I would close the issue

If OP has really disappeared then I would be inclined to assume that he figured it out it was a bug in his code and moved on.

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

@cbeck88 Well, agreed that the bug report is lacking. But, I was drawn to this because I seem to be getting a similar issue in my code and have spent a few days trying to reproduce it.

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

I'm having what is possibly the same issue. Unfortunately, it would be hard for me to post my whole test case (and I've tried to reproduce it in a simplified test case). However, running my production code in Valgrind seems to corroborate @LmTinyToon 's story.

==15944== Invalid read of size 1
==15944== at 0x211ABC3B: traverseproto (lgc.c:489)
==15944== by 0x211AC20D: propagatemark (lgc.c:594)
==15944== by 0x211AD3E6: singlestep (lgc.c:1056)
==15944== by 0x211AD6A4: luaC_step (lgc.c:1136)
==15944== by 0x211A16D0: lua_pushfstring (lapi.c:526)
==15944== by 0x211D03CE: searchpath (loadlib.c:449)
==15944== by 0x211D05CA: findfile (loadlib.c:480)
==15944== by 0x211D06A2: searcher_Lua (loadlib.c:498)
==15944== by 0x211A8D4B: luaD_precall (ldo.c:434)
==15944== by 0x211A90D9: luaD_call (ldo.c:498)
==15944== by 0x211A9147: luaD_callnoyield (ldo.c:509)
==15944== by 0x211A2BA4: lua_callk (lapi.c:925)
==15944== Address 0x25e963d9 is 9 bytes inside a block of size 120 free'd
==15944== at 0x4C2ED3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x211C24DF: l_alloc (lauxlib.c:1011)
==15944== by 0x211AF940: luaM_realloc_ (lmem.c:86)
==15944== by 0x211AAB44: luaF_freeproto (lfunc.c:132)
==15944== by 0x211AC5F9: freeobj (lgc.c:698)
==15944== by 0x211AC788: sweeplist (lgc.c:743)
==15944== by 0x211AD2E8: sweepstep (lgc.c:1032)
==15944== by 0x211AD477: singlestep (lgc.c:1070)
==15944== by 0x211AD6A4: luaC_step (lgc.c:1136)
==15944== by 0x211A14DC: lua_pushlstring (lapi.c:485)
==15944== by 0x211D24FC: u_string (eris.c:900)
==15944== by 0x211D62B4: unpersist (eris.c:2301)
==15944== Block was alloc'd at
==15944== at 0x4C2DA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x4C2FD84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15944== by 0x211C24F9: l_alloc (lauxlib.c:1015)
==15944== by 0x211AF940: luaM_realloc_ (lmem.c:86)
==15944== by 0x211AAE48: luaC_newobj (lgc.c:210)
==15944== by 0x211AA93F: luaF_newproto (lfunc.c:100)
==15944== by 0x211D39C6: u_proto (eris.c:1338)
==15944== by 0x211D62FA: unpersist (eris.c:2316)

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

Note that the error appears in multiple places, but the issue it is always the the same "block was free'd at". So I suspect this is indeed a bug.

Further, when I run my code under heavy load it almost always crashes with a heap corruption message. Sometimes it's a double free, sometimes it's a corrupted double-linked list. Basically it's always heap corruption and it has only started happening since I introduced Eris on this project.

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

Even more, all seems to be allocated at the same place and ultimately runs up against a double-free error.

https://gist.github.com/ptwohig/5ddc97166a61af608b19cdfc346c9761

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

Actually, I just figured out the reproduction steps which work 100% of the time:

  • Edit src/Makefile
  • Add -DHARDMEMTESTS to the compiler flags (probably want to add -g and -O0 so you can get stack traces as well)
  • Run unit tests.
  • And Boom! goes the dynamite. You get a segmentation fault each time

Same place in u_closure. I think @LmTinyToon is correct. I'm not sure that disabling garbage collection during will be a suitable workaround because this basically makes for a double-free regardless.

@cbeck88
Copy link
Contributor

cbeck88 commented Jun 12, 2018

Awesome, thank you 👍

I am interested to try to help fix this if that is welcome but I can't make time to do anything here until the weekend

@ptwohig
Copy link

ptwohig commented Jun 12, 2018

Here's where I think the offending lines are:

    LClosure *cl;
    Proto *p;

    eris_checkstack(info->L, 4);

    /* Create closure and anchor it on the stack (avoid collection via GC). */
    cl = eris_newLclosure(info->L, nups);
    eris_setclLvalue(info->L, info->L->top, cl);                   /* ... lcl */
    api_incr_top(info->L);

    /* Preregister closure for handling of cycles (upvalues). */
    registerobject(info);

    /* Read prototype. In general, we create protos (and upvalues) before
     * trying to read them and pass a pointer to the instance along to the
     * unpersist function. This way the instance is safely hooked up to an
     * object, so we don't have to worry about it getting GCed. */
    pushpath(info, ".proto");
    cl->p = eris_newproto(info->L);
    /* Push the proto into which to unpersist as a parameter to u_proto. */
    lua_pushlightuserdata(info->L, cl->p);                /* ... lcl nproto */
    unpersist(info);                          /* ... lcl nproto nproto/oproto */

Consider this scenario:

  • eris_newproto is just a macro for luaF_newproto
  • luaF_newproto actually uses luaC_newobj to allocate the memory
  • luaC_newobj creates the object as a white object, which means its eligible for garbage collection immediately
  • unpersist will call eventually u_proto a few frames higher in the stack
  • It appears that the usage of luaM_realloc_ does just that, it just allocates a block of memory. It's not aware of the garbage collector. So it could be freeing a block of memory now that will also be free'd again later.

I suspect that's why the author's of Lua included that HARDMEMTESTS to force a failure as soon as possible in testing.

@alex-aparin
Copy link
Author

I am sorry for late. Unfortunately I cannot give original code which represents bug. I tried to reproduce but it does not have effect.

@alex-aparin
Copy link
Author

alex-aparin commented Jun 13, 2018

@ptwohig, Yes, it seems like you have the same problem (I remember similar stack after corruption). It was "floating" bug. Sometimes it worked sometimes not. Temporal disabling of garbage collector during work of eris fixed problem (anyway it helped me and bug disappeared)

@ptwohig
Copy link

ptwohig commented Jun 13, 2018

@LmTinyToon I tried the disabling of GC and it still crashed on me. Granted I had a few hundred threads working a few hundred separate lua_States to get it to happen.

Here is something that is definitely wrong, though:

eris.c:1267

  p->sizecode = READ_VALUE(int);
  eris_reallocvector(info->L, p->code, 0, p->sizecode, Instruction);

And observe what happens beginning on lgc.c:479

static int traverseproto (global_State *g, Proto *f) {
  int i;
  if (f->cache && iswhite(f->cache))
    f->cache = NULL;  /* allow cache to be collected */
  markobjectN(g, f->source);
  for (i = 0; i < f->sizek; i++)  /* mark literals */
    markvalue(g, &f->k[i]);
  for (i = 0; i < f->sizeupvalues; i++)  /* mark upvalue names */
    markobjectN(g, f->upvalues[i].name);
  for (i = 0; i < f->sizep; i++)  /* mark nested protos */
    markobjectN(g, f->p[i]);
  for (i = 0; i < f->sizelocvars; i++)  /* mark local-variable names */
    markobjectN(g, f->locvars[i].varname);
  return sizeof(Proto) + sizeof(Instruction) * f->sizecode +
                         sizeof(Proto *) * f->sizep +
                         sizeof(TValue) * f->sizek +
                         sizeof(int) * f->sizelineinfo +
                         sizeof(LocVar) * f->sizelocvars +
                         sizeof(Upvaldesc) * f->sizeupvalues;
}

The garbage collector attempts to iterate each array of objects and mark them individually, however if a value exists (for example) for p->sizek and the pointer is 0, a segfault will happen. More specifically if it is possible that a garbage collection cycle hits between the assignment of one of the sizes and the assignment/allocation of the array holding the data we end up with a hard crash.

Changing everywhere in eris.c to code similar to this seems to have fixed that problem and all tests pass now.

  {
  	int psizek = READ_VALUE(int);
        eris_reallocvector(info->L, p->k, 0, psizek, TValue);
	p->sizek = psizek;
  }

I'm still not convinced that there isn't a need for barriers to prevent memory corruption but this is definitely problematic as well. Granted that the particular above segment can't trigger a garbage collection cycle, similar parts of the code can potentially do so as well.

I also noticed that between Lua 5.2 and Lua 5.3 they removed a proto barrier function, so maybe this was the cause of it. It's very hard to say right now until I do more digging.

@jyelon
Copy link

jyelon commented Dec 10, 2021

I had the same problem. Turns out, this is just a missing luaC_objbarrier.

In u_proto, look for this line of code:

p->p[i] = eris_newproto(info->L);

In this case, if proto p is marked "black", we need a luaC_objbarrier. I preceded this line of code with this:

assert(!isblack(obj2gco(p)));

And sure enough, I got a couple of assertion fails. So definitely, proto p can be black.

I was able to fix it by changing the code as follows:

p->p[i] = cp = eris_newproto(info->L);
luaC_objbarrier(info->L, p, cp);

If you want to see the bug yourself, this is very hard to do. You have to get the garbage collector to mark proto p black at just the wrong moment. It's extremely unlikely, unless you're saving and loading a lot of nested protos. I have not been able to write a small test case that triggers the bug. Despite that, I can assure you that the bug does happen, and that the fix above works.

Note that the "reallocvector" bug documented above, by ptwohig, is a completely separate bug. His repair works for that bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants