Performance issue impacting WCOSS2 users #2364
-
I am working with John Wagner to optimize his workflow on the WCOSS2 systems. The workflow makes heavy use of the stat_analysis utility. The workflow can run ~60 separate PBS jobs, each running ~70 separate instances of stat_analysis. With MET_TMP_DIR set to use a directory within the users Lustre area, I found that most of the stat_analysis processes are stuck in "uninterruptible sleep" with only a couple of process on each compute node making any progress at any given time. Here are some typical stack traces seen while investigating (gdb -p ...). I used the strace utility to profile a single instance of stat_analysis to discover the following pattern of file system activity. In summary, the strace data reveals a repeating pattern of
This pattern is repeated 8128 files for the 'po' element type. Is this the expected behavior?!?! Setting MET_TMP_DIR=/tmp (which is a node local RAM file system) allowed the entire workflow to complete in ~12 minutes while the usual MET_TMP_DIR setting results in runtimes of 4-5 hours. With 60*70=4200 of these processes attempting to run at the same time, the Lustre file system becomes bogged down for all users. Is setting MET_TMP_DIR=/tmp the recommended solution to this issue? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Hello @dkokron, I see that you're supporting John Wagner in his configuration and use of the tools in METplus. Thanks for providing the stack trace information. That really helps point to where the bottleneck is occurring. I strongly suspect that its this line of code in
The MetConfig::read_string(const char * s) function in The pattern of the string in the stack trace you sent Reading a string into a config object is handled using a small text file in this way as a convenience. But yes, I understand that its wreaking havoc on the lustre file system. The MET tools create/destroy temp files as needed while running. We append the process id to the end of the temp file name along with To answer your question directly, yes, setting
We typically use that default setting during development, which may explain why we haven't been able to replicate the runtime issues that John Wagner has been able to uncover! The better longterm solution may be reimplementing the @georgemccabe what impact, if any, does this discussion have on the setting of METplus config options? I see in the METplus User's Guide that |
Beta Was this translation helpful? Give feedback.
-
Please reach out to Steven Earle about the use of /tmp on WCOSS2 compute
nodes. I am on PTO until 2 Oct. So my response time will not be great.
Dan
…On Fri, Sep 22, 2023, 12:33 AM jprestop ***@***.***> wrote:
@JohnHalleyGotway <https://github.com/JohnHalleyGotway>
Although I think I remember that writing to /tmp on WCOSS2 was discouraged
in the past.
I just wanted to confirm what you have said here. That is correct. In the
20210125 MET User's Telecon, the following was discussed:
This is for all users of MET on WCOSS. If you do not use MET on WCOSS,
please disregard.
Please check your MET configuration files for the following line:
tmp_dir = "/tmp"
If you see this in your config files, please change tmp_dir to something
else, on your own disk perhaps. The administrators of WCOSS do not want us
to write to /tmp on compute nodes because it consumes a memory-based
resource.
Changing TMP_DIR's default value from {OUTPUT_BASE}/tmp to something else
was discussed. It was decided to leave the default as-is and that WCOSS
"users should continue managing their own temp directory settings." That
being said it doesn't mean we shouldn't revisit the conversation to
consider changing that default to the /tmp setting that appears in the
MET config files
It is possible that things have changed over the past few years, so please
let me know if you'd like me to follow up on this topic with NCO.
—
Reply to this email directly, view it on GitHub
<#2364 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACODV2F2MZRU3L5RUH6I6H3X3S6FNANCNFSM6AAAAAA5CGVZ2I>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I summarized the changes to MET we should pursue in dtcenter/MET#2690. I'd recommend that we mark this discussion as being answered, close it, and move any future discussion about MET's use of temp file to the body of that new issue. |
Beta Was this translation helpful? Give feedback.
I summarized the changes to MET we should pursue in dtcenter/MET#2690. I'd recommend that we mark this discussion as being answered, close it, and move any future discussion about MET's use of temp file to the body of that new issue.