/
NEWS
3186 lines (2711 loc) · 123 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
psmgmt NEWS -- history of user-visible changes.
Copyright (C) 2010-2021 ParTec Cluster Competence Center GmbH, Munich
Copyright (C) 2021-2024 ParTec AG, Munich
Please send bug reports, questions and suggestions to <support@par-tec.com>
Version 5.1.60-1:
=================
Bugfixes:
- psslurm: respect step core map with cpu-bind=sockets (psc:#449)
- psslurmgetbind: Adjust coremap generation (psc:#446)
- psslurmgetbind: Allow option combinations (psc:#451)
Enhancements:
- Print message if core requested to pin does not match step core map
- psslurmgetbind: Print core map if PSSLURM_PRINT_COREMAPS set
Additional changes:
- psslurmgetbind: Change default Slurm version to 23.02
Version 5.1.60:
===============
Bugfixes:
- Don't stop psaccount collect scripts on temporary error (#186)
- Ensure PSP_DD_CHILDRESREL is sent without delay (#177)
- Ensure -x and -H (-N, -f) work together in mpiexec (#160)
- psiadmin has to expect local answer on plugin (#167)
- Rework mpiexec's environment handling (#182)
- Ensure building works even with missing NUMA support (#12)
- Fix potentail memory leak and unexpected NULL attribute unveiled by Clang
- Fix warnings unveiled by recent Clang versions
- Fix psidsession information when spawning w/out Slurm
Enhancements:
- Add psslurm configuration option SSTAT_USERS (psc:#438)
- Add extra variables to client environment (#188)
* PS_SESSION_ID, PS_JOB_ID, PS_RESERVATION_ID, PS_JOB_RANK, PS_JOB_SIZE
- Implement --pset option for mpiexec (#110)
- Add support for OCI containers to psslurm (via Slurm's oci.conf)
- Add PAM session support for user processes in psslurm
- Prepare for PMIx_Spawn() support -- this is not yet complete
- Close psidforwarder's logger connection on dropped FINALIZE message
Additional changes:
- Get rid of the OpenMPI tweeks in mpiexec
* This is obsoleted for a long time by native Slurm support
- Get rid of starters for ancient MPI versions
- PMIx tests: Add spawn and simple abort tests
- Add script to calculate cpumap from /proc/cpuinfo
- Rename __PMI_SPAWN_SERVICE_RANK to __SPAWNER_SERVICE_RANK
- Introduce PSIDHOOK_EXEC_CLIENT_EXEC and bump plugin API to 141
- Push rrcomm's plugin version to 2
- Make PSID_execFunc() usable outside psid itself
- Publish loadPlugin(), findPlugin() as PSIDplugin_load(), PSIDplugin_find()
- Move spawn definitions into common pluginspawn.[ch]
- Rework psidspawn's testExecutable() and changeToWorkDir()
- Introduce mmapFile() in pluginhelper
- Rework env_t to make it opaque, rework filter functionality, etc.
- Allow stealing of the whole array from strv_t
- IWYU fixes (after update to version 0.22)
- Add Intel Xeon MAX topology w/ and w/out flat mode recorded with v2.2
Version 5.1.59-5:
=================
Bugfixes:
- Ensure reading from cached pack info messages start from the start
Version 5.1.59-4:
=================
Bugfixes:
- Adopt to changed return code in libbpf 1.0.0 and above (jwt:#26118)
- Fix handling of srun's --threads-per-core for pinning (psc:#443)
- Rework psslurm's fillHints() avoids random 'invalid hint' (psc:#440)
Enhancements:
- Introduce PSSLURM_VERBOSE_PINNING to print additional information
- Suppress more error messages if quiet flag is given in bpf_loader
Version 5.1.59-3:
=================
Enhancements:
- Update bpf_cgroup_device for compatibility with Rocky 9
Additional changes:
- Rename bpf_cgroup_device to prevent automatic removal (#176)
Version 5.1.59-2:
=================
Bugfixes:
- Evict false positive warnings from syslog (#178)
Version 5.1.59-1:
=================
Bugfixes:
- Stop hardly blocking build of BPF support in RHEL9, Rocky9, etc.
Version 5.1.59:
===============
Bugfixes:
- Adopt to new error behavior of Slurm 23.02 (jwt:#24558)
- Execute SPANK hooks in batchscript context (deep:#3325)
- Prevent possible DOS in psserial (#172)
- Ensure pluginforwarder sends sufficient data
- Prevent segfault when psaccount gets insufficient answer
- Prevent memory leak when pscommon's logger is re-initialized
- Set correct size for allocation of Req_Signal_Tasks_t
- Fix various memory leaks, double free(), etc. found during development
- Various fixes unvealed by scan-build
Enhancements:
- Add a generic energy readout script (psc:#411)
- Optimize PSI_spawnRsrvtn() (#155)
- Ensure slurmutils is installed with matching Slurm
- Set correct job and step memory limits for SPANK
- RRComm v2 supporting messages across jobs (created by MPI_Comm_spawn())
- Allow psmgmt to be built without SPANK support (#171)
Additional changes:
- Remove the cgroup plugin (obsoleted by jail) (#174)
- Rework PStask_t's handling of extra information (#156)
- Drop support for Slurm 19.05 protocol
- Drop support to spawn to PSProtocolVersion < 341 (before 2019)
- Use PS_DataBuffer for unpacking messages using psserial functions
- Enable the compiler to verify types for more psserial functions
- Introduce PSID_dbg() an replace PSID_log() as much as possible
- Introduce psslurmprototypes
- Remove unused PSI_spawnSingle()
- Various IWYU fixes
Version 5.1.58-1:
=================
Bugfixes:
- sss-nss re-uses closed fd (#159/jwt:#20747)
- prevent segfault from termJail() (jwt:#24361)
- ensure nLocalSlots is initialized (#162/jwt:#24377)
- POSIX does not guarantee the getpwuid_r() et al. sets errno
Enhancements:
- librrcomm needs -fPIC to be linked into libpscom
*************************************************************************
*************************************************************************
** **
** Since the pspelogue executable was moved, the slurmctld prologue **
** has to be adapted accordingly. The new location is libexecdir, **
** i.e. /opt/parastation/libexec/psmgmt/pspelogue **
** **
** psmgmt 5.1.58-0 will require psconfig 5.2.10 or later to work **
** **
*************************************************************************
*************************************************************************
Version 5.1.58:
===============
Bugfixes:
- Pin only to cores allowed by the step in psslurm
- Fix support for spawn in PMI
- Ensure step forwarder task owns its strings
Enhancements:
- Add user-facing error messages to psslurm's pinning
- Handle hint compute_bound in psslurm
- Print coremaps to stderr for debugging if PSSLURM_PRINT_COREMAPS is set
- Make energy reporting managable by srun (meluxina #1242)
- Cache user-facing messages until connection to srun appears
- Add support for libbpf versions 1.0.0 and above (cgroup v2 on Rocky 9)
Additional changes:
- Use share/psconfig/dumps for main psconfig dumps (requires psconfig 5.2.10)
- Add flag termAfterFWmsg to job and step to stop immediately after
delivery of queued messages to the user
- Move psid to sbin directory
- Move various helpers (pspelogue,ps_acc,psaccounter,psilogger) to libexecdir
- Major rework psidsession (adressing tasks via session/job/app/rank)
- Utilize psid (instead of psilogger) to check validity of
installation directory
- Introduce PSIDHOOK_LAST_RESRELEASED, PSIDHOOK_LAST_CHILD_GONE and
bump plugin API version to 140
- Introduce the concept of sister partitions and push daemon protocol
to version 416
- Allow PSIDpart_register() to extend a partition
- Introduce PSID_flog() and PSID_fdbg()
- Introduce PSIDfwd_inForwarder()
- Move clrPartQueue() to PSpart_clrQueue() and make it public
- Rework strv_t behavior
Version 5.1.57-3:
=================
Bugfixes:
- Various fixes for cgroup v2 (jwt:#23342)
* Sanitize calculations of jail limits, especially memory
* Prevent race conditions leading to concurrent modifications of cgroup
Version 5.1.57-2:
=================
Bugfixes:
- Delete allocation before sending MESSAGE_EPILOG_COMPLETE (jwt:#23342)
Version 5.1.57-1:
=================
Bugfixes:
- Set correct state of allocation to ensure proper cleanup of processes
- Set the correct sequence when setting cgroup memory limits
- Ensure correct locking in cgroupv1
- Fix race condition and ensure all SSH processes get jailed into cgroups
- Ensure cgroupv2 controllers get enabled correctly
- Fix quoting and other errors unveiled by shellcheck in jail scripts
- Ensure local variable does not influence top level behavior
- Various checks if data was unavailable from message
- Avoid array boundary violation
Enhancements:
- Print messages from PMIx server log interface
- Improve cleanup of leftover cgroupv2 directories
- Add caller name to jail log functions
- Avoid superfluous mdsave call in cgroup jail script
- Avoid temporary buffers of excessive size
Additional changes:
- Various improvements in the build system (e.g. to support Rocky 9)
- Fix gcc-11 warning emitted via rpmbuild on Rocky 9
- Improvements of documentation comment
*************************************************************************
*************************************************************************
** **
** psmgmt 5.1.57-0 introduces support for cgroup v2. Since enhanced **
** cgroup support was moved into the jail-plugin machinery away from **
** the original cgroup-plugin, loading both plugins was useless but **
** not really harmful. This changes with psmgmt 5.1.57-0 when usage **
** of cgroup v2 support is enabled on the system and within Slurm. **
** In this specific case default settings for the cgroup-plugin will **
** clash with the expectations of the jail-plugin's cgroup machinery **
** and the standard setup of cgroups found on the systems **
** **
** Therefore it is highly recommended to: **
** **
** - properly setup the cgroups machinery of the jail-plugin **
** - disable the use of the old cgroup-plugin **
** **
*************************************************************************
*************************************************************************
Version 5.1.57:
===============
Bugfixes:
- Fix jail memory check (#118)
- Deny access to all devices not allocated by the user via cgroups
- Cleanup leftover cgroupv1 psid directories
- Do not let psslurm init fail if optional Spank plugins are broken
- Do not overwrite jail information if multiple GRes have devices
- Fix inet_ntop()'s second argument
Enhancements:
- Introduce support for cgroup v2
- Set default cgroup version to autodetect
- Add psslurm set options SPANK_LOAD, SPANK_UNLOAD and SPANK_FIN (#pct:428)
- Introduce support for port ranges in SlurmctldPort in psslurm
- Send SIGTERM only in the first round of killing cgroup processes
- psslurm takes hwloc info as default if SKIP_CORE_VERIFICATION is given
- Avoid lapsed constant PMIX_ERR_NOT_IMPLEMENTED in PMIx 4
- Mark processes started with PSID_execFunc() as non daemon
Additional changes:
- Add support for changing IP addresses delivered in psmgmt-dynip
- Introduce PSC_traverseHostInfo() to avoid getaddrinfo()'s boylerplate code
- Introduce PSC_isDaemon() and rework PSC_setDaemonFlag()
- Add adjusted psconfig dumps for base and defaults configuration
- Refactor and streamline RDP
Version 5.1.56-2:
=================
Bugfixes:
- Ensure cached PSP_PACK_INFO info is handled (#131)
- Execute SPANK prologue/epilogue hooks without the presence of
a prologue/epilogue script
- Let doGetMsgBuf() (behind PSP_getMsgBuf() et al.) warn even if header.len is 0
Enhancements:
- Ensure cores is not used in gres.conf
- Include UID in PMIx' temp session dir name to help with node sharing
- Add sentence on ParaStation ID -1 to psiadmin's inline help
Version 5.1.56-1:
=================
Bugfixes:
- Set correct localNodeId for jobs on sister nodes (psc:#427)
- Prevent spank_api.so from being dlopen()ed twice
- Fix RPM's postun scriptlet
Enhancements:
- Add debugging output in psslurm's findGresCred()
Version 5.1.56:
===============
Bugfixes:
- Prevent segfault if resInfo is broken (jwt:#21216)
- Don't let exiting shell spoiling output for steps with pty (jwt:#21114)
- Fix quoting prevents device jail script to generate unexpected files (dt:3151)
- Set correct cgroup configuration default if no cgroup.conf is present
- Ensure configuration updates will remove obsolete entries
- psslurm: Move SLURM_TRES_* setting place
- Fix one improbable segfault unveiled by gcc 13.1
Enhancements:
- pspmix: Use PSPMIX_ENV_TMOUT to steer environment timeout
- Add jail configuration option JAIL_INIT_SCRIPT and the corresponding script
- Save main psid PID to /run for jail scripts and pshc
- Let pluginforwarder jail pspmix server processes
- psslurm: Rework GPU pinning
- Avoid silent fails in GPU pinning
- Introduce spank_prepend_task_argv() expected for Slurm 23.11
- Let RPMs depend on PMIx minor but release version
Additional changes:
- pspmix: Print input to server_grp_cb()
Version 5.1.55-2:
=================
Bugfixes:
- Prevent free() on uninitialized pointer (jwt:#20926)
Version 5.1.55-1:
=================
Bugfixes:
- Ensure SPANK calls (esp. TASK_EXIT) are not called for service procs
Enhancements:
- Extra analysis checks for jwt:#20747
Version 5.1.55:
===============
Bugfixes:
- Ensure the correct order of SPANK hook calls
- Add SPANK option handling at all necessary places
- Actual use optional keyword of plugstack.conf
Enhancements:
- Add support for Slurm 23.02 protocol
- Add support for include statement in slurm.conf
- Set all Slurm configuration files to case-insensitive
- Allow SPANK hooks spank_init, spank_init_post_opt and spank_user_init
to modify the environment of corresponding user processes
- Rework and improve jail scripts
- Introduce PSIDnodes_lookupHostname() to internally resolve hostnames;
this utilizes <Psid.NetworkName>.Hostname in psconfig's node objects
- Add support for S_TASK_ARGV (might be in future Slurm vers) to psslurmspank
Additional changes:
- Add psslurm option SKIP_CORE_VERIFICATION as preparation for AWS pClusters
- Add ReFrame testsuite and first PMIx test
- Make pluginconfig's Config_t opaque
Version 5.1.54-3:
=================
Bugfixes:
- Ensure downward messages are not handled as upward (jwt:#19720)
Enhancements:
- Support srun option --threads-per-core in psslurmgetbind
- Reject mututally exclusive srun options in psslurmgetbind
Version 5.1.54-2:
=================
Bugfixes:
- slurm.conf might hold absolute or relative path to spank configuration
Version 5.1.54-1:
=================
Bugfixes:
- Fix possbile segfault when parsing GRes usage in psslurm
- Update srun options for user triggered spawn in psslurm
Version 5.1.54:
===============
Bugfixes:
- Fix possible segfault if psslurm verifies an allocation (#96)
- Fix handling of hint nomultithread in pspmix (#104)
- Don't destroy steps in Job_delete()
- Ensure all delayed tasks get removed when allocation is gone
- Prevent premature exit of psilogger on short jobs
- Don't load psslurm if jail function handles are not avaiable
- Ensure all data gets distribute during PMIx fence operation
Enhancements:
- Add cgroup support to psslurm (might limit various system resources)
- Major rework of jail plugin and scripts to allow node sharing
- Introduce automatic PMIx process sets to pspmix when acting as v4 server
- More scalable, tree protocol based PMIx fence operation in pspmix
- Introduce psslurm configuration option SLURM_CONFIG_DIR
- Rename psslurm configuration option SLURM_CONF_DIR to SLURM_CONF_CACHE
- Warn about missing multi gpu per tasks support in psslurm (mlx#842)
- Verify sender, dest and user IDs in pspmix communication
- Add new psslurm configuration option DENIED_USERS
- Speedup completing phase of steps which failed to start
- Introduce __PSI_LOGGER_IO_FILE to detour psidlogger's outputs
Additional changes:
- Show senders TID for failed psaccount update messages
- Give some indication why psidforwarder's connection to logger gets lost
- Move jail scripts to separate folders
- Use pscompress and psexpand in jail scripts
- Switch psid's Timer facility to POSIX per-process timer
- Rename Step_clearByJobid() to Step_destroyByJobid()
- Rename traverseHostList() to traverseCompList()
- Add functionality to find entry in vector
- Print pmix version at the end of configure
- Avoid %exclude statement in spec file for each .la (#100)
Version 5.1.53-1:
=================
Bugfixes:
- Avoid double/wrong free with PMIx_Cpuset_destruct() in pspmix
Version 5.1.53:
===============
Bugfixes:
- Fix filter logic bug unvealed by Clang
- Fix syntax error not complained by gcc-11 or later
- Don't mix loop variables
- Fix memory leak in pspmix
- Use PMIx_Cpuset_destruct() in current version in pspmix
- Add validity checks for array lengths in pspmix messages
- Sanitize file descriptor handling in pspmi
- Call GC for cbInfoPool only when necessary
- Fix print_array_if macro in gdbinit definitions
Enhancements:
- Set PMIX_LOCALITY_STRING using hwloc in pspmix
- Make logger print prefix look nicer and sort compatible
- Introduce rrcomm plugin and user-space library for Rank Router Communication
- Introduce mpiexec's --fullpartition (-P) option
- Use local reservation info for pinning (obsoletes PSP_DD_SPAWNLOC msgs)
- Support more fence info types used by openpmix
- Remove unnecessary initializations and reduce noise during pspmix's operation
- Add minRank/maxRank to PSIDsession's PSresinfo_t to speedup search)
- Add memory cleanup functionality to PSIDpart module
- Introduce delayPSPMsg plugin (delays specific message types for debugging)
Additional changes:
- Introduce PSCio_recvBufB() for blocking receive
- Introduce tryRecvFragMsg() in psserial
- Introduce PSIDHOOK_SPAWN_TASK and bump plugin API version to 137
- Move pspmix server start to new hook HOOK_SPAWN_TASK
- Add hook PSIDHOOK_FRWRD_SETUP and bump plugin API version to 138
- Enhance PSCio_setFDblock() to report old setting on return
- Send local reservation infos to nodes (push PSDaemonProtocolVersion to 415)
- Use common definitions of crucial environment names
- Introduce Timer_restart()
- Introduce PSP_resolveType() and PSDaemonP_resolveType()
- Remove size limitation from fragmented messages
- Do not prevent sending 0-length fragmented messages
- Count number of selectors and make it available via Selector_getNum()
- Let psidforwarder rely on reported number of Selectors to decide on exit
- Store size in PSIDmsgbuf_t explicitly
- Make psserial's fragment types public
- Refactor PSIDsession to make use of PSitems
- Rework selector
- Remove obsolete remnants of psicomm (early sketches of rank-routing idea)
- Improve lots of documentation comments
Version 5.1.52-5:
=================
Bugfixes:
- prevent segfault due to late REQUEST_LAUNCH_TASKS message
Version 5.1.52-4:
=================
Bugfixes:
- Prevent Selector deadlock (don't awaitWrite() on disabled Selector)
Version 5.1.52-3:
=================
Bugfixes:
- Ensure psslurm is fully initialized if config-less mode is combined
with Slurm healthcheck
- Allow to call psslurm's cleanup() even if not fully initialized
Version 5.1.52-2:
=================
Enhancements:
- Allow higher pmix release versions in pspmix RPM requirements
Version 5.1.52-1:
=================
Bugfixes:
- Adjust psslurmgetbind to Slurm's interpretation of -B option (#89)
- Fix hint nomultithread together with 'exact'
Version 5.1.52:
===============
Bugfixes:
- Allow directory as destination for srun's --bcast option (pct:#404)
- Use configured nodename for psmix's namespace procmap (#82)
Enhancements:
- Add support for Slurm 22.05
- Add IPMI support to psslurm's energy monitoring
- psslurmgetbind: Add --exact option (implied by -c for Slurm >= 22.05)
- Try to autodetect Slurm version in psslurmgetbind if possible
- Show Slurm protocol version in config request log message
Additional changes:
- pspmix RPM require pmix version used to build
Version 5.1.51-2:
=================
Bugfixes:
- pspmix: Create namespace with only info arrays (works around OpenPMIx #2791)
- psslurm: Do not filter PMIX_MCA_* variables from user env
- pspmix: Fix memory leak
Version 5.1.51-1:
=================
Bugfixes:
- Fix round counter for pluginforwarder children
* This is used by pelogue to determine if to run as root
*************************************************************************
*************************************************************************
** **
** psmgmt 5.1.51-0 renames psconfig parameters (i.e.: **
** RdpStatusTimeout => StatusTimeout **
** RdpStatusDeadLimit => DeadLimit **
** RdpStatusBroadcasts => StatusBroadcasts **
** This must be reflected in the psconfig database. To adapt the **
** defaults:psid object accordingly, the script **
** update_defaults_5.1.51.sh deployed in **
** /opt/parastation/share/doc/psmgmt/psconfig **
** must be run with corresponding rights on the system hosting the **
** psconfig database. If your configuration contains custom setting **
** of either parameter listed above, those have to be adapted, too. **
** Starting with version 5.1.51 the ParaStation daemon will refuse **
** kick off if one of the now obsolete psconfig parameters is found **
** **
*************************************************************************
*************************************************************************
Version 5.1.51:
===============
Bugfixes:
- Prevent writing behind end of buffer in spawner (#84)
- Reset FPE exceptions on spawn and unload if ENABLE_FPE_EXCEPTION is enabled
- Allow to run PMIx jobs as root
- Prevent possible segfault in psgw at unload when it was uninitialized
- Various potential bugs unveiled by scan-build
- Fix format errors unveiled by cppcheck
- Ensure array size for gethostname() is sufficient
- Ensure debug loops have no bad side effects
Enhancements:
- Let MAP_LDOM pinning use physical domain numbers
- Support topology.conf and set SLURM_TOPOLOGY_* in rank environments
- Extend psslurm's accounting towards energy, interconnect and I/O
* Set their poll intervals from slurm.conf
* Add psaccount configuration option MONITOR_SCRIPT_PATH
* Introduce psaccount debug masks PSACC_LOG_FILESYS and PSACC_LOG_INTERCON
- Unload psslurm if Slurm configuration parsing in configless mode fails
- Enhance PMIx 4 support
- Add PMIx singleton support to pspmix
- Add default session ID explicitly (might be required by openpmix 4.1.2rc1)
Additional changes:
- Rename (misleading) psconfig parameters to match psiadmin
- Make use of PSC_concat() less error-prone (remove need for trailing 0L)
- Make flog() and and fdbg() accessible to other plugins
- Introduce FW_CHILD_INFINITE to restart plugin forwarder's child endlessly
- Also call hookFWInitUser if user is root
- Add const qualifier to traverseHostList()
- Add some test programs for PMI
Version 5.1.50-5:
=================
Bugfixes:
- Do not pass node info array in pspmix with PMIx < 4 (jwt:#16581)
- Fix handling of PSPMIX_CLIENT_INIT/FINALIZE[_RES] message type
Enhancements:
- Integrate pspmix' info arrays output into debug logging
Version 5.1.50-4:
=================
Bugfixes:
- Rework request handling for Slurm messages messed up in 5.1.50-3
Version 5.1.50-3:
=================
Bugfixes:
- Ensure ptid is interpreted correctly if node goes down (jwt:#16170)
- Fix re-sendind requests originally sent via sendSlurmctldReq() (jwt:#16170)
Version 5.1.50-2:
=================
Bugfixes:
- Ensure reqKeys.strings gets initialized (jwt:#16557)
Version 5.1.50-1:
=================
Bugfixes:
- Enforce PSP_DD_DAEMONCONNECT to be the first message delivered via RDP
Enhancements:
- Introduce PSSLURM_FAKE_UPTIME to help supporting Slurms power saving feature
- Add example scripts for Slurm power saving feature
Additional changes:
- Introduce RDP_getState() and RDP_getNumPend()
Version 5.1.50:
===============
Bugfixes:
- Reorganize psslurm's con->info handling (#77)
- Let MASK_LDOM pinning use physical domain numbers (#72)
- Release delayed spawn requests in the step callback (meluxina:#219)
- Ensure an allocation is defined for spawn request (meluxina:#219).
- Do not crash psid when SLURM_OVERCOMMIT is not set
- Ensure psslurm's help will not crash psid if no key is given
- Avoid use-after-free for config.psiddomain
- Avoid memory leak upon connection reset in psslurm
- Fix memory leak in pspmix' fence data handling
- Substantial hardening of pspmix
- Retry sending Slurm message even if nothing is sent yet
- Verify an allocation if no job/step is started
- Ensure openSlurmctldConEx() returns an error
Enhancements:
- Add basic support for PMIx 4.0 and bump pspmix plugin version to 2
- Continue pinning on full nodes when overcommit is set in psslurm
- Use psid domain as cluster ID in pspmix
- Introduce PSC_getwd()
- Introduce addStringArrayToMsg()
Additional changes:
- Adapt strv_t in analogy to env_t and add compatibility to getStringArrayM()
- Replace add/getEnviron() by addStringArrayToMsg() and getStringArrayM()
- Remove obsolete info parameter from sendSlurmMsg()
Version 5.1.49-5:
=================
Bugfixes:
- Fix user ID of response message for a BCast RPC (pct:#405)
Version 5.1.49-4:
=================
Bugfixes:
- Don't delete the allocation before all terminate messages are send
Version 5.1.49-3:
=================
Bugfixes:
- Delete allocation if the slurmctld prologue fails (jwt:#14755)
- Also delete lingering allocation if result cannot be sent to pspelogue
Enhancements:
- Add psslurm kvs set commands DEL_ALLOC, DEL_JOB, DEL_STEP
- Add hook PSIDHOOK_PELOGUE_DROP and bump plugin API version to 136
Version 5.1.49-2:
=================
Bugfixes:
- Fix possible memory leak if slurmctld misses to send a reply to request
- Prevent segfault when cleaning up clients in PMIx server
- Fix memory corruption if number of requested GRes is too large
- Fix memory leaks unveiled by valgrind
- Fix warnings unveiled by cppcheck
Enhancements:
- Add the ability to query psslurm's active Slurm connections via psiadmin
- Allow PMI client to connect multiple times (via TCP)
- Overwrite default PMI connection method by PMI_ENABLE_TCP environment
Additional changes:
- Raise preference of CONNECTED above RELEASED for PMI(x) status
Version 5.1.49-1:
=================
Messed up the tags, thus, need a new version
Version 5.1.49:
===============
Bugfixes:
- Avoid reading beyond end of buffer
- Ensure SIGPIPE is not ignored if started via systemd
- Prevent PSsignal_get() from interupt by RDP timeouts, too
Enhancements:
- Major rework of pspmix to a concept of one server per user
- Add function to deregister client from PMIx server library
- Unload psslurm if Slurm's healthcheck fails
- Introduce PStask_destroy() and rework PSIDtask_clearMem() to use it
- Plugin's help directive now takes an argument
- Rework update caching in providerloop.c
Additional changes:
- Store reservations in jobs as sets grouped by spawner
- Fix description of DEBUG_MASK in pspmix.conf
- Fix various gcc-12 warnings
- Split psidsession.[ch] from psidspawn.[ch]
- Introduce PSC_getVersionStr() and make use of it at various places
- Make PSID_findJobInSession() public
- Rename (original) PSjob_t to PSsession_t and PSresset_t to PSjob_t
- Rename PSI_sendSpawnReq() to sendSpawnReq() and make it private
- Consolidate env.[ch] into psenv.[cv]
- Centralize fixList() into list_fix() in list.h
- Add print_array_if macro to gdbinit definitions
Version 5.1.48-6:
=================
Bugfixes:
- Attach Slurm message hash to munge credential (psc:#402)
Enhancements:
- Allow to encode payload in psMungeEncodeRes()
- Improve debug log (unify resID print, minor fixes)
*************************************************************************
*************************************************************************
** **
** psmgmt 5.1.48-5 adapts psslurm to critical bugfixes in Slurm **
** (CVE-2022-29500, CVE-2022-29501, CVE-2022-29502). Therefore **
** psslurm in this and further versions is only compatible with **
** Slurm versions 20.11.9 and 21.08.8 and beyond **
** **
*************************************************************************
*************************************************************************
Version 5.1.48-5:
=================
Bugfixes:
- Allow only specific users to decode munge messages send by psslurm
- Cleanup possibly sensitive information when spawning step- or job-forwarder
Version 5.1.48-4:
=================
Bugfixes:
- Set correct exit status for steps hitting the walltime limit
- Ensure symbol is found after after second %, too
Enhancements:
- Add `--exact` option to srun when spawning additional processes
Version 5.1.48-3:
=================
Bugfixes:
- Prevent double sending of signals in psslurm
Version 5.1.48-2:
=================
Bugfixes:
- Fix slurmcltd (v21.08) warning on empty gres links statement (psc:#400)
- Consider all spawn info objects during MPI_Comm_spawn_multiple (#69)
Enhancements:
- Add support for links parameter of gres.conf
Version 5.1.48-1:
=================
Bugfixes:
- Ensure to receive all pending ACKs from srun (#2977)
- Force spawning srun to use configuration cache in Slurm configless-mode
- Ensure to close all connections via Step_destroy() on step termination
- Allow NodeName entry of slurm.conf without additional options
Enhancements:
- Let (pmi[x])spawn fail early if ressources are insufficient
- Add psslurm configuration option SRUN_BINARY used for spawning processes
Version 5.1.48:
===============
Bugfixes:
- Fix rank label output for pack jobs (jrt:13143)
- Fix ldom mask|map pinning with uncommon CPU maps in psslurm
- Register to slurmctld only after health-check at startup phase is finished
- Only release forwarder when all children are gone
- Don't use cpu map with map_cpu and mask_cpu in psslurmgetbind
Enhancements:
- Prevent sending superfluous CHILDDEAD messages
- Combine consecutive calls to sendCHILDRESREL() into one message
- Avoid double bookkeeping of parent/child signals
- Use $PWD over CWD provided in step if both resolve to same directory
- psslurmgetbind: Include --cpumap|-M in --help output
- Re-work messaging in between psid and pluginforwarder avoiding PSLog
Additional changes:
- Push daemon-protocol version to 414
- Get rid of now obsolete psidtimer.h
- Cosmetic fix of memory mask length in output of psslurmgetbind
Version 5.1.47-2:
=================
Bugfixes:
- Fix core credential mapping to nodes (#61)
- Fix forwarding of (ps)PMIx server TID to psid forwarders
Version 5.1.47-1:
=================
Bugfixes:
- Call PSIDHOOK_JAIL_CHILD after hookFWInit() to provide all information
- Test for valid child PID in jail scripts
- Don't expect a terminate script for every jail module
- Pass forwarder's PID to hook PSIDHOOK_JAIL_CHILD instead of UID
Enhancements:
- Call sinfo with SLURM_CONF_SERVER set for Slurm protocol version detection
- Switch off debug output in jail scripts as default
- Introduce hookFWInitUser() to pluginforwarder
*************************************************************************
*************************************************************************
** **
** psmgmt 5.1.47 introduces NIC binding. If psid is configured via **
** psconfig, this has to be reflected in psid's default object in the **
** database. Therefore, on the node hosting psconfig's master db run: **
** **
** psconfig -- set 'defaults:psid' 'BindNics' 'yes' **
** **
*************************************************************************
*************************************************************************
Version 5.1.47:
===============
Bugfixes:
- Fix possible segfault when a step is deleted (jwt:#12795)
- Only use errno value if kill() actually failed
- Check for uid_t consistency in msg_SIGNAL
- Ensure PMIx-server shutdown and add kill timer (steered via SERVER_KILL_WAIT)
Enhancements:
- Add NIC pinning with correct mapping to device names
- Speedup psid's shutdown/reset process
- pspmix: Add option to terminate job on server fail
- pspmix: Drop root priviliges in pspmix server (shall help with #43)
- psslurm: BCast forwarder runs as user
- pspmix: Detect PMIx server usage within psid
- Allow to run plugin forwarder as user in a more natural way
- NIC device name fetching via hwloc
- Change nodeinfo's NICSort default from BIOS to PCI (only for new installs)
- Introduce PSCio_recvBufS(): suppress warning on closed psslurm PTY connection
- Make psid depending on munge when using systemd
Additional changes:
- Add SLURM_CONF_SERVER and SLURM_CONF_BACKUP_SERVER to example psslurm.conf
- Move psmunge plugin into psslurm sub-RPM
- Determine systemd's and sysctl's config directory via pkg-config
- Introduce PSIDclient_getNum()
- Introduce PSCfgHelp_getObject() into libputil
- Several improvements on cppcheck and scanbuild warning
- Tons of IWYU cleanup
Version 5.1.46-4:
=================
Enhancements:
- Execute Slurm health-check when psslurm is loaded
* slurm.conf option "HealthCheckInterval" does *not* influence the
health-check execution on startup
* psslurm.conf option SLURM_HC_STARTUP may prevent execution on startup
Version 5.1.46-3:
=================
Bugfixes:
- Extract Slurm Protocol Version by calling sinfo
- Ensure initSerial() and finalizeSerial() are balanced
Version 5.1.46-2:
=================
Bugfixes:
- do not exec client process with blocked SIGCHLD
Version 5.1.46-1:
=================
Bugfixes:
- Avoid NULL pointer de-reference (jwc:#12570)
Version 5.1.46:
===============
Bugfixes:
- Ensure psidclient will not crash after PSIDclient_clearMem()
- Fix various memory leaks
- Don't request configuration if config-server was not set
- Fix packing of flags for job info request
- Add sanity checks to execBatchJob()
- psslumgetbind supports CPU mapping and fix up
- Fix several scanbuild warnings
Enhancements:
- Add support for Slurm protocol 21.08
- Add support for Slurm health-check (#31)
- Add support for REQUEST_REBOOT_NODES (#36)
- Make RPC RESPONSE_CONFIG compatible with Slurm 21.08
- Allow to use Prolog in addition to PrologSlurmctld in slurm.conf
- Remove the need for [prologue|epilogue].parallel
- Set SLURM_CONF to config-cache in configless mode (for external programs)
- Unset SLURM_CONF for jobs and steps in configless mode
- Replace getpwuid() with PSC_getpwuidBuf() in psidspawn
- Add timeout monitor (60 sec/SLURM_HC_TIMEOUT) to Slurm health-check
- Let psslurm parse SlurmdParameters of slurm.conf
- Add RPCs REQUEST_JOB_REQUEUE and REQUEST_KILL_JOB,
REQUEST_LAUNCH_PROLOG and REQUEST_COMPLETE_PROLOG
- Use sendSlurmReq() for RPCs REQUEST_UPDATE_NODE and MESSAGE_EPILOG_COMPLETE
- Set SHOW_LOCAL flag for Req_Job_Info_Single
- Synchronize slurmd prolog and batch job startup
- Cleanup memory in psslurm job/step forwarder
- Improve PSID_exec[script|func]): clear mem, jail forked procs, kill procgroup
- pscommon functions to query user database in plugins prevents psid bloating
- Avoid detour when deleting a job or step
Additional changes:
- Include SPANK error codes for Slurm protocol 21.08
- Rework the Slurm message unpack code
- Rework slurmctld request/response message handling
- Introduce destroyBCastByJobid() to kill running BCast forwarders
- Introduce Job_destroy() to kill all remaing processes of a job
- Introduce Step_destroy() to kill all remaing processes of a step
- Some refactoring in psslurm and function naming consolidation
- IWYU fixes
Version 5.1.45-4:
=================
Bugfixes:
- Set users supplementary groups in psidforwarder (j3:#1382)
Version 5.1.45-3:
=================
Bugfixes:
- Ensure to use the correct nodelist for shutting down step forwarders
Version 5.1.45-2:
=================
Bugfixes:
- Don't wait for finalize message for tasks that could not be spawned (#45)
- Ensure heterogeneous steps will shutdown correctly if executable is missing
- Fix possbile segmentation fault if psslurm connections get re-used
- Ensure correct termination of error message
Enhancements:
- Adapt visible copyright messages to ParTec AG
- Fix warnings emitted by clang13
Version 5.1.45-1:
=================
Bugfixes:
- Ensure slurm.conf's hash is defined and correct (#34)
- Ensure psidforwarder isn't caught in releaseLogger
- Fix memory leak
Enhancements:
- Expose MPI_* envvars only when PMI or PMIx is to be used
- Make slurm.conf hash and update time accessible via psiadmin
Version 5.1.45:
===============
Bugfixes:
- Ensure PATH resolution ignores non-executable files (jwt:#11381)
- Check for valid protocol version before tweaking fragment (jwt:#11068)
- Prevent inherited __PSI_NO_MEMBIND interfering with memory pinning (#20)
- Ensure psserial's byteorder is never messed up
- Ensure SIGTERM is not unblocked in main daemon
- Ensure forwarder ignores ghost messages
- psgw must unregister all message handlers on unload
- psexec shall not clear psslurm's dropper
- Fix buildsystem when slurm-devel is not installed
- Fix various bugs unveiled by scan-build
Enhancements:
- Trace PMIx initialization state to improve client release (#24)
- Catch down nodes early in psslurm's nodeinfo handling
- Update psroute.py to python3
- Hide psslurm's output of node_iter_next behind PSSLURM_LOG_PART
Additional changes:
- Rework psid's message handling
* Multiple handler might be registered for a given type
* Handlers called in reverse order of registration (latest registered first)
* Steer further handling by their return value (true => fully handled)
* PSID_clearMsg() now unregisters a specific handler
- Refactor psidforwarder to utilize PSID_handleMsg() and
bump plugin API version to 133
- Cleanup psslurm's nodeinfo interfaces
- Rework PSIDHOOK_FRWRD_CLNT_RLS to support for multiple plugins and
bump plugin API version to 134
- Rework PSIDHOOK_FRWRD_EXIT's interface
- Obsolete PSIDHOOK_FRWRD_KVS, PSIDHOOK_FRWRD_SPAWNRES, PSIDHOOK_FRWRD_CC_ERROR
- Introduce PSCio_recvMsgT(), i.e. receive message with timeout
- Introduce PSIDcomm_registerSendMsgFunc()
- Introduce recvDaemonMsg() to replace recvMsg()
- Add list_item macro to gdbinit definitions
Version 5.1.44-2:
=================
Bugfixes:
- Let visspank start without additional parameters