Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PnetCDF large_coalesce test fails due to incorrect data on read (ROMIO problem) #752

Open
adammoody opened this issue Dec 28, 2022 · 0 comments
Labels

Comments

@adammoody
Copy link
Collaborator

The PnetCDF test/largefile/large_coalesce test returns the value of 0 for bytes that should have a non-zero value.

*** TESTING C   large_coalesce for skip filetype buftype coalesce  ------ 0 (at line 285): expect buf[1073741814]=97 but got 0
0 (at line 285): expect buf[1073741815]=98 but got 0
0 (at line 285): expect buf[1073741816]=99 but got 0
0 (at line 285): expect buf[1073741817]=100 but got 0
0 (at line 285): expect buf[1073741818]=101 but got 0
0 (at line 285): expect buf[1073741819]=102 but got 0
0 (at line 285): expect buf[1073741820]=103 but got 0
0 (at line 285): expect buf[1073741821]=104 but got 0
0 (at line 285): expect buf[1073741822]=105 but got 0
0 (at line 285): expect buf[1073741823]=106 but got 0
0 (at line 285): expect buf[1073741824]=107 but got 0
0 (at line 285): expect buf[1073741825]=108 but got 0
0 (at line 285): expect buf[1073741826]=109 but got 0
0 (at line 285): expect buf[1073741827]=110 but got 0
0 (at line 285): expect buf[1073741828]=111 but got 0
0 (at line 285): expect buf[1073741829]=112 but got 0
0 (at line 285): expect buf[1073741830]=113 but got 0
0 (at line 285): expect buf[1073741831]=114 but got 0
0 (at line 285): expect buf[1073741832]=115 but got 0
0 (at line 285): expect buf[1073741833]=116 but got 0
0 (at line 293): expect buf[2147483638]=65 but got 0
0 (at line 293): expect buf[2147483639]=66 but got 0
0 (at line 293): expect buf[2147483640]=67 but got 0
0 (at line 293): expect buf[2147483641]=68 but got 0
0 (at line 293): expect buf[2147483642]=69 but got 0
0 (at line 293): expect buf[2147483643]=70 but got 0
0 (at line 293): expect buf[2147483644]=71 but got 0
0 (at line 293): expect buf[2147483645]=72 but got 0
0 (at line 293): expect buf[2147483646]=73 but got 0
0 (at line 293): expect buf[2147483647]=74 but got 0
0 (at line 293): expect buf[2147483648]=75 but got 0
0 (at line 293): expect buf[2147483649]=76 but got 0
0 (at line 293): expect buf[2147483650]=77 but got 0
0 (at line 293): expect buf[2147483651]=78 but got 0
0 (at line 293): expect buf[2147483652]=79 but got 0
0 (at line 293): expect buf[2147483653]=80 but got 0
0 (at line 293): expect buf[2147483654]=81 but got 0
0 (at line 293): expect buf[2147483655]=82 but got 0
0 (at line 293): expect buf[2147483656]=83 but got 0
0 (at line 293): expect buf[2147483657]=84 but got 0

That is reported around this line:

https://github.com/Parallel-NetCDF/PnetCDF/blob/c7e22c81ac4c2922f84281a4a19f7000079e6c3f/test/largefile/large_coalesce.c#L284

This same test throws a segfault when using Lustre as the file system, so the test failure is not unique to UnifyFS.

Tracing under a debug build of MVAPICH2, the test hits an ADIOI assertion at this line:

https://github.com/pmodels/mpich/blob/5b88f46620607707201768f4b3df39907082f344/src/mpi/romio/adio/common/ad_read_str_naive.c#L311

The value req_len = 2147483126 fails the assertion check req_len == (int) req_len.

The stack trace at this point is:

main
ncmpi_wait_all
ncmpio_wait
req_commit
wait_getput
req_aggregation
mgetput
ncmpi_read_write
PMPI_File_read_at_all
MPIOI_File_read_all
ADIOI_GEN_ReadStridedColl
ADIOI_GEN_ReadStrided
ADIOI_GEN_ReadStrided_naive

Apparently, this "naive" read code path within ROMIO does not support requests larger than 2GB.

This is running as a single process test:

#!/bin/bash
set -x

nodes=$SLURM_NNODES
procs=$(($nodes * 1))

export UNIFYFS_MARGO_CLIENT_TIMEOUT=70000

export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
touch $UNIFYFS_CONFIGFILE

srun --overlap -n $nodes -N $nodes mkdir /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

export UNIFYFS_CLIENT_LOCAL_EXTENTS=1
export UNIFYFS_CLIENT_WRITE_SYNC=0

export UNIFYFS_LOG_VERBOSITY=1

# test_ncmpi_put_var1_schar executes many small writes,
# it was necessary to reduce the chunk size to avoid exhausing space
export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 4096)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 1024 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

export UNIFYFS_CLIENT_SUPER_MAGIC=0

installdir="/path/to/unifyfs.git/install"
export LD_LIBRARY_PATH="${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH"

# turn of darshan profiling
export DARSHAN_DISABLE=1

# sleep for some time after unlink
# see https://github.com/LLNL/UnifyFS/issues/744
export UNIFYFS_CLIENT_UNLINK_USECS=1000000

export LD_PRELOAD="${installdir}/lib/libunifyfs_mpi_gotcha.so"

filename="/unifyfs/testfile.nc"

export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 8192 \* 1048576)
cd test/largefile
./large_coalesce $filename
@adammoody adammoody changed the title PnetCDF largefile PnetCDF largefile/large_coalesce fails after detecting invalid data during a read Dec 28, 2022
@adammoody adammoody changed the title PnetCDF largefile/large_coalesce fails after detecting invalid data during a read PnetCDF large_coalesce test fails due to incorrect data on read (ROMIO problem) Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant