Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of direct memory where flush wal buffer to disk in flush method, please ensure this. #11375

Open
shuangquansq opened this issue Jun 6, 2024 · 2 comments

Comments

@shuangquansq
Copy link

shuangquansq commented Jun 6, 2024

Hi:
Mybe this code has a bug: memory leak. Need add the line tlb.remove() to the file 469 line . file:
org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle#flush(org.apache.ignite.internal.pagemem.wal.record.WALRecord, boolean)

   try {
        // We can safely allow other threads to start building next chains while we are doing flush here.
        ByteBuffer buf;

        boolean tmpBuf = false;

        if (expHead.chainSize() > tlbSize) {
            buf = GridUnsafe.allocateBuffer(expHead.chainSize());

            tmpBuf = true; // We need to manually release this temporary direct buffer.
        }
        else
            buf = tlb.get();

        try {
            long pos = fillBuffer(buf, expHead);

            writeBuffer(pos, buf);
        }
        finally {
            if (tmpBuf)
                GridUnsafe.freeBuffer(buf);
            // need add this code here **_tlb.remove();_**
        }

        return true;
    }
    catch (Throwable e) {
        StorageException se = e instanceof StorageException ? (StorageException)e :
            new StorageException("Unable to write", new IOException(e));

        cctx.kernalContext().failure().process(new FailureContext(CRITICAL_ERROR, se));

        // All workers waiting for a next segment must be woken up and stopped
        signalNextAvailable();

        throw se;
    }

Because we often encounter OutOfMemoryError: Cannot reserve 131072 bytes...... The detail error is below:

2024-06-01T00:20:07,454Z+0000|WARN|sys-stripe-2-#3|Log4J2Logger#warning:523|Still waiting for a concurrent write to complete [written=165737, pos=1448132, lastFsyncPos=165737, stop=false, actualPos=165737]
2024-06-01T00:20:08,259Z+0000|INFO|qtp1471464273-162|SharedClient#testConnection:41|user is added [[42]]
2024-06-01T00:20:08,964Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.StorageException: Unable to write]]
org.apache.ignite.internal.processors.cache.persistence.StorageException: Unable to write
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:475) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.addRecord(FsyncFileWriteHandle.java:267) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:968) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager.log(FileWriteAheadLogManager.java:921) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.beforeReleaseWrite(PageMemoryImpl.java:1867) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1698) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:523) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:515) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:416) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:276) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.writeUnlock(DataStructure.java:247) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:933) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:199) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList$PutBucket.run(PagesList.java:173) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:313) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:304) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.flushBucketsCache(PagesList.java:415) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.saveMetadata(PagesList.java:356) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.saveStoreMetadata(GridCacheOffheapManager.java:322) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.lambda$syncMetadata$1(GridCacheOffheapManager.java:299) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.util.IgniteUtils.lambda$wrapIgniteFuture$5(IgniteUtils.java:11930) ~[ignite-core-2.16.0.jar:2.16.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.io.IOException: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
	... 24 more
Caused by: java.lang.OutOfMemoryError: Cannot reserve 131072 bytes of direct buffer memory (allocated: 3221191584, limit: 3221225472)
	at java.nio.Bits.reserveMemory(Unknown Source) ~[?:?]
	at java.nio.DirectByteBuffer.<init>(Unknown Source) ~[?:?]
	at java.nio.ByteBuffer.allocateDirect(Unknown Source) ~[?:?]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:133) ~[ignite-core-2.16.0.jar:2.16.0]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle$1.initialValue(FsyncFileWriteHandle.java:131) ~[ignite-core-2.16.0.jar:2.16.0]
	at java.lang.ThreadLocal.setInitialValue(Unknown Source) ~[?:?]
	at java.lang.ThreadLocal.get(Unknown Source) ~[?:?]
	at org.apache.ignite.internal.processors.cache.persistence.wal.filehandle.FsyncFileWriteHandle.flush(FsyncFileWriteHandle.java:460) ~[ignite-core-2.16.0.jar:2.16.0]
	... 23 more
2024-06-01T00:20:08,965Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|No deadlocked threads detected.
2024-06-01T00:20:09,918Z+0000|ERROR|ttl-cleanup-worker-#54|Log4J2Logger#error:533|Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [workerName=sys-stripe-7, threadName=sys-stripe-7-#8, blockedFor=11s]
2024-06-01T00:20:09,918Z+0000|WARN|jvm-pause-detector-worker|Log4J2Logger#warning:523|Possible too long JVM pause: 923 milliseconds.
2024-06-01T00:20:09,937Z+0000|ERROR|checkpoint-runner-cpu-#76|Log4J2Logger#error:533|Thread dump at 2024/06/01 00:20:09 UTC
Thread [name="Reference Handler", id=2, state=RUNNABLE, blockCnt=6, waitCnt=0]
        at [email protected]/java.lang.ref.Reference.waitForReferencePendingList(Native Method)
        at [email protected]/java.lang.ref.Reference.processPendingReferences(Unknown Source)
        at [email protected]/java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)

Thread [name="Finalizer", id=3, state=WAITING, blockCnt=6635, waitCnt=6636]
    Lock [object=java.lang.ref.ReferenceQueue$Lock@2967a826, ownerName=null, ownerId=-1]
        at [email protected]/java.lang.Object.wait(Native Method)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)

Thread [name="Signal Dispatcher", id=4, state=RUNNABLE, blockCnt=0, waitCnt=0]

Thread [name="Common-Cleaner", id=11, state=TIMED_WAITING, blockCnt=276, waitCnt=5872]
    Lock [object=java.lang.ref.ReferenceQueue$Lock@76e66cc0, ownerName=null, ownerId=-1]
        at [email protected]/java.lang.Object.wait(Native Method)
        at [email protected]/java.lang.ref.ReferenceQueue.remove(Unknown Source)
        at [email protected]/jdk.internal.ref.CleanerImpl.run(Unknown Source)
        at [email protected]/java.lang.Thread.run(Unknown Source)
        at [email protected]/jdk.internal.misc.InnocuousThread.run(Unknown Source)

Thread [name="Notification Thread", id=12, state=RUNNABLE, blockCnt=0, waitCnt=0]

Thread [name="Apollo-RemoteConfigRepository-1", id=18, state=TIMED_WAITING, blockCnt=0, waitCnt=1124]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2d9b9311, ownerName=null, ownerId=-1]
        at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
        at [email protected]/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
        at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
        at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
        at [email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at [email protected]/java.lang.Thread.run(Unknown Source)
......
@sk0x50
Copy link
Contributor

sk0x50 commented Jun 17, 2024

Hello @shuangquansq ,

I can't tell if this is a mistake or not. However, the common recommendation is:

If you use Ignite native persistence, we recommend that you set the MaxDirectMemorySize JVM parameter to walSegmentSize * 4. With the default WAL settings, this value is equal to 256MB.

https://ignite.apache.org/docs/latest/perf-and-troubleshooting/memory-tuning#generic-gc-settings

Could you please share what value is used for walSegmentSize in your case?

@shuangquansq
Copy link
Author

our walSegmentSize is : 64MB.
defaultDataRegionConfiguration: 1.2G
-XX:MaxDirectMemorySize=1.8G
system region: 100M
checkpoint: 256M

I try Adjust MaxDirectMemorySize to 3G,the issue haven't solve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants