Warning: Undefined array key "HTTP_ACCEPT_LANGUAGE" in /home/u596154002/domains/usbusinessreviews.com/public_html/wp-includes/load.php on line 2057

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the rank-math domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/u596154002/domains/usbusinessreviews.com/public_html/wp-includes/functions.php on line 6114
Meta shares how it detects silent data corruptions in its data centres - Best Business Review Site 2024

Meta shares how it detects silent data corruptions in its data centres

[ad_1]

meta-data-centre.jpg

One of Facebook’s data centres in Prineville, Oregon.


Image: Meta

After years of testing various approaches for detecting silent data corruptions (SDCs), Meta has outlined its approach for resolving the hardware issue.

SDCs are data errors that do not leave any record or trace in system logs. Sources of SDCs include datapath dependencies, temperature variance, and age, among other silicon factors. Since these data errors are silent, they can stay undetected within workloads and propagate across several services.

The data error can affect memory, storage, networking, as well as computer CPUs and cause data loss and corruption.

Meta engineers started testing three years ago as they had a difficult time detecting SDCs once components had already gone into one of its production data centre fleets.

“We [needed] novel detection approaches for preserving application health and fleet resiliency by detecting SDCs and mitigating them at scale,” Meta engineer Harish Dattatraya Dixit said in a blog post.

According to tests, Meta found its most preferred way for detecting SDCs is using both out-of-production and ripple testing.

Out-of-production testing is a SDC detection method that occurs when machines go through a maintenance event such as system reboots, kernel upgrades, and host provisioning among others. This type of testing piggybacks onto these events to allow for tests to have longer runtimes thereby enabling a “more intrusive nature of detection”.

Ripple testing, meanwhile, occurs by running silent error detection in conjunction with workloads being active. This is done through shadow testing with workloads and injecting bit patterns with expected results intermittently within fleets and workloads, which Meta found enabled faster SDC detection than out-of-production testing.

This faster type of testing “ripples” through Meta’s infrastructure, allowing for test times that are 1,000x lower than out-of-production test runtimes.

Meta engineers observed, however, ripple testing could only detect 70% of fleet data corruptions, although it was able to detect them in 15 days. By comparison, out-of-production testing took six months to detect the same corruptions along with other ones.

In explaining these benefits and tradeoffs, Dattatraya Dixit recommended that organisations with large-scale infrastructure should use both approaches to detect SDCs.

“We recommend using and deploying both in a large-scale fleet,” Dattatraya Dixit said.

“While detecting SDCs is a challenging problem for large-scale infrastructures, years of testing have shown us that [out-of-production] and ripple testing can provide a novel solution for detecting SDCs at scale as quickly as possible.”

When Meta engineers used both tests for detecting SDCs, they found all SDCs could eventually be detected. Meta said 70% of SDCs were from ripple testing after 15 days, out-of-production testing caught up to 23% of the remaining SDCs in six months, while the remaining 7% was found through repeated ripple instances within its data centre fleets.

To push further innovation in detecting SDCs, Meta has also announced it will provide five grants, each worth around $50,000, for academia to create research proposals in this field of research. 

Related Coverage

[ad_2]

Source link

slot gacor slot gacor togel macau slot hoki bandar togel slot dana slot mahjong link slot link slot777 slot gampang maxwin slot hoki slot mahjong slot maxwin slot mpo slot777 slot toto slot toto situs toto toto slot situs toto situs toto situs toto situs toto slot88 toto slot slot gacor thailand slot bet receh situs toto situs toto slot toto slot situs toto situs toto situs toto situs togel macau toto slot slot demo slot pulsa slot pragmatic situs toto deposit dana 10k surga slot toto slot link situs toto situs toto slot situs toto situs toto slot777 slot gacor situs toto slot slot pulsa 10k toto togel situs toto slot situs toto slot gacor terpercaya slot dana slot gacor pay4d agen sbobet kedai168 kedai168 deposit pulsa situs toto slot pulsa situs toto slot pulsa situs toto situs toto situs toto slot dana toto slot situs toto slot pulsa toto slot situs toto slot pulsa situs toto situs toto situs toto toto slot toto slot slot toto akun pro maxwin situs toto slot gacor maxwin slot gacor maxwin situs toto slot slot depo 10k toto slot toto slot situs toto situs toto toto slot toto slot toto slot toto togel slot toto togel situs toto situs toto toto slot slot gacor slot gacor slot gacor situs toto situs toto cytotec toto slot situs toto situs toto toto slot situs toto situs toto slot gacor maxwin slot gacor maxwin link slot 10k slot gacor maxwin slot gacor slot pulsa situs slot 10k slot 10k toto slot toto slot situs toto situs toto situs toto bandar togel 4d toto slot toto slot