Last update: 2016-11-30

Introduction

Sorry about the outrageous title of this article.  Actually it should be “Consumer-grade Solid State Drive (SSD) is less reliable than Hard Disk Drive (HDD) with typical computer usage”.  (I can already foresee people attacking this article or accusing me of trolling, but I hope other people would find the information here help them a bit with their own research, and perhaps give more considerations before deciding to have SSD as their only drive in a new high performance PC.  I know that many people will disagree with me, and for the record I do know that there are people who have been using SSD for several years without a single problem.)

Initially when SSD was introduced to the market a few years ago, I thought this type of product would be really great since it’d be fast and very reliable, since it has no moving parts like HDD does.  When I researched about choosing the best SSD in terms of reliability, performance and price, I came across too many posts about SSD dying or not being detected.  In a more extreme case, someone bought 8 SSD of various brands and they all died within 2 years.  So I decided to write this article to share what I’ve found.  I know people will disregard the information presented here as outdated and claim that nowadays SSDs are a lot more reliable (which is true to a certain extent), but I still see many reports of various SSD issues from various forums.  Others may point out that HDD suffers from the same fate, as there are many new HDD turning bad within months of purchase as well – although this is true, the point of this article is that consumer grade SSD is not really more reliable than HDD.  To be fair, I can say that both HDD and SSD are unreliable.  😉

In late 2015 the cheapest SSD already matches the cheapest HDD in device price (not cost per GB).  Since SSD does solve the single most important performance bottleneck in most PC, SSD can be considered to be a standard item in a PC.

I’m not trying to advocate not to use SSD.  The point of this article is not to avoid SSD at all costs, but to understand the higher risks associated with SSD.  The following precautions are suggested:

  • Use SSD for boot drive, and HDD as data drive
  • Backup (if the data size is small, prepare copies in both the SSD and HDD)
  • Keep installable OS media handy in case your SSD boot drive suddenly fails

In 2016, Schroeder and Google published the results of a large-scale field study.  They concluded that “comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors.”  Meaning: SSD users will have a higher chance of data loss.  Even though there is some positive results that can be drawn from this study, keep in mind that the drives in this study are not the cheapest consumer grade drives people typically buy for home use.  In particular, with the industry wide adoption of TLC (which is not in the study), things will be very different for consumer grade SSD.

Ultra Cheap SSD

In 2016, two trends have emerged – all vendors are switching to TLC for their consumer products.  The second one is more worrisome – the rise of ultra cheap SSD that is manufactured with the lowest cost in mind, making use of inferior-grade NAND chips and cheapest components, and it may even eliminate the DRAM cache.  To understand why this is such a great threat, one needs to understand that in NAND manufacturing, it is just not possible to have all output to be perfect, so the good chips are marked with the brand of the manufacturer, and the bad-but-not-unusable chips are sold at a lower price to someone else, probably under a different brand.  (In Chinese these are sometimes described as 白片.)  The use of inferior NAND chips seals the fate of the ultra cheap SSD – they come with a significantly higher probability of causing troubles (such as data loss) than a quality SSD from a reputable manufacturer.

The DRAM cache is a critical piece of component that reduces writes to NAND, thereby increasing the life time, and also improves performance.  Eliminating the cache will affect performance and increase the writes to NAND – coupled with inferior grade NAND, the threat becomes even greater.  (Some models eliminate the DRAM cache but rely on a SLC cache – I don’t think this is ideal.)  Since typical consumers often buy the lowest-priced component they can find, I foresee many people will have SSD troubles due to these low quality SSD.

Some of the DRAM-less SSD include:

  • WD Green SSD
  • SanDisk SSD Plus / Z400s / Z410
  • Transcend SSD360S
  • OCZ TL100

SSD Endurance and Write Speed Slowdown / Degradation

Computer enthusiasts already know that flash and SSD have limited lifetime in form of write cycles, so naturally there are worries about SSD endurance.  However, for most people, a variety of SSD issues can come up as described in this article, before one can use up the endurance of SSD drives using good MLC (or even SLC) NAND chips.  For real-world test results, read this Xtreme Systems thread and this Tech Report SSD Endurance Experiment.  TLC has even lower endurance, unfortunately.

A related problem is that some manufacturers opt to put the SSD drives in a degraded mode that makes them perform much slower, once a write limit is reached.  To make it worse, generally users have no way to reset it, not even doing a secure erase or letting the drive idle for days will recover the lost performance.  Although not restricted to TLC SSD, it is probably common for TLC SSD to employ such a design due to inherent low endurance of TLC.

TLC SSD Read Speed Slowdown / Degradation

It is common knowledge that SSD slows down over time, so many reviews attempt to test steady state performance after stressing the drives.  However, with TLC there is an additional, much more serious and non-obvious slowdown.  In particular, Samsung 840 EVO is notorious for slowing down 10xFirmware updates could not solve it completely. The way that Samsung deals with the 840 EVO slow down is to have the SSD rewrite the old data – that means the already low TLC endurance is further stressed by invisible background rewriting.

This type of slowdown occurs with old data – not newly written data.  Since benchmark tests typically write new data to an SSD, this problem is not detectable by usual benchmark tests and therefore not discussed in ssd review articles.  The proper benchmark for detecting TLC old data read speed slow down is SSD Read Speed Tester.

Unfortunately, this problem is not limited to 840 EVO only.  Some people believe that Samsung 850 EVO does not suffer from the slowdown of TLC read of old data problem, but there is a report that indicates the 850 EVO 1TB model does slow down like 840 EVO did.

It was found that similar slowdowns occur with Crucial BX200, ADATA SP550, and in general “SM2256 paired with 16nm TLC from either Micron or SK Hynix”.  Note: Intel 540s uses SM2258 with SK Hynix 16nm TLC NAND.

With the switch to TLC by all vendors, the performance advantage of SSD is negated by the slow TLC in some scenarios even if we ignore the case about old data – especially when the cache is filled up.  While 3D TLC performs reasonably well, cheap planar TLC as found in low cost SSD drives do not perform well.  Some of them has a cache, but once the cache is filled up, the SSD performs very slowly and is comparable to HDD.  If the need for SSD is because it’s fast, it does not make too much sense to pay a price premium for a slow-performing SSD (the cheapest SSD is still more expensive than HDD on a price per GB basis).

Buy a MLC SSD instead of planar TLC SSD – if you can still find one.

SSD Firmware Bugs

[Since 2015 there has been far fewer critical SSD firmware issues that I know of.]

There have been many critical bugs with various brands of SSD.  At a time when a SandForce controller was the best performing consumer-grade SSD controller, many brands of SSD used it.  Naturally, SandForce controller bugs were inherited by all those drives that use it.  It took a while for SandForce to fix its infamous Blue Screen of Death (BSOD) bug.  (If any reader would like to argue that having firmware bugs do not imply SSD is unreliable, my response is: can you tell your boss, who is experiencing daily BSOD with the PC you built for him, that the PC is actually reliable?)  Even some models of Intel SSD are affected by SandForce controller.

There are other BSOD bugs, not just with SandForce.  As another example, Crucial m4, which uses a Marvell controller, had a BSOD bug after being used for 5184 hours.  Although this is specific to one series of one brand, it is really important because at a time, this particular series was widely regarded as the best buy for SSD because of its high performance at a reasonable price.

BSOD is bad enough, but there were even more serious bugs.  For example, a certain firmware upgrade for Intel X25-M may actually brick it.  There are other similar incidents of firmware upgrade bricking SSD as well.

Apple issued a recall for some MacBook Air 2012 models with 64GB or 128GB SSD.  Earlier Corsair recalled Force 3 drives.  Intel and Kingston also recalled early models of their SSD drives.

Although it was reported that the Samsung-specific SSD-killing Secure Erase bug was fixed, there are still a few reports of issues with it.  A firmware update also bricked some users’ Samsung 850 Pro.

While I like manufacturers to release regular firmware updates, one should always exercise caution (i.e. backup) in doing upgrades.  There is a report of a Transcend 370S (using SM2246EN controller) user losing the drive data after doing a firmware upgrade.

SSD Freeze Problems

Some SSD have issues that are prone to freeze the system for several seconds (e.g. Plextor SSD Freeze problem in Chinese).  This is a particularly hard issue to resolve, since different people using the same firmware on the same SSD have different results.  Anyway, if you happen to be experiencing this problem, see if a firmware that fixes it is available.  Also read why some of these SSD Freeze problems are related to Link Power Management and how to deal with it for Intel chipset users (in Chinese), for AMD chipset users (in Chinese), or in English.

SSD Sleep Problems

For a variety of technical reasons, it is bad for SSD to sleep/hibernate.  In the worst case,  sleep/wake operations may cause BSOD, especially with SandForce controllers (again).  Even if not using a SandForce controller, Crucial m4 has a firmware update 070H that resolves a hang issue that “would typically occur during power-up or resume from Sleep or Hibernate.”  So it’s usually a good idea to set SSD to Never Sleep, disable Windows hybrid sleep, disable hibernate, and set BIOS to use S1 instead of S3.

(By the way, even CPU sleep states affect SSD performance.  On some (but not all) system configurations, disabling C1E, C-States and/or EIST in BIOS yield higher benchmarks for SSD.)

SSD Not Detected Problems

For unknown reasons, there are many reports of problem detecting SSD by the system for pretty much all brands, but some models of Crucial, such as MX100 is especially prone to this symptom.  Google “MX100 disappear”, or “镁光 掉盘” if you can read Chinese.  If you’re suffering from this problem, try these:

  • Turn on SATA Hotplug in BIOS
  • Change Windows Power Plan to High Performance
  • Set SSD to Never Sleep, and disable Windows hibernate and hybrid sleep
  • Disable Link Power Management if you use Intel RST
  • Enable HIPM+DIPM for SSD power option
  • Upgrade SSD firmware
  • Uninstall Intel RST altogether and use MSAHCI driver instead
  • Try a different SATA cable

This problem is so prevalent that there are both an official procedure and an unofficial procedure to recover from this problem for Crucial SSD.  Crucial explains this as being caused by a sudden power loss, but some people simply shut down their computer normally and then got this problem when powering up the computer next time.

[Note: It looks like that beginning with MX200 and BX100, Crucial has managed to decrease the probability of this issue.  The company I work for has 6 or 7 MX200 and have no problems so far.]

Some people advocate that when you finish using the PC, log off instead of shutdown to let the SSD perform its garbage collection (GC).  Since the SSD Not Detected problems occur at boot time, I figured that in order to eliminate this chance, perhaps SSD users should forget about saving the environment and just run their PC 24×7, never let it sleep or shut it down, and equip it with a UPS – Just kidding  😉

SSD Power Fault Problems

There is a 2013 paper called Understanding the Robustness of SSDs under Power Fault, which finds that SSDs suffer data loss or even become bricked after getting a power fault.

This is not just an academic study.  It is very real: Intel SSD 320 series had an infamous “8MB bug” – the drive suddenly becomes 8MB in capacity after a power loss.  The method of fixing it destroys all data, unfortunately.

Even if one uses an uninterruptible power supply (UPS), in the event of a computer hang, there are times when one needs to force power down the system.

SSD Sudden Death Problems

SSD tends to die abruptly, usually with no warning from SMART and definitely offers no click of death warning.  Although HDD can also die abruptly, they have several other common modes of failures as well, so even for the same total number of SSD and HDD failures, SSD must have a much higher proportion of suffering from abrupt deaths.  In the past there were articles which describe SSD as becoming read-only when they die, so your data would be secure.  Reading real-life reports will quickly disprove this statement.  (Although it might still be true at a flash chip level – I don’t know, it certainly is not true from the users point of view.)

Dying abruptly is a problem, but if very few users experience that then we should not be concerned about it.  However, from users feedback it seems to me that the number of users suffering from sudden SSD deaths within 2 years of purchase is not “very few”.

SSD Data Loss if Left Without Power

This sounds unbelievable but it is a fact that SSD gradually loses data if it is left without power.  The problem increases as temperature increases.

Dell Q&A: “Q: I have unplugged my SSD drive and put it into storage. How long can I expect the drive to retain my data without needing to plug the drive back in?
Answer: It depends on the how much the flash has been used (P/E cycle used), type of flash, and storage temperature. In MLC and SLC, this can be as low as 3 months and best case can be more than 10 years. The retention is highly dependent on temperature and workload.” See http://www.dell.com/downloads/global/products/pvaul/en/solid-state-drive-faq-us.pdf#page6 (credit: @timleavy)

SSD in RAID

There is a blog discussing the risk of using SSD in RAID, primarily about the danger of rebuild when all the SSDs in a RAID group are about to die.  Although it seems theoretical and the real situation may not really be this bad due to the variance of flash chips dying at different times rather than roughly at the same time, one of the comments indicated it really happened.

Contradictory best practices for configuring Windows for SSD

There are many guides that teach people how to optimize their Windows for SSD.  However, not everyone agrees with some of those recommendations.  In particular, while many people suggest moving the pagefile to HDD or simply disabling it (which I consider to be as outrageous as the title of this article), this contradicts Microsoft Q&A for SSD drives.  Although I generally trust Microsoft technical publications, this guide found that the Q&A about automatic disabling of Superfetch to be not true.  These discrepancies make things confusing.

Good Power Supply of 5V Required by SSD

The importance of a good power supply unit (PSU) is well known to PC users who use graphics cards.  It is not immediately obvious that a good PSU is really critical for SSD, because their specs usually claim a few watts only – which is nothing compared to any decent graphics cards, or the CPU itself.  However, one thing makes SSD power consumption totally different from CPU or GPU – CPU and GPU use 12V output from PSU, but 2.5″ SSD use 5V output from PSU.  Some decent PSU may have been well designed to handle high current draw of CPU and GPU, with less attention paid to 5V.  Besides, peak current may go to 1A at 5V despite the really low power consumption in normal operations.  So some systems that can handle stressing of overclocked CPU plus GPU do not automatically mean they are capable to support SSD equally well.  This is discussed in Jim Handy’s article, and may have been the cause of one user’s 100% SSD failures.

SSD requirement for PSU: At least a minimum of 1A clean output at 5V is necessary per SSD.

Bait and Switch

Of course bait and switch is not limited to SSD – it’s mentioned here just for completeness.

OCZ Vertex 2 – changed from 34nm NAND to slower-performing 25nm NAND

Kingston SSDNow V300 – changed to much slower-performing NAND.  It also renders itself useless after writing a certain amount due to a “feature” known as Drive Life Protection.

PNY Optima – retail drives use different controller than the ones sent to media for review.

Silicon Power SP60 – it appears that this model continually changes its controller, such that you don’t really know what controller you’re buying until you actually buy one and disassemble it

ADATA SP920 – in the past SP920 (silver version) was a good alternative to M550 because SP920 is Micron OEM M510 and use Micron firmware, especially when it was slightly cheaper and M550 is no longer available.  However, the new black version downgrades the controller and use worse NAND.  Stay away from it.

SanDisk SSD Plus – changed from MLC (SM2246XT controller) [Z400s equivalent model] to TLC (SM2256S controller) [Z410 equivalent model].  Strangely western media did not report it or care about it – perhaps it’s meant to be the slowest drive in the line up (behind Extreme Pro and Ultra II) and this positioning has not changed.

Linux Kernel TRIM compatibility

An investigation into a potential TRIM bug in Samsung SSD led to the discovery of a potential TRIM bug in Linux kernel.

Some SSD are not completely compatible with Linux, so Linux has a list of drives or firmwares that need special workaround:

        /* devices that don't properly handle queued TRIM commands */
        { "Micron_M500_*",                NULL,        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "Crucial_CT*M500*",                NULL,        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "Micron_M5[15]0_*",                "MU01",        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "Crucial_CT*M550*",                "MU01",        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "Crucial_CT*MX100*",                "MU01",        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "Samsung SSD 8*",                NULL,        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },
        { "FCCT*M500*",                        NULL,        ATA_HORKAGE_NO_NCQ_TRIM |
                                                ATA_HORKAGE_ZERO_AFTER_TRIM, },

        /* devices that don't properly handle TRIM commands */
        { "SuperSSpeed S238*",                NULL,        ATA_HORKAGE_NOTRIM, },

Which SSD to buy

Most importantly, stay from the ultra cheap SSD.  Stay away from non-leading brands.  Stay away from models that change its controllers and NAND without notice.

Given a choice between MLC and TLC: always choose MLC (assuming the best SLC is out of budget), e.g. SanDisk Extreme Pro and Crucial MX200.  Between Crucial MLC MX200 and TLC MX300, MX200 is distinctly better, as shown in this AnandTech MX300 benchmark (likewise, MLC BX100 is distinctly better than TLC BX200):

This benchmark is noteworthy as it also shows that planar TLC drives are placed at the bottom, i.e. slowest.  (Some may argue that I chose a benchmark that unfairly shows the weakness of TLC – no, I chose it because for most people, random 4K access is the single most important benchmark that affects general computer responsiveness.  In fact planar TLC performs poorly in many benchmarks, not just in this one.)  Unfortunately, with MX200 being discontinued, there are not too many MLC drives remaining.

Some people like NVMe drives for fastest performance, but to boot Windows from NVMe SSD you need a 9-series or newer Intel chipset motherboard, corresponding BIOS that includes NVMe support, and Windows 8.1 or above.  Intel SSD 750 is slow to boot because it requires a long time to enumerate, if one does not mind the high price and the slow boot time, there is nothing else to complain about.  Samsung NVMe drives are really fast, but they are affected by thermal throttling, more so than other brands.

No matter which you choose, remember to take a look at user reviews at online retailer websites and see how many users (with respect to total number of comments) have their SSD suddenly die within months.