Technology scaling, smaller feature sizes, and reduced supply voltages inherently lower the threshold of energy needed tocause soft errors. This means that soft errors will soon be one of the most crucial concerns in the design of the next-generation high-performance processors. As it turns out, Simultaneous Multi-Threading (SMT) architectures are of interest for performance reasons first and foremost, but also because they are well-suited for an integrated fault-tolerant design since the very concept of SMT provides the redundancy needed to detect and recover from faults. Based on this observation, proposals have been made in the recent past to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This would allow, upon detection of an error, the rollingback of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper focuses on two crucial implementation issues introduced by this generic approach: (i) the performance overhead introduced by the fault-tolerantfeatures; (ii) the possible occurrence of deadlock situations. To lower the performance overhead which is inherent to this duplicated execution mode, we first demonstrate here two new design strategies which entail (i) applying SMT thread scheduling at the level of the instruction dispatch stage rather than at that of the fetch stage; and (ii) copying the instructions fetched to generate the redundant thread instead of fetching two identical threads out of the instruction cache. As a result, when compared to the baseline processor (single-thread execution mode), our simulation results show that by using our two new schemes, the performance overhead can be reduced to 23% on average, down from 30%. In order to further improve performance, we adopt from prior work the concept of Load Value Queue (LVQ): it enables one leading thread to prefetch the data for another trailing thread. This addition of an LVQ to our design, further lowers the performance overhead to only 11% on average. Finally, in the fault-tolerant execution mode, since the two copied threads are cooperating with each other, deadlock situations could be quite common. We thus present a detailed deadlock analysis and then conclude that, by fairly scheduling the threads in the dispatch stage, we can prevent such possible deadlocks.