Discussion:
Predictive failures
(too old to reply)
Don Y
2024-04-15 17:13:02 UTC
Permalink
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?

I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).

So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Martin Rid
2024-04-15 17:32:41 UTC
Permalink
Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (and notdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...
Current and voltages outside of normal operation?

Cheers
--
----Android NewsGroup Reader----
https://piaohong.s3-us-west-2.amazonaws.com/usenet/index.html
Don Y
2024-04-16 02:19:06 UTC
Permalink
Post by Martin Rid
Is there a general rule of thumb for signalling the likelihood ofan "imminent" (for some value of "imminent") hardware failure?I suspect most would involve *relative* changes that would besuggestive of changing conditions in the components (and notdirectly related to environmental influences).So, perhaps, a good strategy is to just "watch" everything andnotice the sorts of changes you "typically" encounter in the hopethat something of greater magnitude would be a harbinger...
Current and voltages outside of normal operation?
I think "outside" is (often) likely indicative of
"something is (already) broken".

But, perhaps TRENDS in either/both can be predictive.

E.g., if a (sub)circuit has always been consuming X (which
is nominal for the design) and, over time, starts to consume
1.1X, is that suggestive that something is in the process of
failing?

Note that the goal is not to troubleshoot the particular design
or its components but, rather, act as an early warning that
maintenance may be required (or, that performance may not be
what you are expecting/have become accustomed to).

You can include mechanisms to verify outputs are what you
*intended* them to be (in case the output drivers have shit
the bed).

You can, also, do sanity checks that ensure values are never
what they SHOULDN'T be (this is commonly done within software
products -- if something "can't happen" then noticing that
it IS happening is a sure-fire indication that something
is broken!)

[Limit switches on mechanisms are there to ensure the impossible
is not possible -- like driving a mechanism beyond its extents]

And, where possible, notice second-hand effects of your actions
(e.g., if you switched on a load, you should see an increase
in supplied current).

But, again, these are more helpful in detecting FAILED items.
Edward Rawde
2024-04-16 03:33:34 UTC
Permalink
Post by Don Y
Post by Martin Rid
Is there a general rule of thumb for signalling the likelihood ofan
"imminent" (for some value of "imminent") hardware failure?I suspect
most would involve *relative* changes that would besuggestive of
changing conditions in the components (and notdirectly related to
environmental influences).So, perhaps, a good strategy is to just
"watch" everything andnotice the sorts of changes you "typically"
encounter in the hopethat something of greater magnitude would be a
harbinger...
Current and voltages outside of normal operation?
I think "outside" is (often) likely indicative of
"something is (already) broken".
But, perhaps TRENDS in either/both can be predictive.
E.g., if a (sub)circuit has always been consuming X (which
is nominal for the design) and, over time, starts to consume
1.1X, is that suggestive that something is in the process of
failing?
That depends on many other unknown factors.
Temperature sensors are common in electronics.
So is current sensing. Voltage sensing too.
Post by Don Y
Note that the goal is not to troubleshoot the particular design
or its components but, rather, act as an early warning that
maintenance may be required (or, that performance may not be
what you are expecting/have become accustomed to).
If the system is electronic then you can detect whether currents and/or
votages are within expected ranges.
If they are a just a little out of expected range then you might turn on a
warning LED.
If they are way out of range then you might tell the power supply to turn
off quick.
By all means tell the software what has happened, but don't put software
between the current sensor and the emergency turn off.
Be aware that components in monitoring circuits can fail too.
Post by Don Y
You can include mechanisms to verify outputs are what you
*intended* them to be (in case the output drivers have shit
the bed).
You can, also, do sanity checks that ensure values are never
what they SHOULDN'T be (this is commonly done within software
products -- if something "can't happen" then noticing that
it IS happening is a sure-fire indication that something
is broken!)
[Limit switches on mechanisms are there to ensure the impossible
is not possible -- like driving a mechanism beyond its extents]
And, where possible, notice second-hand effects of your actions
(e.g., if you switched on a load, you should see an increase
in supplied current).
But, again, these are more helpful in detecting FAILED items.
What system would you like to have early warnings for?
Are the warnings needed to indicate operation out of expected limits or to
indicate that maintenance is required, or both?
Without detailed knowledge of the specific sytem, only speculative answers
can be given.
Don Y
2024-04-16 05:32:04 UTC
Permalink
On 4/15/2024 8:33 PM, Edward Rawde wrote:

[Shouldn't that be Edwar D rawdE?]
Post by Edward Rawde
Post by Don Y
Post by Martin Rid
Current and voltages outside of normal operation?
I think "outside" is (often) likely indicative of
"something is (already) broken".
But, perhaps TRENDS in either/both can be predictive.
E.g., if a (sub)circuit has always been consuming X (which
is nominal for the design) and, over time, starts to consume
1.1X, is that suggestive that something is in the process of
failing?
That depends on many other unknown factors.
Temperature sensors are common in electronics.
So is current sensing. Voltage sensing too.
Sensors cost money. And, HAVING data but not knowing how to
USE it is a wasted activity (and cost).

Why not monitor every node in the schematic and compare
them (with dedicated hardware -- that is ALSO monitored??)
with expected operational limits?

Then, design some network to weight the individual
observations to make the prediction?
Post by Edward Rawde
Post by Don Y
Note that the goal is not to troubleshoot the particular design
or its components but, rather, act as an early warning that
maintenance may be required (or, that performance may not be
what you are expecting/have become accustomed to).
If the system is electronic then you can detect whether currents and/or
votages are within expected ranges.
If they are a just a little out of expected range then you might turn on a
warning LED.
If they are way out of range then you might tell the power supply to turn
off quick.
By all means tell the software what has happened, but don't put software
between the current sensor and the emergency turn off.
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"

As such, software is invaluable as designing PREDICTIVE hardware is
harder than designing predictive software (algorithms).

You don't want to tell the user "The battery in your smoke detector
is NOW dead (leaving you vulnerable)" but, rather, "The battery in
your smoke detector WILL cease to be able to provide the power necessary
for the smoke detector to provide the level of protection that you
desire."

And, the WAY that you inform the user has to be "productive/useful".
A smoke detector beeping every minute is likely to find itself unplugged,
leading to exactly the situation that the alert was trying to avoid!

A smoke detector that beeps once a day risks not being heard
(what if the occupant "works nights"?). A smoke detector
that beeps a month in advance of the anticipated failure (and
requires acknowledgement) risks being forgotten -- until
it is forced to beep more persistently (see above).
Post by Edward Rawde
Be aware that components in monitoring circuits can fail too.
Which is why hardware interlocks are physical switches -- yet
can only be used to protect against certain types of faults
(those that are most costly -- injury or loss of life)
Post by Edward Rawde
Post by Don Y
But, again, these are more helpful in detecting FAILED items.
What system would you like to have early warnings for?
Are the warnings needed to indicate operation out of expected limits or to
indicate that maintenance is required, or both?
Without detailed knowledge of the specific sytem, only speculative answers
can be given.
I'm not looking for speculation. I'm looking for folks who have DONE
such things (designing to speculation is more expensive than just letting
the devices fail when they need to fail!)

E.g., when making tablets, it is possible that a bit of air will
get trapped in the granulation during compression. This is dependant
on a lot of factors -- tablet dimensions, location in the die
where the compression event is happening, characteristics of the
granulation, geometry/condition of the tooling, etc.

But, if this happens, some tens of milliseconds later, the top will "pop"
off the tablet. It now is cosmetically damaged as well as likely out
of specification (amount of "active" present in the dose). You want
to either be able to detect this (100% of the time on 100% of the tablets)
and dynamically discard those units (and only those units!). *OR*,
identify the characteristics of the process that most affect this condition
and *monitor* for them to AVOID the problem.

If that means replacing your tooling more frequently (expensive!),
it can save money in the long run (imagine having to "sort" through
a million tablets each hour to determine if any have popped like this?)
Or, throttling down the press so the compression events are "slower"
(more gradual). Or, moving the event up in the die to provide
a better egress for the trapped air. Or...

TELLING the user that this is happening (or likely to happen, soon)
has real $$$ value. Even better if your device can LEARN which
tablets and conditions will likely lead to this -- and when!
Edward Rawde
2024-04-16 15:10:40 UTC
Permalink
Post by Don Y
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
Post by Don Y
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
...
Joe Gwinn
2024-04-16 15:21:00 UTC
Permalink
On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
Post by Edward Rawde
Post by Don Y
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
Post by Don Y
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
My experience has been the smoke detectors too close (as the smoke
travels) to the kitchen tend to suffer mysterious disablement.
Relocation of the smoke detector usually solves the problem.

Joe Gwinn
Edward Rawde
2024-04-16 16:06:47 UTC
Permalink
Post by Joe Gwinn
On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
Post by Edward Rawde
Post by Don Y
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
Post by Don Y
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
My experience has been the smoke detectors too close (as the smoke
travels) to the kitchen tend to suffer mysterious disablement.
Oh yes I've had that too. I call them burnt toast detectors.
Post by Joe Gwinn
Relocation of the smoke detector usually solves the problem.
Joe Gwinn
legg
2024-04-17 12:11:33 UTC
Permalink
On Tue, 16 Apr 2024 11:10:40 -0400, "Edward Rawde"
Post by Edward Rawde
Post by Don Y
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
...
Post by Don Y
A smoke detector that beeps once a day risks not being heard
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
The ocassional beeping is a low battery alert.

RL
Edward Rawde
2024-04-16 16:02:43 UTC
Permalink
Post by Don Y
[Shouldn't that be Edwar D rawdE?]
I don't mind how you pronounce it.
Post by Don Y
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"
As such, software is invaluable as designing PREDICTIVE hardware is
harder than designing predictive software (algorithms).
Two comparators can make a window detector which will tell you whether some
parameter is in a specified range.
And it doesn't need monthly updates because software is never finished.
Post by Don Y
You don't want to tell the user "The battery in your smoke detector
is NOW dead (leaving you vulnerable)" but, rather, "The battery in
your smoke detector WILL cease to be able to provide the power necessary
for the smoke detector to provide the level of protection that you
desire."
And, the WAY that you inform the user has to be "productive/useful".
A smoke detector beeping every minute is likely to find itself unplugged,
leading to exactly the situation that the alert was trying to avoid!
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
Post by Don Y
I'm not looking for speculation. I'm looking for folks who have DONE
such things (designing to speculation is more expensive than just letting
the devices fail when they need to fail!)
Well I don't recall putting anything much into a design which could predict
remaining life.
The only exceptions, also drawing from other replies in this thread, might
be be temperature sensing,
voltage sensing, current sensing, air flow sensing, noise sensing, iron in
oil sensing,
and any other kind of sensing which might provide information on parameters
outside or getting close to outside expected range.
Give that to some software which also knows how long the equipment has been
in use, how often
it has been used, what the temperature and humidity was, how long it's been
since the oil was changed,
and you might be able to give the operator useful information about when to
schedule specific maintenance.
But don't give the software too much control. I don't want to be told that I
can't use the equipment because an oil change was required 5 minutes ago and
it hasn't been done yet.
...
Don Y
2024-04-16 16:53:42 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"
As such, software is invaluable as designing PREDICTIVE hardware is
harder than designing predictive software (algorithms).
Two comparators can make a window detector which will tell you whether some
parameter is in a specified range.
Yes, but you are limited in the relationships that you can encode
in hardware -- because they are "hard-wired".
Post by Edward Rawde
And it doesn't need monthly updates because software is never finished.
Software is finished when the design is finalized. When management
fails to discipline itself to stop the list of "Sales would like..."
requests, then it's hard for software to even CLAIM to be finished.

[How many hardware products see the addition of features and new
functionality at the rate EXPECTED of software? Why can't I drive
my car from the back seat? It's still a "car", right? It's not like
I'm asking for it to suddenly FLY! Why can't this wall wart deliver
400A at 32VDC? It's still a power supply, right? It's not like I'm
asking for it to become an ARB!]
Post by Edward Rawde
Post by Don Y
You don't want to tell the user "The battery in your smoke detector
is NOW dead (leaving you vulnerable)" but, rather, "The battery in
your smoke detector WILL cease to be able to provide the power necessary
for the smoke detector to provide the level of protection that you
desire."
And, the WAY that you inform the user has to be "productive/useful".
A smoke detector beeping every minute is likely to find itself unplugged,
leading to exactly the situation that the alert was trying to avoid!
Reminds me of a tenant who just removed the battery to stop the annoying
beeping.
"Dinner will be served at the sound of the beep".

[I had a friend who would routinely trip her smoke detector while cooking.
Then, wave a dishtowel in front of it (she was short) to try to "clear" it.]

Most places have specific rules regarding the placement of smoke detectors
to 1) ensure safety and 2) avoid nuisance alarms. (I was amused to disover
that our local fire department couldn't cite the local requirements when
I went asking!)

Add CO and heat detectors to the mix and they get *really* confused!
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."

Remind the occupants once a week (requiring acknowledgement) starting
a month prior to ANTICIPATED battery depletion. When the battery is on
it's last leg, you can be a nuisance.

Folks will learn to remember at the first (or second or third) reminder
in order to avoid the annoying nuisance behavior that is typical of
most detectors. (there is not a lot of $aving$ to replacing the
battery at the second warning instead of at the "last minute")

[We unconditionally replace all of ours each New Years. Modern units
now come with sealed "10 year" batteries -- 10 years being the expected
lifespan of the detector itself!]
Post by Edward Rawde
Post by Don Y
I'm not looking for speculation. I'm looking for folks who have DONE
such things (designing to speculation is more expensive than just letting
the devices fail when they need to fail!)
Well I don't recall putting anything much into a design which could predict
remaining life.
Most people don't. Most people don't design for high availability
*or* "costly" systems. When I was designing for pharma, my philosophy was
to make it easy/quick to replace the entire control system. Let someone
troubleshoot it on a bench instead of on the factory floor (which is
semi-sterile).

When you have hundreds/thousands of devices in a single installation,
then you REALLY don't want to have to be playing wack-a-mole with whichever
devices have crapped out, TODAY. If these *10* are likely to fail in
the next month, then replace all ten of them NOW, when you can fit
that maintenance activity into the production schedule instead of
HAVING to replace them when they DISRUPT the production schedule.
Post by Edward Rawde
The only exceptions, also drawing from other replies in this thread, might
be be temperature sensing,
voltage sensing, current sensing, air flow sensing, noise sensing, iron in
oil sensing,
and any other kind of sensing which might provide information on parameters
outside or getting close to outside expected range.
Give that to some software which also knows how long the equipment has been
in use, how often
it has been used, what the temperature and humidity was, how long it's been
since the oil was changed,
and you might be able to give the operator useful information about when to
schedule specific maintenance.
I have all of those things -- with the exception of knowing which sensory data
is most pertinent to failure prediction. I can watch to see when one device
fails and use it to anticipate the next failure. After many years (of 24/7
operation) and many sites, I can learn from actual experience.

Just like I can learn that you want your coffee pot started 15 minutes after
you arise -- regardless of time of day. Or, that bar stock will be routed
directly to the Gridley's. Because that's what I've *observed* you doing!
Post by Edward Rawde
But don't give the software too much control. I don't want to be told that I
can't use the equipment because an oil change was required 5 minutes ago and
it hasn't been done yet.
Advisories are just that; advisories. They are there to help the user
avoid the "rudeness" of a piece of equipment "suddenly" (as far as the
user is concerned) failing. They add value by increasing availability.

If you choose to ignore the advisory (e.g., not purchasing a spare to have
on hand for that "imminent" failure), then that's your perogative. If
you can afford to have "down time" and only react to ACTUAL failures,
then that's a policy decision that YOU make.

OTOH, if there is no oil in the gearbox, the equipment isn't going to
start; if the oil sensor is defective, then *it* needs to be replaced.
But, if the gearbox is truly empty, then it needs to be refilled.
In either case, the equipment *needs* service -- now. Operating in
this FAILED state presumably poses some risk, hence the prohibition.

[Cars have gas gauges to save the driver from "discovering" that he's
run out of fuel!]
Edward Rawde
2024-04-16 17:25:21 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Again, the goal is to be an EARLY warning, not an "Oh, Shit! Kill the power!!"
...
Post by Don Y
Add CO and heat detectors to the mix and they get *really* confused!
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their
phone.
Post by Don Y
Post by Edward Rawde
The only exceptions, also drawing from other replies in this thread, might
be be temperature sensing,
voltage sensing, current sensing, air flow sensing, noise sensing, iron in
oil sensing,
and any other kind of sensing which might provide information on parameters
outside or getting close to outside expected range.
Give that to some software which also knows how long the equipment has been
in use, how often
it has been used, what the temperature and humidity was, how long it's been
since the oil was changed,
and you might be able to give the operator useful information about when to
schedule specific maintenance.
I have all of those things -- with the exception of knowing which sensory data
is most pertinent to failure prediction.
That's one reason why you want feedback from people who use your equipment.
Post by Don Y
OTOH, if there is no oil in the gearbox, the equipment isn't going to
start; if the oil sensor is defective, then *it* needs to be replaced.
Preferably by me purchasing a new sensor and being able to replace it
myself.
Post by Don Y
...
Don Y
2024-04-16 19:31:33 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.

OTOH, the manufacturer wants to "keep selling toilet paper" and
has used that business model to underwrite the cost of the "device".

Everything, here, is wired. And, my designs have the same approach:
partly because I can distribute power over the fabric; partly because
it removes an attqck surface (RF jamming); partly for reliability
(no reliance on services that the user doesn't "own"); partly for
privacy (no information leaking -- even by side channel inference).
My wireless connections are all short range and of necessity
(e.g., the car connects via wifi so the views from the external
cameras can be viewed on it's LCD screen as it pulls out/in).
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Give that to some software which also knows how long the equipment has been
in use, how often
it has been used, what the temperature and humidity was, how long it's been
since the oil was changed,
and you might be able to give the operator useful information about when to
schedule specific maintenance.
I have all of those things -- with the exception of knowing which sensory data
is most pertinent to failure prediction.
That's one reason why you want feedback from people who use your equipment.
All they know is when something BREAKS. But, my device also knows that
(unless EVERY "thing" breaks at the same time). The devices can capture
pertinent data to adjust it's model of when those other devices are
likely to suffer similar failures because the failed device shared
its observations with them (via a common knowledge base).
Post by Edward Rawde
Post by Don Y
OTOH, if there is no oil in the gearbox, the equipment isn't going to
start; if the oil sensor is defective, then *it* needs to be replaced.
Preferably by me purchasing a new sensor and being able to replace it
myself.
If it makes sense to do so. Replacing a temperature sensor inside an
MCU is likely not cost effective, something folks are capable of doing nor
supported as an FRU.
Edward Rawde
2024-04-16 19:43:17 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Not for most users here.
They tried to put me on lsn/cgnat not long ago.
After complaining I was given a free static IPv4.
Most users wouldn't know DDNS from a banana, and will expect it to work out
of the box after installing the app on their phone.
Post by Don Y
...
Don Y
2024-04-16 21:35:34 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Not for most users here.
They tried to put me on lsn/cgnat not long ago.
After complaining I was given a free static IPv4.
Most folks, here, effectively have static IPs -- even if not guaranteed
as such. But, most also have AUPs that prohibit running their own servers
(speaking about consumers, not businesses).
Post by Edward Rawde
Most users wouldn't know DDNS from a banana, and will expect it to work out
of the box after installing the app on their phone.
There's no reason the app can't rely on DDNS. Infineon used to make
a series of "power control modules" (think BSR/X10) for consumers.
You could talk to the "controller" -- placed on YOUR network -- directly.
No need to go THROUGH a third party (e.g., Infineon).

If you wanted to access those devices (through the controller) from
a remote location, the controller -- if you provided internet access
to it -- would register with a DDNS and you could access it through
that URL.

It is only recently that vendors have been trying to bake themselves into
their products. Effectively turning their products into "rentals".
You can buy a smart IP camera that will recognize *people*! For $30.
Plus $6/month -- forever! (and, if you stop paying, you have a nice
high-tech-looking PAPERWEIGHT).

I rescued another (APC) UPS, recently. I was excited that they FINALLY
had included the NIC in the basic model (instead of as an add-in card
as it has historically been supplied - at additional cost).

[I use the network access to log my power consumption and control the
power to attached devices without having to press power buttons]

Ah, but you can't *talk* to that NIC! It exists so the UPS can talk to the
vendor! Who will let you talk to them to get information about YOUR UPS.

So, you pay for a NIC that you can't use -- unless you agree to their
terms (I have no idea if there is a fee involved or if they just want
to spy on your usage and sell you batteries!)

In addition to the sleeze factor, it's also a risk. Do you know what
the software in the device does? Are you sure it is benevolent? And,
not snooping your INTERNAL network (it's INSIDE your firewall)? Maybe
just trying to sort out what sorts of hardware you have (for which they
could pitch additional products/services)? Are you sure the product
(if benign) can't be hacked and act as a beachhead for some other
infestation?

And, what, exactly, am I *getting* for this risk that I couldn't get
with the "old style" NIC?
Edward Rawde
2024-04-16 22:19:32 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
But vendors know that most people want it easy so the push towards
subscription services and products which phone home isn't going to change.

Most people don't know or care what their products are sending to the
vendor.

I like to see what is connecting to what with https://www.pfsense.org/
But I might be the only person in 100 mile radius doing so.

I can also remote desktop from anywhere of my choice, with the rest of the
world unable to connect.

Pretty much all of my online services are either restricted to specific IPs
(cameras, remote desktop and similar).
Or they have one or more countries and other problem IPs blocked. (web sites
and email services).

None of that is possible when the vendor is in control because users will
want their camera pictures available anywhere.
Don Y
2024-04-17 01:12:14 UTC
Permalink
Post by Edward Rawde
But vendors know that most people want it easy so the push towards
subscription services and products which phone home isn't going to change.
Until it does. There is nothing inherent in the design of any
of these products that requires another "service" to provide the
ADVERTISED functionality.

Our stove and refrigerator have WiFi "apps" -- that rely on a tie-in
to the manufacturer's site (no charge but still, why do they need
access to my stovetop?).

Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
Post by Edward Rawde
Most people don't know or care what their products are sending to the
vendor.
I think that is a generational issue. My neighbor just bought
camera and, when she realized it had to connect to their OUTBOUND
wifi, she just opted to return it. So, a lost sale AND the cost
of a return.

Young people seem to find nothing odd about RENTING -- anything!
Wanna listen to some music? You can RENT it, one song at a time!
Wanna access the internet using free WiFi *inside* a business?
The idea that they are leaking information never crosses their
mind. They *deserve* to discover that some actuary has noted
a correlation between people who shop at XYZCo and alcoholism.
Or, inability to pay their debts. Or, cannabis use. Or...
whatever the Big Data tells *them*.

Like the driver who complained that his CAR was revealing his
driving behavior through OnStar to credit agencies and their
subscribers (insurance companies) were using that to determine
the risk he represented.
Post by Edward Rawde
I like to see what is connecting to what with https://www.pfsense.org/
But I might be the only person in 100 mile radius doing so.
I can also remote desktop from anywhere of my choice, with the rest of the
world unable to connect.
Pretty much all of my online services are either restricted to specific IPs
(cameras, remote desktop and similar).
Or they have one or more countries and other problem IPs blocked. (web sites
and email services).
But IP and MAC masquerading are trivial exercises. And, don't require
a human participant to interact with the target (i.e., they can be automated).

I have voice access to the services in my home. I don't rely on the
CID information provided as it can be forged. But, I *do* require
the *voice* match one of a few known voiceprints -- along with other
conditions for access (e.g., if I am known to be HOME, then anyone
calling with my voice is obviously an imposter; likewise, if
someone "authorized" calls and passes the authentication procedure,
they are limited in what they can do -- like, maybe close my garage
door if I happened to leave it open and it is now after midnight).
And, recording a phrase (uttered by that person) only works if you
know what I am going to ASK you; anything that relies on your own
personal knowledge can't be emulated, even by an AI!

No need for apps or appliances -- you could technically use a "payphone"
(if such things still existed) or an office phone in some business.

I have a "cordless phone" in the car that lets me talk to the house from
a range of 1/2 mile, without relying on cell phone service. I can't
send video over the link -- but, I can ask "Did I remember to close
the garage door?" Or, "Did I forget to turn off the tea kettle?"
as I drive away.
Post by Edward Rawde
None of that is possible when the vendor is in control because users will
want their camera pictures available anywhere.
No, you just have to rely on other mechanisms for authentication.

I have a friend who manages a datafarm at a large multinational bank.
When he is here, he uses my internet connection -- which is "foreign"
as far as the financial institution is concerned -- with no problems.
But, he carries a time-varying "token" with him that ensures he
has the correct credentials for any ~2 minute slice of time!

I rely on biometrics, backed with "shared secrets" ("Hi Jane!
How's Tom doing?" "Hmmm, I don't know anyone by the name of Tom")
because I don't want to have to carry a physical key (and
don't want the other folks with access to have to do so, either)

And, most folks don't really need remote access to the things
that are offering that access. Why do I need to check the state
of my oven/stove WHEN I AM NOT AT HOME? (Why the hell would
I leave it ON when the house is empty???) There are refrigerators
that take a photo of the contents of the frig each time you close
the door. Do I care if the photo on my phone is of the state of the
refrigerator when I was last IN PROXIMITY OF IT vs. it's most recent
state? Do I need to access my thermostat "online" vs. via SMS?
Or voice?
Edward Rawde
2024-04-17 01:38:29 UTC
Permalink
Post by Don Y
Post by Edward Rawde
But vendors know that most people want it easy so the push towards
subscription services and products which phone home isn't going to change.
Until it does. There is nothing inherent in the design of any
of these products that requires another "service" to provide the
ADVERTISED functionality.
Our stove and refrigerator have WiFi "apps" -- that rely on a tie-in
to the manufacturer's site (no charge but still, why do they need
access to my stovetop?).
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
I'd have trouble here with no wifi access.
I can restrict outbound with a firewall as necessary.
Post by Don Y
Post by Edward Rawde
Most people don't know or care what their products are sending to the
vendor.
I think that is a generational issue. My neighbor just bought
camera and, when she realized it had to connect to their OUTBOUND
wifi, she just opted to return it. So, a lost sale AND the cost
of a return.
Young people seem to find nothing odd about RENTING -- anything!
Wanna listen to some music? You can RENT it, one song at a time!
Wanna access the internet using free WiFi *inside* a business?
The idea that they are leaking information never crosses their
mind. They *deserve* to discover that some actuary has noted
a correlation between people who shop at XYZCo and alcoholism.
Or, inability to pay their debts. Or, cannabis use. Or...
whatever the Big Data tells *them*.
Like the driver who complained that his CAR was revealing his
driving behavior through OnStar to credit agencies and their
subscribers (insurance companies) were using that to determine
the risk he represented.
Post by Edward Rawde
I like to see what is connecting to what with https://www.pfsense.org/
But I might be the only person in 100 mile radius doing so.
I can also remote desktop from anywhere of my choice, with the rest of the
world unable to connect.
Pretty much all of my online services are either restricted to specific IPs
(cameras, remote desktop and similar).
Or they have one or more countries and other problem IPs blocked. (web sites
and email services).
But IP and MAC masquerading are trivial exercises. And, don't require
a human participant to interact with the target (i.e., they can be automated).
That's why most tor exit nodes and home user vpn services are blocked.
I don't allow unauthenticated access to anything (except web sites).
I prefer to keep authentication simple and drop packets from countries and
places who have no business connecting.
Granted a multinational bank may need a different approach since their
customers could be anywhere.
If I were a multinational bank I'd be employing people to watch where the
packets come from and decide which ones the firewall should drop.
Post by Don Y
I have voice access to the services in my home. I don't rely on the
CID information provided as it can be forged. But, I *do* require
the *voice* match one of a few known voiceprints -- along with other
conditions for access (e.g., if I am known to be HOME, then anyone
calling with my voice is obviously an imposter; likewise, if
someone "authorized" calls and passes the authentication procedure,
they are limited in what they can do -- like, maybe close my garage
door if I happened to leave it open and it is now after midnight).
And, recording a phrase (uttered by that person) only works if you
know what I am going to ASK you; anything that relies on your own
personal knowledge can't be emulated, even by an AI!
No need for apps or appliances -- you could technically use a "payphone"
(if such things still existed) or an office phone in some business.
I have a "cordless phone" in the car that lets me talk to the house from
a range of 1/2 mile, without relying on cell phone service. I can't
send video over the link -- but, I can ask "Did I remember to close
the garage door?" Or, "Did I forget to turn off the tea kettle?"
as I drive away.
Post by Edward Rawde
None of that is possible when the vendor is in control because users will
want their camera pictures available anywhere.
No, you just have to rely on other mechanisms for authentication.
I have a friend who manages a datafarm at a large multinational bank.
When he is here, he uses my internet connection -- which is "foreign"
as far as the financial institution is concerned -- with no problems.
But, he carries a time-varying "token" with him that ensures he
has the correct credentials for any ~2 minute slice of time!
I rely on biometrics, backed with "shared secrets" ("Hi Jane!
How's Tom doing?" "Hmmm, I don't know anyone by the name of Tom")
because I don't want to have to carry a physical key (and
don't want the other folks with access to have to do so, either)
And, most folks don't really need remote access to the things
that are offering that access. Why do I need to check the state
of my oven/stove WHEN I AM NOT AT HOME? (Why the hell would
I leave it ON when the house is empty???) There are refrigerators
that take a photo of the contents of the frig each time you close
the door. Do I care if the photo on my phone is of the state of the
refrigerator when I was last IN PROXIMITY OF IT vs. it's most recent
state? Do I need to access my thermostat "online" vs. via SMS?
Or voice?
Don Y
2024-04-17 03:17:12 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
I'd have trouble here with no wifi access.
I can restrict outbound with a firewall as necessary.
I have 25 general purpose drops, here. So, you can be in any room,
front/back porch -- even the ROOF -- and get connected.

When I *need* wifi, I have to turn on one of the radios in
the ceiling, temporarily. (they are there as convenience
features for visiting guests; they are blocked from all of
the wired connections in the house)
Post by Edward Rawde
Post by Don Y
But IP and MAC masquerading are trivial exercises. And, don't require
a human participant to interact with the target (i.e., they can be automated).
That's why most tor exit nodes and home user vpn services are blocked.
I don't allow unauthenticated access to anything (except web sites).
I prefer to keep authentication simple and drop packets from countries and
places who have no business connecting.
Granted a multinational bank may need a different approach since their
customers could be anywhere.
If I were a multinational bank I'd be employing people to watch where the
packets come from and decide which ones the firewall should drop.
The internal network isn't routed. So, the only machines to worry about are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.

I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to
look "interesting".

The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
Edward Rawde
2024-04-17 04:21:24 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Simple solution: router has no radio! Even if the appliances wanted
to connect (ignoring their "disable WiFi access" setting), there's
nothing they can connect *to*.
I'd have trouble here with no wifi access.
I can restrict outbound with a firewall as necessary.
I have 25 general purpose drops, here. So, you can be in any room,
front/back porch -- even the ROOF -- and get connected.
I have wired LAN to every room too but it's not only me who uses wifi so
wifi can't be turned off.
Post by Don Y
The internal network isn't routed. So, the only machines to worry about are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with each
other and online as necessary.
I wouldn't want to give online security advice to others without doing it
myself.
Post by Don Y
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Post by Don Y
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
It's been a while since I had to clean a malware infested PC.
Don Y
2024-04-17 05:14:06 UTC
Permalink
Post by Edward Rawde
Post by Don Y
The internal network isn't routed. So, the only machines to worry about are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.

I have no desire to waste any time installing the latest OS & AV updates,
keeping an IDS operationally effective, etc. My business is designing
devices so my uses reflect that -- and nothing else.

"Patch Tuesday?" What's that?? Why would I *want* to play that game?
Post by Edward Rawde
I wouldn't want to give online security advice to others without doing it
myself.
The advice I give to others is to only leave "exposed" what you absolutely
MUST leave exposed. Most of my colleagues have adopted similar strategies
to keep their intellectual property secure; it's a small inconvenience
to (physically) move to a routed workstation when one needs to check email
or chase down a resource online.
Post by Edward Rawde
Post by Don Y
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the world,
it looks like a cable dangling in air...
Post by Edward Rawde
Post by Don Y
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
So, you'd have to *police* all such connections. What do you do with hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN PLACE OF
a legitimate device? (after all, it is a trivial matter to unplug a network
cable from one "approved" PC and plug it into a "foreign import")
Post by Edward Rawde
It's been a while since I had to clean a malware infested PC.
My current project relies heavily on internetworking for interprocessor
communication. So, has to be designed to tolerate (and survive) a
hostile actor being directly connected TO that fabric -- because that
is a likely occurrence, "in the wild".

Imagine someone being able to open your PC and alter the internals...
and be expected to continue to operate as if this had not occurred!
Edward Rawde
2024-04-17 05:39:51 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
The internal network isn't routed. So, the only machines to worry about are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
Must be a big house.
Post by Don Y
...
Post by Edward Rawde
Post by Don Y
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the world,
it looks like a cable dangling in air...
You could ping me if you knew my IP address.
Post by Don Y
Post by Edward Rawde
Post by Don Y
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
Post by Don Y
So, you'd have to *police* all such connections. What do you do with hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN PLACE OF
a legitimate device? (after all, it is a trivial matter to unplug a network
cable from one "approved" PC and plug it into a "foreign import")
Devices on a LAN should be secure just like Internet facing devices.
Post by Don Y
Post by Edward Rawde
It's been a while since I had to clean a malware infested PC.
My current project relies heavily on internetworking for interprocessor
communication. So, has to be designed to tolerate (and survive) a
hostile actor being directly connected TO that fabric -- because that
is a likely occurrence, "in the wild".
Imagine someone being able to open your PC and alter the internals...
and be expected to continue to operate as if this had not occurred!
Don Y
2024-04-17 06:33:11 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
The internal network isn't routed. So, the only machines to worry about are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
Must be a big house.
The office is ~150 sq ft. Three sets of dual workstations each sharing a
set of monitors and a tablet (for music) -- 7 drops for each such set.
Eight drops for my "prototyping platform". Twelve UPSs. Four scanners
(two B size, one A-size w/ADF and a film scanner). An SB2000 and Voyager
(for cross development testing; I'm discarding a T5220 tomorrow).
Four "toy" NASs (for sharing files between myself and SWMBO, documents
dropped by the scanners, etc.). Four 12-bay NASs, two 16 bay. Four
8-bay ESXi servers. Two 1U servers. Two 2U servers. My DBMS server.
A "general services" appliance (DNS, NTP, PXE, FTP, TFTP, font, etc.
services). Three media front ends. One media tank. Two 12 bay
(and one 24 bay) iSCSI SAN devices.

[It's amazing how much stuff you can cram into a small space when you
try hard! :> To be completely honest, the scanners are located in
my adjoining bedroom]

The house a bit under 2000. But, the drops go to places that "people"
don't normally access -- with the notable exception of the 25 "uncommitted
drops": 2 in each bedroom, 2 on kitchen counters, 4 in living room,
3 in family room, 2 in dining room, front hall, back porch, front porch,
etc.

E.g., there are 4 in the kitchen ceiling -- for four "network speakers"
(controller, amplifier, network interface). Four more in the family room
(same use). And two on the back porch.

There's one on the roof *in* the evaporative cooler (to control the
evaporative cooler, of course). Another for a weather station
(to sort out how best to use the HVAC options available). Another
in the furnace/ACbrrr.

One for a genset out by the load center. Another for a solar installation.
One to monitor utility power consumption. Another for municipal water.
And natural gas. One for the irrigation system. One for water
"treatment".

One for the garage (door opener and "parking assistant"). Another for the
water heater. Washer. Dryer. Stove/oven. Refrigerator. Dishwasher.

One for each skylight (to allow for automatic venting, shading and
environmental sensing). One for each window (automate window coverings).

Three "control panels". One "privileged port" (used to "introduce" new
devices to the system, securely).

Two cameras on each corner of the house. A camera looking at the front
door. Another looking away from it. One more looking at the potential
guest standing AT the door. One on the roof (for the wildlife that
invariably end up there)

One for the alarm system. Phone system. CATV. CATV modem. 2 OTA TV
receivers. 2 SDRs.

10 BT "beacons" in the ceiling to track the location of occupants.
2 WiFi APs (also in the ceiling).

Etc. Processors are cheap. As is CAT5 to talk to them and power them.

You'll *see* the cameras, speaker grills, etc. But, the kit controlling
each of them is hidden -- in the devices, walls, ceilings, etc. (each
"controller" is about the size/shape/volume of a US electrical receptacle)
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is not to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the world,
it looks like a cable dangling in air...
You could ping me if you knew my IP address.
You can't see me, at all. You have to know the right sequence of packets
(connection attempts) to throw at me before I will "wake up" and respond
to the *final*/correct one. And, while doing so, will continue to
ignore *other* attempts to contact me. So, even if you could see that
I had started to respond, you couldn't "get my attention".
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
The structure of the house's fabric allows me to treat any individual
node as being directly connected to the ISP while isolating the
rest of the nodes. I.e., if you bring a laptop loaded with malware into
the house, you can't infect anything (or even know that there are other
hosts, here); it's as if you had a dedicated connection to the Internet
with no other devices "nearby".
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
If you were to unplug any of the above mentioned ("house") drops,
you'd find nothing at the other end. Each physical link is an
encrypted tunnel that similarly "hides" until (and unless) properly
tickled. As a result, eavesdropping on the connection doesn't
"give" you anything (because it's immune from replay attacks and
it's content is opaque to you)
Post by Edward Rawde
Post by Don Y
So, you'd have to *police* all such connections. What do you do with hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN PLACE OF
a legitimate device? (after all, it is a trivial matter to unplug a network
cable from one "approved" PC and plug it into a "foreign import")
Devices on a LAN should be secure just like Internet facing devices.
They should be secure from the threats they are LIKELY TO FACE.
If the only access to my devices is by gaining physical entry
to the premises, then why waste CPU cycles and man-hours protecting
against a threat that can't manifest? Each box has a password...
pasted on the outer skin of the box (for any intruder to read).

Do I *care* about the latest MS release? (ANS: No)
Do I care about the security patches for it? (No)
Can I still do MY work with MY tools? (Yes)

I have to activate an iPhone, tonight. So, drag out a laptop
(I have 7 of them), install the latest iTunes. Do the required
song and dance to get the phone running. Wipe the laptop's
disk and reinstall the image that was present, there, minutes
earlier (so, I don't care WHICH laptop I use!)
Edward Rawde
2024-04-17 17:49:38 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
The internal network isn't routed. So, the only machines to worry
about
are
this one (used only for email/news/web) and a laptop that is only used
for ecommerce.
My LAN is more like a small/medium size business with all workstations,
servers and devices behind a firewall and able to communicate both with each
other and online as necessary.
I have 72 drops in the office and 240 throughout the rest of the house
(though the vast majority of those are for dedicated "appliances")...
about 2.5 miles of CAT5.
Must be a big house.
The office is ~150 sq ft. Three sets of dual workstations each sharing a
set of monitors and a tablet (for music) -- 7 drops for each such set.
Eight drops for my "prototyping platform". Twelve UPSs. Four scanners
(two B size, one A-size w/ADF and a film scanner). An SB2000 and Voyager
(for cross development testing; I'm discarding a T5220 tomorrow).
Four "toy" NASs (for sharing files between myself and SWMBO, documents
dropped by the scanners, etc.). Four 12-bay NASs, two 16 bay. Four
8-bay ESXi servers. Two 1U servers. Two 2U servers. My DBMS server.
A "general services" appliance (DNS, NTP, PXE, FTP, TFTP, font, etc.
services). Three media front ends. One media tank. Two 12 bay
(and one 24 bay) iSCSI SAN devices.
....
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
I have an out-facing server that operates in stealth mode and won't appear
on probes (only used to source my work to colleagues). The goal is
not
to
look "interesting".
Not sure what you mean by that.
Given what gets thrown at my firewall I think you could maybe look more
interesting than you think.
Nothing on my side "answers" connection attempts. To the rest of the world,
it looks like a cable dangling in air...
You could ping me if you knew my IP address.
You can't see me, at all. You have to know the right sequence of packets
(connection attempts) to throw at me before I will "wake up" and respond
to the *final*/correct one. And, while doing so, will continue to
ignore *other* attempts to contact me. So, even if you could see that
I had started to respond, you couldn't "get my attention".
I've never bothered with port knocking.
Those of us with inbound connectable web servers, database servers, email
servers etc have to be connectable by more conventional means.

....
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
If you were to unplug any of the above mentioned ("house") drops,
you'd find nothing at the other end. Each physical link is an
encrypted tunnel that similarly "hides" until (and unless) properly
tickled. As a result, eavesdropping on the connection doesn't
"give" you anything (because it's immune from replay attacks and
it's content is opaque to you)
I'm surprised you get anything done with all the tickle processes you must
need before anything works.
Post by Don Y
Post by Edward Rawde
Post by Don Y
So, you'd have to *police* all such connections. What do you do with hundreds
of drops on a factory floor? Or, scattered throughout a business? Can
you prevent any "foreign" devices from being connected -- even if IN
PLACE
OF
a legitimate device? (after all, it is a trivial matter to unplug a network
cable from one "approved" PC and plug it into a "foreign import")
Devices on a LAN should be secure just like Internet facing devices.
They should be secure from the threats they are LIKELY TO FACE.
If the only access to my devices is by gaining physical entry
to the premises, then why waste CPU cycles and man-hours protecting
against a threat that can't manifest? Each box has a password...
pasted on the outer skin of the box (for any intruder to read).
Sounds like you are the the only user of your devices.
Consider a small business.
Here you want a minimum of either two LANs or VLANs so that guest access to
wireless can't connect to your own LAN devices.
Your own LAN should have devices which are patched and have proper
identification so that even if you do get a compromised device on your own
LAN it's not likely to spread to other devices.
You might also want a firewall which is monitored remotely by somone who
knows how to spot anything unusual.
I have much written in python which tells me whether I want a closer look at
the firewall log or not.
Post by Don Y
Do I *care* about the latest MS release? (ANS: No)
Do I care about the security patches for it? (No)
Can I still do MY work with MY tools? (Yes)
But only for your situation.
If I advised a small business to run like that they'd get someone else to do
it.
Post by Don Y
I have to activate an iPhone, tonight. So, drag out a laptop
(I have 7 of them), install the latest iTunes. Do the required
song and dance to get the phone running. Wipe the laptop's
disk and reinstall the image that was present, there, minutes
earlier (so, I don't care WHICH laptop I use!)
You'll have to excuse me for laughing at that.
Cybersecurity is certainly a very interesting subject, and thanks for the
discussion.
If I open one of the wordy cybersecurity books I have (pdf) at a random page
I get this.
"Once the attacker has gained access to a system, they will want to gain
administrator-level access to the current resource, as well as additional
resources on the network."
Well duh. You mean like once the bank robber has gained access to the bank
they will want to find out where the money is?
Don Y
2024-04-17 18:50:07 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
You could ping me if you knew my IP address.
You can't see me, at all. You have to know the right sequence of packets
(connection attempts) to throw at me before I will "wake up" and respond
to the *final*/correct one. And, while doing so, will continue to
ignore *other* attempts to contact me. So, even if you could see that
I had started to respond, you couldn't "get my attention".
I've never bothered with port knocking.
Those of us with inbound connectable web servers, database servers, email
servers etc have to be connectable by more conventional means.
As with installing updates and other "maintenance issues", I have
no desire to add to my workload. I want to spend my time *designing*
things.

I run the server to save me time handling requests from colleagues for
source code releases. This lets them access the repository and
pull whatever versions they want without me having to get them and
send them. Otherwise, they gripe about my weird working hours, etc.
(and I gripe about their poorly timed requests for STATIC resources)

There is some overhead to their initial connection to the server
as the script has to take into account that packets aren't delivered
instantly and retransmissions can cause a connection attempt to be
delayed -- so, *I* might not see it when they think I am.

But, once the connection is allowed, there is no additional
overhead or special protocols required.
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Edward Rawde
I wouldn't bother. I'd just not connect it to wifi or wired if I thought
there was a risk.
What I mean by that is I'd clean it without it being connected.
The Avira boot CD used to be useful but I forget how many years ago.
If you were to unplug any of the above mentioned ("house") drops,
you'd find nothing at the other end. Each physical link is an
encrypted tunnel that similarly "hides" until (and unless) properly
tickled. As a result, eavesdropping on the connection doesn't
"give" you anything (because it's immune from replay attacks and
it's content is opaque to you)
I'm surprised you get anything done with all the tickle processes you must
need before anything works.
I wouldn't "unplug any of the above mentioned drops". I'd let them connect
using their native protocols. This is already baked into the code so "costs"
nothing.

The hiding prevents an adversary from cutting an exposed (e.g., outside the
house) cable and trying to interfere with the system. Just like an adversary
on a factory floor could find a convenient, out-of-the-way place to access the
fabric with malevolent intent. Or, a guest in a hotel. Or, a passenger
on an aircraft/ship. Or, a CAN node in an automobile (!).
Post by Edward Rawde
Post by Don Y
They should be secure from the threats they are LIKELY TO FACE.
If the only access to my devices is by gaining physical entry
to the premises, then why waste CPU cycles and man-hours protecting
against a threat that can't manifest? Each box has a password...
pasted on the outer skin of the box (for any intruder to read).
Sounds like you are the the only user of your devices.
I'm a "development lab". I want to spend my time using my tools to
create new products. I don't want to bear the overhead of trying to
keep up with patches for 0-day exploits just to be able to USE those
tools. I am more than willing to trade the hassle of walking
down the hall to another computer (this one) to access my email.
And, if I DL a research paper, copying it onto a thumb drive to
SneakerNet it back to my office. To me, that's a HUGE productivity
increase!
Post by Edward Rawde
Consider a small business.
Here you want a minimum of either two LANs or VLANs so that guest access to
wireless can't connect to your own LAN devices.
Your own LAN should have devices which are patched and have proper
identification so that even if you do get a compromised device on your own
LAN it's not likely to spread to other devices.
The house network effectively implements a VLAN per drop. My OS only lets
"things" talk to other things that they've been preconfigured to talk to.
So, I can configure the drop in the guest bedroom to access the ISP.
Or, one of the radios in the ceiling to do similarly. If I later decide that
I want to plug a TV into that guest bedroom drop, then the ISP access is
"unwired" from that drop and access to the media server wired in its place.

And, KNOW that there is no way that any of the traffic on either of those
tunnels can *see* (or access) any of the other traffic flowing through the
switch. The switch is the source of all physical security as you have
to be able to convince it to allow your traffic to go *anywhere* (and WHERE).

[So, the switch is in a protected location AND has the hardware
mechanisms that let me add new devices to the fabric -- by installing
site-specific "secrets" over a secure connection]

Because a factory floor would need the ability to "dial out" from a
drop ON the floor (or WiFi) without risking compromise to any of
the machines that are concurrently using that same fabric.

Imagine having a firewall ENCASING that connection so it can't see *anything*
besides the ISP. (and, imagine that firewall not needing any particular rules
governing the traffic that it allows as it's an encrypted tunnel letting
NOTHING through)
Post by Edward Rawde
You might also want a firewall which is monitored remotely by somone who
knows how to spot anything unusual.
I have much written in python which tells me whether I want a closer look at
the firewall log or not.
Yet another activity I don't have to worry about. Sit in the guest bedroom
and you're effectively directly connected to The Internet. If your machine
is vulnerable (because of measures YOU failed to take), then YOUR machine
is at risk. Not any of the other devices sharing that fabric. You can get
infected while sitting there and I'm still safe.

My "labor costs" are fixed and don't increase, regardless of the number
of devices and threats that I may encounter. No need for IT staff to handle
the "exposed" guests -- that's THEIR problem.
Post by Edward Rawde
Post by Don Y
Do I *care* about the latest MS release? (ANS: No)
Do I care about the security patches for it? (No)
Can I still do MY work with MY tools? (Yes)
But only for your situation.
If I advised a small business to run like that they'd get someone else to do
it.
And they would forever be "TAXED" for their choice. Folks are starting
to notice that updates often don't give them anything that is worth the
risk/cost of the update. Especially if that requires/entices them to have
that host routed!

My colleagues have begrudgingly adopted a similar "unrouted development
network" for their shops. The savings in IT-related activities are
enormous. And, they sleep more soundly knowing the only threats
they have to worry about are physical break-in and equipment
failure.

You want to check your email? Take your phone out of your pocket...
Need to do some on-line work (e.g., chasing down research papers
or browsing a remote repository)? Then move to an "exposed"
workstation FOR THAT TASK.

[Imagine if businesses required their employees to move to such
a workstation to browse YouTube videos or check their facebook
page! "Gee, you're spending an awful lot of time 'on-line',
today, Bob... Have you finished that DESIGN, yet?"]
Post by Edward Rawde
Post by Don Y
I have to activate an iPhone, tonight. So, drag out a laptop
(I have 7 of them), install the latest iTunes. Do the required
song and dance to get the phone running. Wipe the laptop's
disk and reinstall the image that was present, there, minutes
earlier (so, I don't care WHICH laptop I use!)
You'll have to excuse me for laughing at that.
Cybersecurity is certainly a very interesting subject, and thanks for the
discussion.
If I open one of the wordy cybersecurity books I have (pdf) at a random page
I get this.
"Once the attacker has gained access to a system, they will want to gain
administrator-level access to the current resource, as well as additional
resources on the network."
Hence the reason for NOT letting anything "talk" to anything that it shouldn't.
E.g., the oven has no need to talk to the front door lock. Or, the garage
door opener. Or, the HVAC controller. So, even if compromised, an adversary
can only do what those items could normally do. There is no *path* to
the items that it has no (designed) need to access!

With a conventional fabric, anything that establishes a beachhead on ANY
device can start poking around EVERYWHERE. You have to monitor traffic
INSIDE your firewall to verify nothing untoward is happening (IDS -- yet
another cost to install and maintain and police!)
Post by Edward Rawde
Well duh. You mean like once the bank robber has gained access to the bank
they will want to find out where the money is?
Banks keep the money is well-known places. Most commercial (and free) OS's
are similarly unimaginative. So, *looking* for it is relatively easy.
Especially OSs which use a unified file system as a naming mechanism
for everything in the system ("Gee, let's go have a peek at passwd(5)...")

In my approach, an actor only knows about the items that he SHOULD know about.
So, you may *SUSPECT* that there is a "front door" but the only things you
have access to are named "rose bush" and "garden hose" (if you are
an irrigation controller).

In a conventional (50 year old design!) system, you would *see* the names
of all of the devices in the system and HOPE that someone had implemented
one of them incorrectly. Your task (pen-test) would be to figure out
which one and how best to exploit it.

Had the designers, instead, adhered to the notions of information hiding,
encapsulation, principle of least privilege, etc. there's be less attack
surface exposed to the "outside" AND to devices on the *inside*! (But,
you need to approach the design of the OS entirely differently instead of
hoping to layer protections onto some legacy codebase)
Edward Rawde
2024-04-17 19:28:49 UTC
Permalink
Post by Edward Rawde
You could ping me if you knew my IP address.
So let's take a step back before the posts get so big that my newsreader
crashes.

All networks are different.
All businesses have different online/offline needs.
All businesses have different processes and device needs.
All businesses have different people with different ideas about how their
network and devices should be secured.
Businesses which design or manufacture technology may have different
requirements when compared with businesses who just use it.
People find security inconvenient. "Don't give your password to anyone else"
is likely to fall on deaf ears.

There is no one-size-fits-all cybersecurity solution.
Any solution requires a detailed analysis of the network, the devices, and
how the people and/or their guests use it.

Few people know what is going in/out of the connection to their Internet
provider.
Few people care until it's too late.

Human behaviour is a major factor.

I had one manager do the equivalent of bursting into the operating theatre
while the heart surgeon was busy with a delicate and complicated operation.
He wanted to know all the details of the operation and why this part was
connected to that part etc.
It turned out that his reasoning was that after getting this information he
could do it himself instead of paying "cybersecurity" people.

Unskilled and unaware of it comes to mind. Search engine it if you need to.
Jasen Betts
2024-04-17 05:12:55 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Better to inform the individual who can get the replacement done when the
tenant isn't even home.
So, a WiFi/BT link to <whatever>? Now the simple smoke detector starts
suffering from feeping creaturism. "Download the app..."
No thanks. I have the same view of cameras.
They won't be connecting outbound to a server anywhere in the world.
But the average user does not know that and just wants the pictures on their
phone.
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Someone still needs to provide DDNS.

Yes, UPNP has been a thing for several generations of routers now.
but browswers have become fussier about port numbers too. also some
customers are on Carrier Grade NAT, I don't think that UPNP can traverse
that. IPV6 however can avoid the CGNAT problem.

It's an ease of use vs quality of service problem.
--
Jasen.
🇺🇦 Слава Україні
Don Y
2024-04-17 05:51:37 UTC
Permalink
Post by Jasen Betts
Post by Don Y
There is no need for a manufacturer to interpose themselves in such
"remote access". Having the device register with a DDNS service
cuts out the need for the manufacturer to essentially provide THAT
service.
Someone still needs to provide DDNS.
Yes, but ALL they are providing is name resolution. They aren't
processing your data stream or "adding any value", there.
So, point your DNS at an IP that maps to the DDNS service
of your choice when the device "registers" with it!

Manufacturer can abandon a product line and your hardware STILL WORKS!
Post by Jasen Betts
Yes, UPNP has been a thing for several generations of routers now.
but browswers have become fussier about port numbers too. also some
customers are on Carrier Grade NAT, I don't think that UPNP can traverse
that. IPV6 however can avoid the CGNAT problem.
It's an ease of use vs quality of service problem.
Liz Tuddenham
2024-04-17 07:56:55 UTC
Permalink
...When I was designing for pharma, my philosophy was
to make it easy/quick to replace the entire control system. Let someone
troubleshoot it on a bench instead of on the factory floor (which is
semi-sterile).
That's fine if the failure is clearly in the equipment itself, but what
if it is in the way it interacts with something outside it, some
unpredictable or unrecognised input codition? It works perfectly on the
bench, only to fail when put into service ...again and again.
--
~ Liz Tuddenham ~
(Remove the ".invalid"s and add ".co.uk" to reply)
www.poppyrecords.co.uk
Don Y
2024-04-17 11:05:37 UTC
Permalink
Post by Liz Tuddenham
...When I was designing for pharma, my philosophy was
to make it easy/quick to replace the entire control system. Let someone
troubleshoot it on a bench instead of on the factory floor (which is
semi-sterile).
That's fine if the failure is clearly in the equipment itself, but what
if it is in the way it interacts with something outside it, some
unpredictable or unrecognised input codition? It works perfectly on the
bench, only to fail when put into service ...again and again.
Then the *replacement* -- now installed in the system -- would have
the same faulty behavior as the "pulled" unit. Lending credibility
to the pulled unit NOT being at fault.

When the control system is a 7 ft tall, 24 inch rack, bolted to
the floor, your only option is to troubleshoot the system there,
taking the system out of production while doing so.
john larkin
2024-04-15 18:28:13 UTC
Permalink
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Checking temperatures is good. An overload or a fan failure can be bad
news.

We put temp sensors on most products. Some parts, like ADCs and FPGAs,
have free built-in temp sensors.

I have tried various ideas to put an air flow sensor on boards, but so
far none have worked very well. We do check fan tachs to be sure they
are still spinning.

Blocking air flow generally makes fan speed *increase*.
Joe Gwinn
2024-04-15 19:41:57 UTC
Permalink
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.

Joe Gwinn
john larkin
2024-04-15 20:05:40 UTC
Permalink
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?

I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.

Cars do, for some failure modes, like low oil level.

Don, what does the thing do?
Joe Gwinn
2024-04-15 22:03:23 UTC
Permalink
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.

It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.

The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.

I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Post by john larkin
Don, what does the thing do?
Good question.

Joe Gwinn
john larkin
2024-04-15 23:26:35 UTC
Permalink
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.

One could instrument a PCB fab test board, I guess. But DC tests would
be fine.

We have one board with over 4000 vias, but they are mostly in
parallel.
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.

They measure shaft torque and SHP too, from the shaft twist.

I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Joe Gwinn
2024-04-16 14:19:00 UTC
Permalink
Post by john larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
Post by john larkin
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Post by john larkin
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Post by john larkin
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.

Joe Gwinn
John Larkin
2024-04-16 15:16:04 UTC
Permalink
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
Post by john larkin
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an
operating electronic gadget. The answer is mostly NO.

We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
Post by Joe Gwinn
Post by john larkin
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.

The main engine gearbox had padlocks on the covers.

There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.

It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.

Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Post by Joe Gwinn
Joe Gwinn
Joe Gwinn
2024-04-16 17:20:34 UTC
Permalink
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
Post by john larkin
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an
operating electronic gadget. The answer is mostly NO.
Agree.
Post by John Larkin
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
the like, but that does not measure noise. Do you recall any more of
what they were doing? I might know what they were up to. The
military were big on prognostics for a while, and still talk of this,
but it never worked all that well in the field compared to what it was
supposed to improve on.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
Not at the currents in that logic card. Too much ambient thermal
noise.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.

Joe Gwinn
John Larkin
2024-04-17 00:48:19 UTC
Permalink
Post by Joe Gwinn
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
Post by john larkin
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an
operating electronic gadget. The answer is mostly NO.
Agree.
Post by John Larkin
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
the like, but that does not measure noise. Do you recall any more of
what they were doing? I might know what they were up to. The
military were big on prognostics for a while, and still talk of this,
but it never worked all that well in the field compared to what it was
supposed to improve on.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
Not at the currents in that logic card. Too much ambient thermal
noise.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.

I told him to touch an FPGA to see how warm it was getting, and he
refused.
Edward Rawde
2024-04-17 01:04:40 UTC
Permalink
Post by John Larkin
Post by Joe Gwinn
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
....
Post by John Larkin
Post by Joe Gwinn
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
That's what happens when they grow up having never accidentally touched the
top cap of a 40KG6A/PL519
John Larkin
2024-04-17 03:19:19 UTC
Permalink
On Tue, 16 Apr 2024 21:04:40 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
Post by Joe Gwinn
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
....
Post by John Larkin
Post by Joe Gwinn
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
That's what happens when they grow up having never accidentally touched the
top cap of a 40KG6A/PL519
They can type code. Rust is supposed to be safe.
Edward Rawde
2024-04-17 03:50:19 UTC
Permalink
Post by John Larkin
On Tue, 16 Apr 2024 21:04:40 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
Post by Joe Gwinn
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
On Mon, 15 Apr 2024 15:41:57 -0400, Joe Gwinn
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
....
Post by John Larkin
Post by Joe Gwinn
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
That's what happens when they grow up having never accidentally touched the
top cap of a 40KG6A/PL519
They can type code. Rust is supposed to be safe.
I doubt it's safe from the programmer who implemented my humidifier like
this:

if humidity < setting {
fan_on();
} else {
fan_off();
}
Joe Gwinn
2024-04-17 15:47:53 UTC
Permalink
On Tue, 16 Apr 2024 17:48:19 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
But nobody is going to monitor every via on a PCB, even if it were
possible.
It was not possible to test the vias on the failing logic board, but
we knew from metallurgical cut, polish, and inspect studies of failed
boards that it was the vias that were failing.
Post by john larkin
One could instrument a PCB fab test board, I guess. But DC tests would
be fine.
What was being tested was a fab test board that had both the series
via chain path and the no-via path of roughly the same DC resistance,
set up so we could do 4-wire Kelvin resistance measurements of each
path independent of the other path.
Yes, but the question was whether one could predict the failure of an
operating electronic gadget. The answer is mostly NO.
Agree.
Post by John Larkin
We had a visit from the quality team from a giant company that you
have heard of. They wanted us to trend analyze all the power supplies
on our boards and apply a complex algotithm to predict failures. It
was total nonsense, basically predicting the future by zooming in on
random noise with a big 1/f component, just like climate prediction.
Hmm. My first instinct was that they were using MIL-HNBK-317 (?) or
the like, but that does not measure noise. Do you recall any more of
what they were doing? I might know what they were up to. The
military were big on prognostics for a while, and still talk of this,
but it never worked all that well in the field compared to what it was
supposed to improve on.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
We have one board with over 4000 vias, but they are mostly in
parallel.
This can also be tested , but using a 6.5-digit DMM intended for
measuring very low resistance values. A change of one part in 4,000
is huge to a 6.5-digit instrument. The conductivity will decline
linearly as vias fail one by one.
Millikelvin temperature changes would make more signal than a failing
via.
Not at the currents in that logic card. Too much ambient thermal
noise.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
Post by Joe Gwinn
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
Big ships that I've worked on have a long propeller shaft in the shaft
alley, a long tunnel where nobody often goes. They have magnetic shaft
runout sensors and shaft bearing temperature monitors.
They measure shaft torque and SHP too, from the shaft twist.
Yep. And these kinds of things fail slowly. At first.
They could repair a bearing at sea, given a heads-up about violent
failure. A serious bearing failure on a single-screw machine means
getting a seagoing tug.
The main engine gearbox had padlocks on the covers.
There was also a chem lab to analyze oil and water and such, looking
for contaminamts that might suggest something going on.
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
It was fun being an electronic guru on sea trials of a ship full of
big hairy Popeye types. I, skinny gawky kid, got my own stateroom when
other tech reps slept in cots in the hold.
Have you noticed how many lumberjack types are afraid of electricity?
That can be funny.
Oh yes. And EEs frightened by a 9-v battery.
Joe Gwinn
I had an intern, an EE senior, who was afraid of 3.3 volts.
I told him to touch an FPGA to see how warm it was getting, and he
refused.
Yeah.

Not quite as dramatic, but in the last year I have been involved in
some full-scale vibration tests, where a relay rack packed full of
equipment is shaken and resulting phase noise is measured. People are
afraid to touch the vibrating equipment., but I tell people to put a
hand on a convenient place.

It's amazing how much one can tell by feel. There is some
low-frequency spectral analysis capability there, and one can detect
for instance a resonance. It's a very good cross-check on the fancy
instrumentation.

Joe Gwinn
Glen Walpert
2024-04-16 23:41:48 UTC
Permalink
On Tue, 16 Apr 2024 08:16:04 -0700, John Larkin wrote:

<clip>
Post by John Larkin
The main engine gearbox had padlocks on the covers.
Padlocks went on every reduction gearbox in the USN in the summer of 1972,
after CV60's departure to Vietnam was delayed by 3 days due to a bucket of
bolts being dumped into #3 Main Machinery Room reduction gear. The locks
were custom made for the application and not otherwise available, to serve
as a tamper-evident seal. You could easily cut one off but you couldn't
get a replacement. #3 main gear was cleaned up and large burrs filed off,
but still made a thump with every revolution for the entire 'Nam cruise.
(This followed a 3 fatality fire in 3-Main which delayed departure by 3
weeks, done skillfully enough to be deemed an accident.)

I was assigned to 3-Main on CV60 for the shipyard overhaul following the
'Nam cruise and heard the stories from those who were there. The
reduction gear thump went away entirely after a full power run following
overhaul, something rarely done except for testing on account of the
~million gallon a day fuel consumption.
Post by John Larkin
Post by Joe Gwinn
Post by john larkin
I liked hiding out in the shaft alley. It was private and cool, that
giant shaft slowly rotating.
Probably had a calming flowing water sound as well.
Yes, cool and beautiful and serene after the heat and noise and
vibration of the engine room. A quiet 32,000 horsepower.
So not claustrophobic and like cool and quiet - you would have made a good
submariner.

Glen
Phil Hobbs
2024-04-16 00:51:03 UTC
Permalink
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
There are a number of instruments available that look for metal particles
in the lubricating oil.

Cheers

Phil Hobbs
--
Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC /
Hobbs ElectroOptics Optics, Electro-optics, Photonics, Analog Electronics
John Larkin
2024-04-16 02:17:39 UTC
Permalink
On Tue, 16 Apr 2024 00:51:03 -0000 (UTC), Phil Hobbs
Post by Phil Hobbs
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
There are a number of instruments available that look for metal particles
in the lubricating oil.
Cheers
Phil Hobbs
And water. Some of our capacitor simulators include a parallel
resistance component.

One customer used to glue bits of metal onto a string and pull it
through the magnetic sensor. We did a simulator for that too.

Jet engines have magnetic eddy-current blade-tip sensors. For
efficiency, they want a tiny clearance between fan blades and the
casing, but not too tiny.
Joe Gwinn
2024-04-16 13:54:40 UTC
Permalink
On Tue, 16 Apr 2024 00:51:03 -0000 (UTC), Phil Hobbs
Post by Phil Hobbs
Post by Joe Gwinn
Post by john larkin
Post by john larkin
On Mon, 15 Apr 2024 10:13:02 -0700, Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
There is a standard approach that may work: Measure the level and
trend of very low frequency (around a tenth of a Hertz) flicker noise.
When connections (perhaps within a package) start to fail, the flicker
level rises. The actual frequency monitored isn't all that critical.
Joe Gwinn
Do connections "start to fail" ?
Yes, they do, in things like vias. I went through a big drama where a
critical bit of radar logic circuitry would slowly go nuts.
It turned out that the copper plating on the walls of the vias was
suffering from low-cycle fatigue during temperature cycling and slowly
breaking, one little crack at a time, until it went open. If you
measured the resistance to parts per million (6.5 digit DMM), sampling
at 1 Hz, you could see the 1/f noise at 0.1 Hz rising. It's useful to
also measure a copper line, and divide the via-chain resistance by the
no-via resistance, to correct for temperature changes.
The solution was to redesign the vias, mainly to increase the critical
volume of copper. And modern SMD designs have less and less copper
volume.
I bet precision resistors can also be measured this way.
Post by john larkin
I don't think I've ever owned a piece of electronic equipment that
warned me of an impending failure.
Onset of smoke emission is a common sign.
Post by john larkin
Cars do, for some failure modes, like low oil level.
The industrial method for big stuff is accelerometers attached near
the bearings, and listen for excessive rotation-correlated (not
necessarily harmonic) noise.
There are a number of instruments available that look for metal particles
in the lubricating oil.
Yes.

The old-school version was a magnetic drain plug, which one inspected
for clinging iron chips or dust, also serving to trap those chips. The
newer-school version was to send a sample of the dirty oil to the lab
for microscope and chemical analysis. There are companies that will
take your old lubrication oil and reprocess it, yielding new oil.

If there was an oil filter, inspect the filter surface.

And when one was replacing the oil in the gear case, wipe the bottom
with a white rag, and look at the rag.

Nobody did electronic testing until very recently, because even
expensive electronics were far too unreliable and fragile.

Joe Gwinn
Edward Rawde
2024-04-15 20:32:17 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.

It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.

Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.

A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.

A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.

A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
It would often be easy to locate the dried up electrolytic due to it having
already let go of some of its contents.

So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.

If it's not possible to have the equivalent of two TVs then you will want to
be in a position to get the existing TV repaired or replaced as quicky as
possible.

My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.

I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.

These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
Post by Don Y
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Don Y
2024-04-16 03:20:55 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations? If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the location
of those magnetic domains on the rotating medium?

Why should it attempt to retry this MORE than once?

Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?

Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?

I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.

The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.

[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Post by Edward Rawde
Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).

One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my
attention as the LIT furnace should STAY LIT!

The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.

I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)

[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
Post by Edward Rawde
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
Post by Edward Rawde
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
Post by Edward Rawde
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Post by Edward Rawde
It would often be easy to locate the dried up electrolytic due to it having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want to
be in a position to get the existing TV repaired or replaced as quicky as
possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.

Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
Post by Edward Rawde
My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Post by Edward Rawde
I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Post by Edward Rawde
These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?

You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?

[Why have ECC at all?]

There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.

[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]

The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).

For small manufacturers, the research required is likely not cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).

But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.

<https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>

<https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>

<https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>

<https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>

<https://ieeexplore.ieee.org/document/1656125>

<https://journals.sagepub.com/doi/10.1177/0142331208092031>

[Sorry, I can't publish links to the full articles]
Edward Rawde
2024-04-16 04:14:13 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
The correct approach is to mark that part of the disk as unusable and, if
possible, move any data from it elsewhere quick.
Post by Don Y
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
I'd have put an SSD in by now, along with an off site backup of the same
data :)
Post by Don Y
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
If it hasn't been used for some time then I'm ready with a tiny screwdriver
blade to help it open.
But I forget when I last used an optical drive.
Post by Don Y
The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.
[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Post by Edward Rawde
Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).
One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my
attention as the LIT furnace should STAY LIT!
The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.
I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
That's why the furnace designers couldn't have anticipated it.
They did not know that such a contition might occur so never tested for it.
Post by Don Y
Post by Edward Rawde
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
There will always be sudden unexpected loss of functionality for reasons
which could not easily be predicted.
People who service lawn mowers in the area where I live are very busy right
now.
Post by Don Y
Post by Edward Rawde
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
Post by Edward Rawde
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Yes a power supply ripple detection circuit could have turned on a warning
LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and
thus diminished the profit of the manufacturer.
2. The user would not have understood and would have ignored the warning
anyway.
Post by Don Y
Post by Edward Rawde
It would often be easy to locate the dried up electrolytic due to it having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want to
be in a position to get the existing TV repaired or replaced as quicky as
possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.
Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
Post by Edward Rawde
My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Get feedback from the people who use your equpment.
Post by Don Y
Post by Edward Rawde
I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.
Post by Don Y
Post by Edward Rawde
These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
Then I probably can't, because the power supply may be just a bought in
power supply which was never designed with upcoming failure detection in
mind.
Post by Don Y
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
Things are sometimes done the way they've always been done.
I used to notice a missing chip in the 9th position but now you mention it
the RAM I just looked at has 9 chips each side.
Post by Don Y
There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.
I'm sure that's true.
Post by Don Y
[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]
The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).
For small manufacturers, the research required is likely not
cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).
But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.
<https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>
<https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>
<https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>
<https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>
<https://ieeexplore.ieee.org/document/1656125>
<https://journals.sagepub.com/doi/10.1177/0142331208092031>
[Sorry, I can't publish links to the full articles]
Don Y
2024-04-16 05:40:29 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
Actually, he DID understand the hardware which is why he retried
it instead of ASSUMING every operation would proceed correctly.

[Why bother testing the result code if you never expect a failure?]
Post by Edward Rawde
The correct approach is to mark that part of the disk as unusable and, if
possible, move any data from it elsewhere quick.
That only makes sense if the error is *persistent*. "Shit
happens" and you can get an occasional failed operation when
nothing is truly "broken".

(how do you know the HBA isn't the culprit?)
Post by Edward Rawde
Post by Don Y
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
I'd have put an SSD in by now, along with an off site backup of the same
data :)
So, any problems you have with your SSD, today, should be solved by using the
technology that will be invented 10 years hence! Ah, that's a sound strategy!
Post by Edward Rawde
Post by Don Y
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
If it hasn't been used for some time then I'm ready with a tiny screwdriver
blade to help it open.
Why don't they ship such drives with tiny screwdrivers to make it
easier for EVERY customer to address this problem?
Post by Edward Rawde
But I forget when I last used an optical drive.
When the firmware in your SSD corrupts your data, what remedy will
you use?

You're missing the forest for the trees.
Post by Edward Rawde
Post by Don Y
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
That's why the furnace designers couldn't have anticipated it.
Really? You can't anticipate the "gas shutoff" not being in the ON
position? (which would yield the same endless retry cycle)
Post by Edward Rawde
They did not know that such a contition might occur so never tested for it.
If they planned on ENDLESSLY retrying, then they must have imagined
some condition COULD occur that would lead to such an outcome.
Else, why not just retry *once* and then give up? Or, not
retry at all?
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
There will always be sudden unexpected loss of functionality for reasons
which could not easily be predicted.
And if they CAN'T be predicted, then they aren't germane to this
discussion, eh?

My concern is for the set of failure modes that can realistically
be anticipated.

I *know* the inverters in my monitors are going to fail. It
would be nice if I knew before I was actively using one when
it went dark!

[But, most users would only use this indication to tell them
to purchase another monitor; "You have been warned!"]
Post by Edward Rawde
People who service lawn mowers in the area where I live are very busy right
now.
Post by Don Y
Post by Edward Rawde
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
Post by Edward Rawde
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Yes a power supply ripple detection circuit could have turned on a warning
LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and
thus diminished the profit of the manufacturer.
That would depend on the market, right? Most of my computers have redundant
"smart" (i.e., internal monitoring and reporting) power supplies. Because
they were marketed to folks who wanted that sort of reliability. Because
a manufacturer who didn't provide that level of AVAILABILITY would quickly
lose market share. The cost of the added components and "handling" is
small compared to the cost of lost opportunity (sales).
Post by Edward Rawde
2. The user would not have understood and would have ignored the warning
anyway.
That makes assumptions about the market AND the user.

If one of my machines signals a fault, I look to see what it is complaining
about: is it a power supply failure (in which case, I'm now reliant on
a single power supply)? is it a memory failure (in which case, a bank
of memory may have been disabled which means the machine will thrash
more and throughput will drop)? is it a link aggregation error (and
network traffic will suffer)?

If I can't understand these errors, then I either don't buy a product
with that level of reliability *or* have someone on hand who CAN
understand the errors and provide remedies/advice.

Consumers will replace a PC because of malware, trashed registry,
creeping cruft, etc. That's a problem with the consumer buying the
"wrong" sort of computing equipment for his likely method of use.
(buy a Mac?)
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Get feedback from the people who use your equpment.
Users often don't understand when a device is malfunctioning.
Or, how to report the conditions and symptoms in a meaningful way.

I recall a woman I worked with ~45 years ago sitting, patiently,
waiting for her computer to boot. As I walked past, she asked me how
long it takes for that to happen (floppy based systems). Alarmed
(I had designed the workstations), I asked "How long have you been
waiting?"

Turns out, she had inserted the (8") floppy rotated 90 degrees from
it's proper orientation.

How much longer would she have waited had I not walked past?
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.
One wouldn't bother with a CD/DVD player -- they are too disposable
and reporting errors won't help the user (even though you have a
big ATTACHED display at your disposal!)

"For your continued video enjoyment, replace me, now!"

OTOH, if a CNC machine tries to "home" a mechanism and doesn't
get (electronic) confirmation of that event having been completed,
would you expect *it* to just sit there endlessly waiting?
Possibly causing damage to itself in the process?

Would you expect it to "notice" if the drive motor APPEARED to
be connected and was drawing the EXPECTED amount of current?

Or, would you expect an electrician to come along and start
troubleshooting (taking the machine out of production in the process)?
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
Then I probably can't, because the power supply may be just a bought in
power supply which was never designed with upcoming failure detection in
mind.
You wouldn't pick such a power supply if that was an important
failure mode to guard against! (that's why smart power supplies
are so common -- and redundant!)
Post by Edward Rawde
Post by Don Y
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
Things are sometimes done the way they've always been done.
Then, we should all be using machines with MEGAbytes of memory...
Post by Edward Rawde
I used to notice a missing chip in the 9th position but now you mention it
the RAM I just looked at has 9 chips each side.
Much consumer kit has non-ECC RAM. I'd wager many of the
devices designed by folks *here* use non-ECC RAM (because
support for ECC in embedded products is less common).

Is this ignorance? Or, willful naivite?
Edward Rawde
2024-04-16 15:07:27 UTC
Permalink
Post by Don Y
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
Actually, he DID understand the hardware which is why he retried
it instead of ASSUMING every operation would proceed correctly.
....
When the firmware in your SSD corrupts your data, what remedy will
you use?
Replace drive and restore backup.
It's happened a few times, and a friend had one of those 16 GB but looks
like 1 TB to the OS SSDs from Amazon.
Martin Brown
2024-04-16 10:46:20 UTC
Permalink
Post by Edward Rawde
Post by Don Y
Post by Edward Rawde
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
My conclusion would be no.
Some of my reasons are given below.
It always puzzled me how HAL could know that the AE-35 would fail in the
near future, but maybe HAL had a motive for lying.
Why does your PC retry failed disk operations?
Because the software designer didn't understand hardware.
The correct approach is to mark that part of the disk as unusable and, if
possible, move any data from it elsewhere quick.
Post by Don Y
If I ask the drive to give
me LBA 1234, shouldn't it ALWAYS give me LBA1234? Without any data corruption
(CRC error) AND within the normal access time limits defined by the location
of those magnetic domains on the rotating medium?
Why should it attempt to retry this MORE than once?
Now, if you knew your disk drive was repeatedly retrying operations,
would your confidence in it be unchanged from times when it did not
exhibit such behavior?
I'd have put an SSD in by now, along with an off site backup of the same
data :)
Post by Don Y
Assuming you have properly configured a EIA232 interface, why would you
ever get a parity error? (OVERRUN errors can be the result of an i/f
that is running too fast for the system on the receiving end) How would
you even KNOW this was happening?
I suspect everyone who has owned a DVD/CD drive has encountered a
"slow tray" as the mechanism aged. Or, a tray that wouldn't
open (of its own accord) as soon/quickly as it used to.
If it hasn't been used for some time then I'm ready with a tiny screwdriver
blade to help it open.
But I forget when I last used an optical drive.
Post by Don Y
The controller COULD be watching this (cuz it knows when it
initiated the operation and there is an "end-of-stroke"
sensor available) and KNOW that the drive belt was stretching
to the point where it was impacting operation.
[And, that a stretched belt wasn't going to suddenly decide to
unstretch to fix the problem!]
Post by Edward Rawde
Back in that era I was doing a lot of repair work when I should have been
doing my homework.
So I knew that there were many unrelated kinds of hardware failure.
The goal isn't to predict ALL failures but, rather, to anticipate
LIKELY failures and treat them before they become an inconvenience
(or worse).
One morning, the (gas) furnace repeatedly tried to light as the
thermostat called for heat. Then, a few moments later, the
safeties would kick in and shut down the gas flow. This attracted my
attention as the LIT furnace should STAY LIT!
The furnace was too stupid to notice its behavior so would repeat
this cycle, endlessly.
I stepped in and overrode the thermostat to eliminate the call
for heat as this behavior couldn't be productive (if something
truly IS wrong, then why let it continue? and, if there is nothing
wrong with the controls/mechanism, then clearly it is unable to meet
my needs so why let it persist in trying?)
[Turns out, there was a city-wide gas shortage so there was enough
gas available to light the furnace but not enough to bring it up to
temperature as quickly as the designers had expected]
That's why the furnace designers couldn't have anticipated it.
They did not know that such a contition might occur so never tested for it.
Post by Don Y
Post by Edward Rawde
A component could fail suddenly, such as a short circuit diode, and
everything would work fine after replacing it.
The cause could perhaps have been a manufacturing defect, such as
insufficient cooling due to poor quality assembly, but the exact real cause
would never be known.
You don't care about the real cause. Or, even the failure mode.
You (as user) just don't want to be inconvenienced by the sudden
loss of the functionality/convenience that the the device provided.
There will always be sudden unexpected loss of functionality for reasons
which could not easily be predicted.
People who service lawn mowers in the area where I live are very busy right
now.
Post by Don Y
Post by Edward Rawde
A component could fail suddenly as a side effect of another failure.
One short circuit output transistor and several other components could also
burn up.
So, if you could predict the OTHER failure...
Or, that such a failure might occur and lead to the followup failure...
Post by Edward Rawde
A component could fail slowly and only become apparent when it got to the
stage of causing an audible or visible effect.
But, likely, there was something observable *in* the circuit that
just hadn't made it to the level of human perception.
Yes a power supply ripple detection circuit could have turned on a warning
LED but that never happened for at least two reasons.
1. The detection circuit would have increased the cost of the equipment and
thus diminished the profit of the manufacturer.
2. The user would not have understood and would have ignored the warning
anyway.
Post by Don Y
Post by Edward Rawde
It would often be easy to locate the dried up electrolytic due to it having
already let go of some of its contents.
So I concluded that if I wanted to be sure that I could always watch my
favourite TV show, we would have to have at least two TVs in the house.
If it's not possible to have the equivalent of two TVs then you will want to
be in a position to get the existing TV repaired or replaced as quicky as
possible.
Two TVs are affordable. Consider two controllers for a wire-EDM machine.
Or, the cost of having that wire-EDM machine *idle* (because you didn't
have a spare controller!)
Post by Edward Rawde
My home wireless Internet system doesn't care if one access point fails, and
I would not expect to be able to do anything to predict a time of failure.
Experience says a dead unit has power supply issues. Usually external but
could be internal.
Again, the goal isn't to predict "time of failure". But, rather, to be
able to know that "this isn't going to end well" -- with some advance notice
that allows for preemptive action to be taken (and not TOO much advance
notice that the user ends up replacing items prematurely).
Get feedback from the people who use your equpment.
Post by Don Y
Post by Edward Rawde
I don't think it would be possible to "watch" everything because it's rare
that you can properly test a component while it's part of a working system.
You don't have to -- as long as you can observe its effects on other
parts of the system. E.g., there's no easy/inexpensive way to
check to see how much the belt on that CD/DVD player has stretched.
But, you can notice that it HAS stretched (or, some less likely
change has occurred that similarly interferes with the tray's actions)
by noting how the activity that it is used for has changed.
Sure but you have to be the operator for that.
So you can be ready to help the tray open when needed.
Post by Don Y
Post by Edward Rawde
These days I would expect to have fun with management asking for software to
be able to diagnose and report any hardware failure.
Not very easy if the power supply has died.
What if the power supply HASN'T died? What if you are diagnosing the
likely upcoming failure *of* the power supply?
Then I probably can't, because the power supply may be just a bought in
power supply which was never designed with upcoming failure detection in
mind.
Post by Don Y
You have ECC memory in most (larger) machines. Do you silently
expect it to just fix all the errors? Does it have a way of telling you
how many such errors it HAS corrected? Can you infer the number of
errors that it *hasn't*?
[Why have ECC at all?]
Things are sometimes done the way they've always been done.
I used to notice a missing chip in the 9th position but now you mention it
the RAM I just looked at has 9 chips each side.
Post by Don Y
There are (and have been) many efforts to *predict* lifetimes of
components (and, systems). And, some work to examine the state
of systems /in situ/ with an eye towards anticipating their
likelihood of future failure.
I'm sure that's true.
Post by Don Y
[The former has met with poor results -- predicting the future
without a position in its past is difficult. And, knowing how
a device is "stored" when not powered on also plays a role
in its future survival! (is there some reason YOUR devices
can't power themselves on, periodically; notice the environmental
conditions; log them and then power back off)]
The question is one of a practical nature; how much does it cost
you to add this capability to a device and how accurately can it
make those predictions (thus avoiding some future cost/inconvenience).
For small manufacturers, the research required is likely not
cost-effective;
just take your best stab at it and let the customer "buy a replacement"
when the time comes (hopefully, outside of your warranty window).
But, anything you can do to minimize this TCO issue gives your product
an edge over competitors. Given that most devices are smart, nowadays,
it seems obvious that they should undertake as much of this task as
they can (conveniently) afford.
<https://www.sciencedirect.com/science/article/abs/pii/S0026271409003667>
<https://www.researchgate.net/publication/3430090_In_Situ_Temperature_Measurement_of_a_Notebook_Computer-A_Case_Study_in_Health_and_Usage_Monitoring_of_Electronics>
<https://www.tandfonline.com/doi/abs/10.1080/16843703.2007.11673148>
<https://www.prognostics.umd.edu/calcepapers/02_V.Shetty_remaingLifeAssesShuttleRemotemanipulatorSystem_22ndSpaceSimulationConf.pdf>
<https://ieeexplore.ieee.org/document/1656125>
<https://journals.sagepub.com/doi/10.1177/0142331208092031>
[Sorry, I can't publish links to the full articles]
--
Martin Brown
Chris Jones
2024-04-17 12:18:59 UTC
Permalink
On a vaguely related rant, shamelessly hijacking your thread:

Why do recent mechanical hard drives have a "Annualised Workload Rate"
limit saying that you are only supposed to write say 55TB/year?

What is the wearout mechanism, or is it just bullshit to discourage
enterprise customers from buying the cheapest drives?

It seems odd to me that they would all do it, if it really is just made
up bullshit. It also seems odd to express it in terms of TB
read+written. I can't see why that would be more likely to wear it out
than some number of hours of spindle rotation, or seek operations, or
spindle starts, or head load/unload cycles. I could imagine they might
want to use a very high current density in the windings of the write
head that might place an electromigration limit on the time spent
writing, but they apply the limit to reads as well. Is there something
that wears out when the servo loop is keeping the head on a track?
Don Y
2024-04-17 15:42:03 UTC
Permalink
Why do recent mechanical hard drives have a "Annualised Workload Rate" limit
saying that you are only supposed to write say 55TB/year?
Are you sure they aren't giving you a *recommendation*? I.e., "this
device will give acceptable performance (not durability) in applications
with a workload of X TB/yr"?

I built a box to sanitize and characterize disks for recycling. It seems
a typical run-of-the-mill disk performs at about 60MB/s. So, ~350MB/min
or 21GB/hr. That's ~500GB/day or 180TB/yr.

Assuming 24/7/365 use.

In a 9-to-5 environment, that would be (5/7)*60TB (to account for idle time
on weekends) or ~40TB/yr.

Said another way, I'd expect a 55TB/yr drive to run at about (55/40)*60MB/s
or ~80MB/s. A drive that runs at 100MB/s (not uncommon) would be ~100TB/yr.
What is the wearout mechanism, or is it just bullshit to discourage enterprise
customers from buying the cheapest drives?
It seems odd to me that they would all do it, if it really is just made up
bullshit. It also seems odd to express it in terms of TB read+written. I can't
As this seems to be a relatively new "expression", it may be a side-effect of
SSD ratings (in which *wear* is a real issue). It would allow for a rough
comparison of the durability of the media in a synthetic workload.
see why that would be more likely to wear it out than some number of hours of
spindle rotation, or seek operations, or spindle starts, or head load/unload
cycles. I could imagine they might want to use a very high current density in
the windings of the write head that might place an electromigration limit on
the time spent writing, but they apply the limit to reads as well. Is there
something that wears out when the servo loop is keeping the head on a track?
I've encountered drives with 50K PoH that still report no SMART issues.
I assume they truly are running 24/7/365 (based on the number of power cycles
reported) so that's *6* years spinning on its axis! (I wonder how many
miles it would have traveled if it was a "wheel"?)

Most nearline drives pulled from DASs seem to be discarded (upgraded?)
at about 20K PoH, FWIW. Plenty of useful life remaining!

[FWIW, I've only lost two drives in my life -- one a laptop drive installed
in an application that spun it up and down almost continuously and another
that magically lost access to it's boot sector. OTOH, I've heard horror
stories of folks having issues with SSDs (firmware). So, just put all the
rescued SSDs I come across in a box thinking "someday" I will play with them]
Don Y
2024-04-17 16:05:56 UTC
Permalink
a typical run-of-the-mill disk performs at about 60MB/s.  So, ~350MB/min
or 21GB/hr.  That's ~500GB/day or 180TB/yr.
Assuming 24/7/365 use.
In a 9-to-5 environment, that would be (5/7)*60TB (to account for idle time
on weekends) or ~40TB/yr.
Said another way, I'd expect a 55TB/yr drive to run at about (55/40)*60MB/s
or ~80MB/s.  A drive that runs at 100MB/s (not uncommon) would be ~100TB/yr.
That number doesn't look right. 100/80 = 1.25 so that 55 should probably be
about 70TB/yr (not 100!).

I guess a calculator would be handy... :>
Martin Brown
2024-04-16 08:45:34 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
You have to be very careful that the additional complexity doesn't
itself introduce new annoying failure modes. My previous car had
filament bulb failure sensors (new one is LED) of which the one for the
parking light had itself failed - the parking light still worked.
However, the car would great me with "parking light failure" every time
I started the engine and the main dealer refused to cancel it.

Repair of parking light sensor failure required swapping out the
*entire* front light assembly since it was built in one time hot glue.
That would be a very expensive "repair" for a trivial fault.

The parking light is not even a required feature.
Post by Don Y
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Monitoring temperature, voltage supply and current consumption isn't a
bad idea. If they get unexpectedly out of line something is wrong.
Likewise with power on self tests you can catch some latent failures
before they actually affect normal operation.
--
Martin Brown
Don Y
2024-04-16 11:26:28 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
You have to be very careful that the additional complexity doesn't itself
introduce new annoying failure modes.
*Or*, decrease the reliability of the device, in general.
My previous car had filament bulb failure
sensors (new one is LED) of which the one for the parking light had itself
failed - the parking light still worked. However, the car would great me with
"parking light failure" every time I started the engine and the main dealer
refused to cancel it.
My goal is to provide *advisories*. You don't want to constrain the
user.

Smoke detectors that nag you with "replace battery" alerts are nags.
A car that refuses to start unless the seat belts are fastened is a nag.

You shouldn't require a third party to enable you to ignore an
advisory. But, it's OK to require the user to acknowledge that
advisory!
Repair of parking light sensor failure required swapping out the *entire* front
light assembly since it was built in one time hot glue. That would be a very
expensive "repair" for a trivial fault.
The parking light is not even a required feature.
Post by Don Y
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Monitoring temperature, voltage supply and current consumption isn't a bad
idea. If they get unexpectedly out of line something is wrong.
Extremes are easy to detect -- but often indicate failures.
E.g., a short, an open.

The problem is sorting out what magnitude changes are significant
and which are normal variation.

I think being able to track history gives you a leg up in that
it gives you a better idea of what MIGHT be normal instead of
just looking at an instant in time.
Likewise with
power on self tests you can catch some latent failures before they actually
affect normal operation.
POST is seldom executed as devices tend to run 24/7/365.
So, I have to design runtime BIST support that can, hopefully,
coax this information from a *running* system without interfering
with that operation.

This puts constraints on how you operate the hardware
(unless you want to add lots of EXTRA hardware to
extract these observations.

E.g., if you can control N loads, then individually (sequentially)
activating them and noticing the delta power consumption reveals
more than just enabling ALL that need to be enabled and only seeing
the aggregate of those loads.

This can also simplify gross failure detection if part of the
normal control strategy.

E.g., I designed a medical instrument many years ago that had an
external "sensor array". As that could be unplugged at any time,
I had to continually monitor for it's disconnection. At the same
time, individual sensors in the array could be "spoiled" by
spilled reagents. Yet, the other sensors shouldn't be compromised
or voided just because of the failure of certain ones.

Recognizing that this sort of thing COULD happen in normal use
was the biggest part of the design; the hardware and software
to actually handle these exceptions was then straightforward.

Note that some failures may not be possible to recover from
without adding significant cost (and other failure modes).
So, it's a value decision as to what you support and what
you "tolerate".
John Larkin
2024-04-16 15:22:17 UTC
Permalink
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Post by Martin Brown
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
You have to be very careful that the additional complexity doesn't
itself introduce new annoying failure modes. My previous car had
filament bulb failure sensors (new one is LED) of which the one for the
parking light had itself failed - the parking light still worked.
However, the car would great me with "parking light failure" every time
I started the engine and the main dealer refused to cancel it.
Repair of parking light sensor failure required swapping out the
*entire* front light assembly since it was built in one time hot glue.
That would be a very expensive "repair" for a trivial fault.
The parking light is not even a required feature.
Post by Don Y
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
Monitoring temperature, voltage supply and current consumption isn't a
bad idea. If they get unexpectedly out of line something is wrong.
Likewise with power on self tests you can catch some latent failures
before they actually affect normal operation.
The real way to reduce failure rates is by designing carefully.

Sometimes BIST can help ensure that small failures won't become
board-burning failures, but an RMA will happen anyhow.

I just added a soft-start feature to a couple of boards. Apply a
current-limited 48 volts to the power stages before the real thing is
switched on hard.
Bill Sloman
2024-04-16 16:58:41 UTC
Permalink
<snip>
Post by John Larkin
Sometimes BIST can help ensure that small failures won't become
board-burning failures, but an RMA will happen anyhow.
Built-in self test is mostly auto-calibration. You can use temperature
sensitive components for precise measurements if you calibrate out the
temperature shift and re-calibrate if the measured temperature shifts
appreciably (or every few minutes).

It might also take out the effects of dopant drift in a hot device, but
it wouldn't take it out forever.
Post by John Larkin
I just added a soft-start feature to a couple of boards. Apply a
current-limited 48 volts to the power stages before the real thing is
switched on hard.
Soft-start has been around forever. If you don't pay attention to what
happens to your circuit at start-up and turn-off you can have some real
disasters.

At Cambridge Instruments I once replaced all the tail resistors in a
bunch of class-B long-tailed-pair-based scan amplifiers with constant
current diodes. With the resistors tails, the scan amps drew a lot of
current when the 24V rail was being ramped up and that threw the 24V
supply into current limit, so it didn't ramp up. The constant current
diodes stopped this (not that I can remember how).

This was a follow-up after I'd brought in to stop the 24V power supply
from blowing up (because it hadn't had a properly designed current limit).

The problem had shown up in production - where it was known as the three
back problem because when things did go wrong the excursions on the 24V
rail destroyed three bags of components.
--
Bill Sloman, Sydney
Edward Rawde
2024-04-16 17:39:07 UTC
Permalink
Post by Bill Sloman
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
<snip>
Post by John Larkin
Sometimes BIST can help ensure that small failures won't become
board-burning failures, but an RMA will happen anyhow.
Built-in self test is mostly auto-calibration. You can use temperature
sensitive components for precise measurements if you calibrate out the
temperature shift and re-calibrate if the measured temperature shifts
appreciably (or every few minutes).
It might also take out the effects of dopant drift in a hot device, but it
wouldn't take it out forever.
Post by John Larkin
I just added a soft-start feature to a couple of boards. Apply a
current-limited 48 volts to the power stages before the real thing is
switched on hard.
Soft-start has been around forever. If you don't pay attention to what
happens to your circuit at start-up and turn-off you can have some real
disasters.
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.

I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Post by Bill Sloman
At Cambridge Instruments I once replaced all the tail resistors in a bunch
of class-B long-tailed-pair-based scan amplifiers with constant current
diodes. With the resistors tails, the scan amps drew a lot of current when
the 24V rail was being ramped up and that threw the 24V supply into
current limit, so it didn't ramp up. The constant current diodes stopped
this (not that I can remember how).
This was a follow-up after I'd brought in to stop the 24V power supply
from blowing up (because it hadn't had a properly designed current limit).
The problem had shown up in production - where it was known as the three
back problem because when things did go wrong the excursions on the 24V
rail destroyed three bags of components.
--
Bill Sloman, Sydney
John Larkin
2024-04-17 00:58:47 UTC
Permalink
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Some FPGAs require supply sequencing, as may as four.

LM3880 is a dedicated powerup sequencer, most cool.

https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Edward Rawde
2024-04-17 01:16:45 UTC
Permalink
Post by John Larkin
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V
connector is in a position where it isn't sure whether it's connected or
not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
John Larkin
2024-04-17 03:23:46 UTC
Permalink
On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V
connector is in a position where it isn't sure whether it's connected or
not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
We considered the brownout case. The MAX809 handles that.

This supply will also tolerate +24v input, in case someone grabs the
wrong wart. Or connects the power backwards.
John Larkin
2024-04-17 15:17:10 UTC
Permalink
On Tue, 16 Apr 2024 20:23:46 -0700, John Larkin
Post by John Larkin
On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V
connector is in a position where it isn't sure whether it's connected or
not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
We considered the brownout case. The MAX809 handles that.
This supply will also tolerate +24v input, in case someone grabs the
wrong wart. Or connects the power backwards.
Another hazard/failure mode happens when things like opamps use pos
and neg supply rails. A positive regulator, for example, can latch up
if its output is pulled negative, though ground, at startup. Brownout
dippies can trigger that too.

Add schottky diodes to ground.
Edward Rawde
2024-04-17 18:31:13 UTC
Permalink
Post by John Larkin
On Tue, 16 Apr 2024 20:23:46 -0700, John Larkin
Post by John Larkin
On Tue, 16 Apr 2024 21:16:45 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 13:39:07 -0400, "Edward Rawde"
Post by Edward Rawde
Post by John Larkin
On Tue, 16 Apr 2024 09:45:34 +0100, Martin Brown
Yes I've seen that a lot.
The power rails in the production product came up in a different order to
those in the development lab.
This caused all kinds of previously unseen behaviour including an expensive
flash a/d chip burning up.
I'd have it in the test spec that any missing power rail does not cause
issues.
And any power rail can be turned on and off any time.
The equipment may not work properly with a missing power rail but it should
not be damaged.
Some FPGAs require supply sequencing, as may as four.
LM3880 is a dedicated powerup sequencer, most cool.
https://www.dropbox.com/scl/fi/gwrimefrgm729k8enqrir/28S662D_sh_19.pdf?rlkey=qvyip7rjqfy6i9yegqrt57n23&dl=0
Ok that doesn't surprise me.
I'd want to be sure that the requirement is always met even when the 12V
connector is in a position where it isn't sure whether it's connected or
not.
Or rapid and repeated connect/disconnect of 12V doesn't cause any issue.
We considered the brownout case. The MAX809 handles that.
This supply will also tolerate +24v input, in case someone grabs the
wrong wart. Or connects the power backwards.
Another hazard/failure mode happens when things like opamps use pos
and neg supply rails. A positive regulator, for example, can latch up
if its output is pulled negative, though ground, at startup. Brownout
dippies can trigger that too.
Add schottky diodes to ground.
I've seen many a circuit with pos and neg supply rails for op amps when a
single rail would have been fine.
In one case the output went negative during startup and the following device
(a VCO) didn't like that and refused to start.
A series diode was the easiest solution in that case.
Don
2024-04-16 13:25:07 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
A singular speculative spitball - the capacitive marker:

In-situ Prognostic Method of Power MOSFET Based on Miller Effect

... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...

(10.1109/PHM.2017.8079139)

Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
Edward Rawde
2024-04-16 15:37:16 UTC
Permalink
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Very interesting but are there any products out there which make use of this
or other prognostic methods to provide information on remaining useful life?
Post by Don
Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
Don
2024-04-16 17:15:31 UTC
Permalink
Post by Edward Rawde
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Very interesting but are there any products out there which make use of this
or other prognostic methods to provide information on remaining useful life?
Perhaps this popular application "rings a bell"?

Battery and System Health Monitoring of Battery-Powered Smart Flow
Meters Reference Design

<https://www.ti.com/lit/ug/tidudo5a/tidudo5a.pdf>

Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
Don Y
2024-04-16 21:22:28 UTC
Permalink
Post by Edward Rawde
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Very interesting but are there any products out there which make use of this
or other prognostic methods to provide information on remaining useful life?
Wanna bet there's a shitload of effort going into sorting out how to
prolong the service life of batteries for EVs?

It's only a matter of time before large organizations and nations start
looking hard at "eWaste" both from the standpoint of efficient use of
capitol, resources and environmental consequences. If recycling was
mandated (by law), how many vendors would rethink their approach to
product design? (Do we really need to assume the cost of retrieving
that 75 inch TV from the customer just so we can sell him ANOTHER?
Is there a better way to pitch improvements in *features* instead of
pels or screen size?)

Here, you have to PAY (typ $25) for someone to take ownership of
your CRT-based devices. I see Gaylords full of LCD monitors discarded
each week. And, a 20 ft roll-off of "flat screen TVs" monthly.

Most businesses discard EVERY workstation in their fleet on a
2-3 yr basis. The software update cycle coerces hardware developers
to design for a similarly (artificially) limited lifecycle.

[Most people are clueless at the volume of eWaste that their communities
generate, regularly.]
john larkin
2024-04-16 22:06:49 UTC
Permalink
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Danke,
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.

This doesn't sound very useful for "in-situ" diagnostics.

GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Don
2024-04-16 23:25:30 UTC
Permalink
Post by john larkin
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Threshold voltage measurement is indeed one of two parameters. The
second parameter is Miller platform voltage measurement.
The Miller plateau is directly related to the gate-drain
capacitance, Cgd. It's why "capacitive marker" appears in my
original followup.
Long story short, the Miller Plateau length provides a metric
principle to measure Tj without a sensor. Some may find this useful.

Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
John Larkin
2024-04-17 01:02:24 UTC
Permalink
Post by Don
Post by john larkin
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Threshold voltage measurement is indeed one of two parameters. The
second parameter is Miller platform voltage measurement.
The Miller plateau is directly related to the gate-drain
capacitance, Cgd. It's why "capacitive marker" appears in my
original followup.
Long story short, the Miller Plateau length provides a metric
principle to measure Tj without a sensor. Some may find this useful.
Danke,
When we want to measure actual junction temperature of a mosfet, we
use the substrate diode. Or get lazy and thermal image the top of the
package.
Don
2024-04-17 03:46:02 UTC
Permalink
Post by John Larkin
Post by Don
Post by john larkin
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
(10.1109/PHM.2017.8079139)
Sounds like they are really measuring gate threshold, or gate transfer
curve, drift with time. That happens and is usually no big deal, in
moderation. Ions and charges drift around. We don't build opamp
front-ends from power mosfets.
This doesn't sound very useful for "in-situ" diagnostics.
GaN fets can have a lot of gate threshold and leakage change over time
too. Drive them hard and it doesn't matter.
Threshold voltage measurement is indeed one of two parameters. The
second parameter is Miller platform voltage measurement.
The Miller plateau is directly related to the gate-drain
capacitance, Cgd. It's why "capacitive marker" appears in my
original followup.
Long story short, the Miller Plateau length provides a metric
principle to measure Tj without a sensor. Some may find this useful.
When we want to measure actual junction temperature of a mosfet, we
use the substrate diode. Or get lazy and thermal image the top of the
package.
My son asked me to explain how Government works. So I told him. They
hire a guy, give him a FLIR, and bundle both with their product as an
in-situ prognostic solution.

Danke,
--
Don, KB7RPU, https://www.qsl.net/kb7rpu
There was a young lady named Bright Whose speed was far faster than light;
She set out one day In a relative way And returned on the previous night.
Don Y
2024-04-16 22:21:25 UTC
Permalink
Post by Don
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
I suspect most would involve *relative* changes that would be
suggestive of changing conditions in the components (and not
directly related to environmental influences).
So, perhaps, a good strategy is to just "watch" everything and
notice the sorts of changes you "typically" encounter in the hope
that something of greater magnitude would be a harbinger...
In-situ Prognostic Method of Power MOSFET Based on Miller Effect
... This paper presents a new in-situ prognosis method for
MOSFET based on miller effect. According to the theory
analysis, simulation and experiment results, the miller
platform voltage is identified as a new degradation
precursor ...
With the levels of integration we now routinely encounter, this
is likely more of interest to a component vendor than an end designer.
I.e., sell a device that provides this sort of information in a
friendlier form.

Most consumers/users don't care about which component failed.
They just see the DEVICE as having failed.

The Reliability Engineer likely has more of an interest -- but,
only if he gets a chance to examine the failed device (how many
"broken" devices actually get returned to their manufacturer
for such analysis? Even when covered IN warranty??)

When I see an LCD monitor indicating signs of imminent failure,
I know I have to have a replacement on-hand. (I keep a shitload).
I happen to know that this particular type of monitor (make/model)
*tends* to fail in one of N (for small values of N) ways. So,
when I get around to dismantling it and troubleshooting, I know
where to start instead of having to wander through an undocumented
design -- AGAIN.

[I've standardized on three different (sized) models to make this
process pretty simple; I don't want to spend more than a few minutes
*repairing* a monitor!]

If the swamp (evaporative) cooler cycles on, I can monitor the rate
of water consumption compared to "nominal". Using this, I can infer
the level of calcification of the water valve *in* the cooler.
To some extent, I can compensate for obstruction by running the
blower at a reduced speed (assuming the cooler can meet the needs
of the house in this condition). With a VFD, I could find the sweet
spot! :>

So, I can alert the occupants of an impending problem that they might
want to address before the cooler can't meet their needs (when the
pads are insufficiently wetted, you're just pushing hot, dry air into
the house/office/business).

A "dumb" controller just looks at indoor temperature and cycles
the system on/off based on whether or not it is above or below
the desired setpoint (which means it can actually make the house
warmer, the harder it tries to close the loop!)
Buzz McCool
2024-04-18 17:18:08 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never
demonstrated to me (given ample opportunity) that this technology
actually worked on intermittently failing hardware I had, so be cautious
in applying it in any future endeavors.

https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Don Y
2024-04-18 22:05:07 UTC
Permalink
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never demonstrated
to me (given ample opportunity) that this technology actually worked on
intermittently failing hardware I had, so be cautious in applying it in any
future endeavors.
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition will
be welcome.

Sun has historically been aggressive in trying to increase availability,
especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).

I am now seeing similar features in Dell servers. But, the *actual*
implementation details are always shrouded in mystery.

But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.

Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".

E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors! Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory
controller throws an error).

This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product. So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!

The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?

[One common flaw with RAID implementations and naive reliance on that
technology]
Glen Walpert
2024-04-19 01:27:11 UTC
Permalink
Post by Buzz McCool
Is there a general rule of thumb for signalling the likelihood of an
"imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never
demonstrated to me (given ample opportunity) that this technology
actually worked on intermittently failing hardware I had, so be
cautious in applying it in any future endeavors.
Intermittent failures are the bane of all designers. Until something is
reliably observable, trying to address the problem is largely
wack-a-mole.
Post by Buzz McCool
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition
will be welcome.
Sun has historically been aggressive in trying to increase availability,
especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).
I am now seeing similar features in Dell servers. But, the *actual*
implementation details are always shrouded in mystery.
But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.
Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some baked
in "hard limit".
E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors! Bit rot may not be detected until some time after
it has occurred -- when you eventually access that memory (and the
memory controller throws an error).
This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product. So, bit rot IN that code is more likely to
go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you will
be of the handlers' abilities to address faults that DO manifest!
The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced WHEN
NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that
technology]
RAID, even with backups, is unsuited to high reliability storage of large
databases. Distributed storage can be of much higher reliability:

https://telnyx.com/resources/what-is-distributed-storage

<https://towardsdatascience.com/introduction-to-distributed-data-
storage-2ee03e02a11d>

This requires successful retrieval of any n of m data files, normally from
different locations, where n can be arbitrarily smaller than m depending
on your needs. Overkill for small databases but required for high
reliability storage of very large databases.
Don Y
2024-04-19 03:08:17 UTC
Permalink
Post by Glen Walpert
Post by Don Y
The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced WHEN
NEEDED -- if you aren't doing patrol reads/scrubbing to verify that it
is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that
technology]
RAID, even with backups, is unsuited to high reliability storage of large
https://telnyx.com/resources/what-is-distributed-storage
<https://towardsdatascience.com/introduction-to-distributed-data-
storage-2ee03e02a11d>
This requires successful retrieval of any n of m data files, normally from
different locations, where n can be arbitrarily smaller than m depending
on your needs. Overkill for small databases but required for high
reliability storage of very large databases.
This is effectively how I maintain my archive. Except that the
media are all "offline", requiring a human operator (me) to
fetch the required volumes in order to locate the desired files.

Unlike mirroring (or other RAID technologies), my scheme places
no constraints as to the "containers" holding the data. E.g.,

DISK43 /somewhere/in/filesystem/ fileofinterest
DISK21 >some>other>place anothernameforfile
CDROM77 \yet\another\place archive.type /where/in/archive foo

Can all yield the same "content" (as verified by their prestored signatures).
Knowing the hash of each object means you can verify its contents from a
single instance instead of looking for confirmation via other instance(s)

[Hashes take up considerably less space than a duplicate copy would]

This makes it easy to create multiple instances of particular "content"
without imposing constraints on how it is named, stored, located, etc.

I.e., pull a disk out of a system, catalog its contents, slap an adhesive
label on it (to be human-readable) and add it to your store.

(If I could mount all of the volumes -- because I wouldn't know which volume
might be needed -- then access wouldn't require a human operator, regardless
of where the volumes were actually mounted or the peculiarities of the
systems on which they are mounted! But, you can have a daemon that watches to
see WHICH volumes are presently accessible and have it initiate a patrol
read of their contents while the media are being accessed "for whatever OTHER
reason" -- and track the time/date of last "verification" so you know which
volumes haven't been checked, recently)

The inconvenience of requiring human intervention is offset by the lack of
wear on the media (as well as BTUs to keep it accessible) and the ease of
creating NEW content/copies. NOT useful for data that needs to be accessed
frequently but excellent for "archives"/repositories -- that can be mounted,
accessed and DUPLICATED to online/nearline storage for normal use.
boB
2024-04-19 18:16:02 UTC
Permalink
On Thu, 18 Apr 2024 15:05:07 -0700, Don Y
Post by Don Y
Post by Don Y
Is there a general rule of thumb for signalling the likelihood of
an "imminent" (for some value of "imminent") hardware failure?
This reminded me of some past efforts in this area. It was never demonstrated
to me (given ample opportunity) that this technology actually worked on
intermittently failing hardware I had, so be cautious in applying it in any
future endeavors.
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
Post by Don Y
https://radlab.cs.berkeley.edu/classes/cs444a/KGross_CSTH_Stanford.pdf
Thanks for that. I didn't find it in my collection so it's addition will
be welcome.
Yes, neat paper.

boB
Post by Don Y
Sun has historically been aggressive in trying to increase availability,
especially on big iron. In fact, such a "prediction" led me to discard
a small server, yesterday (no time to dick with failing hardware!).
I am now seeing similar features in Dell servers. But, the *actual*
implementation details are always shrouded in mystery.
But, it is obvious (for "always on" systems) that there are many things
that can silently fail that will only manifest some time later -- if at
all and possibly complicated by other failures that may have been
precipitated by it.
Sorting out WHAT to monitor is the tricky part. Then, having the
ability to watch for trends can give you an inkling that something is
headed in the wrong direction -- before it actually exceeds some
baked in "hard limit".
E.g., only the memory that you actively REFERENCE in a product is ever
checked for errors! Bit rot may not be detected until some time after it
has occurred -- when you eventually access that memory (and the memory
controller throws an error).
This is paradoxically amusing; code to HANDLE errors is likely the least
accessed code in a product. So, bit rot IN that code is more likely
to go unnoticed -- until it is referenced (by some error condition)
and the error event complicated by the attendant error in the handler!
The more reliable your code (fewer faults), the more uncertain you
will be of the handlers' abilities to address faults that DO manifest!
The same applies to secondary storage media. How will you know if
some-rarely-accessed-file is intact and ready to be referenced
WHEN NEEDED -- if you aren't doing patrol reads/scrubbing to
verify that it is intact, NOW?
[One common flaw with RAID implementations and naive reliance on that
technology]
Don Y
2024-04-19 19:10:22 UTC
Permalink
Post by boB
Post by Don Y
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing???

You're going to "bless" a product that you, personally, know has a fault...
boB
2024-04-21 19:37:58 UTC
Permalink
On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
Post by Don Y
Post by boB
Post by Don Y
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing???
You're going to "bless" a product that you, personally, know has a fault...
Yes, it may be hard to replicate but you just have to try and try
again sometimes. Or create something that exercises the unit or
software to make it happen and automatically catch it in the act.

I don't care to have to do that very often. When I do, I just try to
make it a challenge.

boB
Don Y
2024-04-21 21:23:32 UTC
Permalink
Post by boB
On Fri, 19 Apr 2024 12:10:22 -0700, Don Y
Post by Don Y
Post by boB
Post by Don Y
Intermittent failures are the bane of all designers. Until something
is reliably observable, trying to address the problem is largely
wack-a-mole.
The problem I have with troubleshooting intermittent failures is that
they are only intermittend sometimes.
My pet peeve is folks (developers) who OBSERVE FIRST HAND a particular
failure/fault but, because reproducing it is "hard", just pretend it
never happened! Really? Do you think the circuit/code is self-healing???
You're going to "bless" a product that you, personally, know has a fault...
Yes, it may be hard to replicate but you just have to try and try
again sometimes. Or create something that exercises the unit or
software to make it happen and automatically catch it in the act.
I think this was the perfect application for Google Glass! It
seems a given that whenever you stumble on one of these "events",
you aren't concentrating on how you GOT there; you didn't expect
the failure to manifest so weren't keeping track of your actions.

If, instead, you could "rewind" a recording of everything that you
had done up to that point, it would likely go a long way towards
helping you recreate the problem!

When you get a "report" of someone encountering some anomalous
behavior, its easy to shrug it off because they are often very
imprecise in describing their actions; details (crucial) are
often missing or a subject of "fantasy". Is the person sure
that the machine wasn't doing exactly what it SHOULD in that
SPECIFIC situation??

OTOH, when it happens to YOU, you know that the report isn't
a fluke. But, are just as weak on the details as those third-party
reporters!
Post by boB
I don't care to have to do that very often. When I do, I just try to
make it a challenge.
Being able to break a design into small pieces goes a long way to
improving its quality. Taking "contractual design" to its extreme
lets you build small, validatable modules that stand a greater
chance of working in concert.

Unfortunately, few have the discipline for such detail, hoping,
instead, to test bigger units (if they do ANY formal testing at all!)

Think of how little formal testing goes into a hardware design.
Aside from imposing inputs and outputs at their extremes, what
*really* happens before a design is released to manufacturing?
(I haven't seen a firm that does a rigorous shake-n-bake in
more than 40 years!)

And, how much less goes into software -- where it is relatively easy to
build test scaffolding and implement regression tests to ensure new
releases don't reintroduce old bugs...

When the emphasis (Management) is getting product out the door,
it's easy to see engineering (and manufacturing) disciplines suffer.

:<

Loading...