Intermittent Server Performance Degradation Troubleshooting

I have been troubleshooting an intermittent server performance issue for too long and I am running out of ideas. I am looking for any suggestions as to how I might be able to identify the cause of the issue.

We (my team and I) developed a client/server Windows Forms application using a SQL Server database for a client some years ago. The client recently started experiencing some performance issues and decided to upgrade their infrastructure. They migrated from a single physical SBS machine to a virtual environment with multiple VMs. We successfully migrated and application and SQL bits to the new environment. The client then requested application updates to fix some memory leaks and other performance issues / bugs that they had been running with for years. We made the updates and the system bench-marked well in our environment. We then deployed to their new production environment and the system seemed to run well.

A day or two after the deployment we received complaints about the system hanging or lagging when load/saving form data or generating reports. We connected with the client remotely and confirmed the issue. We analyzed the client environment and checked for possible memory leaks and other issues that might cause the symptoms. We found none. We then realized that the performance issue was affecting multiple machines on the network and must be environmental. The client then had their hardware support techs troubleshoot potential hardware/network configuration for a source. They found none.

During our rounds of troubleshooting with the client we stumbled onto ways of correcting the performance problem when it arises (which seems random). A server reboot fixes the issue but that is not an acceptable fix.

Another way, and the reason why I'm posting this, is when the client notices the performance degrading, they can open the "legacy" version of the application (which is still available on some client machines) and performance is restored. Existing client application instance restarts are not needed.

The system performs well between incidents and the issue seems to occur every 2 to 3 days on average but has run incident-free for as long as a week and has also has multiple incidents in a single day (one in the morning and then one in the afternoon).

We were thinking that the issue might be a SQL Server problem. So I've been profiling, saving traces and have also been monitoring SQL performance counters to look for clues. I'm no SQL performance expert and so I might not be looking into proper counters but the SQL Server does not seem to be pushed very hard. There are no persistent spikes in CPU, Memory, Batches/Second, Transactions/Second, Compilations/Second, Re-Compilations/Second and the paging and cache counters are generally static.

The application may have 10 to 20 active instances running at a time. The application was not originally written with the most efficient data retrieval practices but the load produced is nothing the server can't handle.

I have also been monitoring the Windows Event logs for errors and warning that might shine some light on the problem but haven't seen anything that is thrown just before or during an incident that points to the problem.

Another strange observation we found was that the application performs without degradation when executed directly on the server regardless the overall system performance. I have run the application directly on the server when other machines were experiencing the problem and had no slowness or lag.

Sorry for the book. I am going to keep digging for clues but any suggestions would be greatly appreciated.

Server: Windows Server 2012 R2 (VM with plenty of resources allocated) SQL: SQL Server 2014 стандартный Clients: Mixed but Mostly Windows 7 Professional

1
задан 17 March 2016 в 22:12
2 ответа

Что касается БД, то я бы начал протоколировать активность в таблицу, так. Нужно настроить хранимый proc на более длительное время, чтобы данные продолжали записываться в журнал (SET @numberOfRuns = 10), или вообще отказаться от этой проверки.

Существуют инструменты, облегчающие анализ журналов производительности сервера. Вот . Вот блог авторов .

Вы можете попробовать использовать сетевой монитор, чтобы посмотреть, что происходит на клиенте, когда возникает проблема. Также посмотрите на счетчики трафика сетевых карт в perfmon на сервере. Посмотрите на tcp сессии, когда проблема происходит с netstat, возможно. Я мало что знаю о сети, так что это может быть случай, когда слепые ведут слепых :)

.
1
ответ дан 3 December 2019 в 23:49

Вы когда-нибудь догадывались об этом? Какую строку подключения использует ваше приложение?Если он работает нормально на сервере, но не работает на клиентах, помните о сетевом подключении. т.е. если ваша строка подключения использует datasource = computername, тогда на сервере он будет использовать цикл возврата, а на клиентах он будет использовать разрешение имени и IP-адрес. Возможно, попробуйте использовать IP-адрес в строке подключения вместо DNS-имени, чтобы исключить DNS-поиск.

0
ответ дан 3 December 2019 в 23:49

Теги

Похожие вопросы