Of VNX, Mountain Lions, and Lessons Learned

Mac-OS-X-Mountain-Lion-2It has been a busy week in the tech industry. There were several major conferences including Dell Storage Forum, Cisco Live, and Microsoft TechEd to name a few. Apple also had their annual World Wide Developer Conference (WWDC). While the iOS 6 announcement may have stole the show, Apple also announced MAC OS 10.8 (Mountain Lion).

 

Now you may remember my post around this time last year when MAC OS 10.7 came out and I was asking everyone to please upgrade. Well I have some good news on this regarding 10.8. I received an email the other day from Drew Schlussel that said that the latest beta had completed it’s testing and things were looking good. This is great news to me and should be encouraging to everyone. Apple can still change things between now and when MAC OS 10.8 goes GA later this summer, and engineering will continue to test against the latest build when it becomes available.

 

Looking back at last year, the MAC OS changes present a unique challenge to vendors. The low price of adoption for customers make widespread implementation a lot more common. Combine that with the ever increasing movement of Bring-Your-Own-Device in the workplace, and IT departments are losing control over what versions of software and operating systems are in their environment.

 

With the amount of time it takes for an engineering department to discover a bug, create a fix, perform testing, and publish the new code, we ended up being one of the few that had fixes before the final version of MAC OS 10.7 was available. Once new code is available, it takes time to do an upgrade. Last year, the majority of our upgrades were still being performed by EMC’s Customer Engineers. This additional scheduling time was also compounded by the change control in place at many organizations which are often on a 6 month upgrade cycle at best. This perfect storm can spell disaster when a major issue is discovered.

 

So what is being done in the future to prevent a repeat problem? Well this year, upgrades on the VNX are pretty much a self-service option at this point. When new code is available, customers can use the Unisphere Service Manager to upgrade their boxes that day. You no longer need to open a ticket and schedule an on-site visit as it can all be done from the comfort of your computer in the office (LAN connection is preferred).

 

All that is left is your own internal change control process. VNX is currently on a roughly 6 – 8 week service pack release cycle. Armed with this knowledge, you can start filing for your next upgrade just as soon as you apply your current one and you’ll stay right in line with all the enhancements and fixes that come with every upgrade. I am a big proponent of shorter upgrade cycles and I encourage everyone to upgrade their VNX as close as you can to when new code is released.

Get a sneak peek at new VNX features

imageToday marks the first full day of EMC World 2012.  While everyone is busy watching key notes and checking out the hands on labs, I thought I’d offer you a sneak peak at some new VNX features you can look forward to in the 2nd half of 2012.

 

New Raid Levels for Storage Pools

The first thing I want to talk about is storage pools.  As you are well aware, when you add disks in to storage pool, you need to use the same type of raid level in all storage tiers in the pool.

image

 

As you can see from the picture above, when creating a typical pool from a RAID 6 configuration, you must use it for your FLASH, your SAS, and your NL-SAS drives.  This means that you must use extra flash drives to fill out your pool.  What is changing in the future is a shift towards towards tier specific raid levels.

 

image

 

As you can see in the picture above, now you will be able to have different raid levels at different tiers in your pool.  By mixing a smaller amount of flash with a larger amount of spinning disks, you can put the majority of your unread / archived data on your cheaper storage while being able to afford flash drives as well for your performance data.  This translates into a cheaper initial cost for your storage and offers a more affordable option for customers looking to start out.

 

What the SNAP!

imageThe next big thing coming to VNX is enhanced block snapshots.  I think everyone is well aware of the limitations of SNAPS of luns on the VNX.  Well I’m proud to announce that those are a thing of the past!  With the new functionality, the VNX has increased the maximum amount of writable SNAPS to 256 per lun.  That also raises the limit to 32,768 per system.  Picture me in my best Boston Accent when I say that is a “wicked” high number of snaps.

 

Also introduced in this new enhancement is the ability to take SNAPS of a SNAP.  This opens up the possibility for all sorts of new use cases such as Testing and Development options as well as Point-In-Time backups.  This is functionality that has existed on the FILE side for quite some time now and I’m glad to see it’s making it’s way to the lun level as well.

 

Windows Branch Cache Support for CIFS

imageWith the release of Windows 7 and Windows 2008 R2, Microsoft added new functionality called Branch Cache.  This functionality allows remote computers to cache files and server them out locally to their pears, thus reducing bandwidth over the WAN.  This cached data can either be distributed from clients PCs or be held on a local server in the branch office.  Application performance will be increased by reducing the number of hops the data has to travel.

 

In the next big release for VNX, we will see added support for this functionality to CIFS shares on the VNX.  For more information on this, please read this Microsoft TechNet Article.

 

Well that about does it for now.  3 big new features to look forward to in the second half of 2012.  Please feel free to ask a question in the comments section and I’ll try to answer them as best I can.

Making CAVA work with SMB2 on your VNX

vnx-promo-bannerAs more and more people start to deploy a new VNX and switch to an advanced windows server operating system, I am seeing a higher utilization of the SMB2 protocol for cifs.  With this increase, comes new problems.  Recently I had noticed a rather peculiar notification in the server logs in regards to CAVA.  CAVA was reporting the error “FILE_NOT_FOUND” on scans when the file existed.  It would present itself as something like this:

 

2012-04-29 08:49:47: 81878122528: VC: 3: 32: Server ’192.168.1.156′ returned error ‘FILE_NOT_FOUND’ when checking file ‘\root_vdm_2\CIFS\Test\1234.exe’

 

The standard troubleshooting confirmed that the file did exist.  I even back traced it from the CAVA server through the “check$” share and did not have any problems with the file.  So why was CAVA reporting errors like this so often?  It turns out the problem was not with CAVA itself, but with an “enhancement” introduced as part of SMB2.

 

As part of the SMB2 protocol, the Microsoft Redirector uses a local cache for directory metadata.  This cache is usually cleared after 10 seconds.  What this does, in instances of file systems with a high rate of change, is cause an inconsistency with what the CAVA server sees when it goes to scan a file.  The CAVA server will actually read from the cache and error out when the file is not found in it.  This then causes the error that I pasted above.

 

Of course with a problem, comes a work around.  This was identified and placed into the latest VNX Event Enabler release notes, but I will provide it for you here:

 

  1. Open the Windows Registry Editor and navigate to HKLM\SystemCurrentControlSet\Services\LanmanWorkstation\Parameters.
  2. Right-click Parameters and select New > DWORD Value.
  3. For the new REG_DWORD entry, type a name of DirectoryCacheLifetime.
  4. Set the value to 0 to disable DirectoryCacheLifetime.
  5. Click OK.
  6. Restart the machine.

 

A simple registry change on each CAVA server and a reboot will allow you to set the cache lifetime value to 0 and thus there will be no more caching.  After this change you should not see any more problems caused by SMB2.

Understanding the EMC VNX/Celerra AntiVirus Agent (CAVA): Part 1 – server_viruschk

CAVA is one of the few parts of the Celerra/VNX that cannot be configured and monitored from the GUI.  Most, if not all, of the information you need about cava can be found in the command line.  Over the course of a few posts, I will start with a fully working cava setup, and then work backwards to break it so you can see common implementation problems and possible performance bottlenecks.  In this first post of the series, I will go line by line through the output of server_viruschk so that you can understand just what the output is saying.  For reference, this is the output I will be working with:
[nasadmin@UberCS ~]$ server_viruschk server_2
server_2 :
 10 threads started.
 1 Checker IP Address(es): 192.168.1.101     ONLINE at Thu May 26 19:41:13 2011 (GMT-00:00)
                        MS-RPC over SMB, CAVA version: 4.8.5.0, ntStatus: SUCCESS
                       AV Engine: Symantec AV
                       Server Name: cava.thulin.local
                        Last time signature updated: Tue May 17 05:55:23 2011 (GMT-00:00)

 1 File Mask(s):
 *.*
 5 Excluded File(s):
 ~$* >>>>>>>> *.PST *.TXT *.TMP
 Share \\UBERCIFS\CHECK$.
 RPC request timeout=25000 milliseconds.
 RPC retry timeout=5000 milliseconds.
 High water mark=200.
 Low water mark=50.
 Scan all virus checkers every 10 seconds.
 When all virus checkers are offline:
 Shutdown Virus Checking.
 Scan on read disable.
 Panic handler registered for 65 chunks.
 MS-RPC User: UBERCIFS$
 MS-RPC ClientName: ubercifs.THULIN.LOCAL

 

I will now go line by line starting with the first one.
  1. 10 threads started.
    • This is the number of threads for cava.  Each thread represents a file that can actively be scanned.  Cava will process up to 10 files at once to distribute across your available cava servers.  Any additional files will be put into a holding queue until cava can get to them.  This limit here is so that we don’t overwhelm the av software running on each cava server.  This limit is adjustable by the support lab if it is determined that this will solve a performance issue.
  2. 1 Checker IP Address(es):
    • This line tells you have many cava servers you have defined in your viruschecker.conf file.  In this example, I only have 1 server defined, but you should be running at least 2 servers at a minimum.
  3. 192.168.1.101                                  ONLINE at Thu May 26 19:41:13 2011 (GMT-00:00)
    • This line tells you the IP address of your cava server as well as the status and the last time we checked it.  If that line says anything other than ONLINE, there is a problem with the connection from the windows server to the celerra and that server will not be used for scanning.  More information on possible errors will be in a later post.
  4. MS-RPC over SMB, CAVA version: 4.8.5.0, ntStatus: SUCCESS
    • This has 3 pieces of useful information.  The first is the connection method we use to send commands to the cava agent.  In this case, we are using the MSRPC protocol.  Older clients may use the ONCRPC protocol, but this is not supported on 64 bit systems.  The next part tells you the version of cava you are running.  As of writing this, i am using the latest version (VNX Event Enabler 4.8.5).  Like above where we reported the connection from windows back to the celerra, the ntStatus section reports the status of our initial connection to the windows server.
  5. AV Engine: Symantec AV
    • This tells you the AV software we detected to use for CAVA.  This can be helpful if you have more than AV engine installed on the client.  In my case, I am using Symantec Endpoint.
  6. Last time signature updated: Tue May 17 05:55:23 2011 (GMT-00:00)
    • This is the last time you updated your AV definitions
  7. 1 File Mask(s):
    • The number of file masks you have set to scan for.  In this case, it’s just 1 mask.
  8. *.*
    • This is the file masks you have in place.  Any files that match the entries here will be processed for scanning.  In this case i have *.* (everything with a . in it), but you can cut down a lot of traffic if your only scanning for certain file types.
  9. 5 Excluded File(s):
    • This is how many file exclusion filters you have in place.  In this case i have 5.
  10. ~$* >>>>>>>> *.PST *.TXT *.TMP
    • These are the file filters i have in place.  There are a number of files that AV software just can’t scan (like database files).  I also have in place ~$* and >>>>>>>> to ignore Microsoft Office temporary files as they can become locked temporarily while being scanned and cause a loss of data in the office application.
  11. Share \\UBERCIFS\CHECK$.
    • This is the beginning of the UNC path that will be sent for file scan requests.  This is determined from the CIFSserver line in the viruschecker.conf and will change depending of if you defined it with the ip, netbios name, or FQDN.  The check$ folder is a hidden folder created just for CAVA.  The only account that can access this is the one granted the virus checking privilege.
  12. RPC request timeout=25000 milliseconds.
    • This is the amount of time we will wait for a file to be scanned before trying again.
  13. RPC retry timeout=5000 milliseconds.
    • This is the amount of time we wait for an acknowledgement of each RPC command.
  14. High water mark=200.
    • I spoke before about how we process 10 files at a time, and that addition files are put into a queue.  The high watermark is when we allocate additional resources to cava to process through AV files faster.  Hitting this high limit can cause a performance impact to your cifs servers, so try not to let the queue get this bad.  In my case, i have set the limit to the default of 200.
  15. Low water mark=50.
    • Just like the high watermark, this is a lower limit that starts to indicate that files are queuing up too fast.  This won’t cause a performance problem, but is an indicator of a possible problem to come.
  16. Scan all virus checkers every 10 seconds.
    • Every 10 seconds we will check the status of each cava server to make sure it’s still online and ready to take requests.
  17. When all virus checkers are offline:
    Shutdown Virus Checking.

    • This is the action we will take when all the cava servers are not marked as ONLINE.  This will shutdown cava so that files don’t continue to be queued and hit a high watermark.  The other options is to do nothing (a setting of ‘no’) or to shutdown cifs (what i like to call paranoia mode).
  18. Scan on read disable.
    • This means that scan on read is not enabled and that we are only processing scan on write.  If scan on read was enabled, the cutoff date and time would be listed in this place.
  19. Panic handler registered for 65 chunks.
    • This is mostly just for debug information and how many internal failures cava would survive before causing a panic.  Every process on the celerra has a panic handler and this information is of no use to basic cava troubleshooting.
  20. MS-RPC User: UBERCIFS$
    • Earlier i talked about how we use the MS-RPC protocol to connect to the cava agent servers.  This is the username we will use for the SMB connection.  In this case, we are using the compname of the cifs server for cava.
  21. MS-RPC ClientName: ubercifs.THULIN.LOCAL
    • This is the FQDN of the cifs server we are using for cava which is used as part of the MS-RPC process.
This concludes my line by line explanation of the cava output.  I hope you understand the output of cava a bit better.  In future posts on cava  Iwill talk about some of the different information you might see when there is an error as well as the output of the -audit option.  Please feel free to ask questions in the comment section below.