There are quite a few resources out there to help SAS users get started with writing some Python code. But for those of us who have SAS as a first language, switching to Python is quite a bit more than translating lines of code. It’s a change in world-view.
Here’s how to view the world as a Pythonista. Adopting this perspective will not only help you to replicate your SAS work. It will help to to expand your horizons of what a programming language can do for you.
Firstly though, this is not a SAS vs Python post. I have great respect for both of these ecosystems. And they can be used seperately or together in infinite combinations.
I assume we are all grown-ups here, that you’ve done your due diligence on research and made the call that you want to add some Python capability to your business.
What this post will give you is the mental toolkit that will help you get the most out of that decision. If you are like me, you will have at least this set of questions as you set out.
Let’s go through them one to orient you to the Python world.
Why do I have to install a new library every time I sneeze?
The great power of Python..and the great pain…is it’s library system. Unless you are trying to replicate classic computer science algorithms, like bubble sort, from scratch, you will likely have to import at least a couple of libraries for every program you write.
This used to be extremly painful. Some libraries would be buggy to install and would often break some dependent library! Luckily for you, the conda and pip communities in Python have now made this pretty clean. GPU installs of TensorFlow aside, things will go fairly smoothly here.
There is still a small risk of something going catastrophically wrong.
To manage this, create a virtual environment that will isolate your project. Don’t worry about it on Day 1. But as you start to get serious about Python, always use a new venv for a project so you have a fresh set of libraries that won’t intefere with other projects.
You can save that library list to a requirements file and build a new venv in one line later on. https://docs.python.org/3/library/venv.html
And why don’t they just add the popular libraries like pandas and numpy to core python and be done with it? Well, this decentralized development means that you get state-of-the-art and robust functionality like reinforcement learning in Python incredibly early. Although you might still want to shell out for commercial software for some production needs, you can at least assemble all of the starting tools for NLP, ML, graph, Deep Learning, simulation or computer vision projects in Python with two lines for each: a shell command to pip install; and an import statement in your code. In many cases, the functionality available will take you all the way to enterprise deployment.
The library system is remarkable and will power your innovation in data analysis and data science for no cost but the sweat of your brow.
What are the best starting libraries and resources for data science in Python?
I won’t belabour the general case here. There are so many resources out there on doing data science with Python. It has become the de facto language in the field. But here are a few notes, especially for you SAS users.
Pandas dataframes are dataset equivalents. You can manipulate them e.g pivot, derive new fields, summarize, concatenate and calculate to your heart’s content all within pandas.
Proc merge has an equivalent in df.merge. Sorting and renaming are all pretty standard. The only issue I found is that pandas, while very powerful, is quite verbose. Renaming a column is a journey through curly and square brackets!
Read on to find out where you can also just switch to plain old SQL for standard querying and transforms within Python.
If you are a SAS IML user and do a lot of work with matrices, drop into Numpy. You’ll get all of your linear algebra functionality here and with powerful time-savers such as broadcasting. And if you want to hang out with the cool dude Python data scientists, just call your matrices ‘tensors’!
For Machine Learning, there is no open source equivalent to SAS Enterprise Miner in Python. It is really all about coding in scikit-learn, TensorFlow, PyTorch or Keras. The upside is that these are amazing libraries. If you do want to go the GUI way, you can spend a bit of money on AutoML, RapidMiner, DataRobot or more. Or stick with Enterprise Miner for your GUI ML.
Where does the data persist?
( OK, I was wondering which animals persist so…tortoise. Not sure how long I can keep the animal theme going but let’s see!)
In SAS we traditionally deal with .sas7bdat files sitting on the file system - one file per table (or dataset to use the SAS term). If you are connected into a database such as Teradata or SQL Server through ODBC or a native connection, you will also have access to those relational database schemas. If you are a SAS BI user, you may also have access to OLAP data stores.
What about Python?
Well, Python is a language whereas SAS is a vertically integrated system, so ultimately Python doesn’t care so much. It will support anything. In practise though, data analysts and data scientists typically store data as files such as .csv on disk, at least when in R&D mode.
The nice thing about a csv is that is fairly generic. It doesn’t hold data types or other metadata. You can specify the datatypes of fields on import as you would like. So you can pass a csv between software with no problem.
The flip side is..you have to define your data types on import. That can be a whole lot of plumbing code. And good luck hunting bugs around formatting, especially for dates. (Dates are a nightmare in every language!)
So if the data types do really matter for your program, you can consider parquet, a binary file that is just as easy to import and export but that bakes in standard datatypes for you.
Now, what if you have created a really cool Python object like a pandas dataframe with a multi-index or a multi-dimensional array or even an ML model? Well here, the joblib library comes to the rescue. You can simply save your object to disk with a 1-liner and re-import it wherever you like.
And if you do want to connect to a database, there is wonderful support for that. SQLAlchemy, for example allows you to connect to most database formats with a few lines and a library import. Once you are connected you can work away with your data as pandas tables, saving the results back to db or write SQL directly against the db itself within the code.
How to hold data in memory?
Python supports all the basic data structures you might expect - variables, lists and sets. Probably the handiest additional one is the dictionary, a simple key-value pair mapping that comes in handy more often that not. I’ve not sure I’ve written a Python program that doesn’t make use of a dictionary.
As your values can take any data type, one great use case is to have a dictionary of named pandas dataframes. This allows you to loop through the set of tables , automatically applying data transformations to the whole set in one swoop. In SAS you could accomplish this with a macro loop. Trust me, the Python way is a lot easier and more legible.
Doing data science we typically hold more complex data in higher level data structures like pandas dataframes and numpy arrays.
Once your needs become more bespoke, as they will in a program of any complexity, you are free to create your own classes to house data. Python has a special class called dataclass that allows you to create these concisely. Then the power of OO with composability and inheritance allows you to create a sophisticated data model that can be easily programmed against in memory.
If you are dealing with databases a lot, you may want to to create an in-memory relational schema using an Object Relational Model toolkit such as SQLAlchemy. This will provide a lot of convenience functions for traversing your data model with queries, joins and updates.
To return to a theme, the power of Python to combine so many different data structures within a program can also become a painful headache. Combined with dynamic typing, it can create all sorts of side-effects and code illegibility. Good modular or OO design, care with naming conventions and type-hinting can smooth out most of these issues.
Can I find an Enterprise Guide equivalent?
I could beat around the bush here and come up with a few options….
But…the answer is just ‘no’. You won’t find it. Fewer guides in the open source world. Its a place for the more intrepid data scientist!
The open source world is very programmer-centric. You won’t find a free Python based tool that allows you to manipulate and join data, run models, get statistics, build modules into flows etc. and just generally do everything that you might want to do, all in a powerful UI.
As with Enterprise Miner, this is where SAS really shines. Stick with EGuide if you don’t want to code. Or check out some commercial alternatives like SnowFlake for your data work and the ML GUI providers listed above for your modelling.
Do I need to start writing Object Oriented code?
No, you don’t. But you should get to grips with clean, modular code. While SAS is moving towards functions, most code bases are written around the datastep with macros used for reusable sections. Python organized in functions, modules and packages is much more maintainable and legible. You will start to write cleaner and more maintainable code if you learn the basics.
For some use cases, OO does beat out modular. When your program grows beyond a certain complexity, you start to get into spaghetti codd with functions calling functions that is hard to keep track of. One of the tell-tale signs is when you need to pass a bunch of data through 10 functions before it is used.
Start small with OO. Wherever it looks like your modular approach is wearing thin, consider creating a class for a set of data that requires a bunch of common functions on it. Take a look at basic UML and OO patterns and you will soon start building programs worthy of a software engineer.
Why does it take so much code to create visualizations?
Once you start working with matplotlib and seaborn you will see the true power of Python for data analysis. But with great power comes great complexity. I still have to google every time I create even the simplest chart.
If you are a graphics nerd, just dive in. There is so much here for you. Seaborn complements matplotlib. Bokeh, dash, streamlit and more will take you to interactivity. If you are a graphics dunce like me, stick with pandas. They have grafted all the basics onto the library so at a wild guess adding something like .plot.bar() to a dataframe will probably give you something reasonable.
For dashboarding, consider an open source toolset like Superset to save many lines of code and get to some nice, interactive functionality. If you want more control over the process, Bokeh and Streamlit are tools that will allow you to create interactive web applications without going too deeply into web app design.
But ultimately, as with EGuide and EMiner, the open source world does not do this kind of user friendly software as well as the commercial outfits. You may save a lot of headaches by sticking with SAS BI or checking out Tableau/PowerBI/Qlik at least for enterprise scale dashboards.
How do I share my work?
One of the great advantages of SAS and other vertically integrated commercial data systems, is their support for client-server type arrangements around sharing of artefacts. There are a couple of options here in Python. Firstly, for collaboration amongst developers, a hosted Jupyter service like the Tiniest Little Jupyterhub https://tljh.jupyter.org/en/latest/ is a great way to share code and data.
For sharing results with business users, the dashboarding solutions mentioned above are popular choices. But for a more cohesive enterprise data analysis system, you will want to return to the commercial world with offerings from SAS or altenatives such as SnowFlake to support common data consumption and production work-flows.
Do I need to worry about Git and version control?
You don’t need to worry about it but you should use it. It is not that SAS users never use Git. But they don’t tend to. And we get into awful messes around version control and releases without it.
Git is revolutionising the way software is built, making program maintenance and update a whole lot cleaner and error-free. It also paves the way for CICD (Continuous Integration Continuous Deployment) that allows large teams to push code reliably and quickly.
The thing about Git is that it is not easy. Like everything else around the Python ecosystems, it is designed for maximum functionality.
You can do anything with it but sometimes it seems like you can’t do anything with it. As a newbie, you will inevitably get into horrible messes with merging, stashing and rebasing. Windows users have theoption of a nice GUI to simplify things.
But the command line is worth learning. Consider your Git pain to be one-off. Once you get your head around it, you will not be able to live without it. No more Monday morning meetings asking ‘Where did my code go? I can see Colin’s new routine to clean the data but we don’t seem to be generating any …data!’
What IDE should I use to write code?
If you are coming out of the SAS world, I assume you are more likely a data scientist than a computer scientist by training. The CS guys and girls will always baulk at this question and claim to use Vi or vim to write code the way that our ancestors did. Those of us who value our sanity will use a modern editor to help catch bugs, offer autocomplete and to access a host of other conveniences such as database querying, markdown rendering and more.
All the cool kids these days spend a lot of time writing Python in Jupyter notebooks. Its an incredibly handy way to write sequences of code-snippets with quick feedback as you run a snippet in a ‘cell’. The installation is fairly painless and you work within the browser.
However, once a program reaches a couple of hundred lines of code and you think you may need to reuse it (hint: you will resuse it!), I find that the more traditional IDES are indispensible.
With VSCode, PyCharm, Spyder and the others, you can create an application structure with packages and modules more easily and get a sense of how your program is actually called rather than just going with the sequence in which it was written. There are endless debates over this, but consider Jupyter and the fully-fledged IDEs as complementary and choose depending on your requirement.
I am a VSCode user and can’t say enough good things about it. The Python debugger has got me out of more jams than I can remember and the pytest, flask and other integrations are key for creating my production applications. I hear that PyCharm is just as good, if not better.
Can I find something like Proc SQL?
Yes you can! Its called pandasql and it works like a dream. You can write your sql as a string and run it against any Pandas dataframe or set of dataframes. I’ve built whole ETL pipelines using it with no problems. For working with normalized data, it provides a cleaner and more legible codebase that you will enjoy coming back to.
But for more heavy-duty ETLs involving complex reusable functions, multi-dimenstional indices or time-series, Python and Pandas will give you more horse-power. (The popular db systems usally turn to a procedural language around the SQL for this like PSQL or PL/SQL for this.) At the enterprise scale, you can start to access some great frameworks such as Prefect for orchestration and Great Expectations for schema validation.
You are now a Pythonista!
Well, maybe write some Python code first! But now that you have translated your SAS concepts and requirement into Pythonese,
I hope you can see some of the possibilities of this strange new snake of a language that you have adopted.
Talk to us if you are thinking to translate your SAS concepts and requirement into Pythonese.
Here is how we can help…
Advisory
We help you understand the power & perils of Python for your business
Solutions
We build or co-build AI and data solutions for you in Python with your team
Products
Web APIs, high performance code, beautiful dashboards = useful, robust, maintainable AI systems
Training
Augment your team with training and capability building including sample code builds, bespoke training and code reviews
To chat about creating or powering up your AI system, drop us a quick e-mail at admin@lastmile-ai.com