The python book 7th edition pdf free download






















Learning C Programming. CNC Programming Handbook. C Programs with Solutions. Data Structures Into Java. Programmable Logic Controllers. The C programming Language.

Learn Raspberry Pi Programming with Python. Learning Python. Programming Python. The Pragmatic Programmer. Apart from several other enhancements, the second edition contains one new chapter on numerical methods of solution. The book formally splits the "pure" and "applied" parts of the contents by placing the discussion of selected mathematical models in separate chapters.

The book can be used independently by the average student to learn the fundamentals of the subject, while those interested in pursuing more advanced material can regard it as an easily taken first step on the way to the next level. Additionally, practitioners who encounter differential equations in their professional work will find this text to be a convenient source of reference.

Numerous examples from physics, technology, biomathematics, cosmology, economy and optimization allow a quick and motivating approach - abstract proofs and unnecessary formalism are avoided as far as possible. In the foreground is the modelling of ordinary differential equations of the 1st and 2nd order as well as their analytical and numerical solution methods, in which the theory is briefly dealt with before the application examples.

In addition, codes show exemplarily how even more demanding questions can be answered and meaningfully represented with the help of a computer algebra system.

In the first chapter the necessary previous knowledge from integral and differential calculus is treated. A large number of exercises including solutions round off the work. The audience consisted of academics from New York University and other universities, as well as practitioners from investment banks, hedge funds and asset-management firms. The contributions to the 12th of FLINS conference cover state-of-the-art research, development, and technology for computational intelligence systems, both from the foundations and the applications points-of-view.

The emphasis in this book is on theory and methods and differential equations as a part of analysis. Differential equations is worth studying, rather than merely some recipes to be used in physical science. The text gives substantial emphasis to methods which are generally presented first with theoretical considerations following.

Essentially all proofs of the theorems used are included, making the book more useful as a reference. The book mentions the main computer algebra systems, yet the emphasis is placed on MATLAB and numerical methods which include graphing the solutions and obtaining tables of values. Featured applications are easily understood. Complete explanations of the mathematics and emphasis on methods for finding solutions are included.

Author : Steven J. Miller Publisher: American Mathematical Soc. ISBN: Category: Management science Page: View: Read Now » Optimization Theory is an active area of research with numerous applications; many of the books are designed for engineering classes, and thus have an emphasis on problems from such fields. Covering much of the same material, there is less emphasis on coding and detailed applications as the intended audience is more mathematical.

There are still several important problems discussed especially scheduling problems , but there is more emphasis on theory and less on the nuts and bolts of coding.

Why are we able to do a calculation efficiently? How should we look at a problem? As many of the key algorithms in the subject require too much time or detail to analyze in a first course such as the run-time of the Simplex Algorithm , there are numerous comparisons to simpler algorithms which students have either seen or can quickly learn such as the Euclidean algorithm to motivate the type of results on run-time savings.

Readers are also instructed in the extended potential theory in its three forms: the volume potential, the surface single-layer potential and the surface double-layer potential.

Furthermore, the book presents the main initial boundary value problems associated with elliptic, parabolic and hyperbolic equations. The second part of the book, which is addressed first and foremost to those who are already acquainted with the notions and the results from the first part, introduces readers to modern aspects of the theory of partial differential equations.

This field pertains to the design, analysis, and implementation of algorithms for the approximate solution of mathematical problems that arise in applications spanning science and engineering, and are not practical to solve using analytical techniques such as those taught in courses in calculus, linear algebra or differential equations. Topics covered include computer arithmetic, error analysis, solution of systems of linear equations, least squares problems, eigenvalue problems, nonlinear equations, optimization, polynomial interpolation and approximation, numerical differentiation and integration, ordinary differential equations, and partial differential equations.

For each problem considered, the presentation includes the derivation of solution techniques, analysis of their efficiency, accuracy and robustness, and details of their implementation, illustrated through the Python programming language.

This text is suitable for a year-long sequence in numerical analysis, and can also be used for a one-semester course in numerical linear algebra. Focusing on the modeling of real-world phenomena, it begins with a basic introduction to differential equations, followed by linear and nonlinear first order equations and a detailed treatment of the second order linear equations.

After presenting solution methods for the Laplace transform and power series, it lastly presents systems of equations and offers an introduction to the stability theory. To help readers practice the theory covered, two types of exercises are provided: those that illustrate the general theory, and others designed to expand on the text material.

Detailed solutions to all the exercises are included. The book is excellently suited for use as a textbook for an undergraduate class of all disciplines in ordinary differential equations. Author : Stephen A. Table provides a list of useful aggregation functions available in NumPy. We may also wish to compute quantiles: In[16]: print "25th percentile: ", np.

Broadcasting is simply a set of rules for applying binary ufuncs addition, subtraction, multiplication, etc. We can similarly extend this to arrays of higher dimension. While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. The geometry of these examples is visualized in Figure Visualization of NumPy broadcasting The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.

Used with permission. The shapes of the arrays are: M. How does this affect the calculation? But this is not how the broadcasting rules work! That sort of flexibility might be useful in some cases, but it would lead to potential areas of ambiguity. Centering an array In the previous section, we saw that ufuncs allow a NumPy user to remove the need to explicitly write slow Python loops.

Broadcasting extends this ability. Imagine you have an array of 10 observations, each of which consists of 3 values. Plotting a two-dimensional function One place that broadcasting is very useful is in displaying images based on two- dimensional functions. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.

Example: Counting Rainy Days Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city. What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain? Digging into the data One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range.

The result of these comparison operators is always an array with a Boolean data type. Working with Boolean Arrays Given a Boolean array, there are a host of useful operations you can do. Another way to get at this information is to use np. For example: In[22]: are all values in each row less than 8? These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using np. But what if we want to know about all days with rain less than four inches and greater than one inch?

For example, we can address this sort of compound question as follows: In[23]: np. Here are some examples of results we can compute when combining masking with aggregations: In[25]: print "Number days without rain: ", np.

A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves. We are then free to operate on these values as we wish. When would you use one versus the other? In Python, all nonzero integers will evaluate as True. For Boolean NumPy arrays, the latter is nearly always the desired operation. Fancy Indexing In the previous sections, we saw how to access and modify portions of arrays using simple indices e.

For example: In[8]: row[:, np. Modifying Values with Fancy Indexing Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. For example: 82 Chapter 2: Introduction to NumPy www. The result, of course, is that x[0] contains the value 6. Why is this not the case? With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.

So what if you want the other behavior where the operation is repeated? For this, you can use the at method of ufuncs available since NumPy 1. Another method that is similar in spirit is the reduceat method of ufuncs, which you can read about in the NumPy documentation.

Example: Binning Data You can use these ideas to efficiently bin data to create a histogram by hand. For example, imagine we have 1, values and would like to quickly find where they fall within an array of bins.

We could compute it using ufunc. A histogram computed by hand Of course, it would be silly to have to do this each time you want to plot a histogram.

This is why Matplotlib provides the plt. To compute the binning, Matplotlib uses the np. How can this be? If you dig into the np. Sorting Arrays Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy. This section covers algorithms related to sorting values in NumPy arrays. All are means of accomplishing a similar task: sorting the values in a list or array. Fortunately, Python contains built-in sorting algorithms that are much more efficient than either of the simplistic algorithms just shown.

Fast Sorting in NumPy: np. By default np. To return a sorted version of the array without modifying the input, you can use np. NumPy provides this in the np. Within the two partitions, the elements have arbitrary order. Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array: In[13]: np. Finally, just as there is a np. With the pairwise square-distances converted, we can now use np. We can do this with the np. Visualization of the neighbors of each point Each point in the plot has lines drawn to its two nearest neighbors.

At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A. Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python.

You might be tempted to do the same type of operation by manually looping through the data and sorting each set of neighbors individually, but this would almost certainly lead to a slower algorithm than the vectorized version we used. Big-O Notation Big-O notation is a means of describing how the number of operations required for an algorithm scales as the input grows in size. Far more common in the data science world is a less rigid use of big-O notation: as a general if imprecise description of the scaling of an algorithm.

Big-O notation, in this loose sense, tells you how much time your algorithm will take as you increase the amount of data. For our purposes, the N will usually indicate some aspect of the size of the dataset the number of points, the number of dimensions, etc.

Notice that the big-O notation by itself tells you nothing about the actual wall-clock time of a computation, but only about its scaling as you change N. But for small datasets in particular, the algorithm with better scaling might not be faster. Creating Structured Arrays Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method: In[10]: np.

The next character specifies the type of data: characters, bytes, ints, floating points, and so on see Table The last character or characters represents the size of the object in bytes.

For example, you can create a type where each element contains an array or matrix of values. Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary? On to Pandas This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame.

As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

In this chapter, we will focus on the mechanics of using Series, DataFrame, and related structures effectively. We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.

Details on this installation can be found in the Pandas documentation. If you followed the advice outlined in the preface and used the Anaconda stack, you already have Pandas installed. Once Pandas is installed, you can import it and check the version: In[1]: import pandas pandas.

For example, to display all the contents of the pandas namespace, you can type this: In [3]: pd. Introducing Pandas Objects At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.

Series [0. The values are simply a familiar NumPy array: In[3]: data. For example, the index need not be an integer, but can consist of values of any desired type. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence: In[14]: pd. DataFrame as a generalized NumPy array If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects.

DataFrame as specialized dictionary Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For a DataFrame, data['col0'] will return the first column. From a single Series object.

Any list of dictionaries can be made into a DataFrame. DataFrame data Out[24]: a b 0 0 0 1 1 2 2 2 4 Even if some keys in the dictionary are missing, Pandas will fill them in with NaN i.

As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well: In[26]: pd. Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each: In[27]: pd.

DataFrame np. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set technically a multiset, as Index objects may contain repeated values. Those views have some interesting consequences in the operations available on Index objects. Index as ordered set Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.

These included indexing e. Data Selection in Series As we saw in the previous section, a Series object acts in many ways like a one- dimensional NumPy array, and in many ways like a standard Python dictionary.

Examples of these are as follows: In[7]: slicing by explicit index data['a':'c'] Out[7]: a 0. Notice that when you are slicing with an explicit index i. Indexers: loc, iloc, and ix These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[] will use the implicit Python-style index.

First, the loc attribute allows indexing and slicing that always references the explicit index: In[14]: data. The purpose of the ix indexer will become more apparent in the context of DataFrame objects, which we will discuss in a moment.

Data Selection in DataFrame Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure. DataFrame as a dictionary The first analogy we will consider is the DataFrame as a dictionary of related Series objects.

For example, if the column names are not strings, or if the column names conflict with methods of the DataFrame, this attribute-style access is not possible. DataFrame as two-dimensional array As mentioned previously, we can also view the DataFrame as an enhanced two- dimensional array. We can examine the raw underlying data array using the values attribute: In[24]: data. For example, we can transpose the full DataFrame to swap rows and columns: In[25]: data. In particular, passing a single index to an array accesses a row: In[26]: data.

Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array using the implicit Python-style index , but the DataFrame index and column labels are maintained in the result: In[28]: data.

For example, in the loc indexer we can combine masking and fancy indexing as in the following: In[31]: data. Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc.

We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures. Series rng. DataFrame rng. UFuncs: Index Alignment For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. For example, calling A. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array.

Handling Missing Data The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. Trade-Offs in Missing Data Conventions A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.

Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry. In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.

In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with — or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN Not a Number , a special value which is part of the IEEE floating-point specification.

None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array, which adds overhead in both storage and computation. Common special values like NaN are not available for all data types. As in most cases where no universally optimal choice exists, different languages and systems use different conventions. Missing Data in Pandas The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non- floating-point data types.

While R contains four basic data types, NumPy supports far more than this: for example, while R has a single integer type, NumPy supports fourteen basic integer types once you account for available precisions, signedness, and endianness of the encoding.

Further, for the smaller data types such as 8-bit integers , sacrificing a bit to use as a mask will significantly reduce the range of values it can represent. With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating- point NaN value, and the Python None object.

This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest. None: Pythonic missing data The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code.

You should be aware that NaN is a bit like a data virus—it infects any other object it touches. NaN and None in Pandas NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate: In[10]: pd. Series [1, np. For example, if we set a value in an integer array to np.

Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included. Table lists the upcasting conventions in Pandas when NA values are introduced. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are: isnull Generate a Boolean mask indicating missing values notnull Opposite of isnull dropna Return a filtered version of the data fillna Return a copy of the data with missing values filled or imputed We will conclude this section with a brief exploration and demonstration of these routines.

Detecting null values Pandas data structures have two useful methods for detecting null data: isnull and notnull.

Either one will return a Boolean mask over the data. Dropping null values In addition to the masking used before, there are the convenience methods, dropna which removes NA values and fillna which fills in NA values. For a Series, the result is straightforward: In[16]: data. DataFrame [[1, np. Depending on the application, you might want one or the other, so dropna gives a number of options for a DataFrame. By default, dropna will drop all rows in which any null value is present: In[18]: df.

This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.

You could do this in-place using the isnull method as a mask, but because it is such a common operation Pandas provides the fillna method, which returns a copy of the array with the null values replaced. Often it is useful to go beyond this and store higher-dimensional data—that is, data indexed by more than one or two keys.

In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects. For concreteness, we will consider a series of data where each point has a character and numerical key. The bad way Suppose you would like to track data about states from two different years. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

This syntax is much more convenient and the operation is much more efficient! MultiIndex as extra dimension You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Methods of MultiIndex Creation The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor.

Explicit MultiIndex constructors For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd. For example, as we did before, you can construct the MultiIndex from a simple list of arrays, giving the index values within each level: In[14]: pd.

MultiIndex level names Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish this by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact: In[18]: pop.

MultiIndex for columns In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.

Indexing and Slicing a MultiIndex Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. Rearranging Multi-Indices One of the keys to working with multiply indexed data is knowing how to effectively transform the data.

Sorted and unsorted indices Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of the MultiIndex slicing operations will fail if the index is not sorted.

Series np. For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on. Panel Data Pandas has a few other fundamental data structures that we have not yet discussed, namely the pd. Panel and pd. Panel4D objects. These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the one-dimensional Series and two-dimensional DataFrame structures.

Once you are familiar with indexing and manipulation of data in a Series and DataFrame, Panel and Panel4D are relatively straightforward to use. Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation. As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets. Combining Datasets: Concat and Append Some of the most interesting studies of data come from combining different data sources.

Series and DataFrames are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward. Like np. Duplicate indices One important difference between np. While this is valid within DataFrames, the outcome is often undesirable.

Catching the repeats as an error. With this set to True, the concatenation will raise an exception if there are duplicate indices. Sometimes the index itself does not matter, and you would prefer it to simply be ignored. With this set to True, the concatenation will create a new integer index for the resulting Series: In[11]: print x ; print y ; print pd. Another alternative is to use the keys option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data: In[12]: print x ; print y ; print pd.

Concatenation with joins In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. Consider the concatenation of the following two DataFrames, which have some but not all!

The append method Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd. It also is not a very efficient method, because it involves creation of a new index and data buffer. Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the concat function. Combining Datasets: Merge and Join One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.

If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd. Relational Algebra The behavior implemented in pd. Pandas implements several of these fundamental building blocks in the pd.

As we will see, these let you efficiently link data from different sources. Categories of Joins The pd. All three types of joins are accessed via an identical call to the pd. Here we will show simple examples of the three types of merges, and discuss detailed options further below.

The result of the merge is a new DataFrame that combines the information from the two inputs. Many-to-one joins Many-to-one joins are joins in which one of the two key columns contains duplicate entries. Many-to-many joins Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

This will be perhaps most clear with a concrete example. Consider the following, where we have a DataFrame showing one or more skills associated with a particular group. However, often the column names will not match so nicely, and pd.

Specifying Set Arithmetic for Joins In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join.

This comes up when a value appears in one key column but not the other. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to 'inner': In[14]: pd.

An outer join returns a join over the union of the input columns, and fills in all missing values with NAs: In[15]: print df6 ; print df7 ; print pd. For example: In[16]: print df6 ; print df7 ; print pd. All of these options can be applied straightforwardly to any of the preceding join types. Overlapping Column Names: The suffixes Keyword Finally, you may end up in a case where your two input DataFrames have conflicting column names.

If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword: In[18]: print df8 ; print df9 ; print pd. Here we will consider an example of some data about US states and their populations.

In[23]: merged[merged['population']. More importantly, we see also that some of the new state entries are also null, which means that there was no corresponding entry in the abbrevs key!

We can fix these quickly by filling in appropriate entries: In[25]: merged. Now we can merge the result with the area data using a similar procedure. We can see that by far the densest region in this dataset is Washington, DC i. We can also check the end of the list: In[33]: density. This type of messy data merging is a common task when one is trying to answer questions using real-world data sources.

It gives information on planets that astronomers have discovered around other stars known as extrasolar planets or exoplanets for short. For example, we see in the year column that although exoplanets were discovered as far back as , half of all known exoplanets were not discovered until or after.

Table summarizes some other built-in Pandas aggregations. Listing of Pandas aggregation methods Aggregation Description count Total number of items first , last First and last item mean , median Mean and median min , max Minimum and maximum std , var Standard deviation and variance mad Mean absolute deviation prod Product of all items sum Sum of all items These are all methods of DataFrame and Series objects.

To go deeper into the data, however, simple aggregates are often not enough. The next level of data summarization is the groupby operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

GroupBy: Split, Apply, Combine Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so- called groupby operation.

Rather, the GroupBy can often do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way. The power of the GroupBy is that it abstracts away these steps: the user need not think about how the computation is done under the hood, but rather thinks about the operation as a whole.

This object is where the magic is: you can think of it as a special view of the DataFrame, which is poised to dig into the groups but does no actual computation until the aggregation is applied. Perhaps the most important operations made available by a GroupBy are aggregate, filter, transform, and apply.

Column indexing. For example: In[14]: planets. As with the GroupBy object, no computation is done until we call some aggregate on the object: In[16]: planets. Iteration over groups. The GroupBy object supports direct iteration over the groups, returning each group as a Series or DataFrame: In[17]: for method, group in planets. Dispatch methods. Through some Python class magic, any method not explicitly implemented by the GroupBy object will be passed through and called on the groups, whether they are DataFrame or Series objects.

For example, you can use the describe method of DataFrames to perform a set of aggregations that describe each group in the data: In[18]: planets. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until This is just one example of the utility of dispatch methods.

Notice that they are applied to each individual group, and the results are then combined within GroupBy and returned. Aggregate, filter, transform, apply The preceding discussion focused on aggregation for the combine operation, but there are more options available.

In particular, GroupBy objects have aggregate , filter , transform , and apply methods that efficiently implement a variety of useful operations before combining the grouped data.

It can take a string, a function, or a list thereof, and compute all the aggregates at once. Here is a quick example combining all these: In[20]: df. Here because group A does not have a standard deviation greater than 4, it is dropped from the result. For such a transformation, the output is the same shape as the input.

A common example is to center the data by subtracting the group-wise mean: In[23]: df. The apply method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object e.

A list, array, series, or index providing the grouping keys. The key can be any series or list with a length matching that of the DataFrame. Similar to mapping, you can pass any Python function that will input the index value and output the group: In[28]: print df2 ; print df2. Further, any of the preceding key choices can be combined to group on a multi-index: In[29]: df2. We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!

A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple column- wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation.

That is, you split- apply-combine, but both the split and the combine happen across not a one- dimensional index, but across a two-dimensional grid. This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class.

In code: In[4]: titanic. First-class women survived with near certainty hi, Rose! For example, we might be interested in looking at age as a third dimension. The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. Additionally, it can be specified as a dictionary mapping a column to any of the above desired options: In[8]: titanic.

This can be done via the margins keyword: In[9]: titanic. Total number of US births by year and gender With a simple pivot table and plot method, we can immediately see the annual trend in births by gender.

We must start by cleaning the data a bit, removing outliers caused by mistyped dates e. This allows us to quickly compute the weekday corresponding to each row: In[18]: create a datetime index from the year, month, day births.

Average daily births by day of week and decade Apparently births are slightly less common on weekends than on weekdays! Note that the s and s are missing because the CDC data contains only the month of birth starting in From this, we can use the plot method to plot the data Figure Vectorized String Operations One strength of Python is its relative ease in handling and manipulating string data.

Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when one is working with read: cleaning up real-world data. Introducing Pandas String Operations We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. Here is a list of Pandas str methods that mirror Python string methods: len lower translate islower ljust upper startswith isupper rjust find endswith isnumeric center rfind isalnum isdecimal zfill index isalpha split strip rindex isdigit rsplit rstrip capitalize isspace partition lstrip swapcase istitle rpartition Notice that these have various return values.

Some, like lower , return a series of strings: In[7]: monte. With these, you can do a wide range of interesting operations. The get and slice operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.



0コメント

  • 1000 / 1000