Search Results

Search found 28 results on 2 pages for 'pandas'.

Page 1/2 | 1 2  | Next Page >

  • pandas: complex filter on rows of DataFrame

    - by duckworthd
    I would like to filter rows by a function of each row, e.g. def f(row): return sin(row['velocity'])/np.prod(['masses']) > 5 df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, f)] Or for another more complex, contrived example, def g(row): if row['col1'].method1() == 1: val = row['col1'].method2() / row['col1'].method3(row['col3'], row['col4']) else: val = row['col2'].method5(row['col6']) return np.sin(val) df = pandas.DataFrame(...) filtered = df[apply_to_all_rows(df, g)] How can I do so?

    Read the article

  • How split a column in two colunms in pandas

    - by user1345283
    I have el next dataframe data=read_csv('enero.csv') data Fecha DirViento MagViento 0 2011/07/01 00:00 318 6.6 1 2011/07/01 00:15 342 5.5 2 2011/07/01 00:30 329 6.6 3 2011/07/01 00:45 279 7.5 4 2011/07/01 01:00 318 6.0 5 2011/07/01 01:15 329 7.1 6 2011/07/01 01:30 300 4.7 7 2011/07/01 01:45 291 3.1 How to split the column Fecha in two columns,for example, get a dataframe as follows: Fecha Hora DirViento MagViento 0 2011/07/01 00:00 318 6.6 1 2011/07/01 00:15 342 5.5 2 2011/07/01 00:30 329 6.6 3 2011/07/01 00:45 279 7.5 4 2011/07/01 01:00 318 6.0 5 2011/07/01 01:15 329 7.1 6 2011/07/01 01:30 300 4.7 7 2011/07/01 01:45 291 3.1 I am using pandas for to read data I try to calculate daily averages from a monthly database has daily data recorded every 15 minutes. To do this, use pandas and grouped the columns: Date and Time for get a dataframe as follow: Fecha Hora 2011/07/01 00:00 -4.4 00:15 -1.7 00:30 -3.4 2011/07/02 00:00 -4.5 00:15 -4.2 00:30 -7.6 2011/07/03 00:00 -6.3 00:15 -13.7 00:30 -0.3 with this look, I get the following grouped.mean() Fecha DirRes 2011/07/01 -3 2011/07/02 -5 2011/07/03 -6

    Read the article

  • pandas read rotated csv files

    - by EricCoding
    Is there any function in pandas that can directly read a rotated csv file? To be specific, the header information in the first col instead of the first row. For example: A 1 2 B 3 5 C 6 7 and I would like the final DataFrame this way A B C 1 3 5 2 5 7 Of corse we can get around this problem using some data wangling techniques like transpose and slicing. I am wondering there should be a quick way in API but I could not find it.

    Read the article

  • Existing function to slice pandas object by axis number

    - by Zero
    Pandas has the following indexers: Object Type Indexers Series s.loc[indexer] DataFrame df.loc[row_indexer,column_indexer] Panel p.loc[item_indexer,major_indexer,minor_indexer] I would like to be able to index dynamically by axis, for example: df = pd.DataFrame(data=0, index=['row1', 'row2', 'row3'], columns=['col1', 'col2', col3']) df.index(['row1', 'row3'], axis=0) # index by rows df.index(['col1', 'col2'], axis=1) # index by columns Is there a built-in function that does this?

    Read the article

  • Pandas Dataframe to JSON File with Separate Records

    - by Chris
    I'm attempting to dump data from a Pandas Dataframe into a JSON file to import into MongoDB. The format I require in a file has JSON records on each line of the form: {<column 1>:<value>,<column 2>:<value>,...,<column N>:<value>} df.to_json(,orient='records') gets close to the result but all the records are dumped within a single JSON array. Any thoughts on an efficient way to get this result from a dataframe? UPDATE: The best solution I've come up with is the following: dlist = df.to_dict('records') dlist = [json.dumps(record)+"\n" for record in dlist] open('data.json','w').writelines(dlist)

    Read the article

  • Creating a dataframe in pandas by multiplying two series together

    - by Aoife
    Say I have two series in pandas, series A and series B. How do I create a dataframe in which all of those values are multiplied together, i.e. with series A down the left hand side and series B along the top. Basically the same concept as this, where series A would be the yellow on the left and series B the yellow along the top, and all the values in between would be filled in by multiplication: http://www.google.co.uk/imgres?imgurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables/times-table-12x12.gif&imgrefurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables.htm&h=533&w=720&sz=58&tbnid=9B8R_kpUloA4NM:&tbnh=90&tbnw=122&zoom=1&usg=__meqZT9kIAMJ5b8BenRzF0l-CUqY=&docid=j9BT8tUCNtg--M&sa=X&ei=bkBpUpOWOI2p0AWYnIHwBQ&ved=0CE0Q9QEwBg Thanks!

    Read the article

  • Pandas Dataframe add rows on top of dataframe

    - by yash.trojan.25
    I am trying to add blank rows on top of the pandas Dataframe data. Basically, some blank rows and some calculation for each row which contains calculations for Average etc. for that column. Can someone please help me how I can do this? From: A B D E F G H I J 0 -8 10 532 533 533 532 534 532 532 1 -8 12 520 521 523 523 521 521 521 2 -8 14 520 523 522 523 522 521 522 3 -4 2 526 527 527 528 528 527 529 4 -4 4 516 518 517 519 518 516 518 5 -4 6 528 529 530 531 530 528 530 6 -4 8 518 521 521 521 522 519 521 7 -4 10 524 525 525 525 525 524 524 8 -4 12 522 523 524 525 525 522 523 9 -2 2 525 526 527 527 527 525 527 10 -2 4 518 519 519 521 520 519 520 11 -2 6 520 522 522 522 522 520 523 12 -2 8 551 551 552 552 552 550 552 13 -2 10 533 534 535 536 535 534 535 14 -2 12 537 539 539 539 538 537 539 15 -2 14 528 530 530 531 530 529 530 16 -1 2 518 519 519 521 520 518 520 To: A B D E F G H I J Average 525.6 527.1 527.4 528.0 527.6 526.0 527.4 Sigma 8.6 8.3 8.5 8.1 8.3 8.3 8.4 Minimum 516 518 517 519 518 516 518 Maximum 551 551 552 552 552 550 552 0 -8 10 532 533 533 532 534 532 532 1 -8 12 520 521 523 523 521 521 521 2 -8 14 520 523 522 523 522 521 522 3 -4 2 526 527 527 528 528 527 529 4 -4 4 516 518 517 519 518 516 518 5 -4 6 528 529 530 531 530 528 530 6 -4 8 518 521 521 521 522 519 521 7 -4 10 524 525 525 525 525 524 524 8 -4 12 522 523 524 525 525 522 523 9 -2 2 525 526 527 527 527 525 527 10 -2 4 518 519 519 521 520 519 520 11 -2 6 520 522 522 522 522 520 523 12 -2 8 551 551 552 552 552 550 552 13 -2 10 533 534 535 536 535 534 535 14 -2 12 537 539 539 539 538 537 539 15 -2 14 528 530 530 531 530 529 530 16 -1 2 518 519 519 521 520 518 520

    Read the article

  • Parsing a Multi-Index Excel File in Pandas

    - by rhaskett
    I have a time series excel file with a tri-level column MultiIndex that I would like to successfully parse if possible. There are some results on how to do this for an index on stack overflow but not the columns and the parse function has a header that does not seem to take a list of rows. The ExcelFile looks like is like the following: Column A is all the time series dates starting on A4 Column B has top_level1 (B1) mid_level1 (B2) low_level1 (B3) data (B4-B100+) Column C has null (C1) null (C2) low_level2 (C3) data (C4-C100+) Column D has null (D1) mid_level2 (D2) low_level1 (D3) data (D4-D100+) Column E has null (E1) null (E2) low_level2 (E3) data (E4-E100+) ... So there are two low_level values many mid_level values and a few top_level values but the trick is the top and mid level values are null and are assumed to be the values to the left. So, for instance all the columns above would have top_level1 as the top multi-index value. My best idea so far is to use transpose, but the it fills Unnamed: # everywhere and doesn't seem to work. In Pandas 0.13 read_csv seems to have a header parameter that can take a list, but this doesn't seem to work with parse.

    Read the article

  • How to sort a boxplot by the median values in pandas

    - by Chris
    I've got a dataframe outcome2 that I generate a grouped boxplot with in the following manner: In [11]: outcome2.boxplot(column='Hospital 30-Day Death (Mortality) Rates from Heart Attack',by='State') plt.ylabel('30 Day Death Rate') plt.title('30 Day Death Rate by State') Out [11]: What I'd like to do is sort the plot by the median for each state, instead of alphabetically. Not sure how to go about doing so.

    Read the article

  • How to find subgroups statistics in pandas?

    - by user2808117
    I am grouping a DataFrame using multiple columns (e.g., columns A, B - my_df.groupby(['A','B']) ), is there a better (less lines of code, faster) way of finding how many rows are in each subgroup and how many subgroups are there in total? at the moment I am using: def get_grp_size(grp): grp['size'] = len(grp) return grp my_df = my_df.groupby(['A','B']).apply(get_grp_size) my_df[['A','B','size']].drop_duplicates().size

    Read the article

  • Get particular row as series from pandas dataframe

    - by Pratyush
    How do we get a particular filtered row as series? Example dataframe: >>> df = pd.DataFrame({'date': [20130101, 20130101, 20130102], 'location': ['a', 'a', 'c']}) >>> df date location 0 20130101 a 1 20130101 a 2 20130102 c I need to select the row where location is c as a series. I tried: row = df[df["location"] == "c"].head(1) # gives a dataframe row = df.ix[df["location"] == "c"] # also gives a dataframe with single row In either cases I can't the row as series.

    Read the article

  • A faster alternative to Pandas `isin` function

    - by user3576212
    I have a very large data frame df that looks like: ID Value1 Value2 1345 3.2 332 1355 2.2 32 2346 1.0 11 3456 8.9 322 And I have a list that contains a subset of IDs ID_list. I need to have a subset of df for the ID contained in ID_list. Currently, I am using df_sub=df[df.ID.isin(ID_list)] to do it. But it takes a lot time. IDs contained in ID_list doesn't have any pattern, so it's not within certain range. (And I need to apply the same operation to many similar dataframes. I was wondering if there is any faster way to do this. Will it help a lot if make ID as the index? Thanks!

    Read the article

  • Python Pandas operate on row

    - by wuha
    Hi my dataframe look like: Store,Dept,Date,Sales 1,1,2010-02-05,245 1,1,2010-02-12,449 1,1,2010-02-19,455 1,1,2010-02-26,154 1,1,2010-03-05,29 1,1,2010-03-12,239 1,1,2010-03-19,264 Simply, I need to add another column called '_id' as concatenation of Store, Dept, Date like "1_1_2010-02-05", I assume I can do it through df['id'] = df['Store'] +'' +df['Dept'] +'_'+df['Date'], but it turned out to be not. Similarly, i also need to add a new column as log of sales, I tried df['logSales'] = math.log(df['Sales']), again, it did not work.

    Read the article

  • How do you calculate expanding mean on time series using pandas?

    - by mlo
    How would you create a column(s) in the below pandas DataFrame where the new columns are the expanding mean/median of 'val' for each 'Mod_ID_x'. Imagine this as if were time series data and 'ID' 1-2 was on Day 1 and 'ID' 3-4 was on Day 2. I have tried every way I could think of but just can't seem to get it right. left4 = pd.DataFrame({'ID': [1,2,3,4],'val': [10000, 25000, 20000, 40000],'Mod_ID': [15, 35, 15, 42], 'car': ['ford','honda', 'ford', 'lexus']}) right4 = pd.DataFrame({'ID': [3,1,2,4],'color': ['red', 'green', 'blue', 'grey'], 'wheel': ['4wheel','4wheel', '2wheel', '2wheel'], 'Mod_ID': [15, 15, 35, 42]}) df1 = pd.merge(left4, right4, on='ID').drop('Mod_ID_y', axis=1)

    Read the article

  • How to sort a Pandas DataFrame according to multiple criteria?

    - by user1715271
    I have the following DataFrame containing song names, their peak chart positions and the number of weeks they spent at position no 1: Song Peak Weeks 76 Paperback Writer 1 16 117 Lady Madonna 1 9 118 Hey Jude 1 27 22 Can't Buy Me Love 1 17 29 A Hard Day's Night 1 14 48 Ticket To Ride 1 14 56 Help! 1 17 109 All You Need Is Love 1 16 173 The Ballad Of John And Yoko 1 13 85 Eleanor Rigby 1 14 87 Yellow Submarine 1 14 20 I Want To Hold Your Hand 1 24 45 I Feel Fine 1 15 60 Day Tripper 1 12 61 We Can Work It Out 1 12 10 She Loves You 1 36 155 Get Back 1 6 8 From Me To You 1 7 115 Hello Goodbye 1 7 2 Please Please Me 2 20 92 Strawberry Fields Forever 2 12 93 Penny Lane 2 13 107 Magical Mystery Tour 2 16 176 Let It Be 2 14 0 Love Me Do 4 26 157 Something 4 9 166 Come Together 4 10 58 Yesterday 8 21 135 Back In The U.S.S.R. 19 3 164 Here Comes The Sun 58 19 96 Sgt. Pepper's Lonely Hearts Club Band 63 12 105 With A Little Help From My Friends 63 7 I'd like to rank these songs in order of popularity, so I'd like to sort them according to the following criteria: songs that reached the highest position come first, but if there is a tie, the songs that remained in the charts for the longest come first. I can't seem to figure out how to do this in Pandas.

    Read the article

  • Convert object to DateRange

    - by user655832
    I'm querying an underlying PostgreSQL database using Pandas 0.8. Pandas is returning the DataFrame properly but the underlying timestamp column in my database is being returned as a generic "object" type in Pandas. As I would eventually like to seasonal normalization of my data I am curious as to how to convert this generic "object" column to something that is appropriate for analysis. Here is my current code to retrieve the data: # get records from db example import pandas.io.sql as psql import psycopg2 # define query to get all subs created this year QRY = """ select i i, i * random() f, case when random() > 0.5 then true else false end t, (current_date - (i*random())::int)::timestamp with time zone tsz from generate_series(1,1000) as s(i) order by 4 ; """ CONN_STRING = "host='localhost' port=5432 dbname='postgres' user='postgres'" # connect to db conn = psycopg2.connect(CONN_STRING) # get some data set index on relid column df = psql.frame_query(QRY, con=conn) print "Row count retrieved: %i" % (len(df),) Thanks for any help you can render. M

    Read the article

  • How to replace&add the dataframe element by another dataframe in Python Pandas?

    - by bigbug
    Suppose I have two data frame 'df_a' & 'df_b' , both have the same index structure and columns, but some of the inside data elements are different: >>> df_a sales cogs STK_ID QT 000876 1 100 100 2 100 100 3 100 100 4 100 100 5 100 100 6 100 100 7 100 100 >>> df_b sales cogs STK_ID QT 000876 5 50 50 6 50 50 7 50 50 8 50 50 9 50 50 10 50 50 And now I want to replace the element of df_a by element of df_b which have the same (index, column) coordinate, and attach df_b's elements whose (index, column) coordinate beyond the scope of df_a . Just like add a patch 'df_b' to 'df_a' : >>> df_c = patch(df_a,df_b) sales cogs STK_ID QT 000876 1 100 100 2 100 100 3 100 100 4 100 100 5 50 50 6 50 50 7 50 50 8 50 50 9 50 50 10 50 50 How to write the 'patch(df_a,df_b)' function ?

    Read the article

  • matplotlib plot window won't appear

    - by user1518837
    I'm using Python 2.7.3 in 64-bit. I installed pandas as well as matplotlib 1.1.1, both for 64-bit. Right now, none of my plots are showing. After attempting to plot from several different dataframes, I gave up in frustration and tried the following first example from http://pandas.pydata.org/pandas-docs/dev/visualization.html: INPUT: import matplotlib.pyplot as plt ts = Series(randn(1000), index=date_range ('1/1/2000', periods=1000)) ts = ts.cumsum() ts.plot() pylab.show() OUTPUT: Axes(0.125,0.1;0.775x0.8) And no plot window appeared. Other StackOverflow threads I've read suggested I might be missing DLLs. Any suggestions?

    Read the article

  • Memory efficient import many data files into panda DataFrame in Python

    - by richardh
    I import into a panda DataFrame a directory of |-delimited.dat files. The following code works, but I eventually run out of RAM with a MemoryError:. import pandas as pd import glob temp = [] dataDir = 'C:/users/richard/research/data/edgar/masterfiles' for dataFile in glob.glob(dataDir + '/master_*.dat'): print dataFile temp.append(pd.read_table(dataFile, delimiter='|', header=0)) masterAll = pd.concat(temp) Is there a more memory efficient approach? Or should I go whole hog to a database? (I will move to a database eventually, but I am baby stepping my move to pandas.) Thanks! FWIW, here is the head of an example .dat file: cik|cname|ftype|date|fileloc 1000032|BINCH JAMES G|4|2011-03-08|edgar/data/1000032/0001181431-11-016512.txt 1000045|NICHOLAS FINANCIAL INC|10-Q|2011-02-11|edgar/data/1000045/0001193125-11-031933.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-11|edgar/data/1000045/0001193125-11-005531.txt 1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-27|edgar/data/1000045/0001193125-11-015631.txt 1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2011-02-14|edgar/data/1000045/0000929638-11-00151.txt

    Read the article

  • How to retrieve view of MultiIndex DataFrame

    - by Henry S. Harrison
    This question was inspired by this question. I had the same problem, updating a MultiIndex DataFrame by selection. The drop_level=False solution in Pandas 0.13 will allow me to achieve the same result, but I am still wondering why I cannot get a view from the MultiIndex DataFrame. In other words, why does this not work?: >>> sat = d.xs('sat', level='day', copy=False) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2248, in xs raise ValueError('Cannot retrieve view (copy=False)') ValueError: Cannot retrieve view (copy=False) Of course it could be only because it is not implemented, but is there a reason? Is it somehow ambiguous or impossible to implement? Returning a view is more intuitive to me than returning a copy then later updating the original. I looked through the source and it seems this situation is checked explicitly to raise an error. Alternatively, is it possible to get the same sort of view from any of the other indexing methods? I've experimented but have not been successful. [edit] Some potential implementations are discussed here. I guess with the last question above I'm wondering what the current best solution is to index into arbitrary multiindex slices and cross-sections.

    Read the article

  • Non standard interaction among two tables to avoid very large merge

    - by riko
    Suppose I have two tables A and B. Table A has a multi-level index (a, b) and one column (ts). b determines univocally ts. A = pd.DataFrame( [('a', 'x', 4), ('a', 'y', 6), ('a', 'z', 5), ('b', 'x', 4), ('b', 'z', 5), ('c', 'y', 6)], columns=['a', 'b', 'ts']).set_index(['a', 'b']) AA = A.reset_index() Table B is another one-column (ts) table with non-unique index (a). The ts's are sorted "inside" each group, i.e., B.ix[x] is sorted for each x. Moreover, there is always a value in B.ix[x] that is greater than or equal to the values in A. B = pd.DataFrame( dict(a=list('aaaaabbcccccc'), ts=[1, 2, 4, 5, 7, 7, 8, 1, 2, 4, 5, 8, 9])).set_index('a') The semantics in this is that B contains observations of occurrences of an event of type indicated by the index. I would like to find from B the timestamp of the first occurrence of each event type after the timestamp specified in A for each value of b. In other words, I would like to get a table with the same shape of A, that instead of ts contains the "minimum value occurring after ts" as specified by table B. So, my goal would be: C: ('a', 'x') 4 ('a', 'y') 7 ('a', 'z') 5 ('b', 'x') 7 ('b', 'z') 7 ('c', 'y') 8 I have some working code, but is terribly slow. C = AA.apply(lambda row: ( row[0], row[1], B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))), axis=1).set_index(['a', 'b']) Profiling shows the culprit is obviously B.ix[row[0]].irow(np.searchsorted(B.ts[row[0]], row[2]))). However, standard solutions using merge/join would take too much RAM in the long run. Consider that now I have 1000 a's, assume constant the average number of b's per a (probably 100-200), and consider that the number of observations per a is probably in the order of 300. In production I will have 1000 more a's. 1,000,000 x 200 x 300 = 60,000,000,000 rows may be a bit too much to keep in RAM, especially considering that the data I need is perfectly described by a C like the one I discussed above. How would I improve the performance?

    Read the article

  • Efficient way to get highly correlated pairs from large data set in Python or R

    - by Akavall
    I have a large data set (Let's say 10,000 variables with about 1000 elements each), we can think of it as 2D list, something like: [[variable_1], [variable_2], ............ [variable_n] ] I want to extract highly correlated variable pairs from that data. I want "highly correlated" to be a parameter that I can choose. I don't need all pairs to be extracted, and I don't necessarily want the most correlated pairs. As long as there is an efficient method that gets me highly correlated pairs I am happy. Also, it would be nice if a variable does not show up in more than one pair. Although this might not be crucial. Of course, there is a brute force way to finding such pairs, but it is too slow for me. I've googled around for a bit and found some theoretical work on this issue, but I wasn't able for find a package that could do what I am looking for. I mostly work in python, so a package in python would be most helpful, but if there exists a package in R that does what I am looking for it will be great. Does anyone know of a package that does the above in Python or R? Or any other ideas? Thank You in Advance

    Read the article

  • Paste Excel clip to body of an email through Python

    - by Twinkle
    I am using win32com.client in Python to send an email. However I want the body of the email to be a table (HTML- formatted table), I can do it in an Excel first and then copy and paste (but how?), or directly edit the corresponding Pandas data frame. newMail.body = my_table which is a Pandas data frame didn't work. So I'm wondering if there is smarter ways for example, to combine Excel with Outlook apps within Python? Cheers,

    Read the article

  • CentOS (rel6) with default python 2.6, but seperate 3.3.5 installation

    - by Silvertiger
    I have a CentOS server (rel6) that had python installed (2.6), but I needed a few features in 3.3+. I installed 3.3 into a seperate folder and made a symbolic link to execute it: I installed setup tools: yum install python-setuptools I installed a needed module"pandas" easy_install pandas I executed my pyton script, which encountered an error that required i use a newer version I downloaded and installed Python 3.3.5 to it's own folder so as to not override my default python wget http://www.python.org/ftp/python/3.3.5/Python-3.3.5.tar.xz tar xJf ./Python-3.3.5.tar.xz cd ./Python-3.3.5 ./configure --prefix=/opt/python make make install Made s symbolic link to allow me to execute this new python: ln -s /opt/python3.3/bin/python3.3 ~/bin/py The problem is that when I execute the python script with my new py alias, it does not have all the addons needed (explicitly MySQLdb) which the default install does. How do i go about installing the MySQLdb module, or any for that matter, to be reachable or useable for the new Python 3.3.5 installation? Or is there a way to make the current modules in 2.6 available to 3.3.5 as well?

    Read the article

  • How to use OO for data analysis? [closed]

    - by Konsta
    In which ways could object-orientation (OO) make my data analysis more efficient and let me reuse more of my code? The data analysis can be broken up into get data (from db or csv or similar) transform data (filter, group/pivot, ...) display/plot (graph timeseries, create tables, etc.) I mostly use Python and its Pandas and Matplotlib packages for this besides some DB connectivity (SQL). Almost all of my code is a functional/procedural mix. While I have started to create a data object for a certain collection of time series, I wonder if there are OO design patterns/approaches for other parts of the process that might increase efficiency?

    Read the article

1 2  | Next Page >